OpenCL mystery: compiler bug, broken card or stupid scener?
category: code [glöplog]
So I just started working with OpenCL 2 weeks ago, and tonight I've run into a problem that left me completely flabbergasted. Since the nvidia forum doesn't let me post, even after registering, I'm hoping someone here could help me figure out what's going on.
I'm trying to do stream compaction, but I keep getting holes in the compacted texture. After 6+ hours of detective work, I've made this very minimal kernel that always displays the problem, although it just tries to write 3 specific pixels from 1 specific workitem:
Reading out the resulting image gives me .... 9, 9, 9,9, -1.#QNAN, -1.#QNAN, -1.#QNAN, -1.#QNAN, 7, 7, 7, 7 ....
No matter what I try, I can't write to destination pixel(405, 97) from the source workitem (403, 137). If I write to the same adresses from the previous workitem (402, 137) or the next (404, 137), all three writes go through. But the workitems above and below, so (403, 138) or (403, 136) also can't do the middle write.
The specific values are just the first ones to display this problem, but there are many others, and I haven't found a pattern yet:
Some trouble spots are on the same line, others are seperated by several lines.
I'm stumped on how this is possible. I tried changing the destination texture to a power of 2 (1024 instead of 766), or changing the workgroup size (let driver decide instead of (16, 8), but to no avail. The destination texture is shared with OpenGL, but is properly acquired. The original kernel has only 1 write per workitem, but has the same problem. Is this some memory synchronisation issue I'm not aware of?
Anyone got a hint on what I can check/try next? If it's relevant, my card is a Quadro 4600 (same chip as 8800 GTX) and can only do OpenCL 1.0, not 1.1 .
I'm trying to do stream compaction, but I keep getting holes in the compacted texture. After 6+ hours of detective work, I've made this very minimal kernel that always displays the problem, although it just tries to write 3 specific pixels from 1 specific workitem:
Code:
__kernel void CompactStream ( __read_only image2d_t srcImg,
__global uint* hitfield, __global uint* scan,
__write_only image2d_t dstImg,
int width, int height, int destWidth )
{
int2 theCoord = (int2) (get_global_id(0), get_global_id(1));
if ( theCoord.x == 403 && theCoord.y == 137)
if ( width == 766 && height==766 && destWidth == 766) // sanity check
{
write_imagef(dstImg, (int2)(404, 97), (float4)(9.,9.,9.,9.));
write_imagef(dstImg, (int2)(405, 97), (float4)(8.,8.,8.,8.));
write_imagef(dstImg, (int2)(406, 97), (float4)(7.,7.,7.,7.));
}
}
Reading out the resulting image gives me .... 9, 9, 9,9, -1.#QNAN, -1.#QNAN, -1.#QNAN, -1.#QNAN, 7, 7, 7, 7 ....
No matter what I try, I can't write to destination pixel(405, 97) from the source workitem (403, 137). If I write to the same adresses from the previous workitem (402, 137) or the next (404, 137), all three writes go through. But the workitems above and below, so (403, 138) or (403, 136) also can't do the middle write.
The specific values are just the first ones to display this problem, but there are many others, and I haven't found a pattern yet:
Code:
source -> can't write to:
403/137 -> 405/97
386/182 -> 373/124
450/240 -> 265/148
435/244 -> 417/149
339/258 -> 677/153
345/258 -> 678/153
...
Some trouble spots are on the same line, others are seperated by several lines.
I'm stumped on how this is possible. I tried changing the destination texture to a power of 2 (1024 instead of 766), or changing the workgroup size (let driver decide instead of (16, 8), but to no avail. The destination texture is shared with OpenGL, but is properly acquired. The original kernel has only 1 write per workitem, but has the same problem. Is this some memory synchronisation issue I'm not aware of?
Anyone got a hint on what I can check/try next? If it's relevant, my card is a Quadro 4600 (same chip as 8800 GTX) and can only do OpenCL 1.0, not 1.1 .
I had similar problems with similar kernels, but not with the same language.
I used GL_EXT_shader_image_load_store, but the result sounds familiar yes.
I'd bet for a driver bug on that one. yep.
I used GL_EXT_shader_image_load_store, but the result sounds familiar yes.
I'd bet for a driver bug on that one. yep.
Looks about right.. Probably time to make an exe and test on other hardware..