Modify

Opened 12 years ago

Closed 12 years ago

#74 closed defect (postpone)

Some kernels are not created on ipepdvcompute1 with ufo HEAD

Reported by: Matthias Vogelgesang Owned by: Matthias Vogelgesang
Priority: blocker Milestone: ufo-core-0.2
Component: ufo-core Version:
Keywords: Cc: Suren A. Chilingaryan, Anton Myagotin

Description

The reconstruction scripts do not work because the backprojection filter cannot successfully create a kernel from a successfully built program file. The error message is CL_INVALID_KERNEL_DEFINITION, which means that a kernel was compiled with different arguments on different devices. This is definitely not the case. Moreover, this affects only the backprojection and laminographic kernels.

Attachments (0)

Change History (17)

comment:1 Changed 12 years ago by Suren A. Chilingaryan

Just in case, I have recently updated the NVIDIA driver. Currently beta-driver for upcoming CUDA5 is installed.

comment:2 Changed 12 years ago by Matthias Vogelgesang

Hmm, yes I noticed the output. Can you revert this to something stable?

comment:3 Changed 12 years ago by Suren A. Chilingaryan

Generally, I don't want. With this driver, the start-up penalty is so much less. But, feel free to temporary install different driver if you think it could be the cause.

comment:4 Changed 12 years ago by Matthias Vogelgesang

*sigh* Someone is running X …

comment:5 Changed 12 years ago by Suren A. Chilingaryan

It is required for VirtualGL. I don't think anybody using it at the moment. So, feel free to execute "init 3".

comment:6 Changed 12 years ago by Matthias Vogelgesang

No, there are some open xterms and Firefox browser, so someone is probably working or pretending to work.

comment:7 Changed 12 years ago by Suren A. Chilingaryan

It's remote NX session. It doesn't require X server to be running. OK. I'm not completely sure if "init 3" will not kill the session, but it is Anton and he never logs out. The idle times are quite long so I guess it is safe to try.

comment:8 Changed 12 years ago by Matthias Vogelgesang

Cc: Anton Myagotin added

Interestingly, on ufosrv1 the error does not occur and all kernels are build and executed as expected. Maybe there is a problem, that 9 GPUs are installed in ipepdvcompute1?

comment:9 Changed 12 years ago by Suren A. Chilingaryan

I'd rather think it is because of active version of gcc. ufosrv1 has standard 4.6 and on ipepdvcompute1 I still enforce 4.3 for CUDA compatibility. Can you try compiling with 4.6?

comment:10 Changed 12 years ago by Matthias Vogelgesang

No, this is not the cause.

Edit: Removed all comments not related to this bug.

Last edited 12 years ago by Matthias Vogelgesang (previous) (diff)

comment:11 Changed 12 years ago by Matthias Vogelgesang

And it does depend on the GPU configuration. I wrote a little program that setups the platform, desired devices and tries to create these three simple kernels:

__kernel void assign(__global float *input, 
                     __global float *output)
{
    const int idx = get_global_id (0); 
    input[idx] = output[idx]; 
}

__kernel void two_const_params(__constant float *c_param_1,
                               __constant float *c_param_2)
{
}

__kernel void three_const_params(__constant float *c_param_1,
                                 __constant float *c_param_2,
                                 __constant float *c_param_3)
{
}

This is the output for three different GPU configurations (that means a context consisting of those GPUs):

me@ipepdvcompute1:.../ocl-regressions ./build/check --first=0 --last=0
# Platform: OpenCL 1.1 CUDA 5.0.1
# Device 0: GeForce GTX 680
Initialization: OK
Creating backproject program: OK
Creating kernel `assign`: OK
Creating kernel `two_const_params`: OK
Creating kernel `three_const_params`: OK
me@ipepdvcompute1:.../ocl-regressions ./build/check --first=0 --last=1
# Platform: OpenCL 1.1 CUDA 5.0.1
# Device 0: GeForce GTX 680
# Device 1: GeForce GTX 590
Initialization: OK
Creating backproject program: OK
Creating kernel `assign`: OK
Creating kernel `two_const_params`: OK
Creating kernel `three_const_params`: Error: CL_INVALID_KERNEL_DEFINITION
me@ipepdvcompute1:.../ocl-regressions ./build/check --first=1 --last=8
# Platform: OpenCL 1.1 CUDA 5.0.1
# Device 0: GeForce GTX 590
# Device 1: GeForce GTX 590
# Device 2: GeForce GTX 590
# Device 3: GeForce GTX 590
# Device 4: GeForce GTX 590
# Device 5: GeForce GTX 590
# Device 6: GeForce GTX 590
# Device 7: GeForce GTX 590
Initialization: OK
Creating backproject program: OK
Creating kernel `assign`: OK
Creating kernel `two_const_params`: OK
Creating kernel `three_const_params`: OK

Now, someone owes me a beer …

comment:12 Changed 12 years ago by Matthias Vogelgesang

As it turns out, __local parameters are also affected.

IMHO, the best solution would be to pull out the GTX 690 (why was it there in the first place, anyway?).

comment:13 Changed 12 years ago by Suren A. Chilingaryan

OK. According to Khronos clCreateKernel returns error CL_INVALID_KERNEL_DEFINITION if:

if the function definition for __kernel function given by
kernel_name such as the number of arguments, the argument types
are not the same for all devices for which the program
executable has been built.

Next, again according to Khronos:

Variables in the program scope or the outermost scope 
of kernel functions can be declared in the
__constant address space. ... Implementations are not 
required to aggregate these declarations into the fewest number
of constant arguments. This behavior is implementation defined.

I.e. compiler may group constant variables optimizing their number. NVIDIA compiler generates pretty different code based on the compute capability. I.e. it looks like for one of the architectures the compiler optimizes the constant declarations and for another - not. On one hand, NVIDIA may be blamed for this, on other hand it seems to me pretty unsafe to share the same OpenCL context between devices of different compute capabilities. I'd propose to create a separate OpenCL context for each compute capability. Alternatively, you may just skip the devices of non-primary compute compatibility.

PS. GTX680 to allow testing against new NVIDIA architecture, earlier-or-later NVIDIA will stop selling Fermies. And to catch such problems. We can't ignore possibility of having multiple architectures in the same system. For example, ANKA has such systems (tomo2) and generally the system may include the integrated graphics from NVIDIA. So, we need to handle such situations.

PPS. If this somebody will provide you a beer, I hope you'll share it with me ;)

comment:14 in reply to:  13 Changed 12 years ago by Matthias Vogelgesang

Replying to csa:

OK. According to Khronos clCreateKernel returns error CL_INVALID_KERNEL_DEFINITION if:

Nice, but totally irrelevant.

Next, again according to Khronos:

Variables in the program scope or the outermost scope 
of kernel functions can be declared in the
__constant address space. ... Implementations are not 
required to aggregate these declarations into the fewest number
of constant arguments. This behavior is implementation defined.

I.e. compiler may group constant variables optimizing their number. NVIDIA compiler generates pretty different code based on the compute capability. I.e. it looks like for one of the architectures the compiler optimizes the constant declarations and for another - not.

Again this has nothing to do with the actual (hardware/driver) problem.

On one hand, NVIDIA may be blamed for this,

Of course they are to be blamed. And I really don't see, why you try to come up with wrong arguments and misinterpret the standard to cover that up:

  1. There are only constant parameters in the program, so nothing to be aggregated here.
  1. The error code is totally misleading in this case. It must be issued by clCreateKernel() whenever the same kernel name is compiled with different function signatures on different devices (e.g. with a pre-processor switch). I just checked this, and at least in this regard, NVIDIA behaves correctly.

on other hand it seems to me pretty unsafe to share the same OpenCL context between devices of different compute capabilities.

Subjective opinion doesn't matter, the standard does and this is pretty clear: I can create a context with any device I want (why else there would be a clCreateContextFromType() function that pulls in all GPUs?). It is actually demanded from the standard that devices must meet a certain level of capability (e.g. minimum of 32kb of local memory).

I'd propose to create a separate OpenCL context for each compute capability. Alternatively, you may just skip the devices of non-primary compute compatibility.

No, I am not working around a clear bug from NVIDIA's implementation and against the standard. I propose that you pull out the card and put it somewhere else for better use.

PS. GTX680 to allow testing against new NVIDIA architecture, earlier-or-later NVIDIA will stop selling Fermies. And to catch such problems.

Why must this be caught on a production server? Maybe you, as a WP leader, should come up with a standard procedure that we all follow along. Just putting in some random cards in some random servers and hoping for the best, is surely not a good procedure.

We can't ignore possibility of having multiple architectures in the same system. For example, ANKA has such systems (tomo2) and generally the system may include the integrated graphics from NVIDIA. So, we need to handle such situations.

Yes, and the possibility was given with the old ipepdvcompute1 which had a mix of 295s and 580s.

Last edited 12 years ago by Matthias Vogelgesang (previous) (diff)

comment:15 Changed 12 years ago by Suren A. Chilingaryan

I don't see a point. There is no sense to have GTX295 installed any more. Instead, there is completely legal configuration of up to date GPU cards in ipepdvcompute1. Which is by the way a development, not a production server.

Besides, I was not saying NVIDIA coupling with standard, but tried to explain why it could arise and that is most probably the cause. NVIDIA has clearly a bug here. But any software has bugs and these bugs should be handled somehow.

comment:16 Changed 12 years ago by Matthias Vogelgesang

Milestone: ufo-core-0.2

comment:17 Changed 12 years ago by Matthias Vogelgesang

Resolution: postpone
Status: newclosed

Modify Ticket

Change Properties
Set your email in Preferences
Action
as closed The owner will remain Matthias Vogelgesang.
The resolution will be deleted. Next status will be 'reopened'.

Add Comment


E-mail address and name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.