In a CUDA kernel, how do I store an array in “local thread memory”?

Arrays, local memory and registers There is a misconception here regarding the definition of “local memory”. “Local memory” in CUDA is actually global memory (and should really be called “thread-local global memory”) with interleaved addressing (which makes iterating over an array in parallel a bit faster than having each thread’s data blocked together). If you … Read more

Why has atomicAdd not been implemented for doubles?

Edit: As of CUDA 8, double-precision atomicAdd() is implemented in CUDA with hardware support in SM_6X (Pascal) GPUs. Currently, no CUDA devices support atomicAdd for double in hardware. As you noted, it can be implemented in terms of atomicCAS on 64-bit integers, but there is a non-trivial performance cost for that. Therefore, the CUDA software … Read more

What are the differences between CUDA compute capabilities?

The Compute Capabilities designate different architectures. In general, newer architectures run both CUDA programs and graphics faster than previous architectures. Note, though, that a high end card in a previous generation may be faster than a lower end card in the generation after. From the CUDA C Programming Guide (v6.0):

Should I unify two similar kernels with an ‘if’ statement, risking performance loss?

You have a third alternative, which is to use C++ templating and make the variable which is used in the if/switch statement a template parameter. Instantiate each version of the kernel you need, and then you have multiple kernels doing different things with no branch divergence or conditional evaluation to worry about, because the compiler … Read more

Setting up Visual Studio Intellisense for CUDA kernel calls

Wow, lots of dust on this thread. I came up with a macro fix (well, more like workaround…) for this that I thought I would share: // nvcc does not seem to like variadic macros, so we have to define // one for each kernel parameter list: #ifdef __CUDACC__ #define KERNEL_ARGS2(grid, block) <<< grid, block … Read more

cuda block synchronization

In CUDA 9, NVIDIA is introducing the concept of cooperative groups, allowing you to synchronize all threads belonging to that group. Such a group can span over all threads in the grid. This way you will be able to synchronize all threads in all blocks: #include <cuda_runtime_api.h> #include <cuda.h> #include <cooperative_groups.h> cooperative_groups::grid_group g = cooperative_groups::this_grid(); … Read more

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)