cuda – Page 3 – Tarik Billa

In a CUDA kernel, how do I store an array in “local thread memory”?

September 15, 2023 by Tarik

Arrays, local memory and registers There is a misconception here regarding the definition of “local memory”. “Local memory” in CUDA is actually global memory (and should really be called “thread-local global memory”) with interleaved addressing (which makes iterating over an array in parallel a bit faster than having each thread’s data blocked together). If you … Read more

Why has atomicAdd not been implemented for doubles?

September 13, 2023 by Tarik

Edit: As of CUDA 8, double-precision atomicAdd() is implemented in CUDA with hardware support in SM_6X (Pascal) GPUs. Currently, no CUDA devices support atomicAdd for double in hardware. As you noted, it can be implemented in terms of atomicCAS on 64-bit integers, but there is a non-trivial performance cost for that. Therefore, the CUDA software … Read more

What are the differences between CUDA compute capabilities?

September 13, 2023 by Tarik

The Compute Capabilities designate different architectures. In general, newer architectures run both CUDA programs and graphics faster than previous architectures. Note, though, that a high end card in a previous generation may be faster than a lower end card in the generation after. From the CUDA C Programming Guide (v6.0):

Is branch divergence really so bad?

September 12, 2023 by Tarik

You’re assuming (at least it’s the example you give and the only reference you make) that the only way to avoid branch divergence is to allow all threads to execute all the code. In that case I agree there’s not much difference. But avoiding branch divergence probably has more to do with algorithm re-structuring at … Read more

Can I program Nvidia’s CUDA using only Python or do I have to learn C?

September 11, 2023 by Tarik

You should take a look at CUDAmat and Theano. Both are approaches to writing code that executes on the GPU without really having to know much about GPU programming.

Should I unify two similar kernels with an ‘if’ statement, risking performance loss?

September 5, 2023 by Tarik

You have a third alternative, which is to use C++ templating and make the variable which is used in the if/switch statement a template parameter. Instantiate each version of the kernel you need, and then you have multiple kernels doing different things with no branch divergence or conditional evaluation to worry about, because the compiler … Read more

Setting up Visual Studio Intellisense for CUDA kernel calls

August 29, 2023 by Tarik

Wow, lots of dust on this thread. I came up with a macro fix (well, more like workaround…) for this that I thought I would share: // nvcc does not seem to like variadic macros, so we have to define // one for each kernel parameter list: #ifdef __CUDACC__ #define KERNEL_ARGS2(grid, block) <<< grid, block … Read more

How can I make tensorflow run on a GPU with capability 2.x?

August 25, 2023 by Tarik

Recent GPU versions of tensorflow require compute capability 3.5 or higher (and use cuDNN to access the GPU. cuDNN also requires a GPU of cc3.0 or higher: cuDNN is supported on Windows, Linux and MacOS systems with Pascal, Kepler, Maxwell, Tegra K1 or Tegra X1 GPUs. Kepler = cc3.x Maxwell = cc5.x Pascal = cc6.x … Read more

CUDA_HOME path for Tensorflow

August 24, 2023 by Tarik

Run the following command in the terminal: export CUDA_HOME=/usr/local/cuda-X.X Where you replace X.X by the first two digits of your version number (can be found out e.g. via nvcc –version).

cuda block synchronization

August 23, 2023 by Tarik

In CUDA 9, NVIDIA is introducing the concept of cooperative groups, allowing you to synchronize all threads belonging to that group. Such a group can span over all threads in the grid. This way you will be able to synchronize all threads in all blocks: #include <cuda_runtime_api.h> #include <cuda.h> #include <cooperative_groups.h> cooperative_groups::grid_group g = cooperative_groups::this_grid(); … Read more