cuda – Page 2 – Tarik Billa

Why has atomicAdd not been implemented for doubles?

September 13, 2023 by Tarik

Edit: As of CUDA 8, double-precision atomicAdd() is implemented in CUDA with hardware support in SM_6X (Pascal) GPUs. Currently, no CUDA devices support atomicAdd for double in hardware. As you noted, it can be implemented in terms of atomicCAS on 64-bit integers, but there is a non-trivial performance cost for that. Therefore, the CUDA software … Read more

What are the differences between CUDA compute capabilities?

September 13, 2023 by Tarik

The Compute Capabilities designate different architectures. In general, newer architectures run both CUDA programs and graphics faster than previous architectures. Note, though, that a high end card in a previous generation may be faster than a lower end card in the generation after. From the CUDA C Programming Guide (v6.0):

How can I make tensorflow run on a GPU with capability 2.x?

August 25, 2023 by Tarik

Recent GPU versions of tensorflow require compute capability 3.5 or higher (and use cuDNN to access the GPU. cuDNN also requires a GPU of cc3.0 or higher: cuDNN is supported on Windows, Linux and MacOS systems with Pascal, Kepler, Maxwell, Tegra K1 or Tegra X1 GPUs. Kepler = cc3.x Maxwell = cc5.x Pascal = cc6.x … Read more

cuda block synchronization

August 23, 2023 by Tarik

In CUDA 9, NVIDIA is introducing the concept of cooperative groups, allowing you to synchronize all threads belonging to that group. Such a group can span over all threads in the grid. This way you will be able to synchronize all threads in all blocks: #include <cuda_runtime_api.h> #include <cuda.h> #include <cooperative_groups.h> cooperative_groups::grid_group g = cooperative_groups::this_grid(); … Read more

How does CUDA assign device IDs to GPUs?

August 6, 2023 by Tarik

Set the environment variable CUDA_DEVICE_ORDER as: export CUDA_DEVICE_ORDER=PCI_BUS_ID Then the GPU IDs will be ordered by pci bus IDs.

Difference between cuda.h, cuda_runtime.h, cuda_runtime_api.h

August 2, 2023 by Tarik

In very broad terms: cuda.h defines the public host functions and types for the CUDA driver API. cuda_runtime_api.h defines the public host functions and types for the CUDA runtime API cuda_runtime.h defines everything cuda_runtime_api.h does, as well as built-in type definitions and function overlays for the CUDA language extensions and device intrinsic functions. If you … Read more

Default Pinned Memory Vs Zero-Copy Memory

July 26, 2023 by Tarik

I think it depends on your application (otherwise, why would they provide both ways?) Mapped, pinned memory (zero-copy) is useful when either: The GPU has no memory on its own and uses RAM anyway You load the data exactly once, but you have a lot of computation to perform on it and you want to … Read more

What is the purpose of using multiple “arch” flags in Nvidia’s NVCC compiler?

July 26, 2023 by Tarik

Roughly speaking, the code compilation flow goes like this: CUDA C/C++ device code source –> PTX –> SASS The virtual architecture (e.g. compute_20, whatever is specified by -arch compute…) determines what type of PTX code will be generated. The additional switches (e.g. -code sm_21) determine what type of SASS code will be generated. SASS is … Read more

Thrust inside user written kernels

July 17, 2023 by Tarik

As it was originally written, Thrust is purely a host side abstraction. It cannot be used inside kernels. You can pass the device memory encapsulated inside a thrust::device_vector to your own kernel like this: thrust::device_vector< Foo > fooVector; // Do something thrust-y with fooVector Foo* fooArray = thrust::raw_pointer_cast( fooVector.data() ); // Pass raw array and … Read more

What does #pragma unroll do exactly? Does it affect the number of threads?

June 3, 2023 by Tarik

No. It means you have called a CUDA kernel with one block and that one block has 100 active threads. You’re passing size as the second function parameter to your kernel. In your kernel each of those 100 threads executes the for loop 100 times. #pragma unroll is a compiler optimization that can, for example, … Read more