cuda – Page 4 – Tarik Billa

How does CUDA assign device IDs to GPUs?

August 6, 2023 by Tarik

Set the environment variable CUDA_DEVICE_ORDER as: export CUDA_DEVICE_ORDER=PCI_BUS_ID Then the GPU IDs will be ordered by pci bus IDs.

Difference between cuda.h, cuda_runtime.h, cuda_runtime_api.h

August 2, 2023 by Tarik

In very broad terms: cuda.h defines the public host functions and types for the CUDA driver API. cuda_runtime_api.h defines the public host functions and types for the CUDA runtime API cuda_runtime.h defines everything cuda_runtime_api.h does, as well as built-in type definitions and function overlays for the CUDA language extensions and device intrinsic functions. If you … Read more

Default Pinned Memory Vs Zero-Copy Memory

July 26, 2023 by Tarik

I think it depends on your application (otherwise, why would they provide both ways?) Mapped, pinned memory (zero-copy) is useful when either: The GPU has no memory on its own and uses RAM anyway You load the data exactly once, but you have a lot of computation to perform on it and you want to … Read more

Can I use __syncthreads() after having dropped threads?

July 26, 2023 by Tarik

The answer to the short question is “No”. Warp level branch divergence around a __syncthreads() instruction will cause a deadlock and result in a kernel hang. Your code example is not guaranteed to be safe or correct. The correct way to implement the code would be like this: __global__ void kernel(…) if (tidx < N) … Read more

What is the purpose of using multiple “arch” flags in Nvidia’s NVCC compiler?

July 26, 2023 by Tarik

Roughly speaking, the code compilation flow goes like this: CUDA C/C++ device code source –> PTX –> SASS The virtual architecture (e.g. compute_20, whatever is specified by -arch compute…) determines what type of PTX code will be generated. The additional switches (e.g. -code sm_21) determine what type of SASS code will be generated. SASS is … Read more

CUDA and Classes

July 25, 2023 by Tarik

Define the class in a header that you #include, just like in C++. Any method that must be called from device code should be defined with both __device__ and __host__ declspecs, including the constructor and destructor if you plan to use new/delete on the device (note new/delete require CUDA 4.0 and a compute capability 2.0 … Read more

CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle)

July 25, 2023 by Tarik

CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle)

July 25, 2023 by Tarik

Thrust inside user written kernels

July 17, 2023 by Tarik

As it was originally written, Thrust is purely a host side abstraction. It cannot be used inside kernels. You can pass the device memory encapsulated inside a thrust::device_vector to your own kernel like this: thrust::device_vector< Foo > fooVector; // Do something thrust-y with fooVector Foo* fooArray = thrust::raw_pointer_cast( fooVector.data() ); // Pass raw array and … Read more

Can/Should I run this code of a statistical application on a GPU?

June 5, 2023 by Tarik

UPDATE GPU Version __global__ void hash (float *largeFloatingPointArray,int largeFloatingPointArraySize, int *dictionary, int size, int num_blocks) { int x = (threadIdx.x + blockIdx.x * blockDim.x); // Each thread of each block will float y; // compute one (or more) floats int noOfOccurrences = 0; int a; while( x < size ) // While there is work … Read more