allocating shared memory

CUDA supports dynamic shared memory allocation. If you define the kernel like this: __global__ void Kernel(const int count) { extern __shared__ int a[]; } and then pass the number of bytes required as the the third argument of the kernel launch Kernel<<< gridDim, blockDim, a_size >>>(count) then it can be sized at run time. Be … Read more

CUDA Driver API vs. CUDA runtime

The CUDA runtime makes it possible to compile and link your CUDA kernels into executables. This means that you don’t have to distribute cubin files with your application, or deal with loading them through the driver API. As you have noted, it is generally easier to use. In contrast, the driver API is harder to … Read more

High level GPU programming in C++ [closed]

There are many high-level libraries dedicated to GPGPU programming. Since they rely on CUDA and/or OpenCL, they have to be chosen wisely (a CUDA-based program will not run on AMD’s GPUs, unless it goes through a pre-processing step with projects such as gpuocelot). CUDA You can find some examples of CUDA libraries on the NVIDIA … Read more

CUDA: How to use -arch and -code and SM vs COMPUTE

Some related questions/answers are here and here. I am still not sure how to properly specify the architectures for code generation when building with nvcc. A complete description is somewhat complicated, but there are intended to be relatively simple, easy-to-remember canonical usages. Compile for the architecture (both virtual and real), that represents the GPUs you … Read more

CUDA model – what is warp size?

Direct Answer: Warp size is the number of threads in a warp, which is a sub-division used in the hardware implementation to coalesce memory access and instruction dispatch. Suggested Reading: As @Matias mentioned, I’d go read the CUDA C Best Practices Guide (you’ll have to scroll to the bottom where it’s listed). It might help … Read more

Running more than one CUDA applications on one GPU

CUDA activity from independent host processes will normally create independent CUDA contexts, one for each process. Thus, the CUDA activity launched from separate host processes will take place in separate CUDA contexts, on the same device. CUDA activity in separate contexts will be serialized. The GPU will execute the activity from one process, and when … Read more

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)