cuda – Page 3 – Tarik Billa

CUDA: How to use -arch and -code and SM vs COMPUTE

May 18, 2023 by Tarik

Some related questions/answers are here and here. I am still not sure how to properly specify the architectures for code generation when building with nvcc. A complete description is somewhat complicated, but there are intended to be relatively simple, easy-to-remember canonical usages. Compile for the architecture (both virtual and real), that represents the GPUs you … Read more

CUDA model – what is warp size?

May 5, 2023 by Tarik

Direct Answer: Warp size is the number of threads in a warp, which is a sub-division used in the hardware implementation to coalesce memory access and instruction dispatch. Suggested Reading: As @Matias mentioned, I’d go read the CUDA C Best Practices Guide (you’ll have to scroll to the bottom where it’s listed). It might help … Read more

Running more than one CUDA applications on one GPU

May 3, 2023 by Tarik

CUDA activity from independent host processes will normally create independent CUDA contexts, one for each process. Thus, the CUDA activity launched from separate host processes will take place in separate CUDA contexts, on the same device. CUDA activity in separate contexts will be serialized. The GPU will execute the activity from one process, and when … Read more

Using std::vector in CUDA device code

April 29, 2023 by Tarik

You can’t use the STL in CUDA, but you may be able to use the Thrust library to do what you want. Otherwise just copy the contents of the vector to the device and operate on it normally.

CUDA: How many concurrent threads in total?

April 28, 2023 by Tarik

The GTX 580 can have 16 * 48 concurrent warps (32 threads each) running at a time. That is 16 multiprocessors (SMs) * 48 resident warps per SM * 32 threads per warp = 24,576 threads. Don’t confuse concurrency and throughput. The number above is the maximum number of threads whose resources can be stored … Read more

Compression library using Nvidia’s CUDA [closed]

April 11, 2023 by Tarik

We have finished first phase of research to increase performance of lossless data compression algorithms. Bzip2 was chosen for the prototype, our team optimized only one operation – Burrows–Wheeler transformation, and we got some results: 2x-4x speed up on good compressible files. The code works faster on all our tests. We are going to complete … Read more

How can I compile CUDA code then link it to a C++ project?

April 8, 2023 by Tarik

I was able to resolve my issue with a couple of different posts including these ones. Don’t forget that if you are using a 64 bit machine to link to the 64 bit library! It seams kind of obvious, but for clowns like me, that is something I forgot. Here is the make file that … Read more

What is the difference between cuda vs tensor cores?

April 4, 2023 by Tarik

Now only Tesla V100 and Titan V have tensor cores. Both GPUs have 5120 cuda cores where each core can perform up to 1 single precision multiply-accumulate operation (e.g. in fp32: x += y * z) per 1 GPU clock (e.g. Tesla V100 PCIe frequency is 1.38Gz). Each tensor core perform operations on small matrices … Read more

Cuda gridDim and blockDim

March 31, 2023 by Tarik

blockDim.x,y,z gives the number of threads in a block, in the particular direction gridDim.x,y,z gives the number of blocks in a grid, in the particular direction blockDim.x * gridDim.x gives the number of threads in a grid (in the x direction, in this case) block and grid variables can be 1, 2, or 3 dimensional. … Read more

Does __syncthreads() synchronize all threads in the grid?

March 28, 2023 by Tarik

The __syncthreads() command is a block level synchronization barrier. That means it is safe to be used when all threads in a block reach the barrier. It is also possible to use __syncthreads() in conditional code but only when all threads evaluate identically such code otherwise the execution is likely to hang or produce unintended … Read more