Why has atomicAdd not been implemented for doubles?

Edit: As of CUDA 8, double-precision atomicAdd() is implemented in CUDA with hardware support in SM_6X (Pascal) GPUs. Currently, no CUDA devices support atomicAdd for double in hardware. As you noted, it can be implemented in terms of atomicCAS on 64-bit integers, but there is a non-trivial performance cost for that. Therefore, the CUDA software … Read more

What are the differences between CUDA compute capabilities?

The Compute Capabilities designate different architectures. In general, newer architectures run both CUDA programs and graphics faster than previous architectures. Note, though, that a high end card in a previous generation may be faster than a lower end card in the generation after. From the CUDA C Programming Guide (v6.0):

cuda block synchronization

In CUDA 9, NVIDIA is introducing the concept of cooperative groups, allowing you to synchronize all threads belonging to that group. Such a group can span over all threads in the grid. This way you will be able to synchronize all threads in all blocks: #include <cuda_runtime_api.h> #include <cuda.h> #include <cooperative_groups.h> cooperative_groups::grid_group g = cooperative_groups::this_grid(); … Read more

Difference between cuda.h, cuda_runtime.h, cuda_runtime_api.h

In very broad terms: cuda.h defines the public host functions and types for the CUDA driver API. cuda_runtime_api.h defines the public host functions and types for the CUDA runtime API cuda_runtime.h defines everything cuda_runtime_api.h does, as well as built-in type definitions and function overlays for the CUDA language extensions and device intrinsic functions. If you … Read more

Default Pinned Memory Vs Zero-Copy Memory

I think it depends on your application (otherwise, why would they provide both ways?) Mapped, pinned memory (zero-copy) is useful when either: The GPU has no memory on its own and uses RAM anyway You load the data exactly once, but you have a lot of computation to perform on it and you want to … Read more

What is the purpose of using multiple “arch” flags in Nvidia’s NVCC compiler?

Roughly speaking, the code compilation flow goes like this: CUDA C/C++ device code source –> PTX –> SASS The virtual architecture (e.g. compute_20, whatever is specified by -arch compute…) determines what type of PTX code will be generated. The additional switches (e.g. -code sm_21) determine what type of SASS code will be generated. SASS is … Read more

Thrust inside user written kernels

As it was originally written, Thrust is purely a host side abstraction. It cannot be used inside kernels. You can pass the device memory encapsulated inside a thrust::device_vector to your own kernel like this: thrust::device_vector< Foo > fooVector; // Do something thrust-y with fooVector Foo* fooArray = thrust::raw_pointer_cast( fooVector.data() ); // Pass raw array and … Read more

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)