Edit: As of CUDA 8, double-precision atomicAdd()
is implemented in CUDA with hardware support in SM_6X (Pascal) GPUs.
Currently, no CUDA devices support As you noted, it can be implemented in terms of atomicAdd
for double
in hardware.atomicCAS
on 64-bit integers, but there is a non-trivial performance cost for that.
Therefore, the CUDA software team chose to document a correct implementation as an option for developers, rather than make it part of the CUDA standard library. This way developers are not unknowingly opting in to a performance cost they don’t understand.
Aside: I don’t think this question should be closed as “not constructive”. I think it’s a perfectly valid question, +1.