Edit: As of CUDA 8, double-precision atomicAdd() is implemented in CUDA with hardware support in SM_6X (Pascal) GPUs.
Currently, no CUDA devices support As you noted, it can be implemented in terms of atomicAdd for double in hardware.atomicCAS on 64-bit integers, but there is a non-trivial performance cost for that.
Therefore, the CUDA software team chose to document a correct implementation as an option for developers, rather than make it part of the CUDA standard library. This way developers are not unknowingly opting in to a performance cost they don’t understand.
Aside: I don’t think this question should be closed as “not constructive”. I think it’s a perfectly valid question, +1.