Subtracting packed 8-bit integers in an 64-bit integer by 1 in parallel, SWAR without hardware SIMD
If you have a CPU with efficient SIMD instructions, SSE/MMX paddb (_mm_add_epi8) is also viable. Peter Cordes’ answer also describes GNU C (gcc/clang) vector syntax, and safety for strict-aliasing UB. I strongly encourage reviewing that answer as well. Doing it yourself with uint64_t is fully portable, but still requires care to avoid alignment problems and … Read more