To enable all optimizations and disable all safety checks, compile your D program with the following DMD flags:
-O -inline -release -noboundscheck
EDIT: I’ve tried your programs with g++, dmd and gdc. dmd does lag behind, but gdc achieves performance very close to g++. The commandline I used was gdmd -O -release -inline
(gdmd is a wrapper around gdc which accepts dmd options).
Looking at the assembler listing, it looks like neither dmd nor gdc inlined scalar_product
, but g++/gdc did emit MMX instructions, so they might be auto-vectorizing the loop.