Fastest Integer Square Root in the least amount of instructions
Have a look here. For instance, at 3(a) there is this method, which is trivially adaptable to do a 64->32 bit square root, and also trivially transcribable to assembler: /* by Jim Ulery */ static unsigned julery_isqrt(unsigned long val) { unsigned long temp, g=0, b = 0x8000, bshft = 15; do { if (val >= … Read more