Have a look here.
For instance, at 3(a) there is this method, which is trivially adaptable to do a 64->32 bit square root, and also trivially transcribable to assembler:
/* by Jim Ulery */
static unsigned julery_isqrt(unsigned long val) {
unsigned long temp, g=0, b = 0x8000, bshft = 15;
do {
if (val >= (temp = (((g << 1) + b)<<bshft--))) {
g += b;
val -= temp;
}
} while (b >>= 1);
return g;
}
No divisions, no multiplications, bit shifts only. However, the time taken will be somewhat unpredictable particularly if you use a branch (on ARM RISC conditional instructions would work).
In general, this page lists ways to calculate square roots. If you happen to want to produce a fast inverse square root (i.e. x**(-0.5) ), or are just interested in amazing ways to optimise code, take a look at this, this and this.