I believe the key to performance here is to focus on the larger problem rather than on micro-optimizing the extraction of bit positions out of a random integer.
Judging by your sample code and previous SO question you are enumerating all words with K bits set in order, and extracting the bit indices out of these. This greatly simplifies matters.
If so then instead of rebuilding the bit position each iteration try directly incrementing the positions in the bit array. Half of the time this will involve a single loop iteration and increment.
Something along these lines:
// Walk through all len-bit words with num-bits set in order
void enumerate(size_t num, size_t len) {
size_t i;
unsigned int bitpos[64 + 1];
// Seed with the lowest word plus a sentinel
for(i = 0; i < num; ++i)
bitpos[i] = i;
bitpos[i] = 0;
// Here goes the main loop
do {
// Do something with the resulting data
process(bitpos, num);
// Increment the least-significant series of consecutive bits
for(i = 0; bitpos[i + 1] == bitpos[i] + 1; ++i)
bitpos[i] = i;
// Stop on reaching the top
} while(++bitpos[i] != len);
}
// Test function
void process(const unsigned int *bits, size_t num) {
do
printf("%d ", bits[--num]);
while(num);
putchar('\n');
}
Not particularly optimized but you get the general idea.