Why does code mutating a shared variable across threads apparently NOT suffer from a race condition?
foo() is so short that each thread probably finishes before the next one even gets spawned. If you add a sleep for a random time in foo() before the u++, you may start seeing what you expect.