atomic read
As said before, it’s platform dependent. On x86, the value must be aligned on a 4 byte boundary. Generally for most platforms, the read must execute in a single CPU instruction.
optimizer caching
The optimizer doesn’t know you are reading a value modified by a different thread. declaring the value volatile helps with that: the optimizer will issue a memory read / write for every access, instead of trying to keep the value cached in a register.
CPU cache
Still, you might read a stale value, since on modern architectures you have multiple cores with individual cache that is not kept in sync automatically. You need a read memory barrier, usually a platform-specific instruction.
On Wintel, thread synchronization functions will automatically add a full memory barrier, or you can use the InterlockedXxxx functions.
MSDN: Memory and Synchronization issues, MemoryBarrier Macro
[edit] please also see drhirsch’s comments.