x86, difference between BYTE and BYTE PTR

Summary: NASM/YASM requires word [ecx] when the operand-size isn’t implied by the other operand. (Otherwise [ecx] is ok). MASM/TASM requires word ptr [ecx] when the operand-size isn’t implied by the other operand. (Otherwise [ecx] is ok). They each choke on the other’s syntax. WARNING: This is very strange area without any ISO standards or easy-to-find … Read more

What exactly is the base pointer and stack pointer? To what do they point?

esp is as you say it is, the top of the stack. ebp is usually set to esp at the start of the function. Function parameters and local variables are accessed by adding and subtracting, respectively, a constant offset from ebp. All x86 calling conventions define ebp as being preserved across function calls. ebp itself … Read more

Why does std::tuple break small-size struct calling convention optimization in C++?

It seems to be a matter of ABI. For instance, the Itanium C++ ABI reads: If the parameter type is non-trivial for the purposes of calls, the caller must allocate space for a temporary and pass that temporary by reference. And, further: A type is considered non-trivial for the purposes of calls if it has … Read more

Crash with icc: can the compiler invent writes where none existed in the abstract machine?

Your program is well-formed and free of undefined behaviour, as far as I can tell. The C++ abstract machine never actually assigns to a const object. A not-taken if() is sufficient to “hide”https://stackoverflow.com/”protect” things that would be UB if they executed. The only thing an if(false) can’t save you from is an ill-formed program, e.g. … Read more

Do function pointers force an instruction pipeline to clear?

On some processors an indirect branch will always clear at least part of the pipeline, because it will always mispredict. This is especially the case for in-order processors. For example, I ran some timings on the processor we develop for, comparing the overhead of an inline function call, versus a direct function call, versus an … Read more

Relative performance of swap vs compare-and-swap locks on x86

I assume atomic_swap(lockaddr, 1) gets translated to a xchg reg,mem instruction and atomic_compare_and_swap(lockaddr, 0, val) gets translated to a cmpxchg[8b|16b]. Some linux kernel developers think cmpxchg ist faster, because the lock prefix isn’t implied as with xchg. So if you are on a uniprocessor, multithread or can otherwise make sure the lock isn’t needed, you … Read more

why are separate icache and dcache needed [duplicate]

The main reason is: performance. Another reason is power consumption. Separate dCache and iCache makes it possible to fetch instructions and data in parallel. Instructions and data have different access patterns. Writes to iCache are rare. CPU designers are optimizing the iCache and the CPU architecture based on the assumption that code changes are rare. … Read more