cpu-cache – Tarik Billa

why are separate icache and dcache needed [duplicate]

January 7, 2024 by Tarik

The main reason is: performance. Another reason is power consumption. Separate dCache and iCache makes it possible to fetch instructions and data in parallel. Instructions and data have different access patterns. Writes to iCache are rare. CPU designers are optimizing the iCache and the CPU architecture based on the assumption that code changes are rare. … Read more

What are _mm_prefetch() locality hints?

January 6, 2024 by Tarik

Sometimes intrinsics are better understood in terms of the instruction they represent rather than as the abstract semantic given in their descriptions. The full set of the locality constants, as today, is #define _MM_HINT_T0 1 #define _MM_HINT_T1 2 #define _MM_HINT_T2 3 #define _MM_HINT_NTA 0 #define _MM_HINT_ENTA 4 #define _MM_HINT_ET0 5 #define _MM_HINT_ET1 6 #define _MM_HINT_ET2 … Read more

Does a memory barrier ensure that the cache coherence has been completed?

January 5, 2024 by Tarik

The memory barriers present on the x86 architecture – but this is true in general – not only guarantee that all the previous1 loads, or stores, are completed before any subsequent load or store is executed – they also guarantee that the stores have became globally visible. By globally visible it is meant that other … Read more

What is meant by data cache and instruction cache?

November 25, 2023 by Tarik

Instruction fetches can be done in chunks with the assumption that much of the time you are going to run through many instructions in a row. so instruction fetches can be more efficient, there is likely a handful or more clocks of overhead per transaction then the delay for the memory to have the data … Read more

What use is the INVD instruction?

September 23, 2023 by Tarik

Excellent question! One use-case for such a blunt-acting instruction as invd is in specialized or very-early-bootstrap code, such as when the presence or absence of RAM has not yet been verified. Since we might not know whether RAM is present, its size, or even if particular parts of it function properly, or we might not … Read more

Cycles/cost for L1 Cache hit vs. Register on x86?

September 19, 2023 by Tarik

Here’s a great article on the subject: http://arstechnica.com/gadgets/reviews/2002/07/caching.ars/1 To answer your question – yes, a cache hit has approximately the same cost as a register access. And of course a cache miss is quite costly 😉 PS: The specifics will vary, but this link has some good ballpark figures: Approximate cost to access various caches … Read more

What is locality of reference?

September 18, 2023 by Tarik

This would not matter if your computer was filled with super-fast memory. But unfortunately that’s not the case and computer-memory looks something like this1: +———-+ | CPU | <<– Our beloved CPU, superfast and always hungry for more data. +———-+ |L1 – Cache| <<– ~4 CPU-cycles access latency (very fast), 2 loads/clock throughput +———-+ |L2 … Read more

Temporal vs Spatial Locality with arrays

September 5, 2023 by Tarik

Spatial and temporal locality describe two different characteristics of how programs access data (or instructions). Wikipedia has a good article on locality of reference. A sequence of references is said to have spatial locality if things that are referenced close in time are also close in space (nearby memory addresses, nearby sectors on a disk, … Read more

Can I force cache coherency on a multicore x86 CPU?

August 19, 2023 by Tarik

volatile only forces your code to re-read the value, it cannot control where the value is read from. If the value was recently read by your code then it will probably be in cache, in which case volatile will force it to be re-read from cache, NOT from memory. There are not a lot of … Read more

Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?

August 3, 2023 by Tarik

L1 is very tightly coupled to the CPU core, and is accessed on every memory access (very frequent). Thus, it needs to return the data really fast (usually within on clock cycle). Latency and throughput (bandwidth) are both performance-critical for L1 data cache. (e.g. four cycle latency, and supporting two reads and one write by … Read more