Using SSE instructions

SSE instructions are processor specific. You can look up which processor supports which SSE version on wikipedia. If SSE code will be faster or not depends on many factors: The first is of course whether the problem is memory-bound or CPU-bound. If the memory bus is the bottleneck SSE will not help much. Try simplifying … Read more

Processor, OS : 32bit, 64 bit

Let’s try to answer this question by looking at people versus computers; hopefully this will shed some light on things for you: Things to Keep In Mind As amazing as they are, computers are very, very dumb. Memory People have memory (with the exception, arguably, of husbands and politicians.) People store information in their memory … Read more

Why is Intel Haswell XEON CPU sporadically miscomputing FFTs and ART?

EDIT: Problem solved. I have to shout out a huge Sorry to the community and a big thank you for your hints. Sorry to user anonymous, who seems to be involved into kernel development. What happened? We spent another 2 days debugging and fiddling around with the program code. No implementation problems were found. BUT: … Read more

Determine word size of my processor

Your assumption about sizeof(int) is untrue; see this. Since you must know the processor, OS and compiler at compilation time, the word size can be inferred using predefined architecture/OS/compiler macros provided by the compiler. However while on simpler and most RISC processors, word size, bus width, register size and memory organisation are often consistently one … Read more

Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?

L1 is very tightly coupled to the CPU core, and is accessed on every memory access (very frequent). Thus, it needs to return the data really fast (usually within on clock cycle). Latency and throughput (bandwidth) are both performance-critical for L1 data cache. (e.g. four cycle latency, and supporting two reads and one write by … Read more

what is difference between Superscaling and pipelining?

Superscalar design involves the processor being able to issue multiple instructions in a single clock, with redundant facilities to execute an instruction. We’re talking about within a single core, mind you — multicore processing is different. Pipelining divides an instruction into steps, and since each step is executed in a different part of the processor, … Read more