Context switches much slower in new linux kernels

The solution to the bad thread wake up performance problem in recent kernels has to do with the switch to the intel_idle cpuidle driver from acpi_idle, the driver used in older kernels. Sadly, the intel_idle driver ignores the user’s BIOS configuration for the C-states and dances to its own tune. In other words, even if you completely disable all C states in your PC’s (or server’s) BIOS, this driver will still force them on during periods of brief inactivity, which are almost always happening unless an all core consuming synthetic benchmark (e.g., stress) is running. You can monitor C state transitions, along with other useful information related to processor frequencies, using the wonderful Google i7z tool on most compatible hardware.

To see which cpuidle driver is currently active in your setup, just cat the current_driver file in the cpuidle section of /sys/devices/system/cpu as follows:

cat /sys/devices/system/cpu/cpuidle/current_driver

If you want your modern Linux OS to have the lowest context switch latency possible, add the following kernel boot parameters to disable all of these power saving features:

On Ubuntu 12.04, you can do this by adding them to the GRUB_CMDLINE_LINUX_DEFAULT entry in /etc/default/grub and then running update-grub. The boot parameters to add are:

intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll

Here are the gory details about what the three boot options do:

Setting intel_idle.max_cstate to zero will either revert your cpuidle driver to acpi_idle (at least per the documentation of the option), or disable it completely. On my box it is completely disabled (i.e., displaying the current_driver file in /sys/devices/system/cpu/cpuidle produces an output of none). In this case the second boot option, processor.max_cstate=0 is unnecessary. However, the documentation states that setting max_cstate to zero for the intel_idle driver should revert the OS to the acpi_idle driver. Therefore, I put in the second boot option just in case.

The processor.max_cstate option sets the maximum C state for the acpi_idle driver to zero, hopefully disabling it as well. I do not have a system that I can test this on, because intel_idle.max_cstate=0 completely knocks out the cpuidle driver on all of the hardware available to me. However, if your installation does revert you from intel_idle to acpi_idle with just the first boot option, please let me know if the second option, processor.max_cstate did what it was documented to do in the comments so that I can update this answer.

Finally, the last of the three parameters, idle=poll is a real power hog. It will disable C1/C1E, which will remove the final remaining bit of latency at the expense of a lot more power consumption, so use that one only when it’s really necessary. For most this will be overkill, since the C1* latency is not all that large. Using my test application running on the hardware I described in the original question, the latency went from 9 us to 3 us. This is certainly a significant reduction for highly latency sensitive applications (e.g., financial trading, high precision telemetry/tracking, high freq. data acquisition, etc…), but may not be worth the incurred electrical power hit for the vast majority of desktop apps. The only way to know for sure is to profile your application’s improvement in performance vs. the actual increase in power consumption/heat of your hardware and weigh the tradeoffs.

Update:

After additional testing with various idle=* parameters, I have discovered that setting idle to mwait if supported by your hardware is a much better idea. It seems that the use of the MWAIT/MONITOR instructions allows the CPU to enter C1E without any noticeable latency being added to the thread wake up time. With idle=mwait, you will get cooler CPU temperatures (as compared to idle=poll), less power use and still retain the excellent low latencies of a polling idle loop. Therefore, my updated recommended set of boot parameters for low CPU thread wake up latency based on these findings is:

intel_idle.max_cstate=0 processor.max_cstate=0 idle=mwait

The use of idle=mwait instead of idle=poll may also help with the initiation of Turbo Boost (by helping the CPU stay below its TDP [Thermal Design Power]) and hyperthreading (for which MWAIT is the ideal mechanism for not consuming an entire physical core while at the same time avoiding the higher C states). This has yet to be proven in testing, however, which I will continue to do.

Update 2:

The mwait idle option has been removed from newer 3.x kernels (thanks to user ck_ for the update). That leaves us with two options:

idle=halt – Should work as well as mwait, but test to be sure that this is the case with your hardware. The HLT instruction is almost equivalent to an MWAIT with state hint 0. The problem lies in the fact that an interrupt is required to get out of a HLT state, while a memory write (or interrupt) can be used to get out of the MWAIT state. Depending on what the Linux Kernel uses in its idle loop, this can make MWAIT potentially more efficient. So, as I said test/profile and see if it meets your latency needs…

and

idle=poll – The highest performance option, at the expense of power and heat.

Leave a Comment

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)