Customers often want the lowest possible latency for their application. Whether it’s VOIP, weather modeling, financial trading etc. For a while now, CPUs have had the ability to transition frequencies (P-states) based on load. This is the called a CPU governor, and the user interface is the cpuspeed service.
For slightly less time, CPUs have been able to scale certain sections of themselves on or off…voltages up or down, to save power. This capability is known as C-states. The downside to the power savings that C-states provide is a decrease in performance, as well as non-deterministic performance outside of the application or operating system control.
Anyway, for years I have been seeing things like processor.max_cstate on the kernel cmdline. This got the customer much better performance at higher power draw, a business decision they were fine with. After some time went on, people began looking at their datacenters and thinking how power was getting so expensive, they’d like to find a way to consolidate. That’s code for virtualization. But what about workloads so far unsuitable for virtualization, like those I’ve mentioned…those that should continue to run on bare metal, and further, maybe even specialized hardware ?
A clear need for flexibility: sysadmins know that changes to the kernel cmdline require reboots. But the desire to enable the paradigm of absolute performance only-when-I-need-it demands that this be a run-time tunable. Enter the /dev/cpu_dma_latency “pmqos” interface. This interface lets you specific a target latency for the CPU, meaning you can use it to indirectly control (with precision), the C-state residency of your processors. For now, it’s an all-or-nothing affair, but stay tuned as there is work to increase the granularity of C-state residency control to per-core.
Now in 2011, Red Hatter Jan Vcelak wrote a handy script called pmqos-static.py that can enable this paradigm. No more dependence on kernel cmdline. On-demand, toggle C-states to your application’s desire. Control C-states from your application startup/shutdown scripts. Use cron to dial up the performance before business hours and dial it down after hours. Significant power savings can come from this simple script, when compared to using the cmdline.
A few notes, before the technical detail.
1) When you set processor.max_cstate=0, the kernel actually silently sets it to 1.
drivers/acpi/processor_idle.c:1086: 1086 if (max_cstate == 0) 1087 max_cstate = 1;
2) RHEL6 has had this interface forever, but only recently do we have the pmqos-static.py script.
3) This script provides “equivalent performance” to kernel cmdline options, with added flexibility.
So here’s what I mean…this turbostat output on RHEL6.3, Westmere X5650. (note same behavior on SNB E5-2690):
Test #1: processor.max_cstate=0 pk cr CPU %c0 GHz TSC %c1 %c3 %c6 %pc3 %pc6 2.37 3.06 2.67 0.03 0.13 97.47 0.00 67.31
Test #2: processor.max_cstate=1 pk cr CPU %c0 GHz TSC %c1 %c3 %c6 %pc3 %pc6 0.04 2.16 2.67 0.04 0.55 99.37 4.76 88.00
Test #3: processor.max_cstate=0 intel_idle.max_cstate=0 pk cr CPU %c0 GHz TSC %c1 %c3 %c6 %pc3 %pc6 0.02 2.20 2.67 99.98 0.00 0.00 0.00 0.00
Test #4: processor.max_cstate=1 intel_idle.max_cstate=0 pk cr CPU %c0 GHz TSC %c1 %c3 %c6 %pc3 %pc6 0.02 2.29 2.67 99.98 0.00 0.00 0.00 0.00
Test #5: intel_idle.max_cstate=0 pk cr CPU %c0 GHz TSC %c1 %c3 %c6 %pc3 %pc6 0.02 2.19 2.67 99.98 0.00 0.00 0.00 0.00
# rpm -q tuned tuned-0.2.19-9.el6.noarch
Test #6: now with /dev/cpu_dma_latency set to 0 (via latency-performance profile) and intel_idle.max_cstate=0.
The cmdline overrides /dev/cpu_dma_latency.
# tuned-adm profile latency-performance pk cr CPU %c0 GHz TSC %c1 %c3 %c6 %pc3 %pc6 0.01 2.32 2.67 99.99 0.00 0.00 0.00 0.00
Test #7: no cmdline options + /dev/cpu_dma_latency via latency-performance profile. # tuned-adm profile latency-performance pk cr CPU %c0 GHz TSC %c1 %c3 %c6 %pc3 %pc6 100.00 2.93 2.67 0.00 0.00 0.00 0.00 0.00
There is additional flexibility, too…let me illustrate:
# find /sys/devices/system/cpu/cpu0/cpuidle -name latency -o -name name | xargs cat C0 0 NHM-C1 3 NHM-C3 20 NHM-C6 200
This shows you the exit latency (in microseconds) for various C-states on this particular Westmere (aka Nehalem/NHM). Each time the CPU transitions in between C-states, you get a latency hit of almost exactly those number of microseconds (which I can see in benchmarks). By default, an idle Westmere core sits in C6 (SNB sits in C7). To get that core up to C0, it takes 200us.
Here’s what I meant about flexibility. You can control exactly what C-state you want your CPUs in via /dev/cpu_dma_latency and via the pmqos-static.py script. And all dynamically during runtime. cmdline options do not allow for this level of control, as I showed they override /dev/cpu_dma_latency. Exhaustive detail about what to expect from each C-state can be found in Intel’s Architecture documentation. Around page 35-52 or so…
Using the information I fixed out of /sys above…Set it to 200 and you’re in the deepest c-state:
# /usr/libexec/tuned/pmqos-static.py cpu_dma_latency=200 pk cr CPU %c0 GHz TSC %c1 %c3 %c6 %pc3 %pc6 0.04 2.19 2.67 0.04 0.26 99.66 0.91 91.91
Set it to anything in between 20 and 199, and you get into C3:
# /usr/libexec/tuned/pmqos-static.py cpu_dma_latency=199 pk cr CPU %c0 GHz TSC %c1 %c3 %c6 %pc3 %pc6 0.03 2.28 2.67 0.03 99.94 0.00 89.65 0.00
Set it to anything in between 1 and 19, and you get into C1:
# /usr/libexec/tuned/pmqos-static.py cpu_dma_latency=19
pk cr CPU %c0 GHz TSC %c1 %c3 %c6 %pc3 %pc6 0.02 2.18 2.67 99.98 0.00 0.00 0.00 0.00
Set it to 0 and you get into C0. This is what latency-performance
# /usr/libexec/tuned/pmqos-static.py cpu_dma_latency=0 pk cr CPU %c0 GHz TSC %c1 %c3 %c6 %pc3 %pc6 100.00 2.93 2.67 0.00 0.00 0.00 0.00 0.00
Sandy Bridge chips also have C7, but the same rules apply.
You have to decide whether this flexibility buys you anything in order to justify rolling any changes across your environment.
Maybe just the understanding of this is how and why it works might be enough!
Moving away from useless kernel cmdline options is one less thing for you to maintain, although I realize you still have to enable tuned profile. So, dial up the performance when you need it, and save power/money when you don’t! Pretty cool!