“Baby, we were born to run…” — Every userspace process ever.

I tweeted recently that %usr is what we wanna do; %sys is what we gotta do…What I meant was to point out that the kernel’s main goals in life are to bring up hardware, and manage access to it on behalf of an application and get out of the way.  This includes objectives like allocating memory when an application asks for it, taking network packets from an application and giving them to the network card, and deciding what application runs on what core, when it runs (ordering), and for how long.

Since at least the days of the Apollo Guidance Computer, there has been the concept of priorities in job scheduling.  Should you have the time, I highly recommend the Wikipedia article, this book, and the AGC Emulator.

Anyway, in more recent operating systems like Linux, the user interface to the job scheduler is quite similar — a system of policies and priorities.  There’s a great write-up in the Red Hat MRG Realtime docs here.

The system of policies and priorities represent a multi-tiered approach to ordering jobs on a multitasking operating system.  The user herself or an application may request from the kernel that  it wants a certain scheduling policy and priority.  By itself, those values don’t mean much.  But when there’s a contended resource, (such as a single CPU core) they quickly come into play by informing the scheduler what the various task priorities are in relation to each other.  For example, in the case of the AGC, an engine control application would be prioritized higher than, say, a cabin heater.

The kernel can’t read minds, so we occasionally must provide it with guidance as to which application is the highest priority.  If you have a server who’s purpose is to run an application that predicts the weather, you don’t need log cleanup scripts, data archival or backups etc running when the weather app has “real work” to do.  Without any guidance, the kernel will assume these tasks are of equal weight, when in fact the operator knows better.

The tools to manipulate scheduler policy and priority are things like nice and chrt (there are also syscalls that apps can use directly).  In the previous example, you might use nice to inform the scheduler that the weather application is the most important task on the system, and it should run whenever possible.  Something like ‘nice -20 ./weather’ or ‘renice -20 `pidof weather`’.

Back to the kernel’s main point in life:  mediating access to hardware.  In order to do this, the kernel may spawn a special type of process called a kthread.  Kthreads cannot be controlled like regular processes; i.e. CPU/memory affinity or killing them.  At some point if these kthreads have work to do, the scheduler will let them run.  I wrote about some of this previously…They have important functions to do, like write out dirty memory pages to disk (bdi-flush), perhaps shuffle network packets around (ksoftirqd) or service various kernel modules like infiniband.

When the kthreads run, they might run on the same core where the weather app is running.  This interruption in userspace execution can cause a few symptoms…i.e. jittery latency performance, increased CPU cache misses, poor overall performance.

If you’re staring at one of these symptoms, you might be curious what’s the easiest way to find out what’s bumping you off-core and dumping your precious cache lines.

There are a few ways to determine this.  I wrote about how to use perf sched record to do it in a low latency whitepaper, but wanted to write about a 2nd method I’ve been using a bit lately as well.

You can use a Systemtap script included in RHEL6 called ‘cycle_thief.stp’ (written by Red Hat’s Will Cohen) to find out what’s jumping ahead of you.  Here’s an example; PID 3391 is a KVM guest.  I added the [Section X] markers to make explaining the output a bit easier.  I also removed the histogram buckets with zero values to shorten the output.  Finally, I let it run for 30 seconds before hitting Ctrl+C.

# stap cycle_thief.stp -x 3391
^C
[Section 1]  task 3391 migrated: 1
[Section 2]  task 3391 on processor (us):
value |-------------------------------------------------- count
 16   |@@@@@@@@@@@@ 12
 32   |@@@@@@@@@@@ 11
 64   |@ 1
[Section 3] task 3391 off processor (us)
value   |-------------------------------------------------- count
 128    |@@@@@@@@@@@@ 12
 8192   |@@@@ 4
 131072 |@@@@ 4
 524288 |@@@ 3
[Section 4]
other pids taking processor from task 3391
 0    55
 3393 17
 2689 13
 115  4
 69   2
 431  1
[Section 5]
irq taking processor from task 3391
irq count min(us) avg(us) max(us)

Section 1 represents the number of times PID 3391 was migrated between CPU cores.

Section 2 is a histogram of the number of microseconds PID 3391 was on-core (actively executing on a CPU).

Section 3 is a histogram of the number of microseconds PID 3391 was off-core (something else was running).

Section 4 identifies which PIDs executed on the same core PID 3391 wanted to use during those 30 seconds (and thus bumped PID 3391 off-core).  You can grep the process table to see what these are.  Sometimes you’ll find other userspace processes, sometimes you’ll find kthreads.  You can see this KVM guest was off-core more than on.  It’s just an idle guest I created for this example, so that makes sense.

Section 5 is blank; had there been any IRQs serviced by this core during the 30 second script runtime, they’d be counted here.

With an understanding of the various policies and priorities (see MRG docs or man 2 setpriority) cycle_thief.stp is a super easy way of figuring out how to set your process policies and priorities to maximize the amount of time your app is on-core doing useful work.

Battle Plan for RDMA over Converged Ethernet (RoCE)

What is all that %sys time ?  ”I never know what she’s _doing_ in there…” Ha!

12:01:35 PM CPU %usr %nice %sys %iowait %irq %soft %idle
12:01:36 PM all 0.08 0.00  3.33 0.00    0.00 5.00  91.59
12:01:36 PM 0   0.00 0.00 40.59 0.00    0.00 59.41  0.00

...

You can instantly find out with ‘perf top’.  In this case (netperf), the kernel is spending time copying skb’s around, mediating between kernel and userspace.  I wrote a bit about this in a previous blog post; the traditional protection ring.

All that copying takes time…precious, precious time.  And CPU cycles; also precious.  And memory bandwidth…etc.

HPC customers have, for decades, been leveraging Remote Direct Memory Access (RDMA) technology to reduce latency and associated CPU time.  They use InfiniBand fabrics and associated InfiniBand verbs programming to extract every last bit of performance out of their hardware.

As always, that last few percent performance ends up being the most expensive.  Both in terms of hardware and software, as well as the people-talent and their effort.  But it’s also sometimes the most lucrative.

Over the last few years, some in-roads have been made in lowering the bar to entry into RDMA implementation, with one of those being RoCE (RDMA Over Converged Ethernet).  My employer Red Hat ships RoCE libraries (for Mellanox cards) in the “High Performance Networking” channel.

I’ve recently been working on characterizing RoCE in the context of it’s usefulness in various benchmarks and customer loads, so to that end I went into the lab and wired up a pair of Mellanox ConnectX-3 VPI cards back-to-back with a 56Gbit IB cable.  The cards are inside Sandy Bridge generation servers.

Provided some basic understanding of the hideous vernacular in this area, it turns out to be shockingly easy to setup RoCE.  Here’s some recommended reading to get you started:

First thing, make sure your server is subscribed to the HPN channel on RHN.  Then let’s get all the packages installed.

# yum install libibverbs-rocee libibverbs-rocee-devel libibverbs-rocee-devel-static libibverbs-rocee-utils libmlx4-rocee libmlx4-rocee-static rdma mstflint libibverbs-utils infiniband-diags

The Mellanox VPI cards are multi-mode, in that they support either Infiniband or Ethernet.  The cards I’ve got came in Infiniband mode, so I need to switch them over.  Mellanox ships a script called connectx_port_config to change the mode, but we can do it with driver options too.

Get the PCI address of the NIC:

# lspci | grep Mellanox
 21:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]

Check what ethernet devices exist currently:

# ls -al /sys/class/net

I see ib0/1 devices now since my cards are in IB mode.  Now let’s change to ethernet mode.  Note you need to substitute your PCI address as it will likely differ from mine (21:00.0).  I need eth twice since this is a dual-port card.

 # echo "0000:21:00.0 eth eth" >> /etc/rdma/mlx4.conf
 # modprobe -r mlx4_ib
 # modprobe -r mlx4_en
 # modprobe -r mlx4_core
 # service rdma restart ; chkconfig rdma on
 # modprobe mlx4_core
 # ls -al /sys/class/net

Now I see eth* devices (you may see pXpY names depending on the BIOS), since the cards are now in eth mode. If you look in dmesg you will see the mlx4 driver automatically sucked in the mlx4_en module accordingly.  Cool!

Let’s verify that there is now an InfiniBand device ready for use:

# ibstat
CA 'mlx4_0'
	CA type: MT4099
	Number of ports: 2
	Firmware version: 2.11.500 <-- flashed the latest fw using mstflint.
	Hardware version: 0
	Node GUID: 0x0002c90300a0e970
	System image GUID: 0x0002c90300a0e973
	Port 1:
		State: Active  <-------------------- Sweet.
		Physical state: LinkUp
		Rate: 40
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x00010000
		Port GUID: 0x0202c9fffea0e970
		Link layer: Ethernet
	Port 2:
		State: Down
		Physical state: Disabled
		Rate: 10
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x00010000
		Port GUID: 0x0202c9fffea0e971
		Link layer: Ethernet

Cool so we’ve got our RoCE device up from a hardware init standpoint, now give it an IP like any old NIC.

Special note for IB users:  most IB switches have a subnet manager built in (RHEL ships one too, called opensm).  But when using RoCE there’s no need for opensm as it’s specific to InfiniBand fabrics and plays no part in Ethernet fabrics. And since RoCE runs over Ethernet, there is no need for a subnet manager.  The InfiniBandTA article I linked above goes into some detail about what benefits the SM provides on IB fabrics.

Now we get to the hard and confusing part.  Just kidding, we’re done.  Was it that intimidating ?Let’s test it out using an RDMA application that ships with Red Hat MRG Messaging, called qpid-latency-test.  I chose this because it supports RDMA as a transport.

# yum install qpid-cpp-server qpid-cpp-server-rdma qpid-cpp-client qpid-cpp-client-devel -y
# qpidd --auth no -m no
 2013-03-15 11:45:00 [Broker] notice SASL disabled: No Authentication Performed
 2013-03-15 11:45:00 [Network] notice Listening on TCP/TCP6 port 5672
 2013-03-15 11:45:00 [Security] notice ACL: Read file "/etc/qpid/qpidd.acl"
 2013-03-15 11:45:00 [System] notice Rdma: Listening on RDMA port 5672  <-- Sweet.
 2013-03-15 11:45:00 [Broker] notice Broker running

 

Defaults: around 100us.

# numactl -N0 -m0 nice -20 qpid-latency-test -b 172.17.2.41 --size 1024 --rate 10000 --prefetch=2000 --csv
 10000,0.104247,2.09671,0.197184
 10000,0.11297,2.12936,0.198664
 10000,0.099194,2.11989,0.197529
 ^C

With tcp-nodelay: around 95us

# numactl -N0 -m0 nice -20 qpid-latency-test -b 172.17.2.41 --size 1024 --rate 10000 --tcp-nodelay --prefetch=2000 --csv
 10000,0.094664,3.00963,0.163806
 10000,0.093109,2.14069,0.16246
 10000,0.094269,2.18473,0.163521

With RDMA/RoCE/HPN:  around 65us.

# numactl -N0 -m0 nice -20 qpid-latency-test -b 172.17.2.41 --size 1024 --rate 10000 --prefetch=2000 --csv -P rdma
 10000,0.065334,1.88211,0.0858769
 10000,0.06503,1.93329,0.0879431
 10000,0.062449,1.94836,0.0872795
 ^C

Percentage-wise, that’s a really substantial improvement.  Plus don’t forget all the %sys time (which also includes memory subsystem bandwidth usage) you’re saving.  You get all those CPU cycles back to spend on your application!

Disclaimer:  I didn’t do any heroic tuning on these systems.  The above performance test numbers are only to illustrate “proportional improvements”.  Don’t pay much attention to the raw numbers other than order-of-magnitude.  You can do much better starting with this guide

So!  Maybe kick the tires on RoCE, and get closer to wire speed with lower latencies.  Have fun!

Big-win I/O performance increase coming to KVM guests in RHEL6.4

I finally got the pony I’ve been asking for.

There’s a very interesting (and impactful) performance optimization coming to RHEL6.4.  For years we’ve had to do this sort of tuning manually, but thanks to the power of open source, this magical feature has been implemented and is headed your way in RHEL6.4 (try it in the beta!)

enterprise

What is this magical feature…is it a double-rainbow ?  Yes.  All the way.

It’s vhost thread affinity via virsh emulatorpin.

If you’re familiar with the vhost_net network infrastructure added to Linux, it moves the network I/O out of the main qemu userspace thread to a kthread called vhost-$PID (where $PID is the PPID of the main KVM process for the particular guest).  So if your KVM guest is PID 12345, you would also see a [vhost-12345] process.

Anyway…with the growing amount of CPUs/RAM available and proliferation of NUMA systems (basically everything x86 these days), we have to be very careful to respect NUMA topology when tuning for maximum performance.  Lots of common optimizations these days center around NUMA affinity tuning, and the automatic vhost affinity support is tangentially related to that.

If you are concerned with having the best performance for your KVM guest, you may have already used either virsh or virt-manager to bind the VCPUs to a physical CPUs or NUMA nodes.  virt-manager makes this very easy by clicking “Generate from host NUMA configuration”:

vcpupin

OK that’s great.  The guest is going to stick around on those odd-numbered cores.  On my system, the NUMA topology looks like this:

# lscpu|grep NUMA
NUMA node(s): 4
NUMA node0 CPU(s): 0,2,4,6,8,10
NUMA node1 CPU(s): 12,14,16,18,20,22
NUMA node2 CPU(s): 13,15,17,19,21,23
NUMA node3 CPU(s): 1,3,5,7,9,11

So virt-manager will confine the guest’s VCPUs to node 3.  You may think you’re all set now.  And you’re close and you can see the rainbow on the horizon.  You have significantly improved guest performance already by respecting physical NUMA topology, there is more to be done.  Inbound pony.

Earlier I described the concept of the vhost thread, which contains the network processing for it’s associated KVM guest.  We need to make sure that the vhost thread’s affinity matches the KVM guest affinity that we implemented with virt-manager.

At the moment, this feature is not exposed in virt-manager or virt-install, but it’s still very easy to do.  If your guest is named ‘rhel64′, and you want to bind it’s “emulator threads” (like vhost-net) all you have to do is: 

# virsh emulatorpin rhel64 1,3,5,7,9,11 --live
# virsh emulatorpin rhel64 1,3,5,7,9,11 --config

Now the vhost-net threads share a last-level-cache (LLC) with the VCPU threads.  Verify with:

# taskset -pc <PID_OF_KVM_GUEST>
# taskset -pc <PID_OF_VHOST_THREAD>

These should match.  Cache memory is many orders of magnitude faster than main memory, and the performance benefits of this NUMA/cache sharing is obvious…using netperf:

Avg TCP_RR (latency)
Before: 12813 trans/s
After: 14326 trans/s
% diff: +10.5%
Avg TCP_STREAM (throughput)
Before: 8856Mbps
After: 9413Mbps
% diff: +5.9%

So that’s a great performance improvement; just remember for now to run the emulatorpin stuff manually. Note that as I mentioned in previous blog posts, I always mis-tune stuff to make sure I did it right. The “before” numbers above are from the mis-tuned case ;)

Off topic…while writing this blog I was reminded of a really funny story I read on Eric Sandeen’s blog about open source ponies. Ha!