nsinit: per-container resource monitoring of Docker containers on RHEL/Fedora

The use-case for per-application resource counters

Administrators of *NIX-based systems are quite accustomed to viewing resource counters strewn throughout the system, in places like /proc, /sys and more recently /cgroup or /sys/fs/cgroup.  With the release of RHEL6 came widespread enterprise adoption of Control Groups (cgroups), which had been implemented steadily over a series of years, and vetted both there as well as in Fedora (RHEL’s upstream).

Implementing cgroups not only let sysadmins carve up a single OS into multiple logical partitions, it also bought them per-cgroup counters that the kernel maintains.  That’s in addition to common use-cases such as quality of service guarantees or charge-back.

Docker’s unique twist

With the recent uptick in adoption of Linux containers (Docker encapsulates several mature technologies into an impressive usability package), administrators might be wondering where the per-container resource counters are.  We’re in luck!  Since Docker heavily relies on Cgroups, many of the counters that sysadmins are familiar with “just work”.  They could benefit from some usability improvements, but if you’re comfortable spelunking through the cgroup VFS, you can dig them out fairly easily.

I should note that the specific hierarchy and commands below are specific to RHEL and Fedora, so you might have to customize some paths or package names for your system.

In the most recent versions of Fedora, engineers have begun building and shipping a binary called ‘nsinit‘, which is part of libcontainer, which is the “execution driver” for Docker.  nsinit is a very powerful debugging utility that lets sysadmins not only view per-container resource counters, but also view the container’s runtime configuration and “jump into” a running container.

How to use the nsinit utility

First you should grab a copy from Fedora, or build it yourself.  Building it yourself is an unnecessarily complicated exercise; so I’m glad they started building it for Fedora so you can just do:

# yum install --enablerepo=updates-testing golang-github-docker-libcontainer

$ rpm -qf `which nsinit`
golang-github-docker-libcontainer-1.1.0-7.git29363e2.fc20.x86_64

# nsinit
NAME:
 nsinit - A new cli application

USAGE:
 nsinit [global options] command [command options] [arguments...]

VERSION:
 0.1

COMMANDS:
 exec execute a new command inside a container
 init runs the init process inside the namespace
 stats display statistics for the container
 config display the container configuration
 nsenter init process for entering an existing namespace
 pause pause the container's processes
 unpause unpause the container's processes
 help, h Shows a list of commands or help for one command

I’ll cover the most useful of nsinit’s capabilities; config, stats and exec.

Note:  nsinit currently requires that you run it while you're inside the container's state directory.  So from here on, all commands assume you're in there.

So, something like this:

# docker ps -q
4caad549289

# CID=`docker ps -q`
# cd /var/lib/docker/execdriver/native/$CID*
# ll
total 8
-rw-r-xr-x. 1 root root 3826 Sep  1 20:11 container.json
-rw-r--r--. 1 root root  114 Sep  1 20:11 state.json

Those files are plain-text readable, although not very human-readable.  nsinit pretty-prints these files.  For example, an abridged verison of the output of nsinit config (full version here).  Note that you can get much of this info (but not all) from docker inspect.

# nsinit config

{
 "mount_config": {
 "mounts": [
 {
 "type": "bind",
 "source": "/var/lib/docker/init/dockerinit-1.1.1",
 "destination": "/.dockerinit",
 "private": true
 },
 {
 "type": "bind",
 "source": "/etc/resolv.conf",
 "destination": "/etc/resolv.conf",
 "private": true
 },
<snip>
 "mount_label": "system_u:object_r:svirt_sandbox_file_t:s0:c631,c744"
 },
 "hostname": "4caad5492898",
 "environment": [
 "HOME=/",
 "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/goroot/bin:/gopath/bin",
 "HOSTNAME=4caad5492898",
 "DEBIAN_FRONTEND=noninteractive",
 "GOROOT=/goroot",
 "GOPATH=/gopath"
 ],
 "namespaces": {
 "NEWIPC": true,
 "NEWNET": true,
 "NEWNS": true,
 "NEWPID": true,
 "NEWUTS": true
 },
 "capabilities": [
 "CHOWN",
 "DAC_OVERRIDE",
 "FOWNER",
 "MKNOD",
 "NET_RAW",
 "SETGID",
 "SETUID",
 "SETFCAP",
 "SETPCAP",
 "NET_BIND_SERVICE",
 "SYS_CHROOT",
 "KILL"
 ],
 "networks": [
 {
 "type": "loopback",
 "address": "127.0.0.1/0",
 "gateway": "localhost",
 "mtu": 1500
 },
 {
 "type": "veth",
 "bridge": "docker0",
 "veth_prefix": "veth",
 "address": "172.17.0.6/16",
 "gateway": "172.17.42.1",
 "mtu": 1500
 }
 ],
 "cgroups": {
 "name": "4caad5492898f1a4230353de15e2acfc05809c69d05ec7289c6a14ef6d57b195",
 "parent": "docker",
 "allowed_devices": [
<snip>
 "process_label": "system_u:system_r:svirt_lxc_net_t:s0:c631,c744",
 "restrict_sys": true
}

The stats mode is far more interesting.  nsinit reads cgroup counters for CPU and memory usage.  The network statistics come from /sys/class/net/<EthInterface>/statistics.  From here you can see how much memory your application is using, chart it’s growth, watch CPU utilization, cross-check data from other tools, etc.

{
 "network_stats": {
 "rx_bytes": 180568,
 "rx_packets": 89,
 "tx_bytes": 28316,
 "tx_packets": 92
 },
 "cgroup_stats": {
 "cpu_stats": {
 "cpu_usage": {
 "total_usage": 985559718,
 "percpu_usage": [
 43613750,
 79789656,
 132486590,
 78759739,
 49063680,
 60703059,
 36277458,
 35919550,
 36329424,
 20096103,
 8148695,
 25279255,
 0,
 0,
 0,
 6144761,
 14814784,
 2612915,
 95162480,
 33853872,
 114861235,
 71115914,
 6533416,
 33993382
 ],
 "usage_in_kernelmode": 510000000,
 "usage_in_usermode": 440000000
 },
 "throlling_data": {}
 },
 "memory_stats": {
 "usage": 27992064,
 "max_usage": 29020160,
 "stats": {
 "active_anon": 4411392,
 "active_file": 3149824,
 "cache": 22278144,
 "hierarchical_memory_limit": 9223372036854775807,
 "hierarchical_memsw_limit": 9223372036854775807,
 "inactive_anon": 0,
 "inactive_file": 19128320,
 "mapped_file": 3723264,
 "pgfault": 94783,
 "pgmajfault": 25,
 "pgpgin": 19919,
 "pgpgout": 13902,
 "rss": 4460544,
 "rss_huge": 2097152,
 "swap": 0,
 "total_active_anon": 4411392,
 "total_active_file": 3149824,
 "total_cache": 22278144,
 "total_inactive_anon": 0,
 "total_inactive_file": 19128320,
 "total_mapped_file": 3723264,
 "total_pgfault": 94783,
 "total_pgmajfault": 25,
 "total_pgpgin": 19919,
 "total_pgpgout": 13902,
 "total_rss": 4460544,
 "total_rss_huge": 2097152,
 "total_swap": 0,
 "total_unevictable": 0,
 "unevictable": 0
 },
 "failcnt": 0
 },
 "blkio_stats": {}
 }
}

nsenter is commonly used to run a command inside an existing container, something like

# nsenter -m -u -n -i -p -t 19119 bash

Where 19119 is the PID of a process in the container.  Ugly.  nsinit makes this slightly easier (at least IMHO):

# nsinit exec cat /etc/hostname
4caad549289
# nsinit exec bash
bash-4.2# exit

nsinit’s capabilities and reported statistics are incredibly useful when debugging the implementation of QoS for each container, implementing/verifying resource-ceilings/guarantees, and for a more complete understanding of what your containers are doing.

This area is fast-moving…I did want to call out two other important developments, which should ultimately have more broad applicability than nsinit.

Google has published a project called cAdvisor that provides a basic web interface, but more importantly an API for higher layers (such as Kubernetes) to use.

Red Hat has proposed container support for Performance Co-Pilot, a system-level performance monitoring utility in RHEL7, along with goals of teaching many other tools about containers.

Using SCHED_FIFO in Docker containers on RHEL

Well, I’ve been asked about this quite a few times now, so I figured a blog post was in order…

When I was trying to get cyclictest running in a container, I ran into a little snag. I couldn’t run realtime prio tasks inside a container by default. I checked all the normal ulimit stuff for RT, but no dice.  But I did find a way (ugly).

If you do want to run SCHED_FIFO tasks you can in fact do so, like this:

Run a privileged container (because of cap_sys_nice being dropped by docker) adding this to your docker run command:

--priveleged

Or, if you  have a more recent version of Docker, add this to your docker run command:

--cap-add=sys_nice

Set rt_runtime_us > 0 for the parent cgroup of where docker containers end up in the heirarchy:

# echo 950000 > /sys/fs/cgroup/cpu/system.slice/cpu.rt_runtime_us

Still blocked:

# docker run -it cyclictest bash
root@231fbb116315: ~ # chrt -f 1 w
chrt: failed to set pid 0's policy: Operation not permitted

3. Update cpu.rt_runtime_us for the new container:

# echo 900000 > `find /sys/fs/cgroup/cpu/system.slice|grep docker|grep scope|grep cpu.rt_runtime_us`

Now it works:

root@231fbb116315: ~ # chrt -f 1 w
11:01:56 up 26 min, 0 users, load average: 0.08, 0.05, 0.05
USER TTY LOGIN@ IDLE JCPU PCPU WHAT

Yes, it should be made easier…the question is at what level do we integrate this; Docker or orchestration.

For more info, see this Red Hat Bugzilla.

Getting Started with Performance Analysis of Docker

Docker introduces some intriguing usability, packaging and deployment patterns.  These new patterns offer the potential to effect massive improvements to the enterprise application development and operations specialties.  Containers also offer the promise of bare metal performance while offering some amount of isolation as well.  But can they deliver on that promise ?

Since the early part of January, the Performance Engineering Group at Red Hat has run huge amounts of microbenchmarks, benchmarks and application workloads in Docker containers.  The output of that effort has been a steady stream of lessons learned and advice/guidance given to our product architects and developers.  How dense can we go ?  How fast can it go ?  Are these defaults “sane” ?  What NOT to do…etc.

Disclaimer:  as anyone who has worked with Docker knows, it’s a project under heavy development.  I mention that because this blog post includes code snippets and observations that are tied to specific experiments and Docker/kernel versions.  YMMV, the answer of course is “it depends”, and so on.

Performance tests we’ve pointed at Docker containers

We’ve done a whole bunch of R&D testing with bleeding edge, “niche” hardware and software to push and pull Docker containers in completely unnatural ways.   Based on our choice of benchmarks, you can see that the initial approach was to calculate the precise overhead of containers as compared to bare metal (Red Hat’s Project Atomic will support bare metal deployment of containers).  Of course we are also gathering numbers with VMs to compare and containers in VMs (which might be the end-game, who knows…) via OpenStack etc.

Starting at the core, and working our way to the heaviest, pushing all the relevant subsystems to their limits:

  • In-house timing syscall benchmarks (including vdso), libMicro
  • Linpack, single and double precision, Streams
  • Various incantations of sysbench (oltp and cpu)
  • iozone, smallfile, spinning disk, ssd and NAND flash
  • netperf on 10g and 40g, SR-IOV (pipework)
  • OpenvSwitch with VXLAN offload-capable NICs
  • Traditional “large” applications, i.e. business analytics
  • Addressing single-host vertical scalability limits by fixing the Linux kernel and fiddling some bits in Docker.
  • Using OpenvSwitch to get past the spanning-tree limitations of # of ports per bridged-interface.

All of these mine-sweeping experiments (lots more to come!) have allowed us to find and fix plenty of issues and document best-practices that we hope will lead to a great customer experience.

BTW if you’re interested in serious, low level, Enterprise-grade performance analysis and tuning for Linux containers (or in general!), let’s have a chat @DockerCon … I’ll be one of the guys in a Project Atomic T-shirt :-)

Unique Docker Philosophies

  • Ease of use:  Docker automates the use of existing Linux kernel technologies into an easily consumable format.  Setup and administration of traditionally disjoint subsystems (cgroups, namespaces, iptables, selinux) are encapsulated by Docker.
  • Packaging:  Docker specifies an image/packaging format that allows an application to be packaged with it’s full userspace requirements.  No longer is there a necessary interaction between system-level packages (other than the kernel) with the containerized application.  The application sees only what is provided inside the container.  This can be for example, a specific version of gcc or php that differs from what the host OS provides.  I keep drawing an analogy to BIND “views”.

Performance interests aside, those are the 2 main selling points for me, and the benefits of those cannot be overstated.

Surprise, we added some enterprise-y stuff

Docker learns about systemd

Red Hat has taught Docker to use systemd, rather than sysvinit.  I mention this because (depending on who you’re talking to) it may be controversial.  But I believe that the true promise of containers on Linux relies on specific capabilities that systemd provides:  at least init dbus messaging, remote capabilities, cgroups API, remote journaling.

Docker systemd unit-file override:

  • systemd supports “.d”-style overrides for installed unit-files.  This is the correct way to customize the defaults for any systemd unit-file.  Overrides go in /etc/systemd/system/.
  • I need an override for my testing, because I want to use my own bridge device and I want to play with the MTU as well.  By default, Docker creates a bridge called docker0 and assigns IP addresses from that pool, useful for development, not production.  For production, I guess folks will want to set up their own bridge (or pass through a device, macvlan, whatever).
  • Assuming you have a bridge that you want to use, create a new systemd unit override file called /etc/systemd/system/docker.service.  Here is an example where I’ve set Docker to use a bridge named ‘br1′ and I also added ‘-D’  to enable debug logging for the Docker daemon.  br1 is on my test network, on an IP range that I control.  Finally, I’ve bumped the MTU to 9000 for some throughput tests…
ExecStart=/usr/bin/docker -d --selinux-enabled -H fd:// -b br1 -D --mtu=9000

Also Stephen Tweedie spotted unnecessary memory consumption in systemd mount/umount handling, which was fixed in record time by Lennart Poettering :-)

Docker learns about SELinux

Red Hat has brought SELinux support to Docker.  If you’ve been using Red Hat products for any length of time, you know security is a first order concern for us.  Look at the stats for critical CVE reponse time…adding SELinux support to Docker should come as no surprise :-)  Shout out to the wizards in Red Hat’s Security Response Team, btw.

After the initial bring-up, SELinux support has been fairly painless for us in the Performance Group.  Dan Walsh is doing a talk called “SELinux and Docker” at DockerCon next week (June 10, 2pm, actually).  To give you a sense of how serious Red Hat is about containers and Docker, I should also mention Red Hat’s CTO Brian Stevens is doing one of the keynotes and we’re Platinum sponsoring.  Here’s the very high level picture:

Red_Hat-Project-Atomic-Introduction

Dockerfile for Performance Analysis

What is a Dockerfile?

Why create a Dockerfile specifically for Performance Analysis?

  • One of the core principals of Docker images is that they are absolutely as small as possible.  This is because when a user wants to use your container image, they must pull it over the network.  Docker hosts a registry at http://index.docker.io.  Folks may stand up their own internal registries as well, where bandwidth is a bit less of a concern, images can contain site-specific customizations, intellectual property, licensed software, etc.
  • Our engineers have been working hard to reduce the base image size.  Therefore, the base images include the smallest usable package set, plus necessary tooling/package management utilities (yum) to pull in anything else the user needs inside their containers.  Think @core on steriods.
  • Because of the size constraints on the base image, we have to layer on our usual set of Performance Analysis tools via Dockerfile rather than kickstart.
  • A very common question I get from the field is to provide a precise list of performance analysis packages/tools that I would recommend in their base RHEL images.  So I put a slide in the Summit deck this year:

helpful_utilities

Example Dockerfile

It’s not all that complicated, but includes lots of helpful utilities for characterizing workloads running inside containers.  You might see that sysstat is missing; that’s because I monitor that information on the host.  This is one critical differentiation between virtualization, and containers:  the VCPUs of a KVM guest exist as processes in the host.  With containers, the actual containerized binary shows up in the process list of the host.  Note:  the PID namespace ensures isolation of process tables between containers.

FROM rhel7:latest
MAINTAINER perf <perf@domain.com>

RUN yum install -q -y bc blktrace btrfs-progs ethtool gcc git gnuplot hwloc iotop iproute iputils less mailx man-db netsniff-ng net-tools numactl numactl-devel openssh-clients openssh-server passwd perf procps-ng psmisc screen strace tcpdump vim-enhanced wget xauth which 

RUN git clone http://whatever/project.git

ENV HOME /root
ENV USER root
WORKDIR /root
EXPOSE 22

You might also notice that I’m installing numactl and hwloc.  That’s because recent versions of Docker provide access to sysfs hardware topology tables from the host, allowing you to apply similar tuning techniques as you would on bare metal on containerized processes.  We had some pretty funny test automation explosions when sysfs hardware topology was not exposed :-)  Side note, you can’t tune IRQ affinity from a non-privileged container, but luckily IRQ balance really does a great job these days (even knows about PCI-locality).  Privileged containers CAN program IRQ affinity.

CPU and memory affinity is another important differentiation between VMs and containers.   In a container,  core1 is core1 on the host, core2 is core2 etc (depending on your cgroups config).  With VMs you apply specific vcpupin/numatune/emulatorpin commands in order to ensure VCPU threads and their memory utilize specific CPUs/memory banks.  The process of properly applying affinity to KVM guests is well-documented in Red Hat’s Virtualization Tuning and Optimization Guide.  Naturally, when we characterize VMs and containers inside VMs, we often apply much of that.

How to build a container with the Performance Dockerfile

# time docker build --no-cache=true -t r7perf --rm=true - < Dockerfile_r7perf

# docker run -it r7perf bash

root@7d7b16277784: / # exit

How do I add my benchmark/tool/workload to this Docker container?

  • Ideally, a pre-configured set of scripts would be committed to your own git repo, and pulled into this container automatically in the Dockerfile (RUN git clone http:///whatever/project.git).  This is our approach.
  • Add a RUN command to the Dockerfile that uses yum, wget, git or similar to pull in, install and configure your software.
  • Run a container interactively, then pull down the benchmark manually.  This is our fallback for some of the more challenging/complex benchmarks and under-load analysis.

How to get a benchmark running inside a Docker container

Let’s take for example, sysbench.

  • I’ve built RPMs for sysbench for RHEL6 and RHEL7 and committed them to our git repository.  I’ve also committed my driver script called run-sysbench.sh. (this isn’t mandatory, but using git makes things a LOT easier).
    • You can add a RUN statement to the Dockerfile that wget’s your benchmark/tarball from somewhere, or a RUN that does another git clone of some other repository.
    • However you would normally transfer your code to a new machine, you can do the same thing in the Dockerfile.
  • Once the container build is complete, launch a container, and kick off your workload.  run-sysbench.sh could be any driver/wrapper script that you’ve got.
host# docker run -it --privileged r7perf bash

container# yum install -y bench/sysbench/rhel7/*rpm mariadb-server mariadb ; cd bench/sysbench

container# ./run-sysbench.sh oltp docker

...run-sysbench.sh completes and spits out an output/logfile that it copies off the container (rsync, ftp whatever).
  • That’s it.  When the script finishes and you’ve copied off the results (part of run-sysbench.sh), you can ‘exit’ the container.
  • Astute observers will have noticed that I snuck ‘–privileged’ onto the command line above.  That is because my run-sysbench.sh wants to drop_caches, and that’s not something permitted to a container by default.  As an alternative, instead of using privileges, a container could ssh into it’s host machine as root and drop_cache from there.  See Docker source capabilitiesdaemon/execdriver/lxc/init.go for the additional capabilities afforded to “privileged” containers.
  • Fun example:  create 100 containers running apache, in 14 seconds :-)
# time for i in $(seq 100) ; do docker run -d r7perf /usr/sbin/httpd -DFOREGROUND ; done

43bd1efc8fd4d8cedcced29cedf7176286077661a4df02c27756b3959a9fa75f
de1cc33c8f73d9ebce8676ab52da5e1da9518c649af87688f4a89dbda197c7cb
...

real 0m14.159s
user 0m0.386s
sys 0m0.386s

It’s not very often that a new technology comes up that creates a whole new column for performance characterization.  But containers have done just that, and so it’s been quite the undertaking.  There are still many tests variations to run, but so far we’re encouraged.

That said, I have to keep reminding myself that performance isn’t always the first concern for everyone (*gasp*).  The packaging, development and deployment workflow that breaks the ties between host userspace and container userspace has frankly been a breath of fresh air.