Kernel development

Brief items

Kernel release status

The 3.17 merge window is still open as of this writing, so there is no current development kernel.

Stable updates: the 3.15.9, 3.14.16, 3.10.52, and 3.4.102 updates were released on August 7. Greg warns that there will only be one more 3.15 update, so 3.15 users should be thinking about moving on.

Comments (none posted)

Quotes of the week

Like all good pull reqs this ends with a revert, so it must mean we tested it.

— Dave Airlie

Lying to apps generally ends up like children lying to parents - the lie gets more complicated to keep up each case you find until it breaks.

— Alan Cox

Comments (none posted)

Stultz: 2038 Kernel Summit Discussion Fodder

For those interested in the year-2038 problem, this posting from John Stultz is worth a read. He covers the problem, the work that has been done so far to try to mitigate it, and options for the future.

A more aggressive version of the previous proposal is what I’m calling the “New Virtual-Architecture” approach, basically extending the versioning control from the linker down into the kernel as well. It would be adding a new “virtual-architecture” to the kernel, not entirely unlike how x32 is supported on x86_64 systems. We would create entirely new ABI and architecture name in the kernel (think something like “armllt” or “i386llt”). We would preserve compatibility for legacy applications via personalities, similar mechanism as the compat_ interface used to support 32bit applications on 64bit kernels. In this case, we wouldn’t introduce new 64 bit syscalls in the kernel, as the existing interfaces would just be typed correctly for our new virtual architecture, but we would have duplicate syscall interfaces via the compat interfaces.

Full Story (comments: 2)

Kernel development news

3.17 merge window, part 2

By Jonathan Corbet
August 13, 2014

As of this writing, Linus has pulled 9,894 non-merge changesets into the kernel repository for the 3.17 development cycle; that is 3,750 since last week's merge window summary was written. The pace has fallen off in recent days; Linus is evidently traveling and will eventually arrive at the 2014 Kernel Summit, which begins on August 18.

Some of the more interesting user-visible changes merged since last week include:

The memfd and file sealing patches have been merged. A "memfd" is a region of memory identified by a file descriptor that may be passed between processes. File sealing allows a process to freeze the contents of a memfd, disallowing any further changes. Together, these features are meant to be a key part of the upcoming kdbus subsystem.
The new kexec_file_load() system call is available. It allows the kernel to perform signature checking on a new kernel before booting into it. That, in turn, should allow nervous distributors to enable the kexec functionality on systems running in a UEFI secure boot environment.
Initial multiqueue support has been added to the SCSI subsystem. Multiqueue operation should provide increased performance and scalability. This code is experimental in this release and off by default; the use_blk_mq module parameter must be provided to turn it on.
KVM virtualization now works on big-endian ARM systems.
DRM "render nodes," which provide access to the rendering hardware in graphics processors independently of the display, are now enabled by default.
Support for the old POWER3 and rs64 architectures has been removed from the kernel. These architectures have evidently been broken for a number of releases and nobody noticed. Support for Samsung S5P6440, S5P6450, and S5PC100 systems has also been removed.
New hardware support includes:
- Processors and systems: Mediatek MT6589 systems-on-chip (SoCs), Broadcom BCM7XXX-based boards, and Hisilicon HiX5HD2 SoCs.
- Audio: Cirrus Logic CS4265 codecs, Realtek ALC286 and ALC5670 codecs, Freescale asynchronous sample rate converters, Intel Broadwell Wildcatpoint audio DSPs, Hardkernel Odroid-X2 and Odroid-U3 audio controllers, SiRF SoC USP interfaces, and Texas Instruments TAS2552 mono audio amplifiers.
- Graphics: STMicroelectronics SoC stiH41x chipsets and Cirrus Logic CLPS711X framebuffers.
- Input: Microchip CAP1106 six-channel capacitive touch sensors and Wacom protocol 4 serial tablets.
- Miscellaneous: HP iPAQ Atmel Micro ASIC battery controllers, Intel Crystal Cove power management ICs, Maxim MAX77802 power management ICs, Freescale i.MX1 pin controllers, Qualcomm 8960 pin controllers, MSI GT68xR LED controllers, NVIDIA Tegra XUSB pad controllers, NXP PCF85063 real-time clocks, and Xilinx Zynq GPIO controllers.

Changes visible to kernel developers include:

The ALSA sound driver core is now able to work with hardware setups where multiple codecs are attached to a single digital audio interface.

By normal standards, the 3.17 merge window would be expected to close on August 17. Linus has suggested that, in compensation for the time when he is traveling, the window may stay open for a bit longer, allowing him to complete the merging process during the slower moments at the Kernel Summit. Regardless of how that turns out, next week's Kernel Page will include a summary of the final patches merged for this development cycle.

Comments (none posted)

Ftrace: The hidden light switch

August 13, 2014

This article was contributed by Brendan Gregg

You may think, as I did, that analyzing the Linux kernel is like venturing through a dark dungeon: without the addition of advanced tracers like SystemTap, there's much that can't be seen, and can only be inferred. However, I've recently found hidden switches that turn on some bright lights, strategically placed by Steven Rostedt and others since the 2.6.27 release. These are the ftrace profilers. I haven't even tried all the switches yet, but I'm stunned at what I've seen so far, and I'm having to rethink what I previously believed about Linux kernel performance analysis.

Recently at Netflix (where I work), a Cassandra database was performing poorly after a system upgrade, and disk I/O inflation (a massive increase in the number of I/O operations submitted) was suspected. There can be many causes for this: a worse cache-hit ratio, record-size inflation, readahead inflation, other applications, even other asynchronous kernel tasks (file system background scrubs). The question was: which one, and how do we fix it?

1. iosnoop

I started with basic server health checks, and then my iosnoop tool. iosnoop is a shell script that uses the /sys/kernel/debug ftrace facilities, and is in my perf-tools collection on GitHub, along with other ftrace hacks. These work on Netflix's existing servers (which often run Linux 3.2 with security patches) without any other addition to the system, and without requiring kernel debuginfo. In this case, iosnoop was run both with and without -Q to see the effect of queuing:

  # ./iosnoop -ts
  STARTs          ENDs            COMM         PID    TYPE DEV      BLOCK        BYTES     LATms
  13370264.614265 13370264.614844 java         8248   R    202,32   1431244248   45056      0.58
  13370264.614269 13370264.614852 java         8248   R    202,32   1431244336   45056      0.58
  13370264.614271 13370264.614857 java         8248   R    202,32   1431244424   45056      0.59
  13370264.614273 13370264.614868 java         8248   R    202,32   1431244512   45056      0.59
  [...]
  # ./iosnoop -Qts
  STARTs          ENDs            COMM         PID    TYPE DEV      BLOCK        BYTES     LATms
  13370410.927331 13370410.931182 java         8248   R    202,32   1596381840   45056      3.85
  13370410.927332 13370410.931200 java         8248   R    202,32   1596381928   45056      3.87
  13370410.927332 13370410.931215 java         8248   R    202,32   1596382016   45056      3.88
  13370410.927332 13370410.931226 java         8248   R    202,32   1596382104   45056      3.89
  [...]

I didn't see anything out of the ordinary: a higher disk I/O load was causing higher queue times.

The tools that follow are all from the same collection: they demonstrate existing capabilities of ftrace, and how they are useful for solving real problems.

2. tpoint

To investigate these disk reads in more detail, I used tpoint to trace the block:block_rq_insert tracepoint:

  # ./tpoint -H block:block_rq_insert
  Tracing block:block_rq_insert. Ctrl-C to end.
  # tracer: nop
  #
  #       TASK-PID    CPU#    TIMESTAMP  FUNCTION
  #          | |       |          |         |
          java-16035 [000] 13371565.253582: block_rq_insert: 202,16 WS 0 () 550505336 + 88 [java]
          java-16035 [000] 13371565.253582: block_rq_insert: 202,16 WS 0 () 550505424 + 56 [java]
          java-8248  [007] 13371565.278372: block_rq_insert: 202,32 R 0 () 660621368 + 88 [java]
          java-8248  [007] 13371565.278373: block_rq_insert: 202,32 R 0 () 660621456 + 88 [java]
          java-8248  [007] 13371565.278374: block_rq_insert: 202,32 R 0 () 660621544 + 24 [java]
          java-8249  [007] 13371565.311507: block_rq_insert: 202,32 R 0 () 660666416 + 88 [java]
  [...]

I was checking for anything obviously odd, but the I/O details looked normal. The -H option prints column headers.

Next, I traced the code path that led to this I/O by printing stack traces (-s), to see if they contained an explanation. I also added an in-kernel filter to match reads only (when the "rwbs" flag field contains "R"):

  # ./tpoint -s block:block_rq_insert 'rwbs ~ "*R*"' | head -1000
  Tracing block:block_rq_insert. Ctrl-C to end.
          java-8248  [005] 13370789.973826: block_rq_insert: 202,16 R 0 () 1431480000 + 8 [java]
          java-8248  [005] 13370789.973831: <stack trace>
   => blk_flush_plug_list
   => blk_queue_bio
   => generic_make_request.part.50
   => generic_make_request
   => submit_bio
   => do_mpage_readpage
   => mpage_readpages
   => xfs_vm_readpages
   => read_pages
   => __do_page_cache_readahead
   => ra_submit
   => do_sync_mmap_readahead.isra.24
   => filemap_fault
   => __do_fault
   => handle_pte_fault
   => handle_mm_fault
   => do_page_fault
   => page_fault
          java-8248  [005] 13370789.973831: block_rq_insert: 202,16 R 0 () 1431480024 + 32 [java]
          java-8248  [005] 13370789.973836: <stack trace>
   => blk_flush_plug_list
   => blk_queue_bio
   => generic_make_request.part.50
  [...]

Great! The output is similar to the previous example, but with stack traces beneath each tracepoint event. I limited the output using head as it is verbose.

tpoint is another ftrace-based tool. It's usually better to use perf events (the perf command) for this particular use case, as it can handle multi-user access to performance data and a higher event rate, although it is more time-consuming to use. I just wanted to quickly eyeball a few dozen stack traces for a given tracepoint.

The stacks were mostly the same as the example above, which provided a couple of leads: page faults and readahead. This Ubuntu system was using 2MB direct-mapped pages, instead of 4KB like the old system. It also had readahead set to 2048KB, instead of 128KB. Either of these differences could be causing the inflation, although tuning readahead had already been tested, and found to make no difference.

3. funccount

I wanted to understand that stack trace better, so I started by counting calls using funccount, which uses ftrace function profiling. Starting with the per-second rate of submit_bio():

  # ./funccount -i 1 submit_bio
  Tracing "submit_bio"... Ctrl-C to end.

  FUNC                              COUNT
  submit_bio                        27881

  FUNC                              COUNT
  submit_bio                        28478
  [...]

This rate, about 28,000 calls per second, is on par with what the disks are doing as seen from iostat. funccount is counting events in the kernel for efficiency.

Now checking the rate of filemap_fault(), which is closer in the stack to the database:

  # ./funccount -i 1 filemap_fault
  Tracing "filemap_fault"... Ctrl-C to end.

  FUNC                              COUNT
  filemap_fault                      2203

  FUNC                              COUNT
  filemap_fault                      3227
  [...]

This is consistent with what we believed the application was requesting from the filesystem. There is about a 10x inflation between these calls and the issued disk I/O (as evidenced by the submit_bio() calls).

4. funcslower

Just to confirm that the database is suffering latency caused by the stack trace I was studying, I used funcslower (another ftrace-based tool, which uses in-kernel timing and filtering for efficiency) to measure filemap_fault() calls taking longer than 1000 microseconds (1ms):

  # ./funcslower -P filemap_fault 1000
  Tracing "filemap_fault" slower than 1000 us... Ctrl-C to end.
   0)   java-8210    | ! 5133.499 us |  } /* filemap_fault */
   0)   java-8258    | ! 1120.600 us |  } /* filemap_fault */
   0)   java-8235    | ! 6526.470 us |  } /* filemap_fault */
   2)   java-8245    | ! 1458.30 us  |  } /* filemap_fault */
  [...]

These latencies look similar to those seen from disk I/O (with queue time). I'm in the right area.

5. funccount (again)

I noticed that the stack has "readpage" calls and then "readpages". Tracing them both at the same time:

  # ./funccount -i 1 '*mpage_readpage*'
  Tracing "*mpage_readpage*"... Ctrl-C to end.

  FUNC                              COUNT
  mpage_readpages                     364
  do_mpage_readpage                122930

  FUNC                              COUNT
  mpage_readpages                     318
  do_mpage_readpage                110344
  [...]

Here's our inflation: mpage_readpages() is being called about 300 times per second, and then do_mpage_readpage() over 100k times per second. This still looks like readahead, although we did try to adjust readahead sizes as an experiment, and it didn't make a difference.

6. kprobe

Maybe our readahead tuning didn't take effect? I can check this using dynamic tracing of kernel functions. Starting with the above stack trace, I saw that __do_page_cache_readahead() has nr_to_read (number of pages to read) as an argument, which comes from the readahead setting. Using kprobe, an ftrace- and kprobes-based tool, to dynamically trace this function and argument:

  # ./kprobe -H 'p:do __do_page_cache_readahead nr_to_read=%cx'
  Tracing kprobe m. Ctrl-C to end.
  # tracer: nop
  #
  #   TASK-PID    CPU#    TIMESTAMP  FUNCTION
  #      | |       |          |         |
      java-8714  [000] 13445354.703793: do: (__do_page_cache_readahead+0x0/0x180) nr_to_read=200
      java-8716  [002] 13445354.819645: do: (__do_page_cache_readahead+0x0/0x180) nr_to_read=200
      java-8734  [001] 13445354.820965: do: (__do_page_cache_readahead+0x0/0x180) nr_to_read=200
      java-8709  [000] 13445354.825280: do: (__do_page_cache_readahead+0x0/0x180) nr_to_read=200
  [...]

I used -H to print the header, and p: to specify that we will create a probe on function entry, which we'll call "do" (that alias is optional). The rest of that line specifies the function and optional arguments. Without kernel debuginfo, I can't refer to the nr_to_read symbol, so I need to use registers instead. I guessed %cx: If true, then our tuning hasn't taken hold, as 200 in hex is 512 pages: the original 2048KB.

7. funcgraph

To be sure, I read the code to see how this value is passed around, and used funcgraph to illustrate it:

  # ./funcgraph -P filemap_fault | head -1000
   2)   java-8248    |               |  filemap_fault() {
   2)   java-8248    |   0.568 us    |    find_get_page();
   2)   java-8248    |               |    do_sync_mmap_readahead.isra.24() {
   2)   java-8248    |   0.160 us    |      max_sane_readahead();
   2)   java-8248    |               |      ra_submit() {
   2)   java-8248    |               |        __do_page_cache_readahead() {
   2)   java-8248    |               |          __page_cache_alloc() {
   2)   java-8248    |               |            alloc_pages_current() {
   2)   java-8248    |   0.228 us    |              interleave_nodes();
   2)   java-8248    |               |              alloc_page_interleave() {
   2)   java-8248    |               |                __alloc_pages_nodemask() {
   2)   java-8248    |   0.105 us    |                  next_zones_zonelist();
   2)   java-8248    |               |                  get_page_from_freelist() {
   2)   java-8248    |   0.093 us    |                    next_zones_zonelist();
   2)   java-8248    |   0.101 us    |                    zone_watermark_ok();
   2)   java-8248    |               |                    zone_statistics() {
   2)   java-8248    |   0.073 us    |                      __inc_zone_state();
   2)   java-8248    |   0.074 us    |                      __inc_zone_state();
   2)   java-8248    |   1.209 us    |                    }
   2)   java-8248    |   0.142 us    |                    prep_new_page();
   2)   java-8248    |   3.582 us    |                  }
   2)   java-8248    |   4.810 us    |                }
   2)   java-8248    |   0.094 us    |                inc_zone_page_state();
  [...]

funcgraph uses another ftrace feature: the function graph profiler. It has moderate overhead, since it traces all kernel functions, so I only use it for exploratory purposes like this one-off. The output shows the code-flow in the kernel, and even has time deltas in microseconds. It's the call to max_sane_readahead() that is interesting, as that fetches the readahead value it wants to use.

8. kprobe (again)

This time I'll trace the return of the max_sane_readahead() function:

  # ./kprobe 'r:m max_sane_readahead $retval'
  Tracing kprobe m. Ctrl-C to end.
      java-8700  [000] 13445377.393895: m: (do_sync_mmap_readahead.isra.24+0x62/0x9c <- \
  			max_sane_readahead) arg1=200
      java-8723  [003] 13445377.396362: m: (do_sync_mmap_readahead.isra.24+0x62/0x9c <- \
  			max_sane_readahead) arg1=200
      java-8701  [001] 13445377.398216: m: (do_sync_mmap_readahead.isra.24+0x62/0x9c <- \
  			max_sane_readahead) arg1=200
      java-8738  [000] 13445377.399793: m: (do_sync_mmap_readahead.isra.24+0x62/0x9c <- \
  			max_sane_readahead) arg1=200
      java-8728  [000] 13445377.408529: m: (do_sync_mmap_readahead.isra.24+0x62/0x9c <- \
  			max_sane_readahead) arg1=200
  [...]

This is also 0x200 pages: 2048KB, and this time I used the $retval alias instead of guessing registers. So the tuning really did not take effect. Studying the kernel source, I saw that the readahead property was set by a function called file_ra_state_init(). Under what circumstances is that called, and how do I trigger it? ftrace/kprobes to the rescue again:

  # ./kprobe -s p:file_ra_state_init
  Tracing kprobe m. Ctrl-C to end.
            kprobe-20331 [002] 13454836.914913: file_ra_state_init: (file_ra_state_init+0x0/0x30)
            kprobe-20331 [002] 13454836.914918: <stack trace>
   => vfs_open
   => nameidata_to_filp
   => do_last
   => path_openat
   => do_filp_open
   => do_sys_open
   => sys_open
   => system_call_fastpath
            kprobe-20332 [007] 13454836.915191: file_ra_state_init: (file_ra_state_init+0x0/0x30)
            kprobe-20332 [007] 13454836.915194: <stack trace>
   => vfs_open
   => nameidata_to_filp
  [...]

This time I used -s to print stack traces, which showed that this often happens from the open() syscall. As I'd left Cassandra running while tuning readahead, it may not have reopened its files and run file_ra_stat_init(). So I restarted Cassandra to see if the readahead tuning would then take effect, and re-measured:

  # ./kprobe 'r:m max_sane_readahead $retval'
  Tracing kprobe m. Ctrl-C to end.
      java-11918 [007] 13445663.126999: m: (ondemand_readahead+0x3b/0x230 <- \
   			max_sane_readahead) arg1=80
      java-11918 [007] 13445663.128329: m: (ondemand_readahead+0x3b/0x230 <- \
  			max_sane_readahead) arg1=80
      java-11918 [007] 13445663.129795: m: (ondemand_readahead+0x3b/0x230 <- \
  			max_sane_readahead) arg1=80
      java-11918 [007] 13445663.131164: m: (ondemand_readahead+0x3b/0x230 <- \
  			max_sane_readahead) arg1=80
  [...]

Success!

iostat showed a large drop in disk I/O, and the database latency measurements were much better. This was simply a readahead change, where the new Ubuntu instances defaulted to 2048KB. What had misled us earlier was that tuning it had not appeared to make a difference, as the setting wasn't taking effect.

Another tunable we checked was the I/O scheduler, and whether changing it from "deadline" to "noop" was immediate. Using:

  ./funccount -i 1 'deadline*'
  ./funccount -i 1 'noop*'

gave us the answer: they showed the rate of the related kernel functions, and that the tuning was indeed immediate.

ftrace and perf-tools

All the tools I used here are from my perf-tools collection, which are front-ends to ftrace and related tracers (kprobes, tracepoints, and the function profiler). I've described some of them as hacks, which they are, as they use creative workarounds for the lack of some in-kernel features. For example, iosnoop reads both issue and completion events in user space, and calculates the latencies there, instead of doing that more efficiently in kernel context.

These tools are for older Linux systems without kernel debuginfo, and show what Linux can do using only its built-in ftrace. There's even more to ftrace that I didn't show here: profiling function average latency, tracing wakeup events, function probe triggers, etc. There is also Steven Rostedt's own front-end to ftrace, a multi-tool called trace-cmd (covered on LWN in 2010), which can do more than my collection of smaller tools, and is also much easier to use than operating on the /sys files directly.

To do even more, perf_events (which is also part of the Linux kernel source) with kernel debuginfo lets me examine complex data structures, and even trace kernel code by line number and watch local variables. What I still can't do, however, is perform complex in-kernel aggregations until I add other advanced tracers like SystemTap, ktap, or coming up: eBPF.

Future Work

Changes are on the way for the Linux kernel, as regular LWN readers likely know. The capabilities I need, to avoid using hacks, may be provided by eBPF, which may be merged as soon as Linux 3.18. If that happens, I'll be happy to create new versions of these tools making use of eBPF. They will likely look and feel the same, but their implementation will be much more efficient and reliable. I'll also be able to create many more tools without the fear of maintaining too many hacks.

Although, even if Linux does bring these capabilities in an upcoming release, it’ll be some time before I can really use them in production in my work environment, depending on how quickly we can or want to be on the latest kernel. So, while my hacks are temporary workarounds, they may be useful for some time to come.

Conclusion

We have a large number of Linux cloud instances to analyze at Netflix, and some interesting and advanced performance issues to solve. While we've been looking at using advanced tracers like SystemTap, I've also been studying what's there already, including the ftrace profilers. The scripts I showed in this post use ftrace, so they work on our our existing instances as-is, without even installing kernel debuginfo.

In this particular example, a Cassandra database experienced a disk I/O inflation issue caused by an increased readahead setting. I traced disk I/O events and latencies, and how these were created in the kernel. This included examining stack traces, counting function-call rates, measuring slow function times, tracing call graphs, and dynamic tracing of function calls and returns, with their arguments and return values.

I did all of this using ftrace, which has been in the Linux kernel for years. I found the hidden light switches.

If you are curious about the inner workings of these ftrace tools, see "Secrets of the Ftrace function tracer" by Steven Rostedt, "Debugging the kernel using Ftrace" part 1 and part 2, and Documentation/trace/ftrace.txt in the kernel source. The most important lesson here isn't about my tools, but that this level of tracing is even possible on existing Linux kernels. And if eBPF is added, a lot more will be possible.

Comments (4 posted)

Control groups, part 7: To unity and beyond

August 13, 2014

This article was contributed by Neil Brown

Control groups

The original justification for this series was so that we all could understand Linux control groups enough that we might be able to enjoy the occasional debates that crop up around them. It is now time to see how that worked — to see how well you, or I, can assess some proposal or challenge in the context of all the issues that appear to surround control groups. In this, the final installment of the series, we have two opportunities to test our new skills.

An obvious first target is the "unified hierarchy" that is available as a developer preview in Linux 3.16, which was covered recently on these pages. If you haven't done so already, now might be a good time to go back and re-read the article to see whether (and how) various issues we have found are being addressed. If you want to be thorough, read the unified-hierarchy.txt documentation too.

It might help to start by writing down a list of the issues that seemed important to you. It can also help to list some design patterns, or anti-patterns, to be on the lookout for. Four that I find helpful were identified in a previous series on the "Ghosts of Unix Past": full exploitation, conflated designs, unfixable designs, and high maintenance designs, all of which are summarized at the end of that last article.

The unified hierarchy: A score card

Having identified your key issues and arrived at your conclusions, you will no doubt want to either have them affirmed, or have an opportunity to defend them. To meet this need, I present some of my own conclusions.

Unification of hierarchy

It is nearly indisputable that the number of hierarchies allowed by classic cgroups is excessive. It is less clear that reducing the number to one is ideal. In our investigations we found very different uses of hierarchy: some subsystems imposed control downward, others collected accounting upward. These are very different uses involving different implementation concerns. It is arguable that they justify distinct hierarchies.

The unified hierarchy is clearly working toward the removal of excessive duplication, which is good. It doesn't seem to acknowledge that different subsystems might genuinely have incompatible needs, but then it hasn't completely closed the door to separate hierarchies yet. So this aspect deserves a B — good work, but room for improvement.

Processes only permitted in the leaves

The unified hierarchy requires that processes only exist in the leaves of the tree. The enforcement approach for this is somewhat clumsy. Leaves are "any node that doesn't extend any subsystem to children" and there is a two-step dance when creating a new level in the hierarchy. Processes must be moved down first, then subsystems can be extended down afterward.

This complexity achieves an end result that is already possible anyway (system administrators and tools could easily choose to keep processes in leaves) and, thus, is largely uninteresting. It's not clear that the kernel needs to enforce a sane policy of processes-only-in-leaves any more than it should enforce the sane policy that the filesystem root be read-only to most users.

I was going to give this issue a C (too complex), but there is a wart on the design that should be highlighted. Processes are excluded from internal cgroups except for the root cgroup, apparently because the root needs "special treatment". This exception actually leads to a score of C+ for reasons which will become apparent later.

Taming the chaotic subsystems

We have seen that the correlation between cgroup subsystems and elements of functionality is rather chaotic. This is not a new observation: at the 2011 Kernel Summit, Paul Turner was reported as saying that:

Google would, based on its experience, rip apart a lot of the controllers and rework them in a better form.

While that sort of rewrite may be too much, it would be nice if we could de-emphasize the current division into subsystems in the hope that more meaningful groupings could emerge, possibly between the control of processes and the control of other resources. The unified hierarchy seems well-placed to advance this need, but unfortunately goes in the opposite direction. Lists of subsystems now appear throughout the cgroups filesystem in the cgroup.controllers and cgroup.subtree_control files. It is true that attribute files are already named after their subsystem, but having a freezer.state file makes sense whether "freezer" is a separate subsystem or just an element of functionality.

Explicitly listing enabled subsystems in cgroup.controllers effectively entrenches the current structure, so this issue gets a D from me.

Providing a resource-consumer ID

We saw in part five that pages in memory can identify who gets the refund when the memory is freed, but not who gets charged for IO when the content is written out. By insisting that all subsystems use a single hierarchy, a single cgroup can serve as a resource consumer ID for all resources types. This is clearly a solution to the problem, but it is hard to tell if it is a good solution (different resources may be very different) so I'm reserving judgment for now and only giving a B.

Processes or threads

Classic cgroups allows individual threads within a process to be in different cgroups. Imagining a credible use case for this is difficult, but not quite impossible.

The cpuset controller can restrict processes to a set of CPUs and, separately, to a set of memory nodes in a NUMA system. The former restriction can be imposed on any thread using the sched_setaffinity() system call or the taskset program, without involving cgroups. But the set of memory nodes can only be configured through cgroups. Imposing different memory nodes on different threads (which share one address space) doesn't make much sense, so that doesn't justify cgroups per thread, but there are other values that can only be set through cgroups.

The Linux scheduler allows the priority of a thread to be set with much finer granularity than the traditional 40 point "nice" scale. It allows each thread or group to have a "weight" which ranges up to about 100,000 (weight = 1024 * 0.8^nice, approximately). This weight can only be set using cgroups. If you want that fine control of individual threads, you need threads in cgroups.

These are both examples of what I would call the "procfs problem". The procfs filesystem became a way for various ad hoc functionality to be added to the kernel with minimal design review, because there were no clear design guidelines. Consequently, it handles much more than processes. Similarly, cgroups seems to allow "back door" access for functionality that is not at all specific to the control of groups of processes. If the only use for threads-in-cgroups is to benefit from these back doors, then disallowing them might encourage better API design.

The unified hierarchy does exactly this and only allows processes (also known as thread groups) to be in different cgroups. This seems like a good idea but does raise a question: what exactly should we try to control? Threads? Processes? Something else? Whatever the answer, dropping support for moving individual threads seems like a good idea, so this gets an A.

Code simplicity

The unified hierarchy is only one step in what could be a long process. There have been a lot of improvements in the code leading up to the current state, but the full value of the changes won't be fully realized until some old functionality can be removed. When that might be is unknown.

What we do know is that only processes (not threads) will ultimately need to be in cgroups and they will only need to be in a single cgroup each. This will certainly bring simplicity, so an A is clearly in order.

Summary

The less-than-stellar scores assigned above probably have several causes, not least my own personal bias. The most significant single cause is almost certainly the foundation on which the unified hierarchy is being built. Like many first implementations, cgroups really isn't very good: The role of hierarchy and the purpose of subsystems are at best confused. If a sow's ear is all you have, a silk purse is really too much to ask for.

One of the premises of the unified hierarchy is that we have to stay with control groups in some form. Tejun Heo would have preferred a different structure layered "over the process tree like sessions or program groups", but mourned "that ship sailed long ago". A little over a year earlier, something happened which has implications that might not be so melancholy.

Auto-group Scheduling

As was reported in late 2010, there are ways other than cgroups to control groups of processes. Using the group scheduling support that was developed for cgroups, Mike Galbraith created a different, automatic mechanism to group processes together for scheduling.

The standard Unix scheduler, and most successors, attempts to be fair to processes, but processes aren't necessarily the best focus for fairness. On the AUSAM Unix variant I used as a student (Australian Unix Share Accounting Method, which evolved into SHARE II), fairness was aimed at users first, so that one student running six processes (the local limit at the time) would not get more CPU time than another student running only one. On a modern developer's desktop, the "job" (in the job-control sense — a process group) is a very logical grouping. Different jobs (browser, game, make -j 40) could reasonably compete against each other on an equal footing, and processes or threads within a job should reasonably compete against each other, but not, as individuals, against other threads.

There are two issues with automatic scheduling using process groups that were raised in the mailing list thread that records the history of auto-group scheduling. The issues were raised by very different people and received very different responses.

The first, raised by Linus Torvalds, is a suggestion that process groups are too fine-grained for this purpose. Creating a new scheduling group does have some cost, so doing it too often could introduce unacceptable slowness. Unfortunately there is no record of anyone measuring the cost (despite some encouragement from Linus) and only a vague assessment of what constituted "too often" — somewhere between "one for every command invocation in the shell" and "tens of thousands of times a second".

This claim was never really challenged. The final implementation used "sessions" rather then "process groups", which certainly do get created less often. However, this doesn't really seem like the right grouping. If you run:

    make -j 40 >& log &

to compile your project, and then frozen-bubble to pass the time — both from the same terminal window — your game will compete with 40 processes instead of with one job.

It is fairly easy to test the overhead that a fork()+exec() suffers if a scheduler group is also created: /bin/env /bin/echo hello and /bin/setsid /bin/echo hello will do exactly the same things except the latter creates a new session and hence a new scheduler group (if both are run from a shell script, not from an interactive shell):

    time bash -c 'for i in {1..10000}; do /usr/bin/setsid /bin/echo hi ; done > /dev/null'
    time bash -c 'for i in {1..10000}; do /usr/bin/env /bin/echo hi ; done > /dev/null'

The difference between those two is certainly in the noise.

The second issue was raised by Lennart Poettering: "On the desktop this is completely irrelevant." At the time this claim was made, it was true to a substantial extent, because auto-grouping was being done based on "controlling tty" and most desktop applications would equally have no controlling tty. A video editor and a browser would be in the same scheduling group, so the multiple rendering threads used by one could swamp the single thread used by the other. By the end of the discussion, it was again true to a different extent. Auto-grouping was now done based on "sessions", and most desktop session managers did not put each application into a different session. One session manager that was under development did: systemd already used setsid() as required.

Despite the fact that Lennart's comments were not well received, he was at that time working on software that could easily bring the benefits of auto-group scheduling to a larger group of users. No one seemed to realize that.

But, back to the main story, the key lesson from auto-group scheduling is that the cgroups effort inspired some useful functionality in the scheduler, and this functionality can be used quite separately from cgroups. When a process is in a non-root cgroup (from the perspective of the cpu subsystem), it is scheduled as directed by cgroups. When it is in the root, it is scheduled according to auto-groups (unless auto-groups has been disabled). This is why it is a positive that the unified hierarchy allows processes to remain in the root of the hierarchy even when the root is no longer a leaf. It means that the way is left open for independent resource management to be developed in parallel to cgroups, and for both cgroups and non-cgroups management to happen on the same system. This leads to the second challenge.

We have something that the original cgroups developers didn't start with: years of experience and working code. This is a wealth that we should be able to turn to our advantage. So, to test your new-found understanding of resource management, the challenge is this: inspired by auto-groups for scheduling, how would you implement resource management and process control in Linux alongside, but independently of, cgroups? Once you've thought that through, you can come back and compare your results to mine. Don't worry, we'll still be here when you're done.

Hindsight groups: highlighting some issues through contrast.

Contrast is a powerful tool for helping us see things more clearly. So to present the issues that I found to be important, I've embedded them in a different context. Hindsight groups, a name which reflects their origin, are sometimes different to make a point, and sometimes different just to be different. Hindsight groups are focused: they are only about restricting groups of processes. Any need that doesn't match that description needs to seek a home elsewhere.

In hindsight groups (or "hgroups"), the base unit of control is the process group, as created by interactive shells, by systemd, and, potentially, by any other session manager. Control can still be imposed on individual processes using prlimit() or similar commands, but controlling groups has no finer granularity than the process group.

To provide a management structure for these process groups, a new level in the PID hierarchy is added. A "process domain" is introduced above sessions and process groups. Processes are initially in domain zero. A process that is in domain zero and is alone in its session and its process group can call set_domainid() to start a new domain that is subordinate to domain zero, thus creating a two-level hierarchy of domains. When a new PID namespace is created, the domain containing the starting process appears as domain zero in the new namespace and new domains in that namespace are subordinate to the local domain zero, thus establishing a multi-level hierarchy.

The hierarchy formed by domains strongly constrains processes. Once inside a domain, a process cannot get out of that domain. Each domain is associated with a process group — the process group of the process that created it. All other process groups in the same domain are considered to be subordinate to that first process group. This effectively places all process groups into a hierarchy. It is very much an organizational hierarchy, rather than a classification hierarchy. It provides structural groupings like "login session" or "container" or "job". It collects processes based on the task they perform more than the way they behave.

With this new, more strongly defined role for process groups comes a new data structure that is allocated per-process-group, much like a signal_struct is allocated per-process. It contains a set of restrictions that apply to processes in the group. Some of these, like an access control list of devices (similar to that provided by the devices cgroup subsystem), are referred to whenever the process needs to check if something is permitted and cgroups is not configured or provides only the "root" cgroup. Others, like a set of CPUs that may be used, need to be pushed out to all processes and threads in the process group whenever they change. This is uniformly done by sending a virtual signal to all processes (similar to the approach taken by the freezer cgroup subsystem). During handling of that virtual signal, a process will update its local understanding based on restrictions in the process group.

The per-process-group restrictions can be changed by any process that has an appropriate user ID or has superuser permissions. However, a process can only give extra permissions (i.e. reduce restrictions) that its own process group has, and that every process group above it in the hierarchy has. Changes are not propagated down by the kernel, but a user-space tool can propagate the lifting or imposing of restrictions reliably.

One important restriction identifies certain actions that processes cannot perform, and causes them to block if they try. One setting of this restriction effectively freezes all processes in the group, much like the cgroups freezer. Another setting only freezes a process when it tries to create a new process group. This makes it possible to impose some restriction on all process groups in a domain in a race-free way.

The various shared resources: memory, CPU, network, and block I/O, each have specific needs and are managed separately. They benefit from these groupings and this hierarchy, but they are not tied to it.

Networking and block I/O have some similarities, as they generally involve capping or sharing data throughput. They are also quite easily virtualized, so that a sub-domain can be given access to a virtual device that routes data to and from a real device. They can have multiple separate devices to manage and have other concerns beyond just the process that is involved. The network system needs to manage its own link-control traffic and possibly traffic forwarded from another interface. The block-I/O subsystem already makes an internal distinction between metadata (using the REQ_META flag) and other data, and so needs to classify requests in different ways.

Consequently, these two systems have their own queuing management structures and are not known to hgroups. The various queuing algorithms may classify requests based on the originating domain, or they may support some labeling of individual processes (similar to the cgroups network classes subsystem), but that is beyond the interest of hgroups.

Memory usage management is quite different from the other shared resources because it is measured in space more than in time. With the other three (network, block I/O, CPU) a process can start or stop using the resource at any moment or can be temporarily barred from the resource with no ill effects. With memory, the resource is useless unless it is constantly available for some non-trivial period of time.

This means, as we saw in an earlier installment, that memory must be charged to some entity that persists for quite a while. It also means that it is difficult to impose proportional sharing. The cgroups memory controller imposes two limits: a hard limit that must not be exceeded and a soft limit that is only imposed when memory is very tight. Varying the soft limit doesn't really affect the proportion of sharing, but instead affects the proportion of pain imposed when memory must be freed.

These twin needs of persistence and imposing restrictions is met perfectly by process domains, and they serve much the same role as cgroups do in the hierarchy used by the mem subsystem. The memory resources used in each process group are charged to the containing domain, and to that domain's containing domain if there is one. If any limit is reached the allocation fails and memory reclaim is instigated. There are hard and soft limits just as with cgroups.

It is possible for a privileged process to redirect the memory accounting for any process group in a subordinate domain so that usage within that process group is charged to some other domain instead. This can be used, for example, to cause all domains belonging to a given user to have a single overall memory limit imposed, even though the primary hgroups structure doesn't recognize users. In any case, a PID number (currently 24 bit) is sufficient to identify a memory resource owner. This would allow two or even three identifiers for different resources to be attached to a request or a page of memory to charge subsequent handling properly.

CPU throughput limits are imposed in nearly the same way as memory allocation limits. The only difference is that the limits can be imposed on the local process group as well as just the domain. Limits can be both raised and lowered by a suitably privileged process.

CPU scheduling is probably the most complex of the resource managers. Scheduling groups are formed roughly following the domain/process-group/process hierarchy, but with grouping optional at each level. If grouping is enabled for domain 0, then the processes in each domain are grouped and those groups are scheduled against each other. If it isn't enabled, then the individual process groups in each domain are all scheduled against each other, creating a result quite similar to the current auto-groups. As with memory resources, a privileged process can direct a process group to be scheduled in the context of some other process group.

On a single-user system, it is likely that domain scheduling would be disabled, and the top-level scheduling would be between process groups. In a multi-user system, the extra cost of domain-level scheduling would probably be justified. Inside containers, the same choices can be made, independently in each container.

This enabling of CPU scheduling independently at each level is a little bit like the approach the unified hierarchy takes of optionally enabling different subsystems at different levels. It is more general, though, as the set of enabled levels does not need to be contiguous.

Hgroups CPU scheduling has another important difference from both cgroups and auto-groups. One of the problems with auto-groups scheduling is that it changes the effect of using nice to make a program run at a lower priority. The fact that nice doesn't really work anymore has been reported but not yet fixed. It seems that some regressions are less important than others, though possibly it hasn't been reported on the right forum.

The problem is that each scheduling group has a priority that is independent of the processes in the group. When you set the niceness of some process, it only causes it to be nice to processes in the same group (same session for auto-groups). When a user has multiple sessions (which is the whole point of auto-groups) they cannot easily be nice to each other.

Hgroups is not in the business of setting priorities, only of imposing restrictions. The restriction it imposes on a process group is to set an upper bound for the priority weight of that group. The effective weight of a group is then the sum (or possibly the maximum) of the weights of active members, providing that does not exceed the upper bound. This allows a low-priority process to continue to be genuinely nice to all other users, not just those in the same scheduling group. When there are no low-priority processes, it works much the same as the present scheme.

Epilogue

I have certainly found this adventure to be very educational and I'm thankful that you could join me on it. It has achieved the goal of a deep understanding, but I cannot yet tell if it will achieve the goal of improving entertainment. When the next chapter in the cgroups story is revealed, I am prepare to be excited or dismayed; thrilled or disgusted; challenged or affirmed. But the one thing I don't expect to be is bored.

Comments (3 posted)

Patches and updates

Kernel trees

Alexandre Oliva GNU Linux-libre 3.16-gnu is now available ?

Greg KH Linux 3.15.9 ?

Greg KH Linux 3.14.16 ?

Greg KH Linux 3.10.52 ?

Greg KH Linux 3.4.102 ?

Ben Hutchings Linux 3.2.62 ?

Architecture-specific

Kees Cook arm: support CONFIG_RODATA ?

Haojian Zhuang enable HiP04 SoC ?

Laura Abbott DMA Atomic pool for arm64 ?

Alexander Shishkin perf: Add infrastructure and support for Intel PT ?

Build system

Josh Triplett [PATCH] x86: Add "make tinyconfig" to configure the tiniest possible kernel ?

Core kernel code

Rakib Mullick ANNOUNCE: BLD-3.16 release. ?

Waiman Long locking/rwsem: enable reader opt-spinning & writer respin ?

Preeti U Murthy Power Scheduler Design ?

Paul E. McKenney RCU-tasks implementation ?

Jani Nikula module: add support for unsafe, tainting parameters ?

Alexei Starovoitov BPF syscall, maps, verifier, samples, llvm ?

Device drivers

Matthias Brugger tty: serial: Add mediatek UART driver ?

Gyungoh Yoo Adding a support for Skyworks SKY81452 ?

Kever Yang Patches to add support for Rockchip dwc2 controller ?

mathieu.poirier@linaro.org Coresight framework and drivers ?

Iyappan Subramanian net: Add APM X-Gene SoC Ethernet driver support ?

Caesar Wang This series adds support for RK3288 SoC integrated PWM ?

Matthias Brugger tty: serial: Add mediatek UART driver ?

Chanwoo Choi extcon: rt8973a: Add Richtek RT8973A MUIC driver ?

Bjorn Andersson Qualcomm Resource Power Manager driver ?

Andreas Werner Introduce MEN 14F021P BMC driver series ?

Lee Jones mtd: nand: Support for new DT NAND driver ?

Guodong Xu Add MFD and regulator drivers for Hi6421 PMIC SoC ?

Gabriel FERNANDEZ phy: miphy28lp: Introduce support for MiPHY28lp ?

Device driver infrastructure

Jenny TC power_supply: Introduce power supply charging driver ?

Lina Iyer PM QoS: per-cpu PM QoS support ?

Filesystems and block I/O

David Herrmann VFS revoke() ?

Memory management

Johannes Weiner mm: memcontrol: populate unified hierarchy interface v2 ?

Boaz Harrosh [RFC 0/9] pmem: Support for "struct page" with Persistent Memory storage ?

Security-related

Casey Schaufler Smack: Bring-up access mode ?

Casey Schaufler LSM: Generalize existing module stacking ?

Virtualization and containers

jgross@suse.com Add XEN pvSCSI support ?

Page editor: Jonathan Corbet
Next page: Distributions>>