Resource control at Facebook

By Jake Edge
September 19, 2018

Facebook runs a lot of programs and it tries to pack as many as it can onto each machine. That means running close to—and sometimes beyond—the resource limits on any given machine. How the system reacts when, for example, memory is exhausted, makes a big difference in Facebook getting its work done. Tejun Heo came to 2018 Open Source Summit North America to describe the resource control work that has been done by the team he works on at Facebook.

He began by presenting a graph (which can be seen in his slides [PDF]) that showed two different systems reacting to the same type of workload. In each case, the systems were handling typical Facebook requests at the same time a stress-producing process on the system was allocating memory at the rate of 10MB per second. One of those systems, running the standard Facebook kernel, would eventually encounter the out-of-memory (OOM) killer, which resulted in 40 minutes or so of downtime. The other system, running the code that Heo was there to talk about, had some relatively minor periods of handling fewer requests (from nearly 700 requests per second to 400-500) over that same period.

His team set out with a goal: "work-conserving full-OS resource isolation across applications". That's all a bit of a word salad, so he unpacked some of the perhaps unfamiliar terms. "Work conserving" means that machines stay busy; they do not go idle if there is work to do. "Full OS" refers to making the isolation transparent to the rest of the system. For example, virtual machines (VMs) can allocate various resources for isolating a process, but they lose integration with the underlying operating system. Memory that is devoted to a VM cannot be used by other VMs even when it is not being used. In a container environment, though, the underlying OS is integrated with the isolated workloads; keeping that integration is what "full OS" is intended to convey.

When the team set out on the project, it didn't seem that difficult, Heo said. There is a price that any machine in the Facebook cluster must pay—called the "fbtax"—which covers all of the infrastructure required to participate. That includes monitoring and other control programs that run on each node. If those auxiliary, but necessary, processes have memory leaks or other excessive resource consumption problems, they can take down a system for nearly an hour, as shown in the graph. The new "fbtax2" project set out to try to ensure systems could still respond to the real work that needs to be done even in the face of other misbehavior.

Challenges

As might be guessed, there were some challenges. For memory, the .high and .max values used by the memory controller are not work conserving. Because the machines at Facebook are fully utilized, creating artificial limits on memory just led to reduced performance and more fragile systems. In addition, the kernel's OOM killer does not protect the health of the workload and cannot detect if a workload is thrashing; it only detects whether or not the kernel can make progress. So if the workload is thrashing, but the kernel is not directly affected, the OOM killer will not run.

There is also "no good I/O controller to use", he said. The .max value for the I/O controller is, once again, not work conserving. I/O bandwidth is more oversubscribed than memory is for Facebook. The settings for I/O are based on bytes per second and I/O operations per second (IOPS) and it is difficult to find a middle ground when using those. What the Completely Fair Queuing (CFQ) I/O scheduler implements is not good for SSDs, or even hard disks, Heo said. The I/O controller for control groups version 2 (cgroups v2) does better than its predecessor, but it still does not account for all I/O that a container generates. In particular, filesystem metadata and swap I/O are not properly accounted for.

Priority inversions can occur when low-priority cgroups get throttled but generate I/O from filesystem operations or from swapping. For example, the ext4 filesystem only journals metadata, not the actual data, which can create a hard dependency across all writes in the system. The mmap_sem semaphore is another way that priority inversions can be caused. Even something as simple as ps needs to acquire mmap_sem in order to read /proc/PID/cmdline. A low-priority cgroup can hold that semaphore for a long time while doing I/O; if a higher priority cgroup wants to do something that requires the semaphore, it will be blocked.

Solutions

He then moved on to the solutions that the team has found. Instead of .max/.high values for controlling memory, .min and .low values were added. These provide more work conservation because there are no artificial limits; if a cgroup needs more memory and memory is available, it gets more memory. These settings are more forgiving as well; they can be a bit wrong and the system will function reasonably well. There is also work being done to provide proportional memory pressure to cgroups, he said.

It is difficult to tell whether a process is slow because of some inherent limitation in the program or whether it is waiting for some resource; the team realized it needed some visibility into that. Johannes Weiner has been working on the "pressure stall information" (PSI) metric for the last two years. It can help determine that "if I had more of this resource, I might have been able to run this percentage faster". It looks at memory, I/O, and CPU resource usage for the system and for individual cgroups to derive information that helps in "determining what's going on in the system".

PSI is used for allocating resources to cgroups, but is also used by oomd, which is the user-space OOM killer that has been developed by the team. Oomd looks at the PSI values to check the health of the system; if those values are too bad, it will remediate the problem before the kernel OOM killer gets involved.

The configuration of oomd can be workload-dependent; if the web server is being slowed down more than 10%, that is a big problem, Heo said. On the other hand, if Chef or YUM are running 40% slower, "we don't really care". Oomd can act in the first case and not in the second because it provides a way to specify context-specific actions. There are still some priority inversions that can occur and oomd can also help ameliorate those.

The I/O latency controller is something that Josef Bacik has been working on for the past year, Heo said. One of the key problems for an I/O controller is that there is no metric for how much I/O bandwidth is available. As an alternative to the existing I/O controller, you can try to guarantee the latency of I/O completions, which is what the latency controller does. One workload might need 30ms completions, while another needs 20ms; it requires some experimentation to determine those numbers.

The latency controller is "a lot more work conserving" and can be used for both SSDs and hard disks. It supports "do first, pay later" for metadata and swap I/O operations, which allows I/O operations caused by lower priority cgroups to run at a higher priority to help avoid priority inversions. Once those I/Os complete, they will be charged to the lower priority cgroup, which will reduce or delay its ability to do I/O in the future. Also, the latency controller works with the multiqueue block layer, "which is the only thing we use anymore", he said.

Facebook has switched to Btrfs as a way to fix the priority inversions. That is easy to do, he said, since the team has several core Btrfs developers (Bacik, Chris Mason, and Omar Sandoval) on it. Adding aborts for page-cache readahead solved most of the problems they were experiencing with mmap_sem contention, for example. The "do first, pay later" for swap I/O helped with priority inversion problems as well.

Results

Heo went on to describe the infrastructure at Facebook on the path toward looking at the results of his team's work. Facebook has hundreds of thousands of machines with Btrfs as their root filesystem at this point and a seven-digit number of Btrfs containers, Heo said. The change to Btrfs happened over just a few weeks time once his team told the infrastructure team what was wanted; "it was breathtaking to see". He noted that SUSE developers also did a lot of the upstream Btrfs work.

Without swap enabled, "all anonymous memory becomes memlocked", he said. By enabling swap, it allows memory pressure to build up gradually, Facebook enables swap for all cgroups except the ones used for the main workload.

He then went through the cgroups that are used on the systems and their settings; all of these cgroups are managed as systemd "slices". The "hostcritical" slice has processes like oomd, sshd, systemd-journald, and rsyslog in it. It has mem.min=352M and io.latency=50ms (for a hard-disk-based machine). The "workload" cgroup has mem.low=17G and the same latency setting, while the "system" slice (which contains everything else) has no memory setting and io.latency=75ms.

Oomd is then configured to kill a memory hog in the system cgroup under certain conditions. If the workload slice is under moderate memory pressure and the system slice is under high memory pressure, or the system slice suffers from prolonged high memory pressure, a memory hog will be killed. Similarly, an I/O hog in the system cgroup will be killed if the workload is under moderate I/O pressure and the system cgroup is under high I/O pressure. In addition, a swap hog will be killed if swap is running out.

He then showed a number of graphs that demonstrated better handling of various types of stress with fbtax2 versus the existing code. Adding memory leaks of 10MB/s, 50MB/s, and 100MB/s would effectively crash the existing systems, but the fbtax2 systems were able to continue on, sometimes with sizable dips in their performance, but still servicing requests.

When I/O stress was applied, the existing code would continue running, but take huge dips in performance, while the fbtax2 code took fewer and smaller dips. Those dips were still fairly large, though, which may be caused by trying to control too much, Heo said; that can make the system slice too unstable. It is important "to give the kernel some slack" so that it can balance things. All of the graphs can be seen in his slides.

So Facebook now has working full-OS resource isolation, Heo said; fbtax2 is rolling out throughout its infrastructure. Much of the code is already upstream, what remains should be there in the next two development cycles. There are still some things to do, including better batch workload handling and thread-pool management. While latency control is a really effective tool, it is not that great for multiple workloads with different needs. The team is looking at proportional I/O control to see if that can help in those scenarios. He closed by pointing attendees at the Facebook Linux projects page for more information on these and other projects.

[I would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to attend Open Source Summit in Vancouver.]

Index entries for this article
Conference	Open Source Summit North America/2018

Resource control at Facebook

Posted Sep 19, 2018 17:46 UTC (Wed) by mtaht (subscriber, #11087) [Link] (1 responses)

It is so remarkable to see the language (like "work conserving") and methods of queue theory now entering everything. It seems like every week some new application surfaces, it seemed like half of the talks at netdevconf were a result of that...

There's a lot of good papers "out there"

https://www.google.com/search?rlz=1C5CHFA_enUS749US749&...

My copies of these two fundamental books by kleinrock have become rather dog eared over the last 8 years:

https://www.amazon.com/Queueing-Systems-Theory-Leonard-Kl...

Queue theory really suffers from having a terrible notation that I wish was easier to express in modern computer languages and designs. The use of a slash to describe a M/M/1 queue rather than some other utf-8 character has always bothered me.

Volume II is surprisingly accessible to folk with a CS background without having to read volume one. If I could get everyone at a sophomore level in college to have read it, it would be a better world - not just in computer science, but science in general, business, economics, and so on.

https://www.amazon.com/Queueing-Systems-Vol-Computer-Appl...

While I'm off recommending books today for some reason, I have a first edition of this, heavily thumbed through, also:

https://www.amazon.com/Theory-Games-Economic-Behavior-Com...

Resource control at Facebook

Posted Sep 19, 2018 18:54 UTC (Wed) by tlamp (subscriber, #108540) [Link]

nothing to add, just wanted to thank you for your book recommendations!

Resource control at Facebook

Posted Sep 21, 2018 10:49 UTC (Fri) by mageta (subscriber, #89696) [Link] (6 responses)

Really interesting to see that Facebook now really rolls out BTRFS on such a large scale, and has it working well apparently - while on the other hand Redhat seems to have given up on it entirely.. dropping support for RHEL, AFAIK.

Resource control at Facebook

Posted Sep 21, 2018 11:47 UTC (Fri) by josefbacik (subscriber, #90083) [Link] (5 responses)

Well we run an upstream kernel and have 3 of the main developers, whereas RHEL runs some ancient kernel with nobody that contributes to btrfs development. But yeah, it’s on everything plus our containers use it for their base images, so each actual machine that has btrfs has at least 3 or 4 btrfs images running on it as well. It works pretty well, and really nobody is more surprised than me.

Resource control at Facebook

Posted Sep 24, 2018 16:51 UTC (Mon) by SEJeff (guest, #51588) [Link]

Disclaimer (for those who didn't know): You're one of the primary btrfs maintainers and have been contributing for a long time (I used to follow your work years ago!).

Resource control at Facebook

Posted Sep 24, 2018 17:43 UTC (Mon) by hkario (subscriber, #94864) [Link] (3 responses)

> It works pretty well, and really nobody is more surprised than me.

would that mean that the out-of-space handling is now good enough, or are Facebook workloads making sure this doesn't happen?

(or was it something else that was The Big Problem for btrfs?)

Resource control at Facebook

Posted Sep 24, 2018 18:43 UTC (Mon) by josefbacik (subscriber, #90083) [Link]

ENOSPC will always be a pain for us, but it's mostly just a minor annoyance at this point. I'm trying to fix the last big problem we have, but that still only hits tens of boxes a day across the whole fleet, which is pretty minimal. Once that is done I'm hoping that's it for ENOSPC problems for at least a day or two.

Resource control at Facebook

Posted Sep 24, 2018 18:48 UTC (Mon) by josefbacik (subscriber, #90083) [Link] (1 responses)

And to clarify we have _a lot_ of workloads on btrfs so anything normal users see we will see x1000. The root file system workload isn't super interesting, but with the images we schlep around it definitely uncovers weird corner cases. It's used as the backing store for a bunch of our gluster installs, so that has a whole other host of workloads that are pretty interesting. And finally our continuous build system really abuses the file system, basically mercurial checkout for the base volume, snapshot and checkout the new commit, build and run tests, destroy subvolume, rinse repeat a million times. If we're happy most people will be happy. That being said we are significantly more fault tolerant than random joe user, so YMMV.

Resource control at Facebook

Posted Sep 25, 2018 10:26 UTC (Tue) by hkario (subscriber, #94864) [Link]

> That being said we are significantly more fault tolerant than random joe user

that's one thing, the other thing is that things that are common in workstation workloads (like power failures) either don't happen or have completely different recovery protocols...

don't get me wrong, I really think that btrfs is the Linux file system we need in 21st century, I just wish it was already here for the full feature set...