Seven problems with Linux containers
Benefits for LWN subscribersThe primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today!
Kirill Kolyshkin of the OpenVZ project delivered a talk entitled "Seven problems with Linux containers" at SCALE 12x in Los Angeles. The talk was not a criticism of containers, naturally (given that OpenVZ is in the containerization business itself), but rather a guide to the challenges inherent in the idea and a comparison of the various solutions available.
Kolyshkin began by noting that he had previously submitted the same talk title to LinuxFest Northwest but had only managed to come up with six problems at that time. Consequently, he said, the "real" title was now "N problems with Linux containers." But he had come up with a seventh point to discuss this time, he said, so he was pleased to announced that N currently equals 7. Nevertheless, he continued, each problem is really a large topic that could take up a session all on its own.
Problem 1: effective virtualization
The first problem, he said, is how to provide the most efficient virtualization for processes: the best performance with the least overhead, at least for users who cannot solve their problems by throwing additional hardware at every problem. Containers are the solution that OpenVZ and others believe is the best, but that solution needs to be understood in the context of the other possible approaches. Historically, full virtualization on mainframes was once the preferred approach, which has since given way to modern software VMs. But every implementation grapples with the efficiency problem, trying to minimize overhead and provide good performance.
If one looks at the computer as a sandwich of three layers (hardware, operating system, and user space), there are possible solutions at each layer. Intel has worked hard adding hardware features to support virtualization, VM hypervisor solutions provide multiple operating system layers, and containers provide multiple user-space instances. Containers provide the best efficiency because they simply make processes "unsee" each other. Just as the venerable Unix chroot isolates a process in its own filesystem root, Linux namespaces do the same for everything else on the system: process IDs, network, users, inter-process communication (IPC), mounts, and UTS system identifiers. Simple and powerful, a container is simply activating all of the namespaces for a particular process.
Problem 2: sharing resources among containers
With multiple processes isolated in containers, however, the next problem appears: fairly distributing access to system resources so that everyone gets their share. Every container needs access to CPU, RAM, disks, and "various kernel things," he said, plus one wants to protect the system against denial-of-service attacks and provide some means to prioritize certain containers (or even to implement service levels agreements for customers).
OpenVZ's solution is resource controls, a set of four accounting controls that can set per-container resource limits. The first is beancounters after the HP/UX feature it was derived from. Beancounters enumerates 20 or so separate parameters that can be limited on a per-process basis, but in many cases that does not turn out to be adequate. After all, he said, many important programs (like Apache and PostgreSQL) are a collection of processes, not a single process, so per-process limits are insufficient. OpenVZ also provides resource controls for disk quotas and I/O, and uses a two-level CPU scheduler to manage CPU access.
The OpenVZ approach has its good points, he said, including the fact that it is dynamically adjustable. But it is not the only solution; the upstream kernel solution is control groups (a.k.a. cgroups), which provides a similar set of constraints and controls to manage those constraints. Cgroups are still a work in progress, he said; not everything can be controlled the way it should be yet, and there are still instances where processes can escape or abuse the system.
Problem 3: making resource limits easy to understand
Both resource-sharing solutions reveal the third problem on Kolyshkin's list: how to make the resource limits easy to understand for users. OpenVZ found that its 20-or-so parameters were a complicated thing for people to understand to begin with, and many of the parameters are interdependent, making things even more complicated. Thus, the project tried creating a collection of valid and useful configurations to help. It wrote wiki pages and knowledge-base articles, and even created an entire book on the subject. It also created a suite of tools for managing resources.
But users responded to this assortment of help with more puzzlement, he said. "One or two users understand it," he said, but ultimately users just want to run their databases, not spend their time learning resource configuration. The project's next shot at a solution was VSwap, a greatly simplified alternative to beancounters with just two mandatory parameters (RAM and swap), although other parameters were still options. Everyone understands RAM and swap, he said, "so you don't have to go write a book about it." VSwap has been much more successful than beancounters. At the moment, however, it is just available in OpenVZ, although it will probably head for the upstream kernel when the project gets time to work on it.
Problem 4: live migration
OpenVZ can migrate running containers from one machine to another without a shutdown, Kolyshkin said, which not even Solaris's Zones can do. But live migration faces another problem: migrating quickly enough to be useful. Big containers like Oracle databases can have "gigs and gigs of RAM," he said, which can take several seconds to freeze and migrate. The end user, though, does not know that any migration took place—the only thing visible is that a query took several seconds to return.
OpenVZ's live migration involves several steps, he explained: freezing the container, serializing and dumping the container's state to a dump file, copying the dump file to the destination server, undumping it, then unfreezing the container. The delay problem is caused by the huge size of the dump file, which is slow to copy. The project's solution was network swap: freezing a minimal dump file containing only what is required to get the container up and running on the destination, then swapping over the rest of the RAM contents only when memory faults occur on the destination.
That solution was acceptably fast, he said, but it had a negative side effect: if the network goes down between the source and destination server, it can become impossible to network-swap in the necessary memory pages, and impossible to roll back state changes on the destination server. OpenVZ's new solution is iterative migration using the "soft dirty" RAM tracker. OpenVZ asks the Linux kernel to track which memory pages are modified, and periodically asks the kernel for the list, copying over the modified pages. The result is that the migration can begin with copying a much smaller set of pages to the destination, then allow the destination to catch up with an "rsync-like" copying scheme.
Problem 5: upstreaming
Upstreaming code to the Linux kernel is the fifth problem, he said, and is probably OpenVZ's biggest challenge. The project initially did not think it would need to send its patches upstream, so it worked on one big patch set. But, just like Android, he said, it subsequently learned that getting its code upstream would have benefits like reduced maintenance time and not having to port the patch set to new releases. Once it decided to start working on getting its code upstream, however, it encountered resistance in a number of areas: "this is too big," "there are too many side effects, "we don't care about live migration," "your coding style is wrong," and so on.
Solution one was to rewrite the code, Kolyshkin said. "Solution zero" would have been to do the work upstream to begin with, he added, "but we had already written it." So the project rewrote its beancounters features using cgroups, and rewrote its PID namespace code several time before it was accepted. That was a huge effort which briefly pushed Parallels into the realm of the top ten kernel contributors. Not every feature would be accepted upstream, he said, such as OpenVZ's checkpoint/restore code, so solution two was to "circumvent the system." That meant reimplementing a feature in user space with minimal kernel intervention, which resulted in Checkpoint/Restore In Userspace (CRIU). Now CRIU is about two years old, he said, and recently made its 1.0 release. It has attracted several users (such as Google and Samsung), interest from other projects (such as Docker), and was successfully merged upstream for the 3.3 kernel.
Problem 6: a common file system
The sixth problem is the limitations of sharing a filesystem between containers. Remember, he said, a container is a directory that we chroot into; on a journaled filesystem, a large number of containers can cause a bottleneck, particularly when updating the metadata. Lots of small-file-size I/O can also interrupt migrations, causing a denial of service. "If you want to see an example of this," he said, "create a 1MB file and then truncate it by one byte a million times: the whole system will sit and wait for it." There is also no support for implementing disk quotas or snapshots on filesystem subtrees in the upstream filesystems.
The first solution many think of when facing these problems is the Linux logical volume manager (LVM). But, he said, LVM supports only block device (and thus does not support NFS, for example) and it does not have dynamic space allocation. Another possible solution is using loopback devices, but that also does not support dynamic space allocation and (up until recent kernels) would cause a performance hit through double caching.
OpenVZ's solution is ploop, "a loop device with added bells and whistles." It supports online resizing, instant live snapshots, several different image file formats, and several I/O backends to avoid double-caching.
Problem 7: inefficient storage
The last problem Kolyshkin addressed was also storage related, but at a more practical level. Most OpenVZ users are hosting providers, he said, who tend to have many servers with many hard disks. In a recent look at storage, the average disk space utilization among these users was a paltry 37%. In addition, the highest utilization was just 51%, and the worst went down to 14%. The unused space might be considered a problem, but the more important problem was all of the wasted disk bandwidth.
Here again, there are several possible solutions. The first might be to simply use storage area networks (SANs), placing all of the servers' storage together. SANs offer high availability, high performance, and high utilization. Furthermore, if the servers share a single SAN, live migrations would not need to involve any data copying. But the downside of SANs is their super expensive price tag.
OpenVZ's alternative solution is Parallels Cloud Storage (a.k.a. pstorage), which pools the unused disk space of all of the servers into a "virtual SAN." Users do not need to buy any additional hardware, and pstorage is elastic (including hot-plugging). The "distributed reads" offer a RAID-like I/O speed, plus failure-tolerance and redundancy. It also offers data-locality, and the developers argue that it has advantages over distributed systems like Ceph and GlusterFS because it has instant—not eventual—consistency. That is, if different processes read the same data from different places, that data is always the same.
Kolyshkin took a few minutes for questions at the end of the
session; most interestingly, one audience member asked if there
was a solution for sharing entropy between containers. He answered
that the most usual solution is to use a daemon, but that from the
containerization side the real issue is that there is
currently no way to enforce or manage resource limits on
entropy—but that it is evidently not a major problem for
container users, since that limitation has not come up. Should that
change in the future, of course, Kolyshkin can no doubt bump N up to
8.
| Index entries for this article | |
|---|---|
| Conference | Southern California Linux Expo/2014 |
Posted Feb 27, 2014 11:01 UTC (Thu)
by pbonzini (subscriber, #60935)
[Link] (1 responses)
Posted Mar 1, 2014 7:03 UTC (Sat)
by kolyshkin (guest, #34342)
[Link]
I guess it was not available at the time we started working on ploop (circa 2008 -- it was a research project for a long time).
Besides, one big problem with logical volumes is they require a block device, while ploop can reside on e.g. NFS or a pstorage volume. Also, LVM management is more complicated than dealing with one or a few image files.
Posted Feb 27, 2014 16:27 UTC (Thu)
by dowdle (subscriber, #659)
[Link] (1 responses)
For anyone interested, here's a video I recorded of Kir's first version of this talk that he did at LinuxFest Northwest in April 2013:
Posted Mar 3, 2014 5:21 UTC (Mon)
by dlang (guest, #313)
[Link]
there were also technical problems with the recordings that they are trying to see if there's a way to fix them or not.
David Lang
Posted Feb 28, 2014 18:58 UTC (Fri)
by skorgu (subscriber, #39558)
[Link] (1 responses)
http://gluster.org/community/documentation/index.php/Glus...
http://ceph.com/ceph-storage/file-system/
Posted Mar 3, 2014 4:18 UTC (Mon)
by dev (guest, #34359)
[Link]
Posted Mar 1, 2014 15:29 UTC (Sat)
by pj (subscriber, #4506)
[Link]
Posted Mar 3, 2014 22:15 UTC (Mon)
by tpo (subscriber, #25713)
[Link]
Posted May 27, 2014 20:22 UTC (Tue)
by bwhaley (guest, #80289)
[Link] (1 responses)
Anyone have further detail on that comment? Docker uses cgroups for resource control. I'm curious what the limitations are.
Posted May 27, 2014 23:32 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link]
However, cgroups do not physically isolate processes, so a malicious process can ptrace() a process from another cgroup and inject malicious code there. And there are lots of other similar ways to 'escape' cgroups.
Complete confinement is going to require close cooperation between namespaces and cgroups. But the relevant developers don't talk to each other. It's a mess.
Seven problems with Linux containers
Seven problems with Linux containers
Seven problems with Linux containers
Seven problems with Linux containers
SCaLE network staff
Seven problems with Linux containers
Seven problems with Linux containers
CEPH is really a strong consistency storage and is fine for VMs/containers.
GlusterFS is not! Just try it under failures and you will see how files appear/disappear, change their size and content between actual and outdated state etc. And it's by design in Gluster.
Seven problems with Linux containers
Good article!
Seven problems with Linux containers
Seven problems with Linux containers
