LWN: Comments on "LCE: The failure of operating systems and how we can fix it"

LCE: The failure of operating systems and how we can fix it

stevenp129 — Sun, 29 Jun 2014 08:50:55 +0000

what about the fact that if a user could bind to a low level port, they could take advantage of race conditions, and put up a web server (or proxy) in place of the intended appliance?

if user BOB wrote a program to constantly monitor Apache, and the second its PID dies, he was to fire up his own web server on port 80, he could steal sensitive information and password (with great ease).

on a shared hosting service (for example), if somebody neglected to update their CMS to the latest version, and the host runs their webserver without a Chroot... a simple bug or exploit in a website could, in turn, allow a rogue PHP or CGI Script to take over the entire server! not good!

or imagine your DNS server going down! due to a hostile take over... they could redirect traffic to their own off site server, and perform phishing attacks against you and all your clients this way!

Of course there are legitimate reasons to forbid those without privs to bind to ports less that 1024... I'm not sure what is so "stupid" about this idea?

LCE: The failure of operating systems and how we can fix it

rajneesh — Sun, 04 Aug 2013 03:44:17 +0000

How do you give a minimum guarantee using cgroups? cpu.cfs_quota_us is just an upper limit on the cpu usage of the process using bandwidth control? Can you please explain how did you achieve a minimum guarantee of cpu cycles?

LCE: The failure of operating systems and how we can fix it

dps — Thu, 10 Jan 2013 12:03:32 +0000

Putting my system admin hat on I want both border *and* host security. There is a lot that makes sense to block at borders because outsiders have no business using them. Servers on the safe side of firewalls often have to have more services configured and are therefore less secure.

If both a border firewall blocks some attack traffic then a security bug on an internal system is not immediately fatal and there is time to fix it before the border firewall's security is breached. If that has not happened that implies nobody worthwhile has tried or you can't detect security breaches.

In an ideal world there would be no need for security because nobody would even think of doing a bad dead. The world has never been that way.

LCE: The failure of operating systems and how we can fix it

Cyberax — Sat, 24 Nov 2012 19:58:15 +0000

Yes, I remember that such patch was proposed way back then and rejected. And now it will be further complicated by cgroups/namespaces/whatever.

LCE: The failure of operating systems and how we can fix it

dlang — Sat, 24 Nov 2012 19:26:32 +0000

>> I don't have the time to look it up right now, but I remember seeing a /proc entry that removed the <1024 limit.

> There is no such record (I naively thought so too). You can check the Linux source.

Ok, I thought I remembered seeing it at some point in the past, I may have mixed it up with the ability to bind to IP addresses that aren't on the box <shrug>

I wonder how quickly someone could whip up a patch to add this ;-)

seriously, has this been discussed and rejected, or has nobody bothered to try and submit something like this?

LCE: The failure of operating systems and how we can fix it

Cyberax — Sat, 24 Nov 2012 19:21:39 +0000

>I don't have the time to look it up right now, but I remember seeing a /proc entry that removed the <1024 limit.
There is no such record (I naively thought so too). You can check the Linux source.

>I agree that in the modern Internet, that really doesn't make sense, but going back, you had trusted admins (not just of your local box, but of the other boxes you were talking to), and in that environment it worked.
A good mechanism would haven been to allow users access to a range of ports. Something simple like /etc/porttab with list of port ranges and associated groups would suffice.

>remember, these are the same people who think that firewalls are evil because they break the unlimited end-to-end connectivity of the Internet. :-)
I happen to think the same. Security should not be done on network's border, instead all the systems should be secured by local firewalls.

LCE: The failure of operating systems and how we can fix it

dlang — Sat, 24 Nov 2012 19:13:10 +0000

I don't have the time to look it up right now, but I remember seeing a /proc entry that removed the <1024 limit.

I agree that in the modern Internet, that really doesn't make sense, but going back, you had trusted admins (not just of your local box, but of the other boxes you were talking to), and in that environment it worked.

so think naive not moronic

remember, these are the same people who think that firewalls are evil because they break the unlimited end-to-end connectivity of the Internet. :-)

LCE: The failure of operating systems and how we can fix it

bronson — Sat, 24 Nov 2012 19:09:33 +0000

It made sense in the 80s and 90s where the typical Unix host was serving tens to thousands of people and root tended to be trustworthy. No sysadmin wants his users competing to be the first to bind to port 79.

It's true that those days are long gone and it's time for this restriction to disappear.

LCE: The failure of operating systems and how we can fix it

Cyberax — Sat, 24 Nov 2012 18:59:14 +0000

> you can make it so that a normal user can bind to port 80, you can also setup iptables rules so that packets only flow on port 80 (or port 80 at a specific IP address) to and from processes running as a particular user.
>it's not as trivial as a chown, but it's possible.
How? So far I have tried:
1) Iptables - simply DoesNotWork(tm), particularly for localhost.
2) Redirectors - PITA to setup and often no IPv6 support.
3) Capabilities - no way to make it work with Python scripts or Java apps.

For now I'm using nginx as a full-scale HTTP proxy.

That restriction for <1024 ports is by far the most moronic stupid imbecilic UNIX feature ever invented.

LCE: The failure of operating systems and how we can fix it

dlang — Sat, 24 Nov 2012 18:07:19 +0000

you can make it so that a normal user can bind to port 80, you can also setup iptables rules so that packets only flow on port 80 (or port 80 at a specific IP address) to and from processes running as a particular user.

it's not as trivial as a chown, but it's possible.

I think you are mixing up the purposes of using containers.

It's not the need to use port 80 that causes you to use containers, it's that so many processes that use port 80 don't play well with each other (very specific version dependencies that conflict) or are not trusted to be well written and so you want to limit the damage that they can do if they run amok (either due to local bugs, or do to external attackers)

containers don't give you as much isolation as full virtualization, but they do give you a lot of isolation (and are improving fairly rapidly), and they do so at a fraction of the overhead (both CPU and memory) of full virtualization.

If you have a very heavy process you are running, you may not notice the overhead of the virtualization, but if you have fairly lightweight processes you are running, the overhead can be very significant.

I'm not just talking about the CPU hit for running in a VM, or the memory hit from each VM having it's own kernel, but also things like the hit from each VM doing it's own memeory management, the hit (both CPU and memory) from each VM needing to run it's own copy of all the basic daemons (systemd/init, syslog, dbus, udev, etc) and so on.

If you are running single digit numbers of VMs on one system, you probably don't care about these overheads, but if you are running dozens to hundreds of VMs on one system, these overheads become very significant.

LCE: The failure of operating systems and how we can fix it

jch — Sat, 24 Nov 2012 14:46:18 +0000

> Likewise, should the kernel be able to allow multiple processes to transparently use port 80?

That's pretty much orthogonal to resource usage, isn't it?

The issue here is that IP addresses don't obey the usual permissions model: I can chown a directory, thereby giving a user the right to create files and subdirectories within this particular directory, but I cannot chown an IP address, thereby giving a user the right to bind port 80 on this particular address.

I'd be curious to know if I'm the only person feeling that many of the uses of containers and virtualisation would be avoided if the administrator could chown an IP address (or an interface).

-- jch

LXC?

mathstuf — Thu, 22 Nov 2012 15:12:10 +0000

Yeah, something like ezjail-admin would be nice for LXC. It'd make me consider using CentOS or Debian for my server instead of FreeBSD.

LCE: The failure of operating systems and how we can fix it

Cyberax — Thu, 22 Nov 2012 09:50:20 +0000

>Isolation

That is done acceptably by OpenVZ and Containers.

>Migration of VMs

Ditto. There's an article about checkpoint/restore for Linux containers in this week's LWN issue.

>Easy provisioning.
OpenVZ actually wins here. Provisioning an OpenVZ container is dead easy. Mass operations are also very easy.

LCE: The failure of operating systems and how we can fix it

hensema — Thu, 22 Nov 2012 09:15:39 +0000

Costa forgets important uses of virtualisation:

Isolation of customers. Each customer can be king of his virtual machine. You'll never achieve this in a single OS instance.
Migration of VMs. You're not dependent on the hardware you've started the instance on. A VM can be migrated to faster/newer/more stable/cheaper/whatever hardware on-the-fly. It's common for VMs to have higher uptime than any physical hardware in the data centre.
Easy provisioning. Sure, you can do PXE boot and auto install an OS. But it'll never be as easy and flexible as provisioning a VM

Surely there's a lot to be won in process or user isolation in Linux itself. That'll both be useful when running on bare metal or on a HV. However Costa seems to want to go back to the hayday of mainframes and minis. That's not going to happen. Sorry.

LCE: The failure of operating systems and how we can fix it

Fowl — Thu, 22 Nov 2012 08:13:33 +0000

Shared libraries often use IPC...

Plus people want to use containers for more serious "untrusted" isolation.

LCE: The failure of operating systems and how we can fix it

HelloWorld — Thu, 22 Nov 2012 06:08:54 +0000

What's the problem with IPC? The whole point of containers is to have better granularity: share the IPC namespace, but don't share the file system namespace so that you can use your own shared libraries.

LCE: The failure of operating systems and how we can fix it

exel — Tue, 20 Nov 2012 01:40:49 +0000

I think a lot of people still associate OpenVZ with Plesk/Virtuozzo. I must admit I haven't seen where that stuff has been going, but 6 years ago it was all pretty horrible.

There is, at the core of OpenVZ, something that seems absolutely elegant and useful – a more isolating take on the FreeBSD jail concept, which itself was of course chroot on acid. Using it for advanced process isolation looks like a sensible application. The _typical_ application of OpenVZ, in the field right now, though, is that of a poor man's hypervisor. I think that a container technology is just the wrong approach for this.

The big elephant in the room, for me, is security isolation. Containers all run under the same kernel, which means that a kernel compromise is a compromise of all attached security domains. An actual hypervisor setup adds an extra privilege layer that has to be separately broken.

Again, this doesn't mean that OpenVZ cannot be tremendously useful. The most visible way Parallels is selling the technology, however, is not what people are looking for. This pans out in the market place.

Conflicting goals?

mathstuf — Mon, 19 Nov 2012 03:14:37 +0000

How does/would the 200% percentage work with big.LITTLE?

LCE: The failure of operating systems and how we can fix it

tom.prince — Sat, 17 Nov 2012 16:47:58 +0000

There are a number of options to have light-weight bare-metal applications running in the VM, rather than full operating systems.

http://www.openmirage.org/ and https://github.com/GaloisInc/HaLVM are tools for doing this that come to mind.

LCE: The failure of operating systems and how we can fix it

dlang — Fri, 16 Nov 2012 18:27:50 +0000

The problem is that the kernel is not lightweight.

When you run a separate kernel to manage one app, you now have multiple layers of caching for example.

'fixing' this gets very quickly to where it's at least as much complexity.

LCE: The failure of operating systems and how we can fix it

tbird20d — Fri, 16 Nov 2012 17:22:44 +0000

At the risk of exposing my ignorance, the containers approach seems like it's piling complexity on top of complexity. I view it somewhat akin to manual loop-unrolling. Sure, you can get some good performance benefits, and sometimes it's called for, but it makes the code more difficult to understand and is harder to maintain.

If the kernel is lightweight, then it seems like re-using it in recursive sort of way as a hypervisor, a'la KVM, seems like the more tractable long term approach, rather than adding lots of complexity to all these different code paths (basically, almost all of the major resource management paths in the kernel).

LXC?

TRS-80 — Fri, 16 Nov 2012 12:29:16 +0000

Having decent userspace tools is something else that's missing from the upstream kernel container implementation. The kernel has all these features now, but no coherent way of managing them nicely yet.

LCE: The failure of operating systems and how we can fix it

kolyshkin — Fri, 16 Nov 2012 09:11:58 +0000

I am really sorry for you negative experience with travesty of horrors. Nevertheless, this product has no common roots (or common developers, managers etc) with OpenVZ, so this is hardly relevant.

> It's still trivial to forkbomb a container running
> the VZ kernel and take down the whole host.

If that would be true, all the hosting service providers using VZ (i.e. a majority of) would go out of business very soon. If CT resources (user beancounters) are configured in a sane way (and they are configured that way by default -- so unless host system administrator removes some very vital limits), this is totally and utterly impossible.

So, let's turn words into actions. I offer you $100 (one hundred US dollars) for demonstrating a way to bring the whole system down using a fork bomb in OpenVZ (or Virtuozzo, for that matter) container. A reproducible description of what to do to achieve it is sufficient.

> The container model also requires a number of really, really annoying limits

I can feel your pain. Have you ever heard of vswap? Here are some 1-year-old news for you?
- https://plus.google.com/113376330521944789537/posts/5WEzA...
- http://wiki.openvz.org/VSwap

In a nutshell, you only need to set RAM and SWAP for container, and keep the rest of really, really annoying limits unconfigured.

> because of the fundamental weakness of the container model

Could you please enlighten us on what exactly is this fundamental weakness?

> You can get around that

Now, I won't be commenting on the rest of your entry because it is based on wrong assumptions.

LCE: The failure of operating systems and how we can fix it

naptastic — Thu, 15 Nov 2012 21:31:19 +0000

Parallels is the company responsible for the travesty of horrors that is Plesk.

The hosting company I work for (who shall remain nameless) uses OpenVZ for virtual servers. It is wholly inadequate and we have begun transitioning to KVM.

It's still trivial to forkbomb a container running the VZ kernel and take down the whole host. The container model also requires a number of really, really annoying limits (shared memory, locked memory, open files, total files, TCP sockets, total sockets, and on and on) that have to be there because of the fundamental weakness of the container model.

You can get around that by using a container just for one thing, but then you still have to have full operating system just for that one thing. If I want to have memcache by itself in a container, I need a full file system and Linux install to support it. You lose some overhead by using a container instead of a hypervisor, then get it all right back plus some with the requirements of containment.

The only advantage I can see is that you can update the hardware configuration in realtime. Other than that, use cgroups or use full virtualization.

LCE: The failure of operating systems and how we can fix it

glommer — Thu, 15 Nov 2012 20:09:17 +0000

Which is a very big advantage of KVM, why I like it so much, and worked on it for so long.

But while you are reusing all of the infrastructure from the OS - awesome - you still have two schedulers, two virtual memory subsystems, two IO dispatchers, etc.

Containers, OTOH, are basically the Operating System taking resource isolation one step further, and allowing you to do all that without resorting to all the resource duplication you have with hypervisors - be your hypervisor your own OS or not.

Which of them suits you better, is up to you, your use cases, and personal preferences.

Conflicting goals?

glommer — Thu, 15 Nov 2012 20:01:38 +0000

Your description is precise and correct.

However, there are many scenarios where you actually want to limit the maximum amount of cpu used, even without contention. An example of this, is cloud deployments where you pay for cpu time and value price predictability over performance.

The cpu cgroup *also* allows one to set a maximum quota through the combination of the following knobs:

cpu.cfs_quota_us
cpu.cfs_period_us

If you define your quota as 50 % of your period, you will run for at most 50 % of the time. This is bandwidth based, in units of microseconds. So "use at most 2 cpus" is equivalent to 200 %. IOW, 2 seconds per second.
This is defaulted to -1, meaning "no upcap"

Equivalent mechanism exists for rt tasks: cpu.rt_quota_us, etc.

Cheers

LXC?

glommer — Thu, 15 Nov 2012 19:48:10 +0000

You can use LXC to run containers on Linux, but whether you can go to "production" with it, depends on what "production" means to you.

There are many things that mainline Linux lacks. One of them, is the kernel memory limitation described in the article, that allows the host to protect against abuse from potentially malicious containers. It is trivial for a container to fill the memory with non-reclaimable objects, so no one else can be serviced.

User namespaces are progressing rapidly, but they are not there yet. Eric Biederman is doing a great job with that, patches are flowing rapidly, but you still lack a fully isolated capability system.

The pseudo file-systems /proc and /sys will still leak a lot of information from the host.

Tools like "top" won't work, because it is impossible to grab per-group figures of cpu usage. And this is not an extensive list.

So if "production" for you rely on any of the above, then no, you can't run LXC. If otherwise, then sure, you can run LXC.

Besides that, a lot of the kernel features that LXC relies on, were contributed for the OpenVZ project. So it is not like we're trying to fork the kernel, and keep people on our branch forever. It's just a quite big amount of work, the trade offs are not always clear for upstream, etc - It is no difference than Android in essence.

The ultimate goal, as stated in the article, is to have all the kernel functionality in mainline, so people can use any userspace tool they want.

Cheers

LXC?

xxiao — Thu, 15 Nov 2012 19:01:58 +0000

I also thought about the same(i.e. use LXC), until the sales pitch on OpenVZ comes out...

LXC?

pbryan — Thu, 15 Nov 2012 18:55:57 +0000

> It is possible to run production containers today, Glauber said, but not with the mainline kernel. Instead, one can use the modified kernel provided by the open source OpenVZ project that is supported by Parallels, the company where Glauber is employed.

I was under the impression that it is possible to run production containers today with LXC. What functionality does OpenVZ provide have that is not supported by LXC, i.e. via cgroups, clone(2) isolation flags, devpts isolation mechanisms?

LCE: The failure of operating systems and how we can fix it

drag — Thu, 15 Nov 2012 18:01:00 +0000

And with KVM we call hypervisors "Linux"

Conflicting goals?

ottuzzi — Thu, 15 Nov 2012 16:23:55 +0000

Hi,

Solaris containers respects imposed limits only when there are competing resources.
Say you have container A with a CPU limit of 80 and a container B with a CPU limit of 70: if container A is not using CPU container B can use all the CPU available; same if B is doing anything on CPU A can use all the CPU.
But when both A and B are using all the CPU available then they are balanced proportionally to their weight.
I think Linux is implemented the same way.

Hope I was clear
Bye
Piero

Conflicting goals?

Jonno — Thu, 15 Nov 2012 16:04:44 +0000

The memory controllers allow for both soft and hard limits, so you can set a soft limit of 80% (so if the cgroup is above 80%, its memory will be the first to go) and a hard limit of 100% (so the cgroup get to use all memory no one else wants).

The CPU controller works slightly differently, you can sets a "shares" value, and *when cpu contention occurs* the cpu resources will be assigned proportionally. As the root cgroup defaults to 1024 shares, if you assign a shares value of 4096 to your cgroup (assuming there is no other cgroups), it will be limited to 80% of cpu time when there is contention, but be allowed to use more if no other process wants to be scheduled.

LCE: The failure of operating systems and how we can fix it

k3ninho — Thu, 15 Nov 2012 16:04:13 +0000

TCP/IP? That's a little broad: JSON via HTTP requests on port 80 (it's left to the reader to provide an insecure implementation via HTTPS/443).

LCE: The failure of operating systems and how we can fix it

raven667 — Thu, 15 Nov 2012 15:26:35 +0000

I think this is exactly right and is something I have noticed as well. I would also add that this is the reality of the micro kernel model, so in a way tannenbaum was right., micro kernels are the future, but we call them hypervisors.

Conflicting goals?

NAR — Thu, 15 Nov 2012 15:13:14 +0000

It may be wise to constrain a (group of) processes to e.g. 80% of RAM and 80% of CPU - but what if there are no other processes on the computer? 20% of the RAM and 20% of the CPU is not used. For example when I'm playing with a game, I'd like it to use every bit of resource to get that 90 frames per second rate. How does it differ from a fork bomb from the outside?

LCE: The failure of operating systems and how we can fix it

epa — Thu, 15 Nov 2012 09:58:39 +0000

For better or worse inter-application communication will end up as being TCP/IP rather than via the filesystem or local IPC mechanisms.

LCE: The failure of operating systems and how we can fix it

robert_s — Thu, 15 Nov 2012 09:55:15 +0000

"it works out more robust this way."

Only because so far, little communication & cooperation between these appliances has been sought or required. If "appliances" are our new "processes" the fun is going to come when the equivalent of IPC is required.

And let's not even start talking about efficiency.

LCE: The failure of operating systems and how we can fix it

epa — Thu, 15 Nov 2012 07:50:23 +0000

Glauber Costa is quite right that hypervisors are a response to the failure of the operating system to provide enough isolation between different processes or different users on the same machine. But there is another important way in which the operating system has failed, and that is in providing an interface which is wide enough to be usable and yet narrow enough to be completely specified and dependable.

Often a large application will specify a particular Linux distribution or Windows version it is 'certified' to run on. The vendor may even insist that its application be the only thing running on the machine, if you want to get support. It may require particular versions of system libraries because those were the ones it was tested with. And yes, I am talking about big companies here, where stupid things are done for stupid big-organization reasons, and if you use free software and compile from source you are free of this nonsense, blah blah. But bear with me and assume that at least some of the time there is a legitimate reason to require an exact operating system version for running an application. (If you have ever worked on a support desk, you will find this reality easier to accept.)

So what we start to see are 'appliances' where the application is packaged up with its operating system ready to load into a virtual machine. Instead of supplying a program which calls the complex interface provided by the kernel, C library, and other system libraries, the vendor supplies one which expects the 'ABI' of an idealized x86-compatible computer. It has proved easier to agree on that than to agree on the higher level interfaces. Even though, somewhat absurdly, it means that TCP/IP and filesystems and virtual memory are all being reimplemented inside the 'appliance', it works out more robust this way.