Containers vs Hypervisors: The Battle Has Just Begun (Linux.com)
Unikernel systems create tiny VMs. Mirage OS from the Xen Project incubator, for example, has created several network devices that run kilobytes in size (yes, that's “kilobytes” – when was the last time you heard of any VM under a megabyte?). They can get that small because the VM itself does not contain a general-purpose operating system per se, but rather a specially built piece of code that exposes only those operating system functions required by the application. There is no multi-user operating environment, no shell scripts, and no massive library of utilities to take up room – or to subvert in some nefarious exploit. There is just enough code to make the application run, and precious little for a malefactor to leverage. And in unikernels like Mirage OS, all the code that is present is statically type-safe, from the applications stack all the way down to the device drivers themselves. It's not the “end-all be-all” of security, but it is certainly heading in the right direction."
Posted Aug 28, 2014 22:07 UTC (Thu)
by ibukanov (subscriber, #3942)
[Link] (10 responses)
If anything, one can try to compare Mirage OS with a Google Native Client as both projects targets writing new code for a safe VM with rather limited API. Similarly, one can compare Docker with, say, Vagrant [1], as both projects uses the idea of shared image files.
Posted Aug 28, 2014 22:38 UTC (Thu)
by edomaur (subscriber, #14520)
[Link] (9 responses)
Posted Aug 29, 2014 6:13 UTC (Fri)
by ibukanov (subscriber, #3942)
[Link] (8 responses)
But if to run in a lightweight hyper container the code should be recompiled using some special library like uClibc or nacl_io requiring to involve developers, not just system administrators, that is not comparable with what Docker and friends offer.
Posted Aug 29, 2014 7:03 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (7 responses)
I'm playing with a small musl-based system running using Docker in my spare time.
Posted Aug 29, 2014 7:34 UTC (Fri)
by ibukanov (subscriber, #3942)
[Link] (6 responses)
Posted Aug 29, 2014 15:20 UTC (Fri)
by justincormack (subscriber, #70439)
[Link] (5 responses)
Posted Aug 30, 2014 14:25 UTC (Sat)
by dmarti (subscriber, #11625)
[Link] (4 responses)
If you want to run an arbitary Linux application on a unikernel, you can try OSv. Simple: toy HTTP server Real application: Redis on OSv. There are a few limitations--the main one is that it's a single address space, so there is no fork(2). However, many non-forking Linux applications will build and run on OSv with just a Makefile change (see the HTTP server example article for how to do that). Detailed info is in the OSv paper from USENIX Annual Technical Conference: OSv—Optimizing the Operating System for Virtual Machines (I work for the company behind OSv.)
Posted Sep 1, 2014 19:36 UTC (Mon)
by ibukanov (subscriber, #3942)
[Link] (3 responses)
Posted Sep 1, 2014 23:37 UTC (Mon)
by dmarti (subscriber, #11625)
[Link] (2 responses)
OSv will use some libraries built for a Linux host (such as libevent in the HTTP server example above) so you may not have to do a separate build just for your OSv systems, and simply use the library from your Linux environment of choice.
I don't know if it's meaningful to say that OSv isolation level is similar to that of NaCl. Both of them definitely have the goal of strict isolation, but they approach it in totally different ways: NaCl by forcing you to use a safe subset of valid x86_64 code, and OSv by using the hypervisor/guest kernel barrier.
Posted Sep 2, 2014 5:45 UTC (Tue)
by ibukanov (subscriber, #3942)
[Link] (1 responses)
However, this situation is still much better than a typical setup for Linux containers where a bug in a big and fat Linux kernel allows to take the whole system. And I suppose OSv can archive the same if not better performance than container solutions.
What is interesting about NaCl is that it provides the same level of isolation as one gets using memory protection under normal OS with much cheaper system calls. They are still more expensive than function calls, but the performance toll should be small enough not to worry about it. So it would be interesting to port NaCL to OSv to get both performance of a lightweight VM and isolation one gets using using a memory-protected kernel for system services.
Posted Sep 5, 2014 9:35 UTC (Fri)
by justincormack (subscriber, #70439)
[Link]
Posted Aug 29, 2014 1:43 UTC (Fri)
by allesfresser (guest, #216)
[Link] (7 responses)
Except, isn't that supposed to be one of the benefits of free/open source software--that a giant like Google can make the investment to research something like this, and then the whole ecosystem benefits from the techniques? Sure, maybe I won't use millions of servers like Google, but that doesn't mean I can't use the same containerizing sort of techniques they do, on a small scale. If the techniques are well-enough known for us to point to as a good example, then we should be able to use them as an example, no?
Posted Aug 29, 2014 11:19 UTC (Fri)
by dgm (subscriber, #49227)
[Link] (6 responses)
Posted Aug 29, 2014 12:52 UTC (Fri)
by Cato (guest, #7643)
[Link] (5 responses)
Posted Aug 29, 2014 18:54 UTC (Fri)
by Lennie (subscriber, #49641)
[Link] (4 responses)
They created new code to do similar things they are doing internally and open sourced that. To be able to create a community around the project.
Just like Docker wasn't used internally by dotCloud, they created something new they wanted to create a community around.
Google also open source Let Me Contain That For You (lmctfy) and cadvisor in the same way. It's all new code.
In all cases their internal code most be years old and maybe not so pretty to look at, they probably wanted to start with a clear slate anyway without any legacy baggage.
Posted Aug 30, 2014 0:22 UTC (Sat)
by Sesse (subscriber, #53779)
[Link] (3 responses)
(Disclaimer: I work at Google, but not with anything related to this.)
/* Steinar */
Posted Aug 30, 2014 9:00 UTC (Sat)
by Lennie (subscriber, #49641)
[Link]
But it was clearly not developed in isolation. It is based on code and design from BULL/SGI and IBM. A lot of it was developed in the mainline Linux kernel too.
cgroups is a generalization of cpusets which was already in mainline. Which originally came from BULL SA. cpusets was later rewritten by SGI. That all happened before it was used by cgroups.
An other example is I believe the memory controller accounting (design ?) which came from IBM. The memory controller had a number of competing implementations, including beancounters by the OpenVZ guys.
Here is the original commit of cgroups:
Also have a look at the end of this email from 2 years before cgroups was created:
For example, the following sequence of commands will setup a cpuset
mount -t cpuset none /dev/cpuset
http://lwn.net/Articles/91637/
Looks kind of familiar, right ? ;-)
Posted Sep 8, 2014 11:00 UTC (Mon)
by dunlapg (guest, #57764)
[Link] (1 responses)
The only way to make containers reasonably secure is to tailor the container to the exact program you're using, and then also to reduce the number of system calls required by that program. This can't be shared between applications; it needs to be done over again from scratch for *every new application*. The fact that Google has done this work for Google Docs doesn't benefit you at all when you're running Apache.
Posted Sep 8, 2014 15:36 UTC (Mon)
by Lennie (subscriber, #49641)
[Link]
I do know there is a long list of things you should do which could allow you to be somewhat secure.
If they implement all of the long list, would it be enough to run code from an untrusted source ? Maybe.
So far Google said they run customer supplied untrusted code for Docker in a Docker-container in a VM in a container.
Here is what RedHat's Dan Walsh is working on:
Here is his list of tips at the end:
- Only run applications from a trusted source
He doesn't even mention the whole list, but as far as I can see. There is:
So far SELinux, capabilities, mounting /proc /sys readonly and namespaces (except for user) and I assume seccomp are implemented in Docker. Usernamespaces hasn't been implemented because not a lot of kernels running in the wild support it properly.
Posted Aug 29, 2014 11:13 UTC (Fri)
by azilian (guest, #47340)
[Link] (3 responses)
As far as security, if you configure your containers properly and give them their own physical storage, most of the security concerns disappear.
I'm not saying that containers are completely secure, but I'm trying to point out, that they are reasonably secure if they are reasonably setup.
Posted Aug 29, 2014 11:31 UTC (Fri)
by dag- (guest, #30207)
[Link]
But as soon as there is uptake on the idea, things will get standardized, and that opens the door to abusing standardized APIs or standardized setups. And if the storage layer is replaced with cloud storage APIs, you have to include the attack vectors against the cloud storage as well.
Things do not become necessarily less complex, but it might help to reduce the number of (currently used) attack vectors.
Posted Aug 29, 2014 13:15 UTC (Fri)
by ewan (guest, #5533)
[Link]
And full virt VMs are reasonably fast if they are reasonably set up.
This is probably one of those things not worth having a war over - some times containers will be better, some times VMs will be better, but most times either one will do just fine, and pretty much interchangeably.
Posted Sep 2, 2014 2:22 UTC (Tue)
by raven667 (subscriber, #5198)
[Link]
I don't think that is true at all, what I've heard from the security people and the container people is that containers are not useful for hostile multi-tenant environments, in the way that full VMs are useful. There are too many design holes which need to be plugged with SELinux or seccomp_bpf or whatever, the kernel attack surface is large and there are _always_ 0days floating around which break the kernel, especially when you support fully-featured guest images. Of course this doesn't mean that hosting services aren't being offered, but what you believe is "reasonably secure" may differ from others.
Posted Aug 29, 2014 17:57 UTC (Fri)
by NightMonkey (subscriber, #23051)
[Link] (2 responses)
Thanks.
Posted Aug 29, 2014 23:47 UTC (Fri)
by azilian (guest, #47340)
[Link] (1 responses)
BSD Jails offer:
Linux containers offer:
Disclaimer: I'm not an expert in BSD and may have missed something.
Posted Aug 30, 2014 5:17 UTC (Sat)
by NightMonkey (subscriber, #23051)
[Link]
Containers vs Hypervisors: The Battle Has Just Begun (Linux.com)
Containers vs Hypervisors: The Battle Has Just Begun (Linux.com)
Containers vs Hypervisors: The Battle Has Just Begun (Linux.com)
Containers vs Hypervisors: The Battle Has Just Begun (Linux.com)
Containers vs Hypervisors: The Battle Has Just Begun (Linux.com)
Containers vs Hypervisors: The Battle Has Just Begun (Linux.com)
Applying the unikernel concept to more applications
Applying the unikernel concept to more applications
Applying the unikernel concept to more applications
Applying the unikernel concept to more applications
Applying the unikernel concept to more applications
Containers vs Hypervisors: The Battle Has Just Begun (Linux.com)
Except Google is Google. They can afford to hire thousands of the best and brightest to do intelligent things that few others can do. After 30 years in this industry, and two decades dealing with customers on site, I doubt that most organizations could readily do what Google has done. If they could, they'd be Google, too.
Containers vs Hypervisors: The Battle Has Just Begun (Linux.com)
Containers vs Hypervisors: The Battle Has Just Begun (Linux.com)
Containers vs Hypervisors: The Battle Has Just Begun (Linux.com)
Containers vs Hypervisors: The Battle Has Just Begun (Linux.com)
Containers vs Hypervisors: The Battle Has Just Begun (Linux.com)
https://git.kernel.org/cgit/linux/kernel/git/torvalds/lin...
named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
and then move the current shell to that cpuset:
cd /dev/cpuset/top_cpuset
mkdir Charlie
cd Charlie
/bin/echo 2-3 > cpus
/bin/echo 1 > mems
/bin/echo $$ > tasks
Containers vs Hypervisors: The Battle Has Just Begun (Linux.com)
Containers vs Hypervisors: The Battle Has Just Begun (Linux.com)
https://opensource.com/business/14/9/security-for-docker
- Run applications on a enterprise quality host
- Install updates regularly
- Drop privileges as quickly as possible
- Run as non-root whenever possible
- Watch your logs
- setenforce 1
- cgroups to prevent a container to DOS CPU/Memory/disk for other containers
- seccomp to only allow certain syscalls:
https://github.com/docker/docker/blob/master/contrib/mkse...
- SELinux to only allow access to certain SELinux types and SELinux catagories
- capabilities whitelist to only allow certain capabilities
- readonly mounts to only a allow only a few entries from /sys /proc
- usernamespaces to make sure root in the container is not root to the kernel / outside the container - you'll need a fairly new kernel to be able to use it securely
- pid namespace to let the container only see it's own processes
- hostname namespace so the container has it's own hostname
- networking namespace to give the container it's own network stack
Containers vs Hypervisors: The Battle Has Just Begun (Linux.com)
And I have to point that Docker is not the only container technology! LXC is out there, and is doing awesome job at providing you with full OS containers.
Containers vs Hypervisors: The Battle Has Just Begun (Linux.com)
Containers vs Hypervisors: The Battle Has Just Begun (Linux.com)
Containers vs Hypervisors: The Battle Has Just Begun (Linux.com)
Containers vs Hypervisors: The Battle Has Just Begun (Linux.com)
Containers vs Hypervisors: The Battle Has Just Begun (Linux.com)
- chroot
- network isolation
- shared memory isolation(I'm not sure if it is a real isolation or simple permission)
- security level support per-container
- limit the network memory used by each process
- hostname/domainname per-container
- fine grained capabilities per-container can be achieved but not with the default tools
- Tools: jail/ezjail
- chroot
- network isolation
- Shared memory isolation
- hostname/domainname per-container
- PID isolation (many container can see PID number 314 in each of them, however these are different PIDs on the host machine
- user mappings (map UID 433 on the host as UID 15 or UID 0 in the container)
- resource limit isolation
- CPU, Memory and I/O limits per-group
- Device isolation per-group
- there is a mechanism in the kernel to freeze/unfreeze all processes inside a container with a single command
- Fine grained capabilities per-container
- Tools: Docker, LXC and a lot of others.
- Live migration - CRIU. If certain rules are met, you can even live migrate a linux container to another physical machine.
Containers vs Hypervisors: The Battle Has Just Begun (Linux.com)