Portable system services

By Jake Edge
November 9, 2016

In the refereed track of the 2016 Linux Plumbers Conference, Lennart Poettering presented a new type of service for systemd that he calls a "portable system service". It is a relatively new idea that he had not talked about publicly until the systemd.conf in late September. Portable system services borrow some ideas from various container managers and projects like Docker, but are targeting a more secure environment than most services (and containers) run in today.

There is no real agreement on what a "container" is, Poettering said, but most accept that they combine a way to bundle up resources and to isolate the programs in the bundle from the rest of the system. There is also typically a delivery mechanism for getting those bundles running in various locations. There may be wildly different implementations, but they generally share those traits.

Portable system services are meant to provide the same kind of resource bundling, but to run the programs in a way that is integrated with the rest of the system. Sandboxing would be used to limit the problems that a compromised service could cause.

If you look at the range of ways that a service can be run, he said, you can put it on an axis from integrated to isolated. The classic system services, such as the Apache web server or NGINX, are fully integrated with the rest of the system. They can see all of the other processes, for example. At the other end of the scale is virtual machines, like those implemented by KVM, which are completely isolated. In between, moving from more integrated to more isolated, are portable system services, Docker-style micro-services, and full operating system containers such as LXC.

Portable system services combine the traditional, integrated services with some ideas from containers, Poettering said. The idea is to consciously choose what gets shared and what doesn't. Traditional services share everything, the network, filesystems, process IDs, init system, devices, and logging, but some of those things will be walled off for portable system services.

This is the next step for system services, he said. The core idea behind systemd is service management; not everything is Docker-ized yet, but everything has a systemd service file. Administrators are already used to using systemd services, so portable services will just make them more powerful. In many cases, users end up creating super-privileged containers by dropping half of the security provided by the container managers and mostly just using the resource bundling aspect. He wants to go the other direction and take the existing services and add resource bundling.

Integration is good, not bad, Poettering said; having common logging and networking is often a good thing. Systemd currently recognizes two types of services: System V and native. A new "portable" service type will be added to support this new idea. It will be different from the other service types by having resource bundling and sandboxing.

To start with, unlike Docker, systemd does not want to be in the business of defining its own resource bundling format, so it will use a simple directory tree in a tarball, subvolume, or GUID Partition Table (GPT) image with, say, SquashFS inside. Services will run directly from that directory, which will be isolated using chroot().

The sandboxing he envisions is mostly concerned with taking away the visibility of and preventing access to various system resources. He went through a long list of systemd directives that could be used to effect the sandboxing. For example, PrivateDevices and PrivateNetwork are booleans that restrict access to any but the minimal devices in /dev (e.g. /dev/null, /dev/urandom) and to provide only a loopback interface for networking. PrivateTmp gives the service its own /tmp, which removes a major attack surface. There is a setting to have a private user database that only has three users (root and a service-specific user; all other user IDs are mapped to nobody). Some other settings protect various directories in the system by mounting them read-only for the service, there is a setting to disallow realtime scheduling priority, another to restrict kernel-module loading, and so on.

Many of those are already present in systemd and more will be added. The systemd project has been working to make sandboxing more useful for services, Poettering said. He would like to see a distribution such as Fedora turn these features on for the services that it ships. Another area systemd will be working on is per-service firewalls and accounting.

Unlike native or System V services, portable services will have to opt out of these sandboxing features if they don't support them. In fact, he said, if systemd were just starting out today, native services would have been opt-out for the sandboxing options, but it is too late for that now.

There are some hard problems that need to be solved to make all of this work. One is that Unix systems are not ready to handle dynamic user IDs. When a portable service gets started, an unprivileged user for the service gets created, but is not put into the user database (e.g. passwd file). If a file is created by this user and then the service dies, the file lingers with a user ID that is unknown to the system.

One way to handle that is to restrict services with a dynamic user ID from being able to write to any of the filesystems, so using that feature will require the ProtectSystem feature that mounts the system directories read-only. Those services will get a private /tmp and a directory under /run that they can use, but those will be tied to the lifecycle of the service. That way, the files with unknown (to the system) user IDs will go away when the service and user ID are gone.

Dynamic users are currently working in systemd, which is a big step forward, Poettering said. Right now, new users are installed when an RPM for a service is installed, which doesn't really scale. Dynamic users make that problem go away since the user ID is consumed only while the service is running.

Another problem that people encounter is that if they try to install a service in a chroot() environment, they will need to copy the user database into the chroot(). The idea behind the PrivateUsers setting is to make chroot() work right. That setting will restrict the service to only have three users, one is the dynamic user for the service and the other two are root and nobody. Most distributions agree on the user ID for root and nobody, so that will help make the portable services able to run on various distributions.

D-Bus is incompatible with a chroot() environment because there is a need to drop policy files into the host filesystem. For now, that is an unsolved problem, but the goal is to move D-Bus functionality into systemd itself. That is something the project should have done a long time ago, Poettering said. The systemd D-Bus server would then use a different policy mechanism that doesn't require access to the host filesystem.

He stressed that systemd is not building its own Docker-like container manager; it is, instead, providing building blocks to take a native service and turn it into a portable one. So systemd will only have a simple delivery mechanism that is meant to be used by developers for testing, not for production use. Things like orchestration and cluster deployment are out of scope for systemd, Poettering declared.

He showed a few examples of his vision of using the systemctl command to start, stop, and monitor portable services on the local host or a remote one, though it was not a demo. It was not entirely clear from the talk how far along things are for portable system services. The overall goal is to take existing systemd services, add resource bundling and sandboxing, and make "natural extensions to service management" to support the portable services idea, he said.

Various questions were asked at the end. For example, updating the services is something that is out of the scope for systemd. That can be handled with some other tool, but the low-level building blocks will be provided by systemd. Another question concerned configuration. Different configurations of a service would require building a different bundle with those changes, as the assumption is that configuration gets shipped in the bundle.

Having the security be opt-out is useful, but how will additions to the security restrictions be handled? Existing services could break under stricter rules. Poettering said that it was something he was aware of, but had not come up with a solution for yet. He wants to start with a powerful set of restrictions out of the box, but perhaps defining a target security level for a particular service could help deal with this backward incompatibility problem.

[ Thanks to LWN subscribers for supporting my travel to Santa Fe for LPC. ]

Index entries for this article
Conference	Linux Plumbers Conference/2016

/tmp

Posted Nov 10, 2016 10:16 UTC (Thu) by epa (subscriber, #39769) [Link] (7 responses)

Giving each service its own /tmp is a big step forward, but why don't we have per-user /tmp as standard already?

/tmp

Posted Nov 10, 2016 10:37 UTC (Thu) by mezcalero (subscriber, #45103) [Link] (3 responses)

Because that means you have to run each user in her/his own file system namespace whose root dir has mount propagation turned off towards the host's file system mount namespace (after all you don't want the private /tmp mount to end up being in effect for the rest of the system). But if you do that most relevant mount commands don't have effect on the host anymore, even when run with sudo. And that's just plain annoying.

Moreover some software (mis-)uses /tmp as a place for communication primitives, most prominently X11, which hence makes it problematic to make the host's version of the dir unavailable to normal users.

/tmp

Posted Nov 10, 2016 12:04 UTC (Thu) by epa (subscriber, #39769) [Link] (2 responses)

I should have been clearer that you could get the same effect by removing global /tmp altogether and setting TMPDIR for each user, patching the occasional bit of software that hardcodes the path /tmp (or even working around it in the C library). But you have a point about X11 and other software expecting files in /tmp to be shared between users.

/tmp

Posted Nov 11, 2016 3:50 UTC (Fri) by wahern (subscriber, #37304) [Link] (1 responses)

If patching were an acceptable solution systemd, Docker, etc, wouldn't be as popular as they are. Fixing problematic software without having to, you know, fix it, is their raison d'être. Which is ironic given such software is invariably open source or actually authored by the team. Being able to cleanly fix problematic software without having to resort to complex proprietary management and tooling stacks was supposed to be the benefit of open source. Not only were you able to fix and upstream problems in particular applications, but you could help refactor the ecosystem, such as by judicious modifications or additions to libraries or the kernel, spreading the burden more efficiently.

But maybe it was really just about cost afterall. And maybe industry will always gravitate toward such approaches because it's easier to buy (or download or otherwise acquire) an external solution than to fix the software, even if the software is open source, and especially when the solution holds out the promise of fixing the problem across the board, notwithstanding that such promises invariably fall short of that mark.

And that would be fine, except that such approaches de-emphasize and detract from the real need to actually fix and improve software. And that piling layers of intermediating software atop actual applications adds substantial complexity, creating more motivation for "simplifying" software. People don't want to create Debian or RPM packages so they create Docker, which necessitates packaging and distributing those Docker images, basically reinventing the entire Debian and RPM infrastructure. (Which isn't a defense of Debian or RPM, just that people aren't really solving the underlying pain points, instead vainly trying to avoid them.)

Or we don't want call to chroot and setuid from main (controlled via getopt or environ), so that gets shunted to init, which now has even more problems to solve because it has to be much more general and doesn't have the advantage of being able to run from the most convenient and natural point in the application's initialization sequence; and the most general and inevitable solution is to duplicate all the various namespaces, even though most of it will be superfluous or unnecessary. And over time we get more clever at duplicating these namespaces (paravirtualization to virtualization to containers), adding increasingly complex mechanisms to the system and the kernel. It's no surprise the kernel and system software has become so trivial to exploit, despite the advent of increasingly powerful hardware enforcement mechanisms. The complexity is outracing the mitigations, even though open source was supposed to permit us to put a stop to that cycle by _fixing_ and _improving_ software, rather than being forced to workaround broken software.

It's turtles all the way down. How many can we stack?

/tmp

Posted Nov 18, 2016 18:06 UTC (Fri) by drag (guest, #31333) [Link]

The reality is that how you solve these issues is to create better solutions and make them available to people and hope they adopt them. Then when they try to use them you get feedback on what works and what does not and then fix it. This requires short development cycles and rapid releases which necessitates incremental solutions.

Trying to solve the world's problems in one big open source push is not a option and neither is ignoring problems and pretending everything is hunky dory.

5-year plans don't work. Also people are not ignorant of the benefits of deb or rpm packaging as you may imagine. Although you seem to be ignorant of their problems.

/tmp

Posted Nov 17, 2016 8:58 UTC (Thu) by Wol (subscriber, #4433) [Link] (2 responses)

> but why don't we have per-user /tmp as standard already?

I was using exactly that on a Univac 1100 back in the 80s ...

The trouble is there is too much tradition baked into Unix/Linux. I'd love to get rid of all the dot-files in ~, shove them in a directory such as ~/.etc, it would be nice to do a tmpfs mount at login time of a directory ~/tmp, etc etc.

But how much work would it be to change all those programs out there so that they know and expect such things?

Cheers,
Wol

/tmp

Posted Nov 17, 2016 17:29 UTC (Thu) by joib (subscriber, #8541) [Link] (1 responses)

> I'd love to get rid of all the dot-files in ~, shove them in a directory such as ~/.etc

There's $XDG_CONFIG_HOME (defaults to ~/.config), which is used by a lot of modern stuff.

/tmp

Posted Nov 18, 2016 3:19 UTC (Fri) by Jonno (subscriber, #49613) [Link]

>> I'd love to get rid of all the dot-files in ~, shove them in a directory such as ~/.etc
> There's $XDG_CONFIG_HOME (defaults to ~/.config), which is used by a lot of modern stuff.

Just please not that shoving everything that has traditionally been put in dot-directories in ~ into XDG_CONFIG_HOME is not appropriate! But between XDG_CONFIG_HOME (~/.config), XDG_DATA_HOME (~/.local/share), XDG_CACHE_HOME (~/.cache), and XDG_RUNTIME_DIR (/run/user/$(id -u)) every need should be covered...

Portable system services

Posted Nov 10, 2016 13:08 UTC (Thu) by gebi (guest, #59940) [Link] (1 responses)

... "the assumption is that configuration gets shipped in the bundle"

no, please don't!
The one thing we have, or better should have, learned from the container world is, that having the configuration mixed into the container/image/... or anything that forces you to rebuild the container just because you are running it on another infrastructure is a really bad idea.

The core premise should be, that i can just copy the container from one server to another, tweak a few configuration bits and have it running with the _same_ image/container/...

For an overview on what feature set would be nice:
http://kubernetes.io/docs/user-guide/configmap/
http://kubernetes.io/docs/user-guide/secrets/

Portable system services

Posted Nov 10, 2016 14:38 UTC (Thu) by micka (subscriber, #38720) [Link]

I don't think that means you need to rebuild the service "container" to change the configuration.
I think the standard operation for many things related to systemd is to have a default config with the package with overrides in etc ( then again maybe with overrides for users).

Portable system services

Posted Nov 10, 2016 14:35 UTC (Thu) by mcatanzaro (subscriber, #93033) [Link]

"He would like to see a distribution such as Fedora turn these features on for the services that it ships."

This seems uncontroversial. Somebody just has to step up and do the work. Preferably upstream; Fedora should only mess with service files that are not upstream for whatever reason.

vs flatpak vs AppImage vs Snappy

Posted Nov 11, 2016 13:30 UTC (Fri) by NAR (subscriber, #1313) [Link] (3 responses)

How does this solution compares to flatpak or AppImage or Snappy? My (very superficial) understanding is that these all provide sandboxed (containerized) applications.

vs flatpak vs AppImage vs Snappy

Posted Nov 17, 2016 7:32 UTC (Thu) by lindbergio (guest, #110789) [Link] (1 responses)

I might be wrong on this but I think Snappy and Flatpak only provide the specific necessary libraries required for an application while an application inside a container has a complete isolated environment including it's own libraries, binaries. So basically a container has it's own "OS" and utilizing the namespace features as IPC, network, PID, mount and some more. The container also shares the kernel with the host.

vs flatpak vs AppImage vs Snappy

Posted Nov 17, 2016 8:00 UTC (Thu) by alexl (subscriber, #19068) [Link]

Flatpak definately uses all those features, as well as shipping all the dependencies (up to and including libc). Thats not a full "OS" though, as it doesn't have the host support stuff like init, hw support, kernel, logging, etc. Just the libraries.

Snappy is somewhat similar, but it doesn't use namespaces, but instead multiple prefixes that get set up with env-var munging + apparmour to hide away things the app shouldn't be able to access.

vs flatpak vs AppImage vs Snappy

Posted Nov 28, 2016 18:48 UTC (Mon) by mildred593 (guest, #107325) [Link]

Flatpak is only targhetted at desktop applications. You need to define a .desktop file and you get integration with the desktop. It cannot install system services.

Dynamic user id

Posted Nov 17, 2016 9:01 UTC (Thu) by Wol (subscriber, #4433) [Link]

The obvious solution seems to me, put the user id in the system passwd file, then the systemd container uses it in the chrooted environment.

Maybe add a config id "DynamicUser=name", telling systemd to read "name" from the passwd file to get the id to use.

I was thinking to use "DynamicID=nnn" but that's actually a bad idea - it's too open to misuse (of the "I can't be bothered to do it properly" kind which is almost worse than deliberate abuse).

Cheers,
Wol

Portable system services

Posted Nov 18, 2016 14:18 UTC (Fri) by jond (subscriber, #37669) [Link]

> To start with, unlike Docker, systemd does not want to be in the business of defining its own resource bundling format, so it will use a simple directory tree in a tarball, subvolume, or GUID Partition Table (GPT) image with, say, SquashFS inside.

I honestly can't see the difference here. Docker uses a TAR archive with some JSON metadata and has a well-specified image specification. I can't see how some other scheme using a different archiving system (or not), some other markup language (or not) and different metadata layout (or not) is an improvement.

Opt-put security

Posted Nov 28, 2016 18:53 UTC (Mon) by mildred593 (guest, #107325) [Link]

systemd services could always define a key that change the default security settings:

Isolate=None|All|...

Select the default isolation level. None is the default and means that there is no isolation by default. Any isolation feature must be opt-in. All will trigger all isolation features. If you specify a systemd version instead, all isolation features present in this systemd version will be used. Newer isolation features will be left out.