|
|
Subscribe / Log in / New account

Leading items

Welcome to the LWN.net Weekly Edition for February 22, 2018

This edition contains the following feature content:

This week's edition also includes these inner pages:

  • Brief items: Brief news items from throughout the community.
  • Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

New tricks for XFS

By Jake Edge
February 21, 2018

linux.conf.au

The XFS filesystem has been in the kernel for fifteen years and was used in production on IRIX systems for five years before that. But it might just be time to teach that "old dog" of a filesystem some new tricks, Dave Chinner said, at the beginning of his linux.conf.au 2018 presentation. There are a number of features that XFS lacks when compared to more modern filesystems, such as snapshots and subvolumes; but he has been thinking—and writing code—on a path to get them into XFS.

Some background

XFS is the "original B-tree filesystem" since everything that the filesystem stores is organized in B-trees. They are not actually a traditional B-tree, rather they are a form of B* tree. A difference is that each node has a sibling pointer, which allows horizontal traversal of the tree. That kind of traversal is important when looking at features like copy on write (CoW).

An XFS filesystem is split into allocation groups, "which are like mini-filesystems"; they have their own free-space index B-trees, inode B-trees, reverse-mapping B-trees, and so on. File data is referenced by extents, with the help of B-trees. "Directories and attributes are more B-trees"; the directory B-tree is the most complex as it is a "virtually mapped, multiple index B-tree with all sorts of hashing" for scalability.

XFS uses writeahead journaling for crash resistance. It has checkpoint-based journaling that is meant to reduce the write amplification that can result from changing blocks that are already in the journal.

He followed that with a quick overview of CoW filesystems. When a CoW filesystem writes to a block of data or metadata, it first makes a copy of it; in doing so, it needs to update the index tree entries to point to the new block. That leads to modifying the block that holds those entries, which necessitates another copy, thus a modification to the parent index entry, and so on, all the way up to the root of the filesystem. All of those updates can be written together anywhere in the filesystem, which allows lots of optimizations to be done. It also provides consistent on-disk images, since the entire update can be written prior to making an atomic change to the root-level index.

All of that is great for crash recovery, he said, but the downside is that it requires that space be allocated for these on-disk updates. That allocation process requires metadata updates, which means a metadata tree update, thus more space needs to be allocated for that. That leads to the problem that the filesystem does not know exactly how much space is going to be needed for a given CoW operation. "That leads to other problems in the future."

These index tree updates are what provide many of the features that are associated with CoW filesystems, Chinner said, "sharing, snapshots, subvolumes, and so on". They are all a natural extension of having an index tree structure that reference-counts objects; that allows multiple indexes to point to the same object by just increasing the reference count on it. Snapshots are simply keeping around an index tree that has been superseded; that can be done by taking a reference to that tree. Replication is done by creating a copy of the tree and all of its objects, which is a complicated process, but "does give us the send-receive-style replication" that users are familiar with.

CoW in XFS is different. Because of the B* trees, it cannot do the leaf-to-tip update that CoW filesystems do; it would require updating laterally as well, which in the worst case means updating the entire filesystem. So CoW in XFS is data-only.

Data-only CoW limits the functionality that XFS can provide; features like deduplication and file cloning are possible, but others are not. The features it does provide are useful for projects like overlayfs and NFS, Chinner said. The advantage of data-only CoW is that there is no impact on non-shared data or metadata. In addition, XFS can always calculate how much space is needed for a CoW operation because only the data is copied; the metadata is updated in place.

But, since the metadata updates are not done with CoW, crash resiliency is a bit more difficult—it is not a matter of simply writing a new tree branch and then switching to it atomically. XFS has implemented "deferred operations", which are a kind of "intent logging mechanism", Chinner said. Deferred operations were used for freeing extents in the past, but have been extended to do reference count and reverse B-tree mapping updates. That allows replaying CoW updates as part of recovery.

What is a subvolume?

Thinking about all of that led Chinner to a number of questions about what can be done with data-only CoW. Everyone seems to want subvolume snapshots, but that seems to require CoW operations for metadata. How can the problem be repackaged so that there is a way to implement the same functionality? That is the ultimate goal, of course. He wondered how much of a filesystem was actually needed to implement a subvolume. There are other implementations to look at, so we can learn from them, he said. "What should we avoid? What do they do right?" The good ideas can be stolen—copied—"because that's the easy way".

Going back to first principles, he asked: "what is a subvolume? What does it provide?" From what he can tell, there are three attributes that define a subvolume. It has flexible capacity, so it can grow or shrink without any impact. A subvolume is also a fully functioning filesystem that allows operations like punching holes in files or cloning. The main attribute, though, is that a subvolume is the unit of granularity for snapshots. Everything else is built on top of those three attributes.

He asked: could subvolumes be implemented as a namespace construct that sits atop the filesystem? Bind mounts and mount namespaces already exist in VFS, he wondered if those could be used to create something that "looks like and smells like a subvolume". If you add a directory hierarchy quota on top of a bind mount, it will result in a kind of flexible space management. If you "squint hard enough", that is something like a subvolume, he said.

Similarly, a recursive copy operation using --reflink=always can create a kind of snapshot. It still replicates the metadata, but the vast majority of the structure has been cloned without copying the data. Replication can be done with rsync and tar; "sure, it's slow", but there are tools to do that sort of thing. It doesn't really resemble a Btrfs subvolume, for example, but it can still provide the same functionality, Chinner said. In addition, overlayfs copies data and replicates metadata, so it shows that you can provide "something that looks like a subvolume using data-only copy on write".

Another idea might be to implement the subvolume below the filesystem with a device construct of some sort. In fact, we already have that, he said. A filesystem image can be stored in a sparse file, then loopback mounted. That image file can be cloned with data-only CoW, which allows for fast snapshots. The space management is "somewhat flexible", but is limited by what the block layer provides and what filesystems implement. Replication is a simple file copy.

What this shows "is that what we think of as a subvolume, we're already using", Chinner said. The building blocks are there, they are just being used in ways that do not make people think of subvolumes.

The loopback filesystem solution suffers from the classic ENOSPC problem, however. If the filesystem that holds the image file runs out of space, it will communicate that by returning ENOSPC, but the filesystem inside the image will not be prepared to handle that failure and things break horribly: "blammo!". This is same problem that thin provisioning has. It is worse than the CoW filesystem ENOSPC problem, because you can't predict when it will happen and you can't recover when it does, he said.

He returned to the idea of learning from others at that point. Overlayfs and, to a lesser extent, Btrfs have taught us that specifying subvolumes via mount options is "really, really clunky", Chinner said. Btrfs subvolumes share the same superblock, which can cause some subtle issues about how they are treated by various tools like find or backup programs. A subvolume needs to be implemented as an independent VFS entity and not just act like one. "There's only so much you can hide by lying."

The ENOSPC problem is important to solve. The root of the problem is that upper and lower volumes (however defined) have a different view of free-space availability and those two layers do not communicate about it. This problem has been talked about many times at LSFMM (for example, in 2017 and in 2016) without making any real progress. But a while back, Christoph Hellwig came up with a file layout interface for the Parallel NFS (pNFS) server running on top of XFS; it allowed the pNFS client to remotely map files from the server and to allocate blocks from the server. The actual data lives elsewhere and the client does its reads and writes to those locations; so the client is doing its filesystem allocation on the server and then doing the I/O to somewhere else. This provides a model for a cross-layer communication of space accounting and management that is "very instructive".

A new kind of subvolume

He has been factoring all of this into his thinking on a new type of subvolume; one that acts the same as the subvolumes CoW filesystems have, but is implemented quite differently. The kernel could be changed so that it can directly mount image files (rather than via the loopback device) and a device space-management API could be added. If a filesystem implements both sides of that API, image files of the same filesystem type can be used as subvolumes. The API can be used to get the mapping information, which will allow the subvolume to do its I/O directly to the host filesystem's block device. This breaks the longstanding requirement that filesystems must use block devices; with his changes, they can now use files directly.

But this mechanism will still work for block devices, which will make it useful for thin provisioning as well. The thin-provisioned block device (such as dm-thin) can implement the host side of the space-management API; the filesystem can then use the client-side API for space accounting and I/O mapping. That way the underlying block device will report ENOSPC before the filesystem has modified its structures and issued I/O. That is something of a bonus, he said, but if his idea solves two problems at once, that gives him reason to think he is on the right track.

Snapshots are "really easy in this model". The subvolume is frozen and the image file is cloned. It is fast and efficient. In effect, the subvolume gets CoW metadata even though its filesystem does not implement it; the data-only CoW of the filesystem below (where the image file resides) provides the metadata CoW.

Replication could be done by copying the image files, but there are better ways to do it. Two image files can be compared to determine which blocks have changed between two snapshots. It is quite simple to do and does not require any knowledge of what is in the files being replicated. He implemented a prototype using XFS filesystems on loopback devices in 200 lines of shell script using xfs_io. "It's basically a delta copy" that is independent of what is in the filesystem image; if you had two snapshots of ext4 filesystems, the same code would work, he said.

There are features that people are asking for that the current CoW filesystems (e.g. Btrfs, ZFS) cannot provide, but this new scheme could. Right now, there is a lot of data shared between files on disk that is not shared once it gets to the page cache. If you have 500 containers based on the same golden image, you can have multiple snapshots being used but each container has its own version of the same file in the cache. "So you have 500 copies of /bin/bash in memory", he said. Overlayfs does this the right way since it shares the one cached version of the unmodified Bash between all of the containers.

His goal is to get that behavior for this new scheme as well. That requires sharing the data in shared extents in the page cache. It is a complex and difficult problem, Chinner said, because the page cache is indexed by file and offset, whereas the only information available for the shared extents is their physical location in the filesystem (i.e. the block number). Instead of doing an exhaustive search in the page cache to see if a shared extent is cached, he is proposing adding a buffer cache that is indexed by block number. XFS already has a buffer cache, but it doesn't have a way to share pages between multiple files. Chinner indicated that Matthew Wilcox was working on solving that particular problem; that solution would be coming "maybe next week", he said with a grin.

For a long time people have been saying that you don't need encryption for subvolumes because containers are isolated, but then came Meltdown and Spectre, which broke all that isolation. He thinks that may lead some to want more layers of defense to make it harder to steal their data when that isolation breaks down. Adding the generic VFS file-encryption API to XFS will allow encrypting the image files and/or individual files within a subvolume. There might be something to be gained by adding key management into the space-management API as well.

It is looking like XFS could offer "encrypted, snapshottable, cloned subvolumes with these mechanisms", Chinner said. There is still a lot of work to do to get there, of course; it is still in its early stages.

The management interface that will be presented to users is not nailed down yet; he has been concentrating on getting the technology working before worrying about policy management. How subvolumes are represented, what the host volume looks like to users, and whether everything is a subvolume are all things that need to be worked out. There is also a need to integrate this work with tools like Anaconda and Docker.

None of the code has had any review yet; it all resides on his laptop and servers. Once it gets posted, there will be lots of discussion about the pieces he will need to push into the kernel as well as the XFS-specific parts. There will probably be "a few flame wars around that, a bit of shouting, all the usual melodrama that goes along with doing controversial things". He recommended popcorn.

He then gave a demo (starting around 36:56 in the YouTube video of the talk) of what he had gotten working so far. It is a fairly typical early stage demo, but managed to avoid living up to the names of the subvolume and snapshot, which were "blammo" and "kaboom".

After the demo, Chinner summarized the talk (and the work). He started out by looking at how to get the same functionality as subvolumes, but without implementing copy on write for metadata. The "underlying revelation" was to use files as subvolumes and to treat subvolumes as filesystems. That gives the same functionality as a CoW filesystem for that old dog XFS.

[I would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to Sydney for LCA.]

Comments (35 posted)

Licenses and contracts

February 21, 2018

This article was contributed by Tom Yates


FOSDEM

Some days it seems that wherever two or more free-software enthusiasts gather together, there also shall be licensing discussions. One such, which can get quite heated, is the question of whether a given free-software license is a license, or whether it is really a contract. This distinction is important, because most legal systems treat the two differently. I know from personal experience that that discussion can go on, unresolved, for long periods, but it had not previously occurred to me to wonder whether this might be due to the answer being different in different jurisdictions. Fortunately, it has occurred to some lawyers to wonder just that, and three of them came together at FOSDEM 2018 to present their conclusions.

The talk was given by Pamela Chestek of Chestek Legal, Andrew Katz of Moorcrofts, and Michaela MacDonald of Queen Mary University of London. Chestek focused on the US legal system, Katz on that of England and Wales, while MacDonald focused on the civil law tradition that is characteristic of many EU member states. The four licenses they chose to consider were the "Modified" or "three-clause" BSD, the Apache License, the GNU General Public License (their presentation was not specific to GPLv3, but the passage they quoted to make a point was from GPLv3), and the Fair License. The first three are among the most common free-software licenses currently in use. The latter is the shortest license the Open Source Initiative has ever approved, and though it is used by hardly any free software, it was included as an example of the maximum possible simplicity in a license.

US considerations

Chestek, speaking first, said that in the US a license is a grant of permission, in this case permission to use copyrighted material. It can be "express" (explicit), either in writing or orally, or implicit in the rights-holder's conduct. It is a defense against a charge of copyright infringement; if you are accused of infringement, you can say "I have a license", and as long as your behavior is within any license conditions, you're not an infringer.

[Pamela Chestek]

A contract, however, is different. In its basic form, it is an agreement between two parties, with an offer by one party, an acceptance of that offer by the other party, and mutual consideration (which means that both parties must gain a definable benefit from the relationship). Some people claim that a free-software license cannot be a contract simply because of the absence of any consideration to the licensor. Chestek, however, said there is jurisprudence in the US establishing that the increases in market share, reputation, and product quality that result from offering your software under a free license are all valid potential considerations.

Chestek introduced here the term "bare license", which was to crop up throughout the talk. While noting that it is not a term commonly in use in the US, it is helpful in comparing jurisdictions. She defined it as a license that contains a rights grant and nothing else, no language that might incline a court toward treating it as a contract. Typically, however, a license will have more than the mere rights grant in it. Often it will include some conditions; then the important issue is which of those are conditions on the license and which are other requirements in the agreement. Failure to meet the former means you don't have a copyright defense; the presence of the latter may push a court to treating the license as a contract. In deciding whether a particular requirement is a condition on the license a court will look carefully at the wording. So when section 5 of GPLv3 says:

You may convey a work based on the Program, or the modifications to produce it from the Program, in the form of source code under the terms of section 4, provided that you also meet all of these conditions

These are clearly established as conditions on the license. A simple requirement not to disparage the licensor, however, is not such a condition.

It is possible, said Chestek, that the Fair License is a bare license. It says simply:

Usage of the works is permitted provided that this instrument is retained with the works, so that any entity that uses the works is notified of this instrument. DISCLAIMER: THE WORKS ARE WITHOUT WARRANTY.

The warranty disclaimer worries her, though, as it may not be part of the rights grant. But, of the four licenses under consideration, it is the only one that she thinks might be a bare license and not also a contract. In the case of the three-clause BSD license, clause three may be troublesome: "The name of the author may not be used to endorse or promote products derived from this software without specific prior written permission." She thinks a court would most likely treat that clause as a contractual obligation rather than a condition on the license. This was, she admitted to a later questioner, a narrow reading, but clause three lacked the "magic language" that courts look for in license conditions. Furthermore, there is some law in the US that requires license conditions to apply to the copyright grant so, for example, a requirement not to use the author's name in advertising has nothing to do with the copyrights and thus cannot be a condition on the license. In response to a follow-up question, she felt that even the "FreeBSD" or "two-clause" BSD license, which removes the problematic clause three, would still be regarded as having contractual elements.

In short, under US law, with the possible exception of the Fair License, all the licenses under consideration are likely to be regarded as contracts.

England, Wales, and, perhaps, Northern Ireland

[Andrew Katz]

Katz then considered the question in the context of the law of England and Wales, noting that a colleague had opined that the analysis would likely apply in Northern Ireland also. Scotland is more of a civil law jurisdiction and the analysis would not apply there. As is so often the case in English matters, this presentation involved a certain amount of history.

A bare license, from an English law perspective, is a promise not to enforce certain rights, such as copyright, that the licensor may have. If this promise is not enshrined in a contract—for free software it usually is not—the person making the promise may withdraw it at any time. So English courts a long time ago came up with the principle of estoppel, which says that if someone makes a promise, and you rely on that promise, they may not later revoke that promise.

Katz then presented a Victorian case usually known as Carbolic Smoke Ball. The plaintiff, Mrs. Carlill, bought one of these balls, which were guaranteed to be so effective against influenza that anyone who used the ball as directed and got the flu would be paid £100. As Katz said, "on the upside, the chemicals in the ball failed to kill her. On the downside, they also failed to kill the virus, and she caught influenza. Probably even more unfortunately, they failed to kill her husband, who was a trial lawyer". She sued for her £100 ("which was a lot of money in those days, before the Brexit vote") and won, but it went to appeal. In a judgment which is still studied by law students today, the Court of Appeal upheld the verdict, and in so doing extended contract law. Because estoppel provides no basis for an action, the court developed the idea of a unilateral contract, where one side makes a promise and the other party acts in such a way as to make it binding.

So the courts have shown willingness to construe contracts: that is, to behave as if a contract existed where none was actually signed. Furthermore they have also shown a willingness to look at a contract that does exist, and behave as if it contained terms that it does not, again in the furtherance of justice. However, just because the courts have shown a willingness to do so when there is a need to, it does not mean they are willing to do it everywhere. The limits are shown in Robin Ray v. Classic FM, where the court said that if it was going to imply terms into a contract, it could do so only to the extent they were necessary, and no further. On this authority, Katz argued that the same principle applied to implying a unilateral contract: if a bare license would obtain the full effect required, there is no need to imply a contract.

On that basis, he was minded to say that English law would hold that the Fair License, Apache, and BSD licenses were just licenses, while the GPL might be held to be a contract.

Civil law in the EU

[Michaela MacDonald]

MacDonald then looked at the question from an EU perspective. The main difference is that software licenses, whether proprietary or free, would be interpreted as enforceable bilateral contracts. The idea of a bare license as something other than a contract is fairly alien to the civil law tradition. This means that a free-software license must meet all the requirements of a contract to be enforceable; it needs an offer and an acceptance, though consideration is not relevant. Unlike common law, the civil law tradition focuses on the obligations of the contracting parties, rather than their promises.

Her conclusion was that all four licenses would be interpreted as contracts in civil law jurisdictions. In more than one case, German courts have held that the GPL is enforceable as a contract.

Due to time pressure six slides were then very quickly presented showing how bare licenses, unilateral contracts, and bilateral contracts were treated in the three jurisdictions with regard to six specific legal aspects. It is the differences in how the law treats licenses and contracts in each jurisdiction that makes the question of whether free software licenses are also contracts so important.

The first aspect was taxonomy — principally, how acceptance is indicated. The second was the revocability of the license. Third-party beneficiary rights — whether someone other than the licensor can sue a licensee for failure to perform their obligations — was the third. The fourth was about specific performance; if a licensee fails to perform an act required by the license, whether you can ask a court to order them to perform it, or whether your only available remedy is damages in compensation. The fifth followed on, being about the award of legal costs: if you sue someone for violating your license and you win, can you get the other party to pay your legal costs? The sixth and last aspect was whether the licensor could lawfully exclude liability.

Given what conclusions were drawn about the status of free-software licenses in the various jurisdictions, the implications for the treatment of those licenses are as follows:

  • In the US, at least for the frequently-used free software licenses, acceptance is by conduct. The license is irrevocable once even partial performance has occurred. Third-party rights depend on the wording. Orders for specific performance are rarely granted but are theoretically possible. You won't get legal costs awarded, and exclusion of liability is likely to be valid.
  • In the UK, where many free-software licenses are probably just bare licenses, no acceptance is required for a defense against an infringement claim. The license is revocable by the licensor until the licensee has relied on it. No third-party rights exist. Orders for specific performance are not available. The award of legal costs is discretionary but may happen, and liability exclusion is generally enforceable.
  • In the EU acceptance can be by any means, including conduct. The license is probably irrevocable. Third-party rights do exist, and specific performance is available. Nothing was said about legal costs, and liability exclusion is possible but is subject to the the requirements of proportionality and consumer protection.

I'm a system administrator. I know beyond all doubt that many systems questions that seem simple on the surface are complex in both depth and width when fully examined, though I don't generally have to live with the added complexity of my free software changing its behavior depending on which country I'm running it in. It was a useful lesson to be reminded that questions that leak out of my field of expertise into others' do not magically become simple in so doing, and that "common sense says ..." is not a good legal argument.

Video of the talk and the slides from it are available here.

[We would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to Brussels for FOSDEM.]

Comments (12 posted)

BPF comes to firewalls

By Jonathan Corbet
February 19, 2018
The Linux kernel currently supports two separate network packet-filtering mechanisms: iptables and nftables. For the last few years, it has been generally assumed that nftables would eventually replace the older iptables implementation; few people expected that the kernel developers would, instead, add a third packet filter. But that would appear to be what is happening with the newly announced bpfilter mechanism. Bpfilter may eventually replace both iptables and nftables, but there are a lot of questions that will need to be answered first.

It may be tempting to think that iptables has been the kernel's packet-filtering implementation forever, but it is a relative newcomer, having been introduced in the 2.4.0 kernel in 2001. Its predecessors (ipchains, introduced in 2.2.10, and ipfwadm, which dates back to 1.2.1 in 1995) are mostly forgotten at this point. Iptables has served the Linux community well and remains the firewalling mechanism that is most widely used, but it does have some shortcomings; it has lasted longer than the implementations that came before, but it is clearly not the best possible solution to the problem.

The newer nftables subsystem, merged for the 3.13 kernel release in early 2014, introduced an in-kernel virtual machine to implement firewall rules; users have been slowly migrating over, but the process has been slow. For some strange reason, system administrators have proved reluctant to throw away their existing firewall configurations, which were painful to develop and which still function as well as they ever did, and start over with a new and different system.

Still, it was logical to assume that nftables would eventually take over, especially as the iptables compatibility layers improved. Some people started to doubt this story, though, when serious development started on the BPF virtual machine. There seemed to be a lot of overlap between the two virtual machines, and BPF was being quickly extended in ways that improved its performance, functionality, and security. Even so, nftables development has continued, and there has been little talk — until now — of pushing BPF into the core of the firewalling code.

Bringing in BPF

The announcement of bpfilter changes that situation, though. In short, bpfilter enables the creation of BPF programs that can be attached to points in the network packet path and make filtering decisions. In the proof-of-concept patches, those programs are attached at the express data path (XDP) layer, where they are run from the network-interface drivers. But, as Daniel Borkmann noted in the introduction to the patches, BPF programs could be just as easily attached at any other point in the path, allowing them to make decisions at the same points that iptables rules do.

There are a number of advantages claimed for the bpfilter approach. BPF programs can be just-in-time compiled on most popular architectures, so they should be quite fast. The work that has been done to enable the offloading of XDP-level programs to the network interface itself can come into play here, moving firewall processing off the host CPU entirely. The use of BPF enables the writing of firewall rules in C, which may appeal to some developers who are starting from the beginning. And firewall code would be subject to the BPF verifier, adding a layer of security to the whole system.

One of the core design features for bpfilter is the ability to translate existing iptables rules into BPF programs. This feature is intended to make it easy for existing firewall configurations to be moved over to the new scheme, perhaps without system administrators even knowing that it is happening. This translation is done in an interesting manner. Iptables rules are passed to the kernel, so the kernel must take responsibility for doing that work, but the task can be a complex one that would benefit from a user-space implementation.

To enable such an implementation, the bpfilter developers have created a new mechanism that supports the creation of a special type of kernel module to handle this kind of task. These modules would be part of the kernel and would be shipped by distributors as just another .ko file, but they would contain an ordinary ELF executable. After the module has been loaded, its code can be run in a separate user-space process; all that is required is a call to a special version of call_usermodehelper().

This mechanism allows the translation code to be managed as if it were just another part of the kernel. That code can be developed in user space, though. When it runs, the translation code will be separated from the kernel, making it harder to attack the kernel via that path. If this mechanism catches on, one can imagine that a number of other tasks could eventually be pushed out of the kernel proper into one of these special user-space modules. Developers should be careful, though; this could prove to be a slippery slope leading toward something that starts to look like a microkernel architecture.

Early responses

There have not been a whole lot of comments thus far on the code itself. That may be partly because, in their haste to get a proof of concept out to illustrate the idea, the developers never quite got around to writing comments in the code — or even changelogs for the patches. The idea itself, though, has raised concerns for some developers.

Harald Welte, who is not often seen in this community these days, showed up with a number of questions. At the top of his list was the decision to emulate iptables rules with the new BPF mechanism. If the new subsystem is to ever replace the iptables implementation, it will need to implement exactly the same behavior; small and subtle differences could introduce security problems into deployed firewall configurations. Given the complexity of iptables, the chances of such differences happening are significant.

More fundamentally, the networking developers have wanted to phase out iptables and its user-space interfaces for some time. Iptables has not aged entirely well. For example, there is no way to add or replace a single rule (or small set of rules); iptables can only wipe out the entire configuration and start from scratch. That makes firewall changes expensive; it also gets difficult to coordinate changes when they are being made by multiple actors at once. The increasing use of containers has created just this kind of situation; addressing this problem requires moving away from the iptables API. The fact that iptables requires separate rule sets for IPv4 and IPv6 creates a pain point for administrators as well.

Implementing the iptables API with bpfilter, Welte said, will "risk perpetuating the design mistakes we made in iptables some 18 years ago for another decade or more". It will push back the (already distant) date when that API could be deprecated and removed. Rather than focusing on iptables, Welte said, the developers should create an emulation of the newer nftables API, which was designed with the lessons from iptables in mind. That would support sites that have already migrated and encourage that migration to continue.

Networking maintainer David Miller (who authored some of the new code) replied that iptables is still far more widely used, so implementing that interface provides for better testing coverage in the near term. Welte answered, though, that most of the biggest use cases (Docker and Kubernetes, for example) use the command-line tools rather than the iptables API, so there is no need to implement emulation of the API itself to test with those systems. Miller, however, disagreed with the idea that the iptables binaries could be easily replaced on deployed systems: "Like it or not iptables ABI based filtering is going to be in the data path for many years if not a decade or more to come".

Interestingly, while there was talk of implementing the nftables API, nobody has yet questioned the idea of applying the BPF virtual machine to firewalls, even though it would be likely to supplant nftables relatively quickly. Instead, Miller said in the discussion that nftables failed to address the performance problems in Linux's packet-filtering implementation, driving users toward user-space networking technologies instead. There is a real possibility that nftables could end up being one of those experiments that is able to shed some light on the problem space but never takes over in the real world.

Overall, bpfilter is an extremely young project and there are a lot of questions yet to be answered about it. While much of the packet-filtering logic can likely be expressed in BPF code, there are more advanced features (like connection tracking, pointed out by Florian Westphal) that are still likely to need a fair amount of kernel support. There are no performance numbers with the patch set, so any performance gains are still theoretical at this point. And the code itself is quite young, lacking both features and documentation.

The end result is that we'll probably not see bpfilter in the mainline kernel in the immediate future. Given the developers who have worked on it, though, bpfilter is clearly a serious initiative that is firmly aimed at getting into the mainline eventually. If it truly proves to be a better solution to the network packet-filtering problem, those developers seem likely to prevail eventually.

Comments (22 posted)

Dynamic function tracing events

By Jonathan Corbet
February 15, 2018
For as long as the kernel has included tracepoints, developers have argued over whether those tracepoints are part of the kernel's ABI. Tracepoint changes have had to be reverted in the past because they broke existing user-space programs that had come to depend on them; meanwhile, fears of setting internal code in stone have made it difficult to add tracepoints to a number of kernel subsystems. Now, a new tracing functionality is being proposed as a way to circumvent all of those problems.

Whether tracepoints are part of the kernel ABI is not an insignificant issue. The kernel's ABI promise states that working programs will not be broken by updated kernels. It has become clear in the past that this promise extends to tracepoints, most notably in 2011 when a tracepoint change broke powertop and had to be reverted. Some kernel maintainers prohibit or severely restrict the addition of tracepoints to their subsystems out of fear that a similar thing could happen to them. As a result, the kernel lacks tracepoints that users would find useful.

This topic has found its way onto the agenda at a number of meetings, including the 2017 Maintainers Summit. At that time, a clever idea had been raised: rather than place tracepoints in sensitive locations, developers could just put markers that would have to be explicitly connected to and converted to tracepoints at run time. By adding some hoops to be jumped through, it was hoped, this new mechanism would not create any new ABI guarantees. Then things went quiet for a couple of months.

Recently, though, tracing maintainer Steve Rostedt surfaced with a variation on that proposal that he is calling "dynamically created function-based events". The details have changed, but the basic nature of the ABI dodge remains the same. The key detail that is different comes from the observation that the kernel already has a form of marker in place that the tracing code can make use of.

Kernel code is usually compiled with options that are normally used for code profiling. As a result, each function begins with a call to a function called mcount() (or __fentry()__ when a newer compiler is in use). When a user-space program is being profiled, mcount() tracks calls to each function and the time spent there. The kernel, though, replaces mcount() with its own version that supports features like function tracing. Most of the time, the mcount() calls are patched out entirely, but they can be enabled at run time when there is a need to trace calls into a specific function.

There are other possible uses for this function-entry hook. Rostedt's patch uses it to enable the creation of a tracepoint at the beginning of any kernel function at run time. With the tracefs control filesystem mounted, a new tracepoint can be created with a command like:

    echo 'SyS_openat(int dfd, string path, x32 flags, x16 mode)' \
    	 > /sys/kernel/tracing/function_events

This command requests the creation of a tracepoint at the entry to SyS_openat(), the kernel's implementation of the openat() system call. Four values will be reported from the tracepoint: the directory file descriptor (dfd), the given pathname (path), and the flags and mode arguments. This tracepoint will show up under events/functions and will look like any other tracepoint in the kernel. It can be queried, enabled, and disabled in the usual ways. Interestingly, path in this case points into user space, but the tracing system properly fetches and prints the data anyway.

There is evidently some work yet to be done: "I need to rewrite the function graph tracer, and be able to add dynamic events on function return.". But the core is seemingly in place and working. That leaves an important question, though: will it be enough to avoid creating a new set of ABI-guaranteed interfaces to the kernel? Mathieu Desnoyers worried that it might not:

Having those tools hook on function names/arguments will not make this magically go away. As soon as kernel code changes, widely used trace analysis tools will start breaking left and right, and we will be back to square one. Only this time, it's the internal function signature which will have become an ABI.

Linus Torvalds disagreed with this worry, though. The extra step required to hook into the kernel implies a different view of the status of that hook:

Everybody *understands* that this is like a debugger: if you have a gdb script that shows some information, and then you go around and change the source code, then *obviously* you'll have to change your debugger script too. You don't keep the source code static just to make your gdb script happy., That would be silly.

In contrast, the explicit tracepoints really made people believe that they have some long-term meaning.

If reality matches this view, then the new dynamic tracepoint mechanism could go a long way toward defusing the ABI issues. The number of new tracepoints being added to the kernel would be likely to drop, as developers could simply use the dynamic variety instead. When tracepoints are added in the future, it is relatively likely that they will be designed to support some sort of system-management tool and, thus, be viewed as a part of the ABI from the outset.

That assumes that this patch series is eventually merged, of course. There was some dissent from Alexei Starovoitov, who complained that the new interface adds little to what can already be had with kprobes. He also disliked the text-oriented interface, suggesting (unsurprisingly) that BPF should be used instead to extract specific bits of data from the kernel. Rostedt noted, though, that many developers are put off by the complexity of getting started with BPF and would prefer something simpler.

Rostedt said that he thought the interface would be useful, but that he would not continue its development if others did not agree: "If others think this would be helpful, I would ask them to speak up now". Thus far, few people have spoken. If the dynamic function tracing mechanism is indeed something that other developers would like to have available, they might want to make their feelings known.

Comments (8 posted)

The boot-constraint subsystem

February 16, 2018

This article was contributed by Viresh Kumar

The fifth version of the patch series adding the boot-constraint subsystem is under review on the linux-kernel mailing list. The purpose of this subsystem is to honor the constraints put on devices by the bootloader before those devices are handed over to the operating system (OS) — Linux in our case. If these constraints are violated, devices may fail to work properly once the kernel starts reconfiguring the hardware; by tracking and enforcing those constraints, instead, we can ensure that hardware continues to work properly until the kernel is fully operational.

The bootloader is a piece of code that loads the operating system, normally after initializing a number of hardware components that are required during the boot process, such as the flash memory controller. More than one bootloader may take part in booting the OS; the first-stage bootloader loads the second-stage bootloader, and the second-stage bootloader loads the OS. Some of the most common bootloaders used with Linux are LILO (LInux LOader), LOADLIN (LOAD LINux), GRUB (GRand Unified Bootloader), U-Boot (Universal Bootloader) and UEFI (Unified Extensible Firmware Interface).

The bootloaders enable and configure many devices before passing control to Linux; it is important to hand them over to Linux in a glitch-free manner. A typical example is the display or LCD controller, which is used by the bootloader to show images while the platform boots Linux. The LCD controller may be using multiple resources, including clocks and regulators, that are shared with other devices. These shared resources must be configured in such a way that they satisfy the needs of all the devices that use them. If another device is probed before the LCD-controller driver, then the driver for that new device may end up disabling or reconfiguring shared resources to ranges that satisfy only the new device while rendering the LCD screen unstable. Another common use case is that of the debug serial port when it is enabled by the bootloader; it may be used by kernel developers to debug an early kernel oops with the earlycon command line parameter.

Of course, we can have more complex cases where the same resource is used by multiple devices. We can also have a case where the resources are not shared, but the kernel disables them forcefully if no users appear until a certain point in the kernel boot process. An example of that is the clock framework, which disables unused clocks at the late_initcall_sync() initcall level.

The boot-constraint core solves these complex boot-order dependencies between otherwise unrelated devices by setting constraints on the shared resources, on behalf of the bootloader, before any of these devices are probed. For example, if the multimedia controller (MMC) and the LCD controller share the clock and regulator resources and both are enabled by the bootloader, then the boot-constraint core adds the constraint on those resources before the LCD or MMC controllers are probed by their kernel device drivers. The constraints are set differently for each resource type; it can be a simple clk_enable() operation for the clock constraint, for example. The MMC and LCD controllers can get probed in any order later on and the constraints added by the boot-constraint core will be honored by the resources until the constraints are removed.

The boot constraints for a device are added by the platform-specific code currently. These constraints remain set until a driver tries to probe the device; the constraints are removed automatically by the driver core after the device is probed (successfully or unsuccessfully), except for the case where the probe has deferred because some of the required resources weren't available. The constraints are removed even for cases where probing of the device fails, because the kernel will not retry probing the device again by itself and there is no point keeping the constraint set in this case. This behavior can be modified in the future if we have a use case where it makes sense to keep the constraint set even after failing to probe the device.

Adding boot constraints

A boot constraint defines a configuration requirement for the device. For example, if the clock is enabled for a device by the bootloader and we want the device to continue working until the time the device is probed by its driver, then keeping this clock enabled is one of the boot constraints.

The boot-constraint core supports three type of boot constraints currently, based on resource types; they are represented by the enumeration below:

    enum dev_boot_constraint_type {
	DEV_BOOT_CONSTRAINT_CLK,
	DEV_BOOT_CONSTRAINT_PM,
	DEV_BOOT_CONSTRAINT_SUPPLY,
    };

DEV_BOOT_CONSTRAINT_CLK represents the clock boot constraint, DEV_BOOT_CONSTRAINT_PM represents the power-domain boot constraint (a power-management domain that must remain energized), and DEV_BOOT_CONSTRAINT_SUPPLY represents the power-supply boot constraint. This list can be expanded in future; basically any resource type can have constraints set for it.

A single boot constraint, of any type, can be added for a device using:

    int dev_boot_constraint_add(struct device *dev,
    				struct dev_boot_constraint_info *info);

This function must be called before the device is probed by its driver, otherwise the boot constraint will never be removed and may result in unwanted behavior of the hardware. dev represents the device for which the boot constraint is to be added, and info defines the boot constraint. This function returns zero on success, and a negative error number otherwise.

The dev_boot_constraint_info structure looks like:

    struct dev_boot_constraint_info {
	struct dev_boot_constraint constraint;
	void (*free_resources)(void *data);
	void *free_resources_data;
    };

It contains an instance of the dev_boot_constraint structure, the optional free_resources() callback, and its parameter. The boot-constraint core calls free_resources(), if available, with free_resources_data as its argument right after the constraint is removed.

The dev_boot_constraint structure represents the actual constraint:

    struct dev_boot_constraint {
	enum dev_boot_constraint_type type;
	void *data;
    };

It contains the type of the boot constraint and constraint-specific data, as each constraint type has different requirements. The data is mandatory for clock and power-supply boot constraints, while it is not required for the PM boot constraint.

The clock boot constraint is represented by the struct dev_boot_constraint_clk_info, which contains only the name string:

    struct dev_boot_constraint_clk_info {
	const char *name;
    };

This string must match the connection-id of the device's clock. The boot-constraint core adds the clock constraint by calling clk_prepare_enable() for the device's clock and removes it with a call to clk_disable_unprepare().

The power-supply constraint is represented by the dev_boot_constraint_supply_info structure:

    struct dev_boot_constraint_supply_info {
	const char *name;
	unsigned int u_volt_min;
	unsigned int u_volt_max;
    };

The name string must match the power-supply name as the boot-constraint core uses it as an argument to the regulator_get(dev, name) helper. u_volt_min and u_volt_max define the minimum and maximum voltage accepted by the device for which the constraint is added. The boot-constraint core adds the power-supply constraint by first calling regulator_set_voltage() (only if both u_volt_min and u_volt_max have non-zero values), followed by regulator_enable(). The constraint is removed with regulator_disable(), followed by regulator_set_voltage() if it was called when the constraint was set.

The PM boot constraint doesn't need any data; the boot-constraint core adds the constraint by attaching the power domain to the device with dev_pm_domain_attach(), which internally enables the power domain. The power domain is normally attached to the device by the driver core while it binds a driver to the device, which won't happen in this case as the power domain is already attached by the boot-constraint core. For this reason, the boot-constraint core does not detach the power domain while removing the constraint, as that will be done by the driver core when the driver is detached from the device at a later point.

Deferrable constraints

There is a limitation with dev_boot_constraint_add(), though. What if the resources required by the constraint for the device aren't available when dev_boot_constraint_add() is called? How do we make sure that the boot-constraint core gets a chance to set the constraints before the device is probed by its driver, if the resources are going to be available at a later point of time?

The solution to these problems is twofold. First, we have to make the process of adding constraints deferrable, so that the driver core tries adding constraints again later if need be. Second, we have to make sure that the boot-constraint core gets a chance to add the constraints after the resources are made available but before the device (to which we are adding the constraints) is probed by its driver.

The first solution is implemented with the help of (virtual) device-driver pairs. The boot-constraint core implements an internal platform driver with the name boot-constraints-dev; its boot_constraint_probe() callback is called for every device registered with the same name as the driver. A platform device (boot-constraints-dev) is created for every device constraint we want to add. The boot_constraint_probe() callback gets called for each of these platform devices and tries to add the respective constraint. If the resources required for adding the constraint aren't available yet, the probe callback returns -EPROBE_DEFER and the driver core moves the virtual device to the list of deferred devices that are re-probed at a later point of time. On successful probing of these virtual devices, the constraints are added with the boot-constraints core.

The second problem, the race with the actual device being probed before the constraints are set, is solved by requesting the driver core to start probing the deferred devices immediately after boot_constraint_probe() returns -EPROBE_DEFER for any of the constraints. Normally, the driver core performs re-probing of deferred devices from an internal routine registered with late_initcall(). But if the boot-constraint core finds that it failed to add one of the constraints because its resources weren't available, then it requests the driver core to perform the re-probing of deferred devices from that point. The driver core tries to re-probe the deferred devices after every new device is registered with it, so the constraint-specific platform device will be probed before the actual device for which constraints are being added. Yes, this puts an extra burden on the system during the boot process and may make the kernel boot process slightly longer, but it is the best we could do as an initial solution. Later on we may want the boot-constraint core to register with the resource-specific subsystems so that they can inform the core once the resource is available. This way, the core can get a chance to set the constraint before the actual device gets probed.

In order to support deferrable constraints and to simplify adding multiple boot constraints for a platform, the boot-constraints core provides another helper:

    void dev_boot_constraint_add_deferrable_of(struct dev_boot_constraint_of *oconst, 
					       int count);

Here, oconst represents an array of device-tree (DT) constraints and count represents the number of entries in the array. This helper provides an easy way to add one or more deferrable constraints for one or more devices. Just like the dev_boot_constraint_add() helper, it must be called before the devices (to which we want to add constraints) are probed by their drivers.

struct dev_boot_constraint_of represents one or more constraints corresponding to one or more devices with the same DT compatible string:

    struct dev_boot_constraint_of {
	const char *compat;
	struct dev_boot_constraint *constraints;
	unsigned int count;
	const char * const *dev_names;
	unsigned int dev_names_count;
    };

The compat string must match with the DT compatible property of the devices to which we want to apply constraints, constraints is an array of one or more boot constraints, and count is the array size. The constraints will be added for every device that matches the compat string, unless the optional dev_names field is set to a non-NULL value. This field is used to confine the boot constraints to only a subset of devices that match the compat string; it is an array containing names (e.g. serial@fff02000) of the DT nodes where the constraints must be applied and dev_names_count is the array size.

Curious readers can see this example of how boot constraints might be set on a real-world device.

Status and future work

The proposed series updates a few platforms to use the boot-constraint core. There are numerous others that could use it in the future. There is a lot of work still to be done, though. Getting rid of relatively slow boot time with deferred constraints, as described earlier, is of utmost priority. It may also be useful to get the constraint-specific information from the device tree and reduce some platform specific code. It would also be interesting to use this work for other firmware interfaces like ACPI, as these are common problems across architectures.

Comments (3 posted)

An overview of Project Atomic

February 21, 2018

This article was contributed by J. B. Crawford

Terms like "cloud-native" and "web scale" are often used and understood as pointless buzzwords. Under the layers of marketing, though, cloud systems do work best with a new and different way of thinking about system administration. Much of the tool set used for cloud operations is free software, and Linux is the platform of choice for almost all cloud applications. While just about any distribution can be made to work, there are several projects working to create a ground-up system specifically for cloud hosts. One of the best known of these is Project Atomic from Red Hat and the Fedora Project.

The basic change in thinking from conventional to cloud-computing operations is often summed up as that of "pets versus cattle". Previously, we have looked at our individual computers as "pets", in that we think about them as individual entities that need to be protected. If a server went down, you would carefully fix that server, or in a worst case replace it with a new host by copying over a backup. In a cloud environment, we can create and destroy hosts in seconds, so we take advantage of this by treating those hosts as largely disposable "cattle". If a host encounters a problem, we just destroy it and create a new host to take over its function.

Closely coupled with this paradigm shift is a move to containerization. Container systems like Moby (formerly known as Docker) or rkt allow you to deploy software by packaging it into an image, similar to a very lightweight virtual machine, complete with all dependencies and system configuration required. Containerized, cloud-based deployments are quickly becoming the most common arrangement for web applications and other types of internet-connected software that are amenable to horizontal scaling.

With disposable servers running software that comes packaged with all its requirements, there's no longer as much need to manage the underlying servers at all. Ideally, they are set up once and never changed—even for updates. Instead of being "administered" in the conventional sense, an out-of-date cloud host is simply destroyed and replaced with a new one. This pattern is referred to as "immutable infrastructure", and it's perhaps the largest technical shift involved in modern cloud computing.

In a typical container-based cloud deployment, there are actually two levels at which a Linux distribution is involved. First, there is the operating system on the cloud hosts. Second, there is the Linux environment in the application containers running on those hosts—these share the kernel but have their own user space. The flexibility to use different distributions for different applications is one of the advantages of containerization, but in production systems the distribution installed in the container will often be a lightweight one like Alpine or a stripped down Ubuntu variant. The Linux distribution on the cloud host itself should be capable of running the containers and should provide whatever services are needed for maintenance and diagnostics.

Project Atomic produces such a distribution and, more generally, builds tools for Linux cloud hosts and containers. Project Atomic is the upstream project for many components of OpenShift Origin, which is the basis for Red Hat's commercial container-deployment platform OpenShift. The main Atomic product is Atomic Host, a Linux distribution for immutable cloud hosts running containers under Moby and Kubernetes.

Atomic Host comes in two different flavors depending on the user's risk appetite and desired update cycle: one derived from CentOS and one derived from Fedora. On top of the base distribution, Atomic Host makes a number of modifications. The most significant is the rpm-ostree update system. Rpm-ostree is a based on OSTree, which is described as "Git for operating systems", and is integrated with the familiar Red Hat package-management ecosystem but operates quite a bit differently. Conceptually, rpm-ostree manages the system software much like a container image. The entire installation is one atomic version, and updates replace it completely with a new version while keeping the previous version available for rollback. These versions are managed in a Git repository.

So, with Atomic Host, instead of installing an operating system on a server and then installing and configuring a variety of system software, you can instead create a virtual machine or cloud host from an Atomic Host image and then, in one step, install a Git-controlled system configuration. The system is ready to serve an application in just two steps that can be easily automated by a cloud orchestrator. When the image becomes outdated, whether because of updates to the kernel or just to some configuration files, you build a new image and completely replace the filesystem on the running hosts with the new image.

Atomic Host's Fedora variant updates at a strict biweekly cadence, providing new images with the patches and updates that would normally come as updates to individual packages. The CentOS option currently releases irregularly, about once per month, although one of the goals of the CentOS Atomic SIG is to establish a regular release cycle for the future. In a notable difference from a more conventional update strategy, users will have to apply these as complete system updates in order to receive security patches and bug fixes. Since per-package updates may come more quickly than every two weeks, there may be a somewhat larger window of vulnerability for Atomic Host when compared to Fedora or CentOS.

Composing with rpm-ostree

Let's take a closer look at how an rpm-ostree configuration is created. The Git repository controlling rpm-ostree contains a number of files that describe the cloud host in terms of installed package versions and their configurations. The rpm-ostree tool then "composes" the system, assembling the packages and configuration into a filesystem and storing the result as an image file. The underlying OSTree tooling actually records the state of the filesystem after the compose and then installs the resulting version by copying the entire filesystem onto a host. This is conceptually similar to a disk image but, since OSTree operates at a filesystem level, it is both more efficient and capable of using filesystem features to retain older versions for easy rollback of changes.

This is a different way of installing updates from the one we've all become accustomed to. There are several major advantages, though. First, the filesystem replication process guarantees that the installed software and operating system configuration on one host will be identical to those installed on the others. The rpm-ostree mechanism also allows thorough testing (including automated integration testing) of a complete host configuration, from the operating system up, before making changes to the production environment. Second, it prevents a huge number of potential problems with software installation and update consistency, making it much easier to automatically update hosts—every update is an installation from scratch with no concerns about the starting state. Finally, it makes it far easier to replace hosts. Since all of the software on the host came from a known, version-controlled configuration, there's no need to worry that there is special configuration on a host that will be lost. In short, there's no need for backups as everything can be exactly recreated from the start.

Rpm-ostree extends the basic OSTree image approach by allowing "layering" of RPM packages on top of an OSTree configuration. If you have worked with Docker/Moby, this concept of image layering will be familiar. If an existing OSTree image meets your basic needs but you require some additional software, the RPM packages for the additional software can be added on top of the OSTree image without building an entirely new image. This eases the transition to immutable updates by allowing administrators to stick to one of their most familiar tools: making software available by installing a package.

Also included in Atomic Host is the atomic command, which wraps Moby to simplify installing and running container images that have specific expectations about the way they are run. atomic is particularly well suited for Moby containers used for system functions; an example use case is automatically placing systemd unit files to manage the Moby container as a service. Instead of pulling a Moby image and going through several steps to set up a systemd unit to run it with appropriate settings, containers with the right metadata can just be installed with atomic install.

It should be clear that Atomic Host is a significant change from traditional server-oriented Linux distributions. Atomic Host might sound difficult to use for small environments and one-off servers, but those are not in its use cases. Immutable infrastructure systems work best in at-scale environments where hosts—servers, virtual machines, or cloud instances—can be obtained and abandoned easily. Philosophically, Atomic Host focuses on providing a testable, consistent environment overall, rather than maintaining individual hosts for maximum reliability. As a result, it is not a good choice for "pet" servers that are individually important and cannot easily be replaced. To truly take advantage of Atomic Host, the applications you run should be able to tolerate individual hosts disappearing and add new hosts as they become available.

The benefit of this change is simpler and more automated maintenance of multiple servers. When your application runs on hosts that are nearly stateless and can be rebuilt in minutes, there's far less need for traditional system administration. Instead, your time can be spent on thorough design and testing of your system configuration.

Atomic Host is not the only product of Project Atomic. There is also Cockpit, which is a web application that allows for easy remote management and monitoring of hosts running Atomic Host. Cockpit is a step between manually installing and managing Atomic Host nodes and a complete configuration management system such as Puppet; Cockpit is still a tool to monitor and manage individual hosts, but allows you to do so from a central interface instead of via SSH. Because of Atomic Host's simple, low-maintenance design, this may be all that's needed for cloud environments with tens of hosts.

While it's perfectly possible to make changes to an Atomic Host installation on the fly via SSH or even Cockpit, doing so negates many of the advantages of immutable deployments by introducing inconsistent and likely poorly tested changes. As with many technologies, immutable deployments also require an element of discipline by the operators. The temptation to make a "quick fix" must be overcome in favor of a tested, controlled change to the entire deployment.

Finally, Project Atomic has also put quite a bit of work into developing the basic nuts and bolts for a well-run Linux cloud environment. A major example is their significant work on integrating Moby with SELinux. This effort, like Atomic Host and Cockpit, will help to keep Linux solidly at the front of cloud computing.

Comments (11 posted)

Open-source trusted computing for IoT

February 21, 2018

This article was contributed by Mischa Spiegelmock


FOSDEM

At this year's FOSDEM in Brussels, Jan Tobias Mühlberg gave a talk on the latest work on Sancus, a project that was originally presented at the USENIX Security Symposium in 2013. The project is a fully open-source hardware platform to support "trusted computing" and other security functionality. It is designed to be used for internet of things (IoT) devices, automotive applications, critical infrastructure, and other embedded devices where trusted code is expected to be run.

A common security practice for some time now has been to sign executables to ensure that only the expected code is running on a system and to prevent software that is not trusted from being loaded and executed. Sancus is an architecture for trusted embedded computing that enables local and remote attestation of signed software, safe and secure storage of secrets such as encryption keys and certificates, and isolation of memory regions between software modules. In addition to the technical specification [PDF], the project also has a working implementation of code and hardware consisting of compiler modifications, additions to the hardware description language for a microcontroller to add functionality to the processor, a simulator, header files, and assorted tools to tie everything together.

Many people are already familiar with code signing; by default, smartphones won't install apps that haven't been approved by the vendor (i.e. Apple or Google) because each app must be submitted for approval and then signed using a key that is shipped pre-installed on every phone. Similarly, many computers support mechanisms like ARM TrustZone or UEFI Secure Boot that are designed to prevent hardware rootkits at the bootloader level. In practice, some of those technologies have been used to restrict computers to boot only Microsoft Windows or Google Chrome OS, though there are ways to disable the enforcement for most hardware.

In somewhat of a contrast to more proprietary schemes that some argue restrict the freedom of end-users, the Sancus project is a completely open-source design built explicitly on open-source hardware, libraries, operating systems, crypto, and compilers. It can be used, if desired, in specialized contexts where it is of critical importance that trusted code runs in isolation, on say an automobile braking actuator attached to a controller area network bus, or a smart grid system such as the type that was hacked in Ukraine during the attack by Russia. These are the opposite of general-purpose devices; instead, one specific function must be performed and integrity and isolation are critical.

The problem is that many medical devices, automotive controllers, industrial controllers, and similar sensitive embedded systems are made up of limited microcontrollers that may have software modules from different vendors. Misbehaving or malicious software can interfere in the operation of those other modules, expose or steal secrets, and compromise the integrity of the system. Integrity checks based in software are bypassed relatively easily compared to gate-level hardware checking; those checks also add considerable overhead and non-deterministic performance behavior.

Sancus 2.0 extends the openMSP430 16-bit microcontroller with a small and efficient set of strong security primitives, weighing in at under 1,500 lines of Verilog code and increasing power consumption by about 6%, according to Mühlberg. It can disallow jumps to undeclared entry points, provide memory isolation, and attestation for software modules.

Besides providing a key hierarchy and chain of trust for loading software modules, Sancus has a simple metadata descriptor for each module that stores the .text and .data ranges in memory; it then ensures that a .data section is inaccessible unless the program counter is in the .text range of the appropriate module. This is a simple but effective process isolation mechanism to ensure that secrets are not accessible from other software modules and that one module cannot disturb the memory of other modules.

Sancus 2.0 comes with openMSP430 hardware extension Verilog code for use with FPGA boards and with the open-source Icarus Verilog tool. A simple "hello, world" example module written in C demonstrates the basic structure of a software module designed to be loaded in a trusted environment. There are also more complex examples and a demonstration trusted vehicular component system. An LLVM-based compiler is used to compile software to signed modules designed to be loaded by a trusted microcontroller.

Mühlberg mentioned that there is ongoing work on creating secure paths between peripherals for secure I/O, integration with common existing hardware solutions such as ARM TrustZone or Intel SGX, formal verification, and ensuring suitability for realtime applications.

To give a feel for the system in action, Mühlberg showed a demonstration video comparing two simulated automotive controller networks with malicious code running on a node. One can see the unsecured system behave erratically when receiving invalid messages, whereas the Sancus system gracefully slows down and safely disengages.

Much has been written about the upcoming IoTpocalypse: the lack of security in critical infrastructure and general despair about the dismal state of easily exploitable embedded systems as they multiply and get connected to the internet. A project based on open-source building blocks and free-software ethos that attempts to provide a layer of integrity and deterministic behavior to microcontrollers should be lauded and considered by anyone building hardware applications where security and reliability are strong requirements.

Comments (5 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>


Copyright © 2018, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds