This week's edition also includes these inner pages:
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
At linux.conf.au (LCA) 2017 in Hobart, Tasmania, Keith Packard talked with kernel graphics maintainer Dave Airlie about how virtual reality devices should be hooked up to Linux. They both thought it would be pretty straightforward to do, so it would "only take a few weeks", but Packard knew "in reality it would take a lot longer". In a talk at LCA 2018 in Sydney, Packard reported back on the progress he has made; most of it is now in the upstream kernel.
Packard has been consulting for Valve, which is a game technology company, to add support for head-mounted displays to Linux. Those displays have an inertial measurement unit (IMU) for position and orientation tracking and a display with some optics. The display is about 2Kx1K pixels in the hardware he is working with; that is split in half for each eye. The displays also have a "bunch of lenses", which makes them "more complicated than you would hope".
The display is meant to block out the real world and to make users believe they inhabit the virtual reality. "It's great if you want to stumble into walls, chairs, and tables." Nearly all of the audience indicated they had used a virtual reality headset, leading Packard to hyperbolically proclaim that he is the last person in the universe to obtain one.
There is a lot of computation that needs to be done in order to display a scene on the headset. The left and right eye buffers must be computed separately by the application. Those buffers must undergo an optical transform to compensate for the distortion introduced by the lenses. In addition, the OLED display has a different intensity for every pixel, which produces highly visible artifacts when moving an image across the display as the head moves. So his hardware has a table that is used to alter the intensities before the buffer is displayed. A regular Linux desktop is not suitable to be shown on a virtual reality headset; these devices should inherently be separate from the set of normal desktop display outputs.
Displaying to one of these head-mounted devices has hard realtime requirements. If the scene does not track the head motion perfectly, users will "get violently ill". Every frame must be displayed perfectly, both from a pixel-perfect perspective and that they must be delivered on time. The slowest frame rate is 90 frames per second, which is 50% faster than the usual desktop frame rate; the application "must hit every frame" or problems will occur. If the device simply presents a static image that does not change with head position, which can happen during debugging, it can cause the wearer to fall down if they are standing or fall out of their chair if they are not.
He and Airlie discussed a few different possibilities for supporting these displays. Most required changes to the desktops, X, the kernel, or some combination of them. All except one would have serious latency problems because X is not a hard realtime environment; that could lead to missed frames or other glitches. X is kind of a "best effort" display manager, Packard said; you give it a frame and it will display it sometime in the future. These displays need more than that.
The idea they came up with was to allow applications to "borrow" a display and get X out of the way—because "it's not helping". It adds latency and the application doesn't need any of the services it provides. By hacking the kernel and X (something that he and Airlie are well qualified to do, Packard said), they could provide a way for X to talk to some displays and for applications to talk to others without the two interacting at all.
Direct rendering manager (DRM) clients already talk directly to the kernel without going through X, he said. Those clients have a file descriptor to use for that communication; either those descriptors could be made more capable or a new kind of file descriptor could be added. Beyond rendering, the applications need to be able to do mode setting, which is a privileged operation that is normally mediated by the window system (e.g. X). What is needed is a way to pull the display away from X temporarily, so that the head-mounted display application has full control of it. But there is also a need for X to be able to recover the display if the application crashes.
So they came up with the idea of a "lease". The application would not own the device, but would have limited access for a limited amount of time. The lessor (e.g. X server) promises not to "come in and bug the application while it is busy displaying stuff"; it gives the application free rein on the display. The lessee (application) can set modes, flip buffers, and turn the display on or off for power saving; when the lease terminates (or the lessee does), the lessor can clean up.
When doing new things in the kernel, it often leads to getting sidetracked into a series of "yak-shaving exercises", Packard said. One of those came about when he was trying to change the kernel frame counter from 32-bit to 64-bit. The current vertical blank (VBLANK) API is a mess; there is a single ioctl() command that is used for three separate functions. It only supports a 32-bit frame counter that will wrap in a few years, which means that it cannot be treated as a unique value. A multi-year test-debug cycle is not conducive to finding and fixing any bugs that result from the wrap, so he was hesitant to rely on it. It also only supported microsecond resolution, which was insufficient for his needs.
So he added two new ioctl() commands, one that would retrieve the frame counter and another to queue an event to be delivered on a particular count. Those counts are now 64-bit quantities with nanosecond resolution. He got sidetracked into solving this problem and "it took a surprising amount of time to integrate this into the kernel". It was such a small change to the API and "the specification is really clear and easy to read", so that meant that it got bikeshedded, which was frustrating. He thought it would be quick and easy to do, but "it took like three months". "Welcome to kernel development", he added.
For DRM leasing, he created three patch sets to "tell the story" of the changes he wanted to make. The first chunk simply changed the internal APIs so that the hooks he needed were available; it made no functional change to the kernel. The second was the core of his changes; it added lots of code but didn't touch other parts of the system. That allowed it to be reviewed and merged without actually changing how the kernel functioned. Only with the last chunk, which exposed the new functionality to user space, does the kernel API change. Breaking the patches up this way makes it easier for maintainers to review, Packard said.
Two different kinds of objects can be leased: connectors, which correspond to the physical output connectors, and CRTCs, which are the scan-out engines. Packard noted that CRTC stands for "cathode-ray tube controller" but that few people actually have a cathode-ray tube display any more. In fact, the only person he knows that does still have one plays a competitive video game; the one-frame latency introduced by using an LCD display evidently interferes with optimal game playing. "Wow!"
He also added some new ioctl() commands to allow applications to lease a connector and CRTC. The lessee can only see and manipulate the connector and CRTC resources the lease has been granted for; only the DRM master (e.g. X server) has access to the full set. The application only needs to change in one place: instead of opening the graphics device, it should request a lease. Once the lease is granted, it will only see the resources that it has leased, so nothing else in the application needs to change, he said.
He then demonstrated the code using an application he wrote (xlease) that gets a lease from the X server, starts a new X server, and tells the new server to use the leased device for its output. That came up with an xterm, but he could not actually type in that terminal window because the second X server had no input devices. He then ran the venerable x11perf 2D-performance test in a second window. He killed xlease, which killed the second X server; the original X server was able to recover and redisplay his desktop at that point.
The second X server was running as a normal, unprivileged user because it does not require special privileges to open the graphics device as the regular X server does. That could lead to a way to solve the longstanding request for multi-seat X. He has a graphics card with four outputs, for example, and X already handles multiple input devices, so it should be straightforward to set up true multi-seat environments. Each display would have its own X server and input devices simply by changing configuration files, he said.
The master X server can still see all of the resources, including those that have been leased out. It could still mess with the output on the leased devices; it has simply agreed not to do so, there is no enforcement of that by the kernel. But desktops, such as GNOME or KDE/Plasma, would also be able to mess with those displays. So the X server effectively hides the leased resources from other clients; it simply pretends that there is nothing connected while the lease is active.
In an aside, Packard noted that he had seen patches to add leasing to the Wayland protocol the day before. There was no code yet, but "somebody is starting to think about how this might work in a Wayland environment, which is pretty cool".
All of the leasing mechanism sits atop an ability that was added for direct rendering infrastructure (DRI) 3: passing file descriptors from the server to X applications. Passing file descriptors via Unix-domain sockets has been possible for a long time, but was never used in X. DRI 2 had some "hokey mechanism" that was replaced by file-descriptor passing when DRI 3 came about. That required a bunch of changes and infrastructure in the X server that made this leasing work "a piece of cake", he said. The lease is represented as a file descriptor that can be passed to an application; using file descriptors that way is a powerful primitive that he thinks we will be seeing more of for X in the future.
There was a pile of minor fixes that he needed to make in X to support leasing. For example, the cursor needed to be disabled when the CRTC was so that the lessee doesn't end up with a random cursor on the screen. Similarly, the cursor needed to be restored to a known state when the lease terminates.
He has also done some work to provide leasing support for Vulkan. His original plan was to create a Vulkan extension that would be passed the file descriptor for it to use, but that ran aground on the rather heavyweight Vulkan standardization process for extensions. NVIDIA has an extension that does something related and he was able to modify that to handle leasing while still allowing it to be used for its original purpose with the binary-only NVIDIA driver.
He concluded the talk by saying that virtual reality on Linux is "working great and should be coming to your desktop soon". The kernel changes were merged for 4.15, while the X changes will be coming in xserver 1.20. The YouTube video of the talk is available for those interested.
[I would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to Sydney for LCA.]
A VPN works by establishing an encrypted connection from an endpoint system to a trusted host elsewhere on the network. That host becomes the router through which some or all network traffic from the endpoint passes. Since this tunnel is encrypted, traffic that travels over the VPN is protected from eavesdroppers — until it reaches the trusted host, at least. Setting up the VPN connection in the first place requires authentication between the endpoints; that, in turn, allows hosts to place some trust in the packets coming over the VPN connection. It is thus a common configuration to only allow internal resources to be accessed via a VPN connection.
There are other advantages to VPNs as well. Today's youth tend to be well acquainted with the use of a VPN to bypass various types of content filtering. A VPN can be used to change a user's apparent network location, helping to circumvent annoyances like country-specific content blocking. The first eavesdropper many of us are likely to encounter is our own Internet service provider, which tends to take a great interest in collecting data on web traffic and such to sell to advertisers; using a VPN will frustrate this kind of prying.
Linux users have a number of VPN options available to them. In many cases, a set of OpenSSH tunnels will do the trick for specific applications. OpenVPN is a popular, open-core system that provides comprehensive VPN functionality. But OpenVPN suffers from a fair amount of complexity and, since it is a user-space implementation, it takes a toll on networking performance. IPsec is built into the kernel and has fewer performance problems, but it makes up for that with even more complexity.
In June 2016, Jason Donenfeld showed up with a new VPN implementation called WireGuard that claims to avoid the problems associated with other options. It is an in-kernel implementation (though still out of tree) that has been developed with performance in mind. The implementation is quite small (about 4,000 lines of code), making it relatively easy to verify. Configuration of the system is relatively simple though, as with any sort of network configuration it seems, the "relatively" qualification is important.
Donenfeld has gone out of his way to make it easy to experiment with WireGuard; there are prebuilt packages available for a wide range of distributions. Those packages contain the source for the WireGuard implementation; it is built on the fly using the DKMS framework. Once the installation is done, the user is left with a kernel module (wireguard.ko) and the wg tool for configuration.
Every host connecting to a WireGuard implementation must use a public/private key pair for communication. The first step, thus, is to generate a new private key with a command like:
# wg genkey
uHCQ+Damh4F5zNVr9PvHiflW2aRU1SE0GQCVYkvxiEc=
The keys are generated using the Curve25519 elliptic curve; as a result they are quite a bit shorter than keys used by other algorithms. The associated public key can be created from the private key with the wg pubkey command.
WireGuard presents itself as a new type of network interface that can be used to route packets into a VPN. Thus, setting up a WireGuard implementation requires creating and configuring this interface, using a command series like:
# ip link add wg0 type wireguard
# ip addr add 10.0.0.1/24 wg0
# wg set wg0 private-key <private-key-file>
# ip link set wg0 up
These commands create a new network interface called wg0, loading the wireguard kernel module in the process. This interface is assigned the network address 10.0.0.1, and its private key is set to a key generated with wg genkey. Just running a bare wg command at this point will produce output like:
# wg
interface: wg0
public key: FNqV9pbUECLd7SNQ98jDlDRxqtppMTT9CEE8p1w6bTU=
private key: (hidden)
listening port: 41415
Like many recent protocols, WireGuard is based on UDP. Packets at one end are encrypted, then sent to the remote endpoint encapsulated within UDP packets, where they are decrypted and sent on their way. The above output tells us that port 41415 was chosen to listen for these UDP packets; the port number can also be explicitly configured with the wg command.
A command series like this must be carried out at both ends of the VPN connection; the IP addresses should be different, of course, but on the same subnet (10.0.0.0/24 in this case). Imagine we did something like that on the remote host, giving it IP address 10.0.0.2 and putting it on port 44556. The next step is to connect those two interfaces together so that they may pass packets back and forth. On the original machine (the one whose wg0 interface has address 10.0.0.1), we would run something like:
wg set wg0 peer <public-key> allowed-ips 10.0.0.2/32 endpoint <ip-addr>:44556
Here, ip-addr is the real-world (not VPN) address of the other end of the connection. A similar command would be required on that other system, using the appropriate public key and IP address. At that point, it will be possible to communicate between the two hosts by using the appropriate addresses. Once the connection has been established the IP addresses can change; if one end is a laptop, for example, the VPN will still work after moving to a new network.
In a sense, that's really about all there is to it. But the real world does tend to be a bit more complicated, of course. For example, it is common to want the endpoint to send all of its network traffic over the VPN. That could be accomplished by setting the allowed-ips parameter in the above command to 0.0.0.0/0 and using ip route to set the default route to go through wg0. A slightly more complex setup (turning on IP forwarding, probably setting up NAT) would then be required on the other end to make the routing work.
The advantage of the WireGuard approach can be seen here, though; it creates interfaces that can be connected to each other. After that all of the normal networking routing and traffic-control features can be used to cause that VPN link to be used in a wide variety of ways. There are also some special features designed to allow WireGuard interfaces to be used within network namespaces.
I ran a test, using WireGuard to set up a link between the desktop machine and a remote cloud instance. It took a little while, but that is mostly a matter of being extremely rusty with the ip command set. The VPN tunnel worked as advertised in the end. Before enabling the tunnel, a SpeedOf.Me test showed 137Mbps bandwidth down and 12.9Mbps up; the ping time to an LWN server was 76ms. With all traffic routing over the WireGuard VPN link, downward bandwidth dropped to 131Mbps and upward to 12.4Mbps; ping times were nearly unchanged. That is not a zero cost, but it is not huge and one should bear in mind that going through a NAT gateway at the far end will be a big chunk of the total performance hit. So WireGuard does indeed appear to be reasonably fast.
One test is not a comprehensive evaluation, of course. It will be interesting to try WireGuard at the next conference with an overloaded network to see how well it copes with packet loss, for example, and no attempt was made to verify the cryptographic aspects of the protocol. WireGuard does seem like a relatively simple and fast VPN implementation, though, that could go a long way toward making VPN use nearly universal on Linux systems.
Getting to that point will require that WireGuard be merged into the mainline kernel, though. Donenfeld has stated that upstreaming the code was his intent from the beginning, but there have been almost no postings of the code on the kernel mailing lists. It is, thus, unsurprising that WireGuard remains out of tree. Donenfeld did post an upstreaming roadmap in November; it suggests that the code is unlikely to be merged right away since, for example, an overhaul of the cryptographic API is evidently a precondition. That overhaul has not yet happened, and neither has the promised near-term posting of the WireGuard code.
Chances are that this all will happen eventually, though. WireGuard seems to have generated a high level of interest, and it appears to have been deployed in many settings already. It has, for example, been integrated into OpenWrt with a set of configuration screen in the LuCi web interface. So there is clearly an audience for this functionality. Once the process of getting it upstream begins in earnest, it may run its course relatively quickly.
See this white paper [PDF] for lots of details on how WireGuard works.
By design, the C language does not define the contents of automatic variables — those that are created on the stack when the function defining them is called. If the programmer does not initialize automatic variables, they will thus contain garbage values; in particular, they will contain whatever happened to be left on the stack in the location where the variables are allocated. Failure to initialize these variables can, as a result, lead to a number of undesirable behaviors. Writing an uninitialized variable to user space will leak the data on the stack, which may be sensitive in one way or another. If the uninitialized value is used within the function, surprising results may ensue; if an attacker can find a way to control what will be left on the stack, they may be able to exploit this behavior to compromise the kernel. Both types of vulnerability have arisen in the kernel in the past and will certainly continue to pop up in the future.
Note that, while most uses of uninitialized data can be squarely blamed on the programmer, that is not always the case. For example, structures stored on the stack may contain padding between fields, and the compiler may well decide that it need not initialize the padding, since the program will not use that memory. But that memory can still be exposed to user space, should the kernel write such a structure in response to a system call.
The best solution to this problem would be to find and fix every location where on-stack variables are not properly initialized. Tools like KASan can help with this task, but chasing down this kind of problem is a never-ending game. It would, thus, be nice to have a way of automatically preventing this type of vulnerability.
For some time, Alexander Popov has been working on a port of the PaX STACKLEAK feature to the mainline kernel; the ninth version of the patch set was posted on March 3. This series adds a GCC plugin that tracks the maximum depth of the kernel stack; this information can be used to help prevent stack overruns. The main purpose of this tracking, though, is to allow the kernel to clear the kernel stack on return from every system call; the stack-clearing code can use the maximum depth to avoid clearing more stack space than was actually used. According to the cover letter, turning on this feature incurs a performance cost of about 1%; in return for this overhead, kernel code always runs in an environment where the contents of the stack are known to have been properly set.
Incidentally, the "clearing" of the stack is not setting it to zero. Instead, a special poison value is used; that should help to identify crashes that are caused by the use of uninitialized on-stack variables.
Kees Cook remarked that this series "should be ready to land pretty soon", but that was before Linus Torvalds became aware of it. Torvalds was not pleased, and made it clear that the STACKLEAK code was unlikely to make it into the mainline in its current form. He complained that:
He suggested that security developers should focus more on finding and fixing problems, thus improving the kernel, rather than papering over issues in this way.
Needless to say, the developers involved see the situation a little differently. Cook responded:
That response led Torvalds to start thinking about what he described as a "*smart*" way of dealing with the problem. Simply clearing the stack did not strike him as *smart*, but having the compiler initialize all automatic variables to zero would be. This initialization would provide similar protection from uninitialized data, but it could also be omitted whenever the compiler could determine that the variable was properly initialized in some other way. The result should be protection with significantly lower overhead.
That overhead could be reduced further in performance-sensitive code by adding a special marker for variables that the compiler should not initialize, even if it seems that initialization is necessary. Places where this marker is needed would stand out in performance profiles, and the marker itself would be a red flag that uninitialized data may be present.
Cook was in favor of adding this functionality to the compiler, but he also said that it is insufficient. It takes a long time for a new compiler to be widely adopted; people will build new kernels with old compilers for a surprisingly long time. So an approach based solely on the compiler will not provide anything close to universal coverage for years. Adding the stack clearing into the kernel can protect sites regardless of whether a new compiler is used to build it. He also pointed out that there are a couple of cases where the zeroing of automatic variables does not provide complete coverage. If a vulnerability allows an attacker to read data below the current stack boundary, it can be exploited to read the possibly interesting data that will be sitting there. Clearing the stack also wipes out data that might otherwise be read by an unrelated vulnerability, considerably narrowing the window in which that vulnerability could be exploited.
The discussion has no definitive conclusions as of this writing. The STACKLEAK code has encountered a significant obstacle on its way into the mainline, but it shouldn't necessarily be written off quite yet. There do appear to be some valid reasons for having this feature in the kernel, in the short term at least, and the stack clearing can be disabled for users who do not want to pay the cost. So, with some persistence (and security developers have learned to be persistent), there may yet be a place for the STACKLEAK patches in the mainline.
This is the fourth article of a series discussing various methods of reducing the size of the Linux kernel to make it suitable for small environments. Reducing the kernel binary has its limits and we have pushed them as far as possible at this point. Still, our goal, which is to be able to run Linux entirely from the on-chip resources of a microcontroller, has not been reached yet. This article will conclude this series by looking at the problem from the perspective of making the kernel and user space fit into a resource-limited system.
A microcontroller is a self-contained system with peripherals, memory, and a CPU. It is typically small, inexpensive, and has low power-consumption characteristics. Microcontrollers are designed to accomplish one task and run one specific program. Therefore, the dynamic memory content of a microcontroller is usually much smaller than its static content. This is why it is common to find microcontrollers equipped with many times more ROM than RAM.
For example, the ATmega328 (a popular Arduino target) comes with 32KB of flash memory and only 2KB of static memory (SRAM). Now for something that can boot Linux, the STM32F767BI comes with 2MB of flash and 512KB of SRAM. So we'll aim for that resource profile and figure out how to move as much content as possible from RAM to ROM.
The idea of eXecute-In-Place (XIP) is to have the CPU fetch instructions directly from the ROM or flash memory where it is stored and avoid loading them into RAM altogether. XIP is a given in the microcontroller world where RAM is small, as we've seen. But XIP is used a bit less on larger systems where RAM is plentiful and simply executing everything from RAM is often simpler; executing from RAM is also faster due to high-performance caches. This is why most Linux targets don't support XIP. In fact, XIP in the kernel appears to be supported only on ARM, and its introduction predates the Git era.
For kernel XIP, it is necessary to have ROM or flash memory directly accessible in a range of the processor's memory address space, alongside system RAM, without the need for any software drivers. NOR flash is often used for that purpose as it offers random access, unlike the block-addressed NAND flash. Then, the kernel must be specially linked so the text and read-only data sections are allocated in the flash address range. All we need to do is enable CONFIG_XIP_KERNEL and the build system will prompt for the desired kernel physical address location in flash. Only the writable kernel data will be copied to RAM.
It is therefore highly desirable with an XIP kernel to have as much code and data as possible placed in flash memory. The more that remains in flash, the less will be copied to the precious RAM. By default, functions are put in flash, along with any data annotated with the const qualifier. It is convenient that all the "constification" work that took place in recent kernel releases, mainly for hardening purposes, directly benefits the XIP kernel case too.
User space is a huge consumer of RAM. But, just like the kernel, user-space binaries have read-write and read-only segments. It would be nice to have the user-space read-only segments stored in the same flash memory, and executed directly from there rather than being loaded into RAM. However, unlike the kernel, which is a static binary loaded or mapped only once from well-known ROM and RAM addresses, user-space executables are organized into a filesystem, making things more complicated.
Could we get rid of the filesystem? Certainly we could. In fact, this is what most small realtime operating systems do: they link their application code directly with the kernel, bypassing the filesystem layer entirely. And that wouldn't be completely revolutionary even for Linux, as kernel threads are more or less treated like user-space applications: they have an execution context of their own, they are scheduled alongside user applications, they can be signaled, they appear in the task list, etc. And kernel threads have no filesystem under them. An application made into a kernel thread could crash the entire kernel, but in a microcontroller environment lacking a memory-management unit (MMU), this is already the case for pure user-space applications.
However, having a filesystem around for user-space applications still has many advantages we don't want to lose:
Compatibility with full-fledged Linux systems, so our application can be developed and tested natively on a workstation;
The convenience of having multiple unrelated applications together;
The ability to develop and update the kernel and user space independently of each other;
A clear boundary that identifies application code as not being a derived work of the kernel in the context of the GPL.
This being said, we want the smallest and simplest filesystem possible. Let's not forget that our flash memory budget is only 2MB, and our kernel (see the previous article in this series) weighs about 1MB already. That pretty much rules out writable filesystems due to their inherent overhead, and we don't want to be writing to the same flash where the kernel and user space live as this would render all the flash content inaccessible during write operations and crash any code executing from it.
Side note: It is possible to write to the actual flash memory being used for XIP with CONFIG_MTD_XIP but this is tricky, currently available only for Intel and AMD flash memory, and requires target-specific support.
So our choices for small, read-only filesystems are:
Squashfs: highly scalable, compressed by default, somewhat complex code, no XIP support
Romfs: small and simple code, no compression, partial (only on systems without an MMU) XIP support
Cramfs: small and simple code, compressed, partial (MMU-only) out-of-tree XIP support
I settled on cramfs as the small amount of available flash memory warrants compression that romfs doesn't have, and cramfs's simple code base made it easier to add no-MMU XIP support quickly more than it would be for squashfs. Also, cramfs can be used with the block-device subsystem configured out entirely.
However the early attempts at adding XIP to cramfs were rather crude and lacking in a fundamental way. It was an all-or-nothing affair: each file was either completely uncompressed for XIP purposes, or entirely compressed. In reality, executables are made of both code and data, and since writable data has to be copied to RAM anyway, it is wasteful to keep that part uncompressed in flash. So I took upon myself to completely redesign cramfs XIP support for both the MMU and no-MMU cases. I included the needed ability to mix compressed and uncompressed blocks of arbitrary alignments, and did so in a way to meet quality standards for upstream inclusion (available in mainline since Linux v4.15).
I later (re)discovered that the almost 10-year-old AXFS filesystem (still maintained out of tree) could have been a good fit. I had forgotten about it though, and in any case I prefer to work with mainline code.
One may wonder why DAX was not used here. DAX is like XIP on steroids; it is tailored for large writable filesystems and relies on the presence of an MMU (which the STM32 processor lacks) to page in and out data as needed. Its documentation also mentions another shortcoming: "The DAX code does not work correctly on architectures which have virtually mapped caches such as ARM, MIPS and SPARC". Because cramfs with XIP is read-only and small enough to always be entirely mapped in memory, it is possible to achieve the intended result with a much simpler approach, making DAX somewhat overkill in this context.
Now that we're set with an XIP-capable filesystem, it is time to populate it. I'm using a static build of BusyBox to keep things simple. Using a target with an MMU, we can see how things are mapped in memory:
# cat /proc/self/maps
00010000-000a5000 r-xp 08101000 1f:00 1328 /bin/busybox
000b5000-000b7000 rw-p 00095000 1f:00 1328 /bin/busybox
000b7000-000da000 rw-p 00000000 00:00 0 [heap]
bea07000-bea28000 rw-p 00000000 00:00 0 [stack]
bebc1000-bebc2000 r-xp 00000000 00:00 0 [sigpage]
bebc2000-bebc3000 r--p 00000000 00:00 0 [vvar]
bebc3000-bebc4000 r-xp 00000000 00:00 0 [vdso]
ffff0000-ffff1000 r-xp 00000000 00:00 0 [vectors]
The clue that gives XIP away is shown in bold in the third column on the first output line. It is meant to be the file offset for that mapping, except that remap_pfn_range(), used to establish an XIP mapping, overwrites the file offset in the virtual memory area (VMA) structure (vma->vm_pgoff) with the physical address for that mapping. We can see that 0x08101000 would be way too big for a file offset here; instead, it corresponds to a location in the physical address range for the flash memory. Cramfs may also use vm_insert_mixed() in some cases, and then this physical address reporting wouldn't be available. A reliable way to display XIP mappings in all cases would be useful.
The second /bin/busybox mapping (the .data section) is flagged read-write (rw-p), unlike the first one (the .text section) which is read-only and executable (r-xp). Writable segments cannot be mapped to the flash memory and, therefore, have to be loaded in RAM in the usual way.
The MMU makes it easy for a program to see its code at the absolute address it expects regardless of the actual memory used. Things aren't that simple in the no-MMU case, where user executables must be able to run at any memory address; position-independent code (PIC) is therefore a requirement. This ability is offered by the bFLT flat file format, and has been available for quite a long time with uClinux targets. However, this format has multiple limitations that make XIP, shared libraries, or the combination of both, unwieldy.
Fortunately there is a variant of ELF, called ELF FDPIC, that overcomes all those limitations. Because FDPIC segments are position-independent with no predetermined relative offset between them, it is possible to share common .text segments across multiple executable instances just like standard ELF-on-MMU targets, and those .text segments may be XIP as well. ELF FDPIC support was added to the ARM architecture (also available in mainline since Linux v4.15).
On my STM32 target, with the combination of a XIP-enabled cramfs and ELF FDPIC user-space binaries, the BusyBox mapping now looks like this:
# cat /proc/self/maps
00028000-0002d000 rw-p 00037000 1f:03 1660 /bin/busybox
0002d000-0002e000 rw-p 00000000 00:00 0
0002e000-00030000 rw-p 00000000 00:00 0 [stack]
081a0760-081d8760 r-xs 00000000 1f:03 1660 /bin/busybox
Due to the lack of an MMU, the XIP segment is even more obvious as there is no address translation and the flash-memory address is clearly visible. The no-MMU memory mapping support requires shared mappings for XIP, hence the "r-xs" annotation.
Okay, now we're all set for some hammering. We've seen that our XIP BusyBox above already saved 229,376 bytes of RAM, or 56 memory pages. That represents 44% of our total budget of 128 pages if we want to target 512KB of RAM. From now on, it is important to closely track where memory allocations go and determine how useful that precious memory is. Let's start by looking at the kernel itself, using a trimmed-down configuration from the previous article, but with CONFIG_XIP_KERNEL=y (and LTO disabled for now as it takes too long to build). We get:
text data bss dec hex filename
1016264 97352 169568 1283184 139470 vmlinux
The 1,016,264 bytes of text are located in flash so we can ignore them for a while. The 266,920 bytes of data and BSS, though, represent 51% of our RAM budget. Let's find out what is responsible for it with some scripting on the System.map file:
#!/bin/sh
{
read addr1 type1 sym1
while read addr2 type2 sym2; do
size=$((0x$addr2 - 0x$addr1))
case $type1 in
b|B|d|D)
echo -e "$type1 $size\t$sym1"
;;
esac
type1=$type2
addr1=$addr2
sym1=$sym2
done
} < System.map | sort -n -r -k 2
The first output lines are:
B 133953016 _end
b 131072 __log_buf
d 8192 safe_print_seq
d 8192 nmi_print_seq
D 8192 init_thread_union
d 4288 timer_bases
b 4100 in_lookup_hashtable
b 4096 ucounts_hashtable
d 3960 cpuhp_ap_states
[...]
Here we ignore _end because its apparent huge size comes from the fact that the end of kernel static allocation in RAM comes before the next kernel symbol located in flash — much higher in the address space. It is always good to go back to System.map to make sense of some weird cases like this.
However, we do have a clearly identifiable memory allocation to pound on. Looking at the declaration of __log_buf we see:
/* record buffer */
#define LOG_ALIGN __alignof__(struct printk_log)
#define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT)
static char __log_buf[__LOG_BUF_LEN] __aligned(LOG_ALIGN);
This one is easy. Because we don't want to configure the whole of printk() support out just yet, we'll set CONFIG_LOG_BUF_SHIFT=12 (the smallest allowed value). And while there, we'll also set the configuration symbol CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT to its minimum of ten. The result is:
text data bss dec hex filename
1016220 83016 42624 1141860 116c64 vmlinux
Our RAM usage went down from 266,920 to 125,640 bytes with a couple of simple configuration tweaks. Let's see our symbol-size list again:
B 134092280 _end
D 8192 init_thread_union
d 4288 timer_bases
b 4100 in_lookup_hashtable
b 4096 ucounts_hashtable
b 4096 __log_buf
d 3960 cpuhp_ap_states
[...]
The next contender is init_thread_union. This one is interesting because its size is derived from THREAD_SIZE_ORDER, which determines how many stack pages each kernel task gets. The first task (the init task) happens to have its stack statically allocated in the .data segment, which is why we see it here. Changing this from two pages to one page should be perfectly fine for our tiny environment, and this will also save one page per task with dynamically allocated stacks.
To reduce the size of timer_bases we'll tweak the value of LVL_BITS down from six to four. To reduce in_lookup_hashtable we change IN_LOOKUP_SHIFT from ten to five. And so on for a few more random kernel constants.
Figuring out and reducing static memory allocations is easy as we've seen. But dynamic allocations must be dealt with as well, and for that we have to instrument our target and boot it. The first dynamic allocations come from the memblock allocator, as the usual kernel memory allocators are not up and running yet. All the instrumentation we need is already there; it suffices to provide "memblock=debug" on the kernel command line to activate it. Here's what it shows:
memblock_reserve: [0x00008000-0x000229f7] arm_memblock_init+0xf/0x48
memblock_reserve: [0x08004000-0x08007dbe] arm_memblock_init+0x1d/0x48
Here we have our static RAM being reserved, followed by our kernel code and read-only data in flash (which is mapped starting at 0x08004000). If the kernel code were in RAM then it would make sense to reserve that too. In this case this is just a useless but harmless reservation since the flash will never be allocated for any other purpose anyway.
Now for actual dynamic allocations:
memblock_virt_alloc_try_nid_nopanic: 131072 bytes align=0x0 nid=0
from=0x0 max_addr=0x0 alloc_node_mem_map.constprop.6+0x35/0x5c
Normal zone: 32 pages used for memmap
Normal zone: 4096 pages, LIFO batch:0
This is our memmap array taking up 131,072 bytes (32 pages) in order to manage 4096 pages. By default this target uses the full 16MB of external RAM available on the board. So if we reduce the number of available pages, to say 512KB, then this will shrink significantly.
The next significant allocation is:
memblock_virt_alloc_try_nid_nopanic: 32768 bytes align=0x1000 nid=-1
from=0xffffffff max_addr=0x0 setup_per_cpu_areas+0x21/0x64
pcpu-alloc: s0 r0 d32768 u32768 alloc=1*32768
32KB of per-CPU memory pool for a uniprocessor system with less than a megabyte of RAM? Nah. Here are a few tweaks to include/linux/percpu.h to reduce that to a single page:
-#define PCPU_MIN_UNIT_SIZE PFN_ALIGN(32 << 10)
+#define PCPU_MIN_UNIT_SIZE PFN_ALIGN(4 << 10)
-#define PERCPU_DYNAMIC_EARLY_SLOTS 128
-#define PERCPU_DYNAMIC_EARLY_SIZE (12 << 10)
+#define PERCPU_DYNAMIC_EARLY_SLOTS 32
+#define PERCPU_DYNAMIC_EARLY_SIZE (4 << 10)
+#undef PERCPU_DYNAMIC_RESERVE
+#define PERCPU_DYNAMIC_RESERVE (4 << 10)
It is worth noting that only the SLOB memory allocator (CONFIG_SLOB) still works after these changes.
Moving on to the next major allocation:
memblock_virt_alloc_try_nid_nopanic: 8192 bytes align=0x0 nid=-1
from=0x0 max_addr=0x0 alloc_large_system_hash+0x119/0x1a4
Dentry cache hash table entries: 2048 (order: 1, 8192 bytes)
memblock_virt_alloc_try_nid_nopanic: 4096 bytes align=0x0 nid=-1
from=0x0 max_addr=0x0 alloc_large_system_hash+0x119/0x1a4
Inode-cache hash table entries: 1024 (order: 0, 4096 bytes)
Who said this is a large system? Yes, you should get the idea by now; a couple more small tweaks are needed, but they're omitted from this article for the sake of keeping it reasonably short.
After that, the usual kernel memory allocators such as kmalloc() take over, and allocations ultimately end up down in __alloc_pages_nodemask(). The same kind of tracing and tweaks may be applied until the boot is complete. Sometimes it is just a matter of configuring out more stuff, such as the sysfs filesystem, whose memory needs are a bit excessive for our budget, and so on.
Now that we have hammered down the kernel's RAM usage, we're ready to flash and boot it again. The minimum amount of RAM required for a successful boot to user space at this point is 800KB ("mem=800k" on the kernel command line). Let's explore our small world:
BusyBox v1.7.1 (2017-09-16 02:45:01 EDT) hush - the humble shell
# free
total used free shared buffers cached
Mem: 672 540 132 0 0 0
-/+ buffers/cache: 540 132
# cat /proc/maps
00028000-0002d000 rw-p 00037000 1f:03 1660 /bin/busybox
0002d000-0002e000 rw-p 00000000 00:00 0
0002e000-00030000 rw-p 00000000 00:00 0
00030000-00038000 rw-p 00000000 00:00 0
0004d000-0004e000 rw-p 00000000 00:00 0
00061000-00062000 rw-p 00000000 00:00 0
0006c000-0006d000 rw-p 00000000 00:00 0
0006f000-00070000 rw-p 00000000 00:00 0
00070000-00078000 rw-p 00000000 00:00 0
00078000-0007d000 rw-p 00037000 1f:03 1660 /bin/busybox
081a0760-081d8760 r-xs 00000000 1f:03 1660 /bin/busybox
Here we can see two four-page RAM mappings from offset 0x37000 of /bin/busybox. Those are two data instances, one for the shell process, and one for the cat process. They both share the busybox XIP code segment at 0x081a0760, which is good. There are also two anonymous eight-page RAM mappings among much smaller ones though, and that eats our page budget pretty quickly. They correspond to a 32KB stack space for each of those processes. That certainly can be tweaked down:
--- a/fs/binfmt_elf_fdpic.c
+++ b/fs/binfmt_elf_fdpic.c
@@ -337,6 +337,7 @@ static int load_elf_fdpic_binary(struct linux_binprm *bprm)
retval = -ENOEXEC;
if (stack_size == 0)
stack_size = 131072UL; /* same as exec.c's default commit */
+ stack_size = 8192;
if (is_constdisp(&interp_params.hdr))
interp_params.flags |= ELF_FDPIC_FLAG_CONSTDISP;
That's certainly a quick and nasty hack; properly changing the stack size in the ELF binary's header is the way to go. It would also require careful validation, say on a MMU system with a fixed-size stack where any stack overflow could be caught. But hey, it wouldn't be our first hack at this point and that will do for now.
Still, before rebooting, let's explore some more:
# ps
PID USER VSZ STAT COMMAND
1 0 300 S {busybox} sh
2 0 0 SW [kthreadd]
3 0
ps invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL),
nodemask=(null), order=0, oom_score_adj=0
[...]
Out of memory: Kill process 19 (ps) score 5 or sacrifice child
The intervention of the out-of-memory killer was bound to happen at some point, of course. However the out-of-memory report also provided this piece of information from the buddy allocator:
Normal: 2*4kB (U) 3*8kB (U) 2*16kB (U) 2*32kB (UM)
0*64kB 0*128kB 0*256kB = 128kB
The ps process tried to perform a memory allocation with order=0 (a single 4KB page) and this failed despite having 128KB still available. Why is that? It turns out that the page allocator does not like performing normal memory allocations when there isn't at least a certain small amount of free memory available, as enforced by zone_watermark_ok(). This is to avoid possible deadlocks if a failed memory allocation results in the killing of a process — an operation that may require memory allocations of its own. Even though this watermark is supposed to be small, in our tiny environment this is still something we don't need and can't afford. So let's lower those watermarks slightly:
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7035,6 +7035,10 @@ static void __setup_per_zone_wmarks(void)
zone->watermark[WMARK_LOW] = min_wmark_pages(zone) + tmp;
zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
+ zone->watermark[WMARK_MIN] = 0;
+ zone->watermark[WMARK_LOW] = 0;
+ zone->watermark[WMARK_HIGH] = 0;
+
spin_unlock_irqrestore(&zone->lock, flags);
}
Finally we're able to reboot with "mem=768k" on the kernel command line:
Linux version 4.15.0-00008-gf90e37b6fb-dirty (nico@xanadu.home) (gcc version 6.3.1 20170404
(Linaro GCC 6.3-2017.05)) #634 Fri Feb 23 14:03:34 EST 2018
CPU: ARMv7-M [410fc241] revision 1 (ARMv7M), cr=00000000
CPU: unknown data cache, unknown instruction cache
OF: fdt: Machine model: STMicroelectronics STM32F469i-DISCO board
On node 0 totalpages: 192
Normal zone: 2 pages used for memmap
Normal zone: 0 pages reserved
Normal zone: 192 pages, LIFO batch:0
random: fast init done
[...]
BusyBox v1.27.1 (2017-09-16 02:45:01 EDT) hush - the humble shell
# free
total used free shared buffers cached
Mem: 644 532 112 0 0 24
-/+ buffers/cache: 508 136
# ps
PID USER VSZ STAT COMMAND
1 0 276 S {busybox} sh
2 0 0 SW [kthreadd]
3 0 0 IW [kworker/0:0]
4 0 0 IW< [kworker/0:0H]
5 0 0 IW [kworker/u2:0]
6 0 0 IW< [mm_percpu_wq]
7 0 0 SW [ksoftirqd/0]
8 0 0 IW< [writeback]
9 0 0 IW< [watchdogd]
10 0 0 IW [kworker/0:1]
11 0 0 SW [kswapd0]
12 0 0 SW [irq/31-40002800]
13 0 0 SW [irq/32-40004800]
16 0 0 IW [kworker/u2:1]
21 0 0 IW [kworker/u2:2]
23 0 260 R ps
# grep -v " 0 kB" /proc/meminfo
MemTotal: 644 kB
MemFree: 92 kB
MemAvailable: 92 kB
Cached: 24 kB
MmapCopy: 92 kB
KernelStack: 64 kB
CommitLimit: 320 kB
Here it is! Not exactly our target of 512KB of RAM but 768KB is getting pretty close. Some microcontrollers already have more than that amount of available on-chip SRAM.
Easy improvements are still possible. We can see above that 14 out of the 16 tasks are kernel threads, each with their 4KB stack; some of them could certainly go. Going through another round of memory page tracking would reveal yet more things that could be optimized out, etc. Yet, a dedicated application that doesn't spawn child processes is likely to require less RAM to run as well, unlike this generic shell environment. After all, some popular microcontrollers that are able to connect to the Internet have less total RAM than the remaining free RAM we have here.
There is at least one important lesson to be learned from the work on this project: shrinking the kernel's RAM usage is much easier than shrinking its code size. The code tends to be highly optimized already, because it has a direct influence on system performance, even on big systems. That is not necessarily the case for actual memory usage though. RAM comes relatively cheap on big systems, and wasting some of it really doesn't matter much in practice. Therefore, much low-hanging fruit can be found when optimizing RAM usage for small systems.
Other than the small tweaks and quick hacks presented here, all the major pieces relied upon in this article (XIP kernel, XIP user space, even some device tree memory usage reduction) are available in the mainline already. But further work beyond this proof of concept is still needed to make Linux on tiny devices really useful. Progression of this work will depend, as always, on people's desire to use it and willingness to form a community to promote its development.
[Thanks to Linaro for allowing me to work on this project and to write this article series.]
Sigal is a "simple static gallery generator" with a straightforward design, a nice feature set, and great themes. It was started as a toy project, but has nevertheless grown into a sizable and friendly community. After struggling with maintenance using half a dozen photo gallery projects along the way, I feel I have found a nice little gem that I am happy to share with LWN readers.
Sigal is part of a growing family of static site generators (SSG), software that generates web sites as static HTML files as opposed to more elaborate Content Management Systems (CMS) that generate HTML content on the fly. A CMS requires specialized server-side software that needs maintenance to keep up to date with security fixes. That software is always running and exposed on the network, whereas a site generated with an SSG is only a collection of never-changing files. This drastically reduces the attack surface as visitors do not (usually) interact with the software directly. Finally, web servers can deliver static content much faster than dynamic content, which means SSGs can offer better performance than a CMS.
Having contributed to a major PHP-based CMS for over a decade, I was glad to finally switch to a SSG (ikiwiki) for my own web site three years ago. My photo gallery, however, was still running on a CMS: after running the venerable Gallery software (in hibernation since 2014), then Coppermine, I ended up using Piwigo. But that required a PHP-enabled web server, which meant chasing an endless stream of security issues. While I did consider non-PHP alternatives like MediaGoblin, that seemed too complicated (requiring Celery, Paste, and PostgreSQL). Really, static site generators had me hooked and there was no turning back.
Initially, I didn't use Sigal, as I first stumbled upon
PhotoFloat. It is the
brainchild of Jason A. Donenfeld—the same person behind the
pass password manager that we previously covered and the
WireGuard virtual private network (VPN) as
well. PhotoFloat is a small
Python program that generates a static gallery running custom JavaScript
code. I was enthusiastic about the project: I packaged it for
Debian and published patches to implement RSS feeds and
multiple gallery support. Unfortunately, patches from
contributors would just sit on the mailing list without feedback for
months which led to some users forking the project. Donenfeld
was not happy with the result; he decried the new PHP dependency
and claimed the fork introduced a directory traversal
vulnerability. The fork now seems to be more active than the
original and was renamed to MyPhotoShare. But at that point,
I was already looking for alternatives and found out about Sigal when
browsing a friend's photo gallery.
Sigal was created by a French software developer from Lyon, Simon Conseil. In an IRC interview, he said that he started working on Sigal as a "toy project to learn Python", as part of his work in Astrophysics data processing at the Very Large Telescope in Chile:
Before starting a new project from scratch, Conseil first looked for alternatives ("Gallerize, lazygal, and a few others") but couldn't find anything satisfactory. He wanted to reuse ideas from Pelican, for example the Jinja2 template engine for themes and the Blinker plugin system, so he started his own project.
Like other static gallery generators, Sigal parses a tree of images
and generates thumbnails and HTML pages to show those images. Instead
of deploying its own custom JavaScript application for browsing images
in the browser, Sigal reuses existing applications like Galleria,
PhotoSwipe, and Colorbox.
Image metadata is parsed from
Exif
tags, but a Markdown-formatted text
file can
also be used to change image or album titles, description, and
location. The latest 1.4 release can also
read metadata from in-image IPTC tags. Sigal parses regular
images using the Pillow library but can also read video files,
which get converted to browser-readable video files through the
ubiquitous FFmpeg. Sigal has good (if minimal) online
documentation and, like any good Python program, can be installed
with pip; I am working on packaging
it for Debian.
Plugins offer support for image or copyright
watermarks. The adjust plugin also allows for minor image
adjustments, although those apply to the whole gallery so it is
unclear to me how useful that plugin really is. Even novice
photographers would more likely make adjustments in a basic image
editor like Shotwell, digiKam, or maybe even GIMP before
trying to tweak images in a Python configuration file. Finally,
another plugin provides a simple RSS feed, which is useful to allow users to
keep track of the latest images published in the gallery.
When I asked him about future plans, Conseil said he had "no roadmap":
For me Sigal has been doing its job for a long time now, but the cool thing is that people find it useful and contribute. So my only wish is that this continues and to help the project live for and by its community, which is slowly growing.
Following this lead, I submitted patches and ideas of my own to the project while working on this article. The first shortcoming I have found with Sigal is the lack of access control. A photo gallery is either private or world-readable; there is no way to restrict access to only some albums or photos. I found a way, however, to implement folder password protection using the Basic authentication type for the Apache web server, which I documented in an FAQ entry. It's a little clunky as it uses password files managed through the old htpasswd command. It also means using passwords and, in my usability tests, some family members had trouble typing my weird randomly generated passwords on their tablets. I would have preferred to find a way to use URL-based authentication, with an unguessable one-time link, but I haven't found an easy way to do this in the web server. It can be done by picking a random name for the entire gallery, but not for specific folders, because those get leaked by Sigal. To protect certain pictures, they have to be in a separate gallery, which complicates maintenance.
Which brings us to gallery operation: to create a Sigal gallery, you
need to create a configuration file and run the sigal build
command. This is pretty simple but I think it can be made even
simpler. I have proposed having a default configuration file so
that creating a configuration file isn't required to make new
galleries. I also looked at implementing a "daemon" mode that would
watch a directory for changes and rebuild when new pictures show
up. For now, I have settled on a quick
hack based on the entr
utility but there's talk of implementing the feature directly in the
build command. Such improvements would enable mass hosting of
photo galleries with
minimal configuration. It would also make it easier to create
password-less private galleries with unique, unguessable URLs.
Another patch I am working on is the stream plugin, which creates a new view of the gallery; instead of a folder-based interface, this shows the latest pictures published as a flat list. This is how commercial services like Instagram and Flickr work; even though you can tag pictures or group them by folder, they also offer a unified "stream" view of the latest entries in a gallery. As a demonstration of Sigal's clean design, I was able to quickly find my way in the code base to implement the required changes to the core libraries and unit tests, which are now waiting for review.
In closing, I have found Sigal to be a simple and elegant project. As it stands, it should be sufficient for basic galleries, but more demanding photographers and artists might need more elaborate solutions. Ratings, comments, and any form of interactivity will obviously be difficult to implement in Sigal; fans of those features should probably look at CMS solutions like Piwigo or the new Lychee project. But dynamic features are perhaps best kept to purpose-built free software like Discourse that embeds dynamic controls in static sites. In any case, for a system administrator tired of maintaining old software, the idea of having only static web sites to worry about is incredibly comforting. That simplicity and reliability has made Sigal a key tool in my amateur photographer toolbox.
A set of demos is available for readers who want to see more themes and play around with a real gallery.
Both the free-software and security communities have recently been focusing on the elements of our computers that run below the operating system. These proprietary firmware components are usually difficult or impossible to extend and it has long been suspected (and proven in several cases) that there are significant security concerns with them. The LinuxBoot Project is working to replace this complex, proprietary, and largely unknown firmware with a Linux kernel. That has the added benefit of replacing the existing drivers in the firmware with well-tested drivers from Linux.
To understand LinuxBoot and the problem it's working to solve, we first have to discuss how computers actually boot. We usually think of a running system as including the hardware, operating system (OS), and applications. However, for a number of reasons, there are several layers that run between the hardware and the OS. Most users are aware of UEFI (which replaced the older BIOS); for many systems, it prepares the system to run and loads the bootloader. These necessary functions are just the tip of the iceberg, though. Even after the computer finishes loading the OS, there are multiple embedded systems also running on the system entirely separate from the OS. Most notably, the Intel Management Engine (ME) runs a complete Minix operating system, while System Management Mode (SMM) is used to run code for certain events (e.g. laptop lid gets closed) in a way that is completely invisible to the running OS.
All of these add up to the LinuxBoot project's statement that there are "at least 2.5 kernels between the hardware and Linux,"; those kernels collectively make up the firmware. Many of these firmware components are surprisingly complex and capable, with full network stacks and extensive hardware drivers. The firmware is a major concern to the free-software community, as it leaves all computers running on a large foundation of code that is proprietary and unaudited. It is also a major risk to large tech companies and cloud providers since the firmware presents many opportunities for powerful, persistent exploits and rootkits. Besides these substantial security concerns, improvements in performance and flexibility of the early stages of boot is also a major motivation for this simplification and move to open source.
To work towards a more elegant and open solution, Google launched a project called NERF, or the Non-Extensible Reduced Firmware, which LWN covered back in November. The primary goal of NERF was to reduce the firmware attack surface by removing almost all functionality that was not necessary to start the operating system (although there are limits on the extent to which this is possible). NERF consisted of a "full stack" solution of stripped-down EFI firmware, a Linux kernel, and an initramfs with tools written in Go. Although these components all make up one bundle stored in ROM, they have since been split into separate projects: LinuxBoot is the firmware and kernel while the user-space initramfs image with Go tools for system booting is available as u-root. Due to this modularity, LinuxBoot can be used with a variety of initramfs images.
It's easier to understand how these components fit together if we first look a bit closer at the boot process. A UEFI computer boots in four main phases. The security phase (SEC) and the Pre-EFI Initialization Stage (PEI) are responsible for low-level operations to prepare the hardware and are usually specific to the hardware they are implemented for. After these two stages, the Driver Execution Environment (DXE) loads various drivers, and then the Boot Device Select (BDS) phase can begin.
The BDS phase is where most of what we as users are aware of happens: the system is searched for storage devices, those devices are inspected for bootloaders, and then one of those bootloaders is started. This sounds simple but, due to the many different storage devices and boot schemes available, it can be quite complicated in practice. To manage this complexity, the previous DXE stage loads hundreds of drivers on most systems, implementing support for the majority of the devices the final operating system will be able to use. Further, hardware in the system can provide "option ROMs" that include additional drivers to be loaded and executed prior to BDS. All of these are potentially vulnerable, and it's easy to imagine incredibly damaging attacks based on the extensive capabilities of the largely hidden UEFI.
This is where LinuxBoot comes in. It starts during the DXE stage, resulting in most of the drivers (and their associated attack surface) not being loaded. Instead, a Linux kernel is loaded as if it were a driver.
This brings several immediate advantages. The first, and perhaps most exciting, is that the Linux kernel is an open-source system that has been subject to a great deal of security scrutiny and receives regular fixes. This is almost the exact opposite of the case with existing UEFI drivers. The second is a practical improvement: running Linux at this early stage provides a standardized and well-documented programming environment for developing further modules. This makes it far more practical to extend UEFI's capabilities.
By loading during the DXE, LinuxBoot runs after the first two stages of the UEFI, but takes over after that point, replacing the UEFI drivers. It therefore completely replaces a large portion of the boot process, much of what is often still colloquially referred to as "the BIOS". Because LinuxBoot handles the part of the boot process where I/O devices become available, it eliminates the parts of UEFI with the greatest security and stability risk.
A Linux kernel alone is not of much use; to implement actual system features, a Linux user space is needed. That user space is provided in an initramfs filesystem, much like the standard Linux boot process. There are currently several options available for user-space tools. One is u-root, which is the original solution developed by Google as part of the NERF project. U-root is a tiny image containing binaries written entirely in the Go programming language, and can optionally be built as a single executable binary for optimal load and execution times. This makes it similar to BusyBox. U-root is a complete-enough Linux system to start the final operating system by simply executing its kernel using kexec.
Another notable option for the user space is HEADS, which is a Linux system specifically designed to be secure even when an attacker has physical access to the system. It achieves this through TPM-based hardware trust and extensive protections against firmware attacks. HEADS was originally designed to boot directly from coreboot, but has been ported to work as a LinuxBoot initramfs as well.
Essentially any Linux tool set that can be made compact enough to fit in the available ROM can be built into an initramfs for LinuxBoot, including BusyBox and other minimal user-space projects. The availability of several initramfs images emphasizes the core feature of LinuxBoot: it is an open system. Not only is the LinuxBoot source itself available, the community is free to develop initramfs images that utilize the power of the Linux kernel to extend the firmware in new directions that were almost impossible before due to the closed nature of UEFI implementations.
Because of the limitations of the environment available early in the boot process, the LinuxBoot kernel and user space are highly stripped down. You wouldn't want to use either as your actual computing environment. Instead, LinuxBoot can carry out the next steps to bring the system to a usable state. In many cases this is as simple as searching the available storage devices for a standard Linux kernel and handing over control using kexec, but LinuxBoot enables more sophisticated boot processes as well. An obvious application is cryptographic verification of the final operating system, building another link in a trusted computing system. The sky is the limit, though, and there are many possibilities: improved booting from the network, easier support for unusual storage arrangements, and advanced security controls, for example.
Much like the LinuxBoot initramfs depends on the LinuxBoot kernel to start, so too does the LinuxBoot kernel depend on a previous step in the boot process. This is where coreboot may be involved. LinuxBoot and coreboot are not competitors, rather they address different stages of booting. Remember that the UEFI boot process consists of four stages, of which the third (the driver execution environment or DXE) is the point at which LinuxBoot starts. Coreboot is an implementation of the first two stages, where it can replace the UEFI firmware provided by the motherboard vendor. Coreboot only supports a narrow range of hardware but, when it can be used in concert with LinuxBoot, it enables an almost completely open boot process.
LinuxBoot itself promises much broader hardware support than coreboot because the LinuxBoot kernel and initramfs generally do not need to be heavily customized for the hardware—this is a major part of the reason that LinuxBoot is designed to start after the first few stages of UEFI boot. That allows the UEFI firmware to handle the hardware-specific steps that vendors tend to keep as trade secrets.
LinuxBoot has been demonstrated on Chromebooks, and has clear applications for consumer computers. The major target for LinuxBoot at present is servers. It has recently drawn quite a bit of attention in the large data center environment; the Open Compute Project, which maintains open-source infrastructure designs used by Facebook among others, recently announced that LinuxBoot developer Ron Minnich is to serve as one of the two leads of its firmware work. Server vendor Horizon Computing plans to soon release the Winterfell server, an Open Compute Project implementation that ships with LinuxBoot and NERF. Considering Facebook's heavy investment in Open Compute, this could mean LinuxBoot in use outside of Google soon.
The LinuxBoot Project has recently become a Linux Foundation project. Contributors to the project include Google and Facebook alongside smaller firms Horizon Computing, Two Sigma, and 9elements. Judging by the level of interest the project has attracted early on, this list seems likely to grow.
LinuxBoot is a promising development for anyone frustrated with the amount of hidden, proprietary, and highly privileged firmware in their computers—and the attack surface and maintenance complexity that all of that brings with it. The project's industry support and a general wave of interest in firmware security may make our infrastructure more secure and bring our devices a bit closer to being under our control.
Page editor: Jonathan Corbet
Copyright © 2018, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds