LWN.net Weekly Edition for October 11, 2012
Udev and firmware
Those who like to complain about udev, systemd, and their current maintainers have had no shortage of company recently as the result of a somewhat incendiary discussion on the linux-kernel mailing list. Underneath the flames, though, lie some important issues: who decides what constitutes appropriate behavior for kernel device drivers, how strong is our commitment to backward compatibility, and which tasks are best handled in the kernel without calling out to user space?The udev process is responsible for a number of tasks, most initiated as the result of events originating in the kernel. It responds to device creation events by making device nodes, setting permissions, and, possibly, running a setup program. It also handles module loading requests and firmware requests from the kernel. So, for example, when a driver calls request_firmware(), that request is turned into an event that is passed to the udev process. Udev will, in response, locate the firmware file, read its contents, and pass the data back to the kernel. The driver will get its firmware blob without having to know anything about how things are organized in user space, and everybody should be happy.
Back in January, the udev developers decided to implement a stricter notion of sequencing between various types of events. No events for a specific device, they decided, would be processed until the process of loading the driver module for that device had completed. Doing things this way makes it easier for them to keep things straight in user space and to avoid attempting operations that the kernel is not yet ready to handle. But it also created problems for some types of drivers. In particular, if a driver tries to load device firmware during the module initialization process, things will appear to lock up. Udev sees that the module is not yet initialized, so it will hold onto the firmware request and everything stops. Udev developer Kay Sievers warned the world about this problem last January:
The problem with this line of reasoning, of course, is that one person's kernel bug is another's user-space problem. Firmware loading at module initialization time has worked just fine for a long time — if one ignores little problems like built-in modules, booting with init=/bin/sh, and other situations where proper user-space support is not present when the request_firmware() call takes place. What matters most is that it works for a normal bootstrap on a typical distribution install. The udev sequencing change breaks that: users of a number of distributions have been reporting that things no longer work properly with newer versions of udev installed.
Breaking currently-running systems is something the kernel development community tries hard to avoid, so it is not surprising that there was some disagreement over the appropriateness of the udev changes. Even so, various kernel developers were trying to work around the problems when Linus threw a bit of a tantrum, saying that the problem lies with udev and needs to be fixed there. He did not get the response that he was hoping for.
Kay answered that, despite the problem
reports, udev had not yet been fixed, saying "we still haven't
wrapped our head around how to fix it/work around it.
" He pointed
out that things don't really hang, they just get "slow" while waiting for a
30-second timeout to expire. And he reiterated his position that the real
problem lies in the kernel and should be fixed there. Linus was unimpressed, but, since he does not maintain
udev, there is not a whole lot that he can do directly to solve the
problem.
Or, then again, maybe there is. One possibility raised by a few developers was pulling udev into the kernel source tree and maintaining it as part of the kernel development process. There was a certain amount of support for this idea, but nobody actually stepped up to take responsibility for maintaining udev in that environment. Such a move would represent a fork of a significant package that would take it in a new direction; current plans are to integrate udev more thoroughly with systemd. The current udev developers thus seem unlikely to support putting udev in the kernel tree. Getting distributors to adopt the kernel's version of udev could also prove to be a challenge. In general, it is the sort of mess that is best avoided if at all possible.
An alternative is to simply short out udev for firmware loading altogether. That is, in fact, what has been done; the 3.7 kernel will include a patch (from Linus) that causes firmware loading to be done directly from the kernel without involving user space at all. If the kernel is unable to find the firmware file in the expected places (under /lib/firmware and variants) it will fall back to sending a request to udev in the usual manner. But if the kernel-space load attempt works, then udev will never even know that the firmware request was made.
This appears to be a solution that is workable for everybody involved.
There is nothing particularly tricky about firmware loading, so few
developers seem to have concerns about doing it directly from the kernel.
Kay supports the idea as well, saying
"I would absolutely like to get udev entirely out of the sick game of
firmware loading
". The real proof will be in how well the concept
works once the 3.7 kernel starts seeing widespread testing, but the initial
indications are that there will not be a lot of problems. If things stay
that way, it would not be surprising to see the direct firmware loading
patch backported to the stable series — once it has gained a few amenities
like user-configurable paths.
One of the biggest challenges in kernel design can be determining what should be done in the kernel and what should be pushed out to user space. The user-space solution is often appealing; it can simplify kernel code and make it easier for others to implement their own policies. But an overly heavy reliance on user space can lead to just the sort of difficulty seen with firmware loading. In this case, it appears, the problem was better solved in the kernel; fortunately, it appears to have been a relatively easy one for the kernel to take back without causing compatibility problems.
CIA.vc shuts down
CIA didn't seem important until it was gone. For developers and users on IRC networks like Freenode, CIA was just there in the background, relaying commit messages into the channels of thousands of projects in real time—until recently.
CIA.vc was a central clearinghouse for commit messages sent to it from ten thousand or more version control repositories. There were CIA hooks for subversion, git, bzr, etc, so a project just had to install such a hook into their repository and register on the CIA website. CIA handled the rest, collecting the commit messages as they came in and announcing them on appropriate channels via its swarm of IRC bots. Here is an example from the #commits channel from April:
<CIA-93> OpenWrt: [packages] fwknop: update to 2.0, use new startup commands
<CIA-93> vlc: Pierre Ynard master * r31b5fbdb6d vlc/modules/lua/libs/equalizer.c:
lua: fix memory and object leak and reset locale on error path
<CIA-93> FreeBSD: rakuco * ports/graphics/autoq3d/files/
(patch-src__cmds__cmds.cpp . patch-src__fgui__cadform.cpp):
<CIA-93> FreeBSD: Make the port build with gcc 4.6 (and possibly other compilers).
<CIA-93> gentoo: robbat2 * gentoo/xml/htdocs/proj/en/perl/outdated-cpan-packages.xml:
Automated update of outdated-cpan-packages.xml
<CIA-93> compiz-fusion: a.j.buxton master * /fusion/plugins-main/src/ezoom/ezoom.c:
For a decade, the CIA bots were part of the infrastructure of many projects, which, along with their bug tracker, mailing lists, wiki, and version control system, helped tie communities together. Eric S. Raymond described the effect of the CIA service as follows:
That stream of notifications dried up on September 26th, when CIA.vc was shut down, due to a miscommunication with a hosting provider. It seems there were no backups. It is unclear if CIA will return, but there are two possible replacements available now.
irker
Irker is a simple replacement for CIA, that was announced just three days later. Raymond was developing it even before CIA went down, and designed a much different architecture than the centralized CIA service.
Irker consists of two pieces: a server that acts as a simple relay to IRC and a client that sends messages to be relayed. The server has no knowledge of version control systems or commits, and could be used to relay any sort of content. All the version-control-specific code necessary to extract and format the message is in the client, which is run by a version control hook script.
The irker client and server typically both run on the same machine or LAN, so each project or hosting service is responsible for running its own service, rather than relying on a centralized service like CIA.
Irker has undergone heavy development since the announcement, and is now considered basically feature complete. Its simple and general design is likely to lead to other things being built on top of it. For example, there is a CIA to irker proxy for sites that want to retain their current CIA hooks.
KGB
Although irker made a splash when CIA died, another clone has quietly been overlooked for years. KGB was developed by Martín Ferrari and Damyan Ivanov of the Debian project and released in 2009. KGB is shipped in the current Debian stable release, as well as in Ubuntu universe, making it easy to deploy as a replacement for CIA.
KGB is, like irker, a decentralized client-server system. Unlike irker's content-agnostic server, the KGB server is responsible for formatting notifications from commit information it receives from its clients. Though a less flexible design, this does insulate the clients from some details of IRC, particularly from message length limits.
KGB has enjoyed a pronounced upswing in feature requests and development since CIA went down, gaining features such as web links to commits, url shortening, and the ability to broadcast all projects' notifications to a channel like #commits. Developer Martín Ferrari says:
Will CIA.vc return?
The CIA.vc website currently promises an attempt to revive the service. Any attempt to do so will surely face numerous challenges. Not least is the missing database, which configured much of CIA's behavior. Unless a recent backup of the database is found, any revived CIA.vc will certainly need much configuration to return it to its past functionality.
CIA's code base, while still available,
is large and complex with many moving parts written in
different languages, is reputedly difficult to install,
and has been neglected for years. Raymond's opinion is that
"CIA suffered a complexity collapse
", and
as he said:
"It is notoriously difficult to un-collapse a rubble pile
".
Even if CIA does eventually return, it seems likely that many projects will have moved away from it for good, deploying their own irker or KGB bots. The Apache Software Foundation, KDE project, and Debian's Alioth project hosting site have already deployed their own bots. If the larger hosting sites like Github, Sourceforge, and Savannah follow suit, any revived CIA may be reduced to being, at best, a third player.
Conclusion
CIA.vc was a centralized service, with code that is free software, but with a design and implementation that did not encourage reuse. The service was widely used by the community, which mostly seems to have put up with its instability, its UTF-8 display bugs, its odd formatting of git revision numbers, and its often crufty hook scripts.
According to CIA's author,
Micah Dowty, it never achieved a "critical mass of
involvement
" from contributors. Perhaps CIA was not seen as
important enough to work on. But with two replacements now being developed,
there is certainly evidence of interest. Or perhaps CIA did not present itself
as a free software project, and so was instead treated as simply the
service that
it appeared to be. CIA's website featured things like a commit leaderboard
and new project list, which certainly helped entice people to use it. (Your
author must confess to occasionally trying to fit enough commits into a day to
get to the top of that leader board.) But the website did not encourage
bugs or patches to be filed.
In a way, the story of CIA mirrors the story of the version control systems it reported on. When CIA began in 2003, centralized version control was the norm. The Linux kernel used distributed version control only thanks to the proprietary Bitkeeper, which itself ran a centralized commit publication service. These choices were entirely pragmatic, and the centralized CIA was perhaps in keeping with the times.
Much as happened with version control, the community has gone from being reliant on a centralized service, to having a choice of decentralized alternatives. As a result, new features are rapidly emerging in both KGB and irker that CIA never provided. This is certainly a healthy response to CIA's closure, but it also seems that our many years of reliance on the centralized service held us back from exploring the space that CIA occupied.
Linux and automotive computing security
There was no security track at the 2012 Automotive Linux Summit, but numerous sessions and the "hallway track" featured anecdotes about the ease of compromising car computers. This is no surprise: as Linux makes inroads into automotive computing, the security question takes on an urgency not found on desktops and servers. Too often, though, Linux and open source software in general are perceived as insufficiently battle-hardened for the safety-critical needs of highway speed computing — reading the comments on an automotive Linux news story it is easy to find a skeptic scoffing that he or she would not trust Linux to manage the engine, brakes, or airbags. While hackers in other embedded Linux realms may understandably feel miffed at such a slight, the bigger problem is said skeptic's presumption that a modern Linux-free car is a secure environment — which is demonstrably untrue.
First, there is a mistaken assumption that computing is not yet a pervasive part of modern automobiles. Likewise mistaken is the assumption that safety-critical systems (such as the aforementioned brakes, airbags, and engine) are properly isolated from low-security components (like the entertainment head unit) and are not vulnerable to attack. It is also incorrectly assumed that the low-security systems themselves do not harbor risks to drivers and passengers. In reality, modern cars have shipped with multiple embedded computers for years (many of which are mandatory by government order), presenting a large attack surface with numerous risks to personal safety, theft, eavesdropping, and other exploits. But rather than exacerbating this situation, Linux and open source adoption stand to improve it.
There is an abundance of research dealing with hypothetical exploits to automotive computers, but the seminal work on practical exploits is a pair of papers from the Center for Automotive Embedded Systems Security (CAESS), a team from the University of California San Diego and the University of Washington. CAESS published a 2010 report [PDF] detailing attacks that they managed to implement against a pair of late-model sedans via the vehicles' Controller Area Network (CAN) bus, and a 2011 report [PDF] detailing how they managed to access the CAN network from outside the car, including through service station diagnostic scanners, Bluetooth, FM radio, and cellular modem.
Exploits
The 2010 paper begins by addressing the connectivity of modern cars. CAESS did not disclose the brand of vehicle they experimented on (although car mavens could probably identify it from the photographs), but they purchased two vehicles and experimented with them on the lab bench, on a garage lift, and finally on a closed test track. The cars were not high-end, but they provided a wide range of targets. Embedded electronic control units (ECUs) are found all over the automobile, monitoring and reporting on everything from the engine to the door locks, not to mention lighting, environmental controls, the dash instrument panel, tire pressure sensors, steering, braking, and so forth.
Not every ECU is designed to control a portion of the vehicle, but due to the nature of the CAN bus, any ECU can be used to mount an attack. CAN is roughly equivalent to a link-layer protocol, but it is broadcast-only, does not employ source addressing or authentication, and is easily susceptible to denial-of-service attacks (either through simple flooding or by broadcasting messages with high-priority message IDs, which force all other nodes to back off and wait). With a device plugged into the CAN bus (such as through the OBD-II port mandatory on all 1995-or-newer vehicles in the US), attackers can spoof messages from any ECU. There are often higher-level protocols employed, but CAESS was able to reverse-engineer the protocols in its test vehicles and found security holes that allow attackers to brute-force the challenge-response system in a matter of days.
CAESS's test vehicles did separate the CAN bus into high-priority and low-priority segments, providing a measure of isolation. However, this also proved to be inadequate, as there were a number of ECUs that were connected to both segments and which could therefore be used to bridge messages between them. That set-up is not an error, however; despite common thinking on the subject, quite a few features demanded by car buyers rely on coordinating between the high- and low-priority devices.
For example, electronic stability control involves measuring wheel speed, steering angle, throttle, and brakes. Cruise control involves throttle, brakes, speedometer readings, and possibly ultra-sonic range sensors (for collision avoidance). Even the lowly door lock must be connected to multiple systems: wireless key fobs, speed sensors (to lock the doors when in motion), and the cellular network (so that remote roadside assistance can unlock the car).
The paper details a number of attacks the team deployed against the test vehicles. The team wrote a tool called CarShark to analyze and inject CAN bus packets, which provided a method to mount many attacks. However, the vehicle's diagnostic service (called DeviceControl) also proved to be a useful platform for attack. DeviceControl is intended for use by dealers and service stations, but it was easy to reverse engineer, and subsequently allowed a number of additional attacks (such as sending an ECU the "disable all CAN bus communication" command, which effectively shuts off part of the car).
The actual attacks tested include some startlingly dangerous tricks, such as disabling the brakes. But the team also managed to create combined attacks that put drivers at risk even with "low risk" components — displaying false speedometer or fuel gauge readings, disabling dash and interior lights, and so forth. Ultimately the team was able to gain control of every ECU in the car, load and execute custom software, and erase traces of the attack.
Some of these attacks exploited components that did not adhere to the protocol specification. For example, several ECUs allowed their firmware to be re-flashed while the car was in motion, which is expressly forbidden for obvious safety reasons. Other attacks were enabled by run-of-the-mill implementation errors, such as components that re-used the same challenge-response seed value every time they were power-cycled. But ultimately, the critical factor was the fact that any device on the vehicle's internal bus can be used to mount an attack; there is no "lock box" protecting the vital systems, and the protocol at the core of the network lacks fundamental security features taken for granted on other computing platforms.
Vectors
Of course, all of the attacks described in the 2010 paper relied on an attacker with direct access to the vehicle. That did not necessarily mean ongoing access; they explained that a dongle attached to the OBD-II port could work at cracking the challenge-response system while left unattended. But, even though there are a number of individuals with access to a driver's car over the course of a year (from mechanics to valets), direct access is still a hurdle.
The 2011 paper looked at vectors to attack the car remotely, to assess the potential for an attacker to gain access to the car's internal CAN bus, at which point any of the attacks crafted in the 2010 paper could easily be executed. It considered three scenarios: indirect physical access, short-range wireless networking, and long-range wireless networking. As one might fear, all three presented opportunities.
The indirect physical access involved compromising the CD player and the dealership or service station's scanning equipment, which is physically connected to the car while in the shop for diagnosis. CAESS found that the model of diagnostic scanner used (which adhered to a 2004 US government mandated standard called PassThru) was an embedded Linux device internally, even though it was only used to interface with a Windows application running on the shop's computer. However, the scanner was equipped with WiFi, and broadcasts its address and open TCP port in the clear. The diagnostic application API is undocumented, but the team sniffed the traffic and found several exploitable buffer overflows — not to mention extraneous services like telnet also running on the scanner itself. Taking control of the scanner and programming it to upload malicious code to vehicles was little additional trouble.
The CD player attack was different; it started with the CD player's firmware update facility (which loads new firmware onto the player if a properly-named file is found on an inserted disc). But the player can also decode compressed audio files, including undocumented variants of Windows Media Audio (.WMA) files. CAESS found a buffer overflow in the .WMA player code, which in turn allowed the team to load arbitrary code onto the player. As an added bonus, the .WMA file containing the exploit plays fine on a PC, making it harder to detect.
The short-range wireless attack involved attacking the head unit's Bluetooth functionality. The team found that a compromised Android device could be loaded with a trojan horse application designed to upload malicious code to the car whenever it paired. A second option was even more troubling; the team discovered that the car's Bluetooth stack would respond to pairing requests initiated without user intervention. Successfully pairing a covert Bluetooth device still required correctly guessing the four-digit authorization PIN, but since the pairing bypassed the user interface, the attacker could make repeated attempts without those attempts being logged — and, once successful, the paired device does not show up in the head unit's interface, so it cannot be removed.
Finally, the long-range wireless attack gained access to the car's CAN network through the cellular-connected telematics unit (which handles retrieving data for the navigation system, but is also used to connect to the car maker's remote service center for roadside assistance and other tasks). CAESS discovered that although the telematics unit could use a cellular data connection, it also used a software modem application to encode digital data in an audio call — for greater reliability in less-connected regions.
The team reverse-engineered the signaling and data protocols used by this software modem, and were subsequently able to call the car from another cellular device, eventually uploading malicious code through yet another buffer overflow. Even more disturbingly, the team encoded this attack into an audio file, then played it back from an MP3 player into a phone handset, again seizing control over the car.
The team also demonstrated several post-compromise attack-triggering methods, such as delaying activation of the malicious code payload until a particular geographic location was reached, or a particular sensor value (e.g., speed or tire pressure) was read. It also managed to trigger execution of the payload by using a short-range FM transmitter to broadcast a specially-encoded Radio Data System (RDS) message, which vehicles' FM receivers and navigation units decode. The same attack could be performed over longer distances with a more powerful transmitter.
Among the practical exploits outlined in the paper are recording audio through the car's microphone and uploading it to a remote server, and connecting the car's telematics unit to a hidden IRC channel, from which attackers can send arbitrary commands at their leisure. The team speculates on the feasibility of turning this last attack into a commercial enterprise, building "botnet" style networks of compromised cars, and on car thieves logging car makes and models in bulk and selling access to stolen cars in advance, based on the illicit buyers' preferences.
What about Linux?
If, as CAESS seems to have found, the state-of-the-art is so poor in automotive computing security, the question becomes how Linux (and related open source projects) could improve the situation. Certainly some of the problems the team encountered are out of scope for automotive Linux projects. For example, several of the simpler ECUs are unsophisticated microcontrollers; the fact that some of them ship from the factory with blatant flaws (such as a broken challenge-response algorithm) is the fault of the manufacturer. But Linux is expected to run on the higher-end ECUs, such as the IVI head unit and telematics system, and these components were the nexus for the more sophisticated attacks.
Several of the sophisticated attacks employed by CAESS relied on security holes found in application code. The team acknowledged that standard security practices (like stack cookies and address space randomization) that are established practice in other computing environments simply have not been adopted in automotive system development for lack of perceived need. Clearly, recognizing that risk and writing more secure application code would improve things, regardless of the operating system in question. But the fact that Linux is so widely deployed elsewhere means that more security-conscious code is available for the taking than there is for any other embedded platform.
Consider the Bluetooth attack, for example. Sure, with a little effort, one might could envision a scenario when unattended Bluetooth pairing is desirable — but in practice, Linux's dominance in the mobile device space means there is a greater likelihood that developers would quickly find and patch the problem than would any tier one supplier working in isolation.
One step further is the advantage gained by having Linux serve as a
common platform used by multiple manufacturers. CAESS observed in its
2011 paper that the "glue code" linking discrete modules together was
the greatest source of exploits (e.g., the PassThru diagnostic
scanning device), saying "virtually all vulnerabilities emerged
at the interface boundaries between code written by distinct
organizations.
" It also noted that this was an artifact of the
automotive supply chain itself, in which individual components were
contracted out to separate companies working from specifications, then
integrated by the car maker once delivered:
A common platform employed by multiple suppliers would go a long way toward minimizing this type of issue, and that approach can only work if the platform is open source.
Finally, the terrifying scope of the attacks carried out in the 2010 paper (and if one does not find them terrifying, one needs to read them again) ultimately trace back to the insecure design of CAN bus. CAN bus needs to be replaced; working with a standard IP stack, instead, means not having to reinvent the wheel. The networking angle has several factors not addressed in CAESS's papers, of course — most notably the still-emerging standards for vehicle ad-hoc networking (intended to serve as a vehicle-to-vehicle and vehicle-to-infrastructure channel).
On that subject, Maxim Raya and Jean-Pierre Hubaux recommend using public-key infrastructure and other well-known practices from the general Internet communications realm. While there might be some skeptics who would argue with Linux's first-class position as a general networking platform, it should be clear to all that proprietary lock-in to a single-vendor solution would do little to improve the vehicle networking problem.
Those on the outside may find the recent push toward Linux in the automotive industry frustratingly slow — after all, there is still no GENIVI code visible to non-members. But to conclude that the pace of development indicates Linux is not up to the task would be a mistake. The reality is that the automotive computing problem is enormous in scope — even considering security alone — and Linux and open source might be the only way to get it under control.
Security
Loading modules from file descriptors
Loadable kernel modules provide a mechanism to dynamically modify the functionality of a running system, by allowing code to be loaded and unloaded from the kernel. Loading code into the kernel via a module has a number of advantages over building a completely new monolithic kernel from modified source code. The first of these is that loading a kernel module does not require a system reboot. This means that new kernel functionally can be added without disturbing users and applications.
From a developer perspective, implementing new kernel functionality via modules is faster: a slow "compile kernel, reboot, test" sequence in each development iteration is instead replaced by a much faster "compile module, load module, test" sequence. Employing modules can also save memory, since code in a module can be loaded into memory only when it is actually needed. Device drivers are often implemented as loadable modules for this reason.
From a security perspective, loadable modules also have a potential downside: since a module has full access to kernel memory, it can compromise the integrity of a system. Although modules can be loaded only by privileged users, there are still potential security risks, since a system administrator may be unable to directly verify the authenticity and origin of a particular kernel module. Providing module-related infrastructure to support administrators in that task is the subject of ongoing effort, with one of the most notable pieces being the work to support module signing.
Kees Cook has recently posted a series of patches that tackle another facet of the module-verification problem. These patches add a new system call for loading kernel modules. To understand why the new system call is useful, we need to start by looking at the existing interface for loading kernel modules.
The Linux interface for loading kernel modules has had (since kernel 2.6.0) the following form:
int init_module(void *module_image, unsigned long len,
const char *param_values);
The caller supplies the ELF image of the to-be-loaded module via the memory buffer pointed to by module_image; len specifies the size of that buffer. (The param_values argument is a string that can be used to specify initial values for the module's parameters.)
The main users of init_module() are the insmod and modprobe commands. However any privileged user-space application (i.e., one with the CAP_SYS_MODULE capability) can load a module in the same way that these commands do, via a three-step process: opening a file that contains a suitably built ELF image, reading or mmap()ing the file's contents into memory, and then calling init_module().
However, this call sequence is the source of an itch for Kees. Because the step of obtaining a file descriptor for the image file is separated from the module-loading step, the operating system loses the ability to make deductions about the trustworthiness of the module based on its origin in the filesystem. As Kees said:
His solution is fairly straightforward: remove the middle of the three steps posted above. Instead, the application will open the file and pass the returned file descriptor directly to the kernel as part of a new module-loading system call; the kernel then performs the task of reading the module image from the file as a precursor to loading the module.
Although the concept of the solution is simple, it has been through a few iterations, with the most notable changes being to details of the user-space interface. Kees's initial proposal was to hack the existing init_module() interface, so that if NULL is passed in the module_image argument, the kernel would interpret the len argument as a file descriptor. Rusty Russell, the kernel modules subsystem maintainer, somewhat bluntly suggested that a new system call would be a better approach, and on the next revision of the patch, H. Peter Anvin pointed out that the system call would be better named according to existing conventions, where the file descriptor analog of an existing system call simply uses the same name as that system call, but with an "f" prefix. Thus, Kees has arrived at the currently proposed interface:
int finit_module(int fd, const char *param_values);
In the most recent patch, Kees, who works for Google on Chrome OS, has also further elaborated on the motivations for adding this system call. Specifically, in order to ensure the integrity of a user's system, the Chrome OS developers would like to be able to enforce the restriction that kernel modules are loaded only from the system's read-only, cryptographically verified root filesystem. Since the developers already trust the contents of the root filesystem, employing module signatures to verify the contents of a kernel module would require the addition of an unnecessary set of keys to the kernel and would also slow down module loading. All that Chrome OS requires is a light-weight mechanism for verifying that the module image originates from that filesystem, and the new system call provides just that facility.
Kees pointed out that the new system call also has potential for wider use. For example, Linux Security Modules (LSMs) could use it to examine digital signatures contained in the module file's extended attributes (the file descriptor provides the kernel with the route to access the extended attributes). During discussion of the patches, interest in the new system call was confirmed by the maintainers of the IMA and AppArmor kernel subsystems.
At this stage, there appear to be few roadblocks to getting this system call into the kernel. The only question is when it will arrive. Kees would very much like to see the patches go into the currently open 3.7 merge window, but for various reasons, it appears probable that they will only be merged in Linux 3.8.
Update, January 2013: finit_module() was indeed merged in Linux 3.8, but with a changed API that added a flags argument that can be used to modify the behavior of the system call. Details can be found in the manual page.
Brief items
Security quotes of the week
Judging from the page titles and content the websites in question were targeted because they reference the number "45".
The Linux Foundation's UEFI secure boot system
The Linux Foundation has announced a new boot system meant to make life easier on UEFI secure boot systems. "In a nutshell, the Linux Foundation will obtain a Microsoft Key and sign a small pre-bootloader which will, in turn, chain load (without any form of signature check) a predesignated boot loader which will, in turn, boot Linux (or any other operating system). The pre-bootloader will employ a 'present user' test to ensure that it cannot be used as a vector for any type of UEFI malware to target secure systems. This pre-bootloader can be used either to boot a CD/DVD installer or LiveCD distribution or even boot an installed operating system in secure mode for any distribution that chooses to use it."
The CryptoParty Handbook
The first draft of the CryptoParty Handbook, a 390-page guide to maintaining privacy in the networked world, is available. "This book was written in the first 3 days of October 2012 at Studio Weise7, Berlin, surrounded by fine food and a lake of coffee amidst a veritable snake pit of cables. Approximately 20 people were involved in its creation, some more than others, some local and some far (Melbourne in particular)." It is available under the (still evolving) CC-BY-SA 4.0 license. The guide, too, is still evolving; it should probably be regarded the way one would look at early-stage cryptographic code. Naturally, the authors are looking for contributors to help make the next release better.
New vulnerabilities
bacula: information disclosure
| Package(s): | bacula | CVE #(s): | CVE-2012-4430 | ||||||||||||||||||||
| Created: | October 8, 2012 | Updated: | May 19, 2014 | ||||||||||||||||||||
| Description: | From the Debian advisory:
It was discovered that bacula, a network backup service, does not properly enforce console ACLs. This could allow information about resources to be dumped by an otherwise-restricted client. | ||||||||||||||||||||||
| Alerts: |
| ||||||||||||||||||||||
bind: denial of service
| Package(s): | bind | CVE #(s): | CVE-2012-5166 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Created: | October 10, 2012 | Updated: | November 6, 2012 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description: | From the Mandriva advisory:
A certain combination of records in the RBT could cause named to hang while populating the additional section of a response. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Alerts: |
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
hostapd: denial of service
| Package(s): | hostapd | CVE #(s): | CVE-2012-4445 | ||||||||||||||||||||||||
| Created: | October 8, 2012 | Updated: | October 19, 2012 | ||||||||||||||||||||||||
| Description: | From the Debian advisory:
Timo Warns discovered that the internal authentication server of hostapd, a user space IEEE 802.11 AP and IEEE 802.1X/WPA/WPA2/EAP Authenticator, is vulnerable to a buffer overflow when processing fragmented EAP-TLS messages. As a result, an internal overflow checking routine terminates the process. An attacker can abuse this flaw to conduct denial of service attacks via crafted EAP-TLS messages prior to any authentication. | ||||||||||||||||||||||||||
| Alerts: |
| ||||||||||||||||||||||||||
libxslt: code execution
| Package(s): | libxslt | CVE #(s): | CVE-2012-2893 | ||||||||||||||||||||||||||||
| Created: | October 4, 2012 | Updated: | October 22, 2012 | ||||||||||||||||||||||||||||
| Description: | From the Ubuntu advisory: Cris Neckar discovered that libxslt incorrectly managed memory. If a user or automated system were tricked into processing a specially crafted XSLT document, a remote attacker could cause libxslt to crash, causing a denial of service, or possibly execute arbitrary code. (CVE-2012-2893) | ||||||||||||||||||||||||||||||
| Alerts: |
| ||||||||||||||||||||||||||||||
mozilla: multiple vulnerabilities
| Package(s): | firefox, thunderbird, seamonkey | CVE #(s): | CVE-2012-3983 CVE-2012-3989 CVE-2012-3984 CVE-2012-3985 | ||||||||||||||||||||||||||||
| Created: | October 10, 2012 | Updated: | October 17, 2012 | ||||||||||||||||||||||||||||
| Description: | From the Ubuntu advisory:
Henrik Skupin, Jesse Ruderman, Christian Holler, Soroush Dalili and others discovered several memory corruption flaws in Firefox. If a user were tricked into opening a specially crafted web page, a remote attacker could cause Firefox to crash or potentially execute arbitrary code as the user invoking the program. (CVE-2012-3982, CVE-2012-3983, CVE-2012-3988, CVE-2012-3989) David Bloom and Jordi Chancel discovered that Firefox did not always properly handle the <select> element. A remote attacker could exploit this to conduct URL spoofing and clickjacking attacks. (CVE-2012-3984) Collin Jackson discovered that Firefox did not properly follow the HTML5 specification for document.domain behavior. A remote attacker could exploit this to conduct cross-site scripting (XSS) attacks via javascript execution. (CVE-2012-3985) Johnny Stenback discovered that Firefox did not properly perform security checks on tests methods for DOMWindowUtils. (CVE-2012-3986) Alice White discovered that the security checks for GetProperty could be bypassed when using JSAPI. If a user were tricked into opening a specially crafted web page, a remote attacker could exploit this to execute arbitrary code as the user invoking the program. (CVE-2012-3991) Mariusz Mlynski discovered a history state error in Firefox. A remote attacker could exploit this to spoof the location property to inject script or intercept posted data. (CVE-2012-3992) Mariusz Mlynski and others discovered several flays in Firefox that allowed a remote attacker to conduct cross-site scripting (XSS) attacks. (CVE-2012-3993, CVE-2012-3994, CVE-2012-4184) Abhishek Arya, Atte Kettunen and others discovered several memory flaws in Firefox when using the Address Sanitizer tool. If a user were tricked into opening a specially crafted web page, a remote attacker could cause Firefox to crash or potentially execute arbitrary code as the user invoking the program. (CVE-2012-3990, CVE-2012-3995, CVE-2012-4179, CVE-2012-4180, CVE-2012-4181, CVE-2012-4182, CVE-2012-4183, CVE-2012-4185, CVE-2012-4186, CVE-2012-4187, CVE-2012-4188) | ||||||||||||||||||||||||||||||
| Alerts: |
| ||||||||||||||||||||||||||||||
mozilla: multiple vulnerabilities
| Package(s): | firefox, thunderbird, seamonkey | CVE #(s): | CVE-2012-3982 CVE-2012-3986 CVE-2012-3988 CVE-2012-3990 CVE-2012-3991 CVE-2012-3992 CVE-2012-3993 CVE-2012-3994 CVE-2012-3995 CVE-2012-4179 CVE-2012-4180 CVE-2012-4181 CVE-2012-4182 CVE-2012-4183 CVE-2012-4184 CVE-2012-4185 CVE-2012-4186 CVE-2012-4187 CVE-2012-4188 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Created: | October 10, 2012 | Updated: | January 10, 2013 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description: | From the Red Hat advisory:
Several flaws were found in the processing of malformed web content. A web page containing malicious content could cause Firefox to crash or, potentially, execute arbitrary code with the privileges of the user running Firefox. (CVE-2012-3982, CVE-2012-3988, CVE-2012-3990, CVE-2012-3995, CVE-2012-4179, CVE-2012-4180, CVE-2012-4181, CVE-2012-4182, CVE-2012-4183, CVE-2012-4185, CVE-2012-4186, CVE-2012-4187, CVE-2012-4188) Two flaws in Firefox could allow a malicious website to bypass intended restrictions, possibly leading to information disclosure, or Firefox executing arbitrary code. Note that the information disclosure issue could possibly be combined with other flaws to achieve arbitrary code execution. (CVE-2012-3986, CVE-2012-3991) Multiple flaws were found in the location object implementation in Firefox. Malicious content could be used to perform cross-site scripting attacks, script injection, or spoofing attacks. (CVE-2012-1956, CVE-2012-3992, CVE-2012-3994) Two flaws were found in the way Chrome Object Wrappers were implemented. Malicious content could be used to perform cross-site scripting attacks or cause Firefox to execute arbitrary code. (CVE-2012-3993, CVE-2012-4184) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Alerts: |
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
openstack-keystone: two authentication bypass flaws
| Package(s): | openstack-keystone | CVE #(s): | CVE-2012-4456 CVE-2012-4457 | ||||||||
| Created: | October 4, 2012 | Updated: | October 10, 2012 | ||||||||
| Description: | From the Red Hat Bugzilla entries [1, 2]: CVE-2012-4456: Jason Xu discovered several vulnerabilities in OpenStack Keystone token verification: The first occurs in the API /v2.0/OS-KSADM/services and /v2.0/OS-KSADM/services/{service_id}, the second occurs in /v2.0/tenants/{tenant_id}/users/{user_id}/roles In both cases the OpenStack Keystone code fails to check if the tokens are valid. These issues have been addressed by adding checks in the form of test_service_crud_requires_auth() and test_user_role_list_requires_auth(). CVE-2012-4457: Token authentication for a user belonging to a disable tenant should not be allowed. | ||||||||||
| Alerts: |
| ||||||||||
openstack-swift: insecure use of python pickle
| Package(s): | openstack-swift | CVE #(s): | CVE-2012-4406 | ||||||||||||||||
| Created: | October 8, 2012 | Updated: | June 20, 2013 | ||||||||||||||||
| Description: | From the Red Hat bugzilla:
Sebastian Krahmer (krahmer@suse.de) reports: swift uses pickle to store and load meta data. pickle is insecure and allows to execute arbitrary code in loads(). | ||||||||||||||||||
| Alerts: |
| ||||||||||||||||||
php: multiple vulnerabilities
| Package(s): | php | CVE #(s): | |||||
| Created: | October 8, 2012 | Updated: | October 10, 2012 | ||||
| Description: | PHP 5.4.7 fixes multiple vulnerabilities. See the PHP changelog for details. | ||||||
| Alerts: |
| ||||||
phpldapadmin: cross-site scripting
| Package(s): | phpldapadmin | CVE #(s): | CVE-2012-1114 CVE-2012-1115 | ||||||||
| Created: | October 8, 2012 | Updated: | October 10, 2012 | ||||||||
| Description: | From the Red Hat bugzilla:
Originally (2012-03-01), the following cross-site (XSS) flaws were reported against LDAP Account Manager Pro (from Secunia advisory): * 1) Input passed to e.g. the "filteruid" POST parameter when filtering result sets in lam/templates/lists/list.php (when "type" is set to a valid value) is not properly sanitised before being returned to the user. This can be exploited to execute arbitrary HTML and script code in a user's browser session in context of an affected site. * 2) Input passed to the "filter" POST parameter in lam/templates/3rdParty/pla/htdocs/cmd.php (when "cmd" is set to "export" and "exporter_id" is set to "LDIF") is not properly sanitised before being returned to the user. This can be exploited to execute arbitrary HTML and script code in a user's browser session in context of an affected site. * 3) Input passed to the "attr" parameter in lam/templates/3rdParty/pla/htdocs/cmd.php (when "cmd" is set to "add_value_form" and "dn" is set to a valid value) is not properly sanitised before being returned to the user. This can be exploited to execute arbitrary HTML and script code in a user's browser session in context of an affected site. | ||||||||||
| Alerts: |
| ||||||||||
php-zendframework: multiple vulnerabilities
| Package(s): | php-zendframework | CVE #(s): | |||||
| Created: | October 8, 2012 | Updated: | October 10, 2012 | ||||
| Description: | From the ZendFramework advisories [1], [2]:
[1] The default error handling view script generated using Zend_Tool failed to escape request parameters when run in the "development" configuration environment, providing a potential XSS attack vector. [2] Developers using non-ASCII-compatible encodings in conjunction with the MySQL PDO driver of PHP may be vulnerable to SQL injection attacks. Developers using ASCII-compatible encodings like UTF8 or latin1 are not affected by this PHP issue. | ||||||
| Alerts: |
| ||||||
wireshark: denial of service
| Package(s): | wireshark | CVE #(s): | CVE-2012-5239 CVE-2012-3548 | ||||||||||||||||||||
| Created: | October 8, 2012 | Updated: | March 8, 2013 | ||||||||||||||||||||
| Description: | From the CVE entries:
The Mageia advisory references CVE-2012-5239, which is a duplicate of CVE-2012-3548. The dissect_drda function in epan/dissectors/packet-drda.c in Wireshark 1.6.x through 1.6.10 and 1.8.x through 1.8.2 allows remote attackers to cause a denial of service (infinite loop and CPU consumption) via a small value for a certain length field in a capture file. (CVE-2012-3548) | ||||||||||||||||||||||
| Alerts: |
| ||||||||||||||||||||||
Page editor: Jake Edge
Kernel development
Brief items
Kernel release status
The 3.7 merge window is still open, so there is no current development kernel. See the article below for merges into 3.7 since last week.
Stable updates: The 3.2.31 stable kernel was released on October 10. In addition, the 3.0.45, 3.4.13, 3.5.6, and 3.6.1 stable kernels were released on October 7. Support for the 3.5 series is coming to an end, as there may only be one more update, so users of that kernel should be planning to upgrade.
Quotes of the week
Samsung's F2FS filesystem
Back in August, a linux-kernel discussion on removable device filesystems hinted at a new filesystem waiting in the wings. It now seems clear that said filesystem was the just-announced F2FS, a flash-friendly filesystem from Samsung. "F2FS is a new file system carefully designed for the NAND flash memory-based storage devices. We chose a log structure file system approach, but we tried to adapt it to the new form of storage." See the associated documentation file for details on the F2FS on-disk format and how it works.
[Update: Also see Neil Brown's dissection of f2fs on this week's kernel page.]
Kernel development news
3.7 Merge window part 2
As of this writing, Linus has pulled 9,167 non-merge changesets into the mainline for the 3.7 merge window; that's just over 3,600 changes since last week's summary. As predicted, the merge rate has slowed a bit as Linus found better things to do with his time. Still, it is shaping up to be an active development cycle.User-visible changes since last week include:
- The kernel's firmware loader will now attempt to load files directly
from user space without involving udev. The firmware path is
currently wired to a few alternatives under /lib/firmware;
the plan is to make things more flexible in the future.
- The epoll_ctl() system call supports a new
EPOLL_CTL_DISABLE operation to disable polling on a specific
file descriptor.
- The Xen paravirtualization mechanism is now supported on the ARM
architecture.
- The tools directory contains a new "trace agent" utility; it
uses virtio to move trace data from a guest system to a host in an
efficient manner. Also added to tools is acpidump,
which can dump a system's ACPI tables to a text file.
- Online resizing of ext4 filesystems that use the metablock group
(meta_bg) or 64-bit block number features is now supported.
- The UBI translation layer for flash-based storage devices has gained
an experimental "fastmap" capability. The fastmap caches erase block
mappings, eliminating the need to scan the device at mount time.
- The Btrfs filesystem has gained the ability to perform hole punching
with the fallocate() system call.
- New hardware support includes:
- Systems and processors:
Freescale P5040DS reference boards,
Freescale / iVeia P1022RDK reference boards, and
MIPS Technologies SEAD3 evaluation boards.
- Audio:
Wolfson Bells boards,
Wolfson WM0010 digital signal processors,
TI SoC based boards with twl4030 codecs,
C-Media CMI8328-based sound cards, and
Dialog DA9055 audio codecs.
- Crypto: IBM 842 Power7 compression accelerators.
- Graphics: Renesas SH Mobile LCD controllers.
- Miscellaneous: ST-Ericsson STE Modem devices,
Maxim MAX8907 power management ICs,
Dialog Semiconductor DA9055 PMICs,
Texas Instruments LP8788 power management units,
Texas Instruments TPS65217 backlight controllers,
TI LM3630 and LM3639 backlight controllers,
Dallas DS2404 RTC chips,
Freescale SNVS RTC modules,
TI TPS65910 RTC chips,
RICOH 5T583 RTC chips,
Marvell MVEBU pin control units (and several SoCs using it),
Marvell 88PM860x PMICs,
LPC32x SLC and MLC NAND controllers, and
TI EDMA controllers.
- Video4Linux2: Syntek STK1160 USB audio/video bridges, TechnoTrend USB infrared receivers, Nokia N900 (RX51) IR transmitters, Chips&Media Coda multi-standard codecs, FCI FC2580 silicon tuners, Analog Devices ADV7604 decoders, Analog Devices AD9389B encoders, Samsung Exynos G-Scaler image processors, Samsung S5K4ECGX sensors, and Elonics E4000 silicon tuners.
- Systems and processors:
Freescale P5040DS reference boards,
Freescale / iVeia P1022RDK reference boards, and
MIPS Technologies SEAD3 evaluation boards.
Changes visible to kernel developers include:
- The precursors for the user-space API
header file split have been merged. These create
include/uapi directories meant to hold header files
containing the definitions of data types visible to user space.
Actually splitting those definitions out is a lengthy patch set that
looks to be only partially merged in 3.7; the rest will have to wait
for the 3.8 cycle.
- The core of the Nouveau driver for NVIDIA chipsets has been torn out
and rewritten. The developers understand the target hardware much
better than they did when Nouveau started; the code has now been
reworked to match that understanding.
- The Video4Linux2 subsystem tree has been massively reorganized; driver
source files are now organized by bus type. Most files have moved, so
developers working in this area will need to retrain their fingers for
the new locations. There is also a new, rewritten DVB USB core; a
number of drivers have been converted to this new code.
- The ALSA sound driver subsystem has added a new API for the management
of audio channels; see Documentation/sound/alsa/Channel-Mapping-API.txt
for details.
- The red-black tree implementation has been substantially reworked. It now implements both interval trees and priority trees; the older kernel "prio tree" implementation has been displaced by this work and removed.
Linus had raised the possibility of extending the merge window if his travels got in the way of pulling changes into the mainline. The changeset count thus far, though, suggests that there has been no problem with merging, so chances are that the merge window will close on schedule around October 14.
The new visibility of RCU processing
If you run a post-3.6 Linux kernel for long enough,
you will likely see a process named rcu_sched or
rcu_preempt or maybe even rcu_bh having
consumed significant CPU time.
If the system goes idle and all application processes exit,
these processes might well have the largest CPU consumption of
all the remaining processes.
It is only natural to ask “what are these processes and why
are they consuming so much CPU?”
The “what” part is easy: These are new kernel threads
that handle RCU grace periods, previously handled mainly in
softirq context.
An “RCU grace period” is a period of time after which
all pre-existing RCU read-side critical sections have completed,
so that if an RCU updater
removes a data element from an RCU-protected data structure and then
waits for an RCU grace period, it may subsequently safely carry out
destructive-to-readers actions, such as freeing the data element.
RCU read-side critical sections begin with rcu_read_lock()
and end with rcu_read_unlock().
Updaters can wait for an RCU grace period using synchronize_rcu(),
or they can asynchronously schedule a function to be invoked after
a grace period using call_rcu().
RCU's read-side primitives are extremely fast and scalable, so it
can be quite helpful in read-mostly situations.
For more detail on RCU, see:
The RCU API, 2010 Edition,
What is RCU, Fundamentally?,
this
set of slides [PDF], and
the RCU home page.
Answer
The reason for moving RCU grace-period handling to a kernel thread was to improve real-time latency (both interrupt latency and scheduling latency) on huge systems by allowing RCU's grace-period initialization to be preempted: Without preemption, this initialization can inflict more than 200 microseconds of latency on huge systems. In addition, this change will very likely also improve RCU's energy efficiency while also simplifying the code. These potential simplifications are due to the fact that kernel threads make it easier to guarantee forward progress, avoiding hangs in cases where all CPUs are asleep and thus ignoring the current grace period, as confirmed by Paul Walmsley. But the key point here is that these kernel threads do not represent new overhead: Instead, overhead that used to be hidden in softirq context is now visible in kthread context.
hackbench does a million grace periods in ten minutes,
just how many does something like rcutorture do?
Answer
Now for “why so much CPU?”, which is the question
Ingo Molnar asked immediately upon seeing more than three minutes of
CPU time consumed by
rcu_sched after running a couple hours of kernel builds.
The answer is that Linux makes heavy use of RCU, so much so that
running hackbench for ten minutes can result in almost
one million RCU grace periods—and more than thirty seconds
of CPU time consumed by rcu_sched.
This works out to about thirty microseconds per grace period, which
is anything but excessive, considering the amount of work that grace
periods do.
As it turns out, the CPU consumption
of rcu_sched, rcu_preempt, and
rcu_bh
is often roughly equal to the sum of that of the ksoftirqd
threads.
Interestingly enough, in 3.6 and earlier, some of the RCU grace-period overhead
would have been charged to the ksoftirqd kernel threads.
But CPU overhead per grace period is only part of the story.
RCU works hard to process multiple updates (e.g., call_rcu()
or synchronize_rcu() invocations) with a single grace period.
It is not hard to achieve more than one hundred updates per grace
period, which results in a per-update overhead of only about 300 nanoseconds,
which is not bad at all.
Furthermore, workloads having well in excess of one thousand updates
per grace period
have been observed.
Of course, the per-grace-period CPU overhead does vary, and
with it the per-update overhead.
First, the greater the number of possible CPUs
(as given at boot time by nr_cpu_ids),
the more work RCU must do when initializing and cleaning up grace periods.
This overhead grows fairly slowly, with additional work required
with the addition of each set of 16 CPUs, though this number varies
depending on the
CONFIG_RCU_FANOUT_LEAF kernel configuration parameter
and also on the rcutree.rcu_fanout_leaf kernel boot parameter.
Second, the greater the number of idle CPUs, the more work RCU must do when forcing quiescent states. Yes, the busier the system, the less work RCU needs to do! The reason for the extra work is that RCU is not permitted to disturb idle CPUs for energy-efficiency reasons. RCU must therefore probe a per-CPU data structure to read out idleness state during each grace period, likely incurring a cache miss on each such probe.
Third and finally, the overhead will vary depending on CPU clock rate, memory-system performance, virtualization overheads, and so on. All that aside, I see per-grace-period overheads ranging from 15 to 100 microseconds on the systems I use. I suspect that a system with thousands of CPUs might consume many hundreds of microseconds, or perhaps even milliseconds, of CPU time for each grace period. On the other hand, such a system might also handle a very large number of updates per grace period.
rcu_sched, rcu_preempt, and
rcu_bh kernel threads, we should be able to more
easily identify that overhead and optimize RCU, right?
Answer
In conclusion, the rcu_sched, rcu_preempt, and
rcu_bh CPU overheads should not be anything to worry about.
They do not represent new overhead inflicted on post-3.6 kernels,
but rather better accounting of the
same overhead that RCU has been incurring all along.
Acknowledgments
I owe thanks to Ingo Molnar for first noting this issue and the need to let the community know about it. We all owe a debt of gratitude to Steve Dobbelstein, Stephen Rothwell, Jon Corbet, and Paul Walmsley for their help in making this article human-readable. I am grateful to Jim Wasko for his support of this effort.
Answers to Quick Quizzes
Quick Quiz 1: Why would latency be reduced by moving RCU work to a kthread? And why would anyone care about latency on huge machines?
Answer: Moving work from softirq to a kthread allows that work to be more easily preempted, and this preemption reduces scheduling latency. Low scheduling latency is of course important in real-time applications, but it also helps reduce OS jitter. Low OS jitter is critically important to certain types of high-performance-computing (HPC) workloads, which is the type of workload that tends to be run on huge systems.
Quick Quiz 2:
Wow!!! If hackbench does a million grace periods in ten minutes,
just how many does something like rcutorture do?
Answer:
Actually, rcutorture tortures RCU in many different ways,
including overly long read-side critical sections, transitions to and
from idle, and CPU hotplug operations.
Thus, a typical rcutorture run would probably “only”
do about 100,000 grace periods in a ten-minute interval.
In short, the grace-period rate can vary widely depending on your hardware, kernel configuration, and workload.
Quick Quiz 3:
Now that all of the RCU overhead is appearing
on the rcu_sched, rcu_preempt, and
rcu_bh kernel threads, we should be able to more
easily identify that overhead and optimize RCU, right?
Answer:
Yes and no.
Yes, it is easier to optimize that which can be easily measured.
But no, not all of RCU's overhead appears on the
rcu_sched, rcu_preempt, and
rcu_bh kernel threads.
Some of it still appears on the ksoftirqd kernel
threads, and some of it is spread over other tasks.
Still, yes, the greater visibility should be helpful.
An f2fs teardown
When a techno-geek gets a new toy there must always be an urge to take it apart and see how it works. Practicalities (and warranties) sometimes suppress that urge, but in the case of f2fs and this geek, the urge was too strong. What follows is the result of taking apart this new filesystem to see how it works.
f2fs (interestingly not "f3s") is the "flash-friendly file system", a new filesystem for Linux recently announced by engineers from Samsung. Unlike jffs2 and logfs, f2fs is not targeted at raw flash devices, but rather at the specific hardware that is commonly available to consumers — SSDs, eMMC, SD cards, and other flash storage with an FTL (flash translation layer) already built in. It seems that as hardware gets smarter, we need to make even more clever software to manage that "smartness". Does this sound like parenting to anyone else?
f2fs is based on the log-structured filesystem (LFS) design — which is hardly surprising given the close match between the log-structuring approach and the needs of flash. For those not familiar with log-structured design, the key elements are:
- That it requires copy-on-write, so data is always written to previously unused space.
- That free space is managed in large regions which are written to sequentially. When the number of free regions gets low, data that is still live is coalesced from several regions into one free region, thus creating more free regions. This process is known as "cleaning" and the overhead it causes is one of the significant costs of log structuring.
As the FTL typically uses a log-structured design to provide the wear-leveling and write-gathering that flash requires, this means that there are two log structures active on the device — one in the firmware and one in the operating system. f2fs is explicitly designed to make use of this fact and leaves a number of tasks to the FTL while focusing primarily on those tasks that it is well positioned to perform. So, for example, f2fs makes no effort to distribute writes evenly across the address space to provide wear-leveling.
The particular value that f2fs brings, which can justify it being "flash friendly", is that it provides large-scale write gathering so that when lots of blocks need to be written at the same time they are collected into large sequential writes which are much easier for the FTL to handle. Rather than creating a single large write, f2fs actually creates up to six in parallel. As we shall see, these are assigned different sorts of blocks with different life expectancies. Grouping blocks with similar life expectancies together tends to make the garbage collection process required by the LFS less expensive.
The "large-scale" is a significant qualifier — f2fs doesn't always gather writes into contiguous streams, only almost always. Some metadata, and occasionally even some regular data, is written via random single-block writes. This would be anathema for a regular log-structured filesystem, but f2fs chooses to avoid a lot of complexity by just doing small updates when necessary and leaving the FTL to make those corner cases work.
Before getting into the details of how f2fs does what it does, a brief list of some of the things it doesn't do is in order.
A feature that we might expect from a copy-on-write filesystem is cheap snapshots as they can be achieved by simply not freeing up the old copy. f2fs does not provide these and cannot in its current form due to its two-locations approach to some metadata which will be detailed later.
Other features that are missing are usage quotas, NFS export, and the "security" flavor of extended attributes (xattrs). Each of these could probably be added with minimal effort if they are needed, though integrating quotas correctly with the crash recovery would be the most challenging. We shouldn't be surprised to see some of these in a future release.
Blocks, segments, sections, and zones
Like most filesystems, f2fs is comprised of blocks. All blocks are 4K in size, though the code implicitly links the block size with the system page size, so it is unlikely to work on systems with larger page sizes as is possible with IA64 and PowerPC. The block addresses are 32 bits so the total number of addressable bytes in the filesystem is at most 2(32+12) bytes or 16 terabytes. This is probably not a limitation — for current flash hardware at least.
Blocks are collected into "segments". A segment is 512 blocks or 2MB in size. The documentation describes this as a default, but this size is fairly deeply embedded in the code. Each segment has a segment summary block which lists the owner (file plus offset) of each block in the segment. The summary is primarily used when cleaning to determine which blocks need to be relocated and how to update the index information after the relocation. One block can comfortably store summary information for 512 blocks (with a bit of extra space which has other uses), so 2MB is the natural size for a segment. Larger would be impractical and smaller would be wasteful.
Segments are collected into sections. There is genuine flexibility in the size of a section, though it must be a power of two. A section corresponds to a "region" in the outline of log structuring given above. A section is normally filled from start to end before looking around for another section, and the cleaner processes one section at a time. The default size when using the mkfs utility is 20, or one segment per section.
f2fs has six sections "open" for writing at any time with different sorts of data being written to each one. The different sections allows for file content (data) to be kept separate from indexing information (nodes), and for those to be divided into "hot", "warm", and "cold" according to various heuristics. For example, directory data is treated as hot and kept separate from file data because they have different life expectancies. Data that is cold is expected to remain unchanged for quite a long time, so a section full of cold blocks is likely to not require any cleaning. Nodes that are hot are expected to be updated soon, so if we wait a little while, a section that was full of hot nodes will have very few blocks that are still live and thus will be cheap to clean.
Sections are collected into zones. There may be any (integer) number of sections in a zone though the default is again one. The sole purpose of zones is to try to keep these six open sections in different parts of the device. The theory seems to be that flash devices are often made from a number of fairly separate sub-devices each of which can process IO requests independently and hence in parallel. If zones are sized to line up with the sub-devices, then the six open sections can all handle writes in parallel and make best use of the device.
These zones, full of sections of segments of blocks, make up the "main" area of the filesystem. There is also a "meta" area which contains a variety of different metadata such as the segment summary blocks already mentioned. This area is not managed following normal log-structured lines and so leaves more work for the FTL to do. Hopefully it is small enough that this isn't a problem.
There are three approaches to management of writes in this area. First, there is a small amount of read-only data (the superblock) which is never written once the filesystem has been created. Second, there are the segment summary blocks which have already been mentioned. These are simply updated in-place. This can lead to uncertainty as to the "correct" contents for the block after a crash, however for segment summaries this is not an actual problem. The information in it is checked for validity before it is used, and if there is any chance that information is missing, it will be recovered from other sources during the recovery process.
The third approach involves allocating twice as much space as is required so that each block has two different locations it can exist in, a primary and a secondary. Only one of these is "live" at any time and the copy-on-write requirement of an LFS is met by simply writing to the non-live location and updating the record of which is live. This approach to metadata is the main impediment to providing snapshots. f2fs does a small amount of journaling of updates to this last group while creating a checkpoint, which might ease the task for the FTL somewhat.
Files, inodes, and indexing
Most modern filesystems seem to use B-trees or similar structures for managing indexes to locate the blocks in a file. In fact they are so fundamental to btrfs that it takes its name from that data structure. f2fs doesn't. Many filesystems reduce the size of the index by the use of "extents" which provide a start and length of a contiguous list of blocks rather than listing all the addresses explicitly. Again, f2fs doesn't (though it does maintain one extent per inode as a hint).
Rather, f2fs uses an indexing tree that is very reminiscent of the original Unix filesystem and descendants such as ext3. The inode contains a list of addresses for the early blocks in the file, then some addresses for indirect blocks (which themselves contain more addresses) as well as some double and triple-indirect blocks. While ext3 has 12 direct addresses and one each of the indirection addresses, f2fs has 929 direct address, two each of indirect and double-indirect addresses, and a single triple-indirect address. This allows the addressing of nearly 4TB for a file, or one-quarter of the maximum filesystem size.
While this scheme has some costs — which is why other filesystems have discarded it — it has a real benefit for an LFS. As f2fs does not use extents, the index tree for a given file has a fixed and known size. This means that when blocks are relocated through cleaning, it is impossible for changes in available extents to cause the indexing tree to get bigger — which could be embarrassing when the point of cleaning is to free space. logfs, another reasonably modern log structured filesystem for flash, uses much the same arrangement for much the same reason.
Obviously, all this requires a slightly larger inode than ext3 uses. Copy-on-write is rather awkward for objects that are smaller than the block size so f2fs reserves a full 4K block for each inode which provides plenty of space for indexing. It even provides space to store the (base) name of the file, or one of its names, together with the inode number of the parent. This simplifies the recovery of recently-created files during crash recovery and reduces the number of blocks that need to be written for such a file to be safe.
Given that the inode is so large, one would expect that small files and certainly small symlinks would be stored directly in the inode, rather than just storing a single block address and storing the data elsewhere. However f2fs doesn't do that. Most likely the reality is that it doesn't do it yet. It is an easy enough optimization to add, so it's unlikely to remain absent for long.
As already mentioned, the inode contains a single extent that is a summary of some part of the index tree. It says that some range of blocks in the file are contiguous in storage and gives the address of this range. The filesystem attempts to keep the largest extent recorded here and uses it to speed up address lookups. For the common case of a file being written sequentially without any significant pause, this should result in the entire file being in that one extent, and make lookups in the index tree unnecessary.
Surprisingly, it doesn't seem there was enough space to store 64-bit timestamps, so instead of nanosecond resolution for several centuries in the future, it only provides single-second resolution until some time in 2038. This oversight was raised on linux-kernel and may well be addressed in a future release.
One of the awkward details of any copy-on-write filesystem is that whenever a block is written, its address is changed, so its parent in the indexing tree must change and be relocated, and so on up to the root of the tree. The logging nature of an LFS means that roll-forward during recovery can rebuild recent changes to the indexing tree so all the changes do not have to be written immediately, but they do have to be written eventually, and this just makes more work for the cleaner.
This is another area when f2fs makes use of its underlying FTL and takes a short-cut. Among the contents of the "meta" area is a NAT — a Node Address Table. Here "node" refers to inodes and to indirect indexing blocks, as well as blocks used for xattr storage. When the address of an inode is stored in a directory, or an index block is stored in an inode or another index block, it isn't the block address that is stored, but rather an offset into the NAT. The actual block address is stored in the NAT at that offset. This means that when a data block is written, we still need to update and write the node that points to it. But writing that node only requires updating the NAT entry. The NAT is part of the metadata that uses two-location journaling (thus depending on the FTL for write-gathering) and so does not require further indexing.
Directories
An LFS doesn't really impose any particular requirements on the layout of a directory, except to change the fewest number of blocks possible, which is generally good for performance anyway. So we can assess f2fs's directory structure on an equal footing with other filesystems. The primary goal is to provide fast lookup by file name, and to provide a stable address of each name that can be reported using telldir().
The original Unix filesystem (once it had been adjusted for 256-byte file names) used the same directory scheme as ext2 — sequential search though a file full of directory entries. This is simple and effective, but doesn't scale well to large directories.
More modern filesystems such as ext3, xfs, and btrfs use various schemes involving B-trees, sometimes indexed by a hash of the file name. One of the problems with B-trees is that nodes sometimes need to be split and this causes some directory entries to be moved around in the file. This results in extra challenges to provide stable addresses for telldir() and is probably the reason that telldir() is often called out for being a poor interface.
f2fs uses some sequential searching and some hashing to provide a scheme that is simple, reasonably efficient, and trivially provides stable telldir() addresses. A lot of the hashing code is borrowed from ext3, however f2fs omits the use of a per-directory seed. This seed is a secret random number which ensures that the hash values used are different in each directory, so they are not predictable. Using such a seed provides protection against hash-collision attacks. While these might be unlikely in practice, they are so easy to prevent that this omission is a little surprising.
It is easiest to think of the directory structure as a series of hash tables stored consecutively in a file. Each hash table has a number of fairly large buckets. A lookup proceeds from the first hash table to the next, at each stage performing a linear search through the appropriate bucket, until either the name is found or the last hash table has been searched. During the search, any free space in a suitable bucket is recorded in case we need to create the name.
The first hash table has exactly one bucket which is two blocks in size, so for the first few hundred entries, a simple linear search is used. The second hash table has two buckets, then four, then eight and so on until the 31st table with about a billion buckets, each two blocks in size. Subsequent hash tables — should you need that many — all have the same number of buckets as the 31st, but now they are four blocks in size.
The result is that a linear search of several hundred entries can be required, possibly progressing through quite a few blocks if the directory is very large. The length of this search increases only as the logarithm of the number of entries in the directory, so it scales fairly well. This is certainly better than a purely sequential search, but seems like it could be a lot more work than is really necessary. It does however guarantee that only one block needs to be updated for each addition or deletion of a file name, and since entries are never moved, the offset in the file is a stable address for telldir(), which are valuable features.
Superblocks, checkpoints, and other metadata
All filesystems have a superblock and f2fs is no different. However it does make a clear distinction between those parts of the superblock which are read-only and those which can change. These are kept in two separate data structures.
The f2fs_super_block, which is stored in the second block of the device, contains only read-only data. Once the filesystem is created, this is never changed. It describes how big the filesystem is, how big the segments, sections, and zones are, how much space has been allocated for the various parts of the "meta" area, and other little details.
The rest of the information that you might expect to find in a superblock, such as the amount of free space, the address of the segments that should be written to next, and various other volatile details, are stored in an f2fs_checkpoint. This "checkpoint" is one of the metadata types that follows the two-location approach to copy-on-write — there are two adjacent segments both of which store a checkpoint, only one of which is current. The checkpoint contains a version number so that when the filesystem is mounted, both can be read and the one with the higher version number is taken as the live version.
We have already mentioned the Node Address Table (NAT) and Segment Summary Area (SSA) that also occupy the meta area with the superblock (SB) and Checkpoints (CP). The one other item of metadata is the Segment Info Table or SIT.
The SIT stores 74 bytes per segment and is kept separate from the segment summaries because it is much more volatile. It primarily keeps track of which blocks are still in active use so that the segment can be reused when it has no active blocks, or can be cleaned when the active block count gets low.
When updates are required to the NAT or the SIT, f2fs doesn't make them immediately, but stores them in memory until the next checkpoint is written. If there are relatively few updates then they are not written out to their final home but are instead journaled in some spare space in Segment Summary blocks that are normally written at the same time. If the total amount of updates that are required to Segment Summary blocks is sufficiently small, even they are not written and the SIT, NAT, and SSA updates are all journaled with the Checkpoint block — which is always written during checkpoint. Thus, while f2fs feels free to leave some work to the FTL, it tries to be friendly and only performs random block updates when it really has to. When f2fs does need to perform random block updates it will perform several of them at once, which might ease the burden on the FTL a little.
Knowing when to give up
Handling filesystem-full conditions in traditional filesystems is relatively easy. If no space is left, you just return an error. With a log-structured filesystem, it isn't that easy. There might be a lot of free space, but it might all be in different sections and so it cannot be used until those sections are "cleaned", with the live data packed more densely into fewer sections. It usually makes sense to over-provision a log-structured filesystem so there are always free sections to copy data to for cleaning.
The FTL takes exactly this approach and will over-provision to both allow for cleaning and to allow for parts of the device failing due to excessive wear. As the FTL handles over-provisioning internally there is little point in f2fs doing it as well. So when f2fs starts running out of space, it essentially gives up on the whole log-structured idea and just writes randomly wherever it can. Inodes and index blocks are still handled carefully and there is a small amount of over-provisioning for them, but data is just updated in place, or written to any free block that can be found. Thus you can expect performance of f2fs to degrade when the filesystem gets close to full, but that is common to a lot of filesystems so it isn't a big surprise.
Would I buy one?
f2fs certainly seems to contain a number of interesting ideas, and a number of areas for possible improvement — both attractive attributes. Whether reality will match the promise remains to be seen. One area of difficulty is that the shape of an f2fs (such as section and zone size) needs to be tuned to the particular flash device and its FTL; vendors are notoriously secretive about exactly how their FTL works. f2fs also requires that the flash device is comfortable having six or more concurrently "open" write areas. This may not be a problem for Samsung, but does present some problems for your average techno-geek — though Arnd Bergmann has done some research that may prove useful. If this leads to people reporting performance results based on experiments where the f2fs isn't tuned properly to the storage device, it could be harmful for the project as a whole.
f2fs contains a number of optimizations which aim to ease the burden on the FTL. It would be very helpful to know how often these actually result in a reduction in the number of writes. That would help confirm that they are a good idea, or suggest that further refinement is needed. So, some gathering of statistics about how often the various optimizations fire would help increase confidence in the filesystem.
f2fs seems the have been written without much expectation of highly parallel workloads. In particular, all submission of write requests are performed under a single semaphore. So f2fs probably isn't the filesystem to use for big-data processing on 256-core work-horses. It should be fine on mobile computing devices for a few more years though.
And finally, lots of testing is required. Some preliminary performance measurements have been posted, but to get a fair comparison you really need an "aged" filesystem and a large mix of workloads. Hopefully someone will make the time to do the testing.
Meanwhile, would I use it? Given that my phone is as much a toy to play with as a tool to use, I suspect that I would. However, I would make sure I had reliable backups first. But then ... I probably should do that anyway.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Distributions
Canonical courts financial contributions
Financial contributions are part of many free software projects. Users can, and do, contribute in lots of different ways, but helping the project keep the lights on and, perhaps, even cover some development time, is fairly common. Financial donations may also be used to champion certain features for a project; by ponying up some money, the donor may get some input into the direction of the project. The latter seems to be part of the motivation for Canonical's recent push to more prominently feature—but not require—financial contributions as part of the desktop download process.
On October 9, Steve George, Canonical's VP of Communications and Products, posted a message to the company's blog that described the change:
It's a bit surprising to see a large, company-backed distribution looking
for financial contributions, but George made it clear that there has always
been a way to
do so, "albeit in a not-easy-to-find spot on our website
". He
said that users have been asking for a simpler way to contribute money, so
Canonical was making one available. Now, clicking through to a desktop
download page brings up a web application (seen at right, click to see
a larger version) that allows contributions of up to $125 in each of eight
categories.
According to community manager Jono Bacon, the application was inspired by the various Humble Bundles, which allow people to pay what they wish for computer games or ebooks. While the sliders used by Humble Bundles allow users to choose the amount to pay the authors, charities, and the company, the sliders in the Canonical application allow a choice of features contributors would like to put their money toward. The possibilities are:
- Make the desktop more amazing
- Performance optimisation for games and apps
- Improve hardware support on more PCs
- Phone and tablet versions of Ubuntu
- Community participation in Ubuntu development
- Better coordination with Debian and upstreams
- Better support for flavours like Kubuntu, Xubuntu, Lubuntu
- Tip to Canonical – they help make it happen
To aid the user in evaluating how much to give, the application suggests products that would cost roughly the same as the total contribution. Those range from a "grande extra shot mocha latte chino" at $2, through a "pair of LP Matador bongo drums" at $100, up to "an eight year-old dromedary camel" at $1000. Visitors can either choose their donation level and contribute via PayPal, or click through a link to go on to the download.
While Canonical undoubtedly does a great deal of work for the benefit of millions of Linux users, this is a rather unconventional approach. It is a little hard to imagine that it will generate a significant revenue stream, at least for an organization the size of Canonical. But the real value to Canonical (and by extension, Ubuntu) may be in the feedback it gets from users.
The categories for features are fairly broad, but a consensus among users on one (or a few) of them is certainly useful information. The fact that those users are willing to pay something to make that vote makes the data all the more interesting. There have been persistent complaints that some of the other Ubuntu flavors, Lubuntu, Xubuntu, Kubuntu, and so on, have lacked for financial backing in comparison to the number of users they bring to the table. This effort would give those flavors an opportunity to send a message, for example.
It would be great if the community were to get some visibility into the contributions. Canonical may be understandably loath to give out direct financial information, but there are other ways to do the reporting that would still benefit both the Ubuntu community as well as the larger FOSS ecosystem. Information on the "votes", perhaps as percentages of the number and dollar amount of the contributions, would be useful. That would help Ubuntu see places where more effort is desired as well as identifying potential trouble spots for other distributions and projects. According to George, the company is working on a schedule and format for reporting on the contributions.
Canonical has pursued other non-traditional revenue sources in the past, so this could just be another. With enough different revenue streams, even if some are fairly small and unpredictable, the company could reach a profitable state. That can only be a good thing for the long-term prosperity of not just Canonical, but the Ubuntu distribution as well. Ubuntu has millions of users worldwide and the Linux ecosystem is richer for its presence, so anything that helps its continued existence is definitely a net positive.
Brief items
Distribution quotes of the week
> I find it therefore doubtful that keeping the bottle logo solves any real world problem.
I find it doubtful that getting rid of the bottle logo solves any real world problem.
Distribution News
Debian GNU/Linux
bits from the DPL: September 2012
Debian Project Leader Stefano Zacchiroli has a few bits on his September activities. Topics include Google Code-In, Logo relicensing, and more.
Fedora
Fedora 18 pushed back by one week
Although the meeting was to discuss the Fedora 18 Beta freeze, the decided upon one week delay will push the expected final release out to early December.
Newsletters and articles of interest
Distribution newsletters
- DistroWatch Weekly, Issue 477 (October 8)
- Maemo Weekly News (October 8)
- Ubuntu Weekly Newsletter, Issue 286 (October 7)
Garrett: Handling UEFI Secure Boot in smaller distributions
Matthew Garrett looks at UEFI secure boot in smaller distributions. "I've taken Suse's code for key management and merged it into my own shim tree with a few changes. The significant difference is a second stage bootloader signed with an untrusted key will cause a UI to appear, rather than simply refusing to boot. This will permit the user to then navigate the available filesystems, choose a key and indicate that they want to enrol it. From then on, the bootloader will trust binaries signed with that key."
Arch Linux switches to systemd (The H)
The H reports that the latest installation image for Arch Linux boots with systemd. "Based on the 3.5.5 Linux kernel, Arch Linux 2012.10.06 is a regular monthly snapshot of the rolling-release operating system for new installations. Among the changes since the last snapshot are a simplified EFI boot and setup process, and the use of the gummiboot boot manager to display a menu on EFI systems. Additionally, new packages including ethtool, the FSArchiver tool, the Partimage and Partclone partition utilities, rfkill and the TestDisk data recovery tool are now available on the live system."
Page editor: Rebecca Sobol
Development
getauxval() and the auxiliary vector
There are many mechanisms for communicating information between user-space applications and the kernel. System calls and pseudo-filesystems such as /proc and /sys are of course the most well known. Signals are similarly well known; the kernel employs signals to inform a process of various synchronous or asynchronous events—for example, when the process tries to write to a broken pipe or a child of the process terminates.
There are also a number of more obscure mechanisms for communication between the kernel and user space. These include the Linux-specific netlink sockets and user-mode helper features. Netlink sockets provide a socket-style API for exchanging information with the kernel. The user-mode helper feature allows the kernel to automatically invoke user-space executables; this mechanism is used in a number of places, including the implementation of control groups and piping core dumps to a user-space application.
The auxiliary vector, a mechanism for communicating information from the kernel to user space, has remained largely invisible until now. However, with the addition of a new library function, getauxval(), in the GNU C library (glibc) 2.16 release that appeared at the end of June, it has now become more visible.
Historically, many UNIX systems have implemented the auxiliary vector feature. In essence, it is a list of key-value pairs that the kernel's ELF binary loader (fs/binfmt_elf.c in the kernel source) constructs when a new executable image is loaded into a process. This list is placed at a specific location in the process's address space; on Linux systems it sits at the high end of the user address space, just above the (downwardly growing) stack, the command-line arguments (argv), and environment variables (environ).
From the description and diagram, we can see that although the auxiliary vector is somewhat hidden, it is accessible with a little effort. Even without using the new library function, an application that wants to access the auxiliary vector merely needs to obtain the address of the location that follows the NULL pointer at the end of the environment list. Furthermore, at the shell level, we can discover the auxiliary vector that was supplied to an executable by setting the LD_SHOW_AUXV environment variable when launching an application:
$ LD_SHOW_AUXV=1 sleep 1000
AT_SYSINFO_EHDR: 0x7fff35d0d000
AT_HWCAP: bfebfbff
AT_PAGESZ: 4096
AT_CLKTCK: 100
AT_PHDR: 0x400040
AT_PHENT: 56
AT_PHNUM: 9
AT_BASE: 0x0
AT_FLAGS: 0x0
AT_ENTRY: 0x40164c
AT_UID: 1000
AT_EUID: 1000
AT_GID: 1000
AT_EGID: 1000
AT_SECURE: 0
AT_RANDOM: 0x7fff35c2a209
AT_EXECFN: /usr/bin/sleep
AT_PLATFORM: x86_64
The auxiliary vector of each process on the system is also visible via a corresponding /proc/PID/auxv file. Dumping the contents of the file that corresponds to the above command (as eight-byte decimal numbers, because the keys and values are of that size on the 64-bit system used for this example), we can see the key-value pairs in the vector, followed by a pair of zero values that indicate the end of the vector:
$ od -t d8 /proc/15558/auxv
0000000 33 140734096265216
0000020 16 3219913727
0000040 6 4096
0000060 17 100
0000100 3 4194368
0000120 4 56
0000140 5 9
0000160 7 0
0000200 8 0
0000220 9 4200012
0000240 11 1000
0000260 12 1000
0000300 13 1000
0000320 14 1000
0000340 23 0
0000360 25 140734095335945
0000400 31 140734095347689
0000420 15 140734095335961
0000440 0 0
0000460
Scanning the high end of user-space memory or /proc/PID/auxv is a clumsy way of retrieving values from the auxiliary vector. The new library function provides a simpler mechanism for retrieving individual values from the list:
#include <sys/auxv.h>
unsigned long int getauxval(unsigned long int type);
The function takes a key as its single argument, and returns the corresponding value. The glibc header files define a set of symbolic constants with names of the form AT_* for the key value passed to getauxval(); these names are exactly the same as the strings displayed when executing a command with LD_SHOW_AUXV=1.
Of course, the obvious question by now is: what sort of information is placed in the auxiliary vector, and who needs that information? The primary customer of the auxiliary vector is the dynamic linker (ld-linux.so). In the usual scheme of things, the kernel's ELF binary loader constructs a process image by loading an executable into the process's memory, and likewise loading the dynamic linker into memory. At this point, the dynamic linker is ready to take over the task of loading any shared libraries that the program may need in preparation for handing control to the program itself. However, it lacks some pieces of information that are essential for these tasks: the location of the program inside the virtual address space, and the starting address at which execution of the program should commence.
In theory, the kernel could provide a system call that the dynamic linker could use in order to obtain the required information. However, this would be an inefficient way of doing things: the kernel's program loader already has the information (because it has scanned the ELF binary and built the process image) and knows that the dynamic linker will need it. Rather than maintaining a record of this information until the dynamic linker requests it, the kernel can simply make it available in the process image at some location known to the dynamic linker. That location is, of course, the auxiliary vector.
It turns out that there's a range of other information that the kernel's program loader already has and which it knows the dynamic linker will need. By placing all of this information in the auxiliary vector, the kernel either saves the programming overhead of making this information available in some other way (e.g., by implementing a dedicated system call), or saves the dynamic linker the cost of making a system call, or both. Among the values placed in the auxiliary vector and available via getauxval() are the following:
- AT_PHDR and AT_ENTRY: The values for these keys
are the address of the ELF program headers of the executable and the entry
address of the executable. The dynamic linker uses this information to
perform linking and pass control to the executable.
- AT_SECURE: The kernel assigns a nonzero value to this key
if this executable should be treated securely. This setting may be
triggered by a Linux Security Module, but the common reason is that the
kernel recognizes that the process is executing a set-user-ID or
set-group-ID program. In this case, the dynamic linker disables the use of
certain environment variables (as described in the ld-linux.so(8)
manual page) and the C library changes other aspects of its behavior.
- AT_UID, AT_EUID, AT_GID, and
AT_EGID: These are the real and effective user and group IDs of
the process. Making these values available in the vector saves the dynamic
linker the cost of making system calls to determine the values. If the
AT_SECURE value is not available, the dynamic linker uses these
values to make a decision about whether to handle the executable securely.
- AT_PAGESZ: The value is the system page size. The
dynamic linker needs this information during the linking phase, and the C
library uses it in the implementation of the malloc family of
functions.
- AT_PLATFORM: The value is a pointer to a string
identifying the hardware platform on which the program is running. In some
circumstances, the dynamic linker uses this value in the interpretation of
rpath values. (The ld-linux.so(8) man page describes
rpath values.)
- AT_SYSINFO_EHDR: The value is a pointer to the page
containing the Virtual Dynamic Shared Object (VDSO) that the kernel creates
in order to provide fast implementations of certain system calls. (Some
documentation on the VDSO can be found in the kernel source file
Documentation/ABI/stable/vdso.)
- AT_HWCAP: The value is a pointer to a multibyte mask of
bits whose settings indicate detailed processor capabilities. This
information can be used to provide optimized behavior for certain library
functions. The contents of the bit mask are hardware dependent (for
example, see the kernel source file
arch/x86/include/asm/cpufeature.h for details relating to the
Intel x86 architecture).
- AT_RANDOM: The value is a pointer to sixteen random bytes provided by the kernel. The dynamic linker uses this to implement a stack canary.
The precise reasons why the GNU C library developers have chosen to add
the getauxval() function now are a little unclear. The commit
message and NEWS file entry for the change were merely brief explanations
of what the change was, rather than why it was made. The only clue provided by the implementer on the
libc-alpha mailing list suggested that doing so was useful to allow for
"future enhancements to the AT_ values, especially target-specific
ones
". That comment, plus the observation that the glibc developers
tend to be rather conservative about adding new interfaces to the ABI,
suggest that that they have some interesting new user-space uses of the
auxiliary vector in mind.
Brief items
Quote of the week
The KDE Manifesto
The KDE Manifesto has been released. "The KDE Manifesto is not intended to change the organization or the way it works. Its aim is only to describe how the KDE Community sees itself. What binds us together are certain values and their practical implications, without regard for who a person is or what background and skills they bring. It is a living document, so it will change over time as KDE continues to grow and mature. We are sharing the Manifesto to help people understand what KDE is all about, what we want to accomplish and why we do what we do."
Firefox 16
Mozilla has released Firefox 16. See the details in the release notes. Firefox 16.0 is also available for Android. Here are the Android version release notes.HTTPS Everywhere 3.0
The Electronic Frontier Foundation (EFF) has released version 3.0 of HTTPS Everywhere. HTTPS Everywhere 3.0 adds encryption protection to 1,500 more websites, twice as many as previous stable releases. "Our current estimate is that HTTPS Everywhere 3 should encrypt at least a hundred billion page views in the next year, and trillions of individual HTTP requests."
systemtap release 2.0
Version 2.0 of the system diagnostic framework SystemTap has been released. This release adds a simple macro facility to the built-in scripting language, the ability to conditionally vary code based on the user's privilege level, and an experimental backend that allows SystemTap to profile a user's own processes (i.e., without root privileges).
xpra release 0.7.0
Antoine Martin wrote in to alert us to the latest release of xpra, the "screen for X" utility. This release includes a host of new features, including several video compression formats and experimental support for multiple, concurrent clients.
NASA: How would you use a NASA API?
At its open.NASA blog, the US space agency is soliciting input from the public on the data sets and APIs it provides. "As we collect more and more data, figuring out the best way to distribute, use, and reuse the data becomes more and more difficult. API’s are one way we can significantly lower the barrier of entry to people from outside NASA being able to manipulate and access our public information.
" The current estimate is that NASA collects 15 terabytes of data per day, and future missions may collect far more.
Newsletters and articles
Development newsletters from the last week
- Caml Weekly News (October 9)
- What's cooking in git.git (October 4)
- What's cooking in git.git (October 7)
- Haskell Weekly News (October 3)
- Mozilla Hacks Weekly (October 4)
- OpenStack Community Newsletter (October 5)
- Perl Weekly (October 8)
- PostgreSQL Weekly News (October 8)
- Ruby Weekly (October 4)
Open Hardware Summit open to hybrid models (opensource.com)
Over at opensource.com, Ruth Suehle reports on the Open Hardware Summit, which was recently held in New York. At the summit, the Open Source Hardware Association was officially launched and various ideas about open hardware business strategies were discussed. "Many in the audience were waiting for the afternoon session that included Bre Pettis, co-founder and CEO of MakerBot, creators of a popular open source 3D printer. Earlier in the week, the company announced its latest product, the Replicator 2 3D printer. At the same time, Pettis announced to much controversy, 'For the Replicator 2, we will not share the way the physical machine is designed or our GUI because we don’t think carbon-copy cloning is acceptable and carbon-copy clones undermine our ability to pay people to do development.'"
The open GSM future arrives (The H)
The H creates a standalone mobile telephone network using the sysmoBTS base station. "In previous articles, we've looked at the question of how free are the phones that people use every day, and looked at the theory behind building your own GSM phone network using open source software. Now, in this article we take a look at the sysmoBTS, a small form-factor GSM Base Transceiver Station (BTS) built around these principles and the steps required to configure it to provide a standalone mobile telephone network that is useful for research, development and testing purposes."
Migurski: Openstreetmap in postgres
For anybody wanting to work with Openstreetmap data using PostgreSQL, here's a collection of useful tools and techniques. "At first glance, OSM data and Postgres (specifically PostGIS) seem like a natural, easy fit for one another: OSM is vector data, PostGIS stores vector data. OSM has usernames and dates-modified, PostGIS has columns for storing those things in tables. OSM is a worldwide dataset, PostGIS has fast spatial indexes to get to the part you want. When you get to OSM’s free-form tags, though, the row/column model of Postgres stops making sense and you start to reach for linking tables or advanced features like hstore..Weinberg: Open Source Hardware and the Law
At the Public Knowledge blog, Michael Weinberg addresses the differing legal underpinnings of open source hardware and open source software. "
This combination – copyright that does not protect function, trademark that needs to be applied for and does not protect function, and patents that need to be applied for and can protect functions – means that most hardware projects are 'open' by default because their core functionality is not protected by any sort of intellectual property right. Of course, in this case 'open' means that their key functionality can be copied without legal repercussion, not that the schematics have been posted online or that it is easy to discover how they work (critical elements of open source hardware)." The article is an extension of Weinberg's recent talk at the Open Hardware Summit, and poses questions interesting in light of MakerBot's announcement that its latest 3D printer would not be open.
Page editor: Nathan Willis
Announcements
Brief items
FSF: LulzBot A0-100 3D printer now FSF-certified
The Free Software Foundation has awarded its first *Respects Your Freedom* (RYF) certification to the *LulzBot AO-100 3D Printer* sold by Aleph Objects, Inc. "The RYF certification mark means that the product meets the FSF's standards in regard to users' freedom, control over the product, and privacy."
Articles of interest
FSFE Newsletter - October 2012
The FSFE (Free Software Foundation Europe) newsletter covers Software Freedom Day activities, software patents, free software in the French public administration, and several other topics.The Patent, Used as a Sword (New York Times)
Here's a lengthy New York Times article looking at the problems with the US patent system. "In the smartphone industry alone, according to a Stanford University analysis, as much as $20 billion was spent on patent litigation and patent purchases in the last two years — an amount equal to eight Mars rover missions. Last year, for the first time, spending by Apple and Google on patent lawsuits and unusually big-dollar patent purchases exceeded spending on research and development of new products, according to public filings.MeeGo to return next month with Jolla phone launch (The H)
The H reports on plans for a MeeGo-based phone. "The Finnish startup Jolla Ltd says that it has raised €200 million from a number of, currently unnamed, telecommunications companies and that it will be unveiling a MeeGo-based device next month. The funding consortium is reported to include at least one telecom operator, a chipset maker, and device and component manufacturers." Though MeeGo itself is free software, Jolla evidently plans to keep its "well-patented" user interface layer closed and license it to other companies.
New Books
Practical Vim--New from Pragmatic Bookshelf
Pragmatic Bookshelf has released "Practical Vim" by Drew Neil.
Calls for Presentations
CFP: Cloud Infrastructure, Distributed Storage and High Availability at LCA 2013
The Cloud Infrastructure, Distributed Storage and High Availability mini-conference will take place January 28 as part of linux.conf.au 2013 in Canberra, Australia. The call for papers closes November 4.
Upcoming Events
Events: October 11, 2012 to December 10, 2012
The following event listing is taken from the LWN.net Calendar.
| Date(s) | Event | Location |
|---|---|---|
| October 11 October 12 |
Korea Linux Forum 2012 | Seoul, South Korea |
| October 12 October 13 |
Open Source Developer's Conference / France | Paris, France |
| October 13 October 14 |
Debian BSP in Alcester (Warwickshire, UK) | Alcester, Warwickshire, UK |
| October 13 October 14 |
PyCon Ireland 2012 | Dublin, Ireland |
| October 13 October 15 |
FUDCon:Paris 2012 | Paris, France |
| October 13 | 2012 Columbus Code Camp | Columbus, OH, USA |
| October 13 October 14 |
Debian Bug Squashing Party in Utrecht | Utrecht, Netherlands |
| October 15 October 18 |
OpenStack Summit | San Diego, CA, USA |
| October 15 October 18 |
Linux Driver Verification Workshop | Amirandes,Heraklion, Crete |
| October 17 October 19 |
LibreOffice Conference | Berlin, Germany |
| October 17 October 19 |
MonkeySpace | Boston, MA, USA |
| October 18 October 20 |
14th Real Time Linux Workshop | Chapel Hill, NC, USA |
| October 20 October 21 |
PyCon Ukraine 2012 | Kyiv, Ukraine |
| October 20 October 21 |
Gentoo miniconf | Prague, Czech Republic |
| October 20 October 21 |
PyCarolinas 2012 | Chapel Hill, NC, USA |
| October 20 October 23 |
openSUSE Conference 2012 | Prague, Czech Republic |
| October 20 October 21 |
LinuxDays | Prague, Czech Republic |
| October 22 October 23 |
PyCon Finland 2012 | Espoo, Finland |
| October 23 October 25 |
Hack.lu | Dommeldange, Luxembourg |
| October 23 October 26 |
PostgreSQL Conference Europe | Prague, Czech Republic |
| October 25 October 26 |
Droidcon London | London, UK |
| October 26 October 27 |
Firebird Conference 2012 | Luxembourg, Luxembourg |
| October 26 October 28 |
PyData NYC 2012 | New York City, NY, USA |
| October 27 | Central PA Open Source Conference | Harrisburg, PA, USA |
| October 27 October 28 |
Technical Dutch Open Source Event | Eindhoven, Netherlands |
| October 27 | pyArkansas 2012 | Conway, AR, USA |
| October 27 | Linux Day 2012 | Hundreds of cities, Italy |
| October 29 November 3 |
PyCon DE 2012 | Leipzig, Germany |
| October 29 November 2 |
Linaro Connect | Copenhagen, Denmark |
| October 29 November 1 |
Ubuntu Developer Summit - R | Copenhagen, Denmark |
| October 30 | Ubuntu Enterprise Summit | Copenhagen, Denmark |
| November 3 November 4 |
OpenFest 2012 | Sofia, Bulgaria |
| November 3 November 4 |
MeetBSD California 2012 | Sunnyvale, California, USA |
| November 5 November 7 |
Embedded Linux Conference Europe | Barcelona, Spain |
| November 5 November 7 |
LinuxCon Europe | Barcelona, Spain |
| November 5 November 9 |
Apache OpenOffice Conference-Within-a-Conference | Sinsheim, Germany |
| November 5 November 8 |
ApacheCon Europe 2012 | Sinsheim, Germany |
| November 7 November 9 |
KVM Forum and oVirt Workshop Europe 2012 | Barcelona, Spain |
| November 7 November 8 |
LLVM Developers' Meeting | San Jose, CA, USA |
| November 8 | NLUUG Fall Conference 2012 | ReeHorst in Ede, Netherlands |
| November 9 November 11 |
Free Society Conference and Nordic Summit | Göteborg, Sweden |
| November 9 November 11 |
Mozilla Festival | London, England |
| November 9 November 11 |
Python Conference - Canada | Toronto, ON, Canada |
| November 10 November 16 |
SC12 | Salt Lake City, UT, USA |
| November 12 November 16 |
19th Annual Tcl/Tk Conference | Chicago, IL, USA |
| November 12 November 17 |
PyCon Argentina 2012 | Buenos Aires, Argentina |
| November 12 November 14 |
Qt Developers Days | Berlin, Germany |
| November 16 November 19 |
Linux Color Management Hackfest 2012 | Brno, Czech Republic |
| November 16 | PyHPC 2012 | Salt Lake City, UT, USA |
| November 20 November 24 |
8th Brazilian Python Conference | Rio de Janeiro, Brazil |
| November 24 November 25 |
Mini Debian Conference in Paris | Paris, France |
| November 24 | London Perl Workshop 2012 | London, UK |
| November 26 November 28 |
Computer Art Congress 3 | Paris, France |
| November 29 December 1 |
FOSS.IN/2012 | Bangalore, India |
| November 29 November 30 |
Lua Workshop 2012 | Reston, VA, USA |
| November 30 December 2 |
Open Hard- and Software Workshop 2012 | Garching bei München, Germany |
| November 30 December 2 |
CloudStack Collaboration Conference | Las Vegas, NV, USA |
| December 1 December 2 |
Konferensi BlankOn #4 | Bogor, Indonesia |
| December 2 | Foswiki Association General Assembly | online and Dublin, Ireland |
| December 5 December 7 |
Open Source Developers Conference Sydney 2012 | Sydney, Australia |
| December 5 December 7 |
Qt Developers Days 2012 North America | Santa Clara, CA, USA |
| December 5 | 4th UK Manycore Computing Conference | Bristol, UK |
| December 7 December 9 |
CISSE 12 | Everywhere, Internet |
| December 9 December 14 |
26th Large Installation System Administration Conference | San Diego, CA, USA |
If your event does not appear here, please tell us about it.
Page editor: Rebecca Sobol
