LWN.net Weekly Edition for October 11, 2012 [LWN.net]

Udev and firmware

By Jonathan Corbet
October 10, 2012

Those who like to complain about udev, systemd, and their current maintainers have had no shortage of company recently as the result of a somewhat incendiary discussion on the linux-kernel mailing list. Underneath the flames, though, lie some important issues: who decides what constitutes appropriate behavior for kernel device drivers, how strong is our commitment to backward compatibility, and which tasks are best handled in the kernel without calling out to user space?

The udev process is responsible for a number of tasks, most initiated as the result of events originating in the kernel. It responds to device creation events by making device nodes, setting permissions, and, possibly, running a setup program. It also handles module loading requests and firmware requests from the kernel. So, for example, when a driver calls request_firmware(), that request is turned into an event that is passed to the udev process. Udev will, in response, locate the firmware file, read its contents, and pass the data back to the kernel. The driver will get its firmware blob without having to know anything about how things are organized in user space, and everybody should be happy.

Back in January, the udev developers decided to implement a stricter notion of sequencing between various types of events. No events for a specific device, they decided, would be processed until the process of loading the driver module for that device had completed. Doing things this way makes it easier for them to keep things straight in user space and to avoid attempting operations that the kernel is not yet ready to handle. But it also created problems for some types of drivers. In particular, if a driver tries to load device firmware during the module initialization process, things will appear to lock up. Udev sees that the module is not yet initialized, so it will hold onto the firmware request and everything stops. Udev developer Kay Sievers warned the world about this problem last January:

We might need to work around that in the current udev for now, but these drivers will definitely break in future udev versions. Userspace, these days, should not be in charge of papering over obvious kernel bugs like this.

The problem with this line of reasoning, of course, is that one person's kernel bug is another's user-space problem. Firmware loading at module initialization time has worked just fine for a long time — if one ignores little problems like built-in modules, booting with init=/bin/sh, and other situations where proper user-space support is not present when the request_firmware() call takes place. What matters most is that it works for a normal bootstrap on a typical distribution install. The udev sequencing change breaks that: users of a number of distributions have been reporting that things no longer work properly with newer versions of udev installed.

Breaking currently-running systems is something the kernel development community tries hard to avoid, so it is not surprising that there was some disagreement over the appropriateness of the udev changes. Even so, various kernel developers were trying to work around the problems when Linus threw a bit of a tantrum, saying that the problem lies with udev and needs to be fixed there. He did not get the response that he was hoping for.

Kay answered that, despite the problem reports, udev had not yet been fixed, saying "we still haven't wrapped our head around how to fix it/work around it." He pointed out that things don't really hang, they just get "slow" while waiting for a 30-second timeout to expire. And he reiterated his position that the real problem lies in the kernel and should be fixed there. Linus was unimpressed, but, since he does not maintain udev, there is not a whole lot that he can do directly to solve the problem.

Or, then again, maybe there is. One possibility raised by a few developers was pulling udev into the kernel source tree and maintaining it as part of the kernel development process. There was a certain amount of support for this idea, but nobody actually stepped up to take responsibility for maintaining udev in that environment. Such a move would represent a fork of a significant package that would take it in a new direction; current plans are to integrate udev more thoroughly with systemd. The current udev developers thus seem unlikely to support putting udev in the kernel tree. Getting distributors to adopt the kernel's version of udev could also prove to be a challenge. In general, it is the sort of mess that is best avoided if at all possible.

An alternative is to simply short out udev for firmware loading altogether. That is, in fact, what has been done; the 3.7 kernel will include a patch (from Linus) that causes firmware loading to be done directly from the kernel without involving user space at all. If the kernel is unable to find the firmware file in the expected places (under /lib/firmware and variants) it will fall back to sending a request to udev in the usual manner. But if the kernel-space load attempt works, then udev will never even know that the firmware request was made.

This appears to be a solution that is workable for everybody involved. There is nothing particularly tricky about firmware loading, so few developers seem to have concerns about doing it directly from the kernel. Kay supports the idea as well, saying "I would absolutely like to get udev entirely out of the sick game of firmware loading". The real proof will be in how well the concept works once the 3.7 kernel starts seeing widespread testing, but the initial indications are that there will not be a lot of problems. If things stay that way, it would not be surprising to see the direct firmware loading patch backported to the stable series — once it has gained a few amenities like user-configurable paths.

One of the biggest challenges in kernel design can be determining what should be done in the kernel and what should be pushed out to user space. The user-space solution is often appealing; it can simplify kernel code and make it easier for others to implement their own policies. But an overly heavy reliance on user space can lead to just the sort of difficulty seen with firmware loading. In this case, it appears, the problem was better solved in the kernel; fortunately, it appears to have been a relatively easy one for the kernel to take back without causing compatibility problems.

Comments (11 posted)

CIA.vc shuts down

October 10, 2012

This article was contributed by Joey Hess

CIA didn't seem important until it was gone. For developers and users on IRC networks like Freenode, CIA was just there in the background, relaying commit messages into the channels of thousands of projects in real time—until recently.

CIA.vc was a central clearinghouse for commit messages sent to it from ten thousand or more version control repositories. There were CIA hooks for subversion, git, bzr, etc, so a project just had to install such a hook into their repository and register on the CIA website. CIA handled the rest, collecting the commit messages as they came in and announcing them on appropriate channels via its swarm of IRC bots. Here is an example from the #commits channel from April:

    <CIA-93> OpenWrt: [packages] fwknop: update to 2.0, use new startup commands
    <CIA-93> vlc: Pierre Ynard master * r31b5fbdb6d vlc/modules/lua/libs/equalizer.c:
	lua: fix memory and object leak and reset locale on error path
    <CIA-93> FreeBSD: rakuco * ports/graphics/autoq3d/files/
	(patch-src__cmds__cmds.cpp . patch-src__fgui__cadform.cpp): 
    <CIA-93> FreeBSD: Make the port build with gcc 4.6 (and possibly other compilers).
    <CIA-93> gentoo: robbat2 * gentoo/xml/htdocs/proj/en/perl/outdated-cpan-packages.xml:
	Automated update of outdated-cpan-packages.xml
    <CIA-93> compiz-fusion: a.j.buxton master * /fusion/plugins-main/src/ezoom/ezoom.c:

For a decade, the CIA bots were part of the infrastructure of many projects, which, along with their bug tracker, mailing lists, wiki, and version control system, helped tie communities together. Eric S. Raymond described the effect of the CIA service as follows:

It makes IRC conversations among a development group more productive. It also does something unquantifiable but good to the coherence of the development groups that use it, and the coherence of the open-source community as a whole — when the service was live it was hard to watch #commits for any length of time without being impressed and encouraged.

That stream of notifications dried up on September 26th, when CIA.vc was shut down, due to a miscommunication with a hosting provider. It seems there were no backups. It is unclear if CIA will return, but there are two possible replacements available now.

irker

Irker is a simple replacement for CIA, that was announced just three days later. Raymond was developing it even before CIA went down, and designed a much different architecture than the centralized CIA service.

Irker consists of two pieces: a server that acts as a simple relay to IRC and a client that sends messages to be relayed. The server has no knowledge of version control systems or commits, and could be used to relay any sort of content. All the version-control-specific code necessary to extract and format the message is in the client, which is run by a version control hook script.

The irker client and server typically both run on the same machine or LAN, so each project or hosting service is responsible for running its own service, rather than relying on a centralized service like CIA.

Irker has undergone heavy development since the announcement, and is now considered basically feature complete. Its simple and general design is likely to lead to other things being built on top of it. For example, there is a CIA to irker proxy for sites that want to retain their current CIA hooks.

KGB

Although irker made a splash when CIA died, another clone has quietly been overlooked for years. KGB was developed by Martín Ferrari and Damyan Ivanov of the Debian project and released in 2009. KGB is shipped in the current Debian stable release, as well as in Ubuntu universe, making it easy to deploy as a replacement for CIA.

KGB is, like irker, a decentralized client-server system. Unlike irker's content-agnostic server, the KGB server is responsible for formatting notifications from commit information it receives from its clients. Though a less flexible design, this does insulate the clients from some details of IRC, particularly from message length limits.

KGB has enjoyed a pronounced upswing in feature requests and development since CIA went down, gaining features such as web links to commits, url shortening, and the ability to broadcast all projects' notifications to a channel like #commits. Developer Martín Ferrari says:

For a small project that was mainly developed and maintained for our own use, this was quite some unexpected popularity!

Will CIA.vc return?

The CIA.vc website currently promises an attempt to revive the service. Any attempt to do so will surely face numerous challenges. Not least is the missing database, which configured much of CIA's behavior. Unless a recent backup of the database is found, any revived CIA.vc will certainly need much configuration to return it to its past functionality.

CIA's code base, while still available, is large and complex with many moving parts written in different languages, is reputedly difficult to install, and has been neglected for years. Raymond's opinion is that "CIA suffered a complexity collapse", and as he said: "It is notoriously difficult to un-collapse a rubble pile".

Even if CIA does eventually return, it seems likely that many projects will have moved away from it for good, deploying their own irker or KGB bots. The Apache Software Foundation, KDE project, and Debian's Alioth project hosting site have already deployed their own bots. If the larger hosting sites like Github, Sourceforge, and Savannah follow suit, any revived CIA may be reduced to being, at best, a third player.

Conclusion

CIA.vc was a centralized service, with code that is free software, but with a design and implementation that did not encourage reuse. The service was widely used by the community, which mostly seems to have put up with its instability, its UTF-8 display bugs, its odd formatting of git revision numbers, and its often crufty hook scripts.

According to CIA's author, Micah Dowty, it never achieved a "critical mass of involvement" from contributors. Perhaps CIA was not seen as important enough to work on. But with two replacements now being developed, there is certainly evidence of interest. Or perhaps CIA did not present itself as a free software project, and so was instead treated as simply the service that it appeared to be. CIA's website featured things like a commit leaderboard and new project list, which certainly helped entice people to use it. (Your author must confess to occasionally trying to fit enough commits into a day to get to the top of that leader board.) But the website did not encourage bugs or patches to be filed.

In a way, the story of CIA mirrors the story of the version control systems it reported on. When CIA began in 2003, centralized version control was the norm. The Linux kernel used distributed version control only thanks to the proprietary Bitkeeper, which itself ran a centralized commit publication service. These choices were entirely pragmatic, and the centralized CIA was perhaps in keeping with the times.

Much as happened with version control, the community has gone from being reliant on a centralized service, to having a choice of decentralized alternatives. As a result, new features are rapidly emerging in both KGB and irker that CIA never provided. This is certainly a healthy response to CIA's closure, but it also seems that our many years of reliance on the centralized service held us back from exploring the space that CIA occupied.

Comments (10 posted)

Linux and automotive computing security

By Nathan Willis
October 10, 2012

There was no security track at the 2012 Automotive Linux Summit, but numerous sessions and the "hallway track" featured anecdotes about the ease of compromising car computers. This is no surprise: as Linux makes inroads into automotive computing, the security question takes on an urgency not found on desktops and servers. Too often, though, Linux and open source software in general are perceived as insufficiently battle-hardened for the safety-critical needs of highway speed computing — reading the comments on an automotive Linux news story it is easy to find a skeptic scoffing that he or she would not trust Linux to manage the engine, brakes, or airbags. While hackers in other embedded Linux realms may understandably feel miffed at such a slight, the bigger problem is said skeptic's presumption that a modern Linux-free car is a secure environment — which is demonstrably untrue.

First, there is a mistaken assumption that computing is not yet a pervasive part of modern automobiles. Likewise mistaken is the assumption that safety-critical systems (such as the aforementioned brakes, airbags, and engine) are properly isolated from low-security components (like the entertainment head unit) and are not vulnerable to attack. It is also incorrectly assumed that the low-security systems themselves do not harbor risks to drivers and passengers. In reality, modern cars have shipped with multiple embedded computers for years (many of which are mandatory by government order), presenting a large attack surface with numerous risks to personal safety, theft, eavesdropping, and other exploits. But rather than exacerbating this situation, Linux and open source adoption stand to improve it.

There is an abundance of research dealing with hypothetical exploits to automotive computers, but the seminal work on practical exploits is a pair of papers from the Center for Automotive Embedded Systems Security (CAESS), a team from the University of California San Diego and the University of Washington. CAESS published a 2010 report [PDF] detailing attacks that they managed to implement against a pair of late-model sedans via the vehicles' Controller Area Network (CAN) bus, and a 2011 report [PDF] detailing how they managed to access the CAN network from outside the car, including through service station diagnostic scanners, Bluetooth, FM radio, and cellular modem.

Exploits

The 2010 paper begins by addressing the connectivity of modern cars. CAESS did not disclose the brand of vehicle they experimented on (although car mavens could probably identify it from the photographs), but they purchased two vehicles and experimented with them on the lab bench, on a garage lift, and finally on a closed test track. The cars were not high-end, but they provided a wide range of targets. Embedded electronic control units (ECUs) are found all over the automobile, monitoring and reporting on everything from the engine to the door locks, not to mention lighting, environmental controls, the dash instrument panel, tire pressure sensors, steering, braking, and so forth.

Not every ECU is designed to control a portion of the vehicle, but due to the nature of the CAN bus, any ECU can be used to mount an attack. CAN is roughly equivalent to a link-layer protocol, but it is broadcast-only, does not employ source addressing or authentication, and is easily susceptible to denial-of-service attacks (either through simple flooding or by broadcasting messages with high-priority message IDs, which force all other nodes to back off and wait). With a device plugged into the CAN bus (such as through the OBD-II port mandatory on all 1995-or-newer vehicles in the US), attackers can spoof messages from any ECU. There are often higher-level protocols employed, but CAESS was able to reverse-engineer the protocols in its test vehicles and found security holes that allow attackers to brute-force the challenge-response system in a matter of days.

CAESS's test vehicles did separate the CAN bus into high-priority and low-priority segments, providing a measure of isolation. However, this also proved to be inadequate, as there were a number of ECUs that were connected to both segments and which could therefore be used to bridge messages between them. That set-up is not an error, however; despite common thinking on the subject, quite a few features demanded by car buyers rely on coordinating between the high- and low-priority devices.

For example, electronic stability control involves measuring wheel speed, steering angle, throttle, and brakes. Cruise control involves throttle, brakes, speedometer readings, and possibly ultra-sonic range sensors (for collision avoidance). Even the lowly door lock must be connected to multiple systems: wireless key fobs, speed sensors (to lock the doors when in motion), and the cellular network (so that remote roadside assistance can unlock the car).

The paper details a number of attacks the team deployed against the test vehicles. The team wrote a tool called CarShark to analyze and inject CAN bus packets, which provided a method to mount many attacks. However, the vehicle's diagnostic service (called DeviceControl) also proved to be a useful platform for attack. DeviceControl is intended for use by dealers and service stations, but it was easy to reverse engineer, and subsequently allowed a number of additional attacks (such as sending an ECU the "disable all CAN bus communication" command, which effectively shuts off part of the car).

The actual attacks tested include some startlingly dangerous tricks, such as disabling the brakes. But the team also managed to create combined attacks that put drivers at risk even with "low risk" components — displaying false speedometer or fuel gauge readings, disabling dash and interior lights, and so forth. Ultimately the team was able to gain control of every ECU in the car, load and execute custom software, and erase traces of the attack.

Some of these attacks exploited components that did not adhere to the protocol specification. For example, several ECUs allowed their firmware to be re-flashed while the car was in motion, which is expressly forbidden for obvious safety reasons. Other attacks were enabled by run-of-the-mill implementation errors, such as components that re-used the same challenge-response seed value every time they were power-cycled. But ultimately, the critical factor was the fact that any device on the vehicle's internal bus can be used to mount an attack; there is no "lock box" protecting the vital systems, and the protocol at the core of the network lacks fundamental security features taken for granted on other computing platforms.

Vectors

Of course, all of the attacks described in the 2010 paper relied on an attacker with direct access to the vehicle. That did not necessarily mean ongoing access; they explained that a dongle attached to the OBD-II port could work at cracking the challenge-response system while left unattended. But, even though there are a number of individuals with access to a driver's car over the course of a year (from mechanics to valets), direct access is still a hurdle.

The 2011 paper looked at vectors to attack the car remotely, to assess the potential for an attacker to gain access to the car's internal CAN bus, at which point any of the attacks crafted in the 2010 paper could easily be executed. It considered three scenarios: indirect physical access, short-range wireless networking, and long-range wireless networking. As one might fear, all three presented opportunities.

The indirect physical access involved compromising the CD player and the dealership or service station's scanning equipment, which is physically connected to the car while in the shop for diagnosis. CAESS found that the model of diagnostic scanner used (which adhered to a 2004 US government mandated standard called PassThru) was an embedded Linux device internally, even though it was only used to interface with a Windows application running on the shop's computer. However, the scanner was equipped with WiFi, and broadcasts its address and open TCP port in the clear. The diagnostic application API is undocumented, but the team sniffed the traffic and found several exploitable buffer overflows — not to mention extraneous services like telnet also running on the scanner itself. Taking control of the scanner and programming it to upload malicious code to vehicles was little additional trouble.

The CD player attack was different; it started with the CD player's firmware update facility (which loads new firmware onto the player if a properly-named file is found on an inserted disc). But the player can also decode compressed audio files, including undocumented variants of Windows Media Audio (.WMA) files. CAESS found a buffer overflow in the .WMA player code, which in turn allowed the team to load arbitrary code onto the player. As an added bonus, the .WMA file containing the exploit plays fine on a PC, making it harder to detect.

The short-range wireless attack involved attacking the head unit's Bluetooth functionality. The team found that a compromised Android device could be loaded with a trojan horse application designed to upload malicious code to the car whenever it paired. A second option was even more troubling; the team discovered that the car's Bluetooth stack would respond to pairing requests initiated without user intervention. Successfully pairing a covert Bluetooth device still required correctly guessing the four-digit authorization PIN, but since the pairing bypassed the user interface, the attacker could make repeated attempts without those attempts being logged — and, once successful, the paired device does not show up in the head unit's interface, so it cannot be removed.

Finally, the long-range wireless attack gained access to the car's CAN network through the cellular-connected telematics unit (which handles retrieving data for the navigation system, but is also used to connect to the car maker's remote service center for roadside assistance and other tasks). CAESS discovered that although the telematics unit could use a cellular data connection, it also used a software modem application to encode digital data in an audio call — for greater reliability in less-connected regions.

The team reverse-engineered the signaling and data protocols used by this software modem, and were subsequently able to call the car from another cellular device, eventually uploading malicious code through yet another buffer overflow. Even more disturbingly, the team encoded this attack into an audio file, then played it back from an MP3 player into a phone handset, again seizing control over the car.

The team also demonstrated several post-compromise attack-triggering methods, such as delaying activation of the malicious code payload until a particular geographic location was reached, or a particular sensor value (e.g., speed or tire pressure) was read. It also managed to trigger execution of the payload by using a short-range FM transmitter to broadcast a specially-encoded Radio Data System (RDS) message, which vehicles' FM receivers and navigation units decode. The same attack could be performed over longer distances with a more powerful transmitter.

Among the practical exploits outlined in the paper are recording audio through the car's microphone and uploading it to a remote server, and connecting the car's telematics unit to a hidden IRC channel, from which attackers can send arbitrary commands at their leisure. The team speculates on the feasibility of turning this last attack into a commercial enterprise, building "botnet" style networks of compromised cars, and on car thieves logging car makes and models in bulk and selling access to stolen cars in advance, based on the illicit buyers' preferences.

What about Linux?

If, as CAESS seems to have found, the state-of-the-art is so poor in automotive computing security, the question becomes how Linux (and related open source projects) could improve the situation. Certainly some of the problems the team encountered are out of scope for automotive Linux projects. For example, several of the simpler ECUs are unsophisticated microcontrollers; the fact that some of them ship from the factory with blatant flaws (such as a broken challenge-response algorithm) is the fault of the manufacturer. But Linux is expected to run on the higher-end ECUs, such as the IVI head unit and telematics system, and these components were the nexus for the more sophisticated attacks.

Several of the sophisticated attacks employed by CAESS relied on security holes found in application code. The team acknowledged that standard security practices (like stack cookies and address space randomization) that are established practice in other computing environments simply have not been adopted in automotive system development for lack of perceived need. Clearly, recognizing that risk and writing more secure application code would improve things, regardless of the operating system in question. But the fact that Linux is so widely deployed elsewhere means that more security-conscious code is available for the taking than there is for any other embedded platform.

Consider the Bluetooth attack, for example. Sure, with a little effort, one might could envision a scenario when unattended Bluetooth pairing is desirable — but in practice, Linux's dominance in the mobile device space means there is a greater likelihood that developers would quickly find and patch the problem than would any tier one supplier working in isolation.

One step further is the advantage gained by having Linux serve as a common platform used by multiple manufacturers. CAESS observed in its 2011 paper that the "glue code" linking discrete modules together was the greatest source of exploits (e.g., the PassThru diagnostic scanning device), saying "virtually all vulnerabilities emerged at the interface boundaries between code written by distinct organizations." It also noted that this was an artifact of the automotive supply chain itself, in which individual components were contracted out to separate companies working from specifications, then integrated by the car maker once delivered:

Thus, while each supplier does unit testing (according to the specification) it is difficult for the manufacturer to evaluate security vulnerabilities that emerge at the integration stage. Traditional kinds of automated analysis and code reviews cannot be applied and assumptions not embodied in the specifications are difficult to unravel. Therefore, while this outsourcing process might have been appropriate for purely mechanical systems, it is no longer appropriate for digital systems that have the potential for remote compromise.

A common platform employed by multiple suppliers would go a long way toward minimizing this type of issue, and that approach can only work if the platform is open source.

Finally, the terrifying scope of the attacks carried out in the 2010 paper (and if one does not find them terrifying, one needs to read them again) ultimately trace back to the insecure design of CAN bus. CAN bus needs to be replaced; working with a standard IP stack, instead, means not having to reinvent the wheel. The networking angle has several factors not addressed in CAESS's papers, of course — most notably the still-emerging standards for vehicle ad-hoc networking (intended to serve as a vehicle-to-vehicle and vehicle-to-infrastructure channel).

On that subject, Maxim Raya and Jean-Pierre Hubaux recommend using public-key infrastructure and other well-known practices from the general Internet communications realm. While there might be some skeptics who would argue with Linux's first-class position as a general networking platform, it should be clear to all that proprietary lock-in to a single-vendor solution would do little to improve the vehicle networking problem.

Those on the outside may find the recent push toward Linux in the automotive industry frustratingly slow — after all, there is still no GENIVI code visible to non-members. But to conclude that the pace of development indicates Linux is not up to the task would be a mistake. The reality is that the automotive computing problem is enormous in scope — even considering security alone — and Linux and open source might be the only way to get it under control.

Comments (69 posted)

Loading modules from file descriptors

By Michael Kerrisk
October 10, 2012

Loadable kernel modules provide a mechanism to dynamically modify the functionality of a running system, by allowing code to be loaded and unloaded from the kernel. Loading code into the kernel via a module has a number of advantages over building a completely new monolithic kernel from modified source code. The first of these is that loading a kernel module does not require a system reboot. This means that new kernel functionally can be added without disturbing users and applications.

From a developer perspective, implementing new kernel functionality via modules is faster: a slow "compile kernel, reboot, test" sequence in each development iteration is instead replaced by a much faster "compile module, load module, test" sequence. Employing modules can also save memory, since code in a module can be loaded into memory only when it is actually needed. Device drivers are often implemented as loadable modules for this reason.

From a security perspective, loadable modules also have a potential downside: since a module has full access to kernel memory, it can compromise the integrity of a system. Although modules can be loaded only by privileged users, there are still potential security risks, since a system administrator may be unable to directly verify the authenticity and origin of a particular kernel module. Providing module-related infrastructure to support administrators in that task is the subject of ongoing effort, with one of the most notable pieces being the work to support module signing.

Kees Cook has recently posted a series of patches that tackle another facet of the module-verification problem. These patches add a new system call for loading kernel modules. To understand why the new system call is useful, we need to start by looking at the existing interface for loading kernel modules.

The Linux interface for loading kernel modules has had (since kernel 2.6.0) the following form:

    int init_module(void *module_image, unsigned long len,
                    const char *param_values);

The caller supplies the ELF image of the to-be-loaded module via the memory buffer pointed to by module_image; len specifies the size of that buffer. (The param_values argument is a string that can be used to specify initial values for the module's parameters.)

The main users of init_module() are the insmod and modprobe commands. However any privileged user-space application (i.e., one with the CAP_SYS_MODULE capability) can load a module in the same way that these commands do, via a three-step process: opening a file that contains a suitably built ELF image, reading or mmap()ing the file's contents into memory, and then calling init_module().

However, this call sequence is the source of an itch for Kees. Because the step of obtaining a file descriptor for the image file is separated from the module-loading step, the operating system loses the ability to make deductions about the trustworthiness of the module based on its origin in the filesystem. As Kees said:

being able to reason about the origin of a kernel module would be valuable in situations where an OS already trusts a specific file system, file, etc, due to things like security labels or an existing root of trust to a partition through things like dm-verity.

His solution is fairly straightforward: remove the middle of the three steps posted above. Instead, the application will open the file and pass the returned file descriptor directly to the kernel as part of a new module-loading system call; the kernel then performs the task of reading the module image from the file as a precursor to loading the module.

Although the concept of the solution is simple, it has been through a few iterations, with the most notable changes being to details of the user-space interface. Kees's initial proposal was to hack the existing init_module() interface, so that if NULL is passed in the module_image argument, the kernel would interpret the len argument as a file descriptor. Rusty Russell, the kernel modules subsystem maintainer, somewhat bluntly suggested that a new system call would be a better approach, and on the next revision of the patch, H. Peter Anvin pointed out that the system call would be better named according to existing conventions, where the file descriptor analog of an existing system call simply uses the same name as that system call, but with an "f" prefix. Thus, Kees has arrived at the currently proposed interface:

    int finit_module(int fd, const char *param_values);

In the most recent patch, Kees, who works for Google on Chrome OS, has also further elaborated on the motivations for adding this system call. Specifically, in order to ensure the integrity of a user's system, the Chrome OS developers would like to be able to enforce the restriction that kernel modules are loaded only from the system's read-only, cryptographically verified root filesystem. Since the developers already trust the contents of the root filesystem, employing module signatures to verify the contents of a kernel module would require the addition of an unnecessary set of keys to the kernel and would also slow down module loading. All that Chrome OS requires is a light-weight mechanism for verifying that the module image originates from that filesystem, and the new system call provides just that facility.

Kees pointed out that the new system call also has potential for wider use. For example, Linux Security Modules (LSMs) could use it to examine digital signatures contained in the module file's extended attributes (the file descriptor provides the kernel with the route to access the extended attributes). During discussion of the patches, interest in the new system call was confirmed by the maintainers of the IMA and AppArmor kernel subsystems.

At this stage, there appear to be few roadblocks to getting this system call into the kernel. The only question is when it will arrive. Kees would very much like to see the patches go into the currently open 3.7 merge window, but for various reasons, it appears probable that they will only be merged in Linux 3.8.

Update, January 2013: finit_module() was indeed merged in Linux 3.8, but with a changed API that added a flags argument that can be used to modify the behavior of the system call. Details can be found in the manual page.

Comments (4 posted)

Security quotes of the week

The point is that we in the community need to start the migration away from SHA-1 and to SHA-2/SHA-3 now.

-- Bruce Schneier

That's because a design flaw in the service [McAfee Secure], and in competing services offered by Trust Guard and others, makes it easy to discover in almost real time when a customer has had the seal revoked. A revocation is a either a sign the site has failed to pay its bill, has been inaccessible for a sustained period of time, or most crucially, is no longer able to pass the daily security test.

-- Dan Goodin in ars technica

This apparent screw up in the automated filter mistakenly attempts to censor AMC Theatres, BBC, Buzzfeed, CNN, HuffPo, TechCrunch, RealClearPolitics, Rotten Tomatoes, ScienceDirect, Washington Post, Wikipedia and even the U.S. Government.

Judging from the page titles and content the websites in question were targeted because they reference the number "45".

-- TorrentFreak looks at a Microsoft DMCA notice

Comments (2 posted)

The Linux Foundation's UEFI secure boot system

The Linux Foundation has announced a new boot system meant to make life easier on UEFI secure boot systems. "In a nutshell, the Linux Foundation will obtain a Microsoft Key and sign a small pre-bootloader which will, in turn, chain load (without any form of signature check) a predesignated boot loader which will, in turn, boot Linux (or any other operating system). The pre-bootloader will employ a 'present user' test to ensure that it cannot be used as a vector for any type of UEFI malware to target secure systems. This pre-bootloader can be used either to boot a CD/DVD installer or LiveCD distribution or even boot an installed operating system in secure mode for any distribution that chooses to use it."

Comments (39 posted)

The CryptoParty Handbook

The first draft of the CryptoParty Handbook, a 390-page guide to maintaining privacy in the networked world, is available. "This book was written in the first 3 days of October 2012 at Studio Weise7, Berlin, surrounded by fine food and a lake of coffee amidst a veritable snake pit of cables. Approximately 20 people were involved in its creation, some more than others, some local and some far (Melbourne in particular)." It is available under the (still evolving) CC-BY-SA 4.0 license. The guide, too, is still evolving; it should probably be regarded the way one would look at early-stage cryptographic code. Naturally, the authors are looking for contributors to help make the next release better.

Comments (none posted)

bacula: information disclosure

Package(s):

bacula

CVE #(s):

CVE-2012-4430

Created:

October 8, 2012

Updated:

May 19, 2014

Description:

From the Debian advisory:

It was discovered that bacula, a network backup service, does not properly enforce console ACLs. This could allow information about resources to be dumped by an otherwise-restricted client.

Alerts:

Gentoo	201405-11	bacula	2014-05-17
Fedora	FEDORA-2012-14452	bacula	2013-01-24
Mandriva	MDVSA-2012:166	bacula	2012-10-12
Debian	DSA-2558-1	bacula	2012-10-08
Mageia	MGASA-2012-0321	bacula	2012-11-06

Comments (none posted)

bind: denial of service

Package(s):

bind

CVE #(s):

CVE-2012-5166

Created:

October 10, 2012

Updated:

November 6, 2012

Description:

From the Mandriva advisory:

A certain combination of records in the RBT could cause named to hang while populating the additional section of a response.

Alerts:

Oracle	ELSA-2014-1984	bind	2014-12-12
Gentoo	201401-34	bind	2014-01-29
openSUSE	openSUSE-SU-2013:0605-1	bind	2013-04-03
Slackware	SSA:2012-341-01	bind	2012-12-06
SUSE	SUSE-SU-2012:1390-2	bind	2012-10-24
SUSE	SUSE-SU-2012:1390-1	bind	2012-10-23
Fedora	FEDORA-2012-15981	dnsperf	2012-10-23
Fedora	FEDORA-2012-15981	bind-dyndb-ldap	2012-10-23
Fedora	FEDORA-2012-15981	bind	2012-10-23
Fedora	FEDORA-2012-15981	dhcp	2012-10-23
Debian	DSA-2560-1	bind9	2012-10-20
Fedora	FEDORA-2012-15965	dnsperf	2012-10-19
Fedora	FEDORA-2012-15965	bind	2012-10-19
Fedora	FEDORA-2012-15965	dhcp	2012-10-19
Fedora	FEDORA-2012-15965	bind-dyndb-ldap	2012-10-19
openSUSE	openSUSE-SU-2012:1372-1	bind	2012-10-19
Scientific Linux	SL-bind-20121015	bind97	2012-10-15
Scientific Linux	SL-bind-20121015	bind	2012-10-15
Oracle	ELSA-2012-1364	bind97	2012-10-13
Oracle	ELSA-2012-1363	bind	2012-10-13
Oracle	ELSA-2012-1363	bind	2012-10-12
CentOS	CESA-2012:1364	bind97	2012-10-12
CentOS	CESA-2012:1363	bind	2012-10-13
CentOS	CESA-2012:1363	bind	2012-10-12
Red Hat	RHSA-2012:1365-01	bind	2012-10-12
Red Hat	RHSA-2012:1363-01	bind	2012-10-12
Red Hat	RHSA-2012:1364-01	bind97	2012-10-12
Slackware	SSA:2012-284-01	bind	2012-10-10
Mageia	MGASA-2012-0287	bind	2012-10-11
Ubuntu	USN-1601-1	bind9	2012-10-10
Mandriva	MDVSA-2012:162	bind	2012-10-10
SUSE	SUSE-SU-2012:1390-3	bind	2012-11-05
Oracle	ELSA-2012-1365	bind	2012-11-29

Comments (none posted)

hostapd: denial of service

Package(s):

hostapd

CVE #(s):

CVE-2012-4445

Created:

October 8, 2012

Updated:

October 19, 2012

Description:

From the Debian advisory:

Timo Warns discovered that the internal authentication server of hostapd, a user space IEEE 802.11 AP and IEEE 802.1X/WPA/WPA2/EAP Authenticator, is vulnerable to a buffer overflow when processing fragmented EAP-TLS messages. As a result, an internal overflow checking routine terminates the process. An attacker can abuse this flaw to conduct denial of service attacks via crafted EAP-TLS messages prior to any authentication.

Alerts:

Mandriva	MDVSA-2012:168	hostapd	2012-10-22
openSUSE	openSUSE-SU-2012:1371-1	hostapd	2012-10-19
Fedora	FEDORA-2012-15759	hostapd	2012-10-18
Fedora	FEDORA-2012-15748	hostapd	2012-10-18
Mageia	MGASA-2012-0291	hostapd	2012-10-11
Debian	DSA-2557-1	hostapd	2012-10-08

Comments (none posted)

libxslt: code execution

Package(s):

libxslt

CVE #(s):

CVE-2012-2893

Created:

October 4, 2012

Updated:

October 22, 2012

Description:

From the Ubuntu advisory:

Cris Neckar discovered that libxslt incorrectly managed memory. If a user or automated system were tricked into processing a specially crafted XSLT document, a remote attacker could cause libxslt to crash, causing a denial of service, or possibly execute arbitrary code. (CVE-2012-2893)

Alerts:

Gentoo	201401-07	libxslt	2014-01-10
Mandriva	MDVSA-2013:047	libxslt	2013-04-05
openSUSE	openSUSE-SU-2012:1376-1	chromium	2012-10-22
Mandriva	MDVSA-2012:164	libxslt	2012-10-11
Mageia	MGASA-2012-0283	libxslt	2012-10-06
Debian	DSA-2555-1	libxslt	2012-10-05
Ubuntu	USN-1595-1	libxslt	2012-10-04

Comments (none posted)

mozilla: multiple vulnerabilities

Package(s):

firefox, thunderbird, seamonkey

CVE #(s):

CVE-2012-3983 CVE-2012-3989 CVE-2012-3984 CVE-2012-3985

Created:

October 10, 2012

Updated:

October 17, 2012

Description:

From the Ubuntu advisory:

Henrik Skupin, Jesse Ruderman, Christian Holler, Soroush Dalili and others discovered several memory corruption flaws in Firefox. If a user were tricked into opening a specially crafted web page, a remote attacker could cause Firefox to crash or potentially execute arbitrary code as the user invoking the program. (CVE-2012-3982, CVE-2012-3983, CVE-2012-3988, CVE-2012-3989)

David Bloom and Jordi Chancel discovered that Firefox did not always properly handle the <select> element. A remote attacker could exploit this to conduct URL spoofing and clickjacking attacks. (CVE-2012-3984)

Collin Jackson discovered that Firefox did not properly follow the HTML5 specification for document.domain behavior. A remote attacker could exploit this to conduct cross-site scripting (XSS) attacks via javascript execution. (CVE-2012-3985)

Johnny Stenback discovered that Firefox did not properly perform security checks on tests methods for DOMWindowUtils. (CVE-2012-3986)

Alice White discovered that the security checks for GetProperty could be bypassed when using JSAPI. If a user were tricked into opening a specially crafted web page, a remote attacker could exploit this to execute arbitrary code as the user invoking the program. (CVE-2012-3991)

Mariusz Mlynski discovered a history state error in Firefox. A remote attacker could exploit this to spoof the location property to inject script or intercept posted data. (CVE-2012-3992)

Mariusz Mlynski and others discovered several flays in Firefox that allowed a remote attacker to conduct cross-site scripting (XSS) attacks. (CVE-2012-3993, CVE-2012-3994, CVE-2012-4184)

Abhishek Arya, Atte Kettunen and others discovered several memory flaws in Firefox when using the Address Sanitizer tool. If a user were tricked into opening a specially crafted web page, a remote attacker could cause Firefox to crash or potentially execute arbitrary code as the user invoking the program. (CVE-2012-3990, CVE-2012-3995, CVE-2012-4179, CVE-2012-4180, CVE-2012-4181, CVE-2012-4182, CVE-2012-4183, CVE-2012-4185, CVE-2012-4186, CVE-2012-4187, CVE-2012-4188)

Alerts:

openSUSE	openSUSE-SU-2014:1100-1	Firefox	2014-09-09
Gentoo	201301-01	firefox	2013-01-07
Mageia	MGASA-2012-0353	iceape	2012-12-07
SUSE	SUSE-SU-2012:1351-1	Mozilla Firefox	2012-10-16
openSUSE	openSUSE-SU-2012:1345-1	MozillaFirefox	2012-10-15
Ubuntu	USN-1611-1	thunderbird	2012-10-12
Ubuntu	USN-1600-1	firefox	2012-10-09

Comments (none posted)

mozilla: multiple vulnerabilities

Package(s):

firefox, thunderbird, seamonkey

CVE #(s):

CVE-2012-3982 CVE-2012-3986 CVE-2012-3988 CVE-2012-3990 CVE-2012-3991 CVE-2012-3992 CVE-2012-3993 CVE-2012-3994 CVE-2012-3995 CVE-2012-4179 CVE-2012-4180 CVE-2012-4181 CVE-2012-4182 CVE-2012-4183 CVE-2012-4184 CVE-2012-4185 CVE-2012-4186 CVE-2012-4187 CVE-2012-4188

Created:

October 10, 2012

Updated:

January 10, 2013

Description:

From the Red Hat advisory:

Several flaws were found in the processing of malformed web content. A web page containing malicious content could cause Firefox to crash or, potentially, execute arbitrary code with the privileges of the user running Firefox. (CVE-2012-3982, CVE-2012-3988, CVE-2012-3990, CVE-2012-3995, CVE-2012-4179, CVE-2012-4180, CVE-2012-4181, CVE-2012-4182, CVE-2012-4183, CVE-2012-4185, CVE-2012-4186, CVE-2012-4187, CVE-2012-4188)

Two flaws in Firefox could allow a malicious website to bypass intended restrictions, possibly leading to information disclosure, or Firefox executing arbitrary code. Note that the information disclosure issue could possibly be combined with other flaws to achieve arbitrary code execution. (CVE-2012-3986, CVE-2012-3991)

Multiple flaws were found in the location object implementation in Firefox. Malicious content could be used to perform cross-site scripting attacks, script injection, or spoofing attacks. (CVE-2012-1956, CVE-2012-3992, CVE-2012-3994)

Two flaws were found in the way Chrome Object Wrappers were implemented. Malicious content could be used to perform cross-site scripting attacks or cause Firefox to execute arbitrary code. (CVE-2012-3993, CVE-2012-4184)

Alerts:

openSUSE	openSUSE-SU-2014:1100-1	Firefox	2014-09-09
Slackware	SSA:2013-009-03	seamonkey	2013-01-10
Gentoo	201301-01	firefox	2013-01-07
Mageia	MGASA-2012-0353	iceape	2012-12-07
Debian	DSA-2565-1	iceweasel	2012-10-23
SUSE	SUSE-SU-2012:1351-1	Mozilla Firefox	2012-10-16
Slackware	SSA:2012-288-01	seamonkey	2012-10-15
openSUSE	openSUSE-SU-2012:1345-1	MozillaFirefox	2012-10-15
Debian	DSA-2572-1	iceape	2012-11-04
Ubuntu	USN-1611-1	thunderbird	2012-10-12
Oracle	ELSA-2012-1350	firefox	2012-10-10
Oracle	ELSA-2012-1350	firefox	2012-10-11
Oracle	ELSA-2012-1351	thunderbird	2012-10-10
Mandriva	MDVSA-2012:163	firefox	2012-10-11
Mageia	MGASA-2012-0289	thunderbird	2012-10-11
Mageia	MGASA-2012-0288	firefox	2012-10-11
Fedora	FEDORA-2012-15863	thunderbird-lightning	2012-10-11
Fedora	FEDORA-2012-15863	firefox	2012-10-11
Fedora	FEDORA-2012-15863	thunderbird	2012-10-11
CentOS	CESA-2012:1351	thunderbird	2012-10-10
CentOS	CESA-2012:1350	firefox	2012-10-10
Ubuntu	USN-1600-1	firefox	2012-10-09
Slackware	SSA:2012-283-01	firefox	2012-10-09
Scientific Linux	SL-thun-20121010	thunderbird	2012-10-10
Scientific Linux	SL-fire-20121010	firefox	2012-10-10
CentOS	CESA-2012:1351	thunderbird	2012-10-10
CentOS	CESA-2012:1350	firefox	2012-10-10
Red Hat	RHSA-2012:1351-01	thunderbird	2012-10-09
Red Hat	RHSA-2012:1350-01	firefox	2012-10-09
Fedora	FEDORA-2012-15842	seamonkey	2012-10-24
Fedora	FEDORA-2012-15877	seamonkey	2012-10-24
Debian	DSA-2569-1	icedove	2012-10-29

Comments (none posted)

openstack-keystone: two authentication bypass flaws

Package(s):

openstack-keystone

CVE #(s):

CVE-2012-4456 CVE-2012-4457

Created:

October 4, 2012

Updated:

October 10, 2012

Description:

From the Red Hat Bugzilla entries [1, 2]:

CVE-2012-4456: Jason Xu discovered several vulnerabilities in OpenStack Keystone token verification:

The first occurs in the API /v2.0/OS-KSADM/services and /v2.0/OS-KSADM/services/{service_id}, the second occurs in /v2.0/tenants/{tenant_id}/users/{user_id}/roles

In both cases the OpenStack Keystone code fails to check if the tokens are valid. These issues have been addressed by adding checks in the form of test_service_crud_requires_auth() and test_user_role_list_requires_auth().

CVE-2012-4457: Token authentication for a user belonging to a disable tenant should not be allowed.

Alerts:

Red Hat	RHSA-2012:1378-01	openstack-keystone	2012-10-16
Fedora	FEDORA-2012-13075	openstack-keystone	2012-10-03

Comments (none posted)

openstack-swift: insecure use of python pickle

Package(s):

openstack-swift

CVE #(s):

CVE-2012-4406

Created:

October 8, 2012

Updated:

June 20, 2013

Description:

From the Red Hat bugzilla:

Sebastian Krahmer (krahmer@suse.de) reports:

swift uses pickle to store and load meta data. pickle is insecure and allows to execute arbitrary code in loads().

Alerts:

Ubuntu	USN-1887-1	swift	2013-06-19
Fedora	FEDORA-2012-15642	openstack-swift	2012-10-18
Red Hat	RHSA-2012:1379-01	openstack-swift	2012-10-16
Fedora	FEDORA-2012-15098	openstack-swift	2012-10-08

Comments (none posted)

php: multiple vulnerabilities

Package(s):

php

CVE #(s):

Created:

October 8, 2012

Updated:

October 10, 2012

Description:

PHP 5.4.7 fixes multiple vulnerabilities. See the PHP changelog for details.

Alerts:

Mageia

MGASA-2012-0281

php

2012-10-06

Comments (none posted)

phpldapadmin: cross-site scripting

Package(s):

phpldapadmin

CVE #(s):

CVE-2012-1114 CVE-2012-1115

Created:

October 8, 2012

Updated:

October 10, 2012

Description:

From the Red Hat bugzilla:

Originally (2012-03-01), the following cross-site (XSS) flaws were reported against LDAP Account Manager Pro (from Secunia advisory):

* 1) Input passed to e.g. the "filteruid" POST parameter when filtering result sets in lam/templates/lists/list.php (when "type" is set to a valid value) is not properly sanitised before being returned to the user. This can be exploited to execute arbitrary HTML and script code in a user's browser session in context of an affected site.

* 2) Input passed to the "filter" POST parameter in lam/templates/3rdParty/pla/htdocs/cmd.php (when "cmd" is set to "export" and "exporter_id" is set to "LDIF") is not properly sanitised before being returned to the user. This can be exploited to execute arbitrary HTML and script code in a user's browser session in context of an affected site.

* 3) Input passed to the "attr" parameter in lam/templates/3rdParty/pla/htdocs/cmd.php (when "cmd" is set to "add_value_form" and "dn" is set to a valid value) is not properly sanitised before being returned to the user. This can be exploited to execute arbitrary HTML and script code in a user's browser session in context of an affected site.

Alerts:

Fedora	FEDORA-2012-14363	phpldapadmin	2012-10-06
Fedora	FEDORA-2012-14344	phpldapadmin	2012-10-06

Comments (none posted)

php-zendframework: multiple vulnerabilities

Package(s):

php-zendframework

CVE #(s):

Created:

October 8, 2012

Updated:

October 10, 2012

Description:

From the ZendFramework advisories [1], [2]:

[1] The default error handling view script generated using Zend_Tool failed to escape request parameters when run in the "development" configuration environment, providing a potential XSS attack vector.

[2] Developers using non-ASCII-compatible encodings in conjunction with the MySQL PDO driver of PHP may be vulnerable to SQL injection attacks. Developers using ASCII-compatible encodings like UTF8 or latin1 are not affected by this PHP issue.

Alerts:

Mageia

MGASA-2012-0285

php-zendframework

2012-10-06

Comments (none posted)

wireshark: denial of service

Package(s):

wireshark

CVE #(s):

CVE-2012-5239 CVE-2012-3548

Created:

October 8, 2012

Updated:

March 8, 2013

Description:

From the CVE entries:

The Mageia advisory references CVE-2012-5239, which is a duplicate of CVE-2012-3548.

The dissect_drda function in epan/dissectors/packet-drda.c in Wireshark 1.6.x through 1.6.10 and 1.8.x through 1.8.2 allows remote attackers to cause a denial of service (infinite loop and CPU consumption) via a small value for a certain length field in a capture file. (CVE-2012-3548)

Alerts:

Gentoo	GLSA 201308-05:02	wireshark	2013-08-30
Gentoo	201308-05	wireshark	2013-08-28
Mandriva	MDVSA-2013:020	wireshark	2013-03-08
Mandriva	MDVSA-2013:055	wireshark	2013-04-05
Mageia	MGASA-2012-0284	wireshark	2012-10-06

Comments (none posted)

Kernel release status

The 3.7 merge window is still open, so there is no current development kernel. See the article below for merges into 3.7 since last week.

Stable updates: The 3.2.31 stable kernel was released on October 10. In addition, the 3.0.45, 3.4.13, 3.5.6, and 3.6.1 stable kernels were released on October 7. Support for the 3.5 series is coming to an end, as there may only be one more update, so users of that kernel should be planning to upgrade.

Comments (none posted)

Quotes of the week

If you look at Linux contributions they come from everywhere. The core of the network routing code was written by Russians (and Alexey who worked at a nuclear research instutite even turned up at OLS with a 'minder' who as per every stereotype was apparently capable of drinking vodka in half pints). We have code from government projects, from educational projects (some of which are in effect state funded), from businesses, from volunteers, from a wide variety of non profit causes. Today you can boot a box running Russian based network code with an NSA written ethernet driver.

— Alan Cox

Every time a patch goes through more than 3 purely coding style revisions, a unicorn dies.

— David Miller

Comments (none posted)

Samsung's F2FS filesystem

Back in August, a linux-kernel discussion on removable device filesystems hinted at a new filesystem waiting in the wings. It now seems clear that said filesystem was the just-announced F2FS, a flash-friendly filesystem from Samsung. "F2FS is a new file system carefully designed for the NAND flash memory-based storage devices. We chose a log structure file system approach, but we tried to adapt it to the new form of storage." See the associated documentation file for details on the F2FS on-disk format and how it works.

[Update: Also see Neil Brown's dissection of f2fs on this week's kernel page.]

Comments (44 posted)

3.7 Merge window part 2

By Jonathan Corbet
October 10, 2012

As of this writing, Linus has pulled 9,167 non-merge changesets into the mainline for the 3.7 merge window; that's just over 3,600 changes since last week's summary. As predicted, the merge rate has slowed a bit as Linus found better things to do with his time. Still, it is shaping up to be an active development cycle.

User-visible changes since last week include:

The kernel's firmware loader will now attempt to load files directly from user space without involving udev. The firmware path is currently wired to a few alternatives under /lib/firmware; the plan is to make things more flexible in the future.
The epoll_ctl() system call supports a new EPOLL_CTL_DISABLE operation to disable polling on a specific file descriptor.
The Xen paravirtualization mechanism is now supported on the ARM architecture.
The tools directory contains a new "trace agent" utility; it uses virtio to move trace data from a guest system to a host in an efficient manner. Also added to tools is acpidump, which can dump a system's ACPI tables to a text file.
Online resizing of ext4 filesystems that use the metablock group (meta_bg) or 64-bit block number features is now supported.
The UBI translation layer for flash-based storage devices has gained an experimental "fastmap" capability. The fastmap caches erase block mappings, eliminating the need to scan the device at mount time.
The Btrfs filesystem has gained the ability to perform hole punching with the fallocate() system call.
New hardware support includes:
- Systems and processors: Freescale P5040DS reference boards, Freescale / iVeia P1022RDK reference boards, and MIPS Technologies SEAD3 evaluation boards.
- Audio: Wolfson Bells boards, Wolfson WM0010 digital signal processors, TI SoC based boards with twl4030 codecs, C-Media CMI8328-based sound cards, and Dialog DA9055 audio codecs.
- Crypto: IBM 842 Power7 compression accelerators.
- Graphics: Renesas SH Mobile LCD controllers.
- Miscellaneous: ST-Ericsson STE Modem devices, Maxim MAX8907 power management ICs, Dialog Semiconductor DA9055 PMICs, Texas Instruments LP8788 power management units, Texas Instruments TPS65217 backlight controllers, TI LM3630 and LM3639 backlight controllers, Dallas DS2404 RTC chips, Freescale SNVS RTC modules, TI TPS65910 RTC chips, RICOH 5T583 RTC chips, Marvell MVEBU pin control units (and several SoCs using it), Marvell 88PM860x PMICs, LPC32x SLC and MLC NAND controllers, and TI EDMA controllers.
- Video4Linux2: Syntek STK1160 USB audio/video bridges, TechnoTrend USB infrared receivers, Nokia N900 (RX51) IR transmitters, Chips&Media Coda multi-standard codecs, FCI FC2580 silicon tuners, Analog Devices ADV7604 decoders, Analog Devices AD9389B encoders, Samsung Exynos G-Scaler image processors, Samsung S5K4ECGX sensors, and Elonics E4000 silicon tuners.

Changes visible to kernel developers include:

The precursors for the user-space API header file split have been merged. These create include/uapi directories meant to hold header files containing the definitions of data types visible to user space. Actually splitting those definitions out is a lengthy patch set that looks to be only partially merged in 3.7; the rest will have to wait for the 3.8 cycle.
The core of the Nouveau driver for NVIDIA chipsets has been torn out and rewritten. The developers understand the target hardware much better than they did when Nouveau started; the code has now been reworked to match that understanding.
The Video4Linux2 subsystem tree has been massively reorganized; driver source files are now organized by bus type. Most files have moved, so developers working in this area will need to retrain their fingers for the new locations. There is also a new, rewritten DVB USB core; a number of drivers have been converted to this new code.
The ALSA sound driver subsystem has added a new API for the management of audio channels; see Documentation/sound/alsa/Channel-Mapping-API.txt for details.
The red-black tree implementation has been substantially reworked. It now implements both interval trees and priority trees; the older kernel "prio tree" implementation has been displaced by this work and removed.

Linus had raised the possibility of extending the merge window if his travels got in the way of pulling changes into the mainline. The changeset count thus far, though, suggests that there has been no problem with merging, so chances are that the merge window will close on schedule around October 14.

Comments (none posted)

The new visibility of RCU processing

October 10, 2012

This article was contributed by Paul McKenney

If you run a post-3.6 Linux kernel for long enough, you will likely see a process named rcu_sched or rcu_preempt or maybe even rcu_bh having consumed significant CPU time. If the system goes idle and all application processes exit, these processes might well have the largest CPU consumption of all the remaining processes. It is only natural to ask “what are these processes and why are they consuming so much CPU?”

The “what” part is easy: These are new kernel threads that handle RCU grace periods, previously handled mainly in softirq context. An “RCU grace period” is a period of time after which all pre-existing RCU read-side critical sections have completed, so that if an RCU updater removes a data element from an RCU-protected data structure and then waits for an RCU grace period, it may subsequently safely carry out destructive-to-readers actions, such as freeing the data element. RCU read-side critical sections begin with rcu_read_lock() and end with rcu_read_unlock(). Updaters can wait for an RCU grace period using synchronize_rcu(), or they can asynchronously schedule a function to be invoked after a grace period using call_rcu(). RCU's read-side primitives are extremely fast and scalable, so it can be quite helpful in read-mostly situations. For more detail on RCU, see: The RCU API, 2010 Edition, What is RCU, Fundamentally?, this set of slides [PDF], and the RCU home page.

Quick Quiz 1: Why would latency be reduced by moving RCU work to a kthread? And why would anyone care about latency on huge machines?
Answer

The reason for moving RCU grace-period handling to a kernel thread was to improve real-time latency (both interrupt latency and scheduling latency) on huge systems by allowing RCU's grace-period initialization to be preempted: Without preemption, this initialization can inflict more than 200 microseconds of latency on huge systems. In addition, this change will very likely also improve RCU's energy efficiency while also simplifying the code. These potential simplifications are due to the fact that kernel threads make it easier to guarantee forward progress, avoiding hangs in cases where all CPUs are asleep and thus ignoring the current grace period, as confirmed by Paul Walmsley. But the key point here is that these kernel threads do not represent new overhead: Instead, overhead that used to be hidden in softirq context is now visible in kthread context.

Quick Quiz 2: Wow!!! If hackbench does a million grace periods in ten minutes, just how many does something like rcutorture do?
Answer

Now for “why so much CPU?”, which is the question Ingo Molnar asked immediately upon seeing more than three minutes of CPU time consumed by rcu_sched after running a couple hours of kernel builds. The answer is that Linux makes heavy use of RCU, so much so that running hackbench for ten minutes can result in almost one million RCU grace periods—and more than thirty seconds of CPU time consumed by rcu_sched. This works out to about thirty microseconds per grace period, which is anything but excessive, considering the amount of work that grace periods do.

As it turns out, the CPU consumption of rcu_sched, rcu_preempt, and rcu_bh is often roughly equal to the sum of that of the ksoftirqd threads. Interestingly enough, in 3.6 and earlier, some of the RCU grace-period overhead would have been charged to the ksoftirqd kernel threads.

But CPU overhead per grace period is only part of the story. RCU works hard to process multiple updates (e.g., call_rcu() or synchronize_rcu() invocations) with a single grace period. It is not hard to achieve more than one hundred updates per grace period, which results in a per-update overhead of only about 300 nanoseconds, which is not bad at all. Furthermore, workloads having well in excess of one thousand updates per grace period have been observed.

Of course, the per-grace-period CPU overhead does vary, and with it the per-update overhead. First, the greater the number of possible CPUs (as given at boot time by nr_cpu_ids), the more work RCU must do when initializing and cleaning up grace periods. This overhead grows fairly slowly, with additional work required with the addition of each set of 16 CPUs, though this number varies depending on the CONFIG_RCU_FANOUT_LEAF kernel configuration parameter and also on the rcutree.rcu_fanout_leaf kernel boot parameter.

Second, the greater the number of idle CPUs, the more work RCU must do when forcing quiescent states. Yes, the busier the system, the less work RCU needs to do! The reason for the extra work is that RCU is not permitted to disturb idle CPUs for energy-efficiency reasons. RCU must therefore probe a per-CPU data structure to read out idleness state during each grace period, likely incurring a cache miss on each such probe.

Third and finally, the overhead will vary depending on CPU clock rate, memory-system performance, virtualization overheads, and so on. All that aside, I see per-grace-period overheads ranging from 15 to 100 microseconds on the systems I use. I suspect that a system with thousands of CPUs might consume many hundreds of microseconds, or perhaps even milliseconds, of CPU time for each grace period. On the other hand, such a system might also handle a very large number of updates per grace period.

Quick Quiz 3: Now that all of the RCU overhead is appearing on the rcu_sched, rcu_preempt, and rcu_bh kernel threads, we should be able to more easily identify that overhead and optimize RCU, right?
Answer

In conclusion, the rcu_sched, rcu_preempt, and rcu_bh CPU overheads should not be anything to worry about. They do not represent new overhead inflicted on post-3.6 kernels, but rather better accounting of the same overhead that RCU has been incurring all along.

Acknowledgments

I owe thanks to Ingo Molnar for first noting this issue and the need to let the community know about it. We all owe a debt of gratitude to Steve Dobbelstein, Stephen Rothwell, Jon Corbet, and Paul Walmsley for their help in making this article human-readable. I am grateful to Jim Wasko for his support of this effort.

Answers to Quick Quizzes

Quick Quiz 1: Why would latency be reduced by moving RCU work to a kthread? And why would anyone care about latency on huge machines?

Answer: Moving work from softirq to a kthread allows that work to be more easily preempted, and this preemption reduces scheduling latency. Low scheduling latency is of course important in real-time applications, but it also helps reduce OS jitter. Low OS jitter is critically important to certain types of high-performance-computing (HPC) workloads, which is the type of workload that tends to be run on huge systems.

Back to Quick Quiz 1.

Quick Quiz 2: Wow!!! If hackbench does a million grace periods in ten minutes, just how many does something like rcutorture do?

Answer: Actually, rcutorture tortures RCU in many different ways, including overly long read-side critical sections, transitions to and from idle, and CPU hotplug operations. Thus, a typical rcutorture run would probably “only” do about 100,000 grace periods in a ten-minute interval.

In short, the grace-period rate can vary widely depending on your hardware, kernel configuration, and workload.

Back to Quick Quiz 2.

Quick Quiz 3: Now that all of the RCU overhead is appearing on the rcu_sched, rcu_preempt, and rcu_bh kernel threads, we should be able to more easily identify that overhead and optimize RCU, right?

Answer: Yes and no. Yes, it is easier to optimize that which can be easily measured. But no, not all of RCU's overhead appears on the rcu_sched, rcu_preempt, and rcu_bh kernel threads. Some of it still appears on the ksoftirqd kernel threads, and some of it is spread over other tasks.

Still, yes, the greater visibility should be helpful.

Back to Quick Quiz 3.

Comments (4 posted)

An f2fs teardown

October 10, 2012

This article was contributed by Neil Brown

When a techno-geek gets a new toy there must always be an urge to take it apart and see how it works. Practicalities (and warranties) sometimes suppress that urge, but in the case of f2fs and this geek, the urge was too strong. What follows is the result of taking apart this new filesystem to see how it works.

f2fs (interestingly not "f3s") is the "flash-friendly file system", a new filesystem for Linux recently announced by engineers from Samsung. Unlike jffs2 and logfs, f2fs is not targeted at raw flash devices, but rather at the specific hardware that is commonly available to consumers — SSDs, eMMC, SD cards, and other flash storage with an FTL (flash translation layer) already built in. It seems that as hardware gets smarter, we need to make even more clever software to manage that "smartness". Does this sound like parenting to anyone else?

f2fs is based on the log-structured filesystem (LFS) design — which is hardly surprising given the close match between the log-structuring approach and the needs of flash. For those not familiar with log-structured design, the key elements are:

That it requires copy-on-write, so data is always written to previously unused space.

That free space is managed in large regions which are written to sequentially. When the number of free regions gets low, data that is still live is coalesced from several regions into one free region, thus creating more free regions. This process is known as "cleaning" and the overhead it causes is one of the significant costs of log structuring.

As the FTL typically uses a log-structured design to provide the wear-leveling and write-gathering that flash requires, this means that there are two log structures active on the device — one in the firmware and one in the operating system. f2fs is explicitly designed to make use of this fact and leaves a number of tasks to the FTL while focusing primarily on those tasks that it is well positioned to perform. So, for example, f2fs makes no effort to distribute writes evenly across the address space to provide wear-leveling.

The particular value that f2fs brings, which can justify it being "flash friendly", is that it provides large-scale write gathering so that when lots of blocks need to be written at the same time they are collected into large sequential writes which are much easier for the FTL to handle. Rather than creating a single large write, f2fs actually creates up to six in parallel. As we shall see, these are assigned different sorts of blocks with different life expectancies. Grouping blocks with similar life expectancies together tends to make the garbage collection process required by the LFS less expensive.

The "large-scale" is a significant qualifier — f2fs doesn't always gather writes into contiguous streams, only almost always. Some metadata, and occasionally even some regular data, is written via random single-block writes. This would be anathema for a regular log-structured filesystem, but f2fs chooses to avoid a lot of complexity by just doing small updates when necessary and leaving the FTL to make those corner cases work.

Before getting into the details of how f2fs does what it does, a brief list of some of the things it doesn't do is in order.

A feature that we might expect from a copy-on-write filesystem is cheap snapshots as they can be achieved by simply not freeing up the old copy. f2fs does not provide these and cannot in its current form due to its two-locations approach to some metadata which will be detailed later.

Other features that are missing are usage quotas, NFS export, and the "security" flavor of extended attributes (xattrs). Each of these could probably be added with minimal effort if they are needed, though integrating quotas correctly with the crash recovery would be the most challenging. We shouldn't be surprised to see some of these in a future release.

Blocks, segments, sections, and zones

Like most filesystems, f2fs is comprised of blocks. All blocks are 4K in size, though the code implicitly links the block size with the system page size, so it is unlikely to work on systems with larger page sizes as is possible with IA64 and PowerPC. The block addresses are 32 bits so the total number of addressable bytes in the filesystem is at most 2^(32+12) bytes or 16 terabytes. This is probably not a limitation — for current flash hardware at least.

Blocks are collected into "segments". A segment is 512 blocks or 2MB in size. The documentation describes this as a default, but this size is fairly deeply embedded in the code. Each segment has a segment summary block which lists the owner (file plus offset) of each block in the segment. The summary is primarily used when cleaning to determine which blocks need to be relocated and how to update the index information after the relocation. One block can comfortably store summary information for 512 blocks (with a bit of extra space which has other uses), so 2MB is the natural size for a segment. Larger would be impractical and smaller would be wasteful.

Segments are collected into sections. There is genuine flexibility in the size of a section, though it must be a power of two. A section corresponds to a "region" in the outline of log structuring given above. A section is normally filled from start to end before looking around for another section, and the cleaner processes one section at a time. The default size when using the mkfs utility is 2⁰, or one segment per section.

f2fs has six sections "open" for writing at any time with different sorts of data being written to each one. The different sections allows for file content (data) to be kept separate from indexing information (nodes), and for those to be divided into "hot", "warm", and "cold" according to various heuristics. For example, directory data is treated as hot and kept separate from file data because they have different life expectancies. Data that is cold is expected to remain unchanged for quite a long time, so a section full of cold blocks is likely to not require any cleaning. Nodes that are hot are expected to be updated soon, so if we wait a little while, a section that was full of hot nodes will have very few blocks that are still live and thus will be cheap to clean.

Sections are collected into zones. There may be any (integer) number of sections in a zone though the default is again one. The sole purpose of zones is to try to keep these six open sections in different parts of the device. The theory seems to be that flash devices are often made from a number of fairly separate sub-devices each of which can process IO requests independently and hence in parallel. If zones are sized to line up with the sub-devices, then the six open sections can all handle writes in parallel and make best use of the device.

These zones, full of sections of segments of blocks, make up the "main" area of the filesystem. There is also a "meta" area which contains a variety of different metadata such as the segment summary blocks already mentioned. This area is not managed following normal log-structured lines and so leaves more work for the FTL to do. Hopefully it is small enough that this isn't a problem.

There are three approaches to management of writes in this area. First, there is a small amount of read-only data (the superblock) which is never written once the filesystem has been created. Second, there are the segment summary blocks which have already been mentioned. These are simply updated in-place. This can lead to uncertainty as to the "correct" contents for the block after a crash, however for segment summaries this is not an actual problem. The information in it is checked for validity before it is used, and if there is any chance that information is missing, it will be recovered from other sources during the recovery process.

The third approach involves allocating twice as much space as is required so that each block has two different locations it can exist in, a primary and a secondary. Only one of these is "live" at any time and the copy-on-write requirement of an LFS is met by simply writing to the non-live location and updating the record of which is live. This approach to metadata is the main impediment to providing snapshots. f2fs does a small amount of journaling of updates to this last group while creating a checkpoint, which might ease the task for the FTL somewhat.

Files, inodes, and indexing

Most modern filesystems seem to use B-trees or similar structures for managing indexes to locate the blocks in a file. In fact they are so fundamental to btrfs that it takes its name from that data structure. f2fs doesn't. Many filesystems reduce the size of the index by the use of "extents" which provide a start and length of a contiguous list of blocks rather than listing all the addresses explicitly. Again, f2fs doesn't (though it does maintain one extent per inode as a hint).

Rather, f2fs uses an indexing tree that is very reminiscent of the original Unix filesystem and descendants such as ext3. The inode contains a list of addresses for the early blocks in the file, then some addresses for indirect blocks (which themselves contain more addresses) as well as some double and triple-indirect blocks. While ext3 has 12 direct addresses and one each of the indirection addresses, f2fs has 929 direct address, two each of indirect and double-indirect addresses, and a single triple-indirect address. This allows the addressing of nearly 4TB for a file, or one-quarter of the maximum filesystem size.

While this scheme has some costs — which is why other filesystems have discarded it — it has a real benefit for an LFS. As f2fs does not use extents, the index tree for a given file has a fixed and known size. This means that when blocks are relocated through cleaning, it is impossible for changes in available extents to cause the indexing tree to get bigger — which could be embarrassing when the point of cleaning is to free space. logfs, another reasonably modern log structured filesystem for flash, uses much the same arrangement for much the same reason.

Obviously, all this requires a slightly larger inode than ext3 uses. Copy-on-write is rather awkward for objects that are smaller than the block size so f2fs reserves a full 4K block for each inode which provides plenty of space for indexing. It even provides space to store the (base) name of the file, or one of its names, together with the inode number of the parent. This simplifies the recovery of recently-created files during crash recovery and reduces the number of blocks that need to be written for such a file to be safe.

Given that the inode is so large, one would expect that small files and certainly small symlinks would be stored directly in the inode, rather than just storing a single block address and storing the data elsewhere. However f2fs doesn't do that. Most likely the reality is that it doesn't do it yet. It is an easy enough optimization to add, so it's unlikely to remain absent for long.

As already mentioned, the inode contains a single extent that is a summary of some part of the index tree. It says that some range of blocks in the file are contiguous in storage and gives the address of this range. The filesystem attempts to keep the largest extent recorded here and uses it to speed up address lookups. For the common case of a file being written sequentially without any significant pause, this should result in the entire file being in that one extent, and make lookups in the index tree unnecessary.

Surprisingly, it doesn't seem there was enough space to store 64-bit timestamps, so instead of nanosecond resolution for several centuries in the future, it only provides single-second resolution until some time in 2038. This oversight was raised on linux-kernel and may well be addressed in a future release.

One of the awkward details of any copy-on-write filesystem is that whenever a block is written, its address is changed, so its parent in the indexing tree must change and be relocated, and so on up to the root of the tree. The logging nature of an LFS means that roll-forward during recovery can rebuild recent changes to the indexing tree so all the changes do not have to be written immediately, but they do have to be written eventually, and this just makes more work for the cleaner.

This is another area when f2fs makes use of its underlying FTL and takes a short-cut. Among the contents of the "meta" area is a NAT — a Node Address Table. Here "node" refers to inodes and to indirect indexing blocks, as well as blocks used for xattr storage. When the address of an inode is stored in a directory, or an index block is stored in an inode or another index block, it isn't the block address that is stored, but rather an offset into the NAT. The actual block address is stored in the NAT at that offset. This means that when a data block is written, we still need to update and write the node that points to it. But writing that node only requires updating the NAT entry. The NAT is part of the metadata that uses two-location journaling (thus depending on the FTL for write-gathering) and so does not require further indexing.

Directories

An LFS doesn't really impose any particular requirements on the layout of a directory, except to change the fewest number of blocks possible, which is generally good for performance anyway. So we can assess f2fs's directory structure on an equal footing with other filesystems. The primary goal is to provide fast lookup by file name, and to provide a stable address of each name that can be reported using telldir().

The original Unix filesystem (once it had been adjusted for 256-byte file names) used the same directory scheme as ext2 — sequential search though a file full of directory entries. This is simple and effective, but doesn't scale well to large directories.

More modern filesystems such as ext3, xfs, and btrfs use various schemes involving B-trees, sometimes indexed by a hash of the file name. One of the problems with B-trees is that nodes sometimes need to be split and this causes some directory entries to be moved around in the file. This results in extra challenges to provide stable addresses for telldir() and is probably the reason that telldir() is often called out for being a poor interface.

f2fs uses some sequential searching and some hashing to provide a scheme that is simple, reasonably efficient, and trivially provides stable telldir() addresses. A lot of the hashing code is borrowed from ext3, however f2fs omits the use of a per-directory seed. This seed is a secret random number which ensures that the hash values used are different in each directory, so they are not predictable. Using such a seed provides protection against hash-collision attacks. While these might be unlikely in practice, they are so easy to prevent that this omission is a little surprising.

It is easiest to think of the directory structure as a series of hash tables stored consecutively in a file. Each hash table has a number of fairly large buckets. A lookup proceeds from the first hash table to the next, at each stage performing a linear search through the appropriate bucket, until either the name is found or the last hash table has been searched. During the search, any free space in a suitable bucket is recorded in case we need to create the name.

The first hash table has exactly one bucket which is two blocks in size, so for the first few hundred entries, a simple linear search is used. The second hash table has two buckets, then four, then eight and so on until the 31st table with about a billion buckets, each two blocks in size. Subsequent hash tables — should you need that many — all have the same number of buckets as the 31st, but now they are four blocks in size.

The result is that a linear search of several hundred entries can be required, possibly progressing through quite a few blocks if the directory is very large. The length of this search increases only as the logarithm of the number of entries in the directory, so it scales fairly well. This is certainly better than a purely sequential search, but seems like it could be a lot more work than is really necessary. It does however guarantee that only one block needs to be updated for each addition or deletion of a file name, and since entries are never moved, the offset in the file is a stable address for telldir(), which are valuable features.

Superblocks, checkpoints, and other metadata

All filesystems have a superblock and f2fs is no different. However it does make a clear distinction between those parts of the superblock which are read-only and those which can change. These are kept in two separate data structures.

The f2fs_super_block, which is stored in the second block of the device, contains only read-only data. Once the filesystem is created, this is never changed. It describes how big the filesystem is, how big the segments, sections, and zones are, how much space has been allocated for the various parts of the "meta" area, and other little details.

The rest of the information that you might expect to find in a superblock, such as the amount of free space, the address of the segments that should be written to next, and various other volatile details, are stored in an f2fs_checkpoint. This "checkpoint" is one of the metadata types that follows the two-location approach to copy-on-write — there are two adjacent segments both of which store a checkpoint, only one of which is current. The checkpoint contains a version number so that when the filesystem is mounted, both can be read and the one with the higher version number is taken as the live version.

We have already mentioned the Node Address Table (NAT) and Segment Summary Area (SSA) that also occupy the meta area with the superblock (SB) and Checkpoints (CP). The one other item of metadata is the Segment Info Table or SIT.

The SIT stores 74 bytes per segment and is kept separate from the segment summaries because it is much more volatile. It primarily keeps track of which blocks are still in active use so that the segment can be reused when it has no active blocks, or can be cleaned when the active block count gets low.

When updates are required to the NAT or the SIT, f2fs doesn't make them immediately, but stores them in memory until the next checkpoint is written. If there are relatively few updates then they are not written out to their final home but are instead journaled in some spare space in Segment Summary blocks that are normally written at the same time. If the total amount of updates that are required to Segment Summary blocks is sufficiently small, even they are not written and the SIT, NAT, and SSA updates are all journaled with the Checkpoint block — which is always written during checkpoint. Thus, while f2fs feels free to leave some work to the FTL, it tries to be friendly and only performs random block updates when it really has to. When f2fs does need to perform random block updates it will perform several of them at once, which might ease the burden on the FTL a little.

Knowing when to give up

Handling filesystem-full conditions in traditional filesystems is relatively easy. If no space is left, you just return an error. With a log-structured filesystem, it isn't that easy. There might be a lot of free space, but it might all be in different sections and so it cannot be used until those sections are "cleaned", with the live data packed more densely into fewer sections. It usually makes sense to over-provision a log-structured filesystem so there are always free sections to copy data to for cleaning.

The FTL takes exactly this approach and will over-provision to both allow for cleaning and to allow for parts of the device failing due to excessive wear. As the FTL handles over-provisioning internally there is little point in f2fs doing it as well. So when f2fs starts running out of space, it essentially gives up on the whole log-structured idea and just writes randomly wherever it can. Inodes and index blocks are still handled carefully and there is a small amount of over-provisioning for them, but data is just updated in place, or written to any free block that can be found. Thus you can expect performance of f2fs to degrade when the filesystem gets close to full, but that is common to a lot of filesystems so it isn't a big surprise.

Would I buy one?

f2fs certainly seems to contain a number of interesting ideas, and a number of areas for possible improvement — both attractive attributes. Whether reality will match the promise remains to be seen. One area of difficulty is that the shape of an f2fs (such as section and zone size) needs to be tuned to the particular flash device and its FTL; vendors are notoriously secretive about exactly how their FTL works. f2fs also requires that the flash device is comfortable having six or more concurrently "open" write areas. This may not be a problem for Samsung, but does present some problems for your average techno-geek — though Arnd Bergmann has done some research that may prove useful. If this leads to people reporting performance results based on experiments where the f2fs isn't tuned properly to the storage device, it could be harmful for the project as a whole.

f2fs contains a number of optimizations which aim to ease the burden on the FTL. It would be very helpful to know how often these actually result in a reduction in the number of writes. That would help confirm that they are a good idea, or suggest that further refinement is needed. So, some gathering of statistics about how often the various optimizations fire would help increase confidence in the filesystem.

f2fs seems the have been written without much expectation of highly parallel workloads. In particular, all submission of write requests are performed under a single semaphore. So f2fs probably isn't the filesystem to use for big-data processing on 256-core work-horses. It should be fine on mobile computing devices for a few more years though.

And finally, lots of testing is required. Some preliminary performance measurements have been posted, but to get a fair comparison you really need an "aged" filesystem and a large mix of workloads. Hopefully someone will make the time to do the testing.

Meanwhile, would I use it? Given that my phone is as much a toy to play with as a tool to use, I suspect that I would. However, I would make sure I had reliable backups first. But then ... I probably should do that anyway.

Comments (30 posted)

Greg KH Linux 3.6.1 ?

Thomas Gleixner 3.6.1-rt1 ?

Greg KH Linux 3.5.6 ?

Greg KH Linux 3.4.13 ?

Steven Rostedt 3.4.13-rt21 ?

Steven Rostedt 3.4.12-rt20 ?

Ben Hutchings Linux 3.2.31 ?

Greg KH Linux 3.0.45 ?

Steven Rostedt 3.0.45-rt67 ?

Steven Rostedt 3.0.44-rt66 ?

Willy Tarreau Linux 2.6.32.60 ?

Huacai Chen MIPS: Add Loongson-3 based machines support ?

Domenico Andreoli ARM: Add support for Broadcom BCM476x SoCs ?

Lee Jones Provide basic support for the u9540 ?

Andrea Righi per cpuset load average ?

Vincent Guittot sched: packing small tasks ?

Rafael J. Wysocki PM / QoS: Support for PM QoS device flags ?

Namhyung Kim perf report: Add support for event group view (v3) ?

David Ahern perf kvm: Add live mode for analyzing events ?

Steven Rostedt Stable release of trace-cmd v2.0 ?

Josh Stone systemtap release 2.0 ?

Alexander Duyck [RFC PATCH 0/7] Improve swiotlb performance by using physical addresses ?

Rajagopal Venkat devfreq: Add support for devices which can idle ?

Jeremy Kerr efi: Add support for a UEFI variable filesystem ?

Patil, Rachna Support for TSC/ADC MFD driver ?

Jiang Liu ACPI based system device hotplug framework ?

Yann Cantin new eBeam input driver ?

Christopher Heiny input: Synaptics RMI4 Touchscreen Driver ?

Andrey Smirnov A driver for Si476x series of chips ?

b16395@freescale.com iommu/fsl: Freescale PAMU driver and IOMMU API implementation. ?

Guennadi Liakhovetski [PATCH v4] media: add a VEU MEM2MEM format conversion and scaling driver ?

김재극 f2fs: introduce flash-friendly file system ?

Vivek Goyal cfq-iosched: Use vdisktime based scheduling logic for cfq queues [V2] ?

zwu.kernel@gmail.com vfs: hot data tracking ?

Andi Kleen MM: Support more pagesizes for MAP_HUGETLB/SHM_HUGETLB v3 ?

Dan Magenheimer swap/frontswap: allow movement of zcache pages to swap ?

Anton Vorontsov vmevent: Implement pressure attribute ?

Yasuaki Ishimatsu memory-hotplug: hot-remove physical memory ?

Andrea Arcangeli AutoNUMA27 ?

Glauber Costa kmem controller for memcg. ?

Kees Cook module: add syscall to load module from fd ?

Pablo Neira Ayuso libnetfilter_cthelper 1.0.0 release ?

Karel Zak util-linux v2.22.1 ?

Canonical courts financial contributions

By Jake Edge
October 10, 2012

Financial contributions are part of many free software projects. Users can, and do, contribute in lots of different ways, but helping the project keep the lights on and, perhaps, even cover some development time, is fairly common. Financial donations may also be used to champion certain features for a project; by ponying up some money, the donor may get some input into the direction of the project. The latter seems to be part of the motivation for Canonical's recent push to more prominently feature—but not require—financial contributions as part of the desktop download process.

On October 9, Steve George, Canonical's VP of Communications and Products, posted a message to the company's blog that described the change:

Today, we're making it easier for people to financially contribute to Ubuntu if they want to. By introducing a 'contribute' screen as part of the desktop download process, people can choose to financially support different aspects of Canonical's work: from gaming and apps, developing the desktop, phone and tablet, to co-ordination of upstreams or supporting Ubuntu flavours. It's important to note that Ubuntu remains absolutely free, financial contribution remains optional and it is not required in order to download the software.

It's a bit surprising to see a large, company-backed distribution looking for financial contributions, but George made it clear that there has always been a way to do so, "albeit in a not-easy-to-find spot on our website". He said that users have been asking for a simpler way to contribute money, so Canonical was making one available. Now, clicking through to a desktop download page brings up a web application (seen at right, click to see a larger version) that allows contributions of up to $125 in each of eight categories.

According to community manager Jono Bacon, the application was inspired by the various Humble Bundles, which allow people to pay what they wish for computer games or ebooks. While the sliders used by Humble Bundles allow users to choose the amount to pay the authors, charities, and the company, the sliders in the Canonical application allow a choice of features contributors would like to put their money toward. The possibilities are:

Make the desktop more amazing
Performance optimisation for games and apps
Improve hardware support on more PCs
Phone and tablet versions of Ubuntu
Community participation in Ubuntu development
Better coordination with Debian and upstreams
Better support for flavours like Kubuntu, Xubuntu, Lubuntu
Tip to Canonical – they help make it happen

To aid the user in evaluating how much to give, the application suggests products that would cost roughly the same as the total contribution. Those range from a "grande extra shot mocha latte chino" at $2, through a "pair of LP Matador bongo drums" at $100, up to "an eight year-old dromedary camel" at $1000. Visitors can either choose their donation level and contribute via PayPal, or click through a link to go on to the download.

While Canonical undoubtedly does a great deal of work for the benefit of millions of Linux users, this is a rather unconventional approach. It is a little hard to imagine that it will generate a significant revenue stream, at least for an organization the size of Canonical. But the real value to Canonical (and by extension, Ubuntu) may be in the feedback it gets from users.

The categories for features are fairly broad, but a consensus among users on one (or a few) of them is certainly useful information. The fact that those users are willing to pay something to make that vote makes the data all the more interesting. There have been persistent complaints that some of the other Ubuntu flavors, Lubuntu, Xubuntu, Kubuntu, and so on, have lacked for financial backing in comparison to the number of users they bring to the table. This effort would give those flavors an opportunity to send a message, for example.

It would be great if the community were to get some visibility into the contributions. Canonical may be understandably loath to give out direct financial information, but there are other ways to do the reporting that would still benefit both the Ubuntu community as well as the larger FOSS ecosystem. Information on the "votes", perhaps as percentages of the number and dollar amount of the contributions, would be useful. That would help Ubuntu see places where more effort is desired as well as identifying potential trouble spots for other distributions and projects. According to George, the company is working on a schedule and format for reporting on the contributions.

Canonical has pursued other non-traditional revenue sources in the past, so this could just be another. With enough different revenue streams, even if some are fairly small and unpredictable, the company could reach a profitable state. That can only be a good thing for the long-term prosperity of not just Canonical, but the Ubuntu distribution as well. Ubuntu has millions of users worldwide and the Linux ecosystem is richer for its presence, so anything that helps its continued existence is definitely a net positive.

Comments (16 posted)

Distribution quotes of the week

On Mon, Oct 08, 2012 at 08:48:40PM +0200, Thijs Kinkhorst wrote:
> I find it therefore doubtful that keeping the bottle logo solves any real world problem.

I find it doubtful that getting rid of the bottle logo solves any real world problem.

-- Steve Langasek

Fedora has a huge history of "Hey this is a great idea!" and getting a 30% solution put out with the idea that it will become a 80% solution if people just wish hard enough... instead the people who started it go off to new stuff that interests them and the people who come after either throw away what was done before or find that real life has other plans for them. And then infrastructure gets handed the reigns of the nearly dead website, phone service, etc. And when we say we can't support it.. we have to spend a year proving that we can't get anyone to step up while everyone says "Geez infrastructure can't do anything right."

-- Stephen J Smoogen

Perhaps some invalid data is better than no data at all.

By Michael Kerrisk
October 10, 2012

There are many mechanisms for communicating information between user-space applications and the kernel. System calls and pseudo-filesystems such as /proc and /sys are of course the most well known. Signals are similarly well known; the kernel employs signals to inform a process of various synchronous or asynchronous events—for example, when the process tries to write to a broken pipe or a child of the process terminates.

There are also a number of more obscure mechanisms for communication between the kernel and user space. These include the Linux-specific netlink sockets and user-mode helper features. Netlink sockets provide a socket-style API for exchanging information with the kernel. The user-mode helper feature allows the kernel to automatically invoke user-space executables; this mechanism is used in a number of places, including the implementation of control groups and piping core dumps to a user-space application.

The auxiliary vector, a mechanism for communicating information from the kernel to user space, has remained largely invisible until now. However, with the addition of a new library function, getauxval(), in the GNU C library (glibc) 2.16 release that appeared at the end of June, it has now become more visible.

Historically, many UNIX systems have implemented the auxiliary vector feature. In essence, it is a list of key-value pairs that the kernel's ELF binary loader (fs/binfmt_elf.c in the kernel source) constructs when a new executable image is loaded into a process. This list is placed at a specific location in the process's address space; on Linux systems it sits at the high end of the user address space, just above the (downwardly growing) stack, the command-line arguments (argv), and environment variables (environ).

From the description and diagram, we can see that although the auxiliary vector is somewhat hidden, it is accessible with a little effort. Even without using the new library function, an application that wants to access the auxiliary vector merely needs to obtain the address of the location that follows the NULL pointer at the end of the environment list. Furthermore, at the shell level, we can discover the auxiliary vector that was supplied to an executable by setting the LD_SHOW_AUXV environment variable when launching an application:

    $ LD_SHOW_AUXV=1 sleep 1000
    AT_SYSINFO_EHDR: 0x7fff35d0d000
    AT_HWCAP:        bfebfbff
    AT_PAGESZ:       4096
    AT_CLKTCK:       100
    AT_PHDR:         0x400040
    AT_PHENT:        56
    AT_PHNUM:        9
    AT_BASE:         0x0
    AT_FLAGS:        0x0
    AT_ENTRY:        0x40164c
    AT_UID:          1000
    AT_EUID:         1000
    AT_GID:          1000
    AT_EGID:         1000
    AT_SECURE:       0
    AT_RANDOM:       0x7fff35c2a209
    AT_EXECFN:       /usr/bin/sleep
    AT_PLATFORM:     x86_64

The auxiliary vector of each process on the system is also visible via a corresponding /proc/PID/auxv file. Dumping the contents of the file that corresponds to the above command (as eight-byte decimal numbers, because the keys and values are of that size on the 64-bit system used for this example), we can see the key-value pairs in the vector, followed by a pair of zero values that indicate the end of the vector:

    $ od -t d8 /proc/15558/auxv
    0000000                   33      140734096265216
    0000020                   16           3219913727
    0000040                    6                 4096
    0000060                   17                  100
    0000100                    3              4194368
    0000120                    4                   56
    0000140                    5                    9
    0000160                    7                    0
    0000200                    8                    0
    0000220                    9              4200012
    0000240                   11                 1000
    0000260                   12                 1000
    0000300                   13                 1000
    0000320                   14                 1000
    0000340                   23                    0
    0000360                   25      140734095335945
    0000400                   31      140734095347689
    0000420                   15      140734095335961
    0000440                    0                    0
    0000460

Scanning the high end of user-space memory or /proc/PID/auxv is a clumsy way of retrieving values from the auxiliary vector. The new library function provides a simpler mechanism for retrieving individual values from the list:

    #include <sys/auxv.h>

    unsigned long int getauxval(unsigned long int type);

The function takes a key as its single argument, and returns the corresponding value. The glibc header files define a set of symbolic constants with names of the form AT_* for the key value passed to getauxval(); these names are exactly the same as the strings displayed when executing a command with LD_SHOW_AUXV=1.

Of course, the obvious question by now is: what sort of information is placed in the auxiliary vector, and who needs that information? The primary customer of the auxiliary vector is the dynamic linker (ld-linux.so). In the usual scheme of things, the kernel's ELF binary loader constructs a process image by loading an executable into the process's memory, and likewise loading the dynamic linker into memory. At this point, the dynamic linker is ready to take over the task of loading any shared libraries that the program may need in preparation for handing control to the program itself. However, it lacks some pieces of information that are essential for these tasks: the location of the program inside the virtual address space, and the starting address at which execution of the program should commence.

In theory, the kernel could provide a system call that the dynamic linker could use in order to obtain the required information. However, this would be an inefficient way of doing things: the kernel's program loader already has the information (because it has scanned the ELF binary and built the process image) and knows that the dynamic linker will need it. Rather than maintaining a record of this information until the dynamic linker requests it, the kernel can simply make it available in the process image at some location known to the dynamic linker. That location is, of course, the auxiliary vector.

It turns out that there's a range of other information that the kernel's program loader already has and which it knows the dynamic linker will need. By placing all of this information in the auxiliary vector, the kernel either saves the programming overhead of making this information available in some other way (e.g., by implementing a dedicated system call), or saves the dynamic linker the cost of making a system call, or both. Among the values placed in the auxiliary vector and available via getauxval() are the following:

AT_PHDR and AT_ENTRY: The values for these keys are the address of the ELF program headers of the executable and the entry address of the executable. The dynamic linker uses this information to perform linking and pass control to the executable.
AT_SECURE: The kernel assigns a nonzero value to this key if this executable should be treated securely. This setting may be triggered by a Linux Security Module, but the common reason is that the kernel recognizes that the process is executing a set-user-ID or set-group-ID program. In this case, the dynamic linker disables the use of certain environment variables (as described in the ld-linux.so(8) manual page) and the C library changes other aspects of its behavior.
AT_UID, AT_EUID, AT_GID, and AT_EGID: These are the real and effective user and group IDs of the process. Making these values available in the vector saves the dynamic linker the cost of making system calls to determine the values. If the AT_SECURE value is not available, the dynamic linker uses these values to make a decision about whether to handle the executable securely.
AT_PAGESZ: The value is the system page size. The dynamic linker needs this information during the linking phase, and the C library uses it in the implementation of the malloc family of functions.
AT_PLATFORM: The value is a pointer to a string identifying the hardware platform on which the program is running. In some circumstances, the dynamic linker uses this value in the interpretation of rpath values. (The ld-linux.so(8) man page describes rpath values.)
AT_SYSINFO_EHDR: The value is a pointer to the page containing the Virtual Dynamic Shared Object (VDSO) that the kernel creates in order to provide fast implementations of certain system calls. (Some documentation on the VDSO can be found in the kernel source file Documentation/ABI/stable/vdso.)
AT_HWCAP: The value is a pointer to a multibyte mask of bits whose settings indicate detailed processor capabilities. This information can be used to provide optimized behavior for certain library functions. The contents of the bit mask are hardware dependent (for example, see the kernel source file arch/x86/include/asm/cpufeature.h for details relating to the Intel x86 architecture).
AT_RANDOM: The value is a pointer to sixteen random bytes provided by the kernel. The dynamic linker uses this to implement a stack canary.

The precise reasons why the GNU C library developers have chosen to add the getauxval() function now are a little unclear. The commit message and NEWS file entry for the change were merely brief explanations of what the change was, rather than why it was made. The only clue provided by the implementer on the libc-alpha mailing list suggested that doing so was useful to allow for "future enhancements to the AT_ values, especially target-specific ones". That comment, plus the observation that the glibc developers tend to be rather conservative about adding new interfaces to the ABI, suggest that that they have some interesting new user-space uses of the auxiliary vector in mind.

Comments (9 posted)

Quote of the week

I recommend NOT assuming that package managers are the cat's pajamas and that therefore we can all skip the ability to usefully build from source.

The following event listing is taken from the LWN.net Calendar.

Date(s)	Event	Location
October 11 October 12	Korea Linux Forum 2012	Seoul, South Korea
October 12 October 13	Open Source Developer's Conference / France	Paris, France
October 13 October 14	Debian BSP in Alcester (Warwickshire, UK)	Alcester, Warwickshire, UK
October 13 October 14	PyCon Ireland 2012	Dublin, Ireland
October 13 October 15	FUDCon:Paris 2012	Paris, France
October 13	2012 Columbus Code Camp	Columbus, OH, USA
October 13 October 14	Debian Bug Squashing Party in Utrecht	Utrecht, Netherlands
October 15 October 18	OpenStack Summit	San Diego, CA, USA
October 15 October 18	Linux Driver Verification Workshop	Amirandes,Heraklion, Crete
October 17 October 19	LibreOffice Conference	Berlin, Germany
October 17 October 19	MonkeySpace	Boston, MA, USA
October 18 October 20	14th Real Time Linux Workshop	Chapel Hill, NC, USA
October 20 October 21	PyCon Ukraine 2012	Kyiv, Ukraine
October 20 October 21	Gentoo miniconf	Prague, Czech Republic
October 20 October 21	PyCarolinas 2012	Chapel Hill, NC, USA
October 20 October 23	openSUSE Conference 2012	Prague, Czech Republic
October 20 October 21	LinuxDays	Prague, Czech Republic
October 22 October 23	PyCon Finland 2012	Espoo, Finland
October 23 October 25	Hack.lu	Dommeldange, Luxembourg
October 23 October 26	PostgreSQL Conference Europe	Prague, Czech Republic
October 25 October 26	Droidcon London	London, UK
October 26 October 27	Firebird Conference 2012	Luxembourg, Luxembourg
October 26 October 28	PyData NYC 2012	New York City, NY, USA
October 27	Central PA Open Source Conference	Harrisburg, PA, USA
October 27 October 28	Technical Dutch Open Source Event	Eindhoven, Netherlands
October 27	pyArkansas 2012	Conway, AR, USA
October 27	Linux Day 2012	Hundreds of cities, Italy
October 29 November 3	PyCon DE 2012	Leipzig, Germany
October 29 November 2	Linaro Connect	Copenhagen, Denmark
October 29 November 1	Ubuntu Developer Summit - R	Copenhagen, Denmark
October 30	Ubuntu Enterprise Summit	Copenhagen, Denmark
November 3 November 4	OpenFest 2012	Sofia, Bulgaria
November 3 November 4	MeetBSD California 2012	Sunnyvale, California, USA
November 5 November 7	Embedded Linux Conference Europe	Barcelona, Spain
November 5 November 7	LinuxCon Europe	Barcelona, Spain
November 5 November 9	Apache OpenOffice Conference-Within-a-Conference	Sinsheim, Germany
November 5 November 8	ApacheCon Europe 2012	Sinsheim, Germany
November 7 November 9	KVM Forum and oVirt Workshop Europe 2012	Barcelona, Spain
November 7 November 8	LLVM Developers' Meeting	San Jose, CA, USA
November 8	NLUUG Fall Conference 2012	ReeHorst in Ede, Netherlands
November 9 November 11	Free Society Conference and Nordic Summit	Göteborg, Sweden
November 9 November 11	Mozilla Festival	London, England
November 9 November 11	Python Conference - Canada	Toronto, ON, Canada
November 10 November 16	SC12	Salt Lake City, UT, USA
November 12 November 16	19th Annual Tcl/Tk Conference	Chicago, IL, USA
November 12 November 17	PyCon Argentina 2012	Buenos Aires, Argentina
November 12 November 14	Qt Developers Days	Berlin, Germany
November 16 November 19	Linux Color Management Hackfest 2012	Brno, Czech Republic
November 16	PyHPC 2012	Salt Lake City, UT, USA
November 20 November 24	8th Brazilian Python Conference	Rio de Janeiro, Brazil
November 24 November 25	Mini Debian Conference in Paris	Paris, France
November 24	London Perl Workshop 2012	London, UK
November 26 November 28	Computer Art Congress 3	Paris, France
November 29 December 1	FOSS.IN/2012	Bangalore, India
November 29 November 30	Lua Workshop 2012	Reston, VA, USA
November 30 December 2	Open Hard- and Software Workshop 2012	Garching bei München, Germany
November 30 December 2	CloudStack Collaboration Conference	Las Vegas, NV, USA
December 1 December 2	Konferensi BlankOn #4	Bogor, Indonesia
December 2	Foswiki Association General Assembly	online and Dublin, Ireland
December 5 December 7	Open Source Developers Conference Sydney 2012	Sydney, Australia
December 5 December 7	Qt Developers Days 2012 North America	Santa Clara, CA, USA
December 5	4th UK Manycore Computing Conference	Bristol, UK
December 7 December 9	CISSE 12	Everywhere, Internet
December 9 December 14	26th Large Installation System Administration Conference	San Diego, CA, USA

If your event does not appear here, please tell us about it.

LWN.net Weekly Edition for October 11, 2012

irker

KGB

Will CIA.vc return?

Conclusion

Exploits

Vectors

What about Linux?

Security

Brief items

New vulnerabilities

bacula: information disclosure

bind: denial of service

hostapd: denial of service

libxslt: code execution

mozilla: multiple vulnerabilities

mozilla: multiple vulnerabilities

openstack-keystone: two authentication bypass flaws

openstack-swift: insecure use of python pickle

php: multiple vulnerabilities

phpldapadmin: cross-site scripting

php-zendframework: multiple vulnerabilities

wireshark: denial of service

Kernel development

Brief items

Kernel development news

Acknowledgments

Blocks, segments, sections, and zones

Files, inodes, and indexing

Directories

Superblocks, checkpoints, and other metadata

Knowing when to give up

Would I buy one?

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Security-related

Miscellaneous

Distributions

Brief items

Distribution News

Debian GNU/Linux

Fedora

Newsletters and articles of interest

Development

Brief items

Newsletters and articles

Announcements

Brief items

Articles of interest

New Books

Calls for Presentations

Upcoming Events

Events: October 11, 2012 to December 10, 2012