User: Password:
Subscribe / Log in / New account Weekly Edition for November 21, 2012

Gnash, Lightspark, and Shumway

By Nathan Willis
November 21, 2012

Adobe's proprietary and often annoying Flash format is dying, to be replaced by a bagful of open technologies like HTML5, CSS3, SVG, JavaScript, and royalty-free media codecs. Or so we are told. Of course, we have been told this story often enough over the years that it is difficult to muster genuine excitement at the news. Nevertheless, the most recent combatant to enter the ring is Mozilla's Shumway, which constitutes a distinctly different life form than existing free software projects like Lightspark and Gnash. Rather than implement Flash support in a browser plugin, Shumway translates .swf content into standard HTML and JavaScript, to be handled by the browser's main rendering engine.

The sparking and gnashing of teeth

Gnash and Lightspark are both reverse-engineered implementations of a Flash runtime (and, naturally, come with an accompanying Netscape Plugin API (NP-API) browser plugin), but they cover different parts of the specification. Gnash, the older of the two projects, implements versions 1 and 2 of Flash's ActionScript language, and the corresponding first generation of the ActionScript Virtual Machine (AVM1). This provides solid coverage for .swf files up through Flash 7, and partial support of Flash 8 and Flash 9 (including a significant chunk of .flv video found in the wild). Lightspark implements ActionScript 3 and the AVM2 virtual machine, which equates to support for Flash 9 and newer. Lightspark does have the ability to fall back on Gnash for AVM1 content, though, which enables users to install both and enjoy reasonably broad coverage without having to know the version information of Flash content in advance.

As is typical of reverse engineering efforts, however, neither project can claim full compatibility with the proprietary product. In practice, development tends to focus on specific use-cases and popular sites. Gnash, for example, was founded with the goal of supporting Flash-based educational games, and previous releases have been pinned to fixing support for popular video sites like YouTube. Lightspark maintains a wiki page detailing the status of support for common Flash-driven web sites. But the sheer variety of Flash content makes it virtually impossible to implement the full specification and offer any meaningful guarantee that the plugins will render the content without significant errors.

But an even bigger problem remains one of time and funding. Gnash in particular has struggled to raise the funds necessary for lead developer Rob Savoye to devote significant time to the code. Gnash has been a Free Software Foundation (FSF) high-priority project for years, and Savoye was the 2010 recipient of the FSF's Award for the Advancement of Free Software, but fundraising drives have nevertheless garnered low returns — low enough that as recently as March 2012, Savoye reported that the hosting bills for the site were barely covered. The last major release was version 0.8.10 in February 2012, which included OpenVG-based vector rendering along with touchscreen support. A student named Joshua Beck proposed a 2012 Google Summer of Code (GSoC) project to add OpenGL ES 2.0 support under Savoye's mentorship, but it was not accepted. Traffic on the mailing lists has slowed to a trickle, though there are still commits from Savoye and a devoted cadre of others.

Lightspark has made more frequent releases in recent years, including two milestone releases in 2012. In June, Version introduced support for Adobe AIR applications and the BBC web site's video player. Version 0.7.0 in October added support for LZMA-compressed Flash content and experimental support for runtime bytecode optimization.

Both projects regularly make incremental additions to their suites of supported Flash opcodes and ActionScript functions, but neither has much in the way of headline-grabbing features in new releases. This is a bigger problem for Gnash, which does not have Adobe's newer enhancements to Flash to worry about (and is probably a key reason Gnash has had a hard time attracting donations). Lightspark can still tackle a host of new features with each update of Adobe Flash.

Of course, both projects' real competition has come from the easy availability of a freely-downloadable official browser plugin for Linux, but Adobe announced in February 2012 that Flash 11.2 would be the last release available as an NP-API plugin for Linux. Subsequent Linux releases would only be made as the built-in Flash plugin in Google's Chrome. The move has seemingly not motivated Flash-using Linux fans to cough up support for Gnash and Lightspark — but perhaps the next major update to Flash will.

I did it Shumway

Mozilla developer Jet Villegas wrote a blog post introducing Shumway on November 12, but the code has been available for several months. Shumway is described as an "experimental web-native runtime implementation" of the .swf format. Shumway essentially pushes handling of the formerly-Flash content to the browser's rendering engine and JavaScript interpreter. This protects against misbehaving plugins that eat up too many resources or simply crash. Shumway is available as a Firefox extension [XPI], though it is only expected to work on the most recent Firefox beta builds.

The recent Firefox build is required because Shumway parses Flash content and translates it into HTML5 elements, such as <canvas> and <video> elements, WebGL contexts, and good-old-fashioned embedded images. Shumway translates ActionScript into JavaScript to handle interactivity. Both AVM1 and AVM2 are supported, as are ActionScript versions 1, 2, and 3. The extension supports the use of <object> and <embed> tags to incorporate Flash into the page. As for multimedia codecs, Shumway can automatically take advantage of whatever codecs are available on the system.

At the moment there is not a definitive compatibility list, so Shumway's support for any particular Flash file is a gamble. Villegas did say in a comment that the project is targeting Flash 10 and below, which he said accounts for "the vast majority of existing content."

The idea of translating Flash content into HTML5 is not original to Shumway, but its predecessors have been focused on Flash-based advertising. Google offers a web service called Swiffy that translates uploaded Flash files into JSON objects, targeted at advertisers wanting to deploy animated ads. Smokescreen was a JavaScript player designed to render Flash ads on iOS devices.

Slaying the Flash gorgon

Mozilla's goal with Shumway is to remove Flash from the equation altogether, replacing it with "open web" technologies. By demonstrating that HTML5 content is capable of reproducing anything that can be done in Flash, the thinking goes, the browser-maker can encourage more content creators to drop Flash from their workflows. One might think it fair to ask whether supporting Flash in any sense genuinely "promotes" the use of Flash alternatives. After all, in December 2010, Mozilla's Chris Blizzard told Joe Brockmeier that the organization was not interested in funding Flash support, open source or otherwise:

Our strategy is to invest in the web. While Flash is used on the web, it lacks an open process for development with open specifications and multiple competing implementations. Creating an open source version of Flash wouldn't change the fact that Flash's fate is determined by a single entity.

Blizzard's comment was in response to a question about supporting Gnash and Lightspark development. Sobhan Mohammadpour asked the same thing on the Shumway blog post, to which Villegas replied:

Processing SWF content in C/C++ exposes the same security & device compatibility problems as the Adobe Flash Player. It also doesn’t help advance the Open Web stack (eg. faster javascript and canvas rendering) with the research work.

Such a distinction might seem like splitting hairs to some. In particular, Villegas suggests that Gnash and Lightspark are a greater security risk than an .xpi browser extension. The Gnash team might take offense at that, especially considering the work the project has done to enforce a solid testing policy. But it is certainly true that massaging Flash content into generic web content has the potential to bring .swf and .flv support to a broader range of platforms. Both Gnash and Lightspark are developed primarily for Linux, with only intermittent working builds for Windows. On the other hand, Gnash and Lightspark also offer stand-alone, offline Flash players, which can be a resource-friendly way to work with Flash games and applications.

History also teaches us that it would be unwise to embrace Shumway too tightly, writing off Gnash and Lightspark as also-rans, for the simple reason that Shumway is still an experimental Mozilla project. Sure, some Mozilla experiments (such as Firefox Sync) move on to be fully integrated features in future browsers — but far more are put out to pasture and forgotten, with nary an explanation. Firefox Home, Chromatabs, Mozilla Raindrop — the list goes on and on.

It is also not clear exactly what to make of Villegas's statement about Flash 10 being the newest supported version. If that is long-term limitation, then Shumway may be of finite usefulness. True, Flash may die out completely before there is ever a Flash 12, and Flash 11 may never account for a significant percentage of the web's .swf files. In that case, users everywhere will enjoy a blissful HTML5-driven future with plugin-crashes a forgotten woe, and free unicorns as far as the eye can see. But where have we heard that one before?

Comments (47 posted)

Relicensing VLC from GPL to LGPL

By Nathan Willis
November 21, 2012

Select a software license can be tricky, considering all of the effects that such a choice has for the future: library compatibility, distribution, even membership in larger projects. But agreeing on a license at the beginning is child's play compared to trying to alter that decision later on. Case in point: the VLC media player has recently been relicensed under LGPLv2.1+, an undertaking that required project lead Jean-Baptiste Kempf to track down more than 230 individual developers for their personal authorization for the move.

VLC had been licensed under GPLv2+ since 2001; the development team decided to undertake the relicensing task for a number of reasons, including making VLC compatible with various gadget-vendor application stores (e.g., Apple's). Making the core engine available under LGPL terms would make it a more attractive target for independent developers seeking to write non-GPL applications, the argument goes, which benefits the project through added exposure, and may even attract additional contributor talent.

The license migration was approved by team vote in September 2011. The first big milestone was cleared the following November, a relicensing of libVLC and libVLCcore (which implement the external API and internal plugin layer, respectively), plus the auxiliary libraries libdvbpsi, libaacs, and libbluray. Kempf described the process involved on his blog. Because VLC contributors retain the authors' rights to their contributions, no matter how small, Kempf needed to locate and obtain permission from all of the roughly 150 developers who had written even a minor patch during the project's long history.

To do so, he harvested names and email addresses from the project's git repository and logs, and undertook a lengthy process of sifting through the records (both to weed out false matches, and to identify contributors who were credited in unofficial spots like in-line comments). With the list in hand, Kempf set out to contact each of the contributors to approve the licensing change. He was ultimately successful, and the change was committed. The commit notes that more than 99% of the developers consented to the change, and those agreeing account for 99.99% of the code, which he said is sufficient from a legal standpoint.

The modular community

But, as Kempf described in a follow-up post, the same method was less successful when he set out in 2012 to relicense the next major chunk of VLC code: the major playback modules. Together, they constitute a much larger codebase, with considerably more contributors (including some who are not necessarily committed VLC team members). After emailing the module authors, he said, he received responses from only 25% of them. Two rounds of follow-up emails edged the number up closer to 50%, but for the remainder he resorted to "finding and stalking" the holdouts through other means. Those means included IRC, GitHub, social networks, mutual friends, employers, and even whois data on domain names.

In the end, he managed to get approval from the overwhelming majority of the contributors, but there were some "no"s as well, plus a handful of individuals who never replied at all. At that point, he had to examine the unresolved contributions themselves and decide whether to delete them, reimplement them, refactor them into separate files, or drop the offending modules altogether. He made the license-changing commit on November 6, and listed about 25 modules that were not included. They include the work of 13 developers who either declined give their approval or were unreachable, plus a set of modules that were ports from other projects (such as Xine or MPlayer) and thus not in the VLC team's purview.

By all accounts, the legwork required to hunt down and cajole more than 230 developers was arduous: in the second blog post, Kempf noted that it could get "really annoying" to contact people "over, over, over and over, and over" to ask for an answer. That is probably an understatement; in an email Kempf said at the outset that no one thought it would even be doable.

He also elaborated on what comes next. Not every VLC module was targeted for the relicensing work of the previous year, he said. Out of the roughly 400 modules being developed, about 100 remain non-LGPL. First, for those who rarely venture beyond VLC's desktop media player functionality, it can be easy to forget all of the other functions it provides; those modules will remain under their existing licenses. In particular, VLC's media server, converter, and proxy functionality will remain in GPL modules. Other modules, including scripting and visualization, will remain GPL-licensed at least for the time being, because they do not impede the ability of third-party developers to write non-GPL playback applications, which was the leading use-case motivating the change. VLC's user interface and control modules will also remain GPL-licensed, in order to discourage non-free forks.

Kempf also pointed out that the VideoLAN non-profit organization holds the trademarks to VLC, VideoLAN, and other names, and restricts their usage to open source code. That reflects the project's concern that the move away from the GPL will be misinterpreted by someone as a move away from free-ness (in multiple senses of the word); in addition to the trademark policy, both of the announcements about the relicensing project have emphasized that despite the change, VLC will remain free software.


But despite the consensus reached by the majority of core and module developers, there is still the problem of those twenty-odd playback modules that, for one reason or another, are not being relicensed. Kempf explained that the main VLC application will still be able to use all of the non-LGPL modules, and that only third-party libVLC applications will encounter any difficulties with license compatibility.

Authors of such applications may write their own modules for the missing functionality, or simply migrate to another module — given the modular nature of VLC, there are several modules out there that duplicate functionality implemented elsewhere. "The results might be slightly different, but I doubt many people will notice. There are a few exceptions, (probably 2 or 3) that will get rewritten, at some point, I think."

There are two modules Kempf predicted will never be reimplemented in LGPL code — DVD playback and Teletext support — because they rely on other GPL-licensed packages without viable non-GPL alternatives. He still holds out hope for tracking down a few of the still-unreached contributors, of course — only the authors of the iOS, Dolby, Headphone, and Mono modules outright declined to relicense their work.

It is not possible to predict exactly what effect the LGPL-relicensing work will have on third-party developers targeting iOS or other "app store" markets, thanks to the often opaque processes governing which content gets in and which gets rejected. But VLC was yanked from the iOS App Store in January 2011, a decision believed to be due to the GPL license. But because Apple does not provide details about its decisions, the situation remains nebulous.

Nevertheless, hunting down several hundred individual developers from more than a decade of development is an impressive feat of, shall we say, logistical engineering. Relicensing a community project is rarely a simple task; one is reminded of the multi-year process required to relicense the Heyu home automation engine, which involved tracking down the estates of developers no longer with us. Many large software projects have contemplated a license change at one time or another, and typically the scope of tracking down and persuading all of the former developers is cited as a reason that such a change is unworkable. For example, VLC's contributor pool is far smaller than the kernel's, to be sure. But the fact that Kempf was able to successfully chase down virtually the full set of both uncooperative and unintentionally-AWOL contributors in such a short time frame is an admirable achievement. Then again, the VLC team has long enjoyed a reputation for admirable achievements.

Comments (21 posted)

Android 4.2, tablets, and related thoughts

By Jonathan Corbet
November 20, 2012
The kind folks at Google decided that your editor was in need of a present for the holidays; soon thereafter, a box containing a Nexus 7 tablet showed up on the doorstep. One might think that the resulting joy might be somewhat mitigated by the fact that your editor has been in possession of an N7 tablet since last July, and one might be right. But the truth of the matter is that the gift was well timed, and not just because it's nice to be able to install ill-advised software distributions on a tablet without depriving oneself of a useful device.

It was not that long ago that a leading-edge tablet device was a fairly big deal. Family members would ask where the tablet was; the house clearly wouldn't contain more than one of them. What followed, inevitably, was an argument over who got to use the household tablet. But tablets are quickly becoming both more powerful and less expensive — a pattern that a few of us have seen in this industry before. We are quickly heading toward a world where tablet devices litter the house like notepads, cheap pens, or the teenager's dirty socks. Tablets are not really special anymore.

They are, however, increasingly useful. Your editor recently purchased a stereo component that locates his music on the network (served by Samba), plays said music through the sound system with a fidelity far exceeding that available from portable music players, and relies on an application running on a handy Android (or iOS) device for its user interface. Every handset and tablet in the house, suddenly, is part of the music system; this has led to a rediscovery of your editor's music collection — a development not universally welcomed by your editor's offspring. Other household devices, thermostats for example, are following the same path. There is no need to attach big control surfaces to household gadgets; those surfaces already exist on kitchen counters and in the residents' pockets.

So the addition of a tablet into a household already containing a few of them is not an unwelcome event; it nicely replaces the one that will eventually be found underneath the couch.

What's new in Android 4.2

About the time this tablet showed up, the Android 4.2 release came out as an over-the-air update. Some of the features to be found there would seem to have been developed with the ubiquitous tablet very much in mind. At the top of the list, arguably, is the new multiuser support. A new "users" settings screen allows the addition of new users to the device; each user gets their own settings, apps, lock screen, etc. Switching between users is just a matter of selecting one from the lock screen.

Android users are still not as strongly isolated as on a classic Linux system. Apps are shared between them so that, for example, if one user accepts an app update that adds permissions, it takes effect for everybody. The initial user has a sort of mild superuser access; no other users can add or delete users, for example, and the "factory reset" option is only available to the initial account. There doesn't seem to be a way to parcel out privileges to other accounts. The feature works well enough for a common use case: a tablet that floats around the house and is used by multiple family members. Perhaps someday the face unlock feature will recognize the user of the tablet and automatically switch to the correct account.

A feature that is not yet present is the ability to clone one tablet onto another. As we head toward the day when new tablets will arrive as prizes in cereal boxes, we will lose our patience with the quaint process of configuring the new tablet to work like the others do. Google has made significant progress in this area; a lot of useful stuff just appears on a new tablet once the connection to the Google account has been made. But there is still work to do; the process of setting up the K9 mail client is particularly tiresome, for example. And, naturally, storing even more information on the Google mothership is not without its concerns. Wouldn't it be nice to just put the new tablet next to an existing one and say "be like that other one"? The transfer could be effected with no central data storage at all, and life would be much easier.

Much of the infrastructure for this kind of feature appears to already be in place. The near-field communications (NFC) mechanism can be used to "beam" photos, videos, and more between two devices just by touching them together. The "wireless display" feature can be used to transmit screen contents to a nearby television. It should not be hard to do a full backup/restore to another device. Someday. Meanwhile, the "beaming" feature is handy to move photos around without going through the tiresome process of sending them through email.

Another significant new feature is the "swipe" gesture typing facility, whereby one spells words by dragging a finger across the keyboard from one letter to the next. Gesture typing has been available via add-on apps for a while, but now it's a part of the Android core. Using it feels a little silly at the outset; it is like a return to finger painting in elementary-school art class. For added fun, it will attempt to guess which word is coming next, allowing the typing process to be skipped entirely — as long as the guesses turn out to be accurate. In your editor's experience, gesture typing is no faster than tap-typing; if anything, it is a little slower. But the resulting text does seem to be less error-prone; whoever wrote the code doing the gesture recognition did a good job.

One interesting change is that the notification bar at the top has been split into two. The downward-swipe gesture on the left side gives the usual list of notifications — though many of them have been enhanced with actions selectable directly from the notification. On the right side, instead, one gets various settings options. The new scheme takes a while to get used to; it also seems like it takes a more determined effort to get the selected screen to actually stay down rather than teasing the user and popping right back up.

Various other new features exist. The "photo sphere camera" is evidently an extension of the panorama feature found in earlier releases; alas, it refuses to work on the N7's (poor) front-facing camera, so your editor was unable to test it out. The camera also now evidently has high dynamic range (HDR) processing functionality. On the Nexus 10 tablet, the "Renderscript" mechanism can use the GPU for computational tasks; no other device has the requisite hardware support at the moment. There is a screen magnification feature that can be used to zoom any screen regardless of whether the running app was written with that in mind. And so on.

One other change in the 4.2 release is the replacement of the BlueZ-based Bluetooth stack with a totally new stack (called "Bluedroid") from Broadcom. This stack, according to the release notes, "provides improved compatibility and reliability." A message on the android-platform list gives some additional reasons for the change, including the ability to run Bluetooth processes in a separate process, elimination of the D-Bus dependency, and more. The licensing of the new "Bluedroid" stack has raised some questions of its own that have not been clarified as of this writing.

Bluetooth stack questions aside, the obvious conclusion is that the Android platform continues to advance quickly. Each release improves the experience, adds features, and generally cements Android's position as the Linux-based platform for mobile devices. Your editor would still like to see an alternative platform, preferably one that is closer to traditional Linux, but that seems increasingly unlikely as the spread of Android continues unabated and unchallenged. The good news is that Android continues to be (mostly) free software and it continues to improve. This stage of the evolution of the computing industry could easily have taken a highly proprietary turn; thanks to Android, the worst of that has been avoided.

(Thanks to Karim Yaghmour for pointers to the Bluedroid discussion).

Comments (34 posted)

Page editor: Jonathan Corbet


A rootkit dissected

By Jake Edge
November 21, 2012

A recently discovered Linux rootkit has a number of interesting attributes that make it worth a look. While it demonstrates the power that a rootkit has (to perform its "job" as well as hide itself from detection) this particular rootkit also has some fairly serious bugs—some that are almost comical. What isn't known, at least yet, is how the system where it was discovered became infected; there is no exploit used by the rootkit to propagate itself.

The rootkit was reported to the full-disclosure mailing list on November 13 by "stack trace". Customers had noticed that they were being redirected to malicious sites by means of an <iframe> in the HTTP responses from Stack Trace's site. Stack Trace eventually found that the Nginx web server on the system was not delivering the <iframe> and tracked it to a loadable kernel module, which was attached to the posting. Since then, both CrowdStrike and Kaspersky Lab's Threatpost have analyzed the module's behavior.

The first step for a rootkit is to get itself running in the kernel. That can be accomplished by means of a loadable kernel module. In this case, the module is targeted at the most recent 64-bit kernel version used by Debian 6.0 ("Squeeze"):

The presence of that file would indicate infection, though a look at the process list is required to determine if the rootkit is actually loaded. Once loaded, the module has a number of different tasks to perform that are described below. The CrowdStrike post has even more detail for those interested.

The rootkit targets HTTP traffic, so that it can inject an <iframe> containing an attack: either a malicious URL or some kind of JavaScript-based attack. In order to do that in a relatively undetectable way, it must impose itself into the kernel's TCP send path. It does so by hooking tcp_sendmsg().

Of course, that function and other symbols that the rootkit wants to access are not exported symbols that would be directly accessible to a kernel module. So the rootkit uses /proc/kallsyms to get the addresses it needs. Amusingly, there is code to fall back to looking for the proper to parse for the addresses, but it is never used due to a bug. Even though the kernel version is hardcoded in several places in the rootkit, the helper function actually uses uname -r to get the version. The inability to fall back to checking, along with this version-getting oddity make it seem like multiple people—with little or no inter-communication—worked on the code. Other odd bugs in the rootkit only add to that feeling.

For example, when hooking various functions, the rootkit carefully saves away the five bytes it needs to overwrite with a jmp instruction, but then proceeds to write 19 bytes at the start of the function. That obliterates 14 bytes of code, which eliminates any possibility of unhooking the function. Beyond that, it can't call the unhooked version of the function either, so the rootkit contains private copies of all the functions it hooks.

Beyond hooking tcp_sendmsg(), the rootkit also attempts to hide its presence. There is code to hide the files that it installs, as well as its threads. The file hiding works well enough by hooking vfs_readdir() and using a list of directories and files that should not be returned. Fortunately (or unfortunately, depending on one's perspective), the thread hiding doesn't work at all. It uses the same file-hiding code, but doesn't look in /proc nor convert the names into PIDs, so ps and other tools show the threads. In the original report, Stack Trace noted two threads named get_http_inj_fr and write_startup_c; those names are fairly descriptive given the behavior being seen. The presence of one or both of those names in the process list would mean that the system has the rootkit loaded.

The rootkit does successfully remove itself from the list of loaded modules. It directly iterates down the kernel's module list and deletes the entry for itself. That way lsmod will not list the module, but it also means that it cannot be unloaded, obviating the "careful" preparations in the hooked functions for that eventuality.

As with other malware (botnets in particular), the rootkit has a "command and control" client. That client contacts a particular server (at a hosting service in Germany) for information about what to inject in the web pages. There is some simple, weak encryption used on the link for both authentication and obfuscation of the message.

Beyond just missing a way to propagate to other systems, the rootkit is also rather likely to fail to persist after a reboot. It has code to continuously monitor and alter /etc/rc.local to add an insmod for the rootkit module. It also hooks vfs_read() to look for the exact insmod line and adjusts the buffer to hide that line from anyone looking at the file. But it just appends the command to rc.local, which means that on a default installation of Debian Squeeze it ends up just after an exit 0 line.

Like much of the rest of the rootkit, the HTTP injection handling shows an odd mix of reasonably sensible choices along with some bugs. It looks at the first buffer to be sent to the remote side, verifies that its source port is 80 and that it is not being sent to the loopback address. It also compares the destination IP address with a list of 1708 search engine IP addresses, and does no further processing if it is on the list.

One of the bugs that allowed Stack Trace to diagnose the problem is the handling of status codes. Instead of looking for the 200 HTTP success code, the rootkit looks for three strings on a blacklist that correspond to HTTP failures. That list is not exhaustive, so Stack Trace was able to see the injection in a 400 HTTP error response. Beyond that, the rootkit cleverly handles chunked Transfer-Encodings and gzip Content-Encodings, though the latter does an in-kernel decompress-inject-compress cycle that could lead to noticeable server performance problems.

None of the abilities of the rootkit are particularly novel, though it is interesting to see them laid bare like this. As should be obvious, a rootkit can do an awful lot in a Linux system, and has plenty of ways to hide its tracks. While this rootkit only hid some of its tracks, some of that may have happened after the initial development. The CrowdStrike conclusion is instructive here: "Rather, it seems that this is contract work of an intermediate programmer with no extensive kernel experience, later customized beyond repair by the buyer."

The question of how the rootkit was installed to begin with is still open. Given the overall code quality, CrowdStrike is skeptical that some "custom privilege escalation exploit" was used. That implies that some known but unpatched vulnerability (perhaps in a web application) or some kind of credential leak (e.g. the root password or an SSH key) was the culprit. Until and unless some mass exploit is used to propagate an upgraded version of the rootkit, it is really only of academic interest—except, of course, to anyone whose system is already infected.

Comments (16 posted)

Brief items

Security quotes of the week

So far, in most of the drive-by download scenarios an automated injection mechanism is implemented as a simple PHP script. In the case described above, we are dealing with something far more sophisticated - a kernel-mode binary component that uses advanced hooking techniques to ensure that the injection process is more transparent and low-level than ever before. This rootkit, though it's still in the development stage, shows a new approach to the drive-by download schema and we can certainly expect more such malware in the future.
-- Marta Janus

I do stay awake at night worrying that people are tagging my photo on Facebook, which could allow the New York Police Dept to submit a photo of protesters to Facebook and get a list of names and addresses of the people in the photo. Or it could allow the police to track my movements via existing networks of surveillance cameras by matching my image to my name. Would that require a search warrant? How would that impact my trust in my government to know that my movements are being tracked? Or worse, to know they might be tracked but I'll never know if [they] are or aren't?
-- Jamie McClelland

The ITU is the wrong place to make decisions about the future of the Internet.

Only governments have a voice at the ITU. This includes governments that do not support a free and open Internet. Engineers, companies, and people that build and use the web have no vote.

The ITU is also secretive. The treaty conference and proposals are confidential.

-- Google is concerned about a closed-door International Telecommunication Union (ITU) meeting in December

Comments (2 posted)

Attacking hardened Linux systems with kernel JIT spraying

The "main is usually a function" blog has a discussion on the use of "Jit spraying" techniques to attack the kernel, even when features like supervisor-mode execution prevention are turned on. "JIT spraying is a viable tactic when we (the attacker) control the input to a just-in-time compiler. The JIT will write into executable memory on our behalf, and we have some control over what it writes. Of course, a JIT compiling untrusted code will be careful with what instructions it produces. The trick of JIT spraying is that seemingly innocuous instructions can be trouble when looked at another way."

Comments (44 posted)

New Linux Rootkit Emerges (Threat Post)

Threat Post reports the discovery of a rootkit that targets 64-bit Linux systems. "The Linux rootkit does not appear to be a modified version of any known piece of malware and it first came to light last week when someone posted a quick description and analysis of it on the Full Disclosure mailing list. That poster said that his site had been targeted by the malware and some of his customers had been redirected to malicious sites."

Comments (13 posted)

New vulnerabilities

java-1.5.0-ibm: two vulnerabilities

Package(s):java-1.5.0-ibm CVE #(s):CVE-2012-4820 CVE-2012-4822
Created:November 16, 2012 Updated:November 23, 2012

From the Red Hat advisory:

CVE-2012-4820 IBM JDK: java.lang.reflect.Method invoke() code execution

CVE-2012-4822 IBM JDK: java.lang.class code execution

Red Hat RHSA-2012:1485-01 java-1.4.2-ibm 2012-11-22
Red Hat RHSA-2012:1466-01 java-1.6.0-ibm 2012-11-15
Red Hat RHSA-2012:1465-01 java-1.5.0-ibm 2012-11-15
Red Hat RHSA-2012:1467-01 java-1.7.0-ibm 2012-11-15

Comments (none posted)

java-1.6.0-ibm: code execution

Package(s):java-1.6.0-ibm CVE #(s):CVE-2012-4823
Created:November 16, 2012 Updated:November 21, 2012

From the Red Hat advisory:

CVE-2012-4823 IBM JDK: java.lang.ClassLoder defineClass() code execution

Red Hat RHSA-2012:1467-01 java-1.7.0-ibm 2012-11-15
Red Hat RHSA-2012:1466-01 java-1.6.0-ibm 2012-11-15

Comments (none posted)

java-1.7.0-ibm: code execution

Package(s):java-1.7.0-ibm CVE #(s):CVE-2012-4821
Created:November 16, 2012 Updated:November 21, 2012

From the Red Hat advisory:

CVE-2012-4821 IBM JDK: getDeclaredMethods() and setAccessible() code execution

Red Hat RHSA-2012:1467-01 java-1.7.0-ibm 2012-11-15

Comments (none posted)

kdelibs: multiple vulnerabilities

Package(s):kdelibs CVE #(s):CVE-2012-4515 CVE-2012-4514
Created:November 16, 2012 Updated:February 18, 2013

From the Fedora advisory:

Bug #865831 - CVE-2012-4515 kdelibs: Use-after-free when context menu being used whilst the document DOM is being changed from within JavaScript

Bug #869681 - CVE-2012-4514 kdelibs (khtml): NULL pointer dereference when trying to reuse a frame with null part

Gentoo 201406-31 konqueror 2014-06-27
Mageia MGASA-2013-0054 kdelibs4 2013-02-16
openSUSE openSUSE-SU-2013:0127-1 kdelibs 2013-01-23
Fedora FEDORA-2012-17385 kdelibs 2012-11-16
openSUSE openSUSE-SU-2012:1581-1 kdelibs4 2012-11-28
Fedora FEDORA-2012-17388 kdelibs 2012-11-16

Comments (none posted)

libtiff: code execution

Package(s):tiff CVE #(s):CVE-2012-4564
Created:November 15, 2012 Updated:December 31, 2012

From the Ubuntu advisory:

Huzaifa S. Sidhpurwala discovered that the ppm2tiff tool incorrectly handled certain malformed PPM images. If a user or automated system were tricked into opening a specially crafted PPM image, a remote attacker could crash the application, leading to a denial of service, or possibly execute arbitrary code with user privileges. (CVE-2012-4564)

Fedora FEDORA-2014-6831 mingw-libtiff 2014-06-10
Fedora FEDORA-2014-6837 mingw-libtiff 2014-06-10
Gentoo 201402-21 tiff 2014-02-21
Slackware SSA:2013-290-01 libtiff 2013-10-18
Mandriva MDVSA-2013:046 libtiff 2013-04-05
openSUSE openSUSE-SU-2013:0187-1 tiff 2013-01-23
Fedora FEDORA-2012-20404 libtiff 2012-12-31
Fedora FEDORA-2012-20446 libtiff 2012-12-31
Scientific Linux SL-libt-20121219 libtiff 2012-12-19
Oracle ELSA-2012-1590 libtiff 2012-12-19
Oracle ELSA-2012-1590 libtiff 2012-12-18
CentOS CESA-2012:1590 libtiff 2012-12-19
CentOS CESA-2012:1590 libtiff 2012-12-19
Red Hat RHSA-2012:1590-01 libtiff 2012-12-18
Mandriva MDVSA-2012:174 libtiff 2012-11-22
Ubuntu USN-1631-1 tiff 2012-11-15
Debian DSA-2575-1 tiff 2012-11-18
Mageia MGASA-2012-0332 libtiff 2012-11-17

Comments (none posted)

libunity-webapps: code execution

Package(s):libunity-webapps CVE #(s):CVE-2012-4551
Created:November 21, 2012 Updated:November 21, 2012
Description: From the Ubuntu advisory:

It was discovered that libunity-webapps improperly handled certain hash tables. A remote attacker could use this issue to cause libunity-webapps to crash, or possibly execute arbitrary code.

Ubuntu USN-1635-1 libunity-webapps 2012-11-21

Comments (none posted)

mozilla: multiple vulnerabilities

Package(s):firefox, thunderbird CVE #(s):CVE-2012-4201 CVE-2012-4202 CVE-2012-4207 CVE-2012-4209 CVE-2012-4210 CVE-2012-4214 CVE-2012-4215 CVE-2012-4216 CVE-2012-5829 CVE-2012-5830 CVE-2012-5833 CVE-2012-5835 CVE-2012-5839 CVE-2012-5840 CVE-2012-5841 CVE-2012-5842
Created:November 21, 2012 Updated:January 8, 2013
Description: From the Red Hat advisory:

Several flaws were found in the processing of malformed web content. A web page containing malicious content could cause Firefox to crash or, potentially, execute arbitrary code with the privileges of the user running Firefox. (CVE-2012-4214, CVE-2012-4215, CVE-2012-4216, CVE-2012-5829, CVE-2012-5830, CVE-2012-5833, CVE-2012-5835, CVE-2012-5839, CVE-2012-5840, CVE-2012-5842)

A buffer overflow flaw was found in the way Firefox handled GIF (Graphics Interchange Format) images. A web page containing a malicious GIF image could cause Firefox to crash or, possibly, execute arbitrary code with the privileges of the user running Firefox. (CVE-2012-4202)

A flaw was found in the way the Style Inspector tool in Firefox handled certain Cascading Style Sheets (CSS). Running the tool (Tools -> Web Developer -> Inspect) on malicious CSS could result in the execution of HTML and CSS content with chrome privileges. (CVE-2012-4210)

A flaw was found in the way Firefox decoded the HZ-GB-2312 character encoding. A web page containing malicious content could cause Firefox to run JavaScript code with the permissions of a different website. (CVE-2012-4207)

A flaw was found in the location object implementation in Firefox. Malicious content could possibly use this flaw to allow restricted content to be loaded by plug-ins. (CVE-2012-4209)

A flaw was found in the way cross-origin wrappers were implemented. Malicious content could use this flaw to perform cross-site scripting attacks. (CVE-2012-5841)

A flaw was found in the evalInSandbox implementation in Firefox. Malicious content could use this flaw to perform cross-site scripting attacks. (CVE-2012-4201)

openSUSE openSUSE-SU-2014:1100-1 Firefox 2014-09-09
openSUSE openSUSE-SU-2013:0175-1 mozilla 2013-01-23
Gentoo 201301-01 firefox 2013-01-07
Debian DSA-2588-1 icedove 2012-12-16
Debian DSA-2584-1 iceape 2012-12-08
Debian DSA-2583-1 iceweasel 2012-12-08
Mageia MGASA-2012-0353 iceape 2012-12-07
Fedora FEDORA-2012-18683 firefox 2012-11-22
CentOS CESA-2012:1483 thunderbird 2012-11-22
CentOS CESA-2012:1482 firefox 2012-11-22
Scientific Linux SL-thun-20121121 thunderbird 2012-11-21
Oracle ELSA-2012-1483 thunderbird 2012-11-21
Oracle ELSA-2012-1482 firefox 2012-11-21
SUSE SUSE-SU-2012:1592-1 Mozilla Firefox 2012-11-29
openSUSE openSUSE-SU-2012:1586-1 xulrunner 2012-11-28
openSUSE openSUSE-SU-2012:1583-1 firefox 2012-11-28
Mageia MGASA-2012-0343 thunderbird 2012-11-23
Ubuntu USN-1638-2 ubufox 2012-11-21
Ubuntu USN-1636-1 thunderbird 2012-11-21
Slackware SSA:2012-326-03 thunderbird 2012-11-21
Slackware SSA:2012-326-01 seamonkey 2012-11-21
Fedora FEDORA-2012-18683 thunderbird-enigmail 2012-11-22
Fedora FEDORA-2012-18683 thunderbird 2012-11-22
Fedora FEDORA-2012-18931 seamonkey 2012-12-04
openSUSE openSUSE-SU-2012:1585-1 thunderbird 2012-11-28
Mageia MGASA-2012-0342 firefox 2012-11-23
Slackware SSA:2012-326-02 firefox 2012-11-21
Oracle ELSA-2012-1482 firefox 2012-11-21
Fedora FEDORA-2012-18683 xulrunner 2012-11-22
Fedora FEDORA-2012-18683 thunderbird-lightning 2012-11-22
Mandriva MDVSA-2012:173 firefox 2012-11-21
Red Hat RHSA-2012:1483-01 thunderbird 2012-11-20
Fedora FEDORA-2012-18952 seamonkey 2012-12-04
Ubuntu USN-1638-3 firefox 2012-12-03
openSUSE openSUSE-SU-2012:1584-1 seamonkey 2012-11-28
Ubuntu USN-1638-1 firefox 2012-11-21
CentOS CESA-2012:1483 thunderbird 2012-11-22
CentOS CESA-2012:1482 firefox 2012-11-22
Scientific Linux SL-fire-20121121 firefox 2012-11-21
Red Hat RHSA-2012:1482-01 firefox 2012-11-20

Comments (none posted)

mysql: multiple unspecified vulnerabilities

Package(s):mysql CVE #(s):CVE-2012-0540 CVE-2012-1689 CVE-2012-1734 CVE-2012-2749
Created:November 15, 2012 Updated:October 17, 2013

From the Red Hat advisory:

833737 - CVE-2012-2749 mysql: crash caused by wrong calculation of key length for sort order index

841349 - CVE-2012-0540 mysql: unspecified vulnerability related to GIS extension DoS (CPU Jul 2012)

841351 - CVE-2012-1689 mysql: unspecified vulnerability related to Server Optimizer DoS (CPU Jul 2012)

841353 - CVE-2012-1734 mysql: unspecified vulnerability related to Server Optimizer DoS (CPU Jul 2012)

Gentoo 201308-06 mysql 2013-08-29
Gentoo GLSA 201308-06:02 mysql 2013-08-30
Mandriva MDVSA-2013:008 mysql 2013-02-06
Scientific Linux SL-mysq-20130123 mysql 2013-01-23
Oracle ELSA-2013-0180 mysql 2013-01-22
CentOS CESA-2013:0180 mysql 2013-01-22
Red Hat RHSA-2013:0180-01 mysql 2013-01-22
Scientific Linux SL-mysq-20121115 mysql 2012-11-15
Oracle ELSA-2012-1462 mysql 2012-11-14
CentOS CESA-2012:1462 mysql 2012-11-15
Red Hat RHSA-2012:1462-01 mysql 2012-11-14

Comments (none posted)

phpmyadmin: cross-site scripting

Package(s):phpmyadmin CVE #(s):CVE-2012-5339 CVE-2012-5368
Created:November 20, 2012 Updated:November 21, 2012
Description: From the CVE entries:

Multiple cross-site scripting (XSS) vulnerabilities in phpMyAdmin 3.5.x before 3.5.3 allow remote authenticated users to inject arbitrary web script or HTML via a crafted name of (1) an event, (2) a procedure, or (3) a trigger. (CVE-2012-5339)

phpMyAdmin 3.5.x before 3.5.3 uses JavaScript code that is obtained through an HTTP session to without SSL, which allows man-in-the-middle attackers to conduct cross-site scripting (XSS) attacks by modifying this code. (CVE-2012-5368)

openSUSE openSUSE-SU-2012:1507-1 phpmyadmin 2012-11-20

Comments (none posted)

python-keyring: weak cryptography

Package(s):python-keyring CVE #(s):CVE-2012-4571
Created:November 21, 2012 Updated:December 4, 2013
Description: From the Ubuntu advisory:

Dwayne Litzenberger discovered that Python Keyring's CryptedFileKeyring file format used weak cryptography. A local attacker may use this issue to brute-force CryptedFileKeyring keyring files.

Fedora FEDORA-2013-22694 python-keyring 2013-12-04
Ubuntu USN-1634-1 python-keyring 2012-11-20

Comments (none posted)

ruby: denial of service

Package(s):ruby CVE #(s):CVE-2012-5371
Created:November 19, 2012 Updated:December 7, 2012
Description: From the Red Hat bugzilla:

Ruby 1.9.3-p327 was released to correct a hash-flooding DoS vulnerability that only affects 1.9.x and the 2.0.0 preview [1].

As noted in the upstream report:

Carefully crafted sequence of strings can cause a denial of service attack on the service that parses the sequence to create a Hash object by using the strings as keys. For instance, this vulnerability affects web application that parses the JSON data sent from untrusted entity.

This vulnerability is similar to CVS-2011-4815 for ruby 1.8.7. ruby 1.9 versions were using modified MurmurHash function but it's reported that there is a way to create sequence of strings that collide their hash values each other. This fix changes the Hash function of String object from the MurmurHash to SipHash 2-4.

Ruby 1.8.x is not noted as being affected by this flaw.

Debian-LTS DLA-263-1 ruby1.9.1 2015-07-01
Gentoo 201412-27 ruby 2014-12-13
openSUSE openSUSE-SU-2013:0376-1 ruby19 2013-03-01
Red Hat RHSA-2013:0582-01 openshift 2013-02-28
Ubuntu USN-1733-1 ruby1.9.1 2013-02-21
Slackware SSA:2012-341-04 ruby 2012-12-06
Fedora FEDORA-2012-18017 ruby 2012-11-19

Comments (none posted)

typo3-src: multiple vulnerabilities

Package(s):typo3-src CVE #(s):
Created:November 16, 2012 Updated:November 21, 2012

From the Debian advisory:

Several vulnerabilities were discovered in TYPO3, a content management system. This update addresses cross-site scripting, SQL injection, and information disclosure vulnerabilities and corresponds to TYPO3-CORE-SA-2012-005.

Debian DSA-2574-1 typo3-src 2012-11-15

Comments (none posted)

weechat: code execution

Package(s):weechat CVE #(s):CVE-2012-5854
Created:November 19, 2012 Updated:November 28, 2012
Description: From the CVE entry:

Heap-based buffer overflow in WeeChat 0.3.6 through 0.3.9 allows remote attackers to cause a denial of service (crash or hang) and possibly execute arbitrary code via crafted IRC colors that are not properly decoded.

Gentoo 201405-03 weechat 2014-05-03
Mandriva MDVSA-2013:136 weechat 2013-04-10
openSUSE openSUSE-SU-2013:0150-1 weechat 2013-01-23
Fedora FEDORA-2012-19538 weechat 2012-12-11
Fedora FEDORA-2012-19533 weechat 2012-12-11
openSUSE openSUSE-SU-2012:1580-1 weechat 2012-11-28
Mageia MGASA-2012-0330 weechat 2012-11-17
Fedora FEDORA-2012-18006 weechat 2012-11-19
Fedora FEDORA-2012-17973 weechat 2012-11-19

Comments (none posted)

xen: multiple vulnerabilities

Package(s):Xen CVE #(s):CVE-2012-3497 CVE-2012-4535 CVE-2012-4536 CVE-2012-4537 CVE-2012-4538 CVE-2012-4539
Created:November 16, 2012 Updated:December 24, 2012

From the SUSE advisory:

* CVE-2012-4535: xen: Timer overflow DoS vulnerability (XSA 20)

* CVE-2012-4536: xen: pirq range check DoS vulnerability (XSA 21)

* CVE-2012-4537: xen: Memory mapping failure DoS vulnerability (XSA 22)

* CVE-2012-4538: xen: Unhooking empty PAE entries DoS vulnerability (XSA 23)

* CVE-2012-4539: xen: Grant table hypercall infinite loop DoS vulnerability (XSA 24)

* CVE-2012-3497: xen: multiple TMEM hypercall vulnerabilities (XSA-15)

SUSE SUSE-SU-2014:0470-1 Xen 2014-04-01
SUSE SUSE-SU-2014:0446-1 Xen 2014-03-25
Gentoo 201309-24 xen 2013-09-27
openSUSE openSUSE-SU-2012:1687-1 xen 2012-12-23
openSUSE openSUSE-SU-2012:1685-1 xen 2012-12-23
Debian DSA-2582-1 xen 2012-12-07
Oracle ELSA-2012-1540 kernel 2012-12-05
Scientific Linux SL-kern-20121206 kernel 2012-12-06
SUSE SUSE-SU-2012:1615-1 Xen 2012-12-06
SUSE SUSE-SU-2012:1487-1 Xen 2012-11-16
openSUSE openSUSE-SU-2012:1573-1 XEN 2012-11-26
openSUSE openSUSE-SU-2012:1572-1 XEN 2012-11-26
Fedora FEDORA-2012-18242 xen 2012-11-23
SUSE SUSE-SU-2012:1486-1 Xen 2012-11-16
Fedora FEDORA-2012-18249 xen 2012-11-23
SUSE SUSE-SU-2012:1503-1 libvirt 2012-11-19
CentOS CESA-2012:1540 kernel 2012-12-05
Red Hat RHSA-2012:1540-01 kernel 2012-12-04
Gentoo 201604-03 xen 2016-04-05

Comments (none posted)

Page editor: Jake Edge

Kernel development

Brief items

Kernel release status

The current development kernel is 3.7-rc6, released on November 16; things have been slow since then as Linus has gone on vacation. "I'll have a laptop with me as I'm away, but if things calm down even further, I'll be happy. I'll do an -rc7, but considering how calm things have been, I suspect that's the last -rc. Unless something dramatic happens."

Stable updates: 3.0.52, 3.2.34, 3.4.19, and 3.6.7 were all released on November 17 with the usual set of important fixes.

Comments (none posted)

Quotes of the week

Sometimes it's scary how many latent bugs we have in the kernel and how long many of them have been around. At other times, it's comforting. I mean, there's a pretty good chance that other people don't notice my screw ups, right?
Tejun Heo

End result: A given device only ever crashes exactly once on a given Windows system.
Peter Stuge on why Linux may have to do the same

I read that line several times and it just keeps sounding like some chant done by the strikers in Greece over the austerity measures...

"Consolidate a bit! The context switch code!"
"Consolidate a bit! The context switch code!"
"Consolidate a bit! The context switch code!"
"Consolidate a bit! The context switch code!"
I guess because it just sounds Greek to me.
Steven Rostedt

After six and a half years of writing and maintaining KVM, it is time to move to new things.
Avi Kivity hands off to Gleb Natapov

Comments (none posted)

Barcelona Media Summit Report

Hans Verkuil has posted a report from the meeting of kernel-space media developers recently held in Barcelona. Covered topics include a new submaintainer organization, requirements for new V4L2 drivers, asynchronous loading, and more. "Basically the number of patch submissions increased from 200 a month two years ago to 700 a month this year. Mauro is unable to keep up with that flood and a solution needed to be found."

Full Story (comments: none)

Kernel development news

LCE: Checkpoint/restore in user space: are we there yet?

By Michael Kerrisk
November 20, 2012

Checkpoint/restore refers to the ability to snapshot the state of an application (which may consist of multiple processes) and then later restore the application to a running state, possibly on a different (virtual) system. Pavel Emelyanov's talk at LinuxCon Europe 2012 provided an overview of the current status of the checkpoint/restore in user space (CRIU) system that has been in development for a couple of years now.

Uses of checkpoint/restore

There are various uses for checkpoint/restore functionality. For example, Pavel's employer, Parallels, uses it for live migration, which allows a running application to be moved between host machines without loss of service. Parallels also uses it for so-called rebootless kernel updates, whereby applications on a machine are checkpointed to persistent storage while the kernel is updated and rebooted, after which the applications are restored; the applications then continue to run, unaware that the kernel has changed and the system has been restarted.

Another potential use of checkpoint/restore is to speed start-up of applications that have a long initialization time. An application can be started and checkpointed to persistent storage after the initialization is completed. Later, the application can be quickly (re-)started from the checkpointed snapshot. (This is analogous to the dump-emacs feature that is used to speed up start times for emacs by creating a preinitialized binary.)

Checkpoint/restore also has uses in high-performance computing. One such use is for load balancing, which is essentially another application of live migration. Another use is incremental snapshotting, whereby an application's state is periodically checkpointed to persistent storage, so that, in the event of an unplanned system outage, the application can be restarted from a recent checkpoint rather than losing days of calculation.

"You might ask, is it possible to already do all of these things on Linux right now? The answer is that it's almost possible." Pavel spent the remainder of the talk describing how the CRIU implementation works, how close the implementation is to completion, and what work remains to be done. He began with some history of the checkpoint/restore project.

History of checkpoint/restore

The origins of the CRIU implementation go back to work that started in 2005 as part of the OpenVZ project. The project provided a set of out-of-mainline patches to the Linux kernel that supported a kernel-space implementation of checkpoint/restore.

In 2008, when the first efforts were made to upstream the checkpoint/restore functionality, the OpenVZ project communicated with a number of other parties who were interested in the functionality. At the time, it seemed natural to employ an in-kernel implementation of checkpoint/restore. A few year's work resulted in a set of more than 100 patches that implemented almost all of the same functionality as OpenVZ's kernel-based checkpoint/restore mechanism.

However, concerns from the upstream kernel developers eventually led to the rejection of the kernel-based approach. One concern related to the sheer scale of the patches and the complexity they would add to the kernel: the patches amounted to tens of thousands of lines and touched a very wide range of subsystems in the kernel. There were also concerns about the difficulties of implementing backward compatibility for checkpoint/restore, so that an application could be checkpointed on one kernel version and then successfully restored on a later kernel version.

Over the course of about a year, the OpenVZ project then turned its efforts to developing an implementation of checkpoint/restore that was done mainly in user space, with help from the kernel where it was needed. In January 2012, that effort was repaid when Linus Torvalds merged a first set CRIU-related patches into the mainline kernel, albeit with an amusingly skeptical covering note from Andrew Morton:

A note on this: this is a project by various mad Russians to perform checkpoint/restore mainly from userspace, with various oddball helper code added into the kernel where the need is demonstrated.

So rather than some large central lump of code, what we have is little bits and pieces popping up in various places which either expose something new or which permit something which is normally kernel-private to be modified.

Since then, two versions of the corresponding user-space tools have been released: CRIU v0.1 in July, and CRIU v0.2, which added support for Linux Containers (LXC), in September.

Goal and concept

The ultimate goal of the CRIU project is to allow the entire state of an application to be dumped (checkpointed) and then later restored. This is a complex task, for several reasons. First of all, there are many pieces of process state that must be saved, for example, information about virtual memory mappings, open files, credentials, timers, process ID, parent process ID, and so on. Furthermore, an application may consist of multiple processes that share some resources. The CRIU facility must allow all of these processes to be checkpointed and restored to the same state.

For each piece of state that the kernel records about a process, CRIU needs two pieces of support from the kernel. The first piece is a mechanism to interrogate the kernel about the value of the state, in preparation for dumping the state during a checkpoint. The second piece is a mechanism to pass that state back to the kernel when the process is restored. Pavel illustrated this point using the example of open files. A process may open an arbitrary set of files. Each open() call results in the creation of a file descriptor that is a handle to some internal kernel state describing the open file. In order to dump that state, CRIU needs a mechanism to ask the kernel which files are opened by that process. To restore the application, CRIU then re-opens those files using the same descriptor numbers.

The CRIU system makes use of various kernel APIs for retrieving and restoring process state, including files in the /proc file system, netlink sockets, and system calls. Files in /proc can be used to retrieve a wide range of information about processes and their interrelationships. Netlink sockets are used both to retrieve and to restore various pieces of state information.

System calls provide a mechanism to both retrieve and restore various pieces of state. System calls can be subdivided into two categories. First, there are system calls that operate only on the process that calls them. For example, getitimer() can be used to retrieve only the caller's interval timer value. System calls in this category can't easily be used to retrieve or restore the state of arbitrary processes. However, later in his talk, Pavel described a technique that the CRIU project came up with to employ these calls. The other category of system calls can operate on arbitrary processes. The system calls that set process scheduling attributes are an example: sched_getscheduler() and sched_getparam() can be used to retrieve the scheduling attributes of an arbitrary process and sched_setscheduler() can be used to set the attributes of an arbitrary process.

CRIU requires kernel support for retrieving each piece of process state. In some cases, the necessary support already existed. However, in other cases, there is no kernel API that can be used to interrogate the kernel about the state; for each such case, the CRIU project must add a suitable kernel API. Pavel used the example of memory-mapped files to illustrate this point. The /proc/PID/maps file provides the pathnames of the files that a process has mapped. However, the file pathname is not a reliable identifier for the mapped file. For example, after the mapping was created, filesystem mount points may have been rearranged or the pathname may have been unlinked. Therefore, in order to obtain complete and accurate information about mappings, the CRIU developers added a new kernel API: /proc/PID/map_files.

The situation when restoring process state is often a lot simpler: in many cases the same API that was used to create the state in the first place can be used to re-create the state during a restore. However, in some cases, restoring process state is not so simple. For example, getpid() can be used to retrieve a process's PID, but there is no corresponding API to set a process's PID during a restore (the fork() system call does not allow the caller to specify the PID of the child process). To address this problem, the CRIU developers added an API that could be used to control the PID that was chosen by the next fork() call. (In response to a question at the end of the talk, Pavel noted that in cases where the new kernel features added to support CRIU have security implications, access to those features has been restricted by a requirement that the user must have the CAP_SYS_ADMIN capability.)

Kernel impact and new kernel features

The CRIU project has largely achieved its goal, Pavel said. Instead of having a large mass of code inside the kernel that does checkpoint/restore, there are instead many small extensions to the kernel that allow checkpoint/restore to be done in user space. By now, just over 100 CRIU-related patches have been merged upstream or are sitting in "-next" trees. Those patches added nine new features to the kernel, of which only one was specific to checkpoint/restore; all of the others have turned out to also have uses outside checkpoint/restore. Approximately 15 further patches are currently being discussed on the mailing lists; in most cases, the principles have been agreed on by the stakeholders, but details are being resolved. These "in flight" patches provide two additional kernel features.

Pavel detailed a few of the more interesting new features added to the kernel for the CRIU project. One of these was parasite code injection, which was added by Tejun Heo, "not specifically within the CRIU project, but with the same intention". Using this feature, a process can be made to execute an arbitrary piece of code. The CRIU framework employs parasite code injection to use those system calls mentioned earlier that operate only on the caller's state; this obviated the need to add a range of new APIs to retrieve and restore various pieces of state of arbitrary processes. Examples of system calls used to obtain process state via injected parasite code are getitimer() (to retrieve interval timers) and sigaction() (to retrieve signal dispositions).

The kcmp() system call was added as part of the CRIU project. It allows the comparison of various kernel objects used by two processes. Using this system call, CRIU can build a full picture of what resources two processes share inside the kernel. Returning to the example of open files gives some idea of how kcmp() is useful.

Information about an open file is available via /proc/PID/fd and the files in /proc/PID/fdinfo. Together, these files reveal the file descriptor number, pathname, file offset, and open file flags for each file that a process has opened. This is almost enough information to be able to re-open the file during a restore. However, one notable piece of information is missing: sharing of open files. Sometimes, two open file descriptors refer to the same file structure. That can happen, for example, after a call to fork(), since the child inherits copies of all of its parent's file descriptors. As a consequence of this type of sharing, the file descriptors share file offset and open file flags.

This sort of sharing of open file descriptions can't be restored via simple calls to open(). Instead, CRIU makes use of the kcmp() system call to discover instances of file sharing when performing the checkpoint, and then uses a combination of open() and file descriptor passing via UNIX domain sockets to re-create the necessary sharing during the restore. (However, this is far from the full story for open files, since there are many other attributes associated with specific kinds of open files that CRIU must handle. For example, inotify file descriptors, sockets, pseudo-terminals, and pipes all require additional work within CRIU.)

Another notable feature added to the kernel for CRIU is sock_diag. This is a netlink-based subsystem that can be used to obtain information about sockets. sock_diag is an example of how a CRIU-inspired addition to the kernel has also benefited other projects. Nowadays, the ss command, which displays information about sockets on the system, also makes use of sock_diag. Previously, ss used /proc files to obtain the information it displayed. The advantage of employing sock_diag is that, by comparison with the corresponding /proc files, it is much easier to extend the interface to provide new information without breaking existing applications. In addition, sock_diag provides some information that was not available with the older interfaces. In particular, before the advent of sock_diag, ss did not have a way of discovering the connections between pairs of UNIX domain sockets on a system.

Pavel briefly mentioned a few other kernel features added as part of the CRIU work. TCP repair mode allows CRIU to checkpoint and restore an active TCP connection, transparently to the peer application. Virtualization of network device indices allows virtual network devices to be restored in a network namespace; it also had the side-benefit of a small improvement in the speed of network routing. As noted earlier, the /proc/PID/map_files file was added for CRIU. CRIU has also implemented a technique for peeking at the data in a socket queue, so that the contents of a socket input queue can be dumped. Finally, CRIU added a number of options to the getsockopt() system call, so that various options that were formerly only settable via setsockopt() are now also retrievable.

Current status

Pavel then summarized the current state of the CRIU implementation, looking at what is supported by the mainline 3.6 kernel. CRIU currently supports (only) the x86-64 architecture. Asked at the end of the talk how much work would be required to port CRIU to a new architecture, Pavel estimated that the work should not be large. The main tasks are to implement code that dumps architecture-specific state (mainly registers) and reimplement a small piece of code that is currently written in x86 assembler.

Arbitrary process trees are supported: it is possible to dump a process and all of its descendants. CRIU supports multithreaded applications, memory mappings of all kinds, and terminals, process groups, and sessions. Open files are supported, including shared open files, as described above. Established TCP connections are supported, as are UNIX domain sockets.

The CRIU user-space tools also support various kinds of non-POSIX files, including inotify, epoll, and signalfd file descriptors, but the required kernel-side support is not yet available. Patches for that support are currently queued, and Pavel hopes that they will be merged for kernel 3.8.


The CRIU project tests its work in a variety of ways. First, there is the ZDTM (zero-down-time-migration) test suite. This test suite consists of a large number of small tests. Each test program sets up a test before a checkpoint, and then reports on the state of the tested feature after a restore. Every new feature merged into the CRIU project adds a test to this suite.

In addition, from time to time, the CRIU developers take some real software and test whether it survives a checkpoint/restore. Among the programs that they have successfully checkpointed and restored are Apache web server, MySQL, a parallel compilation of the kernel, tar, gzip, an SSH daemon with connections, nginx, VNC with XScreenSaver and a client connection, MongoDB, and tcpdump.

Plans for the near future

The CRIU developers have a number of plans for the near future. (The CRIU wiki has a TODO list.) First among these is to complete the coverage of resources supported by CRIU. For example, CRIU does not currently support POSIX timers. The problem here is that the kernel doesn't currently provide an API to detect whether a process is using POSIX timers. Thus, if an application using POSIX timers is checkpointed and restored, the timers will be lost. There are some other similar examples. Fixing these sorts of problems will require adding suitable APIs to the kernel to expose the required state information.

Another outstanding task is to integrate the user-space crtools into LXC and OpenVZ to permit live migration of containers. Pavel noted that OpenVZ already supports live migration, but with its own out-of-tree kernel modules.

The CRIU developers plan to improve the automation of live migration. The issue here is that CRIU deals only with process state. However, there are other pieces of state in a container. One such piece of state is the filesystem. Currently, when checkpointing and restoring an application, it is necessary to ensure that the filesystem state has not changed in the interim (e.g., no files that are open in the checkpointed application have been deleted). Some scripting using rsync to automate the copying files from the source system to the destination system could be implemented to ease the task of live migration.

One further piece of planned work is to improve the handling of user-space memory. Currently, around 90% of the time required to checkpoint an application is taken up by reading user-space memory. For many use cases, this is not a problem. However, for live migration and incremental snapshotting, improvements are possible. For example, when performing live migration, the whole application must first be frozen, and then the entire memory is copied out to the destination system, after which the application is restarted on the destination system. Copying out a huge amount of memory may require several seconds; during that time the application is unavailable. This situation could be alleviated by allowing the application to continue to run at the same time as memory is copied to the destination system, then freezing the application and asking the kernel which pages of memory have changed since the checkpoint operation began. Most likely, only a small amount of memory will have changed; those modified pages can then be copied to the destination system. This could result in a considerable shortening of the interval during which the application is unavailable. The CRIU developers plan to talk with the memory-management developers about how to add support for this optimization.

Concluding remarks

Although many groups are interested in having checkpoint/restore functionality, an implementation that works with the mainline kernel has taken a long time in coming. When one looks into the details and realize how complex the task is, it is perhaps unsurprising that it has taken so long. Along the way, one major effort to solve the problem—checkpoint/restore in kernel space—was considered and rejected. However, there are some promising signs that the mad Russians led by Pavel may be on the verge of success with their alternative approach of a user-space implementation.

Comments (11 posted)

VFS hot-data tracking

By Jonathan Corbet
November 20, 2012
At any level of the system, from the hardware to high-level applications, performance often depends on keeping frequently-used data in a place where it can be accessed quickly. That is the principle behind hardware caches, virtual memory, and web-browser image caches, for example. The kernel already tries to keep useful filesystem data in the page cache for quick access, but there can also be advantages to keeping track of "hot" data at the filesystem level and treating it specially. In 2010, a data temperature tracking patch set for the Btrfs filesystem was posted, but then faded from view. Now the idea has returned as a more general solution. The current form of the patch set, posted by Zhi Yong Wu, is called hot-data tracking. It works at the virtual filesystem (VFS) level, tracking accesses to data and making the resulting information available to user space via a couple of mechanisms.

The first step is the instrumentation of the VFS to obtain the needed information. To that end, Zhi Yong's patch set adds hooks to a number of core VFS functions (__blockdev_direct_IO(), readpage(), read_pages(), and do_writepages()) to record specific access operations. It is worth noting that hooking at this level means that this subsystem is not tracking data accesses as such; instead, it is tracking operations that cause actual file I/O. The two are not quite the same thing: a frequently-read page that remains in the page cache will generate no I/O; it could look quite cold to the hot-data tracking code.

The patch set uses these hooks to maintain a surprisingly complicated data structure, involving a couple of red-black trees, that is hooked into a filesystem's superblock structure. Zhi Yong used this bit of impressive ASCII art to describe it in the documentation file included with the patch set:

heat_inode_map           hot_inode_tree
    |                         |
    |                         V
    |           +-------hot_comm_item--------+
    |           |       frequency data       |
+---+           |        list_head           |
|               V            ^ |             V
| ...<--hot_comm_item-->...  | |  ...<--hot_comm_item-->...
|       frequency data       | |        frequency data
+-------->list_head----------+ +--------->list_head--->.....
       hot_range_tree                  hot_range_tree
             heat_range_map                  V
                   |           +-------hot_comm_item--------+
                   |           |       frequency data       |
               +---+           |        list_head           |
               |               V            ^ |             V
               | ...<--hot_comm_item-->...  | |  ...<--hot_comm_item-->...
               |       frequency data       | |        frequency data
               +-------->list_head----------+ +--------->list_head--->.....

In short, the idea is to track which inodes are seeing the most I/O traffic, along with the hottest data ranges within those inodes. The subsystem can produce a sorted list on demand. Unsurprisingly, this data structure can end up using a lot of memory on a busy system, so Zhi Yong has added a shrinker to clean things up when space gets tight. Specific file information is also dropped after five minutes (by default) with no activity.

There is a new ioctl() command (FS_IOC_GET_HEAT_INFO) that can be used to obtain the relevant information for a specific file. The structure it uses shows the information that is available:

    struct hot_heat_info {
	__u64 avg_delta_reads;
	__u64 avg_delta_writes;
	__u64 last_read_time;
	__u64 last_write_time;
	__u32 num_reads;
	__u32 num_writes;
	__u32 temp;
	__u8 live;

The hot-data tracking subsystem monitors the number of read and write operations, when the last operations occurred, and the average period between operations. A complicated calculation boils all that information down to a single temperature value, stored in temp. The live field is an input parameter to the ioctl() call: if it is non-zero, the temperature will be recalculated at the time of the call; otherwise a cached, previously-calculated value will be returned.

The ioctl() call does not provide a way to query which parts of the file are the hottest, or to get a list of the hottest files. Instead, the debugfs interface must be used. Once debugfs is mounted, each device or partition with a mounted filesystem will be represented by a directory under hot_track/ containing two files. The most active files can be found by reading rt_stats_inode, while the hottest file ranges can be read from rt_stats_range. These are the interfaces that user-space utilities are expected to use to make decisions about, for example, which files (or portions of files) should be stored on a fast, solid-state drive.

Should a filesystem want to influence how the calculations are done, the patch set provides a structure (called hot_func_ops) as a place for filesystem-provided functions to calculate access frequencies, temperatures, and when information should be aged out of the system. In the posted patch set, though, only Btrfs uses the hot-data tracking feature, and it does not override any of those operations, so it is not entirely clear why they exist. The changelog states that support for ext4 and xfs has been implemented; perhaps one of those filesystems needed that capability.

The patch set has been through several review cycles and a lot of changes have been made in response to comments. The list of things still to be done includes scalability testing, a simpler temperature calculation function, and the ability to save file temperature data across an unmount. If nothing else, some solid performance information will be required before this patch set can be merged into the core VFS code. So hot-data tracking is not 3.8 material, but it may be ready for one of the subsequent development cycles.

Comments (1 posted)

The module signing endgame

By Jake Edge
November 21, 2012

Inserting a loadable module into the running kernel is a potential security problem, so some administrators want to be able to restrict which modules are allowed. One way to do that is to cryptographically sign modules and have the kernel verify that signature before loading the module. Module signing isn't for everyone, and those who aren't interested probably don't want to pay much of a price for that new feature. Even those who are interested will want to minimize that price. While cryptographically signing kernel modules can provide a security benefit, that boon comes with a cost: slower kernel builds. When that cost is multiplied across a vast number of kernel builds, it draws some attention.

David Miller complained on Google+ about the cost of module signing in mid-October. Greg Kroah-Hartman agreed in the comments, noting that an allmodconfig build took more than 10% longer between 3.6 and 3.7-rc1. The problem is the addition of module signing to the build process. Allmodconfig builds the kernel with as many modules as possible, which has the effect of build-testing nearly all of the kernel. Maintainers like Miller and Kroah-Hartman do that kind of build frequently, typically after each patch they apply, in order to ensure that the kernel still builds. Module signing can, of course, be turned off using CONFIG_MODULE_SIG, but that adds a manual configuration step to the build process, which is annoying.

Linus Torvalds noted Miller's complaint and offered up a "*much* simpler" solution: defer module signing until install time. There is already a mechanism to strip modules during the make modules_install step. Torvalds's change adds module signing into that step, which means that you don't pay the signing price until you actually install the modules. There are some use cases that would not be supported by this change, but Torvalds essentially dismissed them:

Sure, it means that if you want to load modules directly from your kernel build tree (without installing them), you'd better be running a kernel that doesn't need the signing (or you need to sign things explicitly). But seriously, nobody cares. If you are building a module after booting the kernel with the intention of loading that modified module, you aren't going to be doing that whole module signing thing *anyway*. Signed modules make sense when building the kernel and module together, so signing them as we install the kernel and module is just sensible.

One of the main proponents behind the module signing feature over the years has been David Howells; his code was used as the basis for module maintainer Rusty Russell's signature infrastructure patch. But, Howells was not particularly happy with Torvalds's changes. He would like to be able to handle some of the use cases that Torvalds dismissed, including loading modules from the kernel build tree. He thinks that automatic signing should probably just be removed from the build process; a script could be provided to do signing manually.

Howells is looking at the signed modules problem from a distribution view. Currently, the keys used to sign modules can be auto-generated at build time, with the public key getting built into the kernel and the private portion being used for signing—and then likely deleted once the build finishes. That isn't how distributions will do things, so auto-generating keys concerns Howells:

It would also be nice to get rid of the key autogeneration stuff. I'm not keen on the idea of unparameterised key autogeneration - anyone signing their modules should really supply the appropriate address elements.

That may make sense for distributions or those who will be using long-lived keys, but it makes little sense for a more basic use case. With characteristic bluntness, Torvalds pointed that out:

You seem to dismiss the "people want to build their own kernel" people entirely.

One of the main sane use-cases for module signing is:

- randomly generated one-time key
- "make modules_install; make install"
- "make clean" to get rid of the keys.
- reboot.

and now you have a custom kernel that has the convenience of modules, yet is basically as safe as a non-modular build. The above makes it much harder for any kind of root-kit module to be loaded, and basically entirely avoids one fundamental security scare of modules.

Kroah-Hartman agreed with the need to support the use case Torvalds described, though he noted that keys are not removed by make clean, which he considered a bit worrisome. It turns out that make clean is documented to leave the files needed to build modules, so make distclean should be used to get rid of the key files.

Russell, who has always been a bit skeptical of module signing, pointed out that Torvalds's use case could be handled by just storing the hashes of the modules in the kernel—no cryptography necessary. While that's true, Russell's scheme would disallow some other use cases. Signing provides flexibility, Torvalds said, and is "technically the right thing to do". Russell countered:

It's 52k of extra text to get that 'nice flexible'; 1% of my kernel image. That's a lot of bug free code.

Russell's concerns notwithstanding, it is clear that module signing is here to stay. Torvalds's change was added for 3.7 (with some additions by Russell and Howells). For distributions, Josh Boyer has a patch that will add a "modules_sign" target. It will operate on the modules in their installed location (i.e. after a modules_install), and remove the signature, which will allow the distribution packaging system (e.g. RPM) to generate debuginfo for the modules before re-signing them. In that way, distributions can use Torvalds's solution at the cost of signing modules twice. Since that process should be far less frequent than developers building kernels (or build farms building kernels or ...), that tradeoff is likely to be well worth that small amount of pain.

Comments (none posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers


Filesystems and block I/O

Memory management



Virtualization and containers


Page editor: Jonathan Corbet


Gentoo's udev fork

By Jonathan Corbet
November 21, 2012
The ability to fork a project is one of the fundamental freedoms that come with free software. If a software project is heading in a direction that is less than ideal for its users or developers, a competing version can be created and managed in a more useful manner. Forking has been used to great effect with projects like Emacs, GCC,, and XFree86. The most successful forks have specific goals in mind and tend to attract at least a significant subset of the original project's developers. Other types of forks face a harder road. Arguably, a recently launched fork of the udev utility under the aegis of the Gentoo project is of the latter variety.

On November 13, the Gentoo council met to discuss, among other things, how to support systems that are configured with the /usr directory on a separate partition. The meeting minutes show that much of the discussion centered around a new udev fork that, it was hoped, might help to solve this problem. The existence of a new udev fork (now called eudev) within the Gentoo project took some developers by surprise, especially when the associated README file was observed to claim that it was a "Gentoo sponsored" project. This surprise led Greg Kroah-Hartman (a longtime Gentoo developer) to ask what the goals of the fork were. Getting an answer to that question turned out to be harder that one might have expected.

One of the developers behind eudev is Richard Yao; his response really needs to be read in its original form:

If we were using the waterfall model, I could outline some very nice long term goals for you, but we are doing AGILE development, so long term goals have not been well defined. Some short term goals have been defined, but I imagine that you are already familiar with them.

After extensive discussion with a lengthy digression on copyright law (the eudev developers removed some copyright notices in a way that drew objections), some of the project's motivations came into a bit of focus. Part of the problem is certainly the increased integration between udev and systemd. Udev is still easily built as a standalone binary, but some people worry that this situation might change in the future; mere association with systemd seems to be enough to provoke a response in some people.

That response carries over to the ongoing unhappiness over the deprecation of /usr on a separate partition. The developers involved claim that this configuration has not been supported for years and cannot be counted on to work. In truth, though, it does work for a lot of people, and those people are feeling like they are having a useful option taken away from them. Whether a fork like eudev can address those concerns remains to be seen.

Beyond that, a recent switch to the "kmod" library for the loading of kernel modules has created a certain amount of irritation; backing out that feature is one the first changes made by the eudev developers. Udev uses kmod for a reason: avoiding modprobe calls speeds the bootstrap process considerably. But Gentoo developers like fine control over their systems, and some of them want to use that control to exclude kmod, which they see as unnecessary bloat or even a potential security problem. If udev requires kmod, that desire is thwarted, so the change has to come out.

There was also some discontent over the firmware loading difficulties caused by a udev change earlier this year. That problem has since been fixed; indeed, it has been fixed twice: once by loading firmware directly from the kernel, and once in udev. But some developers have not forgotten the incident and feel that the current udev maintainers cannot be trusted.

In truth, a bit of concern is understandable. The eudev developers point to statements like this one from Lennart Poettering:

Yes, udev on non-systemd systems is in our eyes a dead end, in case you haven't noticed it yet. I am looking forward to the day when we can drop that support entirely.

After reading that, it is natural to wonder if the current udev maintainers can truly be trusted to look after the interests of users who do not plan to switch to systemd. From there, it is not too hard to imagine maintaining a fork of udev as an "insurance policy" against misbehavior in the future.

That said, a better insurance policy might be to establish oneself as a participating and respected member of the current systemd/udev development community. The strong personalities found there notwithstanding, it is an active community with developers representing a number of distributions. A developer who can work within that group should be able to represent the interests of a distribution like Gentoo nicely while avoiding the costs of maintaining a fork of a crucial utility. And, should that strategy fail, creating a fork of udev remains an option in the future.

But nobody can tell free software developers what they can or cannot work on, and nobody is trying to tell the eudev developers that creating their own udev fork is wrong. The situation becomes a bit less clear, though, if eudev is destined to replace udev within Gentoo itself; then Gentoo users may well find they have an opinion on the matter. For now, no such replacement has happened. If it begins to look like that situation could change, one can imagine that the resulting discussion in the Gentoo community will be long and entertaining.

Comments (41 posted)

Brief items

Linux Mint 14 released

The Linux Mint team has released Linux Mint 14 "Nadia". "For the first time since Linux Mint 11, the development team was able to capitalize on upstream technology which works and fits its goals. After 6 months of incremental development, Linux Mint 14 features an impressive list of improvements, increased stability and a refined desktop experience. We’re very proud of MATE, Cinnamon, MDM and all the components used in this release, and we’re very excited to show you how they all fit together in Linux Mint 14."

Comments (none posted)

Distribution News


Fedora elections questionnaire answers are up

Fedora elections for seats on the advisory board, FESCo (Fedora Engineering Steering Committee) and FAmSCo (Fedora Ambassadors Steering Committee) are underway. The candidates' responses to questions from the community are available.

Full Story (comments: none)

Results of Fedora 19 Release Name Voting

A release name for Fedora 19 has been selected. F19 will also be known as Schrödinger's Cat.

Full Story (comments: 3)

Newsletters and articles of interest

Distribution newsletters

Comments (none posted)

Review: Ubuntu 12.10 Quantal Quetzal a mix of promise, pain (ars technica)

Ars Technica has a review of Ubuntu 12.10, or Quantal Quetzal. "One of the more intriguing desktop features for Ubuntu 12.10 is the inclusion of the Web Apps feature trialed in Ubuntu 12.04. Web apps are controls that support various popular Web tools, such as Gmail, Twitter, and Google Docs. Ubuntu includes two such Web apps out of the box: Amazon and Ubuntu One."

Comments (none posted)

Page editor: Rebecca Sobol


LCE: Don't play dice with random numbers

By Michael Kerrisk
November 20, 2012

H. Peter Anvin has been involved with Linux for more than 20 years, and is currently one of the x86 architecture maintainers. During his work on Linux, one of his areas of interest has been the generation and use of random numbers, and his talk at LinuxCon Europe 2012 was designed to address a lot of misunderstandings that he has encountered regarding random numbers. The topic is complex, and so, for the benefit of experts in the area, he noted that "I will be making some simplifications". (In addition, your editor has attempted to fill out some details that Peter mentioned only briefly, and so may have added some "simplifications" of his own.)

Random numbers

Possibly the first use of random numbers in computer programs was for games. Later, random numbers were used in Monte Carlo simulations, where randomness can be used to mimic physical processes. More recently, random numbers have become an essential component in security protocols.

Randomness is a subtle property. To illustrate this, Peter displayed a photograph of three icosahedral dice that he'd thrown at home, saying "here, if you need a random number, you can use 846". Why doesn't this work, he asked. First of all, a random number is only random once. In addition, it is only random until we know what it is. These facts are not the same thing. Peter noted that it is possible to misuse a random number by reusing it; this can lead to breaches in security protocols.

There are no tests for randomness. Indeed, there is a yet-to-be-proved mathematical conjecture that there are no tractable tests of randomness. On the other hand, there are tests for some kinds non-randomness. Peter noted that, for example, we can probably quickly deduce that the bit stream 101010101010… is not random. However, tests don't prove randomness: they simply show that we haven't detected any non-randomness. Writing reliability tests for random numbers requires an understanding of the random number source and the possible failure modes for that source.

Most games that require some randomness will work fine even if the source of randomness is poor. However, for some applications, the quality of the randomness source "matters a lot".

Why does getting good randomness matter? If the random numbers used to generate cryptographic keys are not really random, then the keys will be weak and subject to easier discovery by an attacker. Peter noted a couple of recent cases of poor random number handling in the Linux ecosystem. In one of these cases, a well-intentioned Debian developer reacted to a warning from a code analysis tool by removing a fundamental part of the random number generation in OpenSSL. As a result, Debian for a long time generated only one of 32,767 SSH keys. Enumerating that set of keys is, of course, a trivial task. The resulting security bug went unnoticed for more than a year. The problem is, of course, that unless you are testing for this sort of failure in the randomness source, good random numbers are hard to distinguish from bad random numbers. In another case, certain embedded Linux devices have been known to generate a key before they could generate good random numbers. A weakness along these lines allowed the Sony PS3 root key to be cracked [PDF] (see pages 122 to 130 of that presentation, and also this video of the presentation, starting at about 35'24").

Poor randomness can also be a problem for storage systems that depend on some form of probabilistically unique identifiers. Universally unique IDs (UUIDs) are the classic example. There is no theoretical guarantee that UUIDs are unique. However, if they are properly generated from truly random numbers, then, for all practical purposes, the chance of two UUIDs being the same is virtually zero. But if the source of random numbers is poor, this is no longer guaranteed.

Of course, computers are not random; hardware manufacturers go to great lengths to ensure that computers behave reliably and deterministically. So, applications need methods to generate random numbers. Peter then turned to a discussion of two types of random number generators (RNGs): pseudo-random number generators and hardware random number generators.

Pseudo-random number generators

The classical solution for generating random numbers is the so-called pseudo-random number generator (PRNG):

A PRNG has two parts: a state, which is some set of bits determined by a "seed" value, and a chaotic process that operates on the state and updates it, at the same time producing an output sequence of numbers. In early PRNGs, the size of the state was very small, and the chaotic process simply multiplied the state by a constant and discarded the high bits. Modern PRNGs have a larger state, and the chaotic process is usually a cryptographic primitive. Because cryptographic primitives are usually fairly slow, PRNGs using non-cryptographic primitives are still sometimes employed in cases where speed is important.

The quality of PRNGs is evaluated on a number of criteria. For example, without knowing the seed and algorithm of the PRNG, is it possible to derive any statistical patterns in or make any predictions about the output stream? One statistical property of particular interest in a PRNG is its cycle length. The cycle length tells us how many numbers can be generated before the state returns to its initial value; from that point on, the PRNG will repeat its output sequence. Modern PRNGs generally have extremely long cycle lengths. However, some applications still use hardware-implemented PRNGs with short cycle lengths, because they don't really need high-quality randomness. Another property of PRNGs that is of importance in security applications is whether the PRNG algorithm is resistant to analysis: given the output stream, is it possible to figure out what the state is? If an attacker can do that, then it is possible to predict future output of the PRNG.

Hardware random number generators

The output of a PRNG is only as good as its seed and algorithm, and while it may pass all known tests for non-randomness, it is not truly random. But, Peter noted, there is a source of true randomness in the world: quantum mechanics. This fact can be used to build hardware "true" random number generators.

A hardware random number generator (HWRNG) consists of a number of components, as shown in the following diagram:

Entropy is a measure of the disorder, or randomness, in a system. An entropy source is a device that "harvests" quantum randomness from a physical system. The process of harvesting the randomness may be simple or complex, but regardless of that, Peter said, you should have a good argument as to why the harvested information truly derives from a source of quantum randomness. Within a HWRNG, the entropy source is necessarily a hardware component, but the other components may be implemented in hardware or software. (In Peter's diagrams, redrawn and slightly modified for this article, yellow indicated a hardware component, and blue indicated a software component.)

Most entropy sources don't produce "good" random numbers. The source may, for example, produce ones only 25% of the time. This doesn't negate the value of the source. However, the "obvious" non-randomness must be eliminated; that is the task of the conditioner.

The output of the conditioner is then fed into a cryptographically secure pseudo-random number generator (CSPRNG). The reason for doing this is that we can better reason about the output of a CSPRNG; by contrast, it is difficult to reason about the output of the entropy source. Thus, it is possible to say that the resulting device is at least as secure as a CSPRNG, but, since we have a constant stream of new seeds, we can be confident that it is actually a better source of random numbers than a CSPRNG that is seeded less frequently.

The job of the integrity monitor is to detect failures in the entropy source. It addresses the problem that entropy sources can fail silently. For example, a circuit in the entropy source might pick up an induced signal from a wire on the same chip, with the result that the source outputs a predictable pattern. So, the job of the integrity monitor is to look for the kinds of failure modes that are typical of this kind of source; if failures are detected, the integrity monitor produces an error indication.

There are various properties of a HWRNG that are important to users. One of these is bandwidth—the rate at which output bits are produced. HWRNGs vary widely in bandwidth. At one end, the physical drum-and-ball systems used in some public lottery draws produce at most a few numbers per minute. At the other end, some electronic hardware random number sources can generate output at the rate of gigabits per second. Another important property of HWRNGs is resistance to observation. An attacker should not be able to look into the conditioner or CSPRNG and figure out the future state.

Peter then briefly looked at a couple of examples of entropy sources. One of these is the recent Bull Mountain Technology digital random number generator produced by his employer (Intel). This device contains a logic circuit that is forced into an impossible state between 0 and 1, until the circuit is released by a CPU clock cycle, at which point quantum thermal noise forces the circuit randomly to zero or one. Another example of a hardware random number source—one that has actually been used—is a lava lamp. The motion of the liquids inside a lava lamp is a random process driven by thermal noise. A digital camera can be used to extract that randomness.

The Linux kernel random number generator

The Linux kernel RNG has the following basic structure:

The structure consists of a two-level cascaded sequence of pools coupled with CSPRNGs. Each pool is a large group of bits which represents the current state of the random number generator. The CSPRNGs are currently based on SHA-1, but the kernel developers are considering a switch to SHA-3.

The kernel RNG produces two user-space output streams. One of these goes to /dev/urandom and also to the kernel itself; the latter is useful because there are uses for random numbers within the kernel. The other output stream goes to /dev/random. The difference between the two is that /dev/random tries to estimate how much entropy is coming into the system, and will throttle its output if there is insufficient entropy. By contrast, the /dev/urandom stream does not throttle output, and if users consume all of the available entropy, the interface degrades to a pure CSPRNG.

Starting with Linux 3.6, if the system provides an architectural HWRNG, then the kernel will XOR the output of the HWRNG with the output of each CSPRNG. (An architectural HWRNG is a complete random number generator designed into the hardware and guaranteed to be available in future generations of the chip. Such a HWRNG makes its output stream directly available to user space via dedicated assembler instructions.) Consequently, Peter said, the output will be even more secure. A member of the audience asked why the kernel couldn't just do away with the existing system and use the HWRNG directly. Peter responded that some people had been concerned that if the HWRNG turned out not to be good enough, then this would result in a security regression in the kernel. (It is also worth noting that some people have wondered whether the design of HWRNGs may have been influenced by certain large government agencies.)

The inputs for the kernel RNG are shown in this diagram:

In the absence of anything else, the kernel RNG will use the timings of events such as hardware interrupts as inputs. In addition, certain fixed values such as the network card's MAC address may be used to initialize the RNG, in order to ensure that, in the absence of any other input, different machines will at least seed a unique input value.

The rngd program is a user-space daemon whose job is to feed inputs (normally from the HWRNG driver, /dev/hwrng) to the kernel RNG input pool, after first performing some tests for non-randomness in the input. If there is HWRNG with a kernel driver, then rngd will use that as input source. In a virtual machine, the driver is also capable of taking random numbers from the host system via the virtio system. Starting with Linux 3.7, randomness can also be harvested from the HWRNG of the Trusted Platform Module (TPM) if the system has one. (In kernels before 3.7, rngd can access the TPM directly to obtain randomness, but doing so means that the TPM can't be used for any other purpose.) If the system has an architectural HWRNG, then rngd can harvest randomness from it directly, rather than going through the HWRNG driver.

Administrator recommendations

"You really, really want to run rngd", Peter said. It should be started as early as possible during system boot-up, so that the applications have early access to the randomness that it provides.

One thing you should not do is the following:

    rngd -r /dev/urandom

Peter noted that he had seen this command in several places on the web. Its effect is to connect the output of the kernel's RNG back into itself, fooling the kernel into believing it has an endless supply of entropy.

HAVEGE (HArdware Volatile Entropy Gathering and Expansion) is a piece of user-space software that claims to extract entropy from CPU nondeterminism. Having read a number of papers about HAVEGE, Peter said he had been unable to work out whether this was a "real thing". Most of the papers that he has read run along the lines, "we took the output from HAVEGE, and ran some tests on it and all of the tests passed". The problem with this sort of reasoning is the point that Peter made earlier: there are no tests for randomness, only for non-randomness.

One of Peter's colleagues replaced the random input source employed by HAVEGE with a constant stream of ones. All of the same tests passed. In other words, all that the test results are guaranteeing is that the HAVEGE developers have built a very good PRNG. It is possible that HAVEGE does generate some amount of randomness, Peter said. But the problem is that the proposed source of randomness is simply too complex to analyze; thus it is not possible to make a definitive statement about whether it is truly producing randomness. (By contrast, the HWRNGs that Peter described earlier have been analyzed to produce a quantum theoretical justification that they are producing true randomness.) "So, while I can't really recommend it, I can't not recommend it either." If you are going to run HAVEGE, Peter strongly recommended running it together with rngd, rather than as a replacement for it.

Guidelines for application writers

If you are writing applications that need to use a lot of randomness, you really want to use a cryptographic library such as OpenSSL, Peter said. Every cryptographic library has a component for dealing with random numbers. If you need just a little bit of randomness, then just use /dev/random or /dev/urandom. The difference between the two is how they behave when entropy is in short supply. Reads from /dev/random will block until further entropy is available. By contrast, reads from /dev/urandom will always immediately return data, but that data will degrade to a CSPRNG when entropy is exhausted. So, if you prefer that your application would fail rather than getting randomness that is not of the highest qualify, then use /dev/random. On the other hand, if you want to always be able to get a non-blocking stream of (possibly pseudo-) random data, then use /dev/urandom.

"Please conserve randomness!", Peter said. If you are running recent hardware that has a HWRNG, then there is a virtually unlimited supply of randomness. But, the majority of existing hardware does not have the benefit of a HWRNG. Don't use buffered I/O on /dev/random or /dev/urandom. The effect of performing a buffered read on one of these devices is to consume a large amount of the possibly limited entropy. For example, the C library stdio functions operate in buffered mode by default, and an initial read will consume 8192 bytes as the library fills its input buffer. A well written application should use non-buffered I/O and read only as much randomness as it needs.

Where possible, defer the extraction of randomness as late as possible in an application. For example, in a network server that needs randomness, it is preferable to defer extraction of that randomness until (say) the first client connect, rather extracting the randomness when the server starts. The problem is that most servers start early in the boot process, and at that point, there may be little entropy available, and many applications may be fighting to obtain some randomness. If the randomness is being extracted from /dev/urandom, then the effect will be that the randomness may degrade for a CSPRNG stream. If the randomness is being extracted from /dev/random, then reads will block until the system has generated enough entropy.

Future kernel work

Peter concluded his talk with a discussion of some upcoming kernel work. One of these pieces of work is the implementation of a policy interface that would allow the system administrator to configure certain aspects of the operation of the kernel RNG. So, for example, if the administrator trusts the chip vendor's HWRNG, then it should be possible to configure the kernel RNG to take its input directly from the HWRNG. Conversely, if you are paranoid system administrator who doesn't trust the HWRNG, it should be possible to disable use of the HWRNG. These ideas were discussed by some of the kernel developers who were present at the 2012 Kernel Summit; the interface is still being architected, and will probably be available sometime next year.

Another piece of work is the completion of the virtio-rng system. This feature is useful in a virtual-machine environment. Guest operating systems typically have few sources of entropy. The virtio-rng system is a mechanism for the host operating system to provide entropy to a guest operating system. The guest side of this work was already completed in 2008. However, work on the host side (i.e., QEMU and KVM) got stalled for various reasons; hopefully, patches to complete the host side will be merged quite soon, Peter said.

A couple of questions at the end of the talk concerned the problem of live cloning a virtual machine image, including its memory (i.e., the virtual machine is not shut down and rebooted). In this case, the randomness pool of the cloned kernel will be duplicated in each virtual machine, which is a security problem. There is currently no way to invalidate the randomness pool in the clones, by (say) setting the clone's entropy count to zero. Peter thinks a (kernel) solution to this problem is probably needed.

Concluding remarks

Interest in high-quality random numbers has increased in parallel with the increasing demands to secure stored data and network communications with high-quality cryptographic keys. However, as various examples in Peter's talk demonstrated, there are many pitfalls to be wary of when dealing with random numbers. As HWRNGs become more prevalent, the task of obtaining high-quality randomness will become easier. But even if such hardware becomes universally available, it will still be necessary to deal with legacy systems that don't have a HWRNG for a long time. Furthermore, even with a near infinite supply of randomness it is still possible to make subtle but dangerous errors when dealing with random numbers.

Comments (88 posted)

Brief items

Quotes of the week

I think that it'd be cool to have our community be the community of people who can go wild on the platform - "let a thousand flowers bloom". That the core GNOME project is solid and useful, but that we encourage experimentation, respins, freedom for our users. That seems inconsistent with the current GNOME messaging.
Dave Neary

Our patent system is the envy of the world.
David Kappos, head of the United States Patent and Trademark Office

Comments (9 posted)

Upstart 1.6 released

Ubuntu's James Hunt announced the release of version 1.6 of the Upstart event-driven init system. This release adds support for initramfs-less booting, sbuild tests, and a stateful re-exec, which allows Upstart "to continue to supervise jobs after an upgrade of either itself, or any of its dependent libraries."

Full Story (comments: 1)

Thunderbird 17 released

Mozilla has released Thunderbird 17. According to the release notes, this version includes layout changes for RSS feeds and for tabs on Windows, and drops support for Mac OS X 10.5, not to mention the usual bundle of bugfixes and minor enhancements.

Full Story (comments: none)

Firefox 17 released

Firefox 17 has been released. The release notes have all the details. Firefox 17 for Android has also been released, with separate release notes.

Comments (17 posted)

Day: The Next Step

At his blog, Allan Day outlines the next phase of GNOME 3's user experience development, which focuses on "content applications." The project is aiming to make it "quicker and less laborious for people to find content" and subsequently organize it. "To this end, we’re aiming to build a suite of new GNOME content applications: Music, Documents, Photos, Videos and Transfers. Each of these applications aims to provide a quick and easy way to access content, and will seamlessly integrate with the cloud." New mockups are available on the GNOME wiki.

Comments (67 posted)

GNOME Shell to support a "classic" mode

GNOME developer Matthias Clasen has announced that, with the upcoming demise of "fallback mode," the project will support a set of official GNOME Shell extensions to provide a more "classic" experience. "And while we certainly hope that many users will find the new ways comfortable and refreshing after a short learning phase, we should not fault people who prefer the old way. After all, these features were a selling point of GNOME 2 for ten years!"

Full Story (comments: 204)

Newsletters and articles

Development newsletters from the last week

Comments (none posted)

Linux Color Management Hackfest reports

Linux color management developers met in Brno, Czech Republic over the weekend of November 10, and the lead developers from two teams have subsequently published their recaps of the event: Kai-Uwe Behrmann of Oyranos, and Richard Hughes of colord. Developers from GIMP, Taxi, and a host of printing-related projects were also on hand.

Comments (none posted)

Garrett: More in the series of bizarre UEFI bugs

As we start to see more UEFI firmware become available, one would guess we'll find more exciting weirdness like what Matthew Garrett found. For whatever reason, the firmware in a Lenovo Thinkcentre M92p only wants to boot Windows or Red Hat Enterprise Linux (and, no, it is not secure boot related): "Every UEFI boot entry has a descriptive string. This is used by the firmware when it's presenting a menu to users - instead of "Hard drive 0" and "USB drive 3", the firmware can list "Windows Boot Manager" and "Fedora Linux". There's no reason at all for the firmware to be parsing these strings. But the evidence seemed pretty strong - given two identical boot entries, one saying "Windows Boot Manager" and one not, only the first would work."

Comments (39 posted)

Introducing Movit, free library for GPU-side video processing (Libre Graphics World)

Libre Graphics World takes a look at Movit, a new C++ library for video processing on GPUs. "Movit does all the processing on a GPU with GLSL fragment shaders. On a GT9800 with Nouveau drivers I was able to get 10.9 fps (92.1 ms/frame) for a 2971×1671px large image." The library currently performs color space conversions plus an assortment of general filters, and is probably best suited for non-linear video editor projects.

Comments (none posted)

Page editor: Nathan Willis


Brief items

FSFE welcomes German Government's White Paper on "Secure Boot"

The Free Software Foundation Europe looks at a white paper from the German Ministry of the Interior about "Trusted Computing" and "Secure Boot". "The white paper says that "device owners must be in complete control of (able to manage and monitor) all the trusted computing security systems of their devices." This has been one of FSFE's key demands from the beginning. The document continues that "delegating this control to third parties requires conscious and informed consent by the device owner"."

Full Story (comments: 2)

Mozilla Foundation 2011 annual report

The Mozilla Foundation has released it's 2011 annual report in a "collection of boxes" format that reminds one of a recent proprietary operating system release. "In June 2012 we released an update to Firefox for Android that we believe is the best browser for Android available. We completely rebuilt and redesigned the product in native UI, resulting in a snappy and dynamic upgrade to mobile browsing that is significantly faster than the Android stock browser."

Comments (1 posted)

Articles of interest

Linux Foundation Monthly Newsletter: November

The November edition of the Linux Foundation Monthly Newsletter covers the Automotive Grade Linux Workgroup, HP's platinum membership, open clouds, and several other topics.

Full Story (comments: none)

Bottomley: Adventures in Microsoft UEFI Signing

James Bottomley's UEFI bootloader signing experience is worth a read...still a few glitches in the system. "Once the account is created, you still can’t upload UEFI binaries for signature without first signing a paper contract. The agreements are pretty onerous, include a ton of excluded licences (including all GPL ones for drivers, but not bootloaders). The most onerous part is that the agreements seem to reach beyond the actual UEFI objects you sign. The Linux Foundation lawyers concluded it is mostly harmless to the LF because we don’t ship any products, but it could be nasty for other companies."

Comments (41 posted)

Apple Now Owns the Page Turn (New York Times)

The New York Times describes the latest innovation from Apple: a page turning animation for e-readers. Not only is it astonishingly brilliant, it's patented. "Apple argued that its patented page turn was unique in that it had a special type of animation other page-turn applications had been unable to create. [ ... ] The patent comes with three illustrations to explain how the page-turn algorithm works. In Figure 1, the corner of a page can be seen folding over. In Figure 2, the page is turned a little more. I’ll let you guess what Figure 3 shows."

Comments (55 posted)

Portuguese Government Adopts ODF (The Standards Blog)

Andy Updegrove covers a press release from the Portuguese Open Source Business Association on the government adoption of standard formats for documents. "[T]he Portuguese government has opted for ODF, the OpenDocument Format, as well as PDF and a number of other formats and protocols, including XML, XMPP, IMAP, SMTP, CALDAV and LDAP. The announcement is in furtherance of a law passed by the Portuguese Parliament on June 21 of last year requiring compliance with open standards (as defined in the same legislation) in the procurement of government information systems and when exchanging documents at citizen-facing government Web sites."

Comments (1 posted)

New Books

Python for Kids--New from No Starch Press

No Starch Press has released "Python for Kids" by Jason R. Briggs.

Full Story (comments: none)

Calls for Presentations

Apache OpenOffice at FOSDEM 2013

There will be an Apache OpenOffice devroom at FOSDEM 2013, to be held February 2. The call for talks is open until December 23. FOSDEM (Free and Open Source software Developers' European Meeting) will take place February 2-3, 2013 in Brussels, Belgium.

Full Story (comments: none)

Upcoming Events

Chumby Co-Inventor to Keynote at LCA (LCA) has announced the first of four keynote speakers for the 2013 conference. "Andrew "bunnie" Huang is best known as the lead hardware developer of open-source gadget "Chumby"*, a device designed from the ground up as an open source gadget, complete with open source hardware, and whose designers encourage hackers to get into the device and make it their own. He is also the author of "Hacking the Xbox"^, a book about reverse engineering consumer products and the social and practical issues around doing so."

Full Story (comments: none)

Events: November 29, 2012 to January 28, 2013

The following event listing is taken from the Calendar.

November 29
December 1
FOSS.IN/2012 Bangalore, India
November 29
November 30
Lua Workshop 2012 Reston, VA, USA
November 30
December 2
Open Hard- and Software Workshop 2012 Garching bei München, Germany
November 30
December 2
CloudStack Collaboration Conference Las Vegas, NV, USA
December 1
December 2
Konferensi BlankOn #4 Bogor, Indonesia
December 2 Foswiki Association General Assembly online and Dublin, Ireland
December 5
December 7
Qt Developers Days 2012 North America Santa Clara, CA, USA
December 5 4th UK Manycore Computing Conference Bristol, UK
December 5
December 7
Open Source Developers Conference Sydney 2012 Sydney, Australia
December 7
December 9
CISSE 12 Everywhere, Internet
December 9
December 14
26th Large Installation System Administration Conference San Diego, CA, USA
December 27
December 30
29th Chaos Communication Congress Hamburg, Germany
December 27
December 29
SciPy India 2012 IIT Bombay, India
December 28
December 30
Exceptionally Hard & Soft Meeting 2012 Berlin, Germany
January 18
January 19
Columbus Python Workshop Columbus, OH, USA
January 18
January 20
FUDCon:Lawrence 2013 Lawrence, Kansas, USA
January 20 Berlin Open Source Meetup Berlin, Germany

If your event does not appear here, please tell us about it.

Page editor: Rebecca Sobol

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds