User: Password:
|
|
Subscribe / Log in / New account

LWN.net Weekly Edition for January 10, 2013

A few words about Simon 0.4.0

By Nathan Willis
January 9, 2013

The open source speech recognition project Simon unveiled version 0.4.0 on December 30, 2012, after two years of development. The new release boasts some significant architectural changes, so the project advises users not to replace existing versions on production systems. But the changes make Simon noticeably easier to work with, which will please new users. Conversing freely with one's Linux PC is still a ways off, but speech recognition with free software is no longer the exclusive domain of laboratory research.

"Speech recognition" can encompass a range of different projects, such as dictation (e.g., transcribing audio content) or detecting stress in a human voice. Simon is designed to function as a voice interface to the desktop computer; it listens to live audio input, picks out keywords intended as commands, and pipes them to other applications.

Categorical imperatives

Beginning with the 0.3.0 series released in 2010, Simon has based its command-recognition framework on the idea of separate "scenarios" for each application or use case. Scenarios can be as specific as the developer wishes to make them; a general web-browsing scenario for Firefox may be designed to handle only opening links and scrolling through pages, but another could be tailored to work with GMail functionality and keyboard shortcuts. Simon 0.4.0 builds on this approach by adding context awareness: it will activate and deactivate different scenarios depending on which applications the user has open and which have focus. The scenarios still need to be manually installed beforehand, though, so there is little risk Simon will start erasing your hard drives if you happen to walk by and utter the word "partition."

Simon can use any of several back-ends to perform the speech-recognition part of the puzzle. Earlier releases relied on either the BSD-licensed Julius or the better — but non-free licensed — Hidden Markov Model Toolkit (HTK). Version 0.4.0 adds support for another free software recognition toolkit, CMU Sphinx.

The Sphinx engine is highly regarded for its quality, and provides functions that Julius does not, such as the ability to create one's own acoustic speech model. An acoustic model is the statistical representation of the sounds that correspond to the parts of speech that the engine is trying to recognize; it depends on both a "corpus" of audio samples of the speaker or speakers and on a grammar model for the language being spoken. Free sources for acoustic speech models have historically been hard to come by, because most were created by proprietary projects or had no clear licensing at all.

Luckily this situation is changing; the Voxforge project collects GPL-licensed speech models and enables users to create and upload their own. Like a lot of less-well-known free data projects, it could always use more contributions, but it is possible to download decent base models for a variety of languages. Simon 0.4.0 introduces a new internal format for its speech base models, but it is Voxforge compatible, and the English Voxforge model is included in the download. Simon 0.4.0 also includes tools allowing users to create and upload their own speech models to Voxforge.

Say what?

Despite being voice controlled, Simon comes with a graphical front-end for setting up the framework, managing scenarios, and working with speech models. The front-end is KDE-based, and building Simon pulls in a lot of KDE package dependencies. Packages for 0.4.0 have yet to appear, but compiling from source is straightforward. It is important to have CMU Sphinx installed beforehand in order to build a completely free Simon framework, though. Simon's modularity means the build script will simply compile Simon without Sphinx support if the engine is not found.

[Simon 0.4.0 scenarios]

At first run, the Simon setup window will walk users through the process of installing speech models and scenarios, as well as testing microphone input settings and related details. Speech models and scenarios are tracked using the Get Hot New Stuff (GHNS) system, so the available options can be searched through and installed directly within Simon itself. The scenarios currently available include general desktop utilities like window management and cursor control, applications like Firefox, Marble, and Amarok, and a smattering of individual tasks like taking a screenshot. Installing them is easy, and Simon's interface allows each to be activated or deactivated with a single click.

Arguably the biggest hurdle is finding the model one wants; they are language-dependent and only English, Dutch, and German scenarios appear to be published, plus there are frequently several options for each application with essentially the same description. Some descriptions are detailed enough to indicate that they were built with a specific acoustic model (Voxforge or HTK), but some are clearly old enough that they may have compatibility problems (such as the OpenOffice.org scenarios that come from the Simon 0.3.0 days). Some, like the Firefox scenario, also require installing other software (e.g., a Firefox add-on).

[Simon 0.4.0 running]

The main Simon window shows which scenarios are active and which acoustic speech models are loaded, and it displays the microphone volume level and the most recently recognized spoken words. The latter two items are useful for debugging. By default, the setup wizard steers the user toward a generic Voxforge speech model, but to really get good results the user needs to devote some time to training Simon. Most of the scenarios come with a bundled "training text" for this purpose: a list of words that the scenario is listening for. At any time, the user can click on Simon's "Start training" button and record new samples of the important words. These recordings are ingested by the speech recognition engine and added to a user-specific speech model. Simon layers this user-specific model over the base model, hopefully improving the results.

Word to the wise

[Simon 0.4.0 training]

The training interface is painless and provides a lot of hand-holding for new users. This is good news, since it is clear that at least a few training sessions are to be expected before Simon 0.4.0 is usable for daily tasks — even for those of us with perfect elocution. There are simply a lot of variables in human speech, and even more when one throws in the vagaries of cheap PC sound cards and microphones. The trainer prompts the user to speak each of the keywords, reports instantly whether the speaker's voice is too loud or too soft to be useful, and does the rest of the computation in the background.

The nicest thing about Simon 0.4.0, though, is that it moves speech control out of the "theoretical only" realm, where experienced researchers and laboratory conditions are required, and at least makes it possible for everyday users to get started. There is still a long way to go before speech control can offer a constant user interface option as it is depicted in Star Trek or (perhaps more troublingly) in 2001. But the scenario-specific set of commands makes Simon more usable than other open source speech recognition tools, and Simon's built-in training interface makes the necessary grunt work (no pun intended) of tailoring the speech model to one's actual voice about as painless as it can be.

The research into speech recognition will continue, of course. But Simon's new-found modularity will make it easier to incorporate theoretical advances into the desktop application without rewriting from scratch. For users, the next important stage is some development work on new scenarios to hook more applications into Simon. The trickiest part of the stack, though, is likely to remain training the speech recognition engine to recognize the specific user's voice. But no amount of software will eliminate that; just a good microphone and some patience.

Comments (7 posted)

Testing Magic Lantern 2.3

By Nathan Willis
January 9, 2013

In some circles, installing custom or aftermarket firmware like CyanogenMod on a $200 phone is enough to garner street cred, while in others, such minor trifles are fit only to be scoffed at. For those who do not flinch at danger, there is Magic Lantern, a GPL-licensed replacement firmware for high-end Canon digital SLR cameras. The current release is version 2.3, which offers a wealth of improvements for shooting video, plus a growing list of enhancements for still photographers.

Magic Lantern regularly makes releases for a fixed list of Canon models, at the moment including most of the models from the EOS 600D and up. The supported list focuses on cameras using Canon's DIGIC 4 chip and newer models. Recent DIGIC chips include an embedded ARM core which makes writing custom software possible, and the cameras can load and run firmware from an inserted memory card without overwriting the existing firmware. Consequently, projects like Magic Lantern and CHDK (which targets point-and-shoot models) can provide firmware that adds new functionality with minimal risk of bricking the camera — or of voiding the warranty and losing out on Canon's much-loved hardware service offerings. There is still risk involved, however, particularly for new camera models.

Magic Lantern was initially focused on improving video recording functionality. The first model supported by the project was the EOS 5D Mark II, a camera which started a minor revolution by allowing high-quality HD recording in a compact form. But for some budding filmmakers, the stock firmware simply left out too much. Magic Lantern added usability features like crop marks in the preview window, more precise control over ISO speed, white balance, and shutter speed, and a number of miscellaneous add-ons like on-screen sound meters for the audio input.

The current development work is focused on the EOS 5D Mark III, for which the third alpha release was unveiled on January 6. Installation requires unpacking the build onto a supported Compact Flash or SD card, making the card bootable, and loading it into the camera. The download package includes the firmware image plus several folders full of auxiliary files such as the focusing-screen overlays. Normally, the card can be set to automatically boot the camera into Magic Lantern, but this feature has not been enabled in the pre-release builds for the EOS 5D Mark III.

The 5D Mark III release is still incomplete in other areas as well; a good portion of the features enabled for other camera models are still unimplemented for the 5D Mark III. The issue is that some Magic Lantern features (for example, changes to live preview and information display) can work without touching any of the camera's persistent settings, but others require altering properties saved in onboard memory. The team has simply encountered too many unsolved problems with accessing and setting the 5D Mark III's stored settings. Developer a1ex reported that the stability test froze the camera and required a cold reboot and clearing all of the camera settings to restore functionality. For a piece of hardware with a four digit price tag, some caution is understandable.

Still, there is a long list of features which are enabled in the 5D Mark III builds of 2.3. As is to be expected in light of the project's emphasis on digital film-making, most are related to video, but not all of them are so esoteric that a semester of cinematography class is required. The gradual exposure function, for example, allows the user to switch from one exposure setting to another while still filming; Magic Lantern will smoothly transition through the intermediate shutter and ISO speed settings, so that the change fades in (so to speak), instead of hitting all at once.

But there are more unusual features, too. The HDR video mode, for example, shoots twice as many frames as normal, alternating the exposure of each: one set to properly expose the highlights, and one set to properly expose the shadows. Combining the results into a single video stream is not easy, though, and needs to be done in post-production software. So far no tool exists for Linux users, although there is a script using the open source VirtualDub and Enfuse applications.

The majority of the Magic Lantern features enabled for the 5D Mark III at the moment are of the display or composition aide variety, though. But this is not to say that they are merely cosmetic; some offer important enhancements. For instance, the "display gain" feature brightens the live preview window so that items in frame are visible even if it is pitch black outside. That allows the user to compose a decent-looking foreground when doing night shooting or astrophotography, which is a nearly impossible task otherwise.

As a still photographer, I am more interested in some of Magic Lantern 2.3's features that are not yet available on the 5D Mark III. To be honest, though, there are so many features these days that nearly every user will find some of them useful given a random subset. That is a testament to the development team's creativity. More important, of course, is that such aftermarket firmware allows the camera owners to do more (and better) creative work. To Canon's credit, the company has not cracked down on magic Lantern or CHDK — in fact the company adeptly steps around the issue of whether using either project is a warranty violation. Those users with camera models supported by stable builds of 2.3 should consider giving Magic Lantern a try — but should do so with open eyes. With a well-tested model, there is relatively little risk of doing damage to one's camera, but there is virtually no recourse should something go horribly wrong. Perhaps the best advice is to say cowboy up, but do your reading first.

Comments (17 posted)

XBMC comes to Android

By Nathan Willis
January 9, 2013

Version 12 of the XBMC media-playback application is currently in the final stages of development; release candidate 3 was released on January 3. There are multiple enhancements to the codebase, but one of the biggest stories is that XBMC v12 will officially add support for Android. An Android port naturally makes XBMC available on tablets and handsets, but, just as importantly, it enables running on numerous set-top boxes, "smart TVs," and the increasingly-popular smart TV dongle — device classes currently dominated by proprietary applications produced by entertainment companies.

[XBMC v12 on Android's main menu]

Binary builds RC3 of XBMC v12 are available for download from xbmc.org. The Android build is an .apk package that is installable on any device on which the user has enabled installation of non-Play Store software. The project site says that XBMC will eventually come to the Play Store, but not during the pre-release phase. The XBMC wiki has an Android hardware page outlining which devices have tested well with which media types — as one might expect, there is a significantly higher hardware threshold required to enable 1080p video playback.

The target platform for the initial Android release is set-top boxes, in particular the Pivos XIOS DS, which is a compact ARM Cortex A-9 device that the team used as the reference development platform. The project offers a few guidelines for assessing the suitability of other devices, including a note that practically speaking, any Android device that does not have the NEON-compatible coprocessor (or does not have it enabled) will probably be unable to play back HD video. Nevertheless, there are unsupported NEON-free builds linked to from the Android hardware wiki page. The final caveat is that thus far the porting effort has not addressed power consumption, so users of battery powered mobile devices may find XBMC to be quite draining — although the project assures users that this, too, will be addressed in the future. Wall-powered set-top boxes, of course, may not find high power consumption as problematic.

Functionality

I tested the new release on a Nook Tablet running CyanogenMod 7 (CM7), and the battery-draining issue is indeed no joke. The device boasts a 4000 mAh battery, which XBMC managed to drain completely in a little over 3 hours, even though video playback only accounted for a small portion of the time. Granted, CM7 is an unofficial port for this particular device and comes with its own share of power consumption problems. Still, it is clear that there is considerable room for improvement. Nevertheless, even on year-old hardware and a less-than-up-to-date version of Android, XBMC runs remarkably well.

[XBMC v12 on Android's content
browser]

Feature-wise, the good news is that the Android port is nothing short of the full XBMC experience — this is not a "light" or "mobile" version of the software. All of the media formats, network protocols, and add-ons supported in desktop XBMC are available in the Android edition. NFS access was missing from some of the early betas of XBMC v12, but as of now, there are no major gaps in player functionality. Video playback from standard-definition web sources was smooth, and a significantly better experience than accessing the same sites through either the stock Android browser or Firefox. Audio playback rarely stress-tests modern devices, so it gets less attention in reviews, but all of the audio add-ons tested worked like a charm as well.

There are, however, still hiccups to be encountered in individual plug-ins. To some degree this is unavoidable; a huge subset of the video playback add-ons, for example, are "screen scraper"-style hacks to retrieve content from specific Web-based video services, such as the many cable and broadcast TV channels that offer a subset of their programming online. The authors of these add-ons must rewrite their page parsing code every time the target site alters its layout, but one of XBMC's strengths is that add-ons are installable from within the XBMC interface, and updates to restore service can be pushed out quickly.

But reliance on third-party add-on developers has its downside; there are other add-ons available for desktop Linux XBMC that do not seem to work for the Android build, such as the D-Bus based notifications, some of which may never work because of platform limitations. Still others offer functionality that depends on external factors, such as the MythBox add-on, which allows XBMC to play back content from a MythTV back-end. But the add-on only supports MythTV 0.24, which is two releases out-of-date.

Experience

A far more significant problem with XBMC v12 on Android is navigating the user interface. XBMC has long had navigation "trap doors;" spots where it is possible to navigate into a menu or tool, but it is either impossible to navigate back out, or it is only possible to navigate back out through different means (for example, menus where the left-arrow key allows you to enter a screen, but the screen can only be exited by hitting Escape). These trapdoors are usability warts under the best of circumstances, but on an Android device they can literally leave the user stranded if the device does not have a hardware keyboard. Android phones might have a keyboard; tablets will not. Some set-top boxes come with wireless keyboards, although they are largely looked down on, and there is always the possibility of pairing Bluetooth keyboards. But users seem to loathe putting down the directional remote with its single-thumb driveability.

[XBMC v12 on Android's video player]

Trapdoors are not the only interface difficulty, however. Many of XBMC's screens and onscreen controls assume the presence of either a traditional pointer or a touchscreen. Jumping directly to a specific point in the timeline of a song or video, for instance, requires a pointing device to be at least marginally accurate. There may not be a one-size-fits-all solution, considering the variety of content types XBMC plays (and the variety of caching/streaming challenges that accompany them), but some more work will probably be required to optimize for the Android set-top box, which is often touch-free (and may be pointer-free as well).

But the bigger question that XBMC needs to answer for potential Android users is how it offers an improvement over getting at the same content through other applications. Quite simply, the answer it gives is "it depends" — entirely on the type of content. Consuming Internet-delivered video and audio is significantly better through XBMC than it is through a browser. The difference is not quite as stark when compared to a dedicated Android application for a particular service (such as Grooveshark). And XBMC is far less compelling for content that requires more manual searching and browsing.

Take podcasts, for example. XBMC supports managing podcasts, but its interface for subscribing and listening to them is no better than any other on the market. In fact, when coupled with the difficulties of using the UI without a keyboard, it may actually be slightly worse. The same is true for watching or listening to files from local storage — there is no compelling advantage to using XBMC for this task over the stock Android tools, and in some places the interface makes the task more difficult.

As a result, XBMC for Android works well as an Internet content front-end, where a set-top box must compete against the rapidly growing stack of commercial streaming boxes from Roku, Netgear, and everyone else at the big consumer electronics shows. Some of these commercial products also offer an interface into the owner's local music and video collection (typically through UPnP/DLNA). XBMC can match that experience, although with a large enough collection no DLNA solution is particularly pleasant — all eventually fall back on scrolling through page after page of track titles.

Where XBMC has a clear advantage is that it will always be able to offer access to more online content than these proprietary competitors, because the community writes its own add-ons and updates them without the need to call in lawyers and negotiate complex multi-year distribution deals. This is probably where XBMC will make the biggest splash, if and when users of commercial Android set-top boxes can install XBMC through the Google Play store. The do-it-yourself crowd will probably find a desktop Linux-based XBMC set-top box both easier to build and more flexible — but the average consumer may very well discover a new world through seeing XBMC available as a one-click installation option.

The application may also end up being a handy option on handheld Android devices (once the power-consumption issues are fixed). There will probably be more and better options for podcasts and locally stored content, but XBMC's unified front-end to a wealth of Internet-delivered services is likely to be a hit even on phones. If nothing else, it saves users the trouble of scrolling through dozens and dozens of extra application launchers.

Comments (18 posted)

Page editor: Jonathan Corbet

Security

Attacking full-disk encryption with Inception

By Jake Edge
January 9, 2013

When using whole-disk encryption, it's sometimes tempting to be less concerned about attacks requiring physical presence. After all, putting a laptop to sleep is quite convenient, even though attacks like "Evil Maid" or "Cold Boot" are possible. A more recent attack just adds another worry to that list.

Inception is a tool released in 2011 that uses Firewire direct memory access (DMA) to access the memory of a sleeping (or simply powered-on, but locked) system. While it is an an older tool, Inception recently got a notoriety boost from Cory Doctorow at Boing Boing, which is where I came across it. It is a rather interesting attack, and one that isn't really exploiting a bug.

In order to facilitate high-speed transfers, Firewire (aka IEEE 1394) requires the availability of a DMA mode. DMA allows the Firewire controller to directly access system memory, bypassing the CPU. While removing the potential bottleneck of the CPU does make transfers faster, it also opens up the contents of memory for any Firewire device to inspect or modify. This is the same memory that contains various things of interest, including the code to check passwords.

It is the password-checking code that Inception targets. When the incept program is run, it will patch the Linux, Windows, or Mac OS X code running on the system such that any password can be used to log in. After that, one can log in as root (or Administrator) without need for the password—the system is fully compromised. Since the patching is in memory only, though, the change disappears at the next reboot, which may make it more difficult to detect.

Inception doesn't require a Firewire interface on the targeted system, just some way to add one (e.g. PCMCIA, ExpressCard). Typically, the system will detect the Firewire device being added and helpfully install the drivers needed. The attacker's machine, which is attached to the victim over the Firewire interface, then sends commands to enable DMA mode. From there, the program looks for signatures of password authentication modules and patches any it finds.

There are, of course, other things one can do with access to the memory, including dumping its contents for use later on. The system memory may well contain information of interest, for example credentials of various sorts. Patching other parts of the operating system are possible as well, and the incept program has support for using custom signatures and patches. Inception is useful for more than just attacks, as it can be used to help analyze any running system—one that has been compromised for example.

The attack code runs on Linux or OS X systems. It requires Python 3 and libforensic1394. Unsurprisingly, there are some caveats. Targets with more than 4G of RAM may not be attacked reliably because DMA is limited to the low 4G and the code of interest might be loaded higher up. In addition, certain OS X targets may repel the attack by disabling DMA under certain circumstances (like sleeping).

One obvious mitigation for Linux is to disable the Firewire drivers for systems that aren't using them. One could, instead, disable Firewire DMA when the drivers are loaded, but if Firewire is actually being used, that will clearly impact performance. Inception serves as a nice reminder that a powered-on system is vulnerable to many "physically present" kinds of attacks—even if the disk is encrypted

Comments (22 posted)

Brief items

Security quotes of the week

DRM technology will still fail to prevent widespread infringement. In a related development, pigs will still fail to fly.
-- Ed Felten makes predictions for 2013

At a recent conference on the security of connected devices, [Columbia PhD candidate Ang] Cui demonstrated how they can easily insert malicious code into a Cisco VoIP phone (any of the 14 Cisco Unified IP Phone models) and start eavesdropping on private conversations -- not just on the phone but also in the phone's surroundings -- from anywhere in the world.

"It's not just Cisco phones that are at risk. All VoIP phones are particularly problematic since they are everywhere and reveal our private communications," says [Columbia professor Salvatore] Stolfo. "It's relatively easy to penetrate any corporate phone system, any government phone system, any home with Cisco VoIP phones -- they are not secure."

-- Science Daily

Comments (3 posted)

Two new (one "critical") Ruby on Rails vulnerabilities

Two new vulnerabilities (CVE-2013-0156, CVE-2013-0155) have been reported in the Ruby on Rails web framework. CVE-2013-0156 is considered a critical vulnerability that should be patched or worked around immediately ("allows attackers to bypass authentication systems, inject arbitrary SQL, inject and execute arbitrary code, or perform a DoS attack on a Rails application"), while CVE-2013-0155 can alter some SQL queries when JSON parameter parsing is used. They are different than the SQL injection we reported on January 3. More information on -0156 can be found in this analysis.

Comments (6 posted)

New vulnerabilities

cups: unauthorized access to administration interface

Package(s):cups CVE #(s):CVE-2012-6094
Created:January 7, 2013 Updated:April 5, 2013
Description: From the Mageia advisory:

During the process of CUPS socket activation code refactoring in favor of systemd capability a security flaw was found in the way CUPS service honored Listen localhost:631 cupsd.conf configuration option. The setting was recognized properly for IPv4-enabled systems, but failed to be correctly applied for IPv6-enabled systems. As a result, a remote attacker could use this flaw to obtain (unauthorized) access to the CUPS web-based administration interface.

Alerts:
Mandriva MDVSA-2013:034 cups 2013-04-05
Fedora FEDORA-2012-19606 cups 2013-02-26
Mageia MGASA-2013-0004 cups 2013-01-06

Comments (none posted)

dovecot: denial of service

Package(s):dovecot CVE #(s):CVE-2012-5620
Created:January 7, 2013 Updated:January 9, 2013
Description: From the Red Hat bugzilla:

Dovecot 2.1.11 was released and includes a fix for a crash condition when the IMAP server was issued a SEARCH command with multiple KEYWORD parameters. An authenticated remote user could use this flaw to crash Dovecot.

Alerts:
Fedora FEDORA-2012-19752 dovecot 2013-01-05

Comments (none posted)

freeciv: denial of service

Package(s):freeciv CVE #(s):CVE-2012-5645
Created:January 7, 2013 Updated:January 15, 2013
Description: From the Red Hat bugzilla:

A denial of service flaw was found in the way the server component of Freeciv, a turn-based, multi-player, X based strategy game, processed certain packets (invalid packets with whole packet length lower than packet header size or syntactically valid packets, but whose processing would lead to an infinite loop). A remote attacker could send a specially-crafted packet that, when processed would lead to freeciv server to terminate (due to memory exhaustion) or become unresponsive (due to excessive CPU use).

Alerts:
Mageia MGASA-2013-0005 freeciv 2013-01-14
Fedora FEDORA-2012-20623 freeciv 2013-01-05
Fedora FEDORA-2012-20610 freeciv 2013-01-05

Comments (none posted)

inkscape: denial of service

Package(s):inkscape CVE #(s):CVE-2012-5656
Created:January 7, 2013 Updated:February 14, 2013
Description: From the Red Hat bugzilla:

An XML eXternal Entity (XXE) flaw was found in the way Inkscape, a vector-based drawing program using SVG as its native file format performed rasterization of certain SVG images. A remote attacker could provide a specially-crafted SVG image that, when opened in inkscape would lead to arbitrary local file disclosure or denial of service.

Alerts:
openSUSE openSUSE-SU-2013:0297-1 inkscape 2013-02-15
openSUSE openSUSE-SU-2013:0294-1 inkscape 2013-02-14
Ubuntu USN-1712-1 inkscape 2013-01-30
Mageia MGASA-2013-0006 inkscape 2013-01-14
Fedora FEDORA-2012-20621 inkscape 2013-01-05
Fedora FEDORA-2012-20620 inkscape 2013-01-05

Comments (none posted)

mozilla: multiple vulnerabilities

Package(s):firefox thunderbird CVE #(s):CVE-2013-0749 CVE-2013-0770 CVE-2013-0760 CVE-2013-0761 CVE-2013-0763 CVE-2013-0771 CVE-2012-5829 CVE-2013-0768 CVE-2013-0764 CVE-2013-0745 CVE-2013-0747 CVE-2013-0752 CVE-2013-0757 CVE-2013-0755 CVE-2013-0756 CVE-2013-0743
Created:January 9, 2013 Updated:February 18, 2013
Description: From the Ubuntu advisory:

Christoph Diehl, Christian Holler, Mats Palmgren, Chiaki Ishikawa, Bill Gianopoulos, Benoit Jacob, Gary Kwong, Robert O'Callahan, Jesse Ruderman, and Julian Seward discovered multiple memory safety issues affecting Firefox. If the user were tricked into opening a specially crafted page, an attacker could possibly exploit these to cause a denial of service via application crash, or potentially execute code with the privileges of the user invoking Firefox. (CVE-2013-0769, CVE-2013-0749, CVE-2013-0770)

Abhishek Arya discovered several user-after-free and buffer overflows in Firefox. An attacker could exploit these to cause a denial of service via application crash, or potentially execute code with the privileges of the user invoking Firefox. (CVE-2013-0760, CVE-2013-0761, CVE-2013-0762, CVE-2013-0763, CVE-2013-0766, CVE-2013-0767, CVE-2013-0771, CVE-2012-5829)

A stack buffer was discovered in Firefox. If the user were tricked into opening a specially crafted page, an attacker could possibly exploit this to cause a denial of service via application crash, or potentially execute code with the privileges of the user invoking Firefox. (CVE-2013-0768)

Jerry Baker discovered that Firefox did not always properly handle threading when performing downloads over SSL connections. An attacker could exploit this to cause a denial of service via application crash. (CVE-2013-0764)

Olli Pettay and Boris Zbarsky discovered flaws in the Javacript engine of Firefox. An attacker could cause a denial of service via application crash, or potentially execute code with the privileges of the user invoking Firefox. (CVE-2013-0745, CVE-2013-0746)

Jesse Ruderman discovered a flaw in the way Firefox handled plugins. If a user were tricked into opening a specially crafted page, a remote attacker could exploit this to bypass security protections to conduct clickjacking attacks. (CVE-2013-0747)

Sviatoslav Chagaev discovered that Firefox did not properly handle XBL files with multiple XML bindings with SVG content. An attacker could cause a denial of service via application crash, or potentially execute code with the privileges of the user invoking Firefox. (CVE-2013-0752)

Mariusz Mlynski discovered two flaws to gain access to privileged chrome functions. An attacker could possibly exploit this to execute code with the privileges of the user invoking Firefox. (CVE-2013-0757, CVE-2013-0758)

Several use-after-free issues were discovered in Firefox. If the user were tricked into opening a specially crafted page, an attacker could possibly exploit this to execute code with the privileges of the user invoking Firefox. (CVE-2013-0753, CVE-2013-0754, CVE-2013-0755, CVE-2013-0756)

Two intermediate CA certificates were mis-issued by the TURKTRUST certificate authority. If a remote attacker were able to perform a man-in-the-middle attack, this flaw could be exploited to view sensitive information. (CVE-2013-0743)

Alerts:
openSUSE openSUSE-SU-2014:1100-1 Firefox 2014-09-09
Gentoo 201309-23 firefox 2013-09-27
Mandriva MDVSA-2013:050 nss 2013-04-05
SUSE SUSE-SU-2013:0306-1 Mozilla Firefox 2013-02-18
Mageia MGASA-2013-0053 qt4 2013-02-16
SUSE SUSE-SU-2013:0292-1 MozillaFirefox 2013-02-13
Ubuntu USN-1681-4 firefox 2013-02-05
Fedora FEDORA-2013-1432 seamonkey 2013-02-02
Fedora FEDORA-2013-1382 seamonkey 2013-02-02
Fedora FEDORA-2013-0723 thunderbird 2013-02-01
Mageia MGASA-2013-0020 firefox 2013-01-26
Fedora FEDORA-2013-1442 seamonkey 2013-01-26
openSUSE openSUSE-SU-2013:0175-1 mozilla 2013-01-23
openSUSE openSUSE-SU-2013:0149-1 Mozilla 2013-01-23
Ubuntu USN-1681-3 firefox 2013-01-22
Fedora FEDORA-2013-0653 thunderbird 2013-01-18
Fedora FEDORA-2013-0891 firefox 2013-01-16
Fedora FEDORA-2013-0306 ca-certificates 2013-01-15
Fedora FEDORA-2013-0589 thunderbird 2013-01-15
Ubuntu USN-1687-2 nspr 2013-01-14
Ubuntu USN-1687-1 nss 2013-01-14
Mageia MGASA-2013-0008 iceape 2013-01-14
Slackware SSA:2013-009-02 mozilla-thunderbird 2013-01-10
Slackware SSA:2013-009-01 mozilla-firefox 2013-01-10
Mandriva MDVSA-2013:002 firefox 2013-01-09
Ubuntu USN-1681-2 thunderbird 2013-01-08
Ubuntu USN-1681-1 firefox 2013-01-08
openSUSE openSUSE-SU-2013:0131-1 Mozilla 2013-01-23
Fedora FEDORA-2013-0885 xulrunner 2013-01-23
Fedora FEDORA-2013-0885 firefox 2013-01-23
SUSE SUSE-SU-2013:0049-1 MozillaFirefox 2013-01-18
SUSE SUSE-SU-2013:0048-1 MozillaFirefox 2013-01-18

Comments (none posted)

mozilla: multiple vulnerabilities

Package(s):firefox thunderbird xulrunner seamonkey CVE #(s):CVE-2013-0744 CVE-2013-0746 CVE-2013-0748 CVE-2013-0750 CVE-2013-0753 CVE-2013-0754 CVE-2013-0758 CVE-2013-0759 CVE-2013-0762 CVE-2013-0766 CVE-2013-0767 CVE-2013-0769
Created:January 9, 2013 Updated:February 18, 2013
Description: From the Red Hat advisory:

Several flaws were found in the processing of malformed web content. A web page containing malicious content could cause Firefox to crash or, potentially, execute arbitrary code with the privileges of the user running Firefox. (CVE-2013-0744, CVE-2013-0746, CVE-2013-0750, CVE-2013-0753, CVE-2013-0754, CVE-2013-0762, CVE-2013-0766, CVE-2013-0767, CVE-2013-0769)

A flaw was found in the way Chrome Object Wrappers were implemented. Malicious content could be used to cause Firefox to execute arbitrary code via plug-ins installed in Firefox. (CVE-2013-0758)

A flaw in the way Firefox displayed URL values in the address bar could allow a malicious site or user to perform a phishing attack. (CVE-2013-0759)

An information disclosure flaw was found in the way certain JavaScript functions were implemented in Firefox. An attacker could use this flaw to bypass Address Space Layout Randomization (ASLR) and other security restrictions. (CVE-2013-0748)

Alerts:
openSUSE openSUSE-SU-2014:1100-1 Firefox 2014-09-09
Gentoo 201309-23 firefox 2013-09-27
SUSE SUSE-SU-2013:0306-1 Mozilla Firefox 2013-02-18
SUSE SUSE-SU-2013:0292-1 MozillaFirefox 2013-02-13
Ubuntu USN-1681-4 firefox 2013-02-05
Fedora FEDORA-2013-1432 seamonkey 2013-02-02
Fedora FEDORA-2013-1382 seamonkey 2013-02-02
Mageia MGASA-2013-0021 thunderbird 2013-01-26
Mageia MGASA-2013-0020 firefox 2013-01-26
Fedora FEDORA-2013-1442 seamonkey 2013-01-26
Ubuntu USN-1681-3 firefox 2013-01-22
SUSE SUSE-SU-2013:0049-1 MozillaFirefox 2013-01-18
Fedora FEDORA-2013-0891 xulrunner 2013-01-16
CentOS CESA-2013:0144 xulrunner 2013-01-10
Mageia MGASA-2013-0008 iceape 2013-01-14
Oracle ELSA-2013-0144 firefox 2013-01-12
Scientific Linux SL-thun-20130110 thunderbird 2013-01-10
Scientific Linux SL-fire-20130110 firefox 2013-01-10
Slackware SSA:2013-009-03 seamonkey 2013-01-10
Slackware SSA:2013-009-02 mozilla-thunderbird 2013-01-10
Slackware SSA:2013-009-01 mozilla-firefox 2013-01-10
Mandriva MDVSA-2013:002 firefox 2013-01-09
Ubuntu USN-1681-2 thunderbird 2013-01-08
Ubuntu USN-1681-1 firefox 2013-01-08
Oracle ELSA-2013-0144 firefox 2013-01-09
Oracle ELSA-2013-0145 thunderbird 2013-01-09
CentOS CESA-2013:0144 firefox 2013-01-09
CentOS CESA-2013:0145 thunderbird 2013-01-09
CentOS CESA-2013:0144 xulrunner 2013-01-09
Red Hat RHSA-2013:0145-01 thunderbird 2013-01-08
Red Hat RHSA-2013:0144-01 firefox 2013-01-08
openSUSE openSUSE-SU-2013:0131-1 Mozilla 2013-01-23
openSUSE openSUSE-SU-2013:0149-1 Mozilla 2013-01-23
SUSE SUSE-SU-2013:0048-1 MozillaFirefox 2013-01-18
CentOS CESA-2013:0145 thunderbird 2013-01-10
CentOS CESA-2013:0144 firefox 2013-01-10

Comments (none posted)

openshift-origin-node-util: multiple vulnerabilities

Package(s):openshift-origin-node-util CVE #(s):CVE-2012-5646 CVE-2012-5647
Created:January 9, 2013 Updated:January 9, 2013
Description: From the Red Hat advisory:

A flaw was found in the way the administrative web interface for restoring applications (restorer.php) processed options passed to it. A remote attacker could send a specially-crafted request to restorer.php that would result in the query string being parsed as command line options and arguments. This could lead to arbitrary code execution with the privileges of an arbitrary application. (CVE-2012-5646)

An open redirect flaw was found in restorer.php. A remote attacker able to trick a victim into opening the restorer.php page using a specially-crafted link could redirect the victim to an arbitrary page. (CVE-2012-5647)

Alerts:
Red Hat RHSA-2013:0148-01 openshift-origin-node-util 2013-01-08

Comments (none posted)

php-pear-CAS: missing CN validation of CAS server certificate

Package(s):php-pear-CAS CVE #(s):CVE-2012-5583
Created:January 9, 2013 Updated:January 9, 2013
Description: From the Fedora advisory:

* CVE-2012-5583 Missing CN validation of CAS server certificate [#58] (Joachim Fritschi)

Alerts:
Fedora FEDORA-2012-21122 php-pear-CAS 2013-01-09
Fedora FEDORA-2012-21106 php-pear-CAS 2013-01-09

Comments (none posted)

rails: input validation error

Package(s):rails CVE #(s):CVE-2012-5664
Created:January 7, 2013 Updated:January 9, 2013
Description: From the Debian advisory:

joernchen of Phenoelit discovered that rails, an MVC ruby based framework geared for web application development, is not properly treating user-supplied input to "find_by_*" methods. Depending on how the ruby on rails application is using these methods, this allows an attacker to perform SQL injection attacks, e.g., to bypass authentication if Authlogic is used and the session secret token is known.

See this advisory for more information, patches, and workarounds.

Alerts:
SUSE SUSE-SU-2013:0508-1 rubygem-merb-core 2013-03-20
SUSE SUSE-SU-2013:0486-1 Ruby On Rails 2013-03-19
openSUSE openSUSE-SU-2013:0280-1 ruby on rails 2013-02-12
openSUSE openSUSE-SU-2013:0278-1 ruby on rails 2013-02-12
SUSE SUSE-SU-2013:0606-1 Ruby on Rails 2013-04-03
Debian DSA-2597-1 rails 2013-01-04

Comments (none posted)

Page editor: Jake Edge

Kernel development

Brief items

Kernel release status

The current development kernel remains 3.8-rc2; no 3.8 prepatches have been released in the last week.

Stable updates: 3.2.36 was released on January 4.

As of this writing, the 2.6.34.14, 3.0.58, 3.4.25, and 3.7.2 updates are in the review process; they can be expected on or after January 11.

Comments (none posted)

Quotes of the week

I used to believe in a single, integrated security module that addressed all the issues. Now that Linux is supporting everything from real time tire pressure gauges in tricycles to the global no-fly list that just doesn't seem reasonable. We need better turn around on supplemental mechanisms. That means collections of smaller, simpler LSMs instead of monoliths that only a few select individuals or organizations have any hope of configuring properly.
Casey Schaufler

Well at least it crashes safely.
Frederic Weisbecker

Comments (none posted)

Kernel development news

Per-entity load tracking

By Jonathan Corbet
January 9, 2013
The Linux kernel's CPU scheduler has a challenging task: it must allocate access to the system's processors in a way that is fair and responsive while maximizing system throughput and minimizing power consumption. Users expect these results regardless of the characteristics of their own workloads, and regardless of the fact that those objectives are often in conflict with each other. So it is not surprising that the kernel has been through a few CPU schedulers over the years. That said, things have seemed relatively stable in recent times; the current "completely fair scheduler" (CFS) was merged for 2.6.23 in 2007. But, behind that apparent stability, a lot has changed, and one of the most significant changes in some time was merged for the 3.8 release.

Perfect scheduling requires a crystal ball; when the kernel knows exactly what demands every process will make on the system and when, it can schedule those processes optimally. Unfortunately, hardware manufacturers continue to push affordable prediction-offload devices back in their roadmaps, so the scheduler has to be able to muddle through in their absence. Said muddling tends to be based on the information that is actually available, with each process's past performance being at the top of the list. But, interestingly, while the kernel closely tracks how much time each process actually spends running, it does not have a clear idea of how much each process is contributing to the load on the system.

One might well ask whether there is a difference between "CPU time consumed" and "load." The answer, at least as embodied in Paul Turner's per-entity load tracking patch set, which was merged for 3.8, would appear to be "yes." A process can contribute to load even if it is not actually running at the moment; a process waiting for its turn in the CPU is an example. "Load" is also meant to be an instantaneous quantity — how much is a process loading the system right now? — as opposed to a cumulative property like CPU usage. A long-running process that consumed vast amounts of processor time last week may have very modest needs at the moment; such a process is contributing very little to load now, despite its rather more demanding behavior in the past.

The CFS scheduler (in 3.7 and prior kernels) tracks load on a per-run-queue basis. It's worth noting that "the" run queue in CFS is actually a long list of queues; at a minimum, there is one for each CPU. When group scheduling is in use, though, each control group has its own per-CPU run queue array. Occasionally the scheduler will account for how much each run queue is contributing to the load on the system as a whole. Accounting at that level is sufficient to help the group scheduler allocate CPU time between control groups, but it leaves the system as a whole unaware of exactly where the current load is coming from. Tracking load at the run queue level also tends to yield widely varying estimates even when the workload is relatively stable.

Toward better load tracking

Per-entity load tracking addresses these problems by pushing this tracking down to the level of individual "scheduling entities" — a process or a control group full of processes. To that end, (wall clock) time is viewed as a sequence of 1ms (actually, 1024µs) periods. An entity's contribution to the system load in a period pi is just the portion of that period that the entity was runnable — either actually running, or waiting for an available CPU. The trick, though, is to get an idea of contributed load that covers more than 1ms of real time; this is managed by adding in a decayed version of the entity's previous contribution to system load. If we let Li designate the entity's load contribution in period pi, then an entity's total contribution can be expressed as:

L = L0 + L1*y + L2*y2 + L3*y3 + ...

Where y is the decay factor chosen. This formula gives the most weight to the most recent load, but allows past load to influence the calculation in a decreasing manner. The nice thing about this series is that it is not actually necessary to keep an array of past load contributions; simply multiplying the previous period's total load contribution by y and adding the new L0 is sufficient.

In the current code, y has been chosen so that y32 is equal to 0.5 (though, of course, the calculation is done with integer arithmetic in the kernel). Thus, an entity's load contribution 32ms in the past is weighted half as strongly as its current contribution.

Once we have an idea of the load contributed by runnable processes, that load can be propagated upward to any containing control groups with a simple sum. But, naturally, there are some complications. Calculating the load contribution of runnable entities is easy, since the scheduler has to deal with those entities on a regular basis anyway. But non-runnable entities can also contribute to load; the fact a password cracker is currently waiting on a page fault does not change the fact that it may be loading the system heavily. So there needs to be a way of tracking the load contribution of processes that, by virtue of being blocked, are not currently the scheduler's concern.

One could, of course, just iterate through all those processes, decay their load contribution as usual, and add it to the total. But that would be a prohibitively expensive thing to do. So, instead, the 3.8 scheduler will simply maintain a separate sum of the "blocked load" contained in each cfs_rq (control-group run queue) structure. When a process blocks, its load is subtracted from the total runnable load value and added to the blocked load instead. That load can be decayed in the same manner (by multiplying it by y each period). When a blocked process becomes runnable again, its (suitably decayed) load is transferred back to the runnable load. Thus, with a bit of accounting during process state transitions, the scheduler can track load without having to worry about walking through a long list of blocked processes.

Another complication is throttled processes — those that are running under the CFS bandwidth controller and have used all of the CPU time available to them in the current period. Even if those processes wish to run, and even if the CPU is otherwise idle, the scheduler will pass them over. Throttled processes thus cannot contribute to load, so removing their contribution while they languish makes sense. But allowing their load contribution to decay while they are waiting to be allowed to run again would tend to skew their contribution downward. So, in the throttled case, time simply stops for the affected processes until they emerge from the throttled state.

What it is good for

The end result of all this work is that the scheduler now has a much clearer idea of how much each process and scheduler control group is contributing to the load on the system — and it has all been achieved without an increase in scheduler overhead. Better statistics are usually good, but one might wonder whether this information is truly useful for the scheduler.

It does seem that some useful things can be done with a better idea of an entity's load contribution. The most obvious target is likely to be load balancing: distributing the processes on the system so that each CPU is carrying roughly the same load. If the kernel knows how much each process is contributing to system load, it can easily calculate the effect of migrating that process to another CPU. The result should be more accurate, less error-prone load balancing. There are some patches in circulation that make use of load tracking to improve the scheduler's load balancer; something will almost certainly make its way toward the mainline in the near future.

Another feature needing per-entity load tracking is the small-task packing patch. The goal here is to gather "small" processes onto a small number of CPUs, allowing other processors in the system to be powered down. Clearly, this kind of gathering requires a reliable indicator of which processes are "small"; otherwise, the system is likely to end up in a highly unbalanced state.

Other subsystems may also be able to use this information; CPU frequency and power governors should be able to make better guesses about how much computing power will be needed in the near future, for example. Now that the infrastructure is in place, we are likely to see a number of developers using per-entity load information to optimize the behavior of the system. It is still not a crystal ball with a view into the future, but, at least, we now have a better understanding of the present.

Comments (21 posted)

Checkpoint/restore and signals

By Michael Kerrisk
January 9, 2013

Checkpoint/restore is a mechanism that permits taking a snapshot of the state of an application (which may consist of multiple processes) and then later restoring the application to a running state. One use of checkpoint/restore is for live migration, which allows a running application to be moved between host systems without loss of service. Another use is incremental snapshotting, whereby periodic snapshots are made of a long-running application so that it can be restarted from a recent snapshot in the event of a system outage, thus avoiding the loss of days of calculation. There are also many other uses for the feature.

Checkpoint/restore has a long history, which we covered in November. The initial approach, starting in 2005, was to provide a kernel-space implementation. However, the patches implementing this approach were ultimately rejected as being too complex, invasive, and difficult to maintain. This led to an alternate approach: checkpoint/restore in user space (CRIU), an implementation that performs most of the work in user space, with some support from the kernel. The benefit of the CRIU approach is that, by comparison with a kernel-space implementation, it requires fewer and less invasive changes in the kernel code.

To correctly handle the widest possible range of applications, CRIU needs to be able to checkpoint and restore as much of a process's state as possible. This is a large task, since there are very many pieces of process state that need to be handled, including process ID, parent process ID, credentials, current working directory, resource limits, timers, open file descriptors, and so on. Furthermore, some resources may be shared across multiple processes (for example, multiple processes may hold open file descriptors referring to the same open file), so that successfully restoring application state also requires reproducing shared aspects of process state.

For each piece of process state, CRIU requires two pieces of support from the kernel: a mechanism for retrieving the state (used during checkpoint) and a mechanism to set the state (used during restore). In some cases, the kernel provides most or all of the necessary support. In other cases, however, the kernel does not provide a mechanism to retrieve the (complete) value of the state during a checkpoint or does not provide a mechanism to set the state during restore. Thus, one of the ongoing pieces of work for the implementation of CRIU is to add support to the kernel for these missing pieces.

Andrey Vagin's recent patches to the signalfd() system call are an example of this ongoing work and illustrate the complexity of the task of saving and restoring process state. Before looking at these patches closely, we need to consider the general problem that CRIU is trying to solve with respect to signals, and consider some of the details that make the solution complicated.

The problem and its complexities

The overall problem that the CRIU developers want to solve is checkpointing and restoring a process's set of pending signals—the set of signals that have been queued for delivery to the process but not yet delivered. The idea is that when a process is checkpointed, all of the process's pending signals should be fetched and saved, and when the process is restored, all of the signals should be requeued to the process. As things stand, the kernel does not quite provide sufficient support for CRIU to perform either of these tasks.

At first glance, it might seem that the task is as simple as fetching the list of pending signal numbers during a checkpoint and then requeueing those signals during the restore. However, there's rather more to the story than that. First, each signal has an associated siginfo structure that provides additional information about the signal. That information is available when a process receives a signal. If a signal handler is installed using sigaction() with the SA_SIGINFO flag, then the additional information is available as the second argument of the signal handler, which is prototyped as:

    void handler(int sig, siginfo_t *siginfo, void *ucontext);

The siginfo structure contains a number of fields. One of these, si_code, provides further information about the origin of the signal. A positive number in this field indicates that the signal was generated by the kernel; a negative number indicates that the signal was generated by user space (typically by a library function such as sigqueue()). For example, if the signal was generated because of the expiration of a POSIX timer, then si_code will be set to the value SI_TIMER. On the other hand, if a SIGCHLD signal was delivered because a child process changed state, then si_code is set to one of a range of values indicating that the process terminated, was killed by a signal, was stopped, and so on.

Other siginfo fields provide further information about the signal. For example, if the signal was sent using the kill() system call, then the si_pid field contains the PID and the si_uid field contains the real user ID of the sending process. Various other fields in the siginfo structure provide information about specific signals.

There are other factors that make checkpoint/restore of signals complicated. One of these is that multiple instances of the so-called real-time signals can be queued. This means that the CRIU mechanism must ensure that all of the queued signals are gathered up during a checkpoint.

One final detail about signals must also be handled by CRIU. Signals can be queued either to a specific thread or to a process as a whole (meaning that the signal can be delivered to any of the threads in the process). CRIU needs a mechanism to distinguish these two queues during a checkpoint operation, so that it can later restore them.

Limitations of the existing system call API

At first glance it might seem that the signalfd() system call could solve the problem of gathering all pending signals during a CRIU checkpoint:

    int signalfd(int fd, const sigset_t *mask, int flags);

This system call creates a file descriptor from which signals can be "read." Reads from the file descriptor return signalfd_siginfo structures containing much of the same information that is passed in the siginfo argument of a signal handler.

However, it turns out that using signalfd() to read all pending signals in preparation for a checkpoint has a couple of limitations. The first of these is that signalfd() is unaware of the distinction between thread-specific and process-wide signals: it simply returns all pending signals, intermingling those that are process-wide with those that are directed to the calling thread. Thus, signalfd() loses information that is required for a CRIU restore operation.

A second limitation is less obvious but just as important. As we noted above, the siginfo structure contains many fields. However, only some of those fields are filled in for each signal. (Similar statements hold true of the signalfd_siginfo structure used by signalfd().) To simplify the task of deciding which fields need to be copied to user space when a kernel-generated signal is delivered (or read via a signalfd() file descriptor), the kernel encodes a value in the two most significant bytes of the si_code field. The kernel then elsewhere uses a switch statement based on this value to select the code that copies values from appropriate fields in the kernel-internal siginfo structure to the user-space siginfo structure. For example, for signals generated by POSIX timers, the kernel encodes the value __SI_TIMER in the high bytes of si_code, which indicates that various timer-related fields must be copied to the user-space siginfo structure.

Encoding a value in the high bytes of the kernel-internal siginfo.si_code field serves the kernel's requirements when it comes to implementing signal handlers and signalfd(). However, one piece of information is not copied to user space. For kernel-generated signals (i.e., those signals with a positive si_code value), the value encoded in the high bytes of the si_code field is discarded before that field is copied to user space, and it is not possible for CRIU to unambiguously reconstruct the discarded value based only on the signal number and the remaining bits that are passed in the si_code field. This means that CRIU can't determine which other fields in the siginfo structure are valid; in other words, information that is essential to perform a restore of pending signals has been lost.

A related limitation in the system call API affects CRIU restore. The obvious candidates for restoring pending signals are two low-level system calls, rt_sigqueueinfo() and rt_tgsigqueueinfo(), which queue signals for a process and a thread, respectively. These system calls are normally rarely used outside of the C library (where, for example, they are used to implement the sigqueue() and pthread_sigqueue() library functions). Aside from the thread-versus-process difference, these two system calls are quite similar. For example, rt_sigqueueinfo() has the following prototype:

    int rt_sigqueueinfo(pid_t tgid, int sig, siginfo_t *siginfo);

The system call sends the signal sig, whose attributes are provided in siginfo, to the process with the ID tgid. This seems perfect, except that the kernel imposes one limitation: siginfo.si_code must be less than 0. (This restriction exists to prevent a process from spoofing as the kernel when sending signals to other processes.) This means that even if we could use signalfd() to retrieve the two most significant bytes of si_code, we could not use rt_sigqueueinfo() to restore those bytes during a CRIU restore.

Progress towards a solution

Andrey's first attempt to add support for checkpoint/restore of pending signals took the form of an extension that added three new flags to the signalfd() system call. The first of these flags, SFD_RAW, changed the behavior of subsequent reads from the signalfd file descriptor: instead of returning a signalfd_siginfo structure, reads returned a "raw" siginfo structure that contains some information not returned via signalfd_siginfo and whose si_code field includes the two most significant bytes. The other flags, SFD_PRIVATE and SFD_GROUP, controlled whether reads should return signals from the per-thread queue or the process-wide queue.

One other piece of the patch set relaxed the restrictions in rt_sigqueueinfo() and rt_tgsigqueueinfo() so that a positive value can be specified in si_code, so long as the caller is sending a signal to itself. (It seems safe to allow a process to spoof as the kernel when sending signals to itself.)

A discussion on the design of the interface ensued between Andrey and Oleg Nesterov. Andrey noted that, for backward compatibility reasons, the signalfd_siginfo structure could not be fixed to supply the information required by CRIU, so a new message format really was required. Oleg noted that nondestructive reads that employed a positional interface (i.e., the ability to read message N from the queue) would probably be preferable.

In response to Oleg's feedback, Andrey has now produced a second version of his patches with a revised API. The SFD_RAW flag and the use of a "raw" siginfo structure remain, as do the changes to rt_sigqueueinfo() and rt_tgsigqueueinfo(). However, the new patch set provides a rather different interface for reading signals, via the pread() system call:

    ssize_t pread(int fd, void *buf, size_t count, off_t offset);
In normal use, pread() reads count bytes from the file referred to by the descriptor fd, starting at byte offset in the file. Andrey's patch repurposes the interface somewhat in order to read from signalfd file descriptors: offset is used to both select which queue to read from and to specify an ordinal position in that queue. The caller calculates the offset argument using the formula
    queue + pos
queue is either SFD_SHARED_QUEUE_OFFSET to read from the process-wide signal queue, or SFD_PER_THREAD_QUEUE_OFFSET to read from the per-thread signal queue. pos specifies an ordinal position (not a byte offset) in the queue; the first signal in the queue is at position 0. For example, the following call reads the fourth signal in the process-wide signal queue:
    n = pread(fs, &buf, sizeof(buf), SFD_SHARED_QUEUE_OFFSET + 3);

If there is no signal at position pos (i.e., an attempt was made to read past the end of the queue), pread() returns zero.

Using pread() to read signals from a signalfd file descriptor is nondestructive: the signal remains in the queue to be read again if desired.

Andrey's second round of patches has so far received little comment. Although Oleg proposed the revised API, he is unsure whether it will pass muster with Linus:

I think we should cc Linus.

This patch adds the hack and it makes signalfd even more strange.

Yes, this hack was suggested by me because I can't suggest something better. But if Linus dislikes this user-visible API it would be better to get his nack right now.

To date, however, a version of the patches that copies Linus does not seem to have gone out. In the meantime, Andrey's work serves as a good example of the complexities involved in getting CRIU to successfully handle checkpoint and restore of each piece of process state. And one way or another, checkpoint/restore of pending signals seems like a useful enough feature that it will make it into the kernel in some form, though possibly with a better API.

Comments (3 posted)

Xtables2 vs. nftables

By Jake Edge
January 9, 2013

The Linux kernel's firewall and packet filtering support has seen quite a few changes over the years. Back in 2009, it looked like a new packet filtering mechanism, nftables, was set to be the next generation solution for Linux. It was mentioned at the 2010 Kernel Summit as a solution that might apply more widely than just for netfilter. But nftables development stalled, until it was resurrected in 2012 by netfilter maintainer Pablo Neira Ayuso. During the lull, however, another, more incremental change to the existing netfilter code had been developed; Xtables2 was proposed for merging by Jan Engelhardt in mid-December. Many of the same problems in the existing code are targeted by both solutions, so it seems likely that just one or the other gets merged—the decision on which is the subject of some debate.

Xtables2 has been under development since 2010 or so; Engelhardt gave a presentation [PDF] on it at the 2010 Netfilter workshop. Over the last few years, he has occasionally posted progress reports, but the pace of those (and development itself) seems to have picked up after Neira posted his intention to restart nftables development back in October. Given that it will be difficult—impossible, really—to sell two new packet filtering schemes, either the two will need to come together somehow, or one will have to win out.

At least so far, Neira and Engelhardt don't agree about the direction that netfilter should take. After the October nftables announcement, Engelhardt pointed out that one of the missing nftables features noted by Neira was already available in Xtables2: "atomic table replace and atomic dump". Neira's suggestion that Engelhardt look at adding the feature to nftables was rebuffed. Beyond that, though, Neira also said that it would be "*extremely hard* to justify its [Xtables2's] inclusion into mainline". To Engelhardt, and perhaps others, that sounded like a pre-judgment against merging Xtables2, which "would be really sad", he said. He continued by listing a number of useful features already available in Xtables2, including network namespace support, a netlink interface, read-copy update (RCU) support, atomic chain and table replacement, and more.

Much of Neira's announcement concerned the "compatibility layer" that will be needed for any replacement of the existing netfilter code. There are far too many users of iptables to leave behind—not to mention the "no ABI breakage" kernel rule. So, for some period of time, both the existing code that supports iptables and any new solution will have to coexist in the kernel. Eventually, the older code can be removed.

One the main problems that both nftables and Xtables2 address is the code duplication that exists in the existing netfilter implementation (which is often referred to as "xtables"). Because that code is protocol-aware, it was essentially necessary to make four copies of much of it in the kernel to handle each different use case: IPv4, IPv6, ARP, and ethernet bridging. That is clearly sub-optimal, and something that both Xtables2 and nftables address. The difference is in how they address it. With Xtables2, a single protocol-independent table is used per network namespace, while nftables defines a new virtual machine to process packets. Essentially, Xtables2 (as its name would imply) is an evolution of the existing code, while nftables is a more fundamental rework of Linux packet filtering.

That difference in approaches is evident in the discussion over Engelhardt's merge request. Neira is not impressed with the feature set, but he also complains that Xtables2 "inherits many of the design decisions that were taken while designing iptables back in the late nineties". As might be guessed, Engelhardt saw things differently:

nf_tables itself retains some "late nineties" design decisions as well.

In my opinion, there is nothing wrong with keeping some concepts. A developer is not required to reevaluate and reinnovate every concept there has been just for the heck of it. (The old "evolution, not revolution" credo.) Throwing everything overboard generally does not turn out to work these days.

Nftables is hardly a revolution, Neira said, because it implements backward compatibility: "revolutions are never backward compatible". Further discussion noted a number of conceptual similarities between the two approaches, with each side predictably arguing that their solution could do most or all of what the other could do.

There are some differences between the two, though. For one thing, Xtables2 seems further along, both with its code and with things like documentation and a roadmap. As Neira noted, the development hiatus for nftables clearly set the project back a ways, but he is not ready to provide more details quite yet:

I understand you want to know more on the future of nftables, but because the way I am, I prefer to skip "hot air" wording by now and talk on code done anytime soon.

So I have to request some patience from you. We promise to deliver as much information as possible once we are done with the core features.

So there are two competing proposals for how to move forward with netfilter, one that is largely ready (though certainly not complete), according to Engelhardt, and one that is still under active "core" development. While once seen as the likely successor, nftables certainly suffered from lack of attention for a few years, while Xtables2 was seemingly chugging along for much of that time.

Clearly, both Engelhardt and Neira favor their chosen solutions, and neither seems likely to join forces with the other. Engelhardt indicated that he isn't advocating dropping nftables, necessarily, but is instead focused on getting Xtables2 features into the hands of users:

In all fairness, I have never said anything about dropping nft. I am focused on xt2, its inclusion and subsequent maintenance, because it resolves the ipt shortcomings in a way that I think appeals most to the userspace crowd.

Neira proposed a discussion at this year's Netfilter workshop (which should happen by mid-year) to try to resolve the issue. While Engelhardt expressed some concern over the wait, a few months may be needed as Jozsef Kadlecsik pointed out: "Both nftables and xtables2 have got nice features, so it's not a simple question." Network maintainer David Miller concurred with Neira's proposal.

While it may be technically possible to merge Xtables2 and start down the path toward removing the older interface, then switch to nftables when it is ready, it seems an unlikely outcome. If the netfilter developers (and maintainers) are convinced that nftables is the right way forward, skipping the Xtables2 step makes sense. That may mean a longer delay before some of the longstanding problems in the existing code are addressed, but trying to maintain three different packet filtering schemes in the kernel is simply not going to happen.

Comments (14 posted)

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous

Page editor: Jonathan Corbet

Distributions

Distributions face the MoinMoin and Rails vulnerabilities

By Jonathan Corbet
January 9, 2013
MoinMoin is a well-established wiki system with a long list of deployed sites. Like any web application, MoinMoin is in a sensitive position with regard to security: it tends to be directly exposed to the Internet, and, thus, must be able to handle anything that any attacker, anywhere in the world, might throw at it. Similar things could be said of the even more widely-used Ruby on Rails framework. Unfortunately, even in the most careful project, security problems will happen, leaving users exposed. How have distributions responded to the most recent MoinMoin and Rails vulnerabilities? As of this writing, the picture is not entirely encouraging.

Moin is less

MoinMoin's security record, as a whole, is not particularly bad. A look through the LWN vulnerability database shows a number of problems in 2009 and 2010, mostly of the cross-site scripting variety. There was only one vulnerability in 2011, and two in 2012. But the final 2012 vulnerability is not a small one: any attacker with write access is able to execute arbitrary code on the server. Since the purpose of most wiki systems is to give write access to the world (it takes considerable effort to revoke that access generally on a MoinMoin site), this is a widely exploitable vulnerability indeed.

One of the first victims was the Debian project, which disclosed a compromise on January 4. The Python project has also disclosed that its wiki site was broken into. Given the prevalence of the MoinMoin system, and the handy list of waiting victims deployed sites posted by the project, it seems almost certain that there are other compromised sites out there. Anybody running a MoinMoin 1.9.x site that has not yet been patched should probably stop wasting time reading this article and fix their site.

But where is that fix to come from? Most of us, most of the time, outsource the business of integrating security patches to our distributors. That is one of the biggest advantages of running a well-supported distribution: we do not need to stay on top of every single vulnerability that gets reported. It is sufficient to install updates from the distributors occasionally and all should be well thereafter.

As of this writing, only two distributors — Debian and Ubuntu — have issued advisories for the MoinMoin vulnerability. None of the others have put out a fix yet. Some distributors, naturally, have no need to do so; MoinMoin is not shipped in Red Hat Enterprise Linux (and the version in the EPEL repository is old enough to not be vulnerable, but also old enough to be unsupported and possibly subject to an unknown number of other problems). Neither SUSE nor openSUSE appear to ship MoinMoin at all. But others are still shipping a vulnerable version.

Fedora ships vulnerable version 1.9.5, for example; the vulnerability shows in the project's bug tracker, but no fix has yet been issued. The same applies to Gentoo; as of this writing, the bug entry suggests that an advisory is in the works, but it has not yet appeared. Linux Mint does not issue advisories at all; it's hard to say whether this distribution has picked up the fix or not. Anybody running Mandriva Linux is stuck with an old package, but that should be relatively low on their list of problems at this point.

Whether ten days (or more) is too long to wait for a fix for a huge security hole is, perhaps, a matter of perspective. In any case, users of a community distribution who are not paying for support have limited grounds for complaint. Community distributions have limited resources to put into security updates, especially during holiday periods, and a package like MoinMoin is not necessarily at the top of the priority list. Delays will happen, sometimes, though one would wish that they did not happen for problems of this magnitude.

Perhaps what is needed is a way for distributors to inform users of important vulnerabilities that cannot be immediately fixed. In this case, there are workarounds that a MoinMoin administrator can apply to secure a system (see the MoinMoin security fixes page for details) until a proper patch can be applied — but the administrator has to know that (1) there is a problem, and (2) a short-term workaround exists. If a distributor is unable to issue a timely advisory with a fix, perhaps they should at least consider issuing an advisory with a warning and any useful information that may be available?

Rough ride for rails

The Ruby on Rails project disclosed an SQL injection vulnerability on January 2, though the fact that this vulnerability already was known as CVE-2012-5664 suggests that it had been discovered earlier than that. On January 8, the project followed up with advisories for CVE-2013-0155 and CVE-2013-0156, the latter of which exposes most Rails-based sites to code-execution attacks. Thus far, the only distribution to issue updates is Debian. Most community distributions ship a version of Rails, and they are all shipping a vulnerable version as of this writing.

Once again, the advisories from the Rails project include workarounds for those who cannot immediately update their systems. Rails site administrators should be tuned into those advisories, but some certainly are not. Once again, an early warning from distributors might well save some of their users from considerable grief. Distributors are certainly aware of the problems and their workarounds, and they have a unique communication channel to their users. Perhaps, in cases where a fix cannot be made available right away, some sort of heads-up message could be sent out?

In the end, even users of relatively slow-to-update distributions may be better off than those who had to install MoinMoin or Rails from source because their distribution did not ship it. Every one of those hand-installed systems will remain vulnerable until the administrator hears that there is a problem, or, even worse, notices that the system has been compromised. Hopefully, most administrators will manage to get their systems updated before the worst happens. But it's hard to avoid thinking that some of our distributors could have done a little more to help them.

Comments (5 posted)

Brief items

Distribution quotes of the week

I use "political" and "ideological" without criticism. Debian's chief goal - freedom - is a matter of ideology. And because freedom always means escaping from someone's control, it's also a matter of politics.
-- Ian Jackson

It's about Ubuntu. It's about Ubuntu's malicious functionality, spyware. This is egregious behavior, and it calls for the strongest response. If it is accepted as normal, others are likely to follow the same path! We must respond to this as to a shocking crime.
-- Richard Stallman

Comments (none posted)

Red Hat Enterprise Linux 5.9 released

Red Hat has announced the availability of RHEL 5.9. "This release marks the beginning of Production Phase 2 of Red Hat Enterprise Linux 5 and demonstrates the company's continuing effort to promote stability and the preservation of customers' investments in the platform." The meaning of "production phase 2" can be found on this page; essentially, there will be no more software enhancements and hardware support enhancements will be limited to those that are easy to incorporate.

Comments (24 posted)

Open webOS on the Nexus 7

The Nexus 7 seems to have become the tablet development platform of choice; now webOS Nation reports that Open webOS has been ported to the N7. "The port was accomplished with the Galaxy Nexus project in conjunction with LibHybris, created by Carsten Munk (an engineer at Jolla, though he also leads Merproject, which grew out of Sailfish ancestors Maemo and Meego), a library that allows for 'bionic-based [Android] hardware adaptations in glibc systems', in essence making it easier to translate between the designed-for-Android hardware and Linux-based software like the Open webOS operating system."

Comments (none posted)

Fedora 18 Beta for ARM

The Fedora ARM team has announced that the Fedora 18 Beta release for ARM is now available. "The Beta release includes pre-built images for Versatile Express (QEMU), Trimslice (Tegra), Pandaboard (OMAP4), GuruPlug (Kirkwood), and Beagleboard (OMAP3) hardware platforms. The Fedora 18 Beta for ARM now includes an install tree in the yum repository which may be used to PXE-boot a kickstart-based install on systems that support it, such as the Calxeda EnergyCore (HighBank)."

Full Story (comments: none)

Distribution News

Debian GNU/Linux

An analysis of Debian wiki security breach

The Debian project disclosed that the security of its wiki system had been compromised. An analysis of that compromise and its implications has now been posted. "We have completed our audit of the original server hosting wiki.debian.org and have concluded that the penetration did not yield escalated privileges for the attacker(s) beyond the 'wiki' service account. That said, it is clear that the attacker(s) have captured the email addresses and corresponding password hashes of all wiki editors. The attacker(s) were particularly interested in the password hashes belonging to users of Debian, Intel, Dell, Google, Microsoft, GNU, any .gov and any .edu."

Full Story (comments: 18)

bits from the DPL: December 2012

Click below for Stefano Zacchiroli's monthly bits about DPL activities. Topics includes talks, assets, DPL helpers, and collaboration with the outer world.

Full Story (comments: none)

Newsletters and articles of interest

Distribution newsletters

Comments (none posted)

Page editor: Rebecca Sobol

Development

Namespaces in operation, part 2: the namespaces API

By Michael Kerrisk
January 8, 2013

A namespace wraps a global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the resource. Namespaces are used for a variety of purposes, with the most notable being the implementation of containers, a technique for lightweight virtualization. This is the second part in a series of articles that looks in some detail at namespaces and the namespaces API. The first article in this series provided an overview of namespaces. This article looks at the namespaces API in some detail and shows the API in action in a number of example programs.

The namespace API consists of three system calls—clone(), unshare(), and setns()—and a number of /proc files. In this article, we'll look at all of these system calls and some of the /proc files. In order to specify a namespace type on which to operate, the three system calls make use of the CLONE_NEW* constants listed in the previous article: CLONE_NEWIPC, CLONE_NEWNS, CLONE_NEWNET, CLONE_NEWPID, CLONE_NEWUSER, and CLONE_NEWUTS.

Creating a child in a new namespace: clone()

One way of creating a namespace is via the use of clone(), a system call that creates a new process. For our purposes, clone() has the following prototype:

    int clone(int (*child_func)(void *), void *child_stack, int flags, void *arg);

Essentially, clone() is a more general version of the traditional UNIX fork() system call whose functionality can be controlled via the flags argument. In all, there are more than twenty different CLONE_* flags that control various aspects of the operation of clone(), including whether the parent and child process share resources such as virtual memory, open file descriptors, and signal dispositions. If one of the CLONE_NEW* bits is specified in the call, then a new namespace of the corresponding type is created, and the new process is made a member of that namespace; multiple CLONE_NEW* bits can be specified in flags.

Our example program (demo_uts_namespace.c) uses clone() with the CLONE_NEWUTS flag to create a UTS namespace. As we saw last week, UTS namespaces isolate two system identifiers—the hostname and the NIS domain name—that are set using the sethostname() and setdomainname() system calls and returned by the uname() system call. You can find the full source of the program here. Below, we'll focus on just some of the key pieces of the program (and for brevity, we'll omit the error checking code that is present in the full version of the program).

The example program takes one command-line argument. When run, it creates a child that executes in a new UTS namespace. Inside that namespace, the child changes the hostname to the string given as the program's command-line argument.

The first significant piece of the main program is the clone() call that creates the child process:

    child_pid = clone(childFunc, 
                      child_stack + STACK_SIZE,   /* Points to start of 
                                                     downwardly growing stack */ 
                      CLONE_NEWUTS | SIGCHLD, argv[1]);

    printf("PID of child created by clone() is %ld\n", (long) child_pid);

The new child will begin execution in the user-defined function childFunc(); that function will receive the final clone() argument (argv[1]) as its argument. Since CLONE_NEWUTS is specified as part of the flags argument, the child will execute in a newly created UTS namespace.

The main program then sleeps for a moment. This is a (crude) way of giving the child time to change the hostname in its UTS namespace. The program then uses uname() to retrieve the host name in the parent's UTS namespace, and displays that hostname:

    sleep(1);           /* Give child time to change its hostname */

    uname(&uts);
    printf("uts.nodename in parent: %s\n", uts.nodename);

Meanwhile, the childFunc() function executed by the child created by clone() first changes the hostname to the value supplied in its argument, and then retrieves and displays the modified hostname:

    sethostname(arg, strlen(arg);
    
    uname(&uts);
    printf("uts.nodename in child:  %s\n", uts.nodename);

Before terminating, the child sleeps for a while. This has the effect of keeping the child's UTS namespace open, and gives us a chance to conduct some of the experiments that we show later.

Running the program demonstrates that the parent and child processes have independent UTS namespaces:

    $ su                   # Need privilege to create a UTS namespace
    Password: 
    # uname -n
    antero
    # ./demo_uts_namespaces bizarro
    PID of child created by clone() is 27514
    uts.nodename in child:  bizarro
    uts.nodename in parent: antero

As with most other namespaces (user namespaces are the exception), creating a UTS namespace requires privilege (specifically, CAP_SYS_ADMIN). This is necessary to avoid scenarios where set-user-ID applications could be fooled into doing the wrong thing because the system has an unexpected hostname.

Another possibility is that a set-user-ID application might be using the hostname as part of the name of a lock file. If an unprivileged user could run the application in a UTS namespace with an arbitrary hostname, this would open the application to various attacks. Most simply, this would nullify the effect of the lock file, triggering misbehavior in instances of the application that run in different UTS namespaces. Alternatively, a malicious user could run a set-user-ID application in a UTS namespace with a hostname that causes creation of the lock file to overwrite an important file. (Hostname strings can contain arbitrary characters, including slashes.)

The /proc/PID/ns files

Each process has a /proc/PID/ns directory that contains one file for each type of namespace. Starting in Linux 3.8, each of these files is a special symbolic link that provides a kind of handle for performing certain operations on the associated namespace for the process.

    $ ls -l /proc/$$/ns         # $$ is replaced by shell's PID
    total 0
    lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 ipc -> ipc:[4026531839]
    lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 mnt -> mnt:[4026531840]
    lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 net -> net:[4026531956]
    lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 pid -> pid:[4026531836]
    lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 user -> user:[4026531837]
    lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 uts -> uts:[4026531838]

One use of these symbolic links is to discover whether two processes are in the same namespace. The kernel does some magic to ensure that if two processes are in the same namespace, then the inode numbers reported for the corresponding symbolic links in /proc/PID/ns will be the same. The inode numbers can be obtained using the stat() system call (in the st_ino field of the returned structure).

However, the kernel also constructs each of the /proc/PID/ns symbolic links so that it points to a name consisting of a string that identifies the namespace type, followed by the inode number. We can examine this name using either the ls -l or the readlink command.

Let's return to the shell session above where we ran the demo_uts_namespaces program. Looking at the /proc/PID/ns symbolic links for the parent and child process provides an alternative method of checking whether the two processes are in the same or different UTS namespaces:

    ^Z                                # Stop parent and child
    [1]+  Stopped          ./demo_uts_namespaces bizarro
    # jobs -l                         # Show PID of parent process
    [1]+ 27513 Stopped         ./demo_uts_namespaces bizarro
    # readlink /proc/27513/ns/uts     # Show parent UTS namespace
    uts:[4026531838]
    # readlink /proc/27514/ns/uts     # Show child UTS namespace
    uts:[4026532338]

As can be seen, the content of the /proc/PID/ns/uts symbolic links differs, indicating that the two processes are in different UTS namespaces.

The /proc/PID/ns symbolic links also serve other purposes. If we open one of these files, then the namespace will continue to exist as long as the file descriptor remains open, even if all processes in the namespace terminate. The same effect can also be obtained by bind mounting one of the symbolic links to another location in the file system:

    # touch ~/uts                            # Create mount point
    # mount --bind /proc/27514/ns/uts ~/uts

Before Linux 3.8, the files in /proc/PID/ns were hard links rather than special symbolic links of the form described above. In addition, only the ipc, net, and uts files were present.

Joining an existing namespace: setns()

Keeping a namespace open when it contains no processes is of course only useful if we intend to later add processes to it. That is the task of the setns() system call, which allows the calling process to join an existing namespace:

    int setns(int fd, int nstype);

More precisely, setns() disassociates the calling process from one instance of a particular namespace type and reassociates the process with another instance of the same namespace type.

The fd argument specifies the namespace to join; it is a file descriptor that refers to one of the symbolic links in a /proc/PID/ns directory. That file descriptor can be obtained either by opening one of those symbolic links directly or by opening a file that was bind mounted to one of the links.

The nstype argument allows the caller to check the type of namespace that fd refers to. If this argument is specified as zero, no check is performed. This can be useful if the caller already knows the namespace type, or does not care about the type. The example program that we discuss in a moment (ns_exec.c) falls into the latter category: it is designed to work with any namespace type. Specifying nstype instead as one of the CLONE_NEW* constants causes the kernel to verify that fd is a file descriptor for the corresponding namespace type. This can be useful if, for example, the caller was passed the file descriptor via a UNIX domain socket and needs to verify what type of namespace it refers to.

Using setns() and execve() (or one of the other exec() functions) allows us to construct a simple but useful tool: a program that joins a specified namespace and then executes a command in that namespace.

Our program (ns_exec.c, whose full source can be found here) takes two or more command-line arguments. The first argument is the pathname of a /proc/PID/ns/* symbolic link (or a file that is bind mounted to one of those symbolic links). The remaining arguments are the name of a program to be executed inside the namespace that corresponds to that symbolic link and optional command-line arguments to be given to that program. The key steps in the program are the following:

    fd = open(argv[1], O_RDONLY);   /* Get descriptor for namespace */

    setns(fd, 0);                   /* Join that namespace */

    execvp(argv[2], &argv[2]);      /* Execute a command in namespace */

An interesting program to execute inside a namespace is, of course, a shell. We can use the bind mount for the UTS namespace that we created earlier in conjunction with the ns_exec program to execute a shell in the new UTS namespace created by our invocation of demo_uts_namespaces:

    # ./ns_exec ~/uts /bin/bash     # ~/uts is bound to /proc/27514/ns/uts
    My PID is: 28788

We can then verify that the shell is in the same UTS namespace as the child process created by demo_uts_namespaces, both by inspecting the hostname and by comparing the inode numbers of the /proc/PID/ns/uts files:

    # hostname
    bizarro
    # readlink /proc/27514/ns/uts
    uts:[4026532338]
    # readlink /proc/$$/ns/uts      # $$ is replaced by shell's PID
    uts:[4026532338]

In earlier kernel versions, it was not possible to use setns() to join mount, PID, and user namespaces, but, starting with Linux 3.8, setns() now supports joining all namespace types.

Leaving a namespace: unshare()

The final system call in the namespaces API is unshare():

    int unshare(int flags);

The unshare() system call provides functionality similar to clone(), but operates on the calling process: it creates the new namespaces specified by the CLONE_NEW* bits in its flags argument and makes the caller a member of the namespaces. (As with clone(), unshare() provides functionality beyond working with namespaces that we'll ignore here.) The main purpose of unshare() is to isolate namespace (and other) side effects without having to create a new process or thread (as is done by clone()).

Leaving aside the other effects of the clone() system call, a call of the form:

    clone(..., CLONE_NEWXXX, ....);

is roughly equivalent, in namespace terms, to the sequence:

    if (fork() == 0)
        unshare(CLONE_NEWXXX);      /* Executed in the child process */

One use of the unshare() system call is in the implementation of the unshare command, which allows the user to execute a command in a separate namespace from the shell. The general form of this command is:

    unshare [options] program [arguments]

The options are command-line flags that specify the namespaces to unshare before executing program with the specified arguments.

The key steps in the implementation of the unshare command are straightforward:

     /* Code to initialize 'flags' according to command-line options
        omitted */

     unshare(flags);

     /* Now execute 'program' with 'arguments'; 'optind' is the index
        of the next command-line argument after options */

     execvp(argv[optind], &argv[optind]);

A simple implementation of the unshare command (unshare.c) can be found here.

In the following shell session, we use our unshare.c program to execute a shell in a separate mount namespace. As we noted in last week's article, mount namespaces isolate the set of filesystem mount points seen by a group of processes, allowing processes in different mount namespaces to have different views of the filesystem hierarchy.

    # echo $$                             # Show PID of shell
    8490
    # cat /proc/8490/mounts | grep mq     # Show one of the mounts in namespace
    mqueue /dev/mqueue mqueue rw,seclabel,relatime 0 0
    # readlink /proc/8490/ns/mnt          # Show mount namespace ID 
    mnt:[4026531840]
    # ./unshare -m /bin/bash              # Start new shell in separate mount namespace
    # readlink /proc/$$/ns/mnt            # Show mount namespace ID 
    mnt:[4026532325]

Comparing the output of the two readlink commands shows that the two shells are in separate mount namespaces. Altering the set of mount points in one of the namespaces and checking whether that change is visible in the other namespace provides another way of demonstrating that the two programs are in separate namespaces:

    # umount /dev/mqueue                  # Remove a mount point in this shell
    # cat /proc/$$/mounts | grep mq       # Verify that mount point is gone
    # cat /proc/8490/mounts | grep mq     # Is it still present in the other namespace?
    mqueue /dev/mqueue mqueue rw,seclabel,relatime 0 0

As can be seen from the output of the last two commands, the /dev/mqueue mount point has disappeared in one mount namespace, but continues to exist in the other.

Concluding remarks

In this article we've looked at the fundamental pieces of the namespace API and how they are employed together. In the follow-on articles, we'll look in more depth at some other namespaces, in particular, the PID and user namespaces; user namespaces open up a range of new possibilities for applications to use kernel interfaces that were formerly restricted to privileged applications.

(2013-01-15: updated the concluding remarks to reflect the fact that there will be more than one following article.)

Comments (10 posted)

Brief items

Quotes of the week

...with so much drama, he now praises the iPad.
Google's text-to-speech engine, when fed any of a number of input strings containing "-ed with."

A programmer had a problem. He thought to himself, "I know, I'll solve it with threads!". has Now problems. two he
Davidlohr Bueso

Comments (4 posted)

Systemd 197 released

For those watching systemd, version 197 has a number of interesting new features, including a new mechanism for stable network interface naming, an integrated bootchart implementation, optimized readahead behavior on Btrfs, and more. "systemd will no longer detect and recognize specific distributions. All distribution-specific #ifdeffery has been removed, systemd is now fully generic and distribution-agnostic. Effectively, not too much is lost as a lot of the code is still accessible via explicit configure switches. However, support for some distribution specific legacy configuration file formats has been dropped. We recommend distributions to simply adopt the configuration files everybody else uses now and convert the old configuration from packaging scripts."

Full Story (comments: 44)

Firefox 18 is now available

Firefox 18 has been released. See the release notes for the details. There are Firefox for Android release notes also.

Full Story (comments: 48)

wiki.python.org compromised

The python.org wiki was compromised and the attacker gained shell access to the "moin" user account. "Some time later, the attacker deleted all files owned by the "moin" user, including all instance data for both the Python and Jython wikis. The attack also had full access to all MoinMoin user data on all wikis. In light of this, the Python Software Foundation encourages all wiki users to change their password on other sites if the same one is in use elsewhere. We apologize for the inconvenience and will post further news as we bring the new and improved wiki.python.org online."

This is the second high-profile compromise of a Moin-based wiki reported recently; anybody running such a site should be sure they are current with their security patches.

Full Story (comments: 2)

GCC 4.8.0 status report: only regression fixes and docs accepted as of now

Richard Biener has posted another report on the status of GCC 4.8.0 development. Noting that the code had "stabilized itself over the holidays," stage 3 of the development cycle is over, so "GCC trunk is now in release branch mode, thus only regression fixes and documentation changes are allowed now."

Full Story (comments: none)

GNU Radio 3.6.3 available

Version 3.6.3 of GNU Radio has been released. This release adds "major new capabilities and many bug fixes, while maintaining strict source compatibility with user code already written for the 3.6 API." Enhancements include asynchronous message passing, new blocks for interacting with the operating system's networking stack, and the ability to write signal processing blocks in Python.

Full Story (comments: none)

Newsletters and articles

Development newsletters from the past week

Comments (none posted)

The USPTO Would Like to Partner with the Software Community ... Wait. What? Really? (Groklaw)

Groklaw reports on an invitation from the United States Patent and Trademark Office (USPTO) for software developers to join two roundtable discussions aimed at "enhancing" the quality of software patents. Both events are in February: one in New York City and one in Silicon Valley. As Groklaw points out, the events are space-limited and proprietary software vendors are sure to attend. "Large companies with patent portfolios they treasure and don't want to lose can't represent the interests of individual developers or the FOSS community, those most seriously damaged by toxic software patents." (Thanks to Davide Del Vento)

Comments (36 posted)

Fox: How to un-break GNOME menus

Taryn Fox has written a blog entry examining GNOME 3's global application menu, including a few outstanding problems, and a proposal for how they could be addressed. "Ideally, the App Menu will contain all of a given app's functionality. This is the assumption new GNOME apps (like Documents) are building on, and the one certain existing apps (like Empathy) are adopting."

Comments (none posted)

Crouch: Packaged HTML5 Apps: Are we emulating failure?

At his blog, Mozilla's Luke Crouch explores whether or not HTML applications deployed on "web runtime" environments offer a better experience than those running in a traditional browser. The answer is evidently no: "If you're making an HTML5 app, consider - do you want to make a native desktop application? Why or why not? Then consider if the same reasoning is true for the native mobile application."

Comments (none posted)

Page editor: Nathan Willis

Announcements

Brief items

Some new minor site features

Regular LWN readers might be aware of the fact that we report from a fair number of conferences. The curious can now see just how many (and what we report on) in the new LWN.net conference coverage index. It turns out that even we were surprised by just how many events we've been to. Needless to say, we're not done; the conference index will be kept current as we report from future events (next stop: linux.conf.au).

Part of getting to future conferences, of course, is remembering to get our speaking proposals in on time. On the suspicion that we are not the only ones with this kind of problem, we have extended the LWN Events Calendar to include a calendar dedicated to call-for-proposals deadlines. If you have been thinking about presenting your work to the community and would like to know whose deadlines you are about to miss, the CFP deadline calendar should be a helpful resource.

Comments (16 posted)

Crowdfunding "Software Wars"

"Software Wars" is a "movie about how free software will save you thousands of dollars and lead to a better world". The project is looking for funds to finish the movie in an indiegogo campaign. "The average computer user is unaware there is a war for freedom going on that will determine the path of modern society. Software Wars is a movie about the battle for our right to share technology and ideas."

Comments (none posted)

Upcoming Events

PyCon 2013 Schedule Released

PyCon 2013 has announced the schedule for the conference. "The conference begins with two days of tutorials on March 13 and 14, followed by three days of talks from Friday March 15 through 17, ending with four days of sprints through March 21. The tutorial selections this year came from a record pool of 129 submissions, up from 2012's record of 88. "We faced a lot of hard decisions, choosing from some really outstanding proposals by excellent and experienced instructors," said tutorial co-chair Stuart Williams. A highlight of the schedule is expansion to include coverage for beginners not just to the Python language, but also to those new to programming as a whole. There's coverage from the web to the desktop and everything in between, with several tutorials being presented by full-time educators." PyCon takes place in Santa Clara, California.

Comments (none posted)

Events: January 10, 2013 to March 11, 2013

The following event listing is taken from the LWN.net Calendar.

Date(s)EventLocation
January 18
January 19
Columbus Python Workshop Columbus, OH, USA
January 18
January 20
FUDCon:Lawrence 2013 Lawrence, Kansas, USA
January 20 Berlin Open Source Meetup Berlin, Germany
January 28
February 2
Linux.conf.au 2013 Canberra, Australia
February 2
February 3
Free and Open Source software Developers' European Meeting Brussels, Belgium
February 15
February 17
Linux Vacation / Eastern Europe 2013 Winter Edition Minsk, Belarus
February 18
February 19
Android Builders Summit San Francisco, CA, USA
February 20
February 22
Embedded Linux Conference San Francisco, CA, USA
February 22
February 24
Southern California Linux Expo Los Angeles, CA, USA
February 22
February 24
FOSSMeet 2013 Calicut, India
February 22
February 24
Mini DebConf at FOSSMeet 2013 Calicut, India
February 23
February 24
DevConf.cz 2013 Brno, Czech Republic
February 25
March 1
ConFoo Montreal, Canada
February 26
February 28
ApacheCon NA 2013 Portland, Oregon, USA
February 26
February 28
O’Reilly Strata Conference Santa Clara, CA, USA
February 26
March 1
GUUG Spring Conference 2013 Frankfurt, Germany
March 4
March 8
LCA13: Linaro Connect Asia Hong Kong, China
March 6
March 8
Magnolia Amplify 2013 Miami, FL, USA
March 9
March 10
Open Source Days 2013 Copenhagen, DK

If your event does not appear here, please tell us about it.

Page editor: Rebecca Sobol


Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds