LWN.net Weekly Edition for November 8, 2012 [LWN.net]

LCE: Challenges for Linux networking

By Jonathan Corbet
November 7, 2012

It's common to think of networking as being essentially a solved problem for Linux. But, like the rest of the computing environment, networking is evolving quickly over time, and Linux will have to evolve with it. Marcel Holtmann used his LinuxCon Europe slot to talk about some of the work that is being done to ensure that Linux networking remains second to none.

In particular, Marcel is concerned with improving networking on mobile devices. He is staying away from enterprise networking, which, he said, is pretty well worked out. It's mostly based on cabled connections with "no crazy stuff" to worry about. But mobile networking is different. The environment is far more dynamic, there is much more user interaction involved, and there is a long list of tools needed to manage it — and many of those tools are old and not up to the task.

That said, some of the problems are nicely solved at this point. The "networking basics" are in place; these include Ethernet and WiFi support, along with Bluetooth, cellular protocols, and higher-level protocols like DHCP, DNS, and NTP. IPv6 support is mostly there, and tethering support is now in good shape. We also have the beginnings of support for the WISPr protocol which enables automatic login to wireless hotspots. Things are quite good, Marcel said, for the simpler use cases.

The problem is that, increasingly, the common use cases are not so simple, so there are a number of challenges to be overcome before Linux can take the next step. The first of those is being able to answer a simple question: is the system actually online? Simply being associated with an access point is often not enough if that access point requires authentication. It is not an uncommon experience to have a smartphone hook up to an access point with a recognized SSID and, since the authentication step has not yet happened, lose connectivity entirely. The problem is that, at this point, there is no easy way to know if a system is actually on the net or not.

An additional complication is the concept of a locally-connected network; a mobile device may be connected to local devices (a television, say) without having access to the wider Internet. We don't currently have a way to detect that situation or to communicate it to applications.

Connectivity is currently thought of at a global level — the system is either online or it is not. But different applications have different connectivity needs, so we need to move toward a concept of per-application connectivity. The system should be able to go online for a specific task (synchronizing an important application, say) without necessarily opening the floodgates for any other application that may want to talk to the world. We currently have no way to do that because Linux networking is based around systems, not around applications.

An additional irritation is that, once connectivity is established, all applications are notified of the availability of the net at the same time. That creates a sort of thundering herd problem where a whole range of processes try to open connections simultaneously. The result can be a clogged link and frustrated users. We really need a one-at-a-time waking mechanism for applications, with the more important ones being notified first.

There is a real need for finer-grained usage statistics for the net; currently all statistics are global, which tends not to be helpful in the mobile environment. There are no per-user or per-application statistics, and no way to distinguish home-network traffic from roaming traffic. Android has a solution of sorts for this problem, Marcel said, but it is "a horrible nasty hack" that will never make it upstream. Somehow we need to come up with a cleaner implementation.

Time is simple in the data center; one just has to fire up NTP and forget about it. But it is a more challenging concept on a mobile system. Those systems can move between time zones; it can be hard to know what time it actually is. For added fun, many of the available sources of time information — cellular networks, for example — are often simply wrong. So the system needs to find a way to look at data from multiple sources before concluding that it is safe to change the current time zone.

Time confusion can also create trouble for applications, especially those with complex synchronization requirements. To avoid problems, the system really needs to figure out the correct time before notifying applications that the network is available.

The concept of network routing must become more sophisticated in the mobile environment. Some applications, for example, should only be run if the virtual private network (VPN) is available, while others can run over any network. Other applications should not be allowed to use the VPN regardless of its availability; the same is true of Bluetooth and other types of connections.

If a WiFi connection becomes available, some applications may want to use it immediately, but an application in the middle of a conversation over the cellular connection should be allowed to wrap things up before being switched to the new route. So there needs to be something like per-application routing tables, something that a tool like ConnMan can change easily, but, Marcel said, the current network namespace implementation is too strict.

What bugs Marcel the most, though, is the Linux firewall mechanism. He wants to be able to set up personal firewalls that differ based on the current location; a configuration that works at home or in the office may be entirely inappropriate in a coffee shop. He didn't have much good to say about the Linux firewall implementation; it's highly capable, but impossible to configure in a reasonable way. There are, he said, lots of race conditions; reconfiguring the firewall involves clearing the old configuration and leaving things open for a period. He is hoping for a conversion to nftables, but progress on that front is slow.

Marcel would also like to see the easy creation of per-application firewalls; just because a port is opened for one application does not mean that others should be able to use it.

Local WiFi networks, for applications like sending video streams to a television, are coming. That requires support for protocols like WiFi Direct. The good news is that the kernel has fairly nice support at this point; unfortunately, the user interface work has not been done. So only expert users can make use of this capability now. Making it work properly for all users will require a lot of integration work into a lot of subsystems, much like full Bluetooth support needed.

Speaking of Bluetooth, it now has support for pairing via the near-field communications (NFC) protocol. Essentially, devices can be paired by touching them together. NFC-based authentication for WiFi access points is on its way. There is also work being done to support sensor networks over protocols like 802.15.4 and 6LoWPAN.

To summarize the talk, there is a lot of work to be done to provide the best mobile networking experience with Linux. Happily, that work is being done and the pieces are falling into place. Lest anybody think that there will be nothing for anybody to complain about, though, it is worth noting an admonition that Marcel saved for the very end of the talk: in the future, the use of systemd and control groups will be mandatory to get good mobile networking with Linux.

[Your editor would like to thank the Linux Foundation for funding his travel to the event.]

Comments (20 posted)

LCE: Systemd two years on

By Michael Kerrisk
November 7, 2012

It has now been a little more than two years since the systemd project began, and in that time a lot has happened. So Lennart Poettering, one of the main developers of systemd, took the opportunity to look back at the progress to date, in a talk at LinuxCon Europe 2012 . Kay Sievers, the other principal systemd developer, was also in the audience, and offered some pieces of background information during the talk.

The first announcement of systemd was in April 2010. The first distribution based on systemd, Fedora 15, came out in May 2011. Since then, it has appeared as the default init system in a number of other distributions, including openSUSE, Mageia, Mandriva, Arch Linux, and others. Systemd is also included in Debian and Gentoo, but it is not the default init system. The most notable absence from the list is, of course, Ubuntu. Lennart expressed the hope that Ubuntu might still switch over, but sounds somewhat less confident of that these days than he has in the past.

Lennart reported that the systemd project is in good health, with 15 committers (including committers from a range of distributors) and 374 contributors to date, although the latter number is inflated by the incorporation of udev into systemd; if the udev contributors that have contributed to that project during its seven-year lifespan are excluded, then systemd has had about 150 contributors. There are about 30 contributors in any given month, and around 490 subscribers to the project mailing list.

One of the goals of systemd was to be able to boot a system without requiring shell scripts, and that goal was achieved about a year ago. The problem with shell scripts, of course, is that they make it difficult to implement the necessary parallelization and asynchronous logic that are needed for fast boot times. As a result of eliminating shell scripts from the boot phase, the number of processes created during system start-up has fallen from 1500 to about 150. On a system with an SSD, booting a full user space can be accomplished in under one second on a standard PC-architecture laptop. Users of Fedora are unlikely to achieve boot times that are quite that fast, because of dependencies like LVM, but boot times of 10 seconds should be achievable "out of the box" with Fedora. Thus, whereas once upon a time the kernel was fast and user space was slow to boot, these days the position has reversed. Lennart remarked that the ball is now back in the kernel's court in terms of improving boot speeds.

Lennart noted that systemd started out as an init system, and as such replaces the classical System V init (sysvinit) program. But along the way, the developers figured out that booting a system is not solely the job of PID 1 (the PID created for the first process on the system, which executes the init program). There are other processes as well, such as init scripts. So the developers redefined systemd a little more, to be not just an init system, but also a platform. Thus, systemd includes a couple of mini-services that are useful for booting. These services include, for example, reading ahead files that were used during the last boot so that booting proceeds faster this time around. Systemd also added a number of other components, such as udev and a system logger, and replacements for pieces like ConsoleKit. The goal of including these services and components within systemd is to create a more integrated boot system.

These changes are sometimes perceived as feature creep in what is supposed to be simply an init system, which prompts the question: is it bloated? Lennart's response is an emphatic "no". In the systemd platform, PID 1 is simply a unit control system. Other tasks are performed by other pieces that ship as part of the systemd package. He likened this to being a repackaging of what once were separate tools that interacted during the boot phase. That repackaging allows a more integrated approach to the boot process that also eliminates some code and feature duplication that occurred across the formerly separate components. By eliminating duplication, and providing a more integrated approach to the boot components, the resulting system is easier for system administrators to manage. The increasing scope of systemd has created more dependencies on other packages, but Lennart noted that almost all of these dependencies are optional, and only required if users want the optional features available in systemd.

At this point, an audience member asked about the status of systemd with Busybox for embedded systems. Lennart noted that he works mainly on desktop systems, but the project certainly does care about embedded systems.

Systemd integrates with some other operating system components. For example, Dracut can pass performance information to systemd, as well as other information, such as whether the root file system was consistency checked. Systemd integrates closely with udev—so closely that udev was eventually merged into systemd, a decision that was, as Lennart noted, widely discussed, and somewhat controversial. Systemd integrates with D-Bus activation, so that if a D-Bus name is requested, a service will be started. Systemd integrates with the Plymouth boot screen splash tool, so that the two systems coordinate their handling of the screen and passwords. It integrates with the Gummiboot boot loader for EFI, so that it is possible to get boot statistics such as measures of the amount of time spent in BIOS, bootloader, kernel initialization, in init ramdisk, and in user-space activation.

Like the Linux kernel, systemd targets the full spectrum of platforms, with features for mobile, embedded, desktop, and server systems, and Lennart remarked that it was interesting that these diverse platforms gained common benefits from systemd. For example, the control-group management features for systemd are of interest for resource management in both server and embedded systems.

An audience member noted that the systemd project has not focused on porting systemd to other operating systems, such as FreeBSD, and asked whether there was no FreeBSD developer who said they'd like to port systemd to FreeBSD? That has not happened so far, Lennart said. The problem is that systemd is tightly bound to various Linux-specific features, such as control groups, for which there are no equivalents on other operating systems. Kay noted that attempts to port udev to other systems went nowhere, and that would be a necessary first step to port systemd.

Systemd has made a number of components in the traditional init system obsolete. The most obvious of these are, of course, sysvinit and init scripts, which are replaced by small C programs that can be parallelized during boot. In addition, systemd made a number of other components obsolete, such as ConsoleKit, pm-utils, and acpid (the ACPI daemon). Systemd's socket-activation feature renders inetd obsolete. Lennart emphasized that while systemd renders all of these components obsolete, most of the components (with obvious exceptions such as sysvinit) can still be used in parallel with systemd, if desired.

At this point, an audience member opined that the classical UNIX init process was designed to be as small and robust as possible, but that this seemed to contrast with where systemd was going. Is systemd at risk of becoming a large single point of failure? Lennart pointed out that while systemd replaces many earlier tools, it performs those tasks in separate processes. However, the notable difference is that the various components operate in a much more integrated fashion.

Lennart then turned to a discussion of the Journal. The Journal is a recently announced systemd component, which is posited as the successor of the traditional syslog logging system. The syslog system is nearly 30 years old, renders log messages in plain text, and does not lend itself to seemingly simple tasks such as showing the last ten log messages produced by a particular service. There are many problems with syslog. For example, it does not index the log and it does not verify the source of log messages.

The systemd developers concluded that logging is an essential part of service management, and none of the current solutions was adequate for systemd's needs, so they came up with the Journal. The Journal has structured metadata about every log event, including eighteen fields with information such as user ID, group ID, and SELinux context. The log is indexed, and secured so that the logged data is trustworthy. The Journal is also rate limited, so that no individual service can clog the log. The Journal has forward secure sealing, so that, periodically (by default, every fifteen minutes) a cryptographic tag is added to the logged data, and the cryptographic key is discarded. The effect is that an attacker who breaks into the system can't alter history in the log (other than possibly the log entries created in the current interval). Work is ongoing to add network logging functionality to the Journal.

Unfortunately, Lennart ran out of time, so that he was unable to go over his thoughts about the future of systemd. However, after two years, it's clear that systemd is an established part of the Linux landscape, and there are increasing signs that it is moving toward becoming an essential part of the operating system.

Comments (80 posted)

21st-century Csound

November 7, 2012

This article was contributed by Dave Phillips

Csound is a digital audio synthesis and processing environment with facilities for music and sound composition. It is a product of the world of Unix workstations in the 1980s, with a native user interface consisting of plain text files created in a text editor or word processor. Csound is one of the many systems that evolved from Max Mathews's pioneering MusicN series, which was the first set of programming languages dedicated to creating sound and music on computers. Csound is arguably the most advanced of all the systems derived from the MusicN languages.

More specifically, Csound evolved from Music5. It was written by Barry Vercoe, but, since its inception, the system has grown from contributions by many other developers. Csound has always been open-source software. Originally covered by a license from MIT, Csound is now protected by the LGPL, which makes it free and open-source software by contemporary standards. The source package also carries free license terms for certain components not covered by GNU licenses.

The complete Csound system is made of many parts, but the two most important are the Csound programming language and its compiler. A musical piece for Csound is usually written as a collection of instrument definitions and score directives, all of which are compiled to an audio waveform by the csound binary, either offline or in realtime. The language can describe a synthesizer, build a guitar effects processor, perform conditional transformations on a MIDI input stream, and so forth, while the scoring language describes events and their parameter sets to be interpreted by the specified instruments. When the instrument definitions and score instructions are complete, the audio engine can be invoked as a separate process to render the sound.

When processed by the Csound compiler, the following code creates a WAV file of a sine wave played for three seconds — an audio compiler's "Hello world!":

    <CsoundSynthesizer>

    <CsOptions>
    ; process with no display, minimal messaging, no graphics
    ; output to WAV file named simple-sine.wav
    -d -m0 -g -o simple-sine.wav
    </CsOptions>

    <CsInstruments>
    sr=48000	; audio sample rate
    ksmps=64	; samples per k-rate (control rate) period
    nchnls=2	; number of audio output channels	 

    instr 1		        ; define a Csound instrument by number
    kamp = p4			; amplitude value in the score event's 
                                ; 4th p-field
    kfreq = p5			; frequency value in the score event's 
                                ; 5th p-field
    ifn = p6 			; function table number in score event's 
                                ; 6th p-field
    aosc oscil kamp,kfreq,ifn	; an oscillator with amplitude, frequency, 
                                ; and f-table values from preceding parameters
    kenv linseg 0,p3*.50,1,p3*.50,0	; a gain envelope to modulate 
                                        ; the event's output volume
    aout = aosc*kenv		; apply the envelope to the oscillator output
    outs aout,aout		; send the modulated signal to the 
                                ; system audio stereo outs
    endin			; end the instrument definition
    </CsInstruments>

    <CsScore>
    ; a stored function table 
    ; table number, start time, number of sample points, 
    ; GEN function, harmonic fundamental

    f1 0 8192 10 1

    ; an event list with six p-fields
    ; instrument number, start time, event duration, 
    ; direct amplitude value, frequency in cycles per second (cps), 
    ; function table numnber

    i1 0.0 3 5000 440 1
    e		; end of score
    </CsScore>

    </CsoundSynthesizer>

This example is in Csound's unified file format, which joins the instrument definition and score events pieces that historically were in separate files. To render it to a soundfile we need to feed the file to the Csound compiler:

    csound simple-sine.csd

If all goes well you'll have a new 3-second recording of a sine wave that swells from silence to a fairly loud volume and back again to silence. If things don't go well you'll get some hopefully informative error messages from Csound to help debug your code.

I'm sure some readers are wondering why anyone would go to such trouble to produce a sine wave, but consider the following implications from our little example :

I can invoke as many oscillators as I want by simple cut-and-paste text manipulation.
I can increase the number of sample points in the function table that provides the sine wave.
The envelope can have as many stages as I wish.
I can easily specify a more complex multichannel audio output.
The score event values can be replaced by expressions and macros.

All of that just scratches the surface of Csound's capabilities, many of which are found in no other sound software or hardware. It is amazingly powerful software, but accessing its power can be difficult, especially for new users accustomed to GUI-based music software. The plain-text interface seems rather forbidding at first, as does the prospect of working at the command-line. Fortunately, the language has been designed for directness and simplicity, though its complexity varies with the particular elements used in your instruments, of course.

Instruments and processors are built from Csound's huge collection of opcodes, which are the synthesis and processing "black boxes" that provide the various parts — oscillators, envelopes, filters, etc. — required by your instrument designs. Each opcode has its own set of parameters and usage details, so while some may have only a few parameters to address, others may have considerably more. Values for those parameters can be set or calculated within your instrument definitions, or they may be provided by the scoring language.

In the general computing world, the dominance of the GUI might have doomed Csound and similar systems to the dustbin of music software history. However, Csound has proven to be flexible enough to evolve and and remain current, thanks especially to a talented development team ably managed by John ffitch for more than twenty years. Through those years, many fine contributing programmers have come and gone from the Csound development community, but John's deep involvement as Csound programmer and composer has ensured stability, high-performance, and compatibility. On top of this core reliability, we find many developers dedicated to easing the way into Csound by way of graphic IDEs (integrated development environments), documentation improvements, enhanced connectivity, and other developer and user-level amenities.

Thanks to the efforts of its dedicated developers, Csound has an API (application programming interface). Thanks to that API, interesting new applications can be devised with easy access to Csound's considerable capabilities. This article presents a fly-by scan of the activity in some of Csound's new directions, including its deployment on mobile devices, its ever-improving development environments, and its integration with external music and sound software by way of plugins or standalone programs built with the Csound API.

Going mobile

Work proceeds towards the deployment of Csound on Android and iOS devices. I haven't entered the mobile world yet — I'm still deciding on the hardware — but I can point to Steven Yi's VoiceTouch, Art Hunkins's ChimePad, Roger Kelly's MicroDrummer, and Rick Boulanger's CsoundTouch to indicate the state of the art. The Csound source tree now includes a branch for Android development, while the developers' mailing list reveals considerable interest and activity in this area. The hardware capabilities of the current devices pose problems with latency, memory usage, and screen real estate, but those will be resolved eventually. As the devices become more powerful, Csound will be ready to take advantage of their increased capabilities.

Good IDEs

The way into instrument design and audio composition with Csound has been made easier with environments such as Andres Cabrera's CsoundQt (shown at right), Steven Yi's blue, Stefano Bonetti's WinXound (at left), and Jean Piche's Cecilia. These environments differ in many ways, but all are dedicated to providing an organized entry into using Csound.

WinXound and CsoundQt look similar to general purpose IDEs such as Eclipse or Microsoft's VisualStudio, with panels for code inspection, debugging output, manual pages, GUI elements, and so forth. Blue (at left) is more focused on the use of Csound's interfaces to Python and Java, with provisions for instrument design and some powerful tools for composition.

Cecilia takes a more graph-oriented approach to composition by employing an X/Y coordinate system to create generative and modulation curves (shown at right) in a unique way for Csound.

The Cabbage patcher

Developer Rory Walsh has been working on an interesting project called Cabbage. Its web site describes the project as "software for creating cross-platform Csound-based audio plugins", a laudable goal that intends to bring Csound's capabilities to any Linux audio host that supports native VST plugins.

Cabbage processes a file created in a special version of Csound's unified file format and creates a GUI front-end for the instrument or effect (at left). A regular CSD file can be edited with Cabbage's file source editor and tweaked until ready for export as a plugin or standalone application. Currently, the Linux version of Cabbage will create standalone executables and Linux VST plugins, with support for either JACK or plain ALSA audio I/O.

I've had mixed but promising results from my tests of Cabbage. Most standalones work well, and I've been able to export plugins recognized as valid Linux VSTs by Qtractor and Ardour3. Rory also reports success using Cabbage-created plugins with the Linux version of the Renoise digital audio workstation (Renoise is built with the same JUCE framework used by Cabbage).

Documentation consists of a single HTML page in the Docs directory of the source package. The information on that page is useful, but it is not current. For more up-to-date information, you can communicate with Rory and other developers and users on the Cabbage forum and the Csound mailing lists.

The project is young but its Linux version is stable enough for public testing. Performance needs further tuning, the plugins themselves have a large memory footprint, and the documentation needs to be updated. Work remains to be done, but Rory is dedicated to Cabbage's success and welcomes participation from other developers and normal users.

Csound composition/performance environments

In Ye Olden Times, music and sound composition in Csound was a daunting task. The system includes a scoring language and a few utilities for massaging existing scores, but composing complete pieces directly in Csound meant laboring over every event and its unique parameters. Composers such as Richard Boulanger and James Dashow created substantial, beautiful works by hand-crafting every sound and effect, and this method is still available to anyone who wishes to work in such a manner. However, Csounders now enjoy a variety of helper applications and utilities, including some remarkable software dedicated to facilitating composition in Csound.

The modern Csound language includes opcodes — signal processing black boxes — that provide useful random number generators and stochastic cannons, but it lacks sophisticated procedures for further processing the generated data. Fortunately, the gurus of Csound development have created programming interfaces for other languages that do provide such procedures, such as C++, Java, and Python.

I've written elsewhere about athenaCL, CsoundAC, and CM/GRACE, all of which include Csound as an output target. These systems require some study — programs in CM/GRACE are written in a Lisp derivative, CsoundAC and athenaCL employ Python — and they are of interest primarily to composers working with algorithmic composition methods. CM/GRACE author Rick Taube observes that they address issues at what he calls the metalevel of music composition, a "representation of music ... concerned with representing the activity, or process, of musical composition as opposed to its artifact, the score." These packages are fascinating and powerful tools, but clearly they are not music manufacturing programs a la GarageBand.

Jean-Pierre Lemoine's AVSynthesis (at right) is designed as an environment for interaction with OpenGL 3D graphics transformations in realtime, accompanied by audio from Csound also rendered in realtime. However, it is equally useful as a standalone non-realtime system for music and sound composition with Csound. AVSynthesis provides a set of premade instruments and signal processors, along with an analog-style sequencer, a 64-position step sequencer, a GUI for shaping Cmask tendency masks, and a piano-roll interface. Alas, there are no facilities for adding new instruments or effects, but Jean-Pierre's instruments and processing modules are excellent. I've worked intensively with AVSynthesis for a few years, and I'm far from exhausting the possibilities of its included sound-design modules.

Stephane Rollandin's Surmulot is also dedicated to composition with Csound. The Surmulot home page describes the software as "a complex environment for music composition", which is an accurate description. Surmulot is a collection of three systems — the Csound-x code for using Emacs as a front-end for Csound, the muO Smalltalk objects created for use in the Squeak environment, and the GeoMaestro system based on the powerful Keykit program for experimental MIDI composition. Its author advises that Surmulot is not for everyone, but it is a rich environment for composers who want to work in an experimental environment and are not intimidated by emacs.

Incidentally, if you want to compose your Csound works in good old Western standard music notation you'll be pleased to learn that the Rosegarden sequencer can export scores created in that program to scores in the Csound format. Rosegarden's notation capabilities are extensive, and as far as I know it is the only sequencer/notation package that includes Csound as an export target.

Documenting the scene

When I began using Csound in 1989 the system was relatively small. Most of its documentation was in the official manual, a useful and informative tome despite a rather terse presentation and the lack of examples for many opcodes. A few pieces were available for inspection, but the learning process was largely by trial and error.

The current system is considerably larger, with many new opcodes, utilities, and third-party extensions. Likewise, Csound's documentation has expanded. The official manual is the focus of much labor from the community, with updated entries for the whole system and useful examples for all opcodes. The FLOSS on-line manual is more modest in scope but is especially valuable for beginners. Dozens of Vimeo and YouTube videos offer instruction and demonstrations of Csound in action, and if I'm in the off-line state I can read and study one of the five hardcopy books dedicated to Csound. It is fair to claim that the learning process has become considerably easier — perhaps I should say "much less difficult" — thanks to this profusion of resources.

But wait, there's more. In 1999, Hans Mikelson published the first issue of the Csound Magazine, an on-line collection of articles from and for the Csound community. All levels of participation were encouraged, and the magazine was a first-rate resource for new and experienced users. Alas, in 2002 Hans gave up management of its publication. However, in 2005 Steven Yi and James Hearon began publishing the Csound Journal. Like the Csound Magazine, the Journal accepts and publishes work regarding all aspects of Csound at all levels. Don't take my word for it though, just take a look at the Csound Journal/Csound Magazine article index for a quick survey of the various topics of interest to Csounders.

Google finds thousands of sites associated with a search for "Csound", including the predictable thousands of bogus or irrelevant sites. I suggest that you cut through the confusion and head straight to Dr. Richard Boulanger's cSounds.com. It's the best resource for exploring the world of Csound with news, announcements, links, podcasts, user pages, and more. If you want to know what's going on with Csound you need to check out cSounds.com.

I've mentioned the Csound mailing lists. Separate lists are available for developer and user-level discussions, though anyone is invited to participate on either list. For many years the lists have been the principal channel for Csound-relevant communications, and they remain so today. The level of discussion can be formidable, especially on the developers list, but exchange is typically civil and helpful. If you don't understand something, just ask, and you're likely to get a prompt response from one of the many Csound wizards populating the lists.

The more musical notes

Csound is the system of choice for many diverse composers. Check out the Csound group on SoundCloud, as it's quite a varied playlist. Csound is normally associated with academic and experimental computer music, but the collection at SoundCloud includes both popular and more esoteric styles. The Csound users mailing list frequently receives announcements for new works, and the occasional Csound-based piece shows up in KVR's Music Cafe. If you're composing with Csound, feel free to leave a link to your work in the comments on this article.

Outro

Csound is deep and complex software with powerful capabilities for audio/MIDI composition and processing. The Csound community is committed to improving every aspect of the system, including its documentation at user and developer levels. Projects such as Cabbage and CsoundForLive extend Csound into the larger world of the modern digital audio workstation, exposing its powers without demanding special knowledge from a normal user. Greater awareness of Csound help generate more interest in the system, expanding its development and user base, thus ensuring its continued relevance.

Comments (12 posted)

A look at PAM face-recognition authentication

By Nathan Willis
November 7, 2012

Multi-factor authentication traditionally counts "knowledge" (e.g., passwords), "possession" (e.g., physical tokens), and "inherence" (e.g., biometrics) as the three options available to choose from, but Linux's pluggable authentication modules (PAM) support rarely delves into the inherence realm. Most often, biometric authentication comes in the form of fingerprint scanners, which can be found as a built-in feature on plenty of laptops. The majority of new portables already ship with built-in user-facing webcams, however, which opens the door to another possibility: facial recognition. And there is a PAM module capable of doing face-recognition authentication — although it is far from robust.

About face

The module is called, fittingly, pam_face_authentication, and the project is hosted at Google Code. It was initially developed over the course of 2008-2009 through the Google Summer of Code (GSoC) program, subsequently slowed down (as all too many GSoC projects do), but has seen a small uptick in activity in 2012.

The module uses the OpenCV computer vision library to detect human faces in the webcam image, identify features, and match them against its database of samples. The samples are stored on a per-user basis, and the recognizer needs to be trained for each user account by accumulating a collection of snapshots. There is a Qt-based training tool included in the downloads. The tool itself is simple enough to use, but the real trick is amassing a large and robust enough collection of samples to begin with.

The reason a sizable sample collection is vital is that all facial recognition algorithms are sensitive to a number of environmental (for lack of a better word) factors, including the size of the face in the image captured, the tilt and orientation of the subject's head, and the illumination. Given one particular laptop, the size of the image is fairly predictable, but illumination is certainly not.

The pam_face_authentication algorithm works by detecting both the subject's face and eyes in the camera image, then measuring the distance between the eyes and their location in the oval of the face. The measurements is used to scale the captured image in order to compare it to the sample data. The actual comparison is based on a Haar classifier, which extracts image features in a manner similar to wavelet transformations. This technique is sensitive to lighting changes because backlight and sidelight can skew the detection of the face oval. OpenCV can overcome some illumination problems by normalizing the captured image and by working in grayscale, but it is not foolproof. Stray shadows can be mis-identified as one eye or the other, and the method dictates that the subject not be wearing glasses — which could be a practical inconvenience for many users. The 2009 GSoC project improved upon the original implementation, but it is an area where there is plenty of room left to grow.

Ultimately, the only strategy available for improving the success rate of the recognition process is to take a large number of sample images, preferably in a range of lighting conditions. In blog discussions about the module, several users mentioned taking a dozen or more sample images to train the tool for their own face. The OpenCV documentation notes that Haar classifiers are often trained on "a few hundreds of sample views of a particular object". That may sound daunting to all but the staunchly self-absorbed, but practice makes perfect.

Deployment

Still, the pam_face_authentication module is usable today. The team has published Kubuntu packages in a personal package archive (PPA) in the past, although they are no longer up to date. Installation from source is straightforward, though, with the chief hurdle being the lengthy list of OpenCV-related dependencies. Once built, the PAM module itself needs a configuration file at /usr/share/pam-configs/face_authentication; the project's documentation suggests:

    Name: face_authentication profile
    Default: yes
    Priority: 900
    Auth-Type: Primary
    Auth:
    [success=end default=ignore] pam_face_authentication.so enableX

These settings will replace the password prompt in the system's graphical greeter (e.g., GDM or KDM) with the module's webcam viewport. The module must first be activated, though, with

    pam-auth-update --package face_authentication

The pam-auth-update command persists across reboots, but it should take effect immediately, and one can simply log out to test it. When the user selects his or her username, the module immediately begins sampling the webcam image looking for a human face. When it detects and verifies the face as corresponding to the selected user account, it proceeds with the login. If one has not adequately trained the face-recognition database, though, the module will relay a series of helpful instructions ("Keep proper distance with the camera.", "Align your face.", etc.). Luckily, if these stage directions get tiring, the module will eventually fail over and revert to password-entry for login.

Image is everything

When it works, facial recognition login is slick and simple (plus nicely futuristic-feeling). In practice, though, this module leaves a bit to be desired. First and foremost, the training process deserves more attention. The interface allows the user to snap picture after picture, but there is not much feedback about the information extracted. The preview image shows the face oval and the eyes as detected by the algorithm, but nothing else. A visualization of the composite model built from the image collection would be constructive. There is also a high-to-low sensitivity slider under the advanced settings, but it, too, is light on detail. One can click on the "Test" button to see whether or not the current live image in the webcam passes recognition, but there is only a yes/no response. A numeric score, or a visual representation of the match, would improve the experience and potentially shorten the time required to amass a suitable image sample collection.

Together with the drawbacks of the facial recognition algorithm, all of these limitations keep pam_face_authentication from being a viable password replacement for any genuinely security-conscious user. As it is, the 2D facial-recognition algorithm requires considerable training to eliminate false negatives, while remaining susceptible to simple attacks (e.g., holding up a photograph of the targeted user). There are other facial recognition algorithms to be considered, including some that construct a 3D model of the user's face. Android has its own face authentication mechanism, introduced in the "Ice Cream Sandwich" release, which attempts to thwart the photograph attack by trying to detect blinks in the live image.

Then again, an entirely separate issue is that this implementation is a single-factor authentication module. A more useful tool would be a PAM module that integrates facial recognition with other factors; perhaps motivated developers will find pam_face_authentication a decent starting point.

Biometrics are an iffy authentication proposition on their own — a bandage or a scar can lock a user out of a system inadvertently, after all — but the Linux community would do well to push forward on them and physical authentication tokens as well. For all the security that bad password selection and sloppy password storage provide, any dedicated research on alternative authentication schemes is a welcome change.

Comments (1 posted)

Security quotes of the week

The course begins with a detailed discussion of how two parties who have a shared secret key can communicate securely when a powerful adversary eavesdrops and tampers with traffic. We will examine many deployed protocols and analyze mistakes in existing systems. The second half of the course discusses public-key techniques that let two or more parties generate a shared secret key. We will cover the relevant number theory and discuss public-key encryption and basic key-exchange. Throughout the course students will be exposed to many exciting open problems in the field.

-- A free online Cryptography course from Stanford Professor Dan Boneh

Actually from what I've seen on the security front there seems to a distinct view that secure boot is irrelevant because Windows 8 is so suspend/resume focussed that you might as well just trojan the box until the next reboot as its likely to be a couple of weeks [away].

-- Alan Cox

The job placement ad reveals that the law enforcement agency is currently looking to recruit two people to work on telecommunications interception at the source of the messages ("Quellen-TKÜ") at its head office in Cologne; the position is to be filled as soon as possible. The role principally involves "planning, operating and optimising a modern telecommunications network" designed to eavesdrop on internet telephone calls.

-- The H looks at a German government job posting

Sophos claim their products are deployed throughout healthcare, government, finance and even the military. The chaos a motivated attacker could cause to these systems is a realistic global threat. For this reason, Sophos products should only ever be considered for low-value non-critical systems and never deployed on networks or environments where a complete compromise by adversaries would be inconvenient,

-- Tavis Ormandy in CSO Online

Comments (none posted)

drupal: multiple vulnerabilities

Package(s):

drupal

CVE #(s):

CVE-2012-1588 CVE-2012-1589 CVE-2012-1590 CVE-2012-1591 CVE-2012-2153

Created:

November 2, 2012

Updated:

November 7, 2012

Description:

From the Mageia advisory:

Drupal core's text filtering system provides several features including removing inappropriate HTML tags and automatically linking content that appears to be a link. A pattern in Drupal's text matching was found to be inefficient with certain specially crafted strings. This vulnerability is mitigated by the fact that users must have the ability to post content sent to the filter system such as a role with the "post comments" or "Forum topic: Create new content" permission (CVE-2012-1588).

Drupal core's Form API allows users to set a destination, but failed to validate that the URL was internal to the site. This weakness could be abused to redirect the login to a remote site with a malicious script that harvests the login credentials and redirects to the live site. This vulnerability is mitigated only by the end user's ability to recognize a URL with malicious query parameters to avoid the social engineering required to exploit the problem (CVE-2012-1589).

Drupal core's forum lists fail to check user access to nodes when displaying them in the forum overview page. If an unpublished node was the most recently updated in a forum then users who should not have access to unpublished forum posts were still be able to see meta-data about the forum post such as the post title (CVE-2012-1590).

Drupal core provides the ability to have private files, including images, and Image Styles which create derivative images from an original image that may differ, for example, in size or saturation. Drupal core failed to properly terminate the page request for cached image styles allowing users to access image derivatives for images they should not be able to view. Furthermore, Drupal didn't set the right headers to prevent image styles from being cached in the browser (CVE-2012-1591).

Drupal core provides the ability to list nodes on a site at admin/content. Drupal core failed to confirm a user viewing that page had access to each node in the list. This vulnerability only concerns sites running a contributed node access module and is mitigated by the fact that users must have a role with the "Access the content overview page" permission. Unpublished nodes were not displayed to users who only had the "Access the content overview page" permission (CVE-2012-2153).

From the Drupal advisory:

A bug in the installer code was identified that allows an attacker to re-install Drupal using an external database server under certain transient conditions. This could allow the attacker to execute arbitrary PHP code on the original server.

For sites using the core OpenID module, an information disclosure vulnerability was identified that allows an attacker to read files on the local filesystem by attempting to log in to the site using a malicious OpenID server (Drupal SA-CORE-2012-003).

Alerts:

Mandriva	MDVSA-2013:074	drupal	2013-04-08
Mageia	MGASA-2012-0319	drupal	2012-11-01

Comments (none posted)

kernel: denial of service

Package(s):

kernel

CVE #(s):

CVE-2012-4565

Created:

November 6, 2012

Updated:

February 28, 2013

Description:

From the Red Hat bugzilla:

Reading TCP stats when using TCP Illinois congestion control algorithm can cause a divide by zero kernel oops.

An unprivileged local user could use this flaw to crash the system.

Alerts:

Oracle	ELSA-2013-1645	kernel	2013-11-26
Oracle	ELSA-2013-2507	kernel	2013-02-28
Oracle	ELSA-2013-0496	kernel	2013-02-28
Ubuntu	USN-1704-2	Quantal kernel	2013-02-01
Mageia	MGASA-2013-0016	kernel-rt	2013-01-24
Ubuntu	USN-1704-1	kernel	2013-01-22
Mageia	MGASA-2013-0011	kernel-tmb	2013-01-18
Mageia	MGASA-2013-0010	kernel	2013-01-18
Mageia	MGASA-2013-0012	kernel-vserver	2013-01-18
Mageia	MGASA-2013-0009	kernel-linus	2013-01-18
Oracle	ELSA-2012-2048	linux	2012-12-20
Oracle	ELSA-2012-2048	linux	2012-12-20
Oracle	ELSA-2012-2047	kernel	2012-12-20
Oracle	ELSA-2012-2047	kernel	2012-12-20
Oracle	ELSA-2012-1580	kernel	2012-12-19
Scientific Linux	SL-kern-20121219	kernel	2012-12-19
CentOS	CESA-2012:1580	kernel	2012-12-19
Red Hat	RHSA-2012:1580-01	kernel	2012-12-18
Ubuntu	USN-1649-1	linux-ti-omap4	2012-11-30
Ubuntu	USN-1651-1	linux	2012-11-30
Ubuntu	USN-1650-1	linux	2012-11-30
Ubuntu	USN-1647-1	linux-ti-omap4	2012-11-30
Ubuntu	USN-1646-1	linux	2012-11-30
Ubuntu	USN-1644-1	linux	2012-11-30
Ubuntu	USN-1652-1	linux-lts-backport-oneiric	2012-11-30
Red Hat	RHSA-2012:1491-01	kernel-rt	2012-12-04
Ubuntu	USN-1645-1	linux-ti-omap4	2012-11-30
Fedora	FEDORA-2012-17479	kernel	2012-11-06
Ubuntu	USN-1653-1	linux-ec2	2012-12-04
Ubuntu	USN-1648-1	linux	2012-11-30
Fedora	FEDORA-2012-17462	kernel	2012-11-06

Comments (none posted)

kernel: information leak

Package(s):

kernel

CVE #(s):

CVE-2012-4508

Created:

November 6, 2012

Updated:

March 15, 2013

Description:

From the Red Hat bugzilla:

A race condition flaw has been found in the way asynchronous I/O and fallocate interacted which can lead to exposure of stale data -- that is, an extent which should have had the "uninitialized" bit set indicating that its blocks have not yet been written and thus contain data from a deleted file. An unprivileged local user could use this flaw to cause an information leak.

Alerts:

Red Hat	RHSA-2013:1783-01	kernel	2013-12-05
Oracle	ELSA-2013-1645	kernel	2013-11-26
Red Hat	RHSA-2013:1519-01	kernel	2013-11-13
Ubuntu	USN-1900-1	linux-ec2	2013-07-04
Ubuntu	USN-1899-1	linux	2013-07-04
Debian	DSA-2668-1	linux-2.6	2013-05-14
Scientific Linux	SL-kern-20130314	kernel	2013-03-14
CentOS	CESA-2013:0496	kernel	2013-03-09
openSUSE	openSUSE-SU-2013:0396-1	kernel	2013-03-05
Oracle	ELSA-2013-2507	kernel	2013-02-28
Red Hat	RHSA-2013:0496-02	kernel	2013-02-21
Ubuntu	USN-1726-1	linux-ti-omap4	2013-02-14
Ubuntu	USN-1720-1	linux	2013-02-12
Ubuntu	USN-1719-1	linux-lts-backport-oneiric	2013-02-12
Ubuntu	USN-1704-2	Quantal kernel	2013-02-01
Mageia	MGASA-2013-0016	kernel-rt	2013-01-24
Ubuntu	USN-1704-1	kernel	2013-01-22
Mageia	MGASA-2013-0011	kernel-tmb	2013-01-18
Mageia	MGASA-2013-0010	kernel	2013-01-18
Mageia	MGASA-2013-0012	kernel-vserver	2013-01-18
Mageia	MGASA-2013-0009	kernel-linus	2013-01-18
Oracle	ELSA-2012-1540	kernel	2012-12-05
Scientific Linux	SL-kern-20121206	kernel	2012-12-06
Fedora	FEDORA-2012-17479	kernel	2012-11-06
Red Hat	RHSA-2012:1491-01	kernel-rt	2012-12-04
CentOS	CESA-2012:1540	kernel	2012-12-05
Red Hat	RHSA-2012:1540-01	kernel	2012-12-04

Comments (none posted)

mcrypt: buffer overflow

Package(s):

mcrypt

CVE #(s):

CVE-2012-4527

Created:

November 5, 2012

Updated:

November 8, 2012

Description:

From the openSUSE advisory:

Some potential mcrypt buffer overflows in the commandline tool were fixed, which could lead to early aborts of mcrypt. Due to FORTIFY_SOURCE catching such cases, it would have only aborted mcrypt with a buffer overflow backtrace.

Alerts:

Gentoo	201405-19	mcrypt	2014-05-18
Fedora	FEDORA-2012-17290	mcrypt	2012-11-08
Fedora	FEDORA-2012-17318	mcrypt	2012-11-08
openSUSE	openSUSE-SU-2012:1440-1	mcrypt	2012-11-05

Comments (none posted)

munin: multiple vulnerabilities

Package(s):

munin

CVE #(s):

CVE-2012-2103 CVE-2012-3513

Created:

November 5, 2012

Updated:

November 7, 2012

Description:

From the Ubuntu advisory:

It was discovered that the Munin qmailscan plugin incorrectly handled temporary files. A local attacker could use this issue to possibly overwrite arbitrary files. This issue only affected Ubuntu 10.04 LTS, Ubuntu 11.10, and Ubuntu 12.04 LTS. (CVE-2012-2103)

It was discovered that Munin incorrectly handled specifying an alternate configuration file. A remote attacker could possibly use this issue to execute arbitrary code with the privileges of the web server. This issue only affected Ubuntu 12.10. (CVE-2012-3513)

Alerts:

Gentoo	201405-17	munin	2014-05-18
Mandriva	MDVSA-2013:105	munin	2013-04-10
Mageia	MGASA-2012-0358	munin	2012-12-11
Ubuntu	USN-1622-1	munin	2012-11-05

Comments (none posted)

mysql: multiple unspecified vulnerabilities

Package(s):

mysql

CVE #(s):

CVE-2012-3144 CVE-2012-3147 CVE-2012-3149 CVE-2012-3150 CVE-2012-3156 CVE-2012-3158 CVE-2012-3160 CVE-2012-3163 CVE-2012-3166 CVE-2012-3167 CVE-2012-3173 CVE-2012-3177 CVE-2012-3180 CVE-2012-3197

Created:

November 5, 2012

Updated:

December 4, 2012

Description:

From the CVE entries:

Unspecified vulnerability in the MySQL Server component in Oracle MySQL 5.5.26 and earlier allows remote authenticated users to affect availability via unknown vectors related to Server. (CVE-2012-3144)

Unspecified vulnerability in the MySQL Server component in Oracle MySQL 5.5.26 and earlier allows remote attackers to affect integrity and availability, related to MySQL Client. (CVE-2012-3147)

Unspecified vulnerability in the MySQL Server component in Oracle MySQL 5.5.26 and earlier allows remote authenticated users to affect confidentiality, related to MySQL Client. (CVE-2012-3149)

Unspecified vulnerability in the MySQL Server component in Oracle MySQL 5.1.64 and earlier, and 5.5.26 and earlier, allows remote authenticated users to affect availability via unknown vectors related to Server Optimizer. (CVE-2012-3150)

Unspecified vulnerability in the MySQL Server component in Oracle MySQL 5.5.25 and earlier allows remote authenticated users to affect availability via unknown vectors related to Server. (CVE-2012-3156)

Unspecified vulnerability in the MySQL Server component in Oracle MySQL 5.1.64 and earlier, and 5.5.26 and earlier, allows remote attackers to affect confidentiality, integrity, and availability via unknown vectors related to Protocol. (CVE-2012-3158)

Unspecified vulnerability in the MySQL Server component in Oracle MySQL 5.1.65 and earlier, and 5.5.27 and earlier, allows local users to affect confidentiality via unknown vectors related to Server Installation. (CVE-2012-3160)

Unspecified vulnerability in the MySQL Server component in Oracle MySQL 5.1.64 and earlier, and 5.5.26 and earlier, allows remote authenticated users to affect confidentiality, integrity, and availability via unknown vectors related to Information Schema. (CVE-2012-3163)

Unspecified vulnerability in the MySQL Server component in Oracle MySQL 5.1.63 and earlier, and 5.5.25 and earlier, allows remote authenticated users to affect availability via unknown vectors related to InnoDB. (CVE-2012-3166)

Unspecified vulnerability in the MySQL Server component in Oracle MySQL 5.1.63 and earlier, and 5.5.25 and earlier, allows remote authenticated users to affect availability via unknown vectors related to Server Full Text Search. (CVE-2012-3167)

Unspecified vulnerability in the MySQL Server component in Oracle MySQL 5.1.63 and earlier, and 5.5.25 and earlier, allows remote authenticated users to affect availability via unknown vectors related to InnoDB Plugin. (CVE-2012-3173)

Unspecified vulnerability in the MySQL Server component in Oracle MySQL 5.1.65 and earlier, and 5.5.27 and earlier, allows remote authenticated users to affect availability via unknown vectors related to Server. (CVE-2012-3177)

Unspecified vulnerability in the MySQL Server component in Oracle MySQL 5.1.65 and earlier, and 5.5.27 and earlier, allows remote authenticated users to affect availability via unknown vectors related to Server Optimizer. (CVE-2012-3180)

Unspecified vulnerability in the MySQL Server component in Oracle MySQL 5.1.64 and earlier, and 5.5.26 and earlier, allows remote authenticated users to affect availability via unknown vectors related to Server Replication. (CVE-2012-3197)

Alerts:

Gentoo	201308-06	mysql	2013-08-29
Gentoo	GLSA 201308-06:02	mysql	2013-08-30
Mandriva	MDVSA-2013:102	mariadb	2013-04-10
Mageia	MGASA-2012-0349	mysql	2012-12-07
Scientific Linux	SL-mysq-20121115	mysql	2012-11-15
CentOS	CESA-2012:1462	mysql	2012-11-15
Ubuntu	USN-1621-1	mysql-5.1, mysql-5.5, mysql-dfsg-5.1	2012-11-05
Oracle	ELSA-2012-1462	mysql	2012-11-14
Red Hat	RHSA-2012:1462-01	mysql	2012-11-14
Debian	DSA-2581-1	mysql-5.1	2012-12-04
Mageia	MGASA-2012-0341	mysql	2012-11-23

Comments (1 posted)

openoffice.org: code execution

Package(s):

openoffice.org

CVE #(s):

CVE-2012-4233

Created:

November 1, 2012

Updated:

February 10, 2013

Description:

From the Debian advisory:

High-Tech Bridge SA Security Research Lab discovered multiple null-pointer dereferences based vulnerabilities in OpenOffice which could cause application crash or even arbitrary code execution using specially crafted files. Affected file types are LWP (Lotus Word Pro), ODG, PPT (MS Powerpoint 2003) and XLS (MS Excel 2003).

Alerts:

Mageia	MGASA-2013-0045	libreoffice	2013-02-09
openSUSE	openSUSE-SU-2013:0173-1	LibreOffice	2013-01-23
openSUSE	openSUSE-SU-2012:1686-1	libreoffice	2012-12-23
openSUSE	openSUSE-SU-2012:1523-1	LibreOffice	2012-11-22
Debian	DSA-2570-1	openoffice.org	2012-10-31

Comments (none posted)

otrs: cross-site scripting

Package(s):

otrs

CVE #(s):

CVE-2012-4751

Created:

November 7, 2012

Updated:

January 23, 2013

Description:

From the Mageia advisory:

Cross-site scripting (XSS) vulnerability in Open Ticket Request System (OTRS) Help Desk 2.4.x before 2.4.15, 3.0.x before 3.0.17, and 3.1.x before 3.1.11 allows remote attackers to inject arbitrary web script or HTML via an e-mail message body with whitespace before a javascript: URL in the SRC attribute of an element, as demonstrated by an IFRAME element.

Alerts:

Debian	DSA-2733-1	otrs2	2013-08-02
Mandriva	MDVSA-2013:112	otrs	2013-04-10
openSUSE	openSUSE-SU-2013:0145-1	otrs	2013-01-23
Mageia	MGASA-2012-0322	otrs	2012-11-06

Comments (none posted)

pcp: multiple unspecified vulnerabilities

Package(s):

pcp

CVE #(s):

Created:

November 6, 2012

Updated:

November 7, 2012

Description:

PCP 3.6.9 fixes several bugs that may cause security issues.

Alerts:

Fedora	FEDORA-2012-17019	pcp	2012-11-06
Fedora	FEDORA-2012-17050	pcp	2012-11-06

Comments (none posted)

remote-login-service: information leak

Package(s):

remote-login-service

CVE #(s):

CVE-2012-0959

Created:

November 6, 2012

Updated:

November 7, 2012

Description:

From the Ubuntu advisory:

It was discovered that Remote Login Service incorrectly purged account information when switching users. A local attacker could use this issue to possibly obtain sensitive information.

Alerts:

Ubuntu

USN-1624-1

remote-login-service

2012-11-05

Comments (none posted)

ssmtp: no TLS certificate validation

Package(s):

ssmtp

CVE #(s):

Created:

November 1, 2012

Updated:

November 7, 2012

Description:

From the Red Hat bugzilla entry:

It was reported that ssmtp, an extremely simple MTA to get mail off the system to a mail hub, did not perform x509 certificate validation when initiating a TLS connection to server. A rogue server could use this flaw to conduct man-in-the-middle attack, possibly leading to user credentials leak.

Alerts:

Fedora

FEDORA-2012-16163

ssmtp

2012-11-01

Comments (none posted)

xlockmore: denial of service

Package(s):

xlockmore

CVE #(s):

CVE-2012-4524

Created:

November 6, 2012

Updated:

November 9, 2012

Description:

From the Red Hat bugzilla:

A denial of service flaw was found in the way xlockmore, X screen lock and screen saver, performed passing arguments to underlying localtime() call, when the 'dlock' mode was used. An attacker could use this flaw to potentially obtain unauthorized access to screen / graphical session, previously locked by another user / victim.

Alerts:

Gentoo	201309-03	xlockmore	2013-09-02
Mageia	MGASA-2012-0328	xlockmore	2012-11-09
Fedora	FEDORA-2012-16490	xlockmore	2012-11-06
Fedora	FEDORA-2012-16485	xlockmore	2012-11-06

Comments (none posted)

Kernel release status

The current development kernel is 3.7-rc4, released on November 4. Linus Torvalds said: "Perhaps notable just because of the noise it caused in certain circles, there's the ext4 bitmap journaling fix for the issue that caused such a ruckus. It's a tiny patch and despite all the noise about it you couldn't actually trigger the problem unless you were doing crazy things with special mount options."

Stable status: 3.0.51, 3.4.18, and 3.6.6 were released on November 5. They all include important fixes. The latter two (3.4.18, 3.6.6) contain the fix for the ext4 corruption problem that initially looked much worse than it turned out to be.

Comments (none posted)

Quotes of the week

Who does not love the smell of formal methods first thing in the morning?

— Mel Gorman

TCP metrics code says "Yo dawg, I heard you like sizeof, so I did a sizeof of a sizeof, so you can size your size" Fix from Julian Anastasov.

— David Miller (Thanks to Benjamin Poirier)

Comments (none posted)

Many more words on volatile ranges

By Michael Kerrisk
November 5, 2012

The volatile ranges feature provides applications that cache large amounts of data that they can (relatively easily) re-create—for example, browsers caching web content—with a mechanism to assist the kernel in making decisions about which pages can be discarded from memory when the system is under memory pressure. An application that wants to assist the kernel in this way does so by informing the kernel that a range of pages in its address space can be discarded at any time, if the kernel needs to reclaim memory. If the application later decides that it would like to use those pages, it can request the kernel to mark the pages nonvolatile. The kernel will honor that request if the pages have not yet been discarded. However, if the pages have already been discarded, the request will return an error, and it is then the application's responsibility to re-create those pages with suitable data.

Volatile ranges, take 12

John Stultz first proposed patches to implement volatile ranges in November 2011. As we wrote then, the proposed user-space API was via POSIX_FADV_VOLATILE and POSIX_FADV_NONVOLATILE operations for the posix_fadvise() system call. Since then, it appears that he has submitted at least another eleven iterations of his patch, incorporating feedback and new ideas into each iteration. Along the way, some feedback from David Chinner caused John to revise the API, so that a later patch series instead used the fallocate() system call, with two new flags, FALLOCATE_FL_MARK_VOLATILE and FALLOCATE_FL_UNMARK_VOLATILE.

The volatile ranges patches were also the subject of a discussion at the 2012 Linux Kernel Summit memcg/mm minisummit. What became clear there was that few of the memory management developers are familiar with John's patch set, and he appealed for more review of his work, since there were some implementation decisions that he didn't feel sufficiently confident to make on his own. As ever, getting sufficient review of patches is a challenge, and the various iterations of John's patches are a good case in point: several iterations of his patches received no or little substantive feedback.

Following the memcg/mm minisummit, John submitted a new round of patches, in an attempt to move this work further forward. His latest patch set begins with a lengthy discussion of the implementation and outlines a number of open questions.

The general design of the API is largely unchanged, with one notable exception. During the memcg/mm minisummit, John noted that repeatedly marking pages volatile and nonvolatile could be expensive, and was interested in ideas about how the kernel could do this more efficiently. Instead, Taras Glek (a Firefox developer) and others suggested an idea that could side-step the question of how to more efficiently implement the kernel operations: if a process attempts to access a volatile page that has been discarded from memory, then the kernel could generate a SIGBUS signal for the process. This would allow a process that wants to briefly access a volatile page to avoid the expense of bracketing the access with calls to unmark the page as volatile and then mark it as volatile once more.

Instead, the process would access the data, and if it received a SIGBUS signal, it would know that the data at the corresponding address needs to be re-created. The SIGBUS signal handler can obtain the address of the memory access that generated the signal via one of its arguments. Given that information, the signal handler can notify the application that the corresponding address range must be unmarked as volatile and repopulated with data. Of course, an application that doesn't want to deal with signals can still use the more expensive unmark/access/mark approach.

There are still a number of open questions regarding the API. As noted above, following Dave Chinner's feedback, John revised the interface to use the fallocate() system call instead of posix_fadvise(), only to have it suggested by other memory management maintainers at the memcg/mm minisummit that posix_fadvise() or madvise() would be better. The latest implementation still uses fallocate(), though John thinks his original approach of using posix_fadvise() is slightly more sensible. In any case, he is still seeking further input about the preferred interface.

The volatile ranges patches currently only support mappings on tmpfs filesystems, and marking or unmarking a range volatile requires the use of the file descriptor corresponding to the mapping. In his mail, John explained that Taras finds the file-descriptor-based interface rather cumbersome:

In talking with Taras Glek, he pointed out that for his needs, the fd based interface is a little annoying, as it requires having to get access to tmpfs file and mmap it in, then instead of just referencing a pointer to the data he wants to mark volatile, he has to calculate the offset from start of the mmap and pass those file offsets to the interface.

John acknowledged that an madvise() interface would be nice, but it raises some complexities. The problem with an madvise() interface is that it could be more generally applied to any part of the process's address space. However, John wondered what semantics could be attached to volatile ranges that are applied to anonymous mappings (e.g., when pages are duplicated via copy-on-write, should the duplicated page also be marked volatile?) or file-mappings on non-tmpfs filesystems. Therefore, the latest patch series provides only the file-descriptor-based interface.

There are a number of other subtle implementation details that John has considered in the volatile ranges implementation. For example, if a large page range is marked volatile, should the kernel perform a partial discard of pages in the range when under memory pressure, or discard the entire range? In John's estimation, discarding a subset of the range probably destroys the "value" of the entire range. So the approach taken is to discard volatile ranges in their entirety.

Then there is the question of how to treat volatile ranges that overlap and volatile ranges that are contiguous. Overlapping ranges are coalesced into a single range (which means they will be discarded as a unit). Contiguous ranges are slightly different. The current behavior is to merge them if neither range has yet been discarded. John notes that coalescing in these circumstances may not be desirable: since the application marked the ranges volatile in separate operations, it may not necessarily wish to see both ranges discarded together.

But at this point a seeming oddity of the current implementation intervenes: the volatile ranges implementation deals with address ranges at a byte level of granularity rather than at the page level. It is possible to mark (say) a page and half as volatile. The kernel will only discard complete volatile pages, but, if a set of contiguous sub-page ranges covering an entire page is marked volatile, then coalescing the contiguous ranges allows the page to be discarded if necessary. In response to this and various other points in John's lengthy mail, Neil Brown wondered if John was:

trying to please everyone and risked pleasing no-one… For example, allowing sub-page volatile region seems to be above and beyond the call of duty. You cannot mmap sub-pages, so why should they be volatile?

John responded that it seemed sensible from a user-space point of view to allow sub-page marking and it was not too complex to implement. However, the use case for byte-granularity volatile ranges is not obvious from the discussion. Given that the goal of volatile ranges is to assist the kernel in freeing up what would presumably be a significant amount of memory when the system is under memory pressure, it seems unlikely that a process would make multiple system calls to mark many small regions of memory volatile.

Neil also questioned the use of signals as a mechanism for informing user space that a volatile range has been discarded. The problem with signals, of course, is that their asynchronous nature means that they can be difficult to deal with in user-space applications. Applications that handle signals incorrectly can be prone to subtle race errors, and signals do not mesh well with some other parts of the user-space API, such as POSIX threads. John replied:

Initially I didn't like the idea, but have warmed considerably to it. Mainly due to the concern that the constant unmark/access/mark pattern would be too much overhead, and having a lazy method will be much nicer for performance. But yes, at the cost of additional complexity of handling the signal, marking the faulted address range as non-volatile, restoring the data and continuing.

There are a number of other unresolved implementation decisions concerning the order in which volatile range pages should be discarded when the system is under memory pressure, and John is looking for input on those decisions.

A good heuristic is required for choosing which ranges to discard first. The complicating factor here is that a volatile page range may contain both frequently and rarely accessed data. Thus, using the least recently used page in a range as a metric in the decision about whether to discard a range could cause quite recently used pages to be discarded. The Android ashmem implementation (upon which John's volatile ranges work is based) employed an approach to this problem that works well for Android: volatile ranges are discarded in the order in which they are marked volatile, and, since applications are not supposed to touch volatile pages, the least-recently-marked-volatile order provides a reasonable approximation of least-recently-used order.

But the SIGBUS semantics described above mean that an application could continue to access a memory region after marking it as volatile. Thus, the Android approach is not valid for John's volatile range implementation. In theory, the best solution might be to evaluate the age of the most recently used page in each range and then discard the range with the oldest most recently used page; John suspects, however, that there may be no efficient way of performing that calculation.

Then there is the question of the relative order of discarding for volatile and nonvolatile pages. Initially, John had thought that volatile ranges should be discarded in preference to any other pages on the system, since applications have made a clear statement that they can recover if the pages are lost. However, at the memcg/mm minisummit, it was pointed out that there may be pages on the system that are even better candidates for discarding, such as pages containing streamed data that is unlikely to be used again soon (if at all). However, the question of how to derive good heuristics for deciding the best relative order of volatile pages versus various kinds of nonvolatile pages remains unresolved.

One other issue concerns NUMA-systems. John's latest patch set uses a shrinker-based approach to discarding pages, which allows for an efficient implementation. However, (as was discussed at the memcg/mm minisummit) shrinkers are not currently NUMA-aware. As a result, when one node on a multi-node system is under memory pressure, volatile ranges on another node might be discarded, which would throw data away without relieving memory pressure on the node where that pressure is felt. This issue remains unresolved, although some ideas have been put forward about possible solutions.

Volatile anonymous ranges

In the thread discussing John's patch set, Minchan Kim raised a somewhat different use case that has some similar requirements. Whereas John's volatile ranges feature operates only on tmpfs mappings and requires the use of a file descriptor-based API, Minchan expressed a preference for an madvise() interface that could operate on anonymous mappings. And whereas John's patch set employs its own address-range based data structure for recording volatile ranges, Minchan proposed that volatility could be an implemented as a new VMA attribute, VM_VOLATILE, and madvise() would be used to set that attribute. Minchan thinks his proposal could be useful for user-space memory allocators.

With respect to John's concerns about copy-on-write semantics for volatile ranges in anonymous pages, Minchan suggested volatile pages could be discarded so long as all VMAs that share the page have the VM_VOLATILE attribute. Later in the thread, he said he would soon try to implement a prototype for his idea.

Minchan proved true to his word, and released a first version of his prototype, quickly followed by a second version, where he explained that his RFC patch complements John's work by introducing:

new madvise behavior MADV_VOLATILE and MADV_NOVOLATILE for anonymous pages. It's different with John Stultz's version which considers only tmpfs while this patch considers only anonymous pages so this cannot cover John's one. If below idea is proved as reasonable, I hope we can unify both concepts by madvise/fadvise.

Minchan detailed his earlier point about user-space memory allocators by saying that many allocators call munmap() when freeing memory that was allocated with mmap(). The problem is that munmap() is expensive. A series of page table entries must be cleaned up, and the VMA must be unlinked. By contrast, madvise(MAD_VOLATILE) only needs to set a flag in the VMA.

However, Andrew Morton raised some questions about Minchan's use case:

Presumably the userspace allocator will internally manage memory in large chunks, so the munmap() call frequency will be much lower than the free() call frequency. So the performance gains from this change might be very small.

The whole point of the patch is to improve performance, but we have no evidence that it was successful in doing that! I do think we'll need good quantitative testing results before proceeding with such a patch, please.

Paul Turner also expressed doubts about Minchan's rationale, noting that the tcmalloc() user-space memory allocator uses the madvise(MADV_DONTNEED) operation when discarding large blocks from free(). That operation informs the kernel that the pages can be (destructively) discarded from memory; if the process tries to access the pages again, they will either be faulted in from the underlying file, for a file mapping, or re-created as zero-filled pages, for the anonymous mappings that are employed by user-space allocators. Of course, re-creating the pages zero filled is normally exactly the desired behavior for a user-space memory allocator. In addition, MADV_DONTNEED is cheaper than munmap() and has the further benefit that no system call is required to reallocate the memory. (The only potential downside is that process address space is not freed, but this tends not to matter on 64-bit systems.)

Responding to Paul's point, Motohiro Kosaki pointed out that the use of MADV_DONTNEED in this scenario is sometimes the source of significant performance problems for the glibc malloc() implementation. However, he was unsure whether or not Minchan's patch would improve things.

Minchan acknowledged Andrew's questioning of the performance benefits, noting that his patch was sent out as a request for comment; he agreed with the need for performance testing to justify the feature. Elsewhere in the thread, he pointed to some performance measurements that accompanied a similar patch proposed some years ago by Rik van Riel; looking at those numbers, Minchan believes that his patch may provide a valuable optimization. At this stage, he is simply looking for some feedback about whether his idea warrants some further investigation. If his MADV_VOLATILE proposal can be shown to yield benefits, he hopes that his approach can be unified with John's work.

Conclusion

Although various people have expressed an interest in the volatile ranges feature, its progress towards the mainline kernel has been slow. That certainly hasn't been for want of effort by John, who has been steadily refining his well-documented patches and sending them out for review frequently. How that progress will be affected by Minchan's work remains to be seen. On the positive side, Minchan—assuming that his own work yields benefits—would like to see the two approaches integrated. However, that effort in itself might slow the progress of volatile ranges toward the mainline.

Given the user-space interest in volatile ranges, one supposes that the feature will eventually make its way into the kernel. But clearly, John's work, and eventually also Minchan's complementary work, could do with more review and input from the memory management developers to reach that goal.

Comments (28 posted)

UEFI secure boot kernel restrictions

By Jake Edge
November 7, 2012

The UEFI secure boot "problem" spans multiple levels in a Linux system. There are the actual "getting Linux to boot" issues, which have mostly been addressed by the two signed bootloaders that are available for distributions and users. Beyond that, though, are a set of kernel features that have the potential to subvert secure boot. Depending on one's perspective, those features either need to be configurable in the kernel—so some distributions can turn them off in their signed kernels—or they pose little risk beyond that of existing (but unknown) kernel bugs. As might be guessed, both sides of that argument can be heard in a recent linux-kernel thread.

The root problem, so to speak, is that the root user on Linux systems is trusted by the kernel. That means root can make all sorts of temporary—or permanent—changes to the state of the system. Those changes include things like using kexec() to boot a different operating system, or writing a hibernation image to the swap partition for use in a resume. But both of those things could be used by an attacker to circumvent the protections that secure boot is meant to enforce.

If, for example, a user were to boot using one of the Microsoft-signed Linux bootloaders—"shim" or "pre-bootloader"—into a kernel that didn't restrict root's powers, that kernel could arrange to execute a different kernel, perhaps one that has been compromised. Worse yet, from the perspective of those worried about Microsoft blacklisting bootloaders or keys, that second kernel could actually be a malicious version of Windows. So, a secure-boot-protected system would end up booting mal-Windows, which is precisely the scenario secure boot is supposed to prevent.

If that occurs in the wild, various folks believe that Microsoft will blacklist the bootloader that was used in the attack. If that's the same bootloader used to boot Linux, new systems, as well as old systems that get the blacklist update, will no longer boot Linux. Matthew Garrett (at least) is concerned about that scenario, so he has proposed kernel changes that would prevent suitably configured kernels from using kexec() among a handful of other restrictions that could be used to circumvent secure boot.

That was back in early September, and those changes were relatively uncontroversial except for the capability name Garrett chose (CAP_SECURE_FIRMWARE) and the kexec() restriction. In mid-September, he followed up with a set of patches that were substantially the same, though the kexec() patch was removed and the capability was renamed to CAP_COMPROMISE_KERNEL. In the patch, he noted that "if anyone wants to deploy these then they should disable kexec until support for signed kexec payloads has been merged".

Things went quiet for more than a month, but have since erupted into a rather large thread. A query from Jiri Kosina about loading firmware into a secure boot kernel led to a discussion of the threat model that is being covered by Garrett's patch set. While Garrett agreed that firmware loading should eventually be dealt with via signatures, it is not as high on his priority list. An automated attack using crafted firmware would be very hardware-specific, requiring reverse engineering, and "we'd probably benefit from them doing that in the long run".

Garrett's focus on automated attacks makes it clear what threat models he is trying to thwart, so Kosina's next query, about resuming from hibernation, is an issue that Garrett believes should be addressed. It turns out that Josh Boyer has a patch to disable hibernation for secure boot systems, but that, like disabling kexec(), was not overwhelmingly popular.

There are other ways to handle resuming from hibernation, for example by creating keys at boot time that get stored in UEFI boot variables and that the kernel uses to sign hibernation images. But it is clear that some kernel developers are starting (or continuing) to wonder if the kernel secure boot support isn't going a bit—or far more than a bit—overboard.

For one thing, as James Bottomley pointed out, there will always be kernel bugs that allow circumvention of these restrictions (e.g. by root reading the hibernation signing key or flipping the capability bit):

[...] a local privilege elevation attack usually exploits some bug (classically a buffer overflow) which executes arbitrary code in kernel context. In that case, the same attack vector can be used to compromise any in-kernel protection mechanism including turning off the secure boot capability and reading the in-kernel private signing key.

[...] The point I'm making is that given that the majority of exploits will already be able to execute arbitrary code in-kernel, there's not much point trying to consider features like this as attacker prevention. We should really be focusing on discussing why we'd want to prevent a legitimate local root from writing to the suspend partition in a secure boot environment.

But kernel exploits appear to be "off the table", at least in terms of the secure boot circumvention that Garrett and others are concerned about. Kosina said:

My understanding is that we are not trying to protect against root exploiting the kernel. We are trying to protect against root tampering with the kernel code and data through legitimate use of kernel-provided [facilities] (/dev/mem, ioperm, reprogramming devices to DMA to arbitrary memory locations, resuming from hibernation image that has been tampered with, etc).

It's not exactly clear why Microsoft would make a distinction between a kernel exploit and using legitimate kernel services when making a blacklisting decision, though. But, for distributions that do ship signed kernels, they can reduce the attack surface substantially: to only those kernels that they have signed, with whatever vulnerabilities are present in those particular versions.

Eric Paris detailed one possible attack that installs a crafted Linux boot environment (with a legitimately signed bootloader and kernel), which sleeps after setting up a trojaned Windows hibernation image. Users would need to wake the machine twice, but would end up running malware in a secure boot system.

Bottomley and others are, at the very least, uncomfortable with the idea of an "untrusted root". At the heart of the kernel changes for secure boot is removing the ability for root to make persistent changes to the boot environment. The patches that Garrett has proposed close many of the known holes that would allow root to make those kinds of changes, but the argument is that there are likely to be others. As Alan Cox put it:

With all the current posted RH patches I can still take over the box as root trivially enough and you seem to have so far abolished suspend to disk, kexec and a pile of other useful stuff. To actually lock it down you'll have to do a ton more of this.

Another possible way to handle Linux being used as an attack vector against Windows (which is how keys are likely to get blacklisted) is to change the behavior of the Linux bootloaders. Bottomley suggested that a "present user" test on the first boot of the bootloader, which could be detected because the UEFI key database and the "machine owner key" database do not contain the proper keys, would alleviate the problem. Garrett pointed out that the shim bootloader does not do this because it needs to be able to boot unattended, even on first boot. But, Bottomley saw that as unfortunate:

[...] what I'm telling you is that by deciding to allow automatic first boot, you're causing the windows attack vector problem. You could easily do a present user test only on first boot which would eliminate it. Instead, we get all of this.

Garrett, though, sees unattended first boot as an absolute requirement, especially for those who are trying to do automated installations for Linux systems. Others disagreed, not surprisingly, and the discussion still continues. It should be noted that the pre-bootloader that Bottomley released does do a present user test on first boot (and beyond, depending on whether the user changes the secure boot configuration).

There does seem to be something of whack-a-mole problem here in terms of finding all of the ways that this "untrusted root" might be able to impact secure boot. In addition, new kernel features will have to also be scrutinized to see whether they need to be disabled depending on CAP_COMPROMISE_KERNEL.

Not trusting root is a very different model than kernel developers (and users) are accustomed to. One can imagine that all of the different problem areas will be tracked down eventually, but it will be a fair amount of work. Whether that work is truly justified in support of a feature that is largely (though not completely) only useful for protecting Windows is a big question. On the other hand, not being able to [easily] boot Linux on x86 hardware because of key blacklisting would be problematic too. This one will take some time to play out.

Comments (44 posted)

A NILFS2 score card

November 7, 2012

This article was contributed by Neil Brown

A recurring theme in the comments on various articles announcing the new f2fs "flash-friendly" filesystem was that surely some other filesystem might already meet the need, or could be adjusted to meet the need, rather than creating yet another new filesystem. This is certainly an interesting question, but not one that is easy to answer. The cost/benefit calculation for creating a new filesystem versus enhancing an existing one involves weighing many factors, including motivational, social, political, and even, to some extent, technical. It is always more fun building something from scratch; trying to enter and then influence an existing community is at best unpredictable.

Of the various factors, the only ones for which there is substantial visibility to the outside observer are the technical factors, so while they may not be the major factors, they are the ones that this article will explore. In particular, we will examine "NILFS2", a filesystem which has been part of Linux for several years and is — superficially at least — one the best contenders as a suitable filesystem for modest-sized flash storage.

NILFS2 was not written primarily to be a flash-based filesystem, so comparing it head-to-head on that basis with f2fs (which was) might not be entirely fair. Instead we will examine it on its own merits, comparing it with f2fs occasionally to provide useful context, and ask "could this have made a suitable base for a flash-focused filesystem?"

`NILFS2`: what is it?

NILFS2 is the second iteration of the "New Implementation of a Log-structured File System". It is described as a "Continuous Snapshotting" filesystem, a feature which will be explored in more detail shortly.

NILFS2 appears to still be under development, with lots of core functionality present, but a number of important features still missing, such as extended attributes, quotas, and an fsck tool. As such, it is in a similar state to f2fs: well worth experimenting with, but not really ready for production usage yet.

In contrast with f2fs, NILFS2 uses 64 bits for all block addressing, 96 bits for time stamps (nanoseconds forever!), but only 16 bits for link counts (would you ever have more than 65535 links to a file, or sub-directories in one directory?). F2fs, in its initial release, uses 32 bits for each of these values.

While f2fs is a hybrid LFS (Log-structured Filesystem), using update-in-place in a number of cases, NILFS2 is a pure LFS. With the exception of the superblock (stored twice, once at either end of the device), everything is written in one continuous log. Data blocks are added to the log, then indexing information, then inodes, then indexes for the inodes, and so on. Occasionally a "super root" inode is written, from which all other blocks in the filesystem can be found. The address of the latest "super root" is stored in the superblock, along with static values for various parameters of the filesystem and a couple of other volatile values such as the number of free blocks.

Whenever a collection of blocks is written to the log, it is preceded by a segment summary which identifies all the blocks in the segment (similar to the Segment Summaries of f2fs which are stored in a separate area). Consecutive segment summaries are linked together so that, in the event of a crash, all the segment summaries since the most recent super root can be read and the state of the filesystem can be reconstructed.

The segment size can be chosen to be any number of blocks, which themselves must have a size that is a power of two up to the page size of the host system. The default block size is 4KB and the default device segment size is 8MB. Segments can easily be made to line up with erase blocks in a flash device, providing their size is known. While NILFS2 tries to write whole segments at a time, it is not always possible, so a number of consecutive partial segments might be written, each with their own segment summary block.

Being a pure LFS, NILFS2 will never write into the middle of an active segment — as f2fs does when space is tight. It insists on "cleaning" partially used segments (copying live data to a new segment) to make more space available, and does not even keep track of which particular blocks in a segment might be free. If there are no clean segments beyond those reserved for cleaning, the filesystem is considered to be full.

Everything is a file!

The statement "Everything is a file" is part of the Unix tradition and, like many such statements, it sounds good without tying you down to meaning very much. Each of "file", "everything" and even "is a" is open to some interpretation. If we understand "file" to be a collection of data and index blocks that provide some linearly addressed storage, "everything" to mean most data and metadata — excepting only the superblock and the segment summaries — and "is a" to mean "is stored inside a", then NILFS2 honors this Unix tradition.

For example, in a more traditional filesystem such as ext3, inodes (of which there is one per file) are stored at fixed locations in the device — usually a number of locations distributed across the address space, but fixed nonetheless. In f2fs, a hybrid approach is used where the addresses of the inodes are stored in fixed locations (in the Node Address Table — NAT), while the inodes themselves appear in the log, wherever is convenient. For NILFS2, the inode table is simply another file, with its own inode which describes the locations of the blocks.

This file (referred to as the ifile) also contains a bitmap allowing unused inodes to be found quickly, and a "block group descriptor" which allows non-empty bitmaps to be found quickly. With the default block size, every 2²⁵ inodes has 1024 blocks for bitmaps, and one descriptor block, which lists how many bits are set in each of those bitmaps. If you want more inodes than that, a second descriptor block will be automatically allocated.

The inodes themselves are a modest 128 bytes in size, and here I must confess to an oversimplification in the article on f2fs. The statement that "Copy-on-write is rather awkward for objects that are smaller than the block size" holds a grain of truth, but isn't really true as it stands. The reality is more subtle.

The advantages of a small inode size are primarily in efficiency. Less space can be wasted, and fewer I/O requests are needed to load the same number of inodes. For a traditional filesystem with pre-allocated inode regions, the space wasted can be a significant issue. However, that does not really apply to an LFS which allocates the space on demand. The speed issue is slightly harder to reason about. Certainly if all the inodes for files in one directory live in one block, then the common task of running "ls -l" will be expedited. However if more information, such as extended attributes or file data for small files, is stored in a big inode, accessing that will require only one block to be read, not two.

The advantages of a block-sized inode — apart from the extra space, which is of uncertain value — is that inodes can be updated independently. OCFS2 (a cluster-based filesystem) uses this to simplify the locking overhead — a cluster node does not need to gain exclusive access to inodes in the same block as the one that it is interested in when it performs an update, because there aren't any. In an LFS, the main issue is reducing cleaning overhead. As we noted in the f2fs article, grouping data with similar life expectancy tends to reduce the expected cost of cleaning, so storing an inode together with the data that it refers to is a good idea. If there are several inodes in the one block, then the life expectancy will be the minimum for all the inodes, and so probably quite different from nearby data. This could impose some (hard to measure) extra cleaning cost.

On the whole, it would seem best for an LFS if the one-inode-per-block model were used, as there is minimal cost of wasted space and real opportunities for benefits. If ways are found to make maximal use of that extra space, possibly following some ideas that Arnd Bergmann recently suggested, then block-sized inodes would be even more convincing.

Small inodes might be seen as a reason not to choose NILFS2, though not a very strong reason. Adjusting NILFS2 to allow full-block inodes would not be a large technical problem, though it is unknown what sort of social problem it might be.

As with most filesystems, NILFS2 also stores each directory in a "file" so there are no surprises there. The surprise is that the format used is extremely simple. NILFS2 directories are nearly identical to ext2 directories, the only important difference being that they store a 64-bit inode number rather than just 32 bits. This means that any lookup requires a linear search through the directory. For directories up to about 400 entries (assuming fairly short names on average), this is no different from f2fs. For very large directories, the search time increases linearly for NILFS2, while it is logarithmic for f2fs. While f2fs is not particularly efficient at this task, NILFS2 clearly hasn't made any effort at effective support for large directories. There appears to be an intention to implement some sort of B-tree based directory structure in the future, but this has not yet happened.

Use the right tree for the job

If everything is a file, then it is clearly important to know what a file is. It starts at the inode which contains space for seven 64-bit addresses. When the file is small (seven blocks or less) these contain pointers to all of the allocated blocks. When the file is larger, this space changes to become the root of a B-tree, with three keys (file addresses), three pointers to other B-tree nodes, and a small header.

The interesting thing about this B-tree is that the leaves do not contain extents describing contiguous ranges of blocks; instead they describe each block individually. This is interesting because it does not fit the typical use-case for a B-tree.

The particular value of a B-tree is that it remains balanced independently of the ordering or spacing of keys being inserted or removed. There is a cost that blocks may occasionally need to be split or merged, but this is more than compensated for by the ability to add an unpredictable sequence of keys. When extents are being stored in a tree, it is not possible to predict how long each extent will be, or when an extent might get broken, so the sequence of keys added will be unpredictable and a B-tree is ideal.

When the keys being added to the index are the offsets of consecutive blocks, then the sequence is entirely predictable and a different tree is likely to be preferred. A radix tree (where the path through the tree is a simple function of the key value) is much more compact than a B-tree (as there is no need to store keys) and much simpler to code. This is the sort of tree chosen for f2fs, the tree used for ext3, and generally the preferred choice when block extents are not being used to condense the index.

The only case where a B-tree of individual blocks might be more efficient than a radix tree is where the file is very sparse, having just a few allocated blocks spread throughout a large area that is unallocated. Sparse files are simply not common enough among regular files to justify optimizing for them. Nor are many of the special files that NILFS2 uses likely to be sparse. The one exception is the checkpoint file (described later), and optimizing the indexing strategy for that one file is unlikely to have been a motivation.

So we might ask "why?". Why does NILFS2 use a B-tree, or why does it not use extents in its addressing? An early design document [PDF] suggests that B-trees were chosen due to their flexibility, and while it isn't yet clear that the flexibility is worth the cost, future developments might show otherwise. The lack of extent addressing can be explained with a much more concrete answer once we understand one more detail about file indexing.

Another layer of indirection

The headline feature for NILFS2 is "continuous snapshotting". This means that it takes a snapshot of the state of the filesystem "every few seconds". These are initially short-term snapshots (also called "checkpoints"), and can be converted to long-term snapshots, or purged, by a user-space process following a locally configurable policy. This means there are very likely to be lots of active snapshots at any one time.

As has been mentioned, the primary cost of an LFS is cleaning — the gathering of live data from nearly-empty segments to other segments so that more free space can be made available. When there is only one active filesystem, each block moved only requires one index to be updated. However, when there are tens or hundreds of snapshots, each block can be active in a fairly arbitrary sub-sequence of these, so relocating a block could turn into a lot of work in updating indices.

Following the maxim usually attributed to David Wheeler, this is solved by adding a level of indirection. One of the special files that NILFS2 uses is known as the "DAT" or "Disk Address Translation" file. It is primarily an array of 64-bit disk addresses, though the file also contains allocation bitmaps like the ifile, and each entry is actually 256 bits as there is also a record of which checkpoints the block is active in. The addresses in the leaves of the indexing trees for almost all files are not device addresses but are, instead, indexes into this array. The value found contains the actual device address. This allows a block to be relocated by simply moving the block and updating this file. All snapshots will immediately know where the new location is. Doing this with variable length extents would be impractical, which appears to be one of the main reasons that NILFS2 doesn't use them.

It should be noted that while this DAT file is similar to the NAT used by f2fs, it is different in scale. The f2fs NAT is used only to provide indirection when looking up nodes — inodes, index blocks, etc. — not when looking up data blocks. The DAT file is used for lookup of all blocks. This indirection imposes some cost on every access.

Estimating that cost is, again, not easy. Given a 4K block size, each block in the DAT file provides indexing for 128 other blocks. This imposes approximately a 1% overhead in storage space and at least a 1% overhead in throughput. If all the DAT entries for a given file are adjacent, the overhead will be just that 1%. If they are very widely spread out, it could be as much as 100% (if each DAT entry is in a different block of the DAT file). Files that are created quickly on a fresh filesystem will tend to the smaller number, files created slowly (like log files) on a well-aged filesystem will likely tend toward a larger number. An average of 3% or 4% probably wouldn't be very surprising, but that is little more than a wild guess.

Against this cost we must weigh the benefit, which is high frequency snapshots. While I have no experience with this feature, I do have experience with the "undo" feature in text editors. In my younger days I used "ed" and don't recall being upset by the lack of an undo feature. Today I use emacs and use undo all the time — I don't know that I could go back to using an editor without this simple feature. I suspect continual snapshots are similar. I don't miss what I don't have, but I could quickly get used to them.

So: is the availability of filesystem-undo worth a few percent in performance? This is a question I'll have to leave to my readers to sort out. To make it easier to ponder, I'll relieve your curiosity and clarify why not all files use the DAT layer of indirection. The answer is of course that the DAT file itself cannot use indirection as it would then be impossible to find anything. Every other file does use the DAT and every lookup in those files will involve indirection.

Other miscellaneous metadata

NILFS2 has two other metadata files. Both of these are simple tables of data without the allocation bitmaps of the DAT file and the ifile.

The "sufile" records the usage of each segment of storage, counting the number of active blocks in the segment, and remembering when the segment was written. The former is used to allow the segment to be reused when it reaches zero. The latter is used to guide the choice of which segment to clean next. If a segment is very old, there is not much point waiting for more blocks to die of natural attrition. If it is young, it probably is worthwhile to wait a bit longer.

The "cpfile" records checkpoints and every time a checkpoint is created a new record is added to the end of this file. This record stores enough information to reconstruct the state of all files at the time of the checkpoint. In particular, this includes the inode for the ifile. Left to itself, this file would grow without bound. However, in normal operation a user-space program (nilfs_cleanerd) will monitor usage and delete old checkpoints as necessary. This results in the cpfile becoming a sparse file with lots of empty space for most of the length of the file, a dense collection of records at the end, and an arbitrary number of individual blocks sprinkled throughout (for the long-term snapshots). This is the file for which a radix-tree index may not be the optimal indexing scheme. It isn't clear that would matter though.

Pros and cons

So now we must return to the question: from a technical perspective, would NILFS2 make a suitable base of a flash-optimized filesystem? The principal property of flash, that the best writes are sequential writes aligned to the underlying erase block size, is easily met by NILFS2, making it a better contender than filesystems with lots of fixed locations, but can we gain any insight by looking at the details?

One of the several flash-focused features of f2fs is that it has several segments open for writing at a time. This allows data with different life expectancies to be kept separate, and also improves the utilization of those flash devices that allow a degree of parallelism in access. NILFS2 only has a single segment open at a time, as is probably appropriate for rotating media with a high seek cost, and makes no effort to sort blocks based on their life expectancy. Adding these to NILFS2 would be possible, but it is unlikely that it would be straightforward.

Looking at the more generally applicable features of NILFS2, the directory structure doesn't scale, the file indexing is less than optimal, and the addressing indirection imposes a cost of uncertain size. On the whole, there seems to be little to recommend it and a substantial amount of work that would be required to tune it to flash in the way that f2fs has been tuned. It gives the impression of being a one-big-idea filesystem. If you want continual snapshots, this is the filesystem for you. If not, it holds nothing of real interest.

On the other side, f2fs comes across as a one-big-idea filesystem too. It is designed to interface with the FTL (Flash translation layer) found in today's flash devices, and provides little else of real interest, providing no snapshots and not working at all well with any storage device other than flash. Could this be a sign of a trend towards single-focus filesystems? And if so, is that a good thing, or a bad thing?

So, while NILFS2 could have been used as a starting point for a flash-focused filesystem, it is not at all clear that it would have been a good starting point, and it is hard to challenge the decision to create a new filesystem from scratch. Whether some other filesystem might have made a better start will have to be a question for another day.

Comments (18 posted)

Linus Torvalds Linux 3.7-rc4 ?

Greg KH Linux 3.6.6 ?

Thomas Gleixner 3.6.5-rt14 ?

Thomas Gleixner 3.6.5-rt15 brown paper bag release ?

Greg KH Linux 3.4.18 ?

Steven Rostedt 3.4.17-rt28 ?

Steven Rostedt 3.4.15-rt26 ?

Steven Rostedt 3.2.33-rt50 ?

Steven Rostedt 3.2.32-rt49 ?

Greg KH Linux 3.0.51 ?

Steven Rostedt 3.0.50-rt74 ?

Steven Rostedt 3.0.48-rt72 ?

Kees Cook arch/arm: support seccomp ?

Kees Cook remove CONFIG_EXPERIMENTAL ?

Steven Rostedt tracing: Using recursion bits for function tracing and ring buffer ?

Tejun Heo cgroup_freezer: implement proper hierarchy support ?

Alex Shi power aware scheduling ?

Luming Yu A simple hardware detector for latency as well as throughtput ver 0.10 ?

Stephane Eranian perf: add memory access sampling support ?

Jan Kiszka Add gdb python scripts as kernel debugging helpers ?

Jason Wang Multiqueue support in tuntap ?

Jiang Liu introduce ACPI based system device hotplug driver ?

Philip, Avinash Support for AM33xx PWM Susbsytem ?

Jon Mason PCI-Express Non-Transparent Bridge Support ?

Darrick J. Wong [RFC PATCH v2 0/3] mm/fs: Implement faster stable page writes on filesystems ?

Jeff Layton vfs: add the ability to retry lookup and operation to most path-based syscalls ?

Wen Congyang memory-hotplug: hot-remove physical memory ?

Glauber Costa kmem controller for memcg. ?

Michel Lespinasse mm: use augmented rbtrees for finding unmapped areas ?

Mel Gorman Foundation for automatic NUMA balancing ?

Srivatsa S. Bhat [RFC PATCH 00/10][Hierarchy] mm: Linux VM Infrastructure to support Memory Power Management ?

Rafael Aquini make balloon pages movable by compaction ?

Sjur Brændeland Introduce CAIF Virtio and reversed Vrings ?

George Zhang VMCI for Linux upstreaming ?

George Zhang VSOCK for Linux upstreaming ?

Slipping or skipping Fedora 18

By Nathan Willis
November 7, 2012

Fedora release dates have slipped in the past; the QA and release teams meet regularly for a Go/No-Go vote as the schedule draws to a close to assess whether outstanding blockers mandate additional development or QA time. But the Fedora 18 release cycle has seen more delays than usual — in large part because it incorporates an overhaul of the Anaconda installation tool, which is behind schedule. The inherent riskiness of rewriting such a critical component has some project members asking whether the feature-approval process itself needs refactoring, and others asking whether Fedora 18 should be pushed back significantly to ensure that the changes arrive with stability.

Anaconda runs loose

At the heart of the dilemma is the new UI feature for Anaconda. Although the name of the feature suggests an interface revamp, the work actually encompasses a host of changes, some of which touch on other low-level parts of the system, such as the removal of logical volume manager (LVM) storage as the default, and separating system-updating functionality into a separate tool.

Regardless of what one thinks about the specific changes, the refactored Anaconda is quite a bit behind schedule — and, just as importantly, it is still undergoing major development at a time in the development cycle when features should be implemented, and bugfixes should be the focus. As a result, the release schedule has been delayed five times; originally set for October 2, the beta release is now expected November 13. Anaconda issues dominate the list of blockers, though of course there are other issues.

Tom Lane expressed his frustration over the issue on the fedora-devel list on October 30:

It appears to me that anaconda is months away from being shippable. It's still got major features that are incomplete (one example above, but there are more), and I don't seem to be able to do anything at all with it without hitting serious bugs.

How is it that we're even considering shipping this version for F18? For any other package, we'd be telling the maintainer to hold off till F19. The rest of us don't get to be doing major feature development post-beta-freeze.

Lane's concerns were echoed by others, including the notion that the Anaconda development process was being allowed to run roughshod over rules and deadlines that were enforced on other packages. But the crux of the matter remained whether or not the new Anaconda could reasonably be expected to be finished. If not, Lane and the others argued, then surely shipping the Fedora 17 version of Anaconda was better than delaying the new release by still more months.

Adam Williamson argued that reverting to the old Anaconda would be more time consuming, since many of the new Anaconda's problems are fallout from changes to the Dracut initramfs infrastructure. "Oldui wasn't fixed for that, so if [we] were to try and switch back to oldui at this point, we'd have to go through the whole process of adjusting the code for the changes to dracut again, quite apart from any other issues."

Freeze or cut bait

For some, the one-week-delay announcements arriving weekly from the Go/No-Go meetings are as much a part of the problem as the readiness of the code. Fedora would be better to acknowledge the scope of the work remaining in Anaconda by pushing back the release schedule by one or two months, the argument goes. After all, the six month development cycle is not set in stone, and introducing major changes would be an acceptable reason to move it. Meanwhile, the weekly delay decisions consume time on their own, holding up other development teams — as well as generating many more "Fedora 18 delayed" news items. Lane contended that the project "can slip a month (or two) honestly, or you can fritter it away a week at a time, and ensure that as much of that time is unproductive as possible. There is not a third option."

David Airlie went even further, asking if the project should skip the Fedora 18 release entirely. That idea gained little traction, but there was support from many for pushing the release schedule back substantially. Tim Lauridsen called that idea preferable to reverting to the Fedora 17 Anaconda, since reverting would mean revisiting the new Anaconda integration problems during the Fedora 19 cycle.

But the length of the release cycle itself could also be contributing to the new Anaconda's continuing problems. Anaconda developer Vratislav Podzimek argued that the six month period on the calendar is not enough for a major rewrite, given how early the feature freeze takes place.

Originally it was about 3 months between the day F17 was released and the day new Anaconda was expected to work (F18 Alpha release). We of course didn't start the work on May 29, but since there were significant changes in F17 too at least part of the team had to fix bugs and make F17 releasable.

To that point, Toshio Kuratomi replied that, theoretically, each release cycle includes nine months of development time, because it begins when Rawhide is branched for the N+2 release, several months before N is final. However, he said, "there's a mismatch between this practice and the theory of how we develop Fedora." In practice, most developers do set up their development schedules around the not-quite-six-month release cycle. Fedora, he concluded, can either decide to bless this approach and work to make it more effective, or adopt a longer cycle.

There are some who feel that a longer cycle would benefit the project, but the idea has its downsides. For one, Fedora depends on several large upstream projects (e.g., GNOME) that use a six-month release cycle. Changing it would come at the cost of synchronization. For another, the reality is that most feature changes fit into the existing six-month schedule, and events like the Anaconda rewrite are the exception. Lengthening the release cycle for everyone would likely encourage other teams to work slower, and it would not prevent other large-scale features from running late — they would simply be later when they do run late. As Scott Schmit put it, if Fedora moves to a nine-month release cycle, "then people just get more ambitious with their features and then what? Slow the release cycle down more?"

Features and contingencies

But Schmit also contended that the root cause of the Anaconda problem was not that the Anaconda rewrite was big and disruptive, but that the feature came with no contingency plan, and the distribution did not adequately prepare for the risks in advance. There have been other "critical features" introduced in previous releases (GNOME 3 and systemd, for example), he said, about which Fedora leadership knew from the outset that a problem would force the entire release schedule to slip. In those instances, the teams worked hard to ensure that the new feature was ready and tested. In contrast, Anaconda's new UI "seems to have slipped through the cracks."

Others concurred; Williamson listed several high-impact effects of the new UI work that were not discussed when the feature was approved for inclusion:

* The change to only allowing one desktop environment to be installed interactively

* The change to not requiring a root password to be set

* The change to raw ext4 rather than LVM as the default partition scheme

* The fact that anaconda would no longer handle upgrades but an entirely new tool would be written for this

There were probably others, those are ones I recall. These four things alone are clearly major changes that merited discussion beyond the anaconda team and should have been at least explicitly included in the newUI feature and, in the fedup case, probably split out as its own feature.

Some, like Ralf Corsepius, suggested that the Fedora Engineering Steering Committee (FESCo) could alleviate similar woes in the future by making fallback strategies and contingencies part of the feature approval process. Williamson suggested extending Fedora's existing "critical path packages" concept to include "critical path features" as one such approach.

But not everyone sees a systemic problem in this case. To some, the Anaconda team simply should have started work on the rewrite a full release cycle before it was scheduled to be merged in — either working on it in parallel during the Fedora 17 cycle, or proposing it as a Fedora 19 feature. Jóhann B. Guðmundsson pointed out a conference presentation in which the Anaconda team speculated that the work would take two release cycles to complete. Considering that estimate from the team, it could hardly be a surprise that fitting the work into a single cycle proved to be a strain.

Of course, given limitless funds and an inexhaustible pool of developers, all sorts of distribution-building headaches would disappear. But the reality is that the Anaconda team has plenty on its plate already; certainly no one in the discussion accused team members of sitting idly by. They are as strapped for resources as anyone else. In any case, what-might-have-been is a moot point, or at least secondary in importance to preventing a recurrence in the future. The general consensus seems to be in favor of strengthening the feature approval process, particularly where high-impact changes are concerned. The list discussed the possibility of moving to a rolling-release model as well, but that suggestion never got beyond the speculative stage — for every proposed benefit, there was a downside, and the majority of both were hypothetical.

At least it is clear how working harder at assessing new critical features' viability and risks will improve things in future releases; whether that amounts to a formal process, or if the wake-up call of the new Anaconda experience will suffice, remains to be decided. As for what to do this cycle, so far no change in the process has been announced. The release schedule was pushed back again at the November 1 Go/No-Go meeting. The next meeting is scheduled for Thursday, November 8. As the schedule stands now, Fedora 18 is expected to be released December 11 — or about one month from now.

Comments (7 posted)

Distribution quote of the week

These sorts of articles seem to always be written as if the author honestly expects the list to be some sort of profound revelation. That no one even realizes these problems exist, and that now that they've been identified and put into a list, this will somehow be helpful in solving them.

Then, they're usually baffled by the extremely negative reaction that people working on Linux distributions have to a list that, in their eyes, is an unactionable and uninteresting merger of old, already fixed bugs, personal preferences, bare outlines of actual bugs without enough information to fix them, grand sweeping statements of vision with no resources attached, and misunderstandings.

-- Russ Allbery

Comments (none posted)

OpenBSD 5.2 Released

OpenBSD 5.2 has been released. "The most significant change in this release is the replacement of the user-level uthreads by kernel-level rthreads, allowing multithreaded programs to utilize multiple CPUs/cores." There are lots more new features and updates listed in the release announcement (click below).

Full Story (comments: 3)

openSUSE 12.2 for ARM

The openSUSE ARM team has released openSUSE 12.2 for ARM. "Initiated at the openSUSE Conference in 2011 in Nürnberg, the openSUSE ARM team has managed to bring one of the most important Linux distributions to the ARM architecture in a little over a year."

Comments (none posted)

Fedora 18 ARM Alpha Release

The Fedora ARM team has announced that the Fedora 18 Alpha release for ARM is available for testing. "The Alpha release includes prebuilt images for the Trimslice, Pandaboard and Highbank hardware platforms as well a Versatile Express image for use with QEMU."

Full Story (comments: none)

DragonFly 3.2.1 released

DragonFly BSD has announced the release of DragonFly BSD 3.2.1. The release notes contain details. "Significant work has gone into the scheduler to improve performance, using postgres benchmarking as a measure... DragonFly should be now one of the best selections for Postgres and other databases."

Comments (none posted)

bits from the DPL: October 2012

Debian Project Leader Stefano Zacchiroli has a few bits on his October activities. Topics include Debian on public clouds, DPL helpers meeting, events, delegations, and more.

Full Story (comments: none)

Introducing codesearch.debian.net, a regexp code search engine

Debian Code Search is a search engine for Debian source code packages. "It allows you to search all ≈ 17000 source packages, containing 130 GiB of FLOSS source code (including Debian packaging) with regular expressions."

Full Story (comments: none)

Fedora 18 Beta slips again

The Fedora Project has decided to slip the Fedora 18 beta release again. Interestingly, the final release date remains December 11, since there is resistance to pushing it closer to the holiday season. Meanwhile, there is an ongoing discussion on the fedora-devel list, the gist of which is that a number of developers think that rather more time is required to get this release into shape.

Full Story (comments: 7)

openSUSE 11.4 has reached end of SUSE support - 11.4 Evergreen goes on

openSUSE 11.4 has reached its official end-of-life, with no further support from SUSE. Maintenance will be continued by the Evergreen community team.

Full Story (comments: none)

Mageia 1 reaching EOL

The Mageia project released Mageia 1 June 1, 2011. Mageia supports releases for 18 months, so Mageia 1 will no longer be supported after December 1, 2012.

Comments (none posted)

Linuxfx GhostOS released version 6

Linuxfx GhostOS is a product of the Brazilian company Linuxfx. The newly released version 6 includes the KDE desktop with all plugins, drivers and applications needed for production and entertainment. This version also has full support for biometrics, but the access control software is currently only available in Portuguese. The OS itself supports Spanish and English, in addition to Portuguese.

Full Story (comments: none)

Distribution newsletters

DistroWatch Weekly, Issue 481 (November 5)
Ubuntu Weekly Newsletter, Issue 290 (November 4)

Comments (none posted)

Android turns 5 years old (The H)

The H covers five years of Android. "Five years ago on 5 November 2007, the then newly formed Open Handset Alliance (OHA) announced the launch of Android, described as a "truly open and comprehensive platform for mobile devices". Headed by Google, the OHA is a consortium of various organisations involved in developing the open source mobile platform. When it was founded, the group had 34 members including T-Mobile, HTC, Qualcomm and Motorola, and has since grown to 84 members including various other handset manufacturers, mobile carriers, application developers and semiconductor companies."

Comments (5 posted)

RTLWS: Modeling systems with Alloy

By Jake Edge
November 7, 2012

The final day of this year's Real Time Linux Workshop (RTLWS) had talks from a number of different researchers and others from the realtime Linux community. It also featured a talk from outside of that community, but on a topic—as might be guessed—that may be important to users and developers of realtime systems: modeling system behavior for better reliability. Eunsuk Kang from MIT gave an introduction to the Alloy language, which has been used to model and find bugs in systems ranging from network protocols to filesystems.

RTLWS organizer Nicholas Mc Guire prefaced the talk by noting that "Z" is used widely for modeling and design verification, but that there is no free Z support. Alloy, on the other hand, is similar to Z, but is free software. The availability of a free alternative will make this kind of analysis more widespread.

Kang is part of a group at MIT that focuses on how to build more reliable software, which is the motivation behind Alloy. Software design often starts with sketches, then moves into creating design documents in various formats. "Is this the best way to design software?" is one question to ask, he said.

Sketches are good because they are lightweight and informal. That allows people to brainstorm and to quickly prototype ideas. Documents are much more heavyweight and are geared toward completeness. But those documents can't be used to answer questions about the design, nor to determine whether the system design is consistent.

So, they set out to create a new language that is "simple" and would be easy to learn, so it would impose a low burden on its users. It could be used in a step between sketches and design documents. The language would be "precise and analyzable", so that one could ask questions about the system described. The language and tools would provide instant feedback in support of rapid prototyping. That new language is Alloy.

As a demonstration of Alloy and its development environment, Kang created a model of a filesystem in the language. The model consisted of several different kinds of objects, such as files, paths, and filesystems, as well as operations for things like adding and deleting files. The model didn't describe the implementation of the filesystem, just the relationships between the objects—and how they change based on the operations.

The model can then be "run" in the GUI. There is a built-in visualization component, which shows solutions that satisfy the constraints set out in the model. One can add assertions to the model, which can either be confirmed to a certain search depth or a counterexample can be found. Alloy is, in many ways, a descendant of languages like Prolog and automated theorem proving systems.

In the demo, Kang's seemingly reasonable model was missing a key constraint. Alloy was able to find a counterexample to an assertion because the model did not preclude multiple file objects having the same path. Once the add-file operation was changed to eliminate that possibility, no counterexample to the assertion (which essentially stated that the delete operation undoes the add operation) could be found.

Alloy has a small number of constructs, purposely. It is a declarative language that describes the system, but, importantly, not the implementation, in "pure first-order logic". Alloy allows for two different kinds of analysis, Kang said, simulation or assertion. The models get translated into a "constraint solving engine" that can find problems that testing misses.

Standard software testing tries to cover as many different behaviors of the system as possible, but one of its weaknesses is that it can't show the absence of bugs. The analyzer in Alloy is exhaustive, and can be used to "verify very deep properties" about the model, he said. The analyzer allows the designer to "reason about the behavior" of systems.

That kind of reasoning has been done using Alloy, Kang said, both by his group and other researchers and universities. One of his examples was not directly computer-related, instead concerning the beam-scheduling policies for radiation therapy at Massachusetts General Hospital. There are multiple treatment rooms, where several different kinds of requests for radiation from a single cyclotron can be made. There are multiple doctors and nurses making the requests, which get funneled to a master control room that enforces various safety requirements. By modeling the system, several race conditions and other problems were found that had the potential to silently drop requests.

In addition, Stanford and Berkeley have applied Alloy to web security protocols and found previously unknown vulnerabilities in HTML 5 and WebAuth. In fact, researchers using Alloy have found problems in protocols that had previously been "proved" correct. Pamela Zave analyzed the Chord peer-to-peer distributed hash table protocol, which had proofs of its correctness in one of the most cited papers in computer science, Kang said. But Zave found that it is not correct according to its specification, and that six properties believed to be true were not.

There are lots of other examples, he said, including smart card security, the flash filesystem for a Mars rover, and Java virtual machine security. What his group has found is that by using Alloy one will often "find very surprising results". There is ongoing work to generate test cases for the actual implementation from the models to help ensure that the code matches the model. There are also efforts to generate code from the models, but it is a difficult problem.

Going the other way, from existing code to models, is even more difficult. It is far more efficient to start modeling at the design stage, Kang said. The key observation from the team is that spending the time to model a system (or a portion of a larger system) is worth it because it will uncover "a lot of surprises". Those surprises are typically bugs, of course, so finding them—early or late—can only lead to better systems.

[ I would like to thank the RTLWS and OSADL for travel assistance to attend the conference. ]

Comments (9 posted)

Quotes of the week

No matter whether we increase the length of development or not, the time between feature freeze and the spinning of a release is too quick.

— Toshio Kuratomi

I have since then switched to the Awesome Window Manager which is probably the worst name you can use for software (try googling for "Awesome crash" to get an idea).

— Antoine Beaupré

Comments (2 posted)

Asterisk 11 available

Digium has unveiled version 11 of the free software telephony server Asterisk. This is a Long Term Support (LTS) release, which means support will be provided for four years. Changes in this version include WebRTC support, the DTLS-SRTP secure transport, and channel drivers for Jingle and Google Talk.

Comments (8 posted)

Raghavan: PulseConf 2012: Report

Arun Raghavan provides a report from the recent PulseAudio miniconference in Copenhagen. "We started off with a broad topic — what each of our personal visions/goals for the project are. Interestingly, two main themes emerged: having the most seamless desktop user experience possible, and making sure we are well-suited to the embedded world." (Thanks to Paul Wise)

Comments (9 posted)

GNOME 3.8 will depend on Python 3

Olav Vitters has announced that GNOME 3.8 will introduce a dependency on Python 3. Individual modules will be allowed to retain their dependency on Python 2, at the module owner's discretion, however. The GNOME wiki has been updated to include resource material for porting to Python 3; the changeover will begin during the 3.7.x development cycle.

Full Story (comments: none)

Development newsletters from the last week

Caml Weekly News (November 6)
What's cooking in git.git (November 4)
Haskell Weekly News (November 1)
Mozilla Hacks Weekly (November 1)
OpenStack Community Newsletter (November 2)
Perl Weekly (November 5)
PostgreSQL Weekly News (November 5)
Ruby Weekly (November 1)

Comments (none posted)

Seeking Enlightenment (The H)

The H interviews Carsten "Rasterman" Haitzler, leader of the Enlightenment project, about the desktop and its future. "The biggest thing E17 brings to the table is universal compositing. This means you can use a composited desktop without any GPU acceleration at all, and use it nicely. We don't rely on software fallback implementations of OpenGL. We literally have a specific software engine that is so fast that some developers spent weeks using it accidentally, not realising they had software compositing on their setup. E17 will fall back to software compositing automatically if OpenGL acceleration doesn't work. It is that good. It even works very nicely on an old Pentium-M @ 600Mhz with zero accelerated rendering support." (Thanks to Tom Arnold.)

Comments (78 posted)

LibreOffice and OpenOffice clash over user numbers (OStatic)

Susan Linton at OStatic reports on both sides of a dispute between LibreOffice and Apache OpenOffice project members over LibreOffice's download numbers. Rob Weir of OpenOffice challenged the LibreOffice statistics as "puffery" earlier in the week, claiming higher download rates for OpenOffice. Italo Vignoli of LibreOffice responded to the charge with a list of large organizations that have switched to LibreOffice, suggesting Weir was "scared by their success" — which, in turn, prompted another round of replies from each corner. Inter-project rivalries aside, the debate highlights the persistent difficulty of enumerating open source users. Who said statistics were boring?

Comments (61 posted)

Linaro Enterprise Group announced

Linaro has announced the formation of a group "to collaborate and accelerate the development of foundational software for ARM Server Linux". The Linaro Enterprise Group (LEG) consists of existing Linaro members ARM, Samsung, and ST-Ericsson, joined by new Linaro members AMD, Applied Micro Circuits Corporation, Calxeda, Canonical, Cavium, Facebook, HP, Marvell, and Red Hat. "The team will build on Linaro’s experience of bringing competing companies together to work on common solutions and enable OEMs, commercial Linux providers and System on a Chip (SoC) vendors to collaborate in a neutral environment on the development and optimization of the core software needed by the rapidly emerging market for low-power hyperscale servers." One would guess that the choice of LEG for the name of an ARM group was not entirely arbitrary.

Comments (8 posted)

Apple versus Motorola suit dismissed with prejudice

Groklaw reports that the Apple versus Motorola "FRAND royalty" patent case that was due to start on 5 November in the US District Court in Madison, Wisconsin has been dismissed, with prejudice.

Comments (none posted)

Let’s Limit the Effect of Software Patents, Since We Can’t Eliminate Them (Wired)

Richard Stallman shares some ideas to alleviate the patent problem. "The usual suggestions for correcting the problem legislatively involve changing the criteria for granting patents – for instance, to ban issuing patents on computational practices and systems to perform them. But this approach has two drawbacks. First, patent lawyers are clever at reformulating patents to fit whatever rules may apply; they transform any attempt at limiting the substance of patents into a requirement of mere form. For instance, many U.S. computational idea patents describe a system including an arithmetic unit, an instruction sequencer, a memory, plus controls to carry out a particular computation. This is a peculiar way of describing a computer running a program that does a certain computation; it was designed to make the patent application satisfy criteria that the U.S. patent system was believed for a time to require." (Thanks to Paul Wise)

Comments (222 posted)

Free Software Supporter -- Issue 55, October 2012

Free Software Supporter is the Free Software Foundation's monthly news digest. This issue covers a Spanish translation, GNUs trick-or-treat, nominate free software heros, happy Ada Lovelace Day, MediaGoblin, and several other topics.

Full Story (comments: none)

FSFE Newsletter - November 2012

This edition of the Free Software Foundation Europe's newsletter notes that rooting and flashing your device does not void the warranty in EU. Several other topics are also covered.

Full Story (comments: none)

Events: November 8, 2012 to January 7, 2013

The following event listing is taken from the LWN.net Calendar.

Date(s)	Event	Location
November 5 November 9	Apache OpenOffice Conference-Within-a-Conference	Sinsheim, Germany
November 5 November 8	ApacheCon Europe 2012	Sinsheim, Germany
November 7 November 9	KVM Forum and oVirt Workshop Europe 2012	Barcelona, Spain
November 7 November 8	LLVM Developers' Meeting	San Jose, CA, USA
November 8	NLUUG Fall Conference 2012	ReeHorst in Ede, Netherlands
November 9 November 11	Free Society Conference and Nordic Summit	Göteborg, Sweden
November 9 November 11	Mozilla Festival	London, England
November 9 November 11	Python Conference - Canada	Toronto, ON, Canada
November 10 November 16	SC12	Salt Lake City, UT, USA
November 12 November 16	19th Annual Tcl/Tk Conference	Chicago, IL, USA
November 12 November 17	PyCon Argentina 2012	Buenos Aires, Argentina
November 12 November 14	Qt Developers Days	Berlin, Germany
November 16 November 19	Linux Color Management Hackfest 2012	Brno, Czech Republic
November 16	PyHPC 2012	Salt Lake City, UT, USA
November 20 November 24	8th Brazilian Python Conference	Rio de Janeiro, Brazil
November 24 November 25	Mini Debian Conference in Paris	Paris, France
November 24	London Perl Workshop 2012	London, UK
November 26 November 28	Computer Art Congress 3	Paris, France
November 29 December 1	FOSS.IN/2012	Bangalore, India
November 29 November 30	Lua Workshop 2012	Reston, VA, USA
November 30 December 2	Open Hard- and Software Workshop 2012	Garching bei München, Germany
November 30 December 2	CloudStack Collaboration Conference	Las Vegas, NV, USA
December 1 December 2	Konferensi BlankOn #4	Bogor, Indonesia
December 2	Foswiki Association General Assembly	online and Dublin, Ireland
December 5 December 7	Open Source Developers Conference Sydney 2012	Sydney, Australia
December 5 December 7	Qt Developers Days 2012 North America	Santa Clara, CA, USA
December 5	4th UK Manycore Computing Conference	Bristol, UK
December 7 December 9	CISSE 12	Everywhere, Internet
December 9 December 14	26th Large Installation System Administration Conference	San Diego, CA, USA
December 27 December 30	29th Chaos Communication Congress	Hamburg, Germany
December 27 December 29	SciPy India 2012	IIT Bombay, India
December 28 December 30	Exceptionally Hard & Soft Meeting 2012	Berlin, Germany

If your event does not appear here, please tell us about it.

LWN.net Weekly Edition for November 8, 2012

Going mobile

Good IDEs

The Cabbage patcher

Csound composition/performance environments

Documenting the scene

The more musical notes

Outro

Security

About face

Deployment

Image is everything

Brief items

New vulnerabilities

drupal: multiple vulnerabilities

kernel: denial of service

kernel: information leak

mcrypt: buffer overflow

munin: multiple vulnerabilities

mysql: multiple unspecified vulnerabilities

openoffice.org: code execution

otrs: cross-site scripting

pcp: multiple unspecified vulnerabilities

remote-login-service: information leak

ssmtp: no TLS certificate validation

xlockmore: denial of service

Kernel development

Brief items

Kernel development news

Volatile ranges, take 12

Volatile anonymous ranges

Conclusion

NILFS2: what is it?

Everything is a file!

Use the right tree for the job

Another layer of indirection

Other miscellaneous metadata

Pros and cons

Patches and updates

Kernel trees

Architecture-specific

Build system

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Virtualization and containers

Distributions

Anaconda runs loose

Freeze or cut bait

Features and contingencies

Brief items

Distribution News

Debian GNU/Linux

Fedora

openSUSE

Other distributions

New Distributions

Newsletters and articles of interest

Development

Brief items

Newsletters and articles

Announcements

Brief items

Articles of interest

Upcoming Events

Events: November 8, 2012 to January 7, 2013

`NILFS2`: what is it?