LWN.net Weekly Edition for October 28, 2010
GStreamer: Past, present, and future
Longtime GStreamer hacker Wim Taymans opened the first-ever GStreamer conference with a look at where the multimedia framework came from, where it stands, and where it will be going in the future. The framework is a bit over 11 years old and Taymans has been working on it for ten of those years, as conference organizer Christian Schaller noted in his introduction. From a simple project that was started by Eric Walthinsen on an airplane flight, GStreamer has grown into a very capable framework that is heading toward its 1.0 release—promised by Taymans by the end of 2011.
Starting off with the "one slide about what GStreamer is
",
Taymans described the framework as a library for making multimedia
applications. The core of the framework, which provides the plugin system for
inputs, codecs, network devices, and so on, is the interesting part to
him. The actual implementations of the plugins are contained in separate
plugin libraries
with a core-provided "pipeline that allows you to connect
them together
".
Some history
When GStreamer was started, the state of Linux multimedia was "very
poor
". XAnim was the utility for playing multimedia formats on
Linux, but it was fairly painful to use. Besides GStreamer, various other
multimedia projects (e.g. VLC, Ogle, MPlayer, FFmpeg, etc.) started in the
1999/2000 timeframe, which was something of an indication of where things
were. The competitors were well advanced as QuickTime had appeared in 1991
and DirectShow in 1996. Linux was "way behind
", Taymans said.
GStreamer's architecture came out of an Oregon Graduate Institute research project with some ideas from DirectShow (but not the bad parts) when the project was started in 1999. Originally, GStreamer was not necessarily targeted at multimedia, he said.
The use cases for GStreamer are quite varied, with music players topping
the list. Those were "one of the first things that actually
worked
" using GStreamer. Now there are also video players (which
are moving into web browsers), streaming servers, audio and video editors,
and transcoding applications. One of the more recent uses for GStreamer,
which was "unpredicted from my point of view
", is for
voice-over-IP (VoIP) and both the Empathy messaging application and
Tandberg video conferencing application are using it.
After the plane flight, Walthinsen released version 0.0.1 in June 1999. By
July 2002, 0.4.0 was released with GNOME support, though it was "very
rough
". In February 2003, 0.6.0 was released as the first version
where audio worked well. After a major redesign to support
multi-threading, 0.10.0 was released in December 2005. That is still the
most recent major version, though there have been 30 minor releases, and
0.10.31 is coming soon. 0.10.x has been working very well, he said, which
raises the question about when there will be a 1.0.
To try to get a sense for the size of the community and how it is growing, Taymans collected some statistics. There are more than 30 core developers in the project along with more than 200 contributors for a codebase that is roughly 205K lines of code. He also showed various graphs of the commits per month for the project and pointed a spike around the time of the redesign for 0.10. There was also a trough at the point of the Git conversion. As expected, the trend of the number of commits per month rises over the life of the project.
In order to confirm a suspicion that he had, Taymans made the same graph
for just the core, without the plugins, and found that commits per month
has trailed off over the last year or so. The project has not been doing
much in the way of new things in the core recently and this is reflected in
the commit rate. He quoted Andy Wingo as an explanation for that:
"We are in 'a
state of decadence'
".
When looking at a graph in the number of lines of code, you can see different growth rates between the core and plugins as well. The core trend line is a flat, linear growth rate. In contrast, the trend line for the plugins shows exponential growth. This reflects the growing number of plugins, many of which are also adding new features, while the core just gets incremental improvements and features.
The current state
Taymans then spent some time describing the features of GStreamer. It is
fully multi-threaded now; that code is stable and works well. The advanced
trick mode playback is also a high point, and it allows easy seeking within
audio and video streams. The video editing support is coming along, while
the RTP and streaming support are "top notch
". The plugins are
extensive and well-tested because they are out there and being used by lots
of people. GStreamer is used by GNOME's Totem video player, which puts
it in more hands. "Being in GNOME helps
", he said.
The framework has advanced auto-plugging features that allow for dynamic
pipeline changes to support a wide variety of application types. It is
also very "binding friendly
" as it has bindings for most
languages that a developer might want to use. Developers will also find that
it is "very debuggable
".
There are many good points with the 0.10 codebase, and he is very happy with it, which is one of the reasons it has taken so long to get to a 1.0 release. The design of 0.10 was quite extensible, and allowed many more features to be added to it. Structures were padded so that additional elements could be added for new features, without breaking the API or ABI. For example, the state changes and clocks handling code was rewritten during the 0.10 lifetime. The developers were also able to add new features like navigation, quality of service, stepping, and buffering in 0.10.
Another thing that GStreamer did well was to add higher-level objects. GStreamer itself is fairly low-level, but for someone who just wants to play a file, there are a set of higher-level constructs to make that easy—like playbin2, for playing video and audio content, and tagreadbin to extract media metadata. The base classes that were implemented for 0.10, including those that have been added over the last five years, are also a highlight of the framework. Those classes handle things like sinks, sources, transforms, decoders, encoders, and so on.
There are also a number of bad points in the current GStreamer. The current negotiation of formats, codecs, and various other variable properties is too slow. The initial idea was to have a easy and comprehensible way to ask an object what it can do. That query will return the capabilities of the object, as well as the capabilities of everything that it is connected to, so the framework spends a lot of time generating a huge list of capabilities. Those capabilities are expressed in too verbose of a format in Taymans's opinion. Reducing the verbosity and rethinking the negotiation API would result in major performance gains.
The "biggest mistake of all
" in GStreamer is that there is no
extensible buffer metadata. Buffers are passed between the GStreamer
elements, and there is no way to attach new information, like pointers to
multiple video planes or information to handle non-standard strides, to
those buffers. There also need to be generic ways to map the buffer data
to support GPUs and DSPs, especially in embedded hardware. It is very
difficult to handle that with GStreamer currently and is important for
embedded performance.
While dynamic pipeline modifications work in the current code, "the
moment you try it, you will suffer the curse of new segments
",
Taymans said. Those can cause the application to lose its timing and
synchronization, and it is not easy to influence the timing of a stream, so
it is difficult for an application to recover from. The original idea was that
applications would create objects that encapsulated dynamic modifications,
but that turned out not to be the case. There are also a handful of minor
problems with 0.10, including an accumulation of deprecated APIs, running
out of padding
in some structures, and it becoming harder to add new features without
breaking the API/ABI.
A look to the future
To address those problems, Taymans laid out the plans for GStreamer for the next year or so. In the short term, there will be a focus on speeding up the core, while still continuing to improve the plugins. There are more applications trying to do realtime manipulation of GStreamer pipelines, so it is important to make the core faster to support them. Reducing overhead by removing locks in shared data structures will be one of the ways used to increase performance.
In the medium term, over the next 2 or 3 months, Taymans will be collecting
requirements for the next major version. The project will be looking at
how to fix the problems that have been identified, so if anyone "has
other problems that need fixing, please tell me
". There will also
be some experimentation in Git branches for features like adding extensible
buffer metadata.
Starting in January, there will be a 0.11 branch and code will be merged
there. Porting plugins to 0.11 will then start, with an eye toward having
them all done by the end of 2011. Once the plugins have started being
ported, applications will be as well. Then there will a 1.0 release near
the end of 2011, not 2010 as was announced in the past. "This time
we'll do it,
promise
". Taymans concluded his talk with a joking
promise that "world domination" would then be the result of a GStreamer 1.0
release.
Luminance HDR 2.0.1, an improved, but still trying, photography tool
The term "high dynamic range photography" (HDR) encompasses a variety of techniques for working with specialized image formats that are capable of handling extremes of brightness and shadow beyond what can be stored in more pedestrian file formats like TIFF and JPEG, and beyond what can be displayed on CRT and LCD monitors. The leading HDR application for desktop Linux users is Luminance HDR, though that dominance is mostly by default: Linux-based HDR applications are quite scarce, and have, if anything, become more so since the Grumpy Editor's HDR with Linux article was published in 2007. Luminance recently released an update, which makes progress on the usability front, but still leaves considerable room for growth in the future.
Version 2.0.1 was unveiled on October 9, the first update to the 2.0-series released by the project's new maintainer Davide Anastasia. Anastasia inherited maintenance duties in September, making him the fourth project leader in two years. At that time he outlined a short list of goals on the project blog, beginning with fixing long-standing crashes, then working to undo feature regressions introduced in the 2.0 release, and finally improving on what many users and software reviewers have (accurately) described as a confusing user interface. 2.0.1 introduces a few cleanups, but primarily consists of bug-fixes.
Linux users can download source code packages from the project's SourceForge.net site (the URL comes from the application's original name, Qtpfsgui — arguably the most intimidating project moniker open source software has ever released). There are Mac OS X and Windows binaries provided as well. The code is simple to compile; it uses the Qmake build tool and depends on Qt4, the image processing libraries Exiv2, libTIFF, and OpenEXR, and the FFTW3 and GNU Scientific Library math libraries. The only hiccup that I encountered in the build process was that Qt4-specific versions of Qmake, the Qt user interface compiler UIC, and Qt meta-object compiler MOC are required; those who build Qt4 applications regularly should have no trouble whatsoever.
HDR workflow: image creation
Presumably, anyone with hardware capable of natively capturing and displaying HDR content also has special-purpose editing software provided by Skywalker Ranch, Weta Digital, or some other professional studio. Luminance is designed for the rest of us, with standard-issue digital cameras and displays. Thus, its workflow consists of two major tasks: importing a set of low dynamic range (LDR) images to blend into a single HDR image, and taking an HDR image and mapping it into an LDR format for distribution over the web or in print.
Most readers have seen HDR-based photos on Flickr or other online sites. The canonical example scenarios are city streets photographed late at night (where the buildings, the street lights, and the lit windows are all visible in one shot) and scenes in broad daylight, where both sun-lit and shaded subjects are properly exposed. In all of these situations, the key is taking multiple exposures at different settings: some exposed for the shadow areas, some exposed for the highlights. In software, we can blend them together, leaving neither washed-out bright spots nor murky, underexposed shadows.
Creating a new HDR image in Luminance consists of loading in the set of
LDR originals, taken at bracketed exposure settings, lining them up, and
blending the stack down into a single image. The image importer allegedly
supports a wide variety of formats, including any camera raw file type
supported by DCraw,
JPEG, and TIFF. Once imported, Luminance reads the exposure setting from
each file's EXIF tags, or allows you to input it manually if such a tag
cannot be found. There are two automatic image-alignment algorithms
available — an internal scheme labeled "Median threshold bitmap," and
the align_image_stack function from the open source panorama tool Hugin. Alternately, you can
choose to manually align the images using built-in editing tools. Finally,
you must choose an HDR image "profile," consisting of a weighting function,
response curve, and HDR creation algorithm. The settings you choose are
applied to your stack of input images, and the result pops up in a preview
window, where you can inspect it in all of its HDR glory.
In my tests, however, there were more than a few pitfalls to this process. First, selecting and loading the LDR images is more difficult than it needs to be, because you must select all of the images you want to use in the file selector, at the same time (i.e., there is no "add another" button). This means they must all be in the same directory, and on a practical level it means you must look them up in an image previewer first, because there is no thumbnail preview, and after a while the contents of IMG_4342.CR2 and IMG_4243.CR2 become harder to memorize. Luminance also cannot read Exif tags from TIFF files, and I was unable to successfully load JPEG conversions of my raw images, with Luminance complaining that they were an "invalid size," regardless of what size they were saved at.
The alignment step is also problematic; the Median threshold algorithm
crashed every time I tried it, and align_image_stack tended to hang
indefinitely. Eventually I decided to align my test images in Hugin
directly, but this is also a very trying process. The wiki documentation
is more than two releases out-of-date, and I could not decipher which
combination of checkboxes needed to be set for Hugin to align the images
geometrically without attempting to blend their exposure settings. That
experiment ended up being a useless tangent anyway, however, because
Luminance could not read any of the TIFF files Hugin produced. At the
Hugin wiki's suggestion, I also attempted to use the Perl-based hdrprep
for alignment, but it too failed to read the Exif data from the TIFFs.
The manual alignment tools offer some fine-grained control, including multiple ways to overlay two images on the canvas in order to eyeball their overlap and a masking function called Anti Ghosting. Sadly these tools also fall a bit short, primarily because there is no way to correct rotation problems, only vertical and horizontal pixel shifts. Even when I took test photos with a tripod, a small amount of rotation misalignment was part of the natural wind-and-camera-shake effects.
It is also difficult to make an informed choice about the HDR profiles, which are named "Profile 1" through "Profile 6." The weighting function, response curve, and HDR creation algorithm options are similarly opaque, and because installing 2.0.1 from source evidently does not include the user manual, looking up the HDR terminology online is the confused user's only recourse.
HDR workflow: image output
Luminance is capable of saving directly to several HDR image formats, including Logluv TIFF and OpenEXR. These formats use floating point numbers rather than integers for pixel data, allowing them to encode a much wider range of total values — potentially 38 f-stops, depending on the options.
This is orders of magnitude greater than a modern PC screen can display, so Luminance provides an HDR "visualizer" that allows you to explore an HDR image by adjusting the gamma and exposure with sliders. It might be confusing to new users at first, because it appears at first glance like the process of importing and blending the source images has produced nothing more than another LDR image, but in fact the visualizer only shows a portion of the image at a time, due to the physical limitations of the display.
If your goal is to save the image to OpenEXR or another format, you can do so as soon as the import process is complete. Most of the time, however, you will be interested in the second of Luminance's major tasks, compressing the HDR image back into a common LDR format — in such a way that it preserves as much detail as possible. You do this with the "Tonemap HDR image" menu entry, which brings up a workspace where you can select and test nine different tone-mapping algorithms, creating thumbnail-sized preview images before committing to a final choice.
Here again the user interface confronts the user with a formidable list
of techno-speak options and little in the way of explanation. At some
level, that is expected; the algorithms have scientific (rather than
marketing-approved) names such as Mantiuk '06 and Reinhard '02 because they
are named after their
creators. But without reading the original papers, it is unreasonable
to expect a user to decipher all of the individual settings. The Ashikhmin
algorithm, for example, sports a checkbox labeled "Simple" and a radio
button allowing you to choose between "Equation Number 2" and "Equation
Number 4." Anyone who can guess what that means without looking is my
personal hero.
Still, at least Luminance gets it right by allowing you to experiment with multiple test images and to compare them side-by-side. Other parts of the GUI (such as the image loader) have a frustrating lack of backup or undo operations. The final output, after all, is the end that justifies all of the means — so if a user can experiment with different algorithms and eventually stumble across a pleasant result, he or she will be happy even if the underlying formulas remain a mystery.
The upper-bound on usability
The tone-mapping algorithm "issue" raises an important question for Luminance and other niche graphics applications, namely: is it always possible to build a user interface with novice-level simplicity, or are some tasks inherently complicated? Do users really need all nine tone-mapping algorithms? Perhaps Luminance could be refactored to hide all of the mathematical details from the user, or dress them up in friendlier terms — but maybe that process would destroy too much of the application itself, turning it into a toy. The same question could probably be asked about Hugin or any of several complex GIMP plugins.
I tend to think that photographers (like everyone else) have a greater capacity for understanding the scary mathematical and theoretical tasks than they give themselves credit for. Most have gotten used to the arcane demosaicing and noise-removal algorithms found in raw image editors, after all. While Luminance 2.0.1 was frustrating to work with for many reasons, the bulk of the frustration came not from exposing too much scientific technobabble, but from the same sort of usability and interface problems that plague any understaffed project: the lack of thumbnail previews, vacant tooltips, missing "undo" buttons, unsupported file formats, and sudden crashes. My guess is that, absent those stumbling blocks, almost any user could get used to the peculiarities of HDR image creation and tone-mapping.
That having been said, Anastasia has his work cut out for him. Luminance has had many cooks in recent years, a fact that has undoubtedly contributed to its perplexing user interface and crash-proneness. Cleaning it up is high on Anastasia's to-do list as project maintainer; those of us who want to see a high-quality open source HDR tool can only hope he manages to build some momentum. Version 2.0.1, although it was only a bug-fix release, is a tantalizing first step because it came mere weeks after Anastasia took over the reins — the gap between the last 1.9.x release and 2.0.0 lasted well over a year. Today, Luminance has an active maintainer, a new release, and a TODO file included with the source code package. It isn't perfect, but it could be the beginning of something good.
Ghosts of Unix Past: a historical search for design patterns
The exploration of design patterns is importantly a historical search. It is possible to tell in the present that a particular approach to design or coding works adequately in a particular situation, but to identify patterns which repeatedly work, or repeatedly fail to work, a longer term or historical perspective is needed. We benefit primarily from hindsight.The previous series of articles on design patterns took advantage of the development history of the Linux Kernel only implicitly, looking at the patterns that could be found it the kernel at the time with little reference to how they got there. Perspective was provided by looking at the results of multiple long-term development efforts, all included in the one code base.
For this series we try to look for patterns which become visible only over an extended time period. As development of a system proceeds, early decisions can have consequences that were not fully appreciated when they were made. If we can find patterns relating these decisions to their outcomes, it might be hoped that a review of these patterns while making new decisions will help to avoid old mistakes or to leverage established successes.
Full exploitation
A very appropriate starting point for this exploration is the Ritchie and Thompson paper, published in Communications of the ACM, which introduced "The Unix Time-Sharing System". In that paper the authors claimed that the success of Unix was not in "new inventions but rather in the full exploitation of a carefully selected set of fertile ideas." The importance of "careful selection" implies a historical perspective much like the one here proposed for exploring design patterns. A selection can only be made if previous experience is available which demonstrates a number of design avenues to choose between. It is to be hoped that identifying patterns would be one aspect of the care taken in that selection.
Over four weeks we will explore four design patterns which can be traced back to that early Unix of which Ritchie and Thompson wrote, but which can be seen much more clearly from the current perspective. Unfortunately they are not all good, but both good and bad can provide valuable lessons for guiding subsequent design.
"Full exploitation" is essentially a pattern in itself, and one we will come back to repeatedly. Whether it is applied to software development, architecture, or music composition, exploiting a good idea repeatedly can enhance the integrity and cohesion of the result and is - hopefully - a pattern that does not need further justification. That said, "full exploitation" can benefit from detailed illumination. We will gain such illumination for this, as for the other three patterns, by examining two specific examples.
Ritchie and Thompson identified in their abstract several features of Unix which they felt were noteworthy. The first two of these will be our first two examples. Using their words:
- A hierarchical file system incorporating demountable volumes,
- Compatible file, device, and inter-process I/O,
File Descriptors
The second of these is sometimes seen as a key hallmark of Unix and has been rephrased as "Everything is a file". However that term does the idea an injustice as it overstates the reality. Clearly everything is not a file. Some things are devices and some things are pipes and while they may share some characteristics with files, they certainly are not files. A more accurate, though less catchy, characterization would be "everything can have a file descriptor". It is the file descriptor as a unifying concept that is key to this design. It is the file descriptor that makes files, devices, and inter-process I/O compatible.
Though files, devices and pipes are clearly different objects with different behaviors, they nonetheless have some behaviors in common and by using the same abstract handle to refer to them, those similarities can be exploited. A program or library routine that does not care about the differences does not need to know about those differences at all, and a program that does care about the differences only needs to know at the specific places where those differences are relevant.
By taking the idea of a file descriptor and exploiting it also for serial devices, tape devices, disk devices, pipes, and so forth, Unix gained an integrity that has proved to be of lasting value. In modern Linux we also have file descriptors for network sockets, for receiving timer events and other events, and for accessing a whole range of new types of devices that were barely even thought of when Unix was first developed. This ability to keep up with ongoing development demonstrates the strength of the file-descriptor concept and is central to the value of the "full exploitation" pattern.
As we shall see, the file descriptor concept was not exploited as fully as possibly it could have been, either initially or during ongoing development. Some of the weaknesses that we will find are in places where there was missed opportunity for full exploitation of file descriptors or related ideas, and many of the strengths are in places where file descriptors were used to enable new functionality.
Single, Hierarchical namespace
The other noteworthy feature identified by Ritchie and Thompson (first in their list) was a hierarchical filesystem incorporating demountable volumes.
There are three key aspects to this file system which are particularly significant for the present illustration.
- It was hierarchical. We are so used to hierarchical namespaces
today that this seems like it should be a given. However at the time
it was somewhat innovative. Some contemporaneous filesystems, such as
the one used in CP/M, were completely flat with no sub-directories.
Others might have a fix number of levels to the hierarchy, typically
two. The Unix filesystem allowed an arbitrarily deep hierarchy.
- It allowed demountable volumes. While each distinct storage
volume could store a separate hierarchical set of files, this
separation was hidden by combining all of these file sets into a
single all-encompassing hierarchy. Thus the idea of hierarchical
naming was exploited not just for a single device, but across the
union of all storage devices.
- It contained device-special files. These are filesystem objects that provide access to devices, both character devices like modems and block devices like disk drives. Thus the hierarchical naming scheme covered not only files and directories, but also all devices.
The design idea being fully exploited here is the hierarchical namespace. The result of exploiting it within a single storage device, across all storage devices, and providing access to devices as well as storage, is a "single namespace". This provides a uniform naming scheme to provide access to a wide variety of the objects managed by Unix.
The most obvious area where this exploitation continued in subsequent development is the area of virtual filesystems, such as procfs and sysfs in Linux. These allowed processes and many other entities which were not strictly devices or files to appear in the same common namespace.
Another effective exploitation is in the various autofs or auto-mount implementations which allow other objects, which are not necessarily storage, to appear in the namespace. Two examples are /net/hostname which includes hosts on the local network into the namespace, and /home/username which allows user names to appear. While these don't make hosts and users first-class namespace objects they are still valuable steps forward. In particular the latter removes the need for the tilde prefix supported by most shells and some editors (i.e. the mapping from ~username to that user's home directory). By incorporating this feature directly in the namespace, the functionality becomes available to all programs.
As with file descriptors, the hierarchical namespace concept was not exploited as fully as might have been possible so we don't really have a single namespace. Some aspects of this incompleteness are simple omissions which have since been rectified as mentioned above. However there is one area where a hierarchical namespace was kept separate, with unfortunate consequences that still aren't fully resolved today. That namespace is the namespace of devices. The device-special files used to include devices into the single namespace, while effective to some degree, are a poor second cousin to doing it properly.
A little reflection will show that the device namespace in Unix is a hierarchical space with three or more levels. The top level distinguishes between 'block' and 'character' devices. The second level, encoded in the major device number, usually identifies the driver which manages the device. Beneath this are one or two levels encoded in bit fields of the minor number. A disk drive controller might use some bits to identify the drive and others to identify the partition on that drive. A serial device driver might identify a particular controller, and then which of several ports on that controller corresponds to a particular device.
The device special files in Unix provide only limited access to this namespace. It can be helpful to see them as symbolic links into this alternate namespace which add some extra permission checking. However while symlinks can point to any point in the hierarchy, device special files can only point to the actual devices, so they don't provide access to the structure of the namespace. It is not possible to examine the different levels in the namespace, nor to get a 'directory listing' of all entries from some particular node in the hierarchy.
Linux developers have made several attempts to redress this omission with initiatives such as devfs, devpts, udev, sysfs, and more recently devtmpfs. Given the variety of attempts, this is clearly a hard problem. Part of the difficulty is maintaining backward compatibility with the original Unix way of using device special files which gave, for example, stable permission setting on devices. There are doubtless other difficulties as well.
Not only was the device hierarchy not fully accessible, it was not fully extensible. The old limit of 255 major numbers and 255 minor number has long since been extended with minimal pain. However the top level of "block or char" distinction is more deeply entrenched and harder to change. When network devices came along they didn't really fit either as "block" or "character" so, instead of being squeezed into a model where they didn't fit, network devices got their very own separate namespace which has its own separate functions for enumerating all devices, opening devices, renaming devices etc.
So while hierarchical namespaces were certainly well exploited in the early design, they fell short of being fully exploited, and this lead to later extensions not being able to continue the exploitation fully.
Closing
These two examples - file descriptors and a uniform hierarchical namespace - illustrate the pattern of "full exploitation" which can be a very effective tool for building a strong design. While we can see with hindsight that neither was carried out perfectly, they both added considerable value to Unix and its successors, adequately demonstrating the value of the pattern. Whenever one is looking to add functionality it is important to ask "how can this build on what already exists rather than creating everything from scratch?" and equally "How can we make sure this is open to be built upon in the future?"
The next article in this series will explore two more examples, examine their historical development, and extract a different pattern -- one that brings weakness rather than strength. It is a pattern that can be recognized early, but still is an easy trap for the unwary.
Exercises
The interested reader might like to try the following exercises to further explore some of the ideas presented in this article. There are no definitive answers, but rather the questions are starting points that might lead to interesting discoveries.
- Make a list of all kernel-managed objects that can be referenced
using a file descriptor, and the actions that can be effected through
that file descriptor. Make another list of actions or objects which do
not use a file descriptor. Explain how one such action or object
could benefit by being included in a fuller exploitation of file
descriptors.
- Identify three distinct namespaces in Unix or Linux that are not
primarily accessed through the "single namespace". For each,
identify one benefit that could be realized by incorporating the
namespace into the single namespace.
- Identify an area of the IP protocol suite where "full exploitation"
has resulted in significant simplicity, or otherwise been of benefit.
- Identify a design element that was fully exploited in the NFSv2 protocol. Compare and contrast this with NFSv3 and NFSv4.
Next article
Ghosts of Unix past, part 2: Conflated designs
Security
Two glibc vulnerabilities
Tavis Ormandy has been busy of late, poking around in the guts of GNU libc. Out of that have come two separate local privilege escalations that exploit an obscure corner (the dynamic linker auditing API) of glibc, while the exploits themselves use—abuse—some Linux features that many probably aren't aware of. These vulnerabilities and exploits provide good examples of the way that security researchers look at code and systems—a way of looking that more developers would do well to emulate.
The runtime library auditing API is a way for developers to intercept the actions of the dynamic linker to see the steps that it is taking while searching for .so files and resolving symbols from them. When a program is executed with the LD_AUDIT environment variable pointing to one or more shared libraries, the linker will make callbacks into functions in those libraries for various events that happen in the linking process. There are various events specified in the rtld-audit man page, including searching for an object, opening an object, binding to a symbol, and so on. It seems like a useful facility, but one that is likely not in the toolbox of many Linux developers.
The simpler of the two problems that Ormandy found was that setuid programs will open whatever arbitrary library a user specifies in LD_AUDIT, as long as that library lives on the trusted library path. The more well-known LD_PRELOAD environment variable, which preloads the specified libraries before the linker searches for others, is specifically prohibited from operating on setuid programs unless the library is on the trusted path and has the setuid bit set. Exploiting ping (or some other setuid program) with LD_PRELOAD would be trivial—a user-provided library could remap any call ping made to anything the attacker wanted—so it was an obvious restriction. LD_AUDIT using non-setuid libraries was evidently not so obvious.
The problem with allowing user-provided libraries to be used for auditing setuid programs is not anywhere in the auditing API, but is instead inherent in the way the runtime linker processes libraries. When the library is opened with dlopen() to determine whether the auditing callback symbols are present, any library initialization routines must be run. So, an exploit is done by finding a vulnerable system library (it must be on the trusted path) that was not written with setuid execution in mind (and thus does not have that bit set in the filesystem).
In his description of the flaw, Ormandy gives an example of using the libpcprofile.so library, which writes an output file to the path specified by the PCPROFILE_OUTPUT environment variable. Using ping for its setuid nature, he sets LD_AUDIT to the library, points PCPROFILE_OUTPUT where he wants, and ping ends up putting a user-writable file in /etc/cron.d. The details will vary depending on the distribution, but most will be vulnerable to the flaw. There is nothing particularly special about libpcprofile.so, as Ormandy describes ways to find other vulnerable system libraries, which are likely to be numerous—those libraries weren't meant to be used by privileged programs.
The other vulnerability is more difficult to exploit, but stems from a similar laxness in LD_AUDIT handling. In the Linux executable file format, ELF, library search paths can be specified in the executable itself using DT_RPATH or DT_RUNPATH tags. Those tags can contain a $ORIGIN value, which is replaced with location of the executable in the filesystem. That way, a library used by a single executable can be located in a program-specific location rather than in the system library directories.
The ELF specification recommends that $ORIGIN be disallowed for setuid executables, but glibc ignores that recommendation. Ormandy doesn't really see a problem with that:
Unfortunately, the $ORIGIN substitution code was reused in the LD_AUDIT path. There was seemingly an attempt to restrict the use of $ORIGIN in LD_AUDIT for privileged programs, but it was insufficient. $ORIGIN will be expanded if it is the only entry in LD_AUDIT. Since $ORIGIN expands to the directory that contained the program, it isn't necessarily obvious that there is anything there to exploit. But, there are known ways to exploit this kind of situation.
If the directory that contains the executable can be replaced with an exploit library object between the time $ORIGIN is expanded and when the value is used, the library will be loaded and the attacker can do what they like. It is essentially a race condition, but one that can be reliably won by the attacker. Ormandy's example basically pauses the execution of a ping that has been hardlinked into an attacker-controlled directory after the expansion of $ORIGIN has been done. He then removes the directory and its contents, and puts a library that has exploit code in its initialization function in the place of the directory.
That particular exploit mechanism is fairly modern, using relatively recent Linux kernel features, but there are others. Ormandy describes several other ways to exploit the flaw, with differing requirements (e.g. a C compiler or winning an easily winnable race) that might serve different attack strategies. While both are local privilege escalations, they very well might be used in conjunction with a web application or other flaw to turn them into a remote root vulnerability.
Both of these vulnerabilities are quite serious for systems that allow untrusted users to log in. Their impact on other systems depends on whether there are other vulnerable, network-facing programs. While it is a bit ironic that it was an audit of LD_AUDIT behavior that found these bugs, it seems clear that there isn't enough of that kind of auditing being done for Linux systems. It's always a bit worrisome to think of how many of these kinds of flaws are still lingering out there.
Brief items
A Firefox zero-day vulnerability
The Mozilla Security Blog warns of a new Firefox vulnerability which is already being exploited. "Users who visited an infected site could have been affected by the malware through the vulnerability. The trojan was initially reported as live on the Nobel Peace Prize site, and that specific site is now being blocked by Firefox's built-in malware protection. However, the exploit code could still be live on other websites." Disabling JavaScript (or running NoScript) will block exploit attempts.
New vulnerabilities
festival: code execution
| Package(s): | festival | CVE #(s): | CVE-2010-3996 | ||||||||||||||||
| Created: | October 22, 2010 | Updated: | December 9, 2013 | ||||||||||||||||
| Description: | From the openSUSE advisory:
festival_server uses an unsafe LD_LIBRARY_PATH. Local users could exploit that to execute code as another user if that user runs festival_server. | ||||||||||||||||||
| Alerts: |
| ||||||||||||||||||
glibc: privilege escalation
| Package(s): | glibc | CVE #(s): | CVE-2010-3847 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Created: | October 21, 2010 | Updated: | April 15, 2011 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description: | From the Red Hat advisory: It was discovered that the glibc dynamic linker/loader did not handle the $ORIGIN dynamic string token set in the LD_AUDIT environment variable securely. A local attacker with write access to a file system containing setuid or setgid binaries could use this flaw to escalate their privileges. (CVE-2010-3847) For a detailed look, see Tavis Ormandy's report. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Alerts: |
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
glibc: privilege escalation
| Package(s): | glibc | CVE #(s): | CVE-2010-3856 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Created: | October 22, 2010 | Updated: | January 12, 2011 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description: | From the Debian advisory:
Ben Hawkes and Tavis Ormandy discovered that the dynamic loader in GNU libc allows local users to gain root privileges using a crafted LD_AUDIT environment variable. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Alerts: |
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
libsmi: arbitrary code execution
| Package(s): | libsmi | CVE #(s): | CVE-2010-2891 | ||||||||||||||||||||||||||||||||||||
| Created: | October 22, 2010 | Updated: | December 16, 2013 | ||||||||||||||||||||||||||||||||||||
| Description: | From the Mandriva advisory:
A buffer overflow was discovered in libsmi when long OID was given in numerical form. This could lead to arbitrary code execution. | ||||||||||||||||||||||||||||||||||||||
| Alerts: |
| ||||||||||||||||||||||||||||||||||||||
pidgin: denial of service
| Package(s): | pidgin | CVE #(s): | CVE-2010-3711 | ||||||||||||||||||||||||||||||||||||||||||||
| Created: | October 21, 2010 | Updated: | March 14, 2011 | ||||||||||||||||||||||||||||||||||||||||||||
| Description: | From the Red Hat advisory: Multiple NULL pointer dereference flaws were found in the way Pidgin handled Base64 decoding. A remote attacker could use these flaws to crash Pidgin if the target Pidgin user was using the Yahoo! Messenger Protocol, MSN, MySpace, or Extensible Messaging and Presence Protocol (XMPP) protocol plug-ins, or using the Microsoft NT LAN Manager (NTLM) protocol for authentication. (CVE-2010-3711) | ||||||||||||||||||||||||||||||||||||||||||||||
| Alerts: |
| ||||||||||||||||||||||||||||||||||||||||||||||
tuxguitar: code execution
| Package(s): | tuxguitar | CVE #(s): | CVE-2010-3385 | ||||||||
| Created: | October 21, 2010 | Updated: | October 27, 2010 | ||||||||
| Description: | From the Red Hat bugzilla entry: Raphael Geissert conducted a review of various packages in Debian and found that tuxguitar contained a script that could be abused by an attacker to execute arbitrary code [1]. The vulnerability is due to an insecure change to LD_LIBRARY_PATH, and environment variable used by ld.so(8) to look for libraries in directories other than the standard paths. When there is an empty item in the colon-separated list of directories in LD_LIBRARY_PATH, ld.so(8) treats it as a '.' (current working directory). If the given script is executed from a directory where a local attacker could write files, there is a chance for exploitation. | ||||||||||
| Alerts: |
| ||||||||||
Page editor: Jake Edge
Kernel development
Brief items
Kernel release status
The 2.6.37 merge window is open as of this writing, so there is no current development kernel prepatch. The merge window can be expect to close right around the end of the month. See the article below for a summary of activity in this merge window so far.Stable updates: there have been no stable updates in the last week. The 2.6.27.55, 2.6.32.25, and 2.6.35.8 updates are currently in the review process and may be released at any time.
Quotes of the week
Linus warns of a short merge window
Linus has sent out a notice that the 2.6.37 merge window will indeed be shorter than usual; it will probably conclude on October 30 or 31, just in time for the 2010 Kernel Summit. "And so far, in the five days since the 2.6.36 release, we've merged 5500+ commits. That has turned my "maybe we can do a shorter merge window" into a 'we can definitely do a shorter merge window'. Because we already have enough changes, and there's almost a week to go - so I think we're well on track for doing that."
Clang builds a working 2.6.36 Kernel
Bryce Lelbach has announced that he has managed to build and boot a (mostly) working kernel using the LLVM-based Clang compiler. It seems that there are a lot of problems remaining, though, and he had to use a couple of GCC-compiled pieces to get the system to boot. "SELinux, Posix ACLs, IPSec, eCrypt, anything that uses the crypto API - None of these will compile, due to either an ICE or variable-length arrays in structures (don't remember which, it's in my notes somewhere). If it's variable-length arrays or another intentionally unsupported GNUtension, I'm hoping it's just used in some isolated implementation detail (or details), and not a fundamental part of the crypto API (honestly just haven't had a chance to dive into the crypto source yet)."
Running work in hardware interrupt context
As a general rule, kernel developers work to avoid running code in hardware interrupt context; there is a whole array of mechanisms by which interrupt-driven work can be deferred to less pressing times. Apparently, however, there is an occasional need to run arbitrary code in the hardware interrupt context - and there is no hardware conveniently signaling interrupts at the time. To enable the running of code in hardware interrupt context, a new API has been added to 2.6.37.The first step is to fill in an irq_work structure:
#include <linux/irq_work.h>
struct irq_work my_work;
init_irq_work(struct irq_work *entry, void (*func)(struct irq_work *func));
There is then a fairly familiar pair of functions for running the work indicated by this structure:
bool irq_work_queue(struct irq_work *entry);
void irq_work_sync(struct irq_work *entry);
The intended area of use is apparently code running from non-maskable interrupts which needs to be able to interact with the rest of the system. One should assume that just about any other use of this feature is likely to be scrutinized closely.
Jump label
The kernel is filled with tests whose results almost never change. A classic example is tracepoints, which will be disabled on running systems with only very rare exceptions. There has long been interest in optimizing the tests done in such places; with 2.6.37, the "jump label" feature will make those tests go away entirely.Consider the definition of a typical tracepoint, which, behind all of the preprocessor madness, looks something like:
static inline trace_foo(args)
{
if (unlikely(trace_foo_enabled))
goto do_trace;
return;
do_trace:
/* Actually do tracing stuff */
}
The cost of a test for a single tracepoint is essentially zero. The number of tracepoints in the kernel is growing, though, and each one adds a new test. Each test must fetch a value from memory, adding to the pressure on the cache and hurting performance. Given that the value almost never changes, it would be nice to find a way to optimize the "tracepoint disabled" case.
In 2.6.37, this tracepoint can be rewritten using a new macro:
#include <linux/jump_label.h>
#define JUMP_LABEL(key, label) \
if (unlikely(*key)) \
goto label;
The nice thing is that JUMP_LABEL() does not have to be implemented like that. It can, instead, (1) note the location of the test and the key value in a special table, and (2) simply insert a no-op instruction. That reduces the cost of the test (and the tracepoint) to zero for the common "not enabled" case. Most of the time, the tracepoint will never be enabled and the omitted test will never be missed.
The tricky part happens when somebody wants to enable the tracepoint. Changing its status now requires calling one of a pair of special functions:
void enable_jump_label(void *key);
void disable_jump_label(void *key);
A call to enable_jump_label() will look up the key in the jump label table, then replace the special no-op instructions with the assembly equivalent of "goto label", enabling the tracepoint. Disabling the jump label will cause the no-op instruction to be restored.
The end result is a significant reduction in the overhead of disabled tracepoints. This feature only works on architectures which support it (x86 only, at the moment) and only with relatively recent versions of GCC; otherwise the preprocessor version is used.
Kernel development news
2.6.37 merge window, part 1
The 2.6.36 kernel was released on October 20, and the 2.6.37 merge window duly started shortly thereafter. As of this writing, some 6450 changes have been merged for the next development cycle, with more surely to come. Some of the more significant, user-visible changes merged for 2.6.37 include:
- The first parts of the inode scalability patch set have been merged,
but, as of this writing, the core locking changes have not yet been
pushed for inclusion. See this
article for more information on the inode scalability work.
- The x86 architecture now uses separate stacks for interrupt handling
when 8K stacks are in use. The option to use 4K stacks has been
removed.
- The big kernel lock removal process continues; the core kernel is
almost entirely BKL-free. There is now a configuration option which
may be used to build a kernel without the BKL. File locking still
requires the BKL, though; schemes are afoot to fix it before the
close of the merge window, but this work is not yet complete. If file
locking can be cleaned up, it will be possible for many (or most)
users to run a BKL-free 2.6.37 kernel.
- The "rados block device" has been added. RBD allows the creation
of a special block device which is backed by objects stored in the
Ceph distributed system.
- The GFS2 cluster filesystem is no longer marked "experimental." GFS2
has also gained support for the fallocate() system call.
- A new sysfs file, /sys/selinux/status, allows a user-space
application to quickly notice when security policies have changed.
The intended use is evidently daemons which cache the results of
access-control decisions and need to know when those results might
change. A separate file, called policy, has been added for
those simply wanting to read the current policy from the kernel.
- The scheduler now works harder to avoid migrating high-priority
realtime tasks. The
scheduler also will no longer charge processor time used to handle
interrupts to the process which happened to be running at the time.
- VMware's VMI paravirtualization support has been deprecated
by the company and, as scheduled, removed from the 2.6.37 kernel.
- Some hibernation improvements have been merged, including the ability
to compress the hibernation image with LZO,
- The ARM architecture has gained support for the seccomp (secure computing)
feature.
- The block layer can now throttle I/O bandwidth to specific devices,
controlled by the cgroup mechanism. This is the second piece of the
I/O bandwidth controller puzzle which allows the establishment of
specific bandwidth limits which will be enforced even if more I/O
bandwidth is available.
- The new "ttyprintk" device allows suitably-privileged user space to
feed messages through the kernel by way of a pseudo TTY device.
- The kernel has gained support for the point-to-point tunneling
protocol (PPTP); see the
accel-pptp project page for more information.
- The NFS
serverclient has a new "idmapper" implementation for the translation between user and group names and IDs. The new code is more flexible and performs better; see Documentation/filesystems/nfs/idmapper.txt for details. - There is a new -olocal_lock= mount option for the NFS client
which can cause it to treat either (or both) of flock() and
POSIX locks as local.
- Most of the functions of the nfsservctl() system call have
been deprecated and marked for removal in 2.6.40. There is a new
configuration option for those who would like to remove this
functionality ahead of time.
- Simple support for the pNFS protocol has been merged.
- Huge pages can now be migrated between nodes like normal memory pages.
- There is the usual pile of new drivers:
- Systems and processors: Flexibility Connect boards,
Telechips TCC ARM926-based systems,
Telechips TCC8000-SDK development kits,
Vista Silicon Visstrim_m10 i.MX27-based boards,
LaCie d2 Network v2 NAS boards,
Qualcomm MSM8x60 RUMI3 emulators,
Qualcomm MSM8x60 SURF eval boards,
Eukrea CPUIMX51SD modules,
Freescale MPC8308 P1M boards,
APM APM821xx evaluation boards,
Ito SH-2007 reference boards,
IBM "SMI-free" realtime BIOS's,
MityDSP-L138 and MityDSP-1808 systems,
OMAP3 Logic 3530 LV SOM boards,
OMAP3 IGEP modules, and
taskit Stamp9G20 CPU modules.
- Block: Chelsio T4 iSCSI offload engines.
- Input: Roccat Pyra gaming mice,
UC-Logic WP4030U, WP5540U and WP8060U tablets,
several varieties of Waltop tablets,
OMAP4 keyboard controllers,
NXP Semiconductor LPC32XX touchscreen controllers,
Hanwang Art Master III tablets,
ST-Ericsson Nomadik SKE keyboards,
ROHM BU21013 touch panel controllers, and
TI TNETV107X touchscreens.
- Miscellaneous: Freescale eSPI controllers,
Topcliff platform controllher hub devices,
OMAP AES crypto accelerators,
NXP PCA9541 I2C master selectors,
Intel Clarksboro memory controller hubs,
OMAP 2-4 onboard serial ports,
GPIO-controlled fans,
Linear Technology LTC4261 Negative Voltage Hot Swap Controller
I2C interfaces,
TI BQ20Z75 gas gauge ICs,
OMAP TWL4030 BCI chargers,
ROHM ROHM BH1770GLC and OSRAM SFH7770 combined ALS and proximity sensors,
Avago APDS990X combined ALS and proximity sensors,
Intersil ISL29020 ambient light sensors, and
Medfield Avago APDS9802 ALS sensor modules.
- Network: Brocade 1010/1020 10Gb Ethernet cards,
Conexant CX82310 USB ethernet ports,
Atheros AR9170 "otus" 802.11n USB devices, and
Topcliff PCH Gigabit Ethernet controllers.
- Sound: Marvell 88pm860x codecs,
TI WL1273 FM radio codecs,
HP iPAQ RX1950 audio devices,
Native Instruments Traktor Kontrol S4 audio devices,
Aztech Sound Galaxy AZT1605 and AZT2316 ISA sound cards,
Wolfson Micro WM8985 and WM8962 codecs,
Wolfson Micro WM8804 S/PDIF transceivers,
Samsung S/PDIF controllers, and
Cirrus Logic EP93xx AC97 controllers.
- USB: Intel Langwell USB OTG transceivers, YUREX "leg shake" sensors, and USB-attached SCSI devices.
- Systems and processors: Flexibility Connect boards,
Telechips TCC ARM926-based systems,
Telechips TCC8000-SDK development kits,
Vista Silicon Visstrim_m10 i.MX27-based boards,
LaCie d2 Network v2 NAS boards,
Qualcomm MSM8x60 RUMI3 emulators,
Qualcomm MSM8x60 SURF eval boards,
Eukrea CPUIMX51SD modules,
Freescale MPC8308 P1M boards,
APM APM821xx evaluation boards,
Ito SH-2007 reference boards,
IBM "SMI-free" realtime BIOS's,
MityDSP-L138 and MityDSP-1808 systems,
OMAP3 Logic 3530 LV SOM boards,
OMAP3 IGEP modules, and
taskit Stamp9G20 CPU modules.
- The old ieee1394 stack has been removed, replaced at last by the "firewire" drivers.
Changes visible to kernel developers include:
- The jump label
optimization mechanism has been merged; its initial purpose is to
reduce the overhead of inactive tracepoints.
- Yet another RCU variant has been added: "tiny preempt RCU" is meant
for uniprocessor systems. "
This implementation uses but a single blocked-tasks list rather than the combinatorial number used per leaf rcu_node by TREE_PREEMPT_RCU, which reduces memory consumption and greatly simplifies processing. This version also takes advantage of uniprocessor execution to accelerate grace periods in the case where there are no readers.
" - New tracepoints have been added in the network device layer, places
where sk_buff structures are freed,
softirq_raise(), workqueue operations, and
memory management LRU list shrinking operations.
There is also a new script for using perf to analyze network device
events.
- The wakeup latency tracer now has function graph support.
- There is a new mechanism for running
arbitrary code in hardware interrupt context.
- The power management layer now has a formal concept of "wakeup
sources" which can bring the system out of a sleep state. Among other
things, it can collect statistics to help the user determine what is
keeping a system awake. Wakeup events can abort the freezing of
tasks, reducing the time required to recover from an aborted suspend
or hibernate operation.
- A new mechanism for managing the automatic suspending of idle devices
has been added.
- There is a new set of functions for managing the "operating
performance points" of system-on-chip components. (commit).
- A long list of changes to the memblock (formerly LMB) low-level
management code has been merged, and the x86 architecture now uses
memblock for its early memory management.
- The default handling for lseek() has changed: if a driver
does not provide its own llseek() function, the VFS layer
will cause all attempts to change the file position to fail with an
ESPIPE error. All in-tree drivers which lacked
llseek() functions have been changed to use
noop_llseek(), which preserves the previous behavior.
- There is a new way to create workqueues:
struct workqueue_struct *alloc_ordered_workqueue(const char *name, unsigned int flags);Items submitted to the resulting workqueue will be run in order, one at a time. It's meant to eventually replace the old singlethreaded workqueues.
Also added is:
bool flush_work_sync(struct work_struct *work);This function will wait until a specific work item has completed.
- The ALSA ASoC API has been significantly extended to support sound
cards with multiple codecs and DMA controllers. (commit).
- The stack-based
kmap_atomic() patch has been merged, with an associated
API change. See the new Documentation/vm/highmem.txt file for
details.
- There are two new memory allocation helpers:
void *vzalloc(unsigned long size); void *vzalloc_node(unsigned long size, int node);Both behave like the equivalent vmalloc() calls, but they also zero the allocated memory. - Most of the work needed to remove the concept of hard barriers from the block layer has been merged. This task will probably be completed before the closing of the merge window.
Linus has let it be known that he expects this merge window to be shorter than usual so that it can be closed before the 2010 Kernel Summit begins on November 1. Expect patches to be merged at a high rate until the end of October; an update next week will cover the changes merged in the last part of the 2.6.37 merge window.
Resolving the inode scalability discussion
Nick Piggin's VFS scalability patch set has been under development for well over one year. Linus was ready to pull this work during the 2.6.36 merge window, but Nick asked for more time for things to settle out; as a result, only some of the simpler parts were merged then. Last week, we mentioned that some developers became concerned when it started to become clear that the remaining work would not be ready for 2.6.37 either. Out of that concern came a competing version of the patch set (by Dave Chinner) and a big fight. This discussion was of the relatively deep and intimidating variety, but your editor, never afraid to make a total fool of himself, will attempt to clarify the core disagreements and a possible path forward anyway.The global inode_lock is used within the virtual filesystem layer (VFS) to protect several data structures and a wide variety of inode-oriented operations. As a global lock, it has become an increasingly annoying bottleneck as the number of CPUs and threads in systems increases; it clearly needs to be broken up in a way which makes it more scalable. Unfortunately, like a number of old locks in the VFS, the boundaries of what's protected by inode_lock are not always entirely clear, so any attempts to change locking in that area must be done with a great deal of caution. That is why improving inode locking scalability has been such a slow affair.
Getting rid of inode_lock requires putting some other locking in place for everything that inode_lock protects. Nick's patch set creates separate global locks for some of those resources: wb_inode_list_lock for the list of inodes under writeback, and inode_lru_lock for the list of inodes in the cache. The standalone inodes_stat statistics structure is converted over to atomic types. Then the existing i_lock per-inode spinlock is used to cover everything else in the inode structure; once that is done, inode_lock can be removed. The remainder of the patch set (more than half of the total) is then dedicated to reducing the coverage of i_lock, often by using read-copy-update (RCU) instead.
Before any of that, though, Nick's patch set changed the way the core memory management "shrinker" code works. Shrinkers are callbacks which can be invoked by the core when memory is tight; their job is then to reduce the amount of memory used by a specific data structure. The inode and dentry caches can take up quite a bit of memory, so they both have shrinkers which will free up (hopefully) unneeded cache entries when the memory is needed elsewhere. Nick changed the shrinker API to cause it to target specific memory zones; that allows the core to balance free memory across memory types and across NUMA nodes.
The per-zone shrinkers were one of the early flash points in this debate. Dave Chinner and others on the VFS side of the house worried that invoking shrinkers in such a fine-grained way would increase contention at the filesystem level and make it harder to shrink the caches in an efficient way. They also thought that this change was orthogonal to the core goal of eliminating the scalability problems caused by the global inode_lock. Nick fought hard for per-zone shrinkers, and he clearly believes that they are necessary, but he has also dropped them from his patch set for now in an attempt to push things forward.
The next disagreement has to do with the coverage of i_lock; Dave Chinner's alternative patch set avoids using i_lock to cover most of the inode structure. Instead, Dave introduces other locks from the outset, reaching a point where he has relatively fine-grained lock coverage by the time inode_lock is removed at the end of his series. Compared to this approach, Nick's patches have been criticized as being messy and not as scalable.
Nick's response is that the "width" of i_lock is a detail which can be resolved later. His intent was to do the minimal amount of work required to allow the removal of inode_lock, without going straight for the ultimate scalable solution. The goal was to be able to ensure that the locking remains correct by changing as little as possible before the removal of the global lock; that way, hopefully, there are fewer chances of breaking things. Beyond that, any bugs which do slip through before the patch removing inode_lock will almost certainly not reveal themselves until after that removal. That means that anybody trying to use bisection to find a bug will end up at the inode_lock removal patch instead of the real culprit. Thus, minimizing the number of changes before that removal should make debugging easier.
That is why Nick removes inode_lock before the middle of his patch series, while Dave's series does that removal near the end. Both patch sets include a number of the same changes - putting per-bucket locks onto the inode hash table, for example - but Nick does it after removing inode_lock, while Dave does it before. There are also differences, with Nick heading deep into RCU territory while Dave avoids using RCU. Both developers claim to be aiming for similar end results, they just take different roads to get there.
[PULL QUOTE: One of the hardest problems in the VFS is ensuring that all locks are taken in the proper order so that the system will not deadlock. END QUOTE] Finally, there is also a deep disagreement over the locking of the inode cache itself. In current kernels, the cache data structure (the LRU and writeback lists, essentially) is covered by inode_lock with the rest. Both patch sets create separate locks for the LRU and for writeback. The problem is with lock ordering; one of the hardest problems in the VFS is ensuring that all locks are taken in the proper order so that the system will not deadlock. Nick's patches require the VFS to acquire i_lock for the inode(s) of interest prior to acquiring the writeback or LRU locks; Dave, instead, wants i_lock to be the innermost lock.
The problem is that it is not always possible to acquire the locks in the
specified order. Code which is working through the LRU list, for example, must
have that list locked; if it then decides to operate on an inode found in
the LRU list, it must lock the inode. But that violates Nick's locking
order. To make things work correctly, Nick uses spin_trylock() in
such situations to avoid hanging. Uses of spin_trylock() tend to
attract scrutiny, and that is the case here; Dave has described the code as "a
large mess of trylock operations
" which he has gone out of his way
to avoid. Nick responds that the code is
not that bad, and that Dave's approach brings locking complexities of its
own.
This is about where Al Viro jumped in, calling both approaches wrong. Al would like to see the writeback locks taken prior to i_lock (because code tends to work from the list first, prior to attacking individual inodes), but he says the LRU lock should be taken after i_lock because code changing the LRU status of an inode will normally already have that inode's lock. According to Al, Nick is overly concerned with the management of the various inode lists and, as a result, "overengineering" the code. After some discussion, Dave eventually agreed with something close to Al's view and acknowledged that Nick's placement of the LRU lock below i_lock was correct, eliminating that point of contention.
Al has also described the way he would like things to proceed; this is a good thing. When it comes to VFS locking, few are willing to challenge his point of view; that means that he can probably bring about a resolution to this particular dispute. He wants a patch series which starts with the split of the writeback and LRU lists, then proceeds by pulling things out from under inode_lock one at a time. He is apparently pulling together a tree based on both Nick's and Dave's work, but with things done in the order he likes. The end result will probably be credited to Nick, who figured out how to solve a long list of difficult problems around inode_lock, but it will differ significantly from what he initially proposed.
What is not at all clear, though, is how much of this will come together for the 2.6.37 merge window. Al has a long history of last-second pull requests full of hairy changes; Linus tends to let him get away with it. But this would be very last minute, and the changes are deep, so, while Al has pushed some of the initial changes, the core locking work may not be ready in time for 2.6.37. Either way, once inode scalability has been taken care of, discussion can begin on the removal of dcache_lock, which is a rather more complex problem than inode_lock; that should be interesting to watch.
Linux at NASDAQ OMX
One tends to think of "the NASDAQ" as a single exchange based in the US, but, in fact, NASDAQ OMX operates exchanges all over the world - and they run on Linux. In the US for instance, that includes markets like the NASDAQ Stock Market, The NASDAQ Options Market, and NASDAQ OMX PSX, its newest market that launched on October 8. At a brief presentation at the Linux Foundation's invitation-only End User Summit in Jersey City, NASDAQ OMX vice president Bob Evans talked about the ups and downs of using Linux in a seriously mission-critical environment.NASDAQ OMX's exchanges run on thousands of Linux-based servers. These servers handle realtime transaction processing, monitoring, and development as well. The big challenge in this environment, of course, is performance; real money depends on whether the exchange can keep up with the order stream. Latency matters as much as throughput, though; orders must be responded to (and executed) within bounded period of time. Needless to say, reliability is also crucially important; down time is not well received, to say the least.
To meet these requirements, NASDAQ OMX runs large clusters of thousands of machines. These clusters can process hundreds of millions of orders per day - up to one million orders per second - with 250µs latency.
According to Bob, Linux has incorporated some useful technologies in recent years. The NAPI interrupt mitigation technique for network drivers has, on its own, freed up about 1/3 of the available CPU time for other work. The epoll system call cuts out much of the per-call overhead, taking 33µs off of the latency in one benchmark. Handling clock_gettime() in user space via the VDSO page cuts almost another 60ns. Bob was also quite pleased with how the Linux page cache works; it is effective enough, he says, to eliminate the need to use asynchronous I/O, simplifying the code considerably.
On the other hand, there are some things which have not worked out as well for them. These include I/O signals; they are complex to program with and, if things get busy, the signal queue can overflow. The user-space libaio asynchronous I/O (AIO) implementation is thread-based; it scales poorly, he says, and does not integrate well with epoll. Kernel-based asynchronous I/O, instead, lacks proper socket support. He also mentioned the recvmsg() system call, which requires a call into the kernel for every incoming packet.
There is some new stuff coming along which shows some promise. The new recvmmsg() system call can receive multiple packets with a single call. For now, though, it is just a wrapper around the internal recvmsg() implementation and does not hold the socket lock across the entire operation. But, he said, recvmmsg() is a good example of how the ability to add new APIs to Linux is a good thing. He also likes the combination of kernel-based AIO and the eventfd() system call; that makes it possible to integrate file-based AIO into an applications normal event-processing loop. There is also some potential in syslets, which he sees as a way of delivering cheap notifications to user space; it's not clear whether syslets will scale usefully, though.
What NASDAQ OMX would really like to see in Linux now is good socket-based AIO. That would make it possible to replace epoll/recvmsg/sendmsg sequences with fewer system calls. Even better would be if the kernel could provide notifications for multiple events at a time. Best would be if the interface to this functionality were completely based on sockets. He described a vision of an "epoll-like kernel object" which would handle in-kernel network traffic processing. The application could post asynchronous send and receive requests to the queue, and receive notifications when they have been executed. He would like to see multiple sockets attached to a single object, and a file descriptor suitable for passing to poll() for notifications. With a setup like that, it should be possible to push more network traffic through the kernel with lower latencies.
In summary, NASDAQ OMX seems to be happy with its use of Linux. They also seem to like to go with current software - the exchange is currently rolling out 2.6.35.3 kernels. "Emerging APIs" are helping operations like NASDAQ OMX realize real-world performance gains in areas that matter. Linux, Bob says, is one of the few systems that are willing to introduce new APIs just for performance reasons. That is an interesting point of view to contrast with Linus Torvalds's often-stated claim that nobody uses Linux-specific APIs; it seems that there are users, they just tend to be relatively well hidden.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Miscellaneous
Page editor: Jonathan Corbet
Distributions
openSUSE Conference 2010: the state of openSUSE
From 20th to 23rd of October 2010, the second international openSUSE conference took place in Nuremberg, Germany. With the motto "Collaboration across Borders", all users, contributors, and supporters of the openSUSE project and free software in general were invited to four days of learning, discussing, and hacking. More than 250 openSUSE enthusiasts came to Nuremberg, and it was an excellent opportunity to see how the openSUSE project is doing these days.
Get your ass up
Hendrik "Henne" Vogelsang, a founder and board member of the openSUSE
project, gave the first keynote, "Get your ass up!". He kicked off his
presentation with a question to the audience about how old they thought
SUSE was. People tend to forget that SUSE is one of the oldest
distributions: it's already 18 years old. Compare that with Debian which is 17 or Red Hat/Fedora which is 16. People from all over the world have been using SUSE since they were very young, and it has a large user base. But at the same time, the community is very young: the openSUSE project was founded only 5 years ago, and Henne explained that it has only become a real open source project very recently: "For the first 3 years we really struggled in the transition from a company-made product to an open source project. Only when Factory was opened in 2009, openSUSE became really open.
"
This means that openSUSE is in a unique position: it is a very young project with a very old distribution and a very large user base. According to Henne, now is the time to take advantage of this position and make a difference: "We have no rules, so we have all freedom to start to do things today.
" And he immediately followed this with the advice to take responsibility:
Then Henne highlighted some examples of people that stuck their neck out and made a difference. Andrew Wafaa spearheaded a MeeGo version of openSUSE, Smeegol. This is not a Novell-initiated project (although there are some Novell people contributing as individuals), but completely done by volunteers. The openSUSE wiki is another example of the power of individuals: when the openSUSE project was started 5 years ago, one of the first things some volunteers did was starting with a wiki. And a couple of months ago, the openSUSE wiki team launched a complete overhaul of the wiki with a new structure, theme, and search engine. Last but not least, Henne praised the strategy team and community manager Jos Poortvliet for investing their time.
He also stressed that you don't need to wait for an OK from everybody before you start with such an initiative: just make the difference, consensus is not needed. One of the things often stopping us from stepping up is the fear of duplication, he explained: "Why do we offer 8 desktop environments, why do we have Vim, Emacs and Gedit, why do we have KDE's Plasma Netbook and MeeGo for netbooks, and so on.
" According to Henne, there is nothing wrong with this: duplicated efforts are not a waste of time, because we can't all possibly want the same things. His advice is simple: don't let other people tell you to not do something because someone else already did it; we need diversity. He put it somewhat bluntly: "If you want to help, then help; but if you see someone doing something you are not interested in, just shut up and get out of the way.
"
Another thing that discourages people to step up is the fear of failure. But of course, if you always take a safe route and don't fail, then the project never really advances, never innovates. That's why it's so important to let each other fail and pick each other up after that. Henne's last advice in his talk is a direct consequence of this approach: "Don't always think things through: you can't always have a 100% solution. Even if you have a small idea, try it out. Be playful, this is what open source is about.
" All in all, Henne's talk was a great reminder of the responsibility that each individual community member in an open source project bears.
openSUSE's strategy
OpenSUSE's community manager Jos Poortvliet presented an update about the strategy discussion we wrote about in June and in September. He started his talk with a remark: "When I joined Novell, I was glad that this discussion was going on, because I hadn't a really good idea of what openSUSE was either.
" He summarized that a strategy has two main goals: to help make decisions, and to help focus. Both goals are needed not only for technical matters, but also for marketing. For example, if the openSUSE marketing team decides to create a leaflet, the result should obviously depend on the target audience and the goals of the distribution. For instance, will you present the openSUSE Build Service and YaST in the leaflet? Probably not if you're targeting beginners. And will you make a default choice for a desktop environment? If you target beginners, you can choose for the users so they don't have to; but if you target power users, you pick a default desktop environment but allow choosing, or you include all necessary information for the users to make an informed choice themselves.
Jos also made it clear from the beginning that a strategy is not a vehicle to limit the community: he referred to Henne's message that you should not tell people not to do something, and he added that this holds even if it's something going against the strategy. And in a philosophical mood, he said "If you're not seeing yourself in the strategy document, the document has to change; not you!
" In a free software community, people will always work on all sorts of things, but if you have a clear identity (not just "We are just another Linux distribution, and we're green
") and clear goals, it's much easier to invite and attract other people.
The current openSUSE vision and strategy proposal is published on co-ment, a web-based document collaboration and annotation tool. The target user of openSUSE is described like this:
Jos gave some hypothetical examples: if you're an audio professional and need JACK and a realtime kernel, you are one of openSUSE's target users. And if you're a student new to Linux but wanting to learn and experiment, you're also a target user. The strategy also spells out what openSUSE offers:
The openSUSE project is built upon three pillars mentioned in the above quote: the community, the distribution, and the infrastructure. Therefore, the strategy proposal describes all three pillars. The community ("the heart of the openSUSE project
") is described as collaborative and contributing improvements to upstream projects. Moreover, the community works closely with companies in its ecosystem that provide additional value, including support and enterprise offerings on top of or derived from openSUSE technology. Also, the barrier to becoming part of the openSUSE community should be lowered wherever possible. And last but not least, the proposal emphasizes that openSUSE aims to foster the development of free and open source software, but takes a pragmatic approach to what they ship to their users. Jos explained this as "We prefer to ship free software, but we'll not screw our users if they want audio or Flash support.
"
The openSUSE distribution is described as follows:
Jos added that one of the goals of the distribution is a good out-of-the-box experience based on sane defaults. The freedom of choice is exemplified in a wide software selection and compatibility with other operating systems, including Windows and Mac OS X. The last pillar of the project is the infrastructure:
This infrastructure part refers to one of the really strong points of the openSUSE distribution in recent times. The openSUSE Build Service makes it possible to make up to date packages available for multiple current releases, even for other distributions. Moreover, with the Kiwi build system and SUSE Studio, the project provides technology to easily build openSUSE derivatives in the form of live images, appliances, and even full distributions. Your author experienced that this is not just theory: when he wanted to create a Dutch version of the openSUSE 11.3 KDE4 live CD this week and his first attempts failed, he asked Jos who could help him and got an immediate response from openSUSE Boosters Will Stephenson and Stephan Kulow. After some configuration changes and two kiwi commands, the result was the desired Dutch live CD.
The strategy proposal also lists some things that openSUSE doesn't focus on. For example, it won't oversimplify the system to the point where configuring it becomes harder: "We prefer flexibility over an extreme focus on ease of use
". OpenSUSE will also not aim at having the latest and greatest in shipped releases, nor will it provide feature upgrades for a shipped release. But flexibility also means that if you really want to install the latest packages, you can through the openSUSE Build Service. This way, you preserve the stability and integrity of the rest of your system.
After the presentation, there was plenty of time for questions, and questions there were. A valid criticism that was raised is that the wording of the current strategy proposal is too developer-centric. Jos agreed that this is the case and said that he wants to address this in a next version. Another person had the opinion that the strategy proposal is too boring and negative (he summarized it as "We don't want to be selfish like Ubuntu or unstable like Fedora
"), to which Jos answered that a strategy document is indeed boring, but that it's needed to build upon and to create exciting marketing material.
Someone else remarked that this strategy doesn't seem actionable: "How will it change how we do openSUSE? What are the next steps?
" According to Jos, this strategy is indeed not sufficient, some people need to step up and really move the community forward to its goals. Someone else proposed to split the document into a short one with a sexy high-level description of the strategy, and an implementation document that describes how the community will implement this strategy. The latter document can then talk about detailed things like openSUSE not doing feature upgrades for a shipped release.
openSUSE and Novell
Gerald Pfeifer, Director of Product Management at Novell's Open Platform Solutions Business Unit and thus responsible for all SUSE Linux Enterprise products and SUSE Studio, was the keynote speaker on Friday, with a talk entitled "openSUSE and Novell: an unlikely couple?" In his talk, Gerald tried to answer some misconceptions about how openSUSE and Novell work together. He started by stressing that there is no such thing as "Novell employees" as opposed to "the community": many of the employees of Novell's Open Platform Solutions business unit, even in the management team, have a history with free software. "A lot of Novell employees are part of the community, and these people with their feet in both Novell and the openSUSE community are not only important for openSUSE but also for Novell.
" As a side note, he remarked that it even doesn't make much sense to talk about "the openSUSE community", as there is an openSUSE kernel community, openSUSE forums community, openSUSE wiki community, openSUSE KDE community, openSUSE GNOME community, and so on, all behaving differently.
So why does Novell support the openSUSE community? Gerald presented Novell's twofold goal: increase the share of Linux as opposed to other operating systems, and maximize the amount of openSUSE and SUSE Linux Enterprise used. But even just awareness - if people know what openSUSE is - is already important. That said, Novell is a company with shareholders and a board, and it has certain rules and regulations to follow. For instance, a company is supposed to increase revenue or decrease costs with every action it takes, so Gerald explained that Novell can't fulfill each request from the openSUSE community like "Why don't you add 50 more people to the openSUSE Boosters team?
"
Obviously Novell needs openSUSE because it is the base for SUSE Linux Enterprise (SLE). Gerald made this clear: "It's very hard to develop an enterprise Linux distribution every three to four years if you don't base it on a current distribution.
" For SLE, Novell does a lot of quality assurance, so it becomes more ripe and stable than openSUSE, but Gerald stressed that stability is not a side criterion for openSUSE: "OpenSUSE is not a crash test facility for our enterprise Linux distribution, and it never was meant to be: I want it to be as stable as possible.
" He admitted, though, that there was at least one painful case where this went wrong: many SUSE users will remember the broken update mechanism in SUSE Linux 10.1.
One of those areas were Novell has a clear focus on contributing to openSUSE directly is the openSUSE Boosters team. These are thirteen people paid by Novell to support and "boost" the openSUSE community: they don't develop specific projects or maintain specific packages, but if they see a stumbling block for users or contributors, they remove these obstacles, e.g. in the areas of documentation or infrastructure.
But Gerald emphasized that Novell is contributing a lot more people to the openSUSE project than just the Boosters: there is for example the security team, the openSUSE community manager Jos Poortvliet, kernel people, a GCC maintainer who maintains GCC packages for openSUSE even if these versions are not and will not be used in SUSE Linux Enterprise (Richard Günther), and so on. Moreover, Novell contributes hardware and other infrastructure, and even pays legal costs, e.g. for checking license compliance. Tools like the openSUSE Build Service and SUSE Studio, developed by Novell, are directly beneficial to the openSUSE community as well. Of course, Novell also provides significant contributions to upstream projects of various sorts, which also benefits openSUSE.
Gerald concluded his talk by saying that it's important both for Novell
and openSUSE that openSUSE stands more on its own feet: "I want to
see more openSUSE volunteers at next year's conference.
" His
reasoning was: the more Novell needs to direct efforts to baseline work,
the less will land higher up the stack in terms of innovation. That's also
why he said that Novell is very supportive of an openSUSE Foundation,
e.g. by investing quite a bit of lawyers' time for the needed legal work.
After his talk, Gerald left some room for questions, which were
numerous. Andrew Wafaa asked the most interesting one from the point of
view of the relationship between Novell and openSUSE: why doesn't Novell
open up its internal mailing lists? Gerald's answer was that internal
mailing lists will always exist, and other companies working with open
source projects also have them. Some conversations should just remain
internal to the company. However, he has seen several cases of discussions
on an internal mailing list, e.g. about the kernel, where someone mentioned
openSUSE and someone else requested taking the matter to the opensuse-kernel mailing list. "At Novell, we all keep an eye on which discussions benefit from being done publicly.
" Someone else in the audience added the remark that the openSUSE community also has a responsibility here: "Keep the openSUSE mailing lists friendly, otherwise not only external people but also Novell employees get scared and discuss their stuff on the internal mailing lists.
"
Between focus and anarchy
These three talks give a good picture of what is going on in the openSUSE community right now. The presence of Gerald's talk in the schedule seems to suggest that Novell feels the need to defend the specifics of its relationship with openSUSE to the community. His assurance that Novell is very supportive of an openSUSE Foundation should be nice to hear for members of the community who want a stronger and more independent distribution that is able to attract more corporate sponsors.
The two other talks had somewhat contradictory messages. While Henne emphasized that everyone can do what they want in the openSUSE project, Jos tried to convince his audience that openSUSE needs a focused strategy to attract new people. Both of these approaches have some truth in them, but combining them will be a delicate task for the project. If everything is possible, like Henne maintains, then openSUSE will be a wonderful playground for technology enthusiasts and anarchist programmers, but most people from outside the community may be scared to join this chaos. On the other hand, if openSUSE chooses a strategy that is too focused, it may be easily able to attract new people that are interested in its goals, but it may alienate many of its current users.
The trick will be to fine-tune the current strategy to the point where most members of the community will choose to focus on the strategy's goals themselves, while those who want to explore other topics still have the freedom to do so. In practice, this doesn't seem a big change from the current situation, but it's important that people who contribute or want to contribute to openSUSE now have some written guidelines. Coupled with Henne's reminder that each individual community member bears responsibility for the project, the message is clear: now that openSUSE seems to have figured out its place in the Linux ecosystem, it's time to take action.
Brief items
Distribution quotes of the week
Well, someone actually read the anaconda changelog (which is probably the most surprising of all) and decided to comment. The whole mail thread was focused on whether or not Fedora as a whole should allow /usr on its own partition so it was really only tangentially about anaconda. We largely stayed out of the conversation and it kind of died without any real conclusion. The more interesting parts of the thread were about per-user /tmp which doesn't really have anything to do with the initial post.
Debian Edu/Skolelinux 6.0.0 alpha1 test release
Debian Edu/Skolelinux has released Debian "squeeze" based 6.0.0 alpha1. "This is the second test release based on Squeeze. The focus of this release is the thin clients and the diskless workstation setup. Please install a thin client server, and make sure all programs in the KDE menu work on both thin clients and diskless workstations. Especially sound is important to test."
Distribution News
Fedora
Fedora Board Meeting Recap 2010-10-21 - special meeting
Click below for a recap of the October 21, 2010 special meeting of the Fedora Board. Names for Fedora 15, the voting schedule, and spins were among the topics discussed.Fedora Board Recap 2010-10-25
Click below for a recap of the October 25, 2010 meeting of the Fedora Board. Topics include F14 release planning, F15 release names, Fedora elections, and several other items.Fedora 14 Final Release Declared GOLD
Fedora 14 was declared ready during the Go/No-Go meeting. Look for the release announcement on November 2.
Newsletters and articles of interest
Distribution newsletters
- DistroWatch Weekly, Issue 377 (October 25)
- Fedora Weekly News Issue 248 (October 20)
- openSUSE Weekly News, Issue 146 (October 25)
Shuttleworth: Unity shell will be default desktop in Ubuntu 11.04 (ars technica)
Ars technica has a report from Mark Shuttleworth's keynote at the Ubuntu Developer Summit, where Shuttleworth announced that the Unity shell will become Ubuntu's default user interface for both the desktop and netbook editions. "I also asked Shuttleworth why Canonical is building its own shell rather than customizing the GNOME Shell. He says that Canonical made an effort to participate in the GNOME Shell design process and found that Ubuntu's vision for the future of desktop interfaces was fundamentally different from that of the upstream GNOME Shell developers. He says that GNOME's rejection of global menus, for example, is one of the key philosophical differences that would be difficult to reconcile. Canonical has accumulated a team of professional designers with considerable expertise over the past few years. They want to set their own direction and create a user experience that meets the needs of their audience. The other major Linux vendors, who are setting the direction of GNOME Shell's design, have different priorities and are arguably less focused than Ubuntu on serving basic desktop users."
Blessed Unity: Ars reviews Ubuntu 10.10 (ars technica)
Ars technica has a review of Ubuntu 10.10. "Ubuntu 10.10, codenamed Maverick Meerkat, emerged from its burrow this month with some important changes. The user interface got a lift from some theming improvements and a new default font. Usability got a nice boost from a wide range of design improvements and feature enhancements in the Software Center and Ubiquity installer. Canonical's effort to clean up the notification area took another step forward with the addition of playback controls in the sound indicator menu. The latest version of GNOME is included, with a handful of minor improvements, and the F-Spot photo manager was replaced with Shotwell."
Hertzog: The secret plan behind the "3.0 (quilt)" Debian source package format
Raphaël Hertzog blogs about his work on the new source format known as "3.0 (quilt)". "This patch can have two functions: creating the required files in the debian sub-directory and applying changes to the upstream sources. Over time, if the maintainer made several modifications to the upstream source code, they would end up entangled (and undocumented) in this single patch. In order to solve this problem, patch systems were created (dpatch, quilt, simple-patchsys, dbs, ...) and many maintainers started using them. Each implementation is slightly different but the basic principle is always the same: store the upstream changes as multiple patches in the debian/patches/ directory and apply them at build-time (and remove them during cleanup)."
4 Reasons to Give Linux Mint 10 a Try (PCWorld)
PCWorld has a review of Linux Mint 10 RC. "Along with Ubuntu 10.10, Linux Mint 10 RC is based on version 2.6.35 of the Linux kernel along with version 2.32 of the GNOME desktop environment and X.org 7.5. All of these bring with them a raft of security and other improvements."
Fedora 14 Reflects Evolution of Leading-Edge Open Source
Red Hat News takes a look at the relationship between Fedora and Red Hat. "Red Hat participates in this process as part of the Fedora community, and its contributions to Fedora help enhance the technology selected by Fedora's substantial user and contributor base. Fedora helps Red Hat meet a goal of more scalable, extensible, and interoperable Red Hat Enterprise Linux, which is derived from Fedora."
TinyMe Linux For The Win (Yet Another Linux Blog)
Yet Another Linux Blog looks at TinyMe. "TinyMe is based on Unity Linux 2010 and was previously based on PCLinuxOS. It uses LXPanel, PCManFM and the Openbox Window Manager to handle the heavy desktop lifting. The ISO I used was a release candidate and lacked much of the polish of the TinyMe stable release of the past. Even though it's a release candidate, I still found it quite stable and usable..especially since I know my way around the openbox window manager."
Page editor: Rebecca Sobol
Development
The State of Conary
Once hailed as the "next-generation" of package management, Conary was introduced by Eric Troan in 2004 at the Ottawa Linux Symposium (OLS). Though Conary hasn't replaced traditional Linux packaging technologies, it is in wider use than one might think. The next release promises better system management, but is anyone actually using Conary, and where's it going? The answers are yes, and possibly beyond Linux.
Conary was meant to solve some deficiencies that exist in standard package formats. For instance, that package versioning as expressed by RPM or Debian packages does not allow for branches, only a linear newer/older model. Conary was developed to make it easier for users to create their own distribution, from a collection of repositories. The idea being that one might pick and choose repositories from which to install GNOME, Firefox, etc., rather than getting all of their software from Fedora or Debian or Ubuntu.
This has not quite come to pass, at least for most users. While there are distributions using Conary, the primary usage of Conary these days seems to be building custom distribution appliances for businesses.
How is Conary Different?
To learn more about Conary and its current state, we interviewed rPath's Michael K. Johnson, founding engineer at rPath and founding technical leader of the Fedora Project.
Despite its age, many Linux users have probably never heard of Conary or only have heard of it in passing. Fewer still are likely to be familiar with the details. Though Conary is lumped into discussions of package management, it's a bit more than that. Conary is described as "distributed software management system" for Linux distributions, as opposed to a package management system. Rather than managing software as "specialized archives" (as Johnson calls RPMs and Debian packages), Conary packages are references to files in a database. The packages contain references to components, which are divided by their roles in a package — such as runtime requirements, documentation, libraries, etc.
Conary actually works as a sort of distributed source control system. Software comes from specific repositories, and the associations are much more granular than Debian packages or RPMs. For example, it's possible to remove a file from the system and, when the package that owns the file is updated, the individual file is not reinstalled. Files are treated as first class objects in Conary, and can be managed individually if desired.
Packages can have branches called shadows, a customized version of the package that references the original plus changes, or for minimal changes it's possible to have a "derived package" that applies changes without rebuilding a package. As the SCM heritage suggests, Conary also has rollback capabilities that are much more elegant than what is allowed by RPM or dpkg.
Conary also allows for "groups," something like a metapackage or task, that pulls together the components that make up a collection of software meant to be installed together. GNOME or KDE might be distributed as a group, or a collection of server software that contains all of the libraries, applications, and supporting software that needs to be installed.
In short, Conary introduces much more detail and flexibility in managing system software.
Conary 2.2 is due out "soon," and Johnson says "near-final" snapshots are already being used in Foresight Linux development. Johnson says that 2.2 introduces a new and more flexible way to manage systems:
Conary 2.2 introduces "system models", which you can think of as "groups lite". A system model allows you to describe concisely how a system is different from a group on which it is based. Instead of building a group for each unique software combination, you can build fewer base groups, and then express minor unique variations on a per-system basis where conflict is unlikely. It makes it possible to have a more dynamic building-block approach to configuring systems, without giving up the control that Conary provides.
As an example, it is reasonable to have a model that expresses, 'This is one of my web application server systems, built from my group-mywebapp manifest. It is a Dell server, so I will add to it the Dell hardware support packages that my organization uses, which I have bundled together in group-dell-packages. It is being deployed in my Atlanta data center, so I need to add the administrative credentials required for all systems deployed in my Atlanta data center, so also include my atlanta-data-center-credentials package that sets up those credentials appropriately.'
Johnson went on to give a detailed example of the process and group definitions behind the changes that would be required to set up the server, in just a few lines. The upshot here is that Conary 2.2 adds features that make it very easy to clone and manage systems in a few commands.
System models are the primary new feature in 2.2, but Johnson says it also has memory improvements and uses less bandwidth.
Conary and rPath Adoption
For all its technical advantages, Conary (like rPath) has yet to take the world by storm. None of the major distributions have switched to Conary as their package management system or base for development. But that doesn't mean that it's not in use.
Conary and other package management systems are not mutually exclusive. Johnson says there was an "epiphany," about what Conary could do early in 2009.
Support for "encapsulating" other package formats was introduced in Conary 2.1. rPath has been offering custom versions of CentOS, Red Hat Enterprise Linux, SUSE Linux Enterprise, and rPath.
If you don't see much of Conary in the wild, where is it being used? Johnson says that its largest audience is "enterprises using rPath's product line, based on Conary, to manage diverse systems on a massive scale," in other words "enterprise appliances." Johnson says that ISVs are also a significant audience for Conary, and are using Conary and rPath's tools to deliver software as software, hardware, and virtual appliances. He cites the Department of Energy, EMC, Fujitsu, IBM, and Qualcomm as customers who are using Conary and other rPath tools to build and manage software.
In general, Johnson says that there are hundreds of rPath derivatives, but not all are public:
Some derivatives that use Conary aren't rPath-based at all. Johnson says many customers are using rPath supported versions of SLES, RHEL, and CentOS to create multiple products or "system definitions" from those platforms.
Of course there's also Foresight Linux, which is based on rPath and Conary. Foresight has had its ups and downs, and was on the ropes briefly when rPath laid off the developers who were working on the distribution. There's also interest in using Conary with major distributions, albeit in a slightly different way. For instance, there's Boots, a Conary-encapsulated mirror of Fedora. Interest in Boots started when Johnson proposed a change in direction for Foresight Linux. And Conary has been adopted for some derivative versions, like the Openfiler Storage Appliance.
Beyond Linux
In 2005, Johnson suggested that Conary was not limited to Linux, and could be by the BSDs and other operating systems. So far, Johnson says that rPath has "not received significant feedback" from customers saying it'd be worthwhile for the company to package any of the BSDs. It's technically possible, but not in demand.
But the company is building support for managing Windows packages. Johnson says that the company is building in support to create MSI installable packages for Windows, and the company specifically is hiring for field engineers that have experience with MSI packaging and other Windows system provisioning.
Johnson says that the company is also working on managing other package types, so it may not be long before the rPath rBuilder tools support creating Ubuntu or Debian based appliances as well.
Though Conary has not replaced traditional package management for most Linux users or developers, nor has rPath become a household name for Linux users, it's more successful than one might think at first glance. Conary may well be worth a look for developers and ISVs that create software appliances, or for enterprises that wish to have more control over the management of their systems.
Brief items
Development quotes of the week
At any rate, most of the world appears to still be on Python 2.6 and can blissfully ignore that's there's no 2.8 planned for probably another year or more. Which means the Python devs still have a year to come to their senses about discontinuing the language people actually use.
Asterisk 1.8.0 released
Version 1.8.0 of the Asterisk telephony system has been released; this is a major, long-term-supported release. New features include secure RTP support, IPv6 SIP support, calendaring integration, a new call logging system, and more; see the change summary for more information than you will ever possibly be able to use. (Thanks to Graham Cantin).DVDAuthor 0.7.0
DVDauthor is the tool which does the real work behind every other Linux-based DVD authoring application. Version 0.7.0 has been released; changes include better encoding support, more flexible configuration, and more.KDevelop 4.1 released
KDE.News has the KDevelop 4.1 announcement. There are improvements in patch exporting, script integration, PHP support, and a new hex editor, but the headline feature seems to be Git support. "That means that we have support for the basic features for management of a VCS-controlled project, like moving, adding and removing files inside the project. Additionally we integrate the basic VCS features like comparing and reviewing local changes, sending our changes back to the server, updating the local checkout and annotating files."
The Mordor I/O library
Mozy, an online backup provider, has announced the release of some of its code under the BSD license; that includes the Mordor C++ I/O library. "Mordor is a high performance I/O library. One of its main goals is to provide very easy-to-use abstractions and encapsulation of difficult and complex concepts, yet still provide near absolute power in wielding them if necessary."
Mozilla's "Chromeless" project
Mozilla Labs has announced the "Chromeless" project, aimed at making it easier for developers to create browser interfaces. "The 'Chromeless' project experiments with the idea of removing the current browser user interface and replacing it with a flexible platform which allows for the creation of new browser UI using standard Web technologies such as HTML, CSS and JavaScript." There is a "pre-alpha prototype" available now.
Shed Skin 0.6 released
Shed skin is a Python-to-C++ compiler; the 0.6 release is now available. It includes some major changes in how program analysis is done, allowing it to scale to larger ("several thousands of lines") programs. See the announcement for more information.Valgrind 3.6.0 released
Valgrind 3.6.0 is available. New features include support for the ARM architecture, updated distribution support, an understanding of the SSE4.2 instruction set, an improved profiler, and experimental new heap profiler, and more; see the release notes for details.
Newsletters and articles
Development newsletters from the last week
- Caml Weekly News (October 26)
- OpenOffice.org Newsletter (October)
- PostgreSQL Weekly News (October 24)
- Python-URL! (October 26)
Gmail vs. Zimbra Desktop 2.0 (Linux Magazine)
Over at Linux Magazine, Joe "Zonker" Brockmeier takes Zimbra Desktop 2.0 for a spin, comparing it to the email interface that Gmail provides. While there is much to like with Zimbra, which is open source, he found Gmail to be easier to use. "Another major feature for Zimbra is that it allows you to use pretty much any mail service. So you can tie Zimbra Desktop to an IMAP server that you control, and all your mail belongs to you. Totally. For some folks, this feature alone is going to make Zimbra (or another email client) far more desirable than Gmail. While I am sometimes uneasy with Googles ever-increasing collection of data, Im not personally concerned that someone at Google is reading my email. And Googles Gmail reliability has improved to the point that I havent had a problem reaching my mail in several months."
Rosegarden - An open source MIDI / audio multi-tracker (The H)
The H has a lengthy review of Rosegarden. "If access to computers and networks have given us the means to copy and distribute recorded works, they have also given us the means to create our own music. If computers and networks have made it possible to take the programming art into the home, they have also taken the black arts of the recording studio closer to the back bedroom, and in so doing, have made the making of music more accessible to greater numbers of people - much as free software has made it easier for 'hobbyist' programmers to join in and test their skills and make a difference. Tools for GNU / Linux and free software have played their part in this evolution, and Rosegarden is a significant part of the canon, a well structured MIDI / audio sequencer and musical notation editor with a well thought out user interface which has put usability and ease of learning to the fore."
Yocto Project aims to standardize embedded Linux builds (LinuxDevices.com)
LinuxDevices.com has an overview of the Yocto Project just announced by the Linux Foundation. "Unlike build systems based on shell scripts or makefiles, the Yocto Project automates the fetching of sources from upstream sources or local project repositories, says the project. Its customization architecture is said to allow the choice of a wide variety of footprint sizes as well as control over the choice or absence of components such as graphics subsystems, visualization middleware, and services. Yocto is based on the GNOME-derived Poky Linux, a well established platform-independent, cross-compiling build system that uses the same architecture as the OpenEmbedded build system."
Page editor: Jonathan Corbet
Announcements
Non-Commercial announcements
CELF is joining the Linux Foundation
At the Embedded Linux Conference Europe, Tim Bird, architecture chair of the Consumer Electronics Linux Forum (CELF), announced that the organization was joining the Linux Foundation. Bird said that CELF "couldn't be more happy to have the opportunity" to work within the LF. Jim Zemlin, LF executive director, congratulated both organizations and mentioned that the LF would be doubling the funding that CELF currently puts into promoting embedded Linux. He also said that there would be some more information about the new Yocto project—an effort to standardize the embedded Linux development environment—later in the conference.
Update: The Linux Foundation press release about the merger is available as well.
2011 Fedora Scholarship open to applications
The Fedora Scholarship program is accepting applicants from students who will be entering college in Fall 2011. "The Fedora Scholarship program recognizes one high school senior per year for contributions to the Fedora Project and free software/content in general. The scholarship is a $2,000 USD reward per year over each of the four years the recipient is in college, which is funded by Red Hat's Community Architecture team, as well as travel and lodging to the nearest FUDCon for each year of the scholarship."
Articles of interest
Nokia boosts Qt commitment, changes Symbian strategy (ars technica)
Ars technica reports on some changes to Nokia's mobile platform strategy. It plans to do more rapid and incremental Symbian releases, while making Qt the "sole focus" of its application development. "Nokia's plan to use Qt for all of its own applications is also significant. It will enable richer user interfaces and more consistency between Symbian and MeeGo. It also sends a strong message to third-party developers that Qt is ready for prime time on Nokia devices. The recent Qt 4.7 release brings some extremely compelling new functionality for building modern touch-friendly mobile software. Taking advantage of these capabilities will make the Symbian user experience better and help ameliorate some of the issues that detract from Symbian's competitiveness. During my recent tests of the N8, I often found myself thinking that the whole experience would be better if Qt was used pervasively in the bundled applications."
How Qt could bring better third-party software to Ubuntu (ars technica)
Ars technica looks at the advantages of the Qt toolkit. "A point that I think often gets overlooked in the toolkit debate is that adopting Qt doesn't necessarily imply ditching GNOME or switching to KDE. As we discussed in our review of Qt 4.5 last year, Qt has relatively robust support for Gtk+ theming, including conformity with the GNOME HIG and support for native GNOME dialogs. When everything is properly configured, Qt applications look entirely at home in GNOME environments. Adding a standard Qt library stack to a fresh Ubuntu installation requires only 16.5MB of packages, which expands to approximately 50MB on disk."
FSFE interview with Leena Simon
The Free Software Foundation Europe has posted an interview with Leena Simon. "I am fighting within The Pirate Party, as well as in the Freedom not Fear movement, for Free Software. In both movements a lot of people haven't understood yet how important Free Software is: FS does not really connect the one with the other. They are connected in different ways and I can also understand their critique about Free Software."
OpenOffice.org Council members resign (The H)
The H reports that Christoph Noack, Florian Effenberger and Thorsten Behrens have resigned from the OpenOffice.org community council. "Noack says in his email that his "idea of a stable and working open-source environment differs from what I currently perceive when we talk about certain community structure characteristics." Effenberger notes that he feels it's unfortunate that some people view OpenOffice.org and LibreOffice as separate and conflicting projects and that he hopes there will be a resolution in the future."
License compliance is not a problem for open source users (opensource.com)
Simon Phipps worries that excessive focus on license compliance actions obscures the fact that free software licenses make life easy for users. "Open source does not place a compliance burden on the end user, does not mandate acceptance of an end-user license agreement, does not subject you to para-police action from the BSA. That is a significant advantage, and there's no wonder that proprietary vendors want to hide it from you and make you think open source licensing is somehow complex, burdensome or risky. If all you want to do is use the software - which is all you are allowed to do with proprietary software as the other three freedoms are entirely absent - then open source software carries significantly less risk."
Legal Announcements
Gemalto sues Google, HTC, Motorola and Samsung
The mobile patent thicket grows thicker: a company called Gemalto has announced the filing of a lawsuit against Google, HTC, Motorola and Samsung, claiming that Android violates its patents 6,308,317, 7,117,485, and 7,818,727. The latter two were just issued in October; all seem to cover the revolutionary concept of running an interpreted language on a microcontroller.
New Books
Land of Lisp--New from No Starch Press
No Starch Press has released "Land of Lisp", a "Unique, Cartoon-Filled Guide Makes Lisp Programming Fun", by Conrad Barski.
MAKE Magazine Launches Do-It-Yourself Space Technology in New Issue
MAKE Magazine Volume 24 from O'Reilly Media is available.
Resources
Linux Foundation Monthly Newsletter: October 2010
The October issue of the Linux Foundation newsletter covers the Linux Foundation User Survey; New Open Compliance Resources Available; Linux Kernel Summit & Plumbers Conferences Are Coming Up; Aava Mobile, Insprit and OpenLogic Join The Linux Foundation; the Linux Foundation in the News; and Upcoming Training Opportunities.
Contests and Awards
EFF: Pioneer Awards
The Electronic Frontier Foundation (EFF) has announced the four winners of its 2010 Pioneer Awards. The winners are Pamela Jones and Groklaw, Steven Aftergood, James Boyle, and Hari Krishna Prasad Vemuru. "When Pamela Jones created Groklaw in 2003, she envisioned a new kind of participatory journalism and distributed discovery -- a place where programmers and engineers could educate lawyers on technology relevant to legal cases of significance to the Free and Open Source community, and where technologists could learn about how the legal system works. Groklaw quickly became an essential resource for understanding such important legal debates as the SCO-Linux lawsuits, the European Union antitrust case against Microsoft, and whether software should qualify for patent protection."
Education and Certification
Free Technology Academy partners with the FSF
The Free Technology Academy (FTA) and the Free Software Foundation (FSF) have announced their partnership in the FTA's Associate Partner Network. "The Network aims to expand the availability of professional educational courses and materials covering the concepts and applications of Free Software and free standards."
Upcoming Events
lca2011 - Announces Vint Cerf as a Keynote Speaker
The linux.conf.au 2011 organizing team has announced that Vinton G. Cerf will be a keynote speaker for lca2011. "Vinton G. Cerf has served as vice president and chief Internet evangelist for Google since October 2005. In this role, he is responsible for identifying new enabling technologies to support the development of advanced, Internet-based products and services from Google. He is also an active public face for Google in the Internet world."
Events: November 4, 2010 to January 3, 2011
The following event listing is taken from the LWN.net Calendar.
| Date(s) | Event | Location |
|---|---|---|
| November 1 November 5 |
ApacheCon North America 2010 | Atlanta, GA, USA |
| November 3 November 5 |
Linux Plumbers Conference | Cambridge, MA, USA |
| November 4 | 2010 LLVM Developers' Meeting | San Jose, CA, USA |
| November 5 November 7 |
Free Society Conference and Nordic Summit | Gorthenburg, Sweden |
| November 6 November 7 |
Technical Dutch Open Source Event | Eindhoven, Netherlands |
| November 6 November 7 |
OpenOffice.org HackFest 2010 | Hamburg, Germany |
| November 8 November 10 |
Free Open Source Academia Conference | Grenoble, France |
| November 9 November 12 |
OpenStack Design Summit | San Antonio, TX, USA |
| November 11 | NLUUG Fall conference: Security | Ede, Netherlands |
| November 11 November 13 |
8th International Firebird Conference 2010 | Bremen, Germany |
| November 12 November 14 |
FOSSASIA | Ho Chi Minh City (Saigon), Vietnam |
| November 12 November 13 |
Japan Linux Conference | Tokyo, Japan |
| November 12 November 13 |
Mini-DebConf in Vietnam 2010 | Ho Chi Minh City, Vietnam |
| November 13 November 14 |
OpenRheinRuhr | Oberhausen, Germany |
| November 15 November 17 |
MeeGo Conference 2010 | Dublin, Ireland |
| November 18 November 21 |
Piksel10 | Bergen, Norway |
| November 20 November 21 |
OpenFest - Bulgaria's biggest Free and Open Source conference | Sofia, Bulgaria |
| November 20 November 21 |
Kiwi PyCon 2010 | Waitangi, New Zealand |
| November 20 November 21 |
WineConf 2010 | Paris, France |
| November 23 November 26 |
DeepSec | Vienna, Austria |
| November 24 November 26 |
Open Source Developers' Conference | Melbourne, Australia |
| November 27 | Open Source Conference Shimane 2010 | Shimane, Japan |
| November 27 | 12. LinuxDay 2010 | Dornbirn, Austria |
| November 29 November 30 |
European OpenSource & Free Software Law Event | Torino, Italy |
| December 4 | London Perl Workshop 2010 | London, United Kingdom |
| December 6 December 8 |
PGDay Europe 2010 | Stuttgart, Germany |
| December 11 | Open Source Conference Fukuoka 2010 | Fukuoka, Japan |
| December 13 December 18 |
SciPy.in 2010 | Hyderabad, India |
| December 15 December 17 |
FOSS.IN/2010 | Bangalore, India |
If your event does not appear here, please tell us about it.
Page editor: Rebecca Sobol
