User: Password:
Subscribe / Log in / New account Weekly Edition for April 5, 2012

Why bother supporting ARM?

By Jonathan Corbet
April 4, 2012
Two weeks ago, LWN covered the debate within the Fedora project over whether its ARM port should be designated one of that distribution's "primary" architectures. That discussion has progressed a little further, so an update may be warranted. But it may also be worthwhile to address a related question: why is there resistance to the concept of supporting ARM as a primary architecture in the first place? And why might it make sense to promote the ARM architecture anyway?

One of the things that came out in the original discussion is that the Fedora project did not have any idea of how to do that. Over its entire history, the project has never before seriously considered moving one of its secondary architectures to primary status. So there are no procedures in place and no criteria by which a decision to promote an architecture can be made. So, unsurprisingly, the project decided that it needs to come up with a set of reasonable criteria. On April 2, Matthew Garrett posted a draft showing what those criteria might look like.

The rules would appear to make sense. The Fedora infrastructure and release engineering teams need to have people who are able to represent any new primary architecture. The project must be able to build packages on its own servers. Anaconda, the Fedora installer, must work on the targeted hardware. Maintainers of important packages must have access to the supported hardware so they can fix problems. No binary blobs. And so on. Also required is approval from various Fedora teams, each of which can impose additional criteria if it sees the need. These rules are in an early form and can be expected to evolve over time, but the early responses on the mailing list suggest that most people are happy enough with what has been set down.

That said, there are clearly some people who do not see the point of supporting ARM as a primary architecture, and they have a number of reasons for their reluctance. The ARM architecture is messy, for example. The x86 architecture does not have a single design authority, but processors made by multiple vendors still resemble each other closely enough to create a fairly tightly-knit processor family. ARM does have a central design authority, but that authority leaves a lot of significant details up to individual manufacturers, of which there are many. So ARM is not a tightly-knit family; it is more like an extended group of hostile ex-spouses and in-laws who have moved to different continents to get away from each other.

The looseness of the ARM "platform" had led to a lot of innovation in the design space; there is no end of interesting ARM system-on-chip designs with all kinds of impressive integrated peripheral devices. But this diversity, along with a distressing lack of hardware discoverability, makes it impossible to create a single kernel that works on all (or even a significant subset of) ARM processors. Distributors hate having to maintain multiple kernels, and they hate having to put target-specific hacks into installers. ARM currently forces both, despite the ongoing work to consolidate kernel code and move hardware knowledge into the bootloader-supplied device tree structure.

ARM is also, for many developers, a relatively obscure architecture lacking the familiarity of x86. The fact that there are vastly more ARM systems running Linux than x86 systems does not really change that perception; most of us lack ARM-based development systems on our desks. Additionally, ARM processors are relatively slow. That is a problem for developers, who typically need to keep an x86 system and a cross-compiling toolchain around to be able to get through more than one edit-compile-test cycle in any given day. That slowness is also an issue for distributors; it can delay security updates and distribution releases, even for other architectures. And, while the hardware is slow, product cycles are fast; by the time developers have gotten a target working nicely, it may be obsolete and off the market.

Given all of these challenges, it is not surprising that some people would rather not be bothered by an architecture like ARM. The x86 world provides plenty of open, high-performing systems with wide support; why get distracted with that messy architecture where, even if the distribution can be made to work, the hardware is probably closed and won't allow it to be installed?

The answer, of course, is that said messy architecture is already performing much of our computing, and it will likely be doing more of it in the future. Traditional PC-style systems are no longer the center of attention; one assumes they will not go away entirely, but a lot of the action is elsewhere. There is a whole new crowd of makers looking to do interesting things with ARM-based designs; we are just beginning to see what can be done with interesting mobile devices, and the bulk of those devices are not, at this point, using x86 processors. Meanwhile, ARM has its eyes on data center applications where, some think, its compactness and power efficiency will make up for its lack of speed. The x86 architecture will be with us for a long time - even Intel has proved unable to kill it off in the past - but it is far from the only show in town.

It is also worth remembering that, for all its success, Linux is still a minority player on x86 systems. But Linux is the dominant system on ARM-based systems. The "year of the Linux desktop" may be an old and sad joke, but the year of the Linux gadget looks to be happening for real - again.

Given that ARM is where much of the action is, it would make sense that a Linux distribution - especially one that is supposed to be leading-edge and forward-looking - would want to support ARM as well as possible. Solid support for the architecture seems like a necessary precondition for any sort of presence in the interesting computing devices of the near future. Distributors like Ubuntu appear to have come to that conclusion; they have built on Debian's longstanding ARM support to create a distribution that, they hope, will be found in future devices. Without well-established ARM support, Fedora - along with the distributions derived from it - has little chance of competing in that area.

So one might well say that the questions being asked in the Fedora community are wrong. Rather than asking "why should we support ARM when it presents all of these difficulties?", it might make more sense to ask "how can we address these difficulties to provide the best ARM-based distribution possible?". The cynical among us might be tempted to say that Red Hat, Fedora's sponsor and main contributor, faces a classic disruptive technology problem. ARM is unlikely to displace x86 in the places where Red Hat currently sells support, and revenues from any future ARM-based "enterprise" distribution seem likely to be rather lower than those obtained from x86-based distributions. So it would be understandable if Red Hat were to show a lack of enthusiasm for the ARM architecture.

The cynical view is, at best, only partially right, though. Red Hat does not advertise the resources it is putting into ARM distribution development, but it clearly has a number of engineers on the task. Even as a "secondary" architecture, Fedora's ARM distribution has been solid enough to serve as the base for the ARM-based OLPC XO 1.75 laptop. Without Red Hat's support, there wouldn't be a Fedora ARM distribution even with secondary status. So it seems unlikely that Red Hat is the sticking point here, even if its contributions to the kernel's ARM subtree (29 patches total from 3.0 to the present) show little enthusiasm. More likely we're just seeing the usual noise as the wider community comes to terms with what will be required to support this architecture properly.

In the end, the world would not be well served by a single processor architecture; there is value in diversity. Similarly, an industry where ARM-based systems are dominated by Android variants may not be the best possible world. A lot of interesting things are happening in computing, and many of them involve the ARM architecture; there is a lot of value in having strong community-based distribution support for that architecture. That is why Fedora will, in the end, almost certainly bother to support ARM as a primary architecture despite the challenges it presents.

Comments (28 posted)

An update on Oracle v. Google

April 4, 2012

This article was contributed by Adam Saunders

After over a year and a half of legal proceedings, Oracle and Google will go to trial on April 16 in front of the United States District Court for the Northern District of California, to determine whether or not Google's Android software infringes Oracle's copyrights on Java, as well as some of its patents. If the parties don't settle, this trial is expected to take eight weeks.

A lot has happened since the litigation started. In August 2010, several months after acquiring Sun Microsystems, which developed Java and held the copyrights, Oracle launched a lawsuit against Google, claiming that Android's use of Java infringed seven of Oracle's patents, as well as the Java copyrights Oracle holds. The complaint demands an injunction against Google from continuing with its allegedly infringing activity, that "all copies made or used in violation of Oracle America's copyrights [...] be impounded and destroyed or otherwise reasonably disposed of", and that Oracle receive damages. Essentially, Oracle is formally seeking to stop Google's use of Java in Android, and wants compensation for that use.

The FSF has argued that if Google had used an available GPL-licensed version of Java, such as IcedTea, as part of Android, it would have avoided this litigation. This may be true; Sun (now Oracle) distributes Java under GPLv2 with a linking exception, which is what IcedTea is based off of. The GPLv2 implicit patent language, contained in sections 6 and 7 of the license, effectively gives users of Sun/Oracle's distribution of Java a royalty-free patent license that covers standard free software practices: the right to use, modify, and redistribute the software, including modified versions. With the linking exception, permissively licensed software and proprietary software that links to IcedTea could be developed without being licensed under GPLv2; thus, the app repository Google Play (formerly known as Android Market), with its proprietary apps as well as free software apps, would have still been possible.

However, this argument ignores the fact that the Android project started before Sun licensed Java under the GPL. Android, Inc. was founded in 2003 and acquired by Google in 2005; the relicensing of Java happened in November, 2006. When Android started, if a non-Sun programmer or development project wanted to make an open-source version of Java while minimizing the threat of copyright infringement, the only practical way to do this was to rely on clean room reverse engineering; this is what Google claims to have done. But clean room reverse engineering is not a helpful defense in patent litigation, which is why Google's way of implementing Java in Android - including basing Dalvik off of the Apache Harmony project, and not Sun/Oracle's GPL'd Java - exposes it to patent lawsuits from Oracle, assuming Oracle has any valid patents that read on Google's Java implementation.

So if you're Google, and you get sued by Oracle, one of the best things you can do to defend from the patent infringement claims is to get the patents reexamined and hope that they get rejected. This has been a very successful tactic; Google's request for USPTO patent reexaminations has, over time, left Oracle with only two patents left to litigate against Google. The reexamined claims in the '205 and '702 patents were rejected due to prior art, as were the reexamined claims in the '720 patent, the '447 patent, and the '476 patent. The '447 patent covered the concept of restricting access to objects based on where a specific program came from. The '720 patent claimed the novel concept of loading classes into a parent process before calling fork() so they would already be present for child processes. The '702 patent claimed the concept of coalescing duplicated objects (constants, for example) in a class file. The '476 patent is about determining access permissions depending on the calling sequence that led to a specific class method. Finally, the '205 patent claims the concept of a just-in-time compiler.

The only remaining patents are the '520 and the '104 patents.:

  • The '104 patent, reissued in 2003, claims a "method and apparatus for resolving data references in generated code"; the method describes generating and interpreting executable code, and changing symbolic references in the code to numerical references when the code is interpreted. Cameron McKenzie of aptly characterized this as claiming the very basic idea that "if you rid your code of symbolic references, and replace them with direct references, things are more efficient".

  • The '520 patent claims a "method and system for performing static initialization". Essentially, the virtual machine replaces a bunch of instructions initializing an array with a copy of the resulting array, speeding the initialization process.

With only these two patents left to litigate, Oracle is left hoping that it can claim a relatively low sum of damages from Google for alleged patent infringement.

What exactly has Oracle alleged in its copyright infringement claim? Oracle claimed [PDF] infringement of "(a) 37 Java API design specifications and implementations and (b) 11 Java software code files". Google's defense here looks strong, and there are indications that the court agrees. For example, a recent court order [PDF] asked Oracle to explain how Baker v. Selden applies to its copyright claims. In that case, the Supreme Court clearly established that one cannot use copyright to stop people from using the ideas contained in an expressive work; one can only use copyright to restrict use of the particular expressive work itself. Even though the same court order asked Google to address Sun's limitations on permitted uses of Apache Harmony APIs, it appears that the judge might view Baker as implying that one cannot use copyright to restrict API reimplementations in the way that Oracle is claiming in this case.

With regards to the allegedly infringing Java code files, Oracle specified [PDF] them as:

the entire code for,,,,,,, and, obtained by decompiling object code [...] [,] code from [...] [and] comments from [...] [and] from

This claim is weak; although these files had previously been in Android, they are no longer part of Android, and had never been distributed as part of an Android device.

At the end of March, Google made a settlement offer that Oracle rejected. The settlement involved donating a fraction of a percentage of total Android revenues until April 2018, but only if Oracle can demonstrate that the '520 and '104 patents had been infringed. Some might interpret this settlement offer as indicating that Google feels Oracle has a decent case, but Google might simply want this litigation to not drag on any longer; litigation is expensive and time-consuming.

How should the free software and open source community react to this litigation? As the proceedings have shown, Oracle has become far less threatening than it may have appeared in the summer of 2010. Most of Oracle's patents have been rejected. As Groklaw has noted, Oracle's copyright claims on its APIs look weak, with Google's defense that the complaint refers to functional, and therefore non-copyrightable, subject matter looking strong. As well, Sun's praise of Android, doesn't really help Oracle's case. Although it is far too early to tell how the case will turn out, what ruling Judge Alsup will give, and whether or not Android will face the need to change its relationship with Java, it is clear that Oracle's case is much, much weaker than it initially seemed in the summer of 2010. It is entirely possible that Android's current implementation of Java will be in excellent legal shape following this case.

It is important to remember that this lawsuit is only one instance of several examples of legal pressure being applied against Android. Apple has launched many patent lawsuits against several Android device manufacturers, with many of them retaliating against Apple with patent lawsuits of their own. Another example is Microsoft's pressuring of Android device makers into patent licensing agreements. Last year, the non-practicing entity Lodsys sued, among others, Android app developers. So, regardless of how Oracle v. Google is resolved, Android, and free software in general, will remain under significant threat from software patents.

Comments (11 posted)

Runtime filesystem consistency checking

By Jake Edge
April 3, 2012
This year's edition of the Linux Storage, Filesystem, and Memory Management Summit took place in San Francisco April 1-2, just prior to the Linux Foundation Collaboration Summit. Ashvin Goel of the University of Toronto was invited to the summit to discuss the work that he and others at the university had done on consistency checking as filesystems are updated, rather than doing offline checking using tools like fsck. One of the students who had worked on the project, Daniel Fryer, was also present to offer his perspective from the audience. Goel said that the work is not ready for production use, and Fryer echoed that, noting that the code is not 100% solid by any means. They are researchers, Goel said, so the community should give them some leeway, but that any input to make their work more relevant to Linux would be appreciated.

Filesystems have bugs, Goel said, producing a list of bugs that caused filesystem corruption over the last few years. Existing solutions can't deal with these problems because they start with the assumption that the filesystem is correct. Journals, RAID, and checksums on data are nice features but they depend on offline filesystem checking to fix up any filesystem damage that may occur. Those solutions protect against problems below the filesystem layer and not against bugs in the filesystem implementation itself.

But, he said, offline checking is slow and getting slower as disks get larger. In addition, the data is not available while the fsck is being done. Because of that, checking is usually only done after things have obviously gone wrong, which makes the repair that much more difficult. The example given was a file and directory inode that both point to the same data block; how can the checker know which is correct at that point?

James Bottomley asked if there were particular tools that were used to cause various kinds of filesystem corruption, and if those tools were available for kernel hackers and others to use. Goel said that they have tools for both ext3 and btrfs, while audience members chimed in with other tools to cause filesystem corruptions. Those included fsfuzz, mentioned by Ted Ts'o, which will do random corruptions of a filesystem. It is often used to test whether malformed filesystems on USB sticks can be used to crash or subvert the kernel. There were others, like fswreck for the OCFS2 filesystem, as well as similar tools for XFS noted by Christoph Hellwig and another that Chris Mason said he had written for btrfs. Bottomley's suggestion that the block I/O scheduler could be used to pick blocks to corrupt was met with a response from another in the audience joking that the block layer didn't really need any help corrupting data—widespread laughter ensued.

Returning to the topic at hand, Goel stated that doing consistency checking at runtime is faced with the problem that consistency properties are global in nature and are therefore expensive to check. To find two pointers to the same data block, one must scan the entire filesystem, for example. In an effort to get around this difficulty, the researchers hypothesized that global consistency properties could be transformed into local consistency invariants. If only local invariants need to be checked, runtime consistency checking becomes a more tractable problem.

They started with the assumption that the initial filesystem is consistent, and that something below the filesystem layer, like checksums, ensures that correct data reaches the disk. At runtime, then, it is only necessary to check that the local invariants are maintained by whatever data is being changed in any metadata writes. This checking happens before those changes become "durable", so they reason by induction that the filesystem resulting from those is also consistent. By keeping any inconsistent state changes from reaching the disk, the "Recon" system makes filesystem repair unnecessary.

As an example, ext3 maintains a bitmap of the allocated blocks, so to ensure consistency when a block is allocated, Recon needs to test that the proper bit in the bitmap flips from zero to one and that the pointer used is the correct one (i.e. it corresponds to the bit flipped). That is the "consistency invariant" for determining that the block has been allocated correctly. A bit in the bitmap can't be set without a corresponding block pointer being set and vice versa. Additional checks are done to make sure that the block had not already been allocated, for example. That requires that Recon maintain its own block bitmap.

These invariants (they came up with 33 of them for ext3) are checked at the transaction commit point. The design of Recon is based on a fundamental mistrust of the filesystem code and data structures, so it sits between the filesystem and the block layer. When the filesystem does a metadata write, Recon records that operation. Similarly, it caches the data from metadata reads, so that the invariants can be validated without excessive disk reads. When the commit of a metadata update is done, the read cache is updated if the invariants are upheld in the update.

When filesystem metadata is updated, Recon needs to determine what logical change is being performed. It does that by examining the metadata block to determine what type of block it is, and then does a "logical diff" of the changes. The result is a "logical change record" that records five separate fields for each change: block type, ID, the field that changed, the old value, and the new value. As an example, Goel listed the change records that might result from appending a block to inode 12:

Using those records, the invariants can be checked to ensure that the block pointer referenced in the inode is the same as the one that has its bit set in the bitmap, for example.

Currently, when any invariant is violated, the filesystem is stopped. Eventually there may be ways to try to fix the problems before writing to disk, but for now, the safe option is to stop any further writes.

Recon was evaluated by measuring how many consistency errors were detected by it vs. those caught by fsck. Recon caught quite a few errors that were not detected by fsck, while it only missed two that fsck caught. In both cases, the filesystem checker was looking at fields that are not currently used by ext3. Many of the inconsistencies that Recon found and fsck didn't were changes to unallocated data, which are not important from a consistency standpoint, but still should not be changed in a correctly operating filesystem.

There are some things that neither fsck nor Recon can detect, like changes to filenames in directories or time field changes in inodes. In both cases, there isn't any redundant information to do a consistency check against.

The performance impact of Recon is fairly modest, at least in terms of I/O operations. With a cache size of 128MB, Recon could handle a web server workload with only a reduction of approximately 2% I/O operations/second based on a graph that was shown. The cache size was tweaked to find a balance based on the working set size of the workload so that the cache would not be flushed prematurely, which would otherwise cause expensive reads of the metadata information. The tests were run on a filesystem on a 1TB partition with 15-20GB of random files according to Fryer, and used small files to try to stress the metadata cache.

No data was presented on the CPU impact of Recon, other than to say that there was "significant" CPU overhead. Their focus was on the I/O cost, so more investigation of the CPU cost is warranted. Based on comments from the audience, though, some would be more than willing to spend some CPU in the name of filesystem consistency so that the far more expensive offline checking could be avoided in most cases.

The most important thing to take away from the talk, Goel said, is that as long as the integrity of written block data is assured, all of the ext3 properties that can checked by fsck can instead be done at runtime. As Ric Wheeler and others in the audience pointed out, that doesn't eliminate the need for an offline checker, but it may help reduce how often it's needed. Goel agreed with that, and noted that in 4% of their tests with corrupted filesystems, fsck would complete successfully, but that a second run would find more things to fix. Ts'o was very interested to hear that and asked that they file bugs for those cases.

There is ongoing work on additional consistency invariants as well as things like reducing the memory overhead and increasing the number of filesystems that are covered. Dave Chinner noted that invariants for some filesystems may be hard to come up with, especially for filesystems like XFS that don't necessarily do metadata updates through the page cache.

The reaction to Recon was favorable overall. It is an interesting project and surprised some that it was possible to do runtime consistency checking at all. As always, there is more to do, and the team has limited resources, but most attendees seemed favorably impressed with the work.

[Many thanks are due to Mel Gorman for sharing his notes from this session.]

Comments (39 posted)

Page editor: Jonathan Corbet


Libsecret revealed

April 4, 2012

This article was contributed by Nathan Willis

GNOME developer Stef Walter has started work on a new client library for interacting with the secret-storage subsystems that allow users to keep track of passwords, encryption keys, and other sensitive data. The new project is named libsecret, and will enable more applications to connect to the GNOME and KDE "secret services," as well as fix longstanding threading and notification problems.

GNOME and KDE each provide their own mechanisms for storing, editing, and retrieving saved secrets; GNOME uses the gnome-keyring-daemon, and KDE uses ksecretservice (evidently there is no formal project page, but a ksecretservice-devel mailing list exists). However, both systems conform to the same standard, the Secret Service API. The API defines a "secret item" as a sensitive string needing protection (such as a password or key) along with an array of attributes. It also provides a way to group secret items into collections, each of which can be locked (i.e. encrypted) or unlocked separately.

Typically a desktop environment creates a default collection for each user account, unlocking it at login time, and locking it again when the session ends. Everyday applications like email clients or WiFi network applets can communicate with the service over D-Bus to securely store and retrieve credentials. However, the API also enables specialty applications (such as the Seahorse encryption key manager) to do more, like implement an interface through which users can create and manage their own collections at will.

Building a better client library

Previous GNOME releases exposed gnome-keyring-daemon to applications via the libgnome-keyring library. Libsecret is designed to be a more modern replacement for libgnome-keyring — the secret service daemon itself will remain unchanged. Walter announced the libsecret project on the GNOME desktop-devel list on March 26, noting that the new project would improve on libgnome-keyring by being thread-safe, introspectable, and properly asynchronous. The code is hosted on GNOME's Git repository, while the documentation currently resides in Walter's personal web space.

In an email, Walter elaborated on the shortcomings in libgnome-keyring that warranted writing a replacement from scratch. At its core, he said, libgnome-keyring was built around its own binary protocol. Support for speaking the Secret Service API over D-Bus was added later, but ultimately the underlying code needed to go. Libsecret is designed from the ground up for Secret Service and D-Bus, not only implementing the full API, but doing away with the custom bits. As a result, Walter said, applications will in theory be able to use libsecret to communicate with any Secret Service implementation, including ksecretservice (although so far this has not been tested).

The older API, Walter said, did not include support for change notifications from gnome-keyring-daemon. As a result, applications needed to restart in order to see changes to collections and secret items.

Libgnome-keyring had several other technical limitations, not related to the API. First, it was not thread-safe. Since multiple client applications may need to access password storage (including some that may do so with multiple threads) simultaneously, this was a serious impediment. Second, libgnome-keyring could not use GObject introspection (introduced in GTK+ 3.0), which made developing for it in JavaScript or Python impossible. Walter has added JavaScript and Python examples to the libsecret documentation illustrating the creation of a secret schema and storing, retrieving, and deleting a password.

The API rollout

Walter's documentation breaks the libsecret API into two parts. The first is a simple password API, which he has declared stable. The second is a more general API for "power user" applications, which he plans to continue to develop over the course of the GNOME 3.6 development cycle. The plan is to have the complete API finalized in time for 3.6, and patches deployed to migrate existing core GNOME applications over to libsecret prior to the release.

The simple password API defines methods for storing a secret item in a collection, searching the collection for a secret item, retrieving the secret, and deleting the item from storage. There are both synchronous and asynchronous calls; GUI apps can access the asynchronous methods to avoid blocking while waiting for a response from the Secret Service, while non-GUI applications are expected to use the synchronous methods.

The exact makeup of a secret item is defined by a schema. Every item contains a "secret" plus an optional list of attributes that may vary depending on the type of secret. Secret Services store all attributes as string data (in key-value pairs), but the values may be marked as boolean or integer types so that applications can properly interpret them. By default, retrieving a secret item begins with the application initiating a search against a particular collection (either the default collection, or a specified alternative) for a secret matching some search term. Libsecret assumes that the typical search will include the schema name (e.g., "password" or "key") desired, as the different schemas generally represent disjoint use-cases for secret storage, but this can be turned off with a flag.

The simple password API does not explore using the library for encryption keys or other secret items, but the complete API supports these data types, and libsecret eventually will. The Secret Service API recommends that applications use human-readable text as the storage format for secrets, but this is also not a strict requirement. Applications can even use secret items to store compound data by encoding it in XML or another markup language.

The complete libsecret API also provides methods for an application to create, unlock, access, and lock collections, and to prompt the user for input (such as the password required to unlock a particular collection, or what to name a new collection). There will be support for a session collection that is automatically deleted at logout (which would be useful for storing temporary credentials like login passwords for mounting remote storage).

By itself, libsecret is neutral about the Secret Service used, which can have implications for the application developer. Libsecret, for example, does not guarantee that secret items are stored securely in the service, or that they are transferred to the client application in a secure manner. The Secret Service API allows a service and an application to negotiate an encryption algorithm for transferring secrets, but libsecret does not (yet) support this. However, as the Secret Service documentation points out, there may be other methods an application can use to minimize the risk of exposure, such as using mlock() to prevent memory pages from being written to swap.

Libsecret is still in its infancy, but the prospect of a modern password-and-key-storage mechanism is a refreshing one. Perhaps the most interesting new outcome would be more applications that can seamlessly support gnome-keyring-daemon and ksecretservice, but with the increasing presence of desktop applications written in JavaScript, an API accessible to that language is certainly valuable, too. Shortly after Walter posted the announcement about libsecret, he pushed out updates to the Seahorse, libcryptui, and gnome-keyring packages as well. Those updates were for GNOME 3.4 and thus were not changes for libsecret support, but that effort for those utilities should begin any day.

Comments (6 posted)

Brief items

Security quotes of the week

What "Girls Around Me" does is make clear just how useless Facebook's security settings are. In theory if you know what you're doing you can disclose your personal information to Facebook and prevent FB from sharing it with strangers. But in practice ordinary people are not all Bruce Schneier. Ordinary people with Facebook accounts tend to over-share personal information because our social instincts encourage us to share information with everyone we can see, and to discount abstractions (such as the possibility that software bots thousands of miles away might be harvesting the photographs and information we put online in order to better target advertisements at us—or worse).
-- Charlie Stross

He wants us to trust that a 400-ml bottle of liquid is dangerous, but transferring it to four 100-ml bottles magically makes it safe. He wants us to trust that the butter knives given to first-class passengers are nevertheless too dangerous to be taken through a security checkpoint. He wants us to trust the no-fly list: 21,000 people so dangerous they’re not allowed to fly, yet so innocent they can’t be arrested. He wants us to trust that the deployment of expensive full-body scanners has nothing to do with the fact that the former secretary of homeland security, Michael Chertoff, lobbies for one of the companies that makes them. He wants us to trust that there’s a reason to confiscate a cupcake (Las Vegas), a 3-inch plastic toy gun (London Gatwick), a purse with an embroidered gun on it (Norfolk, VA), a T-shirt with a picture of a gun on it (London Heathrow) and a plastic lightsaber that’s really a flashlight with a long cone on top (Dallas/Fort Worth).
-- Bruce Schneier continues his debate with former TSA administrator Kip Hawley

Comments (7 posted)

New vulnerabilities

aptdaemon: installs altered packages

Package(s):aptdaemon CVE #(s):CVE-2012-0944
Created:April 2, 2012 Updated:April 4, 2012
Description: From the Ubuntu advisory:

It was discovered that Aptdaemon incorrectly handled installing packages without performing a transaction simulation. An attacker could possibly use this flaw to install altered packages.

Ubuntu USN-1414-1 aptdaemon 2012-04-02

Comments (none posted)

chromium: multiple vulnerabilities

Package(s):chromium CVE #(s):CVE-2011-3058 CVE-2011-3059 CVE-2011-3060 CVE-2011-3061 CVE-2011-3062 CVE-2011-3063 CVE-2011-3064 CVE-2011-3065
Created:April 2, 2012 Updated:October 26, 2012
Description: From the CVE entries:

Google Chrome before 18.0.1025.142 does not properly handle the EUC-JP encoding system, which might allow remote attackers to conduct cross-site scripting (XSS) attacks via unspecified vectors. (CVE-2011-3058)

Google Chrome before 18.0.1025.142 does not properly handle SVG text elements, which allows remote attackers to cause a denial of service (out-of-bounds read) via unspecified vectors. (CVE-2011-3059)

Google Chrome before 18.0.1025.142 does not properly handle text fragments, which allows remote attackers to cause a denial of service (out-of-bounds read) via unspecified vectors. (CVE-2011-3060)

Google Chrome before 18.0.1025.142 does not properly check X.509 certificates before use of a SPDY proxy, which might allow man-in-the-middle attackers to spoof servers or obtain sensitive information via a crafted certificate. (CVE-2011-3061)

Off-by-one error in the OpenType Sanitizer in Google Chrome before 18.0.1025.142 allows remote attackers to cause a denial of service or possibly have unspecified other impact via a crafted OpenType file. (CVE-2011-3062)

Google Chrome before 18.0.1025.142 does not properly validate the renderer's navigation requests, which has unspecified impact and remote attack vectors. (CVE-2011-3063)

Use-after-free vulnerability in Google Chrome before 18.0.1025.142 allows remote attackers to cause a denial of service or possibly have unspecified other impact via vectors related to SVG clipping. (CVE-2011-3064)

Skia, as used in Google Chrome before 18.0.1025.142, allows remote attackers to cause a denial of service (memory corruption) or possibly have unspecified other impact via unknown vectors. (CVE-2011-3065)

openSUSE openSUSE-SU-2014:1100-1 Firefox 2014-09-09
Gentoo 201301-01 firefox 2013-01-07
Mageia MGASA-2012-0324 webkit 2012-11-06
Ubuntu USN-1617-1 webkit 2012-10-25
Ubuntu USN-1430-3 thunderbird 2012-05-04
SUSE SUSE-SU-2012:0580-1 Mozilla Firefox 2012-05-02
SUSE SUSE-SU-2012:0688-1 MozillaFirefox 2012-06-02
Ubuntu USN-1430-2 ubufox 2012-04-27
Ubuntu USN-1430-1 firefox 2012-04-27
openSUSE openSUSE-SU-2012:0567-1 firefox, thunderbird, seamonkey, xulrunner 2012-04-27
Mandriva MDVSA-2012:066 mozilla 2012-04-27
Oracle ELSA-2012-0516 thunderbird 2012-04-25
Oracle ELSA-2012-0515 firefox 2012-04-25
Oracle ELSA-2012-0515 firefox 2012-04-25
Scientific Linux SL-fire-20120425 firefox 2012-04-25
Scientific Linux SL-thun-20120425 thunderbird 2012-04-25
CentOS CESA-2012:0516 thunderbird 2012-04-25
CentOS CESA-2012:0516 thunderbird 2012-04-24
CentOS CESA-2012:0515 firefox 2012-04-25
CentOS CESA-2012:0515 firefox 2012-04-25
Red Hat RHSA-2012:0516-01 thunderbird 2012-04-24
Red Hat RHSA-2012:0515-01 firefox 2012-04-24
openSUSE openSUSE-SU-2012:0492-1 chromium 2012-04-12
Gentoo 201203-24 chromium 2012-03-30
Ubuntu USN-1430-4 apparmor 2012-06-12

Comments (none posted)

drupal6-date: unspecified vulnerabilities

Package(s):drupal6-date CVE #(s):
Created:April 2, 2012 Updated:April 4, 2012
Description: Drupal6-date 2.8 evidently fixes some security issues.
Fedora FEDORA-2012-4616 drupal6-date 2012-04-01
Fedora FEDORA-2012-4606 drupal6-date 2012-04-01

Comments (none posted)

flash-player: code execution

Package(s):flash-player CVE #(s):CVE-2012-0773
Created:March 29, 2012 Updated:April 4, 2012

From the Adobe advisory:

This update resolves a memory corruption vulnerability in the NetStream class that could lead to code execution (CVE-2012-0773).

SUSE SUSE-SU-2012:0437-1 flash-player 2012-03-30
Red Hat RHSA-2012:0434-01 flash-plugin 2012-03-29
openSUSE openSUSE-SU-2012:0427-1 flash-player 2012-03-29

Comments (none posted)

freeradius: authentication bypass

Package(s):freeradius CVE #(s):CVE-2011-2701
Created:April 2, 2012 Updated:April 4, 2012
Description: From the Mandriva advisory:

The ocsp_check function in rlm_eap_tls.c in FreeRADIUS 2.1.11, when OCSP is enabled, does not properly parse replies from OCSP responders, which allows remote attackers to bypass authentication by using the EAP-TLS protocol with a revoked X.509 client certificate.

Gentoo 201311-09 freeradius 2013-11-13
Mandriva MDVSA-2012:047 freeradius 2012-04-02

Comments (none posted)

libpng: code execution

Package(s):libpng CVE #(s):CVE-2011-3048
Created:April 2, 2012 Updated:April 26, 2012
Description: A memory corruption bug, possibly enabling arbitrary code execution, has been found and corrected in libpng.
Slackware SSA:2012-206-01 libpng 2012-07-24
Gentoo 201206-15 libpng 2012-06-22
Oracle ELSA-2012-0523 libpng 2012-04-25
Oracle ELSA-2012-0523 libpng 2012-04-25
Scientific Linux SL-libp-20120425 libpng 2012-04-25
CentOS CESA-2012:0523 libpng 2012-04-25
CentOS CESA-2012:0523 libpng 2012-04-25
Red Hat RHSA-2012:0523-01 libpng 2012-04-25
Fedora FEDORA-2012-5515 libpng 2012-04-24
Fedora FEDORA-2012-5518 libpng 2012-04-24
openSUSE openSUSE-SU-2012:0491-1 libpng 2012-04-12
Fedora FEDORA-2012-5079 libpng10 2012-04-08
Fedora FEDORA-2012-5080 libpng10 2012-04-08
Ubuntu USN-1417-1 libpng 2012-04-05
Debian DSA-2446-1 libpng 2012-04-04
Mandriva MDVSA-2012:046 libpng 2012-04-02

Comments (none posted)

nova: denial of service

Package(s):nova CVE #(s):CVE-2012-1585
Created:March 29, 2012 Updated:April 9, 2012

From the Ubuntu advisory:

Dan Prince discovered that Nova did not properly perform input validation on the length of server names. An authenticated attacker could issue requests using long server names to exhaust the storage resources containing the Nova API log file.

Fedora FEDORA-2012-5026 openstack-nova 2012-04-08
Ubuntu USN-1413-1 nova 2012-03-29

Comments (none posted)

phpmyadmin: multiple vulnerabilities

Package(s):phpmyadmin CVE #(s):CVE-2012-1190 CVE-2012-1902
Created:April 3, 2012 Updated:May 1, 2012
Description: From the Mandriva advisory:

It was possible to conduct XSS using a crafted database name (CVE-2012-1190).

The show_config_errors.php scripts did not validate the presence of the configuration file, so an error message shows the full path of this file, leading to possible further attacks (CVE-2012-1902).

Fedora FEDORA-2012-5631 phpMyAdmin 2012-05-01
Fedora FEDORA-2012-5624 phpMyAdmin 2012-05-01
openSUSE openSUSE-SU-2012:0494-1 phpMyAdmin 2012-04-12
Mandriva MDVSA-2012:050 phpmyadmin 2012-04-03

Comments (none posted)

php-pear-CAS: multiple vulnerabilities

Package(s):php-pear-CAS CVE #(s):CVE-2012-1104 CVE-2012-1105
Created:April 2, 2012 Updated:April 4, 2012
Description: From the Red Hat bugzilla [1], [2]

1) A security flaw was found in the way phpCAS managed proxying of services. In the detault configuration an phpCAS protected application allowed to proxy any other CAS service with proxy authorization and valid user credentials in the same SSO realm to other phpCAS applications. The application, CAS services has been proxied to, could use this flaw to in unauthorized way to use these CAS services.

2) An information disclosure flaw was found in the way phpCAS, the Central Authentication Service client library in PHP language, performed archiving of debug logging file in the default debug configuration and archiving of proxy configuration session data. Both of the files were archived in /tmp directory in files with unsafe permissions. A local attacker could use this flaw to obtain private user attributes and sensitive login tokens by inspecting content of those archived files.

Fedora FEDORA-2012-4077 php-pear-CAS 2012-03-31
Fedora FEDORA-2012-4119 php-pear-CAS 2012-03-31

Comments (none posted)

pidgin: multiple vulnerabilities

Package(s):pidgin CVE #(s):CVE-2011-4939 CVE-2012-1178
Created:April 2, 2012 Updated:March 15, 2013

From the Red Hat bugzilla entries [1, 2]:

CVE-2011-4939: A NULL pointer dereference flaw was found in the way XMPP protocol plug-in of Pidgin, a Gtk+ based multiprotocol instant messaging client, performed change of user name for particular buddy. If a remote Pidgin user, present on the buddy list of the victim, changed their Pidgin nickname to specially-crafted value it would lead to Pidgin client crash.

CVE-2012-1178: A denial of service flaw was found in the way MSN protocol plug-in of Pidgin, a Gtk+ based multiprotocol instant messaging client, performed sanitization of certain not UTF-8 encoded text prior its presentation. A remote attacker could send a specially-crafted not UTF-8 encoded text (for example via Offline Instant Message post), which once processed by the Pidgin client of the victim would lead to that Pidgin client abort.

Oracle ELSA-2013-0646 pidgin 2013-03-14
openSUSE openSUSE-SU-2012:0905-1 pidgin 2012-07-24
Scientific Linux SL-pidg-20120719 pidgin 2012-07-19
Oracle ELSA-2012-1102 pidgin 2012-07-20
CentOS CESA-2012:1102 pidgin 2012-07-19
CentOS CESA-2012:1102 pidgin 2012-07-19
Red Hat RHSA-2012:1102-01 pidgin 2012-07-19
Ubuntu USN-1500-1 pidgin 2012-07-09
SUSE SUSE-SU-2012:0782-1 finch, libpurple and pidgin 2012-06-22
Fedora FEDORA-2012-4600 pidgin 2012-04-01

Comments (none posted)

rpm: code execution

Package(s):rpm CVE #(s):CVE-2012-0060 CVE-2012-0061 CVE-2012-0815
Created:April 4, 2012 Updated:May 7, 2012
Description: The rpm utility has several parsing flaws that can be exploited via a malicious package file to crash the tool or execute arbitrary code. Importantly, the exploit can happen before the validation of the package file's digital signature, so the checks that would normally stop a hostile package file are ineffective here.
Debian-LTS DLA-140-1 rpm 2015-01-28
Ubuntu USN-1695-1 rpm 2013-01-17
Gentoo 201206-26 rpm 2012-06-24
openSUSE openSUSE-SU-2012:0589-1 rpm, rpm-python 2012-05-07
openSUSE openSUSE-SU-2012:0588-1 rpm, rpm-python 2012-05-07
Fedora FEDORA-2012-5420 rpm 2012-04-22
Fedora FEDORA-2012-5421 rpm 2012-04-22
Oracle ELSA-2012-0451 rpm 2012-04-17
Mandriva MDVSA-2012:056 rpm 2012-04-12
Scientific Linux SL-rpm-20120404 rpm 2012-04-04
Oracle ELSA-2012-0451 rpm 2012-04-03
Oracle ELSA-2012-0451 rpm 2012-04-03
CentOS CESA-2012:0451 rpm 2012-04-03
CentOS CESA-2012:0451 rpm 2012-04-03
Red Hat RHSA-2012:0451-01 rpm 2012-04-03

Comments (18 posted)

tryton-server: privilege escalation

Package(s):tryton-server CVE #(s):CVE-2012-0215
Created:March 29, 2012 Updated:April 9, 2012

From the Debian advisory:

It was discovered that the Tryton application framework for Python allows authenticated users to escalate their privileges by editing the Many2Many field.

Fedora FEDORA-2012-4988 trytond 2012-04-08
Fedora FEDORA-2012-4963 trytond 2012-04-08
Debian DSA-2444-1 tryton-server 2012-03-29

Comments (none posted)

typo3-src: multiple vulnerabilities

Package(s):typo3-src CVE #(s):CVE-2012-1606 CVE-2012-1607 CVE-2012-1608
Created:April 2, 2012 Updated:April 4, 2012
Description: From the Debian advisory:

CVE-2012-1606: Failing to properly HTML-encode user input in several places, the TYPO3 backend is susceptible to Cross-Site Scripting. A valid backend user is required to exploit these vulnerabilities.

CVE-2012-1607: Accessing a CLI Script directly with a browser may disclose the database name used for the TYPO3 installation.

CVE-2012-1608: By not removing non printable characters, the API method t3lib_div::RemoveXSS() fails to filter specially crafted HTML injections, thus is susceptible to Cross-Site Scripting.

Debian DSA-2445-1 typo3-src 2012-03-31

Comments (none posted)

Page editor: Jake Edge

Kernel development

Brief items

Kernel release status

The current development kernel is 3.4-rc1, released by Linus on March 31. See the separate article below for a summary of the final changes merged for this development cycle.

Stable updates: the 3.0.27, 3.2.14, and 3.3.1 updates were released on April 2; they contain the usual long list of important fixes.

Comments (none posted)

Quotes of the week

Publicly making fun of people is half the fun of open source programming.

In fact, the real reason to eschew programming in closed environments is that you can't embarrass people in public.

-- Linus Torvalds

+ * Wikipedia: "The current (13th) b'ak'tun will end, or be completed, on
+ * (December 21, 2012 using the GMT correlation".  GMT or
+ * Mexico/General? What's 6 hours between Mayans friends.. let's follow
+ * 'Mexican time' rules.  You might get 6 more hours of reading your
+ * mail, but don't count on it.
+ */
+#define END_13BAKTUN	1356069600
+extern int	emulatemayanprophecy;	/* End time before the Mayans do */
-- Theo de Raadt; Linux remains unprepared

Maybe I should ask the next person who submits a new architecture to do that work, that's usually how progress in asm-generic happens these days.
-- Arnd Bergmann

Although there have been numerous complaints about the complexity of parallel programming (especially over the past 5-10 years), the plain truth is that the incremental complexity of parallel programming over that of sequential programming is not as large as is commonly believed. Despite that you might have heard, the mind-numbing complexity of modern computer systems is not due so much to there being multiple CPUs, but rather to there being any CPUs at all. In short, for the ultimate in computer-system simplicity, the optimal choice is NR_CPUS=0.
-- Paul McKenney

Comments (24 posted)

OSADL on realtime Linux determinism

[plot] The Open Source Automation Development Lab has posted a press release celebrating a full year's worth of testing of latencies on several systems running the realtime preemption kernel. "Each graph consists of more than 730 latency plots put before one another with the time scale running from back to front. A latency plot displays the number of samples within a given latency class (resolution 1 µs). The logarithmic frequency values at the y-scale ensure that even a single outlier would be displayed (for details of the test procedures and the load scenarios please refer to this description). The absence of any outlier in all the very different systems clearly demonstrates that the perfect determinism of the mainline Linux real-time kernel is a generic feature; it is not restricted to any particular architecture." OSADL is an industry consortium dedicated to encouraging the development and use of Linux in automated systems.

Comments (19 posted)

Kernel development news

The conclusion of the 3.4 merge window

By Jonathan Corbet
April 3, 2012
Linus announced the 3.4-rc1 release and the closing of the merge window on March 31. At the outset, he had said that this merge window could run a little longer than usual; in fact, at 13 days, it was slightly shorter. One should not conclude that there was not much to pull, though; some 9,248 non-merge changesets went into the mainline before 3.4-rc1, and a couple of significant features have sneaked their way in afterward as well.

User-visible features merged since last week's summary include:

  • The device mapper "thin provisioning" target now supports discard requests, a feature which should help it to use the underlying storage more efficiently.

  • The dm-verity device mapper target has been merged. This target manages a read-only device where all blocks are checked against a cryptographic hash maintained elsewhere; it thus provides a certain degree of tampering detection. Details can be found in Documentation/device-mapper/verity.txt

  • Support for the x32 ABI has been merged into the kernel. Getting support into the compiler and the C library is an ongoing project, and the creation of distributions using this ABI will take even longer, but the foundation, at least, is now in place.

  • The "high-speed synchronous serial interface" (HSI) framework has been merged. HSI is an interface that is mainly used to connect processors with cellular modem engines; it will be used for handset support in future kernel releases.

  • New drivers include:

    • Processors and platforms: Samsung EXYNOS5 SoCs, and NVIDIA Tegra3 SoCs.

    • Flash: SMI-attached SPEAR MTD NOR controllers, DiskOnChip G4 NAND flash devices, and Universal Flash Storage host controllers (details in Documentation/scsi/ufs.txt).

    • Miscellaneous: Apple "gmux" display multiplexers, Intel Sodaville GPIO controllers, TI TPS65217 and TPS65090 power management controllers, Ricoh RC5T583 power management system devices, Freescale i.MX on-chip ANATOP controllers, Summit Microelectronics SMB347 battery chargers, and ST Ericsson AB8500 battery management controllers.

Changes visible to kernel developers include:

  • The "common clock framework" unifies the handling of subsystem clocks, especially on the ARM architecture (though it is not limited to ARM). See Documentation/clk.txt for more information.

  • The DMA buffer sharing API has been extended to allow CPU access to the buffers; see the updated Documentation/dma-buf-sharing.txt file for details.

  • The direct rendering subsystem has gained initial support for the DMA buffer sharing mechanism. No drivers use it yet, but having this support in the mainline will ease the development of driver support for future kernels.

  • The massive <asm/system.h> include file has been split into several smaller files and removed; in-tree users have been fixed.

  • The new /proc/dma-mappings file on the ARM architecture displays the currently-active coherent DMA mappings. Since such mappings tend to be in short supply on ARM, this can be a useful debugging tool.

  • The ARM architecture has gained jump label ("static branch") support.

  • The just-in-time compiler for BPF packet filters has been ported to the ARM architecture.

There are a couple of other features that Linus may still be considering merging as of this writing, though the chances of them getting in would appear to be diminishing. One is the DMA mapping rework; Linus has been asking for potential users of this change to speak up, but few have done so. In other words, if there are developers out there who would like to see the improved DMA subsystem in the 3.4 release, you are running out of time to make that desire known. The other is POHMELFS, which has had some review snags and which also seems to lack a vocal community clamoring for its inclusion.

Beyond those possibilities, though, the time for new features to go into the 3.4 development cycle has now passed. The stabilization process has begun, with a probable final release in late May or early June.

Comments (none posted)

2012 Linux Storage, Filesystem, and Memory Management Summit - Day 1

By Jake Edge
April 3, 2012

Day one of the Linux Storage, Filesystem, and Memory Management Summit (LSFMMS) was held in San Francisco on April 1. What follows is a report on the combined and MM sessions from the day largely based on Mel Gorman's write-ups, with some editing and additions from my own notes. In addition, James Bottomley sat in on the Filesystem and Storage discussions and his (lightly edited) reports are included as well. The plenary session from day one, on runtime filesystem consistency checking, was covered in a separate article.


Fengguang Wu began by enumerating his work on improving the writeback situation and instrumenting the system to get better information on why writeback is initiated. James Bottomley quickly pointed out that we've talked about writeback for several years at LSFMMS and specifically asked where are we right now. Unfortunately many people spoke at the same time, some without microphones making it difficult to follow. They did focus on how and when sync takes place, what impact it has, and whether anyone should care about how dd benchmarks behave. The bulk of the comments focused on the fairness of dealing with multiple syncs coming from multiple sources. Ironically despite the clarity of the question, the discussion was vague. As concrete examples were not used by each audience member it could be only concluded that "on some filesystems for some workloads depending on what they do, writeback may do something bad".

Wu brought it back on topic by focusing on I/O-less dirty throttling and the complexities that it brings. However, the intention is to minimize seeks, and to provide less lock contention and low latency. He maintains that there were some impressive performance gains with some minor regressions. There are issues around integration with task/cgroup I/O controllers but considering the current state of I/O controllers, this was somewhat expected.

Bottomley asked about how much complexity this added; Dave Chinner pointed out that the complexity of the code was irrelevant because the focus should be on the complexity of the algorithm. Wu countered that the coverage of his testing was pretty comprehensive, covering a wide range of hardware, filesystems, and workloads.

For dirty reclaim, there is now a greater focus on pushing pageout work to the flusher threads with some effort to improve interactivity by focusing dirty reclaim on the tasks doing the dirtying. He stated that dirty pages reaching the end of the LRU are still a problem and suggested the creation of a dirty LRU list. With current kernels, dirty pages are skipped over by direct reclaimers, which increases CPU cost, making it a problem that varies between kernel versions. Moving them to a separate list unfortunately requires a page flag which is not readily available.

Memory control groups bring their own issues with writeback, particularly around flusher fairness. This is currently beyond control with only coarse options available such as limiting the number of operations that can be performed on a per-inode basis or limiting the amount of IO that can be submitted. There was mention of throttling based on the amount of IO a process completed but it was not clear how this would work in practice.

The final topic was on the block cgroup (blkcg) I/O controller and the different approaches to throttling based on I/O operations/second (IOPS) and access to disk time. Buffered writes are a problem, as is how they could possibly be handled via balance_dirty_pages(). A big issue with throttling buffered writes is still identifying the I/O owner and throttling them at the time the I/O is queued, which happens after the I/O owner has already executed a read() or write(). There was a request to clarify what the best approach might be but there were few responses. As months, if not years, of discussion on the lists imply, it is just not a straightforward topic and it was suggested that a spare slot be stolen to discuss it further (see the follow-up in the filesystem and storage sessions below).

At the end, Bottomley wanted an estimate of how close writeback was to being "done". After some hedging, Wu estimated that it was 70% complete.

Stable pages

The problems surrounding stable pages were the next topic under discussion. As was noted by Ted Ts'o, making writing processes wait for writeback to complete on stable pages can lead to unexpected and rather long latencies, which may be unacceptable for some workloads. Stable pages are only really needed for some systems where things like checksums calculated on the page require that the page be unchanged when it actually gets written.

Sage Weil and Boaz Harrosh listed the three options for handling the problem. The first was to reissue the write for pages that have changed while they were undergoing writeback, but that can confuse some storage systems. Waiting on the writeback (which is what is currently done) or doing a copy-on-write (COW) of the page under writeback were the other two. The latter option was the initial focus of the discussion.

James Bottomley asked if the cost of COW-ing the pages had been benchmarked and Weil said that they hadn't been. Weil and Harrosh are interested in workloads that really require stable writes and whether they were truly affected by waiting for the writeback to complete. Weil noted that Ts'o can just turn off stable pages, which fixes his problem. Bottomley asked: could there just be a mount flag to turn off stable pages? Another way to approach that might be to have the underlying storage system inform the filesystem if it needed stable writes or not.

Since waiting on writeback for stable pages introduces a number of unexpected issues, there is a question of whether replacing it with something with a different set of issues is the right way to go. The COW proposal may lead to problems because it results in there being two pages for the same storage location floating around. In addition, there are concerns about what would happen for a file that gets truncated after its pages have been copied, and how to properly propagate that information.

It is unclear whether COW would be always be a win over waiting, so Bottomley suggested that the first step should be to get some reporting added into the stable writeback path to gather information on what workloads are being affected and what those effects are. After that, someone could flesh out a proposal on how to implement the COW solution that described how to work out the various problems and corner cases that were mentioned.

Memory vs. performance

While the topic name of Dan Magenheimer's slot, "Restricting Memory Usage with Equivalent Performance", was not of his choosing, that didn't deter him from presenting a problem for memory management developers to consider. He started by describing a graph of the performance of a workload as the amount of RAM available to it increases. Adding RAM reduces the amount of time the workload takes, to a certain point. After that point, adding more memory has no effect on the performance.

It is difficult or impossible to know the exact amount of RAM required to optimize the performance of a workload, he said. Two virtual machines on a single host are sharing the available memory, but one VM may need the additional memory that the other does not really need. Some kind of balance point between the workloads being handled by the two VMs needs to be found. Magenheimer has some ideas on ways to think about the problem that he described in the session.

He started with an analogy of two countries, one of which wants resources that the other has. Sometimes that means they go to war, especially in the past, but more recently economic solutions have been used rather than violence to allocate the resource. He wonders if a similar mechanism could be used in the kernel. There are a number of sessions in the memory management track that are all related to the resource allocation problem, he said, including memory control groups soft-limits, NUMA balancing, and ballooning.

The top-level question is how to determine how much memory an application actually needs vs. how much it wants. The idea is try to find the point where giving some memory to another application has a negligible performance impact on the giver while the other application can use it to increase its performance. Beyond tracking the size of the application, Magenheimer posited that one could use calculus and calculate the derivative of the size growth to gain an idea of the "velocity" of the workload. Rik van Riel noted that this information could be difficult to track when the system is thrashing, but Magenheimer thought that tracking refaults could help with that problem.

Ultimately, Magenheimer wants to apply these ideas to RAMster, which allows machines to share "unused" memory between them. RAMster would allow machines to negotiate storing pages for other machines. For example, in an eight machine system, seven machines could treat the remaining machine as a memory server, offloading some of their pages to that machine.

Workload size estimation might help, but the discussion returned to the old chestnut of trying to shrink memory to find at what point the workload starts "pushing" back by either refaulting or beginning to thrash. This would allow the issue to be expressed in terms of control theory. A crucial part of using control theory is having a feedback mechanism. By and large, virtual machines have almost non-existent feedback mechanisms for establishing the priority of different requests for resources. Further, performance analysis on resource usage is limited.

Glauber Costa pointed out that potentially some of this could be investigated using memory cgroups that vary in size to act as a type of feedback mechanism even if it lacked a global view of resource usage.

In the end, this session was a problem statement - what feedback mechanisms does a VM need to assess how much memory the workload on a particular machine requires? This is related to workload working set size estimation but that is sufficiently different from Magenheimer's requirement that they may not share that much in common.

Ballooning for transparent huge pages

Rik van Riel began by reminding the audience that transparent huge pages (THP) gave a large performance gain in virtual machines by virtue of the fact that VMs use nested page tables, which doubles the normal cost of translation. Huge pages, by requiring far fewer translations, can make much of the performance penalty associated with nested page tables go away.

Once ballooning enters the picture though it rains on the parade by fragmenting memory and reducing the number of huge pages that can be used. The obvious approach is to balloon in 2M contiguous chunks. However, this has its own problems because compaction can only do so much. If a guest must shrink its memory by half, it may use all the regions that are capable of being defragmented. This would reduce or eliminate the number of 2M huge pages that could be used.

Van Riel's solution requires that balloon pages become movable within the guest, which requires changes to both the balloon driver and potentially the hypervisor. However, no one in the audience saw a problem with this as such. Balloon pages are not particularly complicated, because they just have one reference. They need a new page mapping with a migration callback to release the reference to the page and the contents do not need to be copied so there is an optimization available there.

Once that is established, it would also be nice to keep balloon pages within the same 2M regions. Dan Magenheimer mentioned a user that has a similar type of problem, but that problem is very closely related to what CMA does. It was suggested that Van Riel may need something very similar to MIGRATE_CMA except where MIGRATE_CMA forbids unmovable pages within their pageblocks, balloon drivers would simply prefer that unmovable pages were not allocated. This would allow further concentration of balloon pages within 2M regions without using compaction aggressively.

There was no resistance to the idea in principle so one would expect that some sort of prototype will appear on the lists during the next year.

Finding holes for mmap()

Rik van Riel started a discussion on the problem of finding free virtual areas quickly during mmap() calls. Very simplistically, an mmap() requires a linear search of the virtual address space by virtual memory area (VMA) with some minor optimizations for caching holes and scan pointers. However, there are some workloads that use thousands of VMAs so this scan becomes expensive.

VMAs are already organized by a red-black tree (RB tree). Andrea Arcangeli had suggested that information about free areas near a VMA could be propagated up the RB tree toward the root. Essentially it would be an augmented RB tree that stores both allocated and free information. Van Riel was considering a simpler approach using a callback on a normal RB tree to store the hole size in the VMA. Using that, each RB node would know the total free space below it in an unsorted fashion.

That potentially introduces fragmentation as a problem but that is inconsequential to Van Riel in comparison to the problem where a hole of a particular alignment is required. Peter Zijlstra maintained that augmented trees should be usable to do this, but that was disputed by Van Riel who said that augmented RB tree users have significant implementation responsibilities so this detail needs further research.

Again, there was little resistance to the idea in principle but there are likely to be issues during review about exactly how it gets implemented.

AIO/DIO in the kernel

Dave Kleikamp talked about asynchronous I/O (AIO) and how it is currently used for user pages. He wants to be able to initiate AIO from within the kernel, so he wants to convert struct iov_iter to contain either an iovec or bio_vec and then convert the direct I/O path to operate on iov_iter. He maintains that this should be a straightforward conversion based on the fact that it is the generic code that does all the complicated things with the various structures.

He tested the API change by converting a loop device to set O_DIRECT and submit via AIO. This eliminated caching in the underlying filesystem and assured consistency of the mounted file.

He sent out patches a month ago but did not get much feedback and was looking to figure out why that was. He was soliciting input on the approach and how it might be improved but it seemed like many had either missed the patches or otherwise not read them. There might be greater attention in the future.

The question was asked whether it would be a compatible interface for swap-over-arbitrary-filesystem. The latest swap-over-NFS patches introduced an interface for pinning pages for kernel I/O but Dave's patches appear to go further. It would appear that swap-over-NFS could be adapted to use Dave's work.

Dueling NUMA migration schemes

Peter Zijlstra started the session by talking about his approach for improving performance on NUMA machines. Simplistically, it assigns processes to a home node that allocation policies will prefer to allocate from and load balancer policies to keep the threads near the memory it is using. System calls are introduced to allow assignment of thread groups and VMAs to nodes. Applications must be aware of the API to take advantage of it.

Once the decision has been made to migrate threads to a new node, their pages are unmapped and migrated as they are faulted, minimizing the number of pages to be migrated and correctly accounting for the cost of the migration to the process moving between nodes. As file pages may potentially be shared, the scheme focuses on anonymous pages. In general, the scheme is expected to work well for the case where the working set fits within a given NUMA node but be easier to implement than the hard binding support currently offered by the kernel. Preliminary tests indicate that it does what it is supposed to do for the cases it handles.

One key advantage Zijlstra cited for his approach was that he maintains information based on thread and VMA, which is predictable. In contrast, Andrea Arcangeli's approach requires storing information on a per-page basis and is much heavier in terms of memory consumption. There were few questions on the specifics of how it was implemented with comments from the room focusing instead on comparing Zijlstra and Arcangeli's approaches.

Hence, Arcangeli presented on AutoNUMA which consists of a number of components. The first is the knuma_scand component which is a page table walker that tracks the RSS usage of processes and the location of their pages. To track reference behavior, a NUMA page fault hinting component changes page table entries (PTEs) in an arrangement that is similar but not identical to PROT_NONE temporarily. Faults are then used to record what process is using a given page in memory. knuma_migrateN is a per-node thread that is responsible for migrating pages if a process should move to a new node. Two further components move threads near the memory they are using or alternatively, move memory to the CPU that is using it. Which option it takes depends on how memory is currently being used by the processes.

There are two types of data being maintained for decisions. sched_autonuma works on a task_struct basis and the data is collected by NUMA hinting page faults. The second is mm_autonuma which works on an mm_struct basis and gathers information on the working set size and the location of the pages it has mapped, which is generated by knuma_scand. [AutoNUMA workflow]

The details on how it decides whether to move threads or memory to different NUMA nodes is involved but Arcangeli expressed a high degree of confidence that it could make close to optimal decisions on where threads and memory should be located. Arcangeli's slide that describes the AutoNUMA workflow is shown at right.

When it happens, migration is based on per-node queues and care is taken to migrate pages at a steady rate to avoid bogging the machine down copying data. While Arcangeli acknowledged the overall concept was complicated, he asserted that it was relatively well-contained without spreading logic throughout the whole of MM.

As with Zijlstra's talk, there were few questions on the specifics of how it was implemented, implying that not many people in the room have reviewed the patches, so Arcangeli moved on to explaining the benchmarks he ran. The results of the benchmarks looked as if performance was within a few percent of manually binding memory and threads to local nodes. It was interesting to note that for one benchmark, specjbb, it was clear that how well AutoNUMA does varies, which shows its non-deterministic behavior. But its performance never dropped below the base performance. He explained that the variation could be partially explained by the fact that AutoNUMA currently does not migrate THP pages, instead it splits them and migrates the individual pages depending on khugepaged to collapse the huge pages again.

Zijlstra pointed out that, for some of the benchmarks that were presented, his approach potentially performed just as well without the algorithm complexity or memory overhead. He asserted this was particularly true for KVM-based workloads as long as the workload fits within a NUMA node. He pointed out that the history of memcg led to a situation where it had to be disabled by default in many situations because of the overhead and that AutoNUMA was vulnerable to the same problem.

When it got down to it, the discussed points were not massively different to discussions on the mailing list except perhaps in terms of tone. Unfortunately there was little discussion on was whether there was any compatibility between the two approaches and what logic could be shared. This was due to time limitations but future reviewers may have a clearer view of the high-level concepts.

Soft limits in memcg

Ying Han began by introducing soft reclaim and stated she wanted to find what blockers existed for merging parts of it. It has reached the point where it is getting sufficiently complicated that it is colliding with other aspects of the memory cgroup (memcg) work.

Right now, the implementation of soft limits allows memcgs to grow above a soft limit in the absence of global memory pressure. In the event of global memory pressure then memcgs get shrunk if they are above their soft limit. The results for shrinking are similar to hierarchical reclaim for hard limits. In a superficial way, this concept is similar to what Dan Magenheimer wanted for RAMSter except that it applies to cgroups instead of machines.

Rik van Riel pointed out that it is possible that a task can be fitting in a node and within its soft limit. If there are other cgroups on the same node, the aggregate soft limit can be above the node size and, in some cases, that cgroup should be shrunk even if it is below the soft limit. This has a quality-of-service impact; Han recognizes that this needs to be addressed. This is somewhat of an administrative issue. The total of all hard limits can exceed physical memory with the impact being that global reclaim shrinks cgroups before they hit their hard limit. This may be undesirable from an administrative point of view. For soft limits, it makes even less sense if the total soft limits exceed physical memory as it would be functionally similar to if the soft limits were not set at all.

The primary issue was to decide what to set the ratio to reclaim pages from cgroups at. If there is global memory pressure and all cgroups are under their soft limit then a situation potentially arises whereby reclaim is retried indefinitely without forward progress. Hugh Dickins pointed out that soft reclaim has no requirement that cgroups under their soft limit never be reclaimed. Instead, reclaim from such cgroups should simply be resisted and the question is how it should be resisted. This may require that all cgroups get scanned to discover that they are all under their soft limit and then require burning more CPU rescanning them. Throttling logic is required but ultimately this is not dissimilar to how kswapd or direct reclaimers get throttled when scanning too aggressively. As with many things, memcg is similar to the global case but the details are subtly different.

Even then, there was no real consensus on how much memory should be reclaimed from cgroups below their soft limit. There is an inherent fairness issue here that does not appear to have changed much between different discussions. Unfortunately, discussions related to soft reclaim are separated by a lot of time and people need to be reminded of the details. This meant that little forward progress was made on whether to merge soft reclaim or not but there were no specific objections during the session. Ultimately, this is still seen as being a little Google-specific particularly as some of the shrinking decisions were tuned based on Google workloads. New use-cases are needed to tune the shrinking decisions and to support the patches being merged.

Kernel interference

Christoph Lameter started by stating that each kernel upgrade resulted in slowdowns for his target applications (which are for high-speed trading). This generates a lot of resistance to kernels being upgraded on their platform. The primary sources of interference were from faults, reclaim, inter-processor interrupts, kernel threads, and user-space daemons. Any one of these can create latency, sometimes to a degree that is catastrophic to their application. For example, if reclaim causes an additional minor fault to be incurred, it is in fact a major problem for their application.

The reason this happens is due to some trends. Kernels are simply more complex with more causes of interference leaving less processor time available to the user. Other trends which affect them are larger memory sizes leading to longer reclaim as well as more processors meaning that for-all-cpu loops take longer.

One possible measure would be to isolate OS activities to a subset of CPUs possibly including interrupt handling. Andi Kleen pointed out that even with CPU isolation, if unrelated processes are sharing the same socket, they can interfere with each other. Lameter maintained that while this was true such isolation was still of benefit to them.

For some of the examples brought up, there are people working on the issues but they are still works in progress and have not been merged. The fact of the matter is that the situation is less than ideal with kernels today. This is forcing them into a situation where they fully isolated some CPUs and bypass the OS as much as possible, which turns Linux into a glorified boot loader. It would be in the interest of the community to reduce such motivations by watching the kernel overhead, he said.

Filesystem and storage sessions

Copy offload

Frederick Knight, who is the NetApp T10 (SCSI) standards guy, began by describing copy offload, which is a method for allowing SCSI devices to copy ranges of blocks without involving the host operating system. Copy offload is designed to be a lot faster for large files because wire speed is no longer the limiting factor. In fact, in spite of the attention now, offloaded copy has been in SCSI standards in some form or other since the SCSI-1 days. EXTENDED COPY (abbreviated as XCOPY) takes two descriptors for the source and destination and a range of blocks. It is then implemented in a push model (source sends the blocks to the target) or a pull model (target pulls from source) depending on which device receives the XCOPY command. There's no requirement that the source and target use SCSI protocols to effect the copy (they may use an internal bus if they're in the same housing) but should there be a failure, they're required to report errors as if they had used SCSI commands.

A far more complex command set is TOKEN based copy. The idea here is that the token contains a ROD (Representation of Data) which allows arrays to give you an identifier for what may be a snapshot. A token represents a device and a range of sectors which the device guarantees to be stable. However, if the device does not support snapshotting and the region gets overwritten (or in fact, for any other reason), it may decline to accept the token and mark it invalid. This, unfortunately, means you have no real idea of the token lifetime, and every time the token goes invalid, you have to do the data transfer by other means (or renew the token and try again).

There was a lot of debate on how exactly we'd make use of this feature and whether tokens would be exposed to user space. They're supposed to be cryptographically secure, but a lot of participants expressed doubt on this and certainly anyone breaking a token effectively has access to all of your data.

NFS and CIFS are starting to consider token-based copy commands, and the token format would be standardized, which would allow copies from a SCSI disk token into an NFS/CIFS volume.

Copy offload implementation

The first point made by Hannes Reinecke is that identification of source and target for tokens is a nightmare if everything is done in user space. Obviously, there is a need to flush the source range before constructing the token, then we can possibly use FIEMAP to get the sectors. Chris Mason pointed out this wouldn't work for Btrfs and after further discussion the concept of a ref-counted FIETOKEN operation emerged instead.

Consideration then moved to hiding the token in some type of reflink() and splice()-like system calls. There was a lot more debate on the mechanics of this, including whether the token should be exposed to user space (unfortunately, yes, since NFS and CIFS would need it). Discussion wrapped up with the thought that we really needed to understand the user-space use cases of this technology.

RAID unification

pNFS is beginning to require complex RAID-ed objects which require advanced RAID topologies. This means that pNFS implementations need an advanced, generic, composable RAID engine that can implement any topology in a single compute operation. MD was rejected because composition requires layering within the MD system and that means you can't do advanced topologies in a single operation.

This proposal was essentially for a new interface that would unify all the existing RAID systems by throwing them away and writing a new one. Ted Ts'o pointed out that filesystems making use of this engine don't want to understand how to reconstruct the data, so the implementation should "just work" for the degraded case. If we go this route, we definitely need to ensure that all existing RAID implementations work as well as they currently do.

The action summary was to start with MD and then look at Btrfs. Since we don't really want new administrative interfaces exposed to users, any new implementation should be usable by the existing LVM RAID interfaces.


Dave Chinner reminded everyone that the methodology behind xfstest is "golden output matching". That means that all XFS tests produce output which is then filtered (to remove extraneous differences like timestamps or, rather, to fill them in with X's) and the success or failure indicated by seeing if the results differ from the expected golden result file. This means that the test itself shouldn't process output.

Almost every current filesystem is covered by xfstest in some form and all the code in XFS is tested at 75-80% coverage. (Dave said we needed to run the code coverage tools to determine what the code coverage of the tests in other filesystems actually is). Ext4, XFS and Btrfs regularly have the xfstest suite run as part of their development cycle.

Xfstest consists of ~280 tests which run in 45-60 minutes (depending on disk speed and processing power). Of these tests, about 100 are filesystem-independent. One of the problems is that the tests are highly dependent on the output format of tools, so, if that changes, the test reports false failures. On the other hand, it is easily fixed by constructing a new golden output file for the tests.

One of the maintenance nightmares is that the tests are numbered rather than named (which means everyone who writes a new test adds it as number 281 and Dave has to renumber). This should be fixed by naming tests instead. The test space should also become hierarchical (grouping by function) rather than the current flat scheme. Keeping a matrix of test results over time allows far better data mining and makes it easier to dig down and correlate reasons for intermittent failures, Chinner said.

Flushing and I/O back pressure

This was a breakout session to discuss some thoughts that arose during the general writeback session (reported above).

The main concept is that writeback limits are trying to limit the amount of time (or IOPS, etc.) spent in writeback. However, the flusher threads are currently unlimited because we have no way to charge the I/O they do to the actual tasks. Also, we have problems accounting for metadata (filesystems with journal threads) and there are I/O priority inversion problems (can't have high priority task blocked because of halted writeout on a low priority one which is being charged for it).

There are three problems:

  • Problems between CFQ and block flusher. This should now be solved by tagging I/O with the originating cgroup.

  • CFQ throws all I/O into a single queue (Jens Axboe thinks this isn't a problem).

  • Metadata ordering causes priority inversion.

On the last, the thought was that we could use transaction reservations as an indicator for whether we had to complete the entire transaction (or just throttle it entirely) regardless of the writeback limits which would avoid the priority inversions caused by incomplete writeout of transactions. For dirty data pages, we should hook writeback throttling into balance_dirty_pages(). For the administrator, the system needs to be simple, so there needs to be a single writeback "knob" to adjust.

Another problem is that we can throttle a process which uses buffered I/O but not if it uses AIO or direct I/O (DIO), so we need to come up with a throttle that works for all I/O.

Comments (13 posted)

2012 Linux Storage, Filesystem, and Memory Management Summit - Day 2

By Jake Edge
April 4, 2012

Day two of the Linux Storage, Filesystem, and Memory Management Summit was much like its predecessor, but with fewer combined sessions. It was held in San Francisco on April 2. Below is a look at the combined sessions as well as those in the Memory Management track that is largely based on write-ups from Mel Gorman as well as some additions from my notes. In addition, James Bottomley has written up the Filesystem and Storage track.

Flash media

Steven Sprouse was invited to the summit to talk about flash media. He is the director of NAND systems architecture at SanDisk, and his group is concerned with consumer flash products - for things like mobile devices, rather than enterprise storage applications, which is handled by a different group. But, he said, most of what he would be talking about is generic to most flash technologies.

The important measure of flash for SanDisk is called "lifetime terabyte writes", which is calculated by the following formula:

    physical capacity * write endurance
            write amplification
Physical capacity is increasing, but write endurance is decreasing (mostly due to cost issues). Write amplification is a measure of the actual amount of writing that must be done because the device has to copy data based on its erase block size. Write amplification is a function of the usage of the device, its block size, over-provisioning, and the usage of the trim command (to tell the device what blocks are no longer being used). Block sizes (which are the biggest concern for write amplification) are getting bigger for flash devices, resulting in higher write amplification.

The write endurance is measured in data retention years. As the cells in the flash get cycled, the amount of time that data will last is reduced. If 10,000 cycles are specified for the device, that doesn't mean they die at that point, just that they may no longer hold data for the required amount of time. There is also a temperature factor and most of the devices he works with have a maximum operating temperature of 45-50°C. Someone asked about read endurance, and Sprouse said that reads do affect endurance but didn't give any more details.

James Bottomley asked if there were reasons that filesystems should start looking at storing long-lived and short-lived data separately and not mixing the two. Sprouse said that may eventually be needed. He said there is a trend toward hybrid architectures that have small amounts of high-endurance (i.e. can handle many more write cycles) flash and much larger amounts of low-endurance flash. Filesystems may want to take advantage of that by storing things like the journal in the high-endurance portion, and more stable OS files in the low-endurance area. Or storing hot data on high-endurance and cold data on low-endurance. How that will be specified is not determined, however.

The specs for a given device are based on the worst-case flash cell, but the average cell will perform much better than that worst case. If you cycle all of the cells in a device the same number of times, one of the pages might well only last 364 days, rather than the one year in the spec. Those values are determined by the device being "cycled, read, and baked", he said. The latter is the temperature testing that is done.

Sprouse likened DRAM and SRAM to paper that has been written on in pencil. If a word is wrong, it can be erased without affecting the surrounding words. Flash is like writing in pen; it can't be erased, so a one-word mistake requires that the entire page be copied. That is the source of write amplification. From the host side, there may be a 512K write, but if that data resides in a 2048K block on the flash, the other three 512K chunks need to be copied which, makes for a write amplification factor of four. In 2004, flash devices were like writing on a small Post-it pad that could only fit four words, but in 2012, it is like writing on a piece of paper the size of a large table. The cost for a one-word change has gone way up.

In order for filesystems to optimize their operation for the geometry of the flash, there needs to be a way to get that geometry information. Christoph Hellwig pointed out that Linux developers have been asking for five years for ways to get that information without success. Sprouse admitted that was a problem and that exposing that information may need to happen. There is also the possibility of filesystems tagging the data they are writing to give the device the information necessary to make the right decision.

Sprouse posed a question about the definition of a "random" write. A 1G write would be considered sequential by most, while 4K writes would be random, but what about sizes in between? Bottomley said that anything beyond 128K is sequential for Linux, while Hellwig said that anything up to 64M is random. But the "right" answer was: "tell me what the erase block size is". For flash, random writes are anything smaller than the erase block size. In the past writing in 128K chunks would have been reasonable, he said, but today each write of that size may make the flash copy several megabytes of data.

One way to minimize write amplification is to group data that is going to become obsolete at roughly the same time. Obsolete can mean that the data is overwritten or that it is thrown away via a trim or discard command. The filesystem should strive to avoid having cold data get copied because it is accidentally mixed in with hot data. As an example, Ted Ts'o mentioned package files (from an RPM or Debian package), which are likely to be obsoleted at the same time (by a package update or removal). Some kind of interface so that user space can communicate that information would be required.

In making those decisions, the focus should be on the hottest files (those changing most frequently) rather than the coldest files, Sprouse said. If the device could somehow know what the logical block addresses associated with each file are, that would help it make better decisions. As an example, if a flash device has four dies, and four files are being written, those files could be written in parallel across the dies. That has the effect of being fast for writing, but is much slower when updating one of the files. Alternatively, each could be written serially, which is slower, but will result in much less copying if one file is updated. Data must be moved around under the hood, Sprouse said, and if the flash knows that a set of rewrites are all associated with a single file, it could reorganize the data appropriately when it does the update.

There are a number of things that a filesystem could communicate to the device that would help it make better decisions. Which blocks relate to the same file, and which are related by function, like files in a /tmp directory that will be invalid after the next boot, or are OS installation files or browser cache files. Filesystems could also mark data that will be read frequently or written frequently. Flash vendors need to provide a way for the host to determine the geometry of a device like its page size, block size, and stripe size.

Those are all areas where OS developers and flash vendors could cooperate, he said. Another that he mentioned was to provide some way for the host to communicate how much time has elapsed since the last power off. Flash devices are still "operating" even when they powered off, because they continue to hold the data that was stored. You could think of flash as DRAM with a refresh rate of one year, for example. If the flash knows that it has been off for six months it could make better decisions for data retention.

Some in the audience advocated an interface to the raw flash, rather than going through the flash translation layer (FTL). Ric Wheeler disagreed, saying that we don't want filesystems to have to know about the low-level details of flash handling. Ts'o agreed and noted that new technologies may come along that invalidate all of the work that would have been put in for an FTL-less filesystem. Chris Mason also pointed out that flash manufacturers want to be able to put a sticker on the devices saying that it will store data for some fixed amount of time. They will not be able (or willing) to do that if it requires the OS to do the right thing to achieve that.

One thing that Mason would like to see is some feedback on hints that filesystems may end up providing to these devices. One of his complaints is that there is no feedback mechanism for the trim command, so that filesystem developers can't see what benefits using trim provides. Sprouse said that trim has huge benefits, but Mason wants to know whether Linux is effective at trimming. He would like to see ways to determine whether particular trim strategies are better or worse and, by extension, how any other hints provided by filesystems are performing.

Bottomley asked if flash vendors could provide a list of the information they are willing to provide about the internals of a given device. With that list, filesystem developers could say which would be useful. Many of these "secrets" about the internals of flash devices are not so secret, as Ts'o pointed out that Arnd Bergmann has done timing attacks to suss out these details, which he has published. Even if there are standards that provide ways for hosts to query these devices for geometry and other information, that won't necessarily solve the problem. As someone in the audience pointed out, getting things like that into a standard does not force the vendors to correctly fill in the data.

Wheeler asked if it would help for the attendees' "corporate overlords" to officially ask for that kind of cooperation from the vendors. There were representatives many large flash-buying companies at the summit, so that might be a way to apply some pressure. Sprouse said that like most companies, there are different factions within SanDisk (and presumably other flash companies). His group sees the benefit of close cooperation with OS developers, but others see the inner workings as "secret sauce".

It is clear there are plenty of ways for the OS and these devices to cooperate, which would result in better usage and endurance. But there is plenty of work to do on both sides before that happens.

Device mapper and Bcache

Kent Overstreet discussed the Bcache project, which creates an SSD-based cache for other (slower) block devices. He began by pointing out that the device mapper (DM) stores much of the information that Bcache would need in user space. Basically, the level of pain required to extract the necessary information from DM meant that they bypassed it entirely.

It was more or less acknowledged that, because Bcache is sufficiently well established in terms of performance, that may imply that DM should provide an API it can use. Basically, if a flash cache is to be implemented in kernel, basing it upon Bcache would be preferable. It would also be preferred if any such cache was configured via an established interface such as DM; this is the core issue that is often bashed around.

It was pointed out that Bcache also required some block-layer changes to split BIOs in some cases, depending on the contents of the btree, which would have been difficult to communicate via DM. This reinforces the original point that adapting Bcache to DM would require a larger number of changes than expected. There was some confusion on exactly how Bcache was implemented and what the requirements are but the Bcache developers were not against adding DM support as such. They were just indifferent to DM because their needs were already been served.

In different variations, the point was made that the community is behind the curve in terms of caching on flash and that some sort of decision is needed. This did not prevent the discussion being pulled in numerous different directions that brought up a large number of potential issues with any possible approach. The semi-conclusion was the community "has to do something" but what that was reached no real conclusion. There was a vague message that a generic caching storage layer was required that would be based on SSD initially but exactly at which layer this should exist as was unclear.

Memory hotplug

Hiroyuki Kamezawa discussed the problem of hot unplugging full NUMA nodes on Intel "Ivy Bridge"-based platforms. There are certain structures that are allocated on a node that cannot be reclaimed before unplug such as pgdat. The basic approach is to declare these nodes as fully ZONE_MOVABLE and allocate needed support structures off-node. The nodes this policy affects can be set via kernel parameters.

An alternative is to boot only one node and, later, hotplug the remaining nodes, marking them ZONE_MOVABLE as they are brought up. Unfortunately, there is an enumeration problem with this. The mapping of physical CPUs to NUMA nodes is not constant because altering a BIOS setting such as HT may change that mapping. For similar reasons, the NUMA node ID may change if DIMMs change slots. Hence, the problem is that the physical node IDs and node IDs as reported by the kernel are not the same between boots. If, on a four-node machine they boot nodes zero and one and hotplug node two, the physical addresses might vary and this is problematic when deciding which node to remove or even when deciding where to place containers.

To overcome this, they need some sort of translation layer that virtualizes the CPU and node ID numbers to keep the mappings consistent between boots. There is more than one use case for this, but the problem mentioned regarding companies that have very restrictive licensing based on CPU IDs was not a very popular one. To avoid going down a political rathole, that use case was acknowledged, but the conversation moved on as there are enough other reasons to provide the translation layer.

It was suggested that perhaps only one CPU and node be activated a boot and to bring up the remaining nodes after udev is active. udev could be used to create symbolic links mapping virtual CPU IDs to physical CPU IDs and similarly symbolic link virtual node IDs to the underlying physical IDs in sysfs. A further step might be to rename CPU IDs and node IDs at runtime to match what udev discovers similar to the way network devices can be renamed, but that may be unnecessary.

Conceivably, an alternative would be that the kernel could be informed what the mapping from virtual IDs to physical IDs should be (based on what's used by management software) and rename the sysfs directories accordingly, but that would be functionally equivalent. It was also suggested that this should be managed by the hardware but that is probably optimistic and would not work for older hardware.

Unfortunately, there was no real conclusion on whether such a scheme could be made work or if it would suit Kamezawa's requirements.

Stalled MM patches

Dan Magenheimer started by discussing whether frontswap should be merged. It got stalled, he said, due to bad timing as he passed a line where there was an increased emphasis on review and testing. To address this he gave an overview of transcendent memory and its components such as the cleancache and frontswap front-ends and the zcache, RAMster, Xen, and KVM backends. Many of these components have been merged, with RAMster being the most recent addition, but frontswap is noticeable by its absence despite the fact that some products ship with it.

He presented the results of a benchmark run based on the old reliable parallel kernel build with increasing numbers of parallel compiles until it started hitting swap. He showed the performance difference when zcache was enabled. The figures seemed to imply that the overhead of the schemes was minimal until there was memory pressure but when zcache was enabled, performance could in fact improve due to more efficient use of RAM and reduced file and swap I/O. He referred people to the list where more figures are available.

He followed up by presenting the figures when the RAMster backend was used. The point was made that using RAMster might show an improvement on the target workload while regressing the performance of the machine that RAMster was taking resources from. Magenheimer acknowledged this but felt that was sufficient evidence justifying frontswap's existence to have it merged.

Andrew Morton suggested posting it again with notes on what products are shipping with it already. He asked how many people had done a detailed review and was discouraged that apparently no one had. On further pushing it turned out that Andrea Arcangeli had looked at it and while he saw some problems he also thought it was been significantly improved in recent times. Rik van Riel's problem was that frontswap's API was synchronous but Magenheimer believes that some of these concerns have been alleviated in recent updates. Morton said that if this gets merged, it will affect everyone and insisted that people review it. It seems probable that more review will be forthcoming this time around as people in the room did feel that the frontswap+zcache combination, in particular, would be usable by KVM.

Kyungmin Park than talked about the contiguous memory allocator (CMA) and how it has gone through several versions with review but without being merged. Morton said that he had almost merged it a few times but then a new version would come out. He said to post it again and he'll merge that.

Mel Gorman then brought up swap over NFS, which has also stalled. He acknowledged that the patches are complex, and the feedback has been that the feature isn't really needed. But, he maintained, that's not true, it is used by some and, in fact, ships with SUSE Linux. Red Hat does not, but he has had queries from at least one engineer there about the status of the patches.

Gorman's basic question was whether the MM developers were willing to deal with the complexity of swap over NFS. The network people have "stopped screaming" at him, which is not quite the same thing as being happy with the patches, but Gorman thinks progress has been made there. In addition, there are several other "swap over network filesystem" patches floating around, all of which will require much of the same infrastructure that swap over NFS requires.

Morton said that the code needs to be posted again and "we need to promise to look at it". Hopefully that will result in comments on whether it is suitable in its current state or, if not, what has to be done to make it acceptable.

Issues with mmap_sem

While implementing a page table walker for estimating work set size, Michel Lespinasse found a number of cases where mmap_sem hold time for writes caused significant problems. Despite the topic title ("Working Set Estimation"), he focused on enumerating the worst mmap_sem hold times, such as when a mapped file is accessed and the atime must be updated or when a threaded application is scanning files and hammering mmap_sem. The user visible effects of this can be embarrassing. For example, ps can stall for long periods of time if a process is stalled on mmap_sem which makes it difficult to debug a machine that is responding poorly.

There was some discussion on how mmap_sem could be changed to alleviate some of these problems. The proposed option was to tag a task_struct before entering a filesystem to access a page. If the filesystem needs to block and the task_struct was tagged, it would release the mmap_sem and retry the full fault from start after the filesystem returns control. Currently the only fault handler that does this properly is x86. The implementation was described as being ugly so he would like people to look at it and see how it could be improved. Hugh Dickins agreed that it was ugly and wants an alternative. He suggested that maybe we want an extension of pte_same to cover pte_same_vma_same() but it was not deeply considered. One possibility would be to have a sequence counter on the mm_struct and observing if it changed.

Andrea Arcangeli pointed out that just dropping the mmap_sem may not help as it still gets hammered by multiple threads and instead the focus should be on avoiding blocking when holding mmap_sem for writing because it is an exclusive lock. Lespinasse felt that this was only a particular problem for mlockall() so there may be some promise for dropping mmap_sem for any blocking and recovering afterward.

Dickins felt that at some point in the past that there was a time when mmap_sem was dropped for writes and just a read semaphore held under some circumstances. He suggested doing some archeology of the commits to confirm if the kernel ever did that and, if so, what were the reasons it was dropped.

The final decision for Lespinasse was to post the patch that expands task_struct with information that would allow the mmap_sem to be dropped before doing a blocking operation. Peter Zijlstra has concerns that this might have some scheduler impact and Andi Kleen was concerned that it did nothing for hold times in other cases. It was suggested that the patch be posted with a micro-benchmark that demonstrates the problem and what impact the patch has on it. People that feel that there are better alternatives can then evaluate different patches with the same metrics.

Page flags

Hugh Dickins credited Johannes Weiner's work on reducing the size of mem_cgroup and highlighted Hiroyuki Kamezawa's further work. He asserted that mem_cgroup is now sufficiently small that it should be merged with page_cgroup. He then moved on to page flag availability and pointed out that there currently should be plenty of flags available on 64-bit systems. Andrew Morton pointed out that some architectures have stolen some of those flags already and that should be verified. Regardless of that potential problem it was noted that, due to some slab alignment patches, there is a hole in struct page and there is a race to make use of that space by expanding page flags.

The discussion was side-tracked by bringing up the problem of virtual memory area (VMA) flag availability. There were some hiccups with making VMA flags 64-bit in the past but thanks to work by Konstantin Khlebnikov, this is likely to be resolved in the near future.

Dickins covered a number of different uses of flags in the memory cgroup (memcg) and where they might be stored but pointed out that memcg was not the primary target. His primary concern was that some patches are contorting themselves to avoid using a page flag. He asserted that the overhead of this complexity is now higher than the memory savings from having a smaller struct page. As keeping struct page very small was originally for 32-bit server class systems (which are now becoming rare) he felt that we should just expand page flags. Morton pointed out that we are going to have to expand page flags eventually and now is as good as time as any.

Unfortunately numerous issues were raised about 32-bit systems that would be impacted by such a change and it was impossible to get consensus on whether struct page should be expanded or not. For example, it was pointed out that embedded CPUs with cache lines of 32 bytes benefit from the current arrangement. Instead it looks like further tricks may be investigated for prolonging the current situation such as reducing the number of NUMA nodes that can be supported on 32-bit systems.

Statistics for memcg

Johannes Weiner wanted to discuss the memcg statistics and what should be gathered. His problem is that he had very little traction on the list and felt maybe it would be better if he explained the situation in person.

The most important statistics he requires are related to memcg hierarchical reclaim. The simple case is just the root group and the basic case is one child that is reclaimed by either hitting its hard limit or due to global reclaim. It gets further complicated when there is an additional child and this is the minimum case of interest. In the hierarchy, cgroups might be arranged as follows:

        cgroup A
            cgroup B

The problem is that if cgroup B is being reclaimed then it should be possible to identify whether the reclaim is due to internal or external pressure. Internal pressure would be due to cgroup B hitting its hard limit. External pressure would be due to either cgroup A hitting its hard limit or global reclaim.

He wants to report pairs of counters for internal and external reclaims. By walking cgroup tree, the statistics for external pressure can be calculated. By looking at the external figures for each cgroup in user space it can be determined exactly where external pressure originated from for any cgroup. The alternative is needing one group of counters per parent which is unwieldy. Just tracking counters about the parent would be complicated if the group were migrated.

The storage requirements are just for the current cgroup. When reporting to user space a tree walk is necessary so it costs computationally but the information will always be coherent even if memcg changes location in the tree. There was some dispute on what file exactly should expose this information but that was a relatively minor problem.

The point of the session was for people to understand how he wants to report statistics and why it is a sensible choice. It seemed that people in the room had a clearer view of his approach and future review might be more straightforward.

Development tree for memcg

Michal Hocko stood up to discuss the current state of the memcg devel tree. After the introduction of the topic, Andrew Morton asked why it was not based on linux-next which Hocko said was a moving target. This potentially leads to a rebases. Morton did not really get why the tree was needed but the memcg maintainers said the motivation was develop against a stable point in time without having to wrestle with craziness in linux-next.

Morton wanted the memcg stuff to be a client of the -mm tree. That is a client of linux-next but Andrew feels he could manage the issues as long as the memcg developers were willing to deal with rebases which they were. Morton is confident he can find a way to compromise without the creation of a new tree. In the event of conflicts, he said that those conflicts should be resolved sooner rather than later.

Morton made a separate point of how long is it going to take to finish memcg. It's one file, how much more can there be to do? Peter Zijlstra pointed out that much of the complexity is due to changing semantics and continual churn. The rate of change is slowing but it still happens.

The conclusion is that Morton will work on extracting the memcg stuff from his view of the linux-mm tree into the memcg devel tree on a regular basis to give them a known base to work against for new features. Some people in the room commented that they missed the mmotm tree as it used to form a relatively stable tree to develop against. There might be some effort in the future to revive something mmotm-like while still basing it on linux-next.

MM scalability

Andi Kleen talked a bit about some of the scalability issues he has run into. These are issues that have showed up in both micro and macro benchmarks. He gave the example of the VMA links for very large processes that fork causing chains that are thousands of VMAs long. TLB flushing is another problem where pages being reclaimed are resulting in an IPI for each page; he feels these operations need to be batched. Andrea Arcangeli pointed out that batching may be awkward because pages are being reclaimed in LRU, not MM, order and batching may be problematic. It could just send an IPI when a bunch of pages are gathered or be able to build lists of pages for multiple MMs.

Another issue on whether clearing the access bit should result in a TLB flush or not. There were disagreements in the room as to whether this would be safe. It potentially affects reclaim but the length of time a page lives on the inactive LRU list should be enough to ensure that the process gets scheduled and flushes the TLB. Relying on that was considered problematic but alternative solutions such as deferring the flush and then sending a global broadcast would interfere with other efforts to reduce IPI traffic. Just avoiding the flush for clearing the access should be fine in the vast majority of cases so chances are a patch will appear on the list for discussion.

Kleen next raised an issue with drain_pages(), which has severe lock contention problem when releasing the pages back to the zone list as well as causing a large number of IPIs to be sent.

His final issue was that swap clustering in general seems to be broken and that the expected clustering of virtual address to contiguous areas in swap is not happening. This was something 2.4 was easily able to do because of how it scanned page tables but it's less effective now. However, there have been recent patches related to swap performance so that particular issue needs to be re-evaluated.

The clear point that shone through is that there are new scalability issues that are going to be higher priority as large machines become cheaper and that the community should be pro-active dealing with them.


Pavel Emelyanov briefly introduced how Parallels systems potentially create hundreds of containers on a system that are all effectively clones of a template. In this case, it is preferred that the file cache be shared between containers to limit the memory usage so as to maximize the number of containers that can be supported. In the past, they used a unionfs approach but as the number of containers increased so did the response time. This was not a linear increase and could be severe on loaded machines. If reclaim kicked in, then performance would collapse.

Their proposal is to extend cleancache to store the template files and share them between containers. Functionally this is de-duplication and, superficially, kernel samepage merging (KSM) would suit their requirements. However, there were a large number of reasons why KSM was not suitable, primarily because it would not be of reliable benefit but also because it would not work for file pages.

Dan Magenheimer pointed out that Xen de-duplicates data through use of a backend to cleancache and that they should create a new backend instead of extending cleancache which would be cleaner. It was suggested that when they submit the patches that they be very clear why KSM is not suitable to avoid the patches being dismissed by the casual observer.

What remains to be done for checkpoint/restore in user space?

Pavel Emelyanov talked about a project he started about six months ago to address some of the issues encountered by previous checkpoint implementations, mostly by trying to move it into user space. This was not without issue because there is still some assistance needed from the kernel. For example, kernel assistance was required to figure out if a page is really shared or not. A second issue mentioned was that given a UNIX socket, it cannot be discovered from userspace what its peer is.

They currently have two major issues. The first is with "stable memory management". Applications create big mappings but they do not access every single page in it and writing the full VMA to a disk file is a waste of time and space. They need to discover which pages have been touched. There is a system call for memory residency but it cannot identify that an address is valid but swapped out for example. For private mappings, it cannot distinguish between a COW page and one that is based on what is on disk. kpagemap also gives insufficient information because information such as virtual address to page frame number (PFN) is missing.

The second major problem is that, if an inode is being watched with inotify, extracting exact information about the watched inode is difficult. James Bottomley suggested using a debugfs interface. A second proposal was to extend the /proc interface in some manner. The audience in the room was insufficiently familiar with the issue to give full feedback so the suggestion was just to extend /proc in some manner, post the patch and see what falls out as people analyze the problem more closely. There was some surprise from Bottomley that people would suggest extending /proc but for the purpose of discussion it would not cause any harm.

Filesystem and Storage sessions

High IOPS and SCSI/Block

Roland Dreier began by noting that people writing block drivers have only two choices: A full request-based driver, or using make_request(). The former is far too heavyweight with a single very hot lock (the queue lock) and a full-fledged elevator. The latter is way too low down in the stack and bypasses many of the useful block functions, so Dreier wanted a third way that takes the best of both. Jens Axboe proposed using his multi-queue work which essentially makes the block queue per-CPU (and thus lockless) coupled with a lightweight elevator. Axboe has been sitting on these patches for a while but promised to dust them off and submit them. Dreier agreed this would probably be fine for his purposes.

Shyam Iyer previewed Dell's vision for where NVMe (Non-Volatile Memory express - basically PCIe cards with fast flash on them) were going. Currently the interface is disk-like, with all the semantics and overhead that implies, but ultimately Dell sees the device as having a pure memory interface using apertures over the PCIe bus. Many people in the room pointed out that while a memory-mapped interface may be appealing from the speed point of view, it wouldn't work if the device still had the error characteristics of a disk, because error handling in the memory space is much less forgiving. Memory doesn't do any software error recovery and every failure to deliver data instantly is a hard failure resulting in a machine check, so the device would have to do all recovery itself and only signal a failure to deliver data as a last resort.

LBA hinting and new storage commands

Frederick Knight began by previewing the current T10 thoughts on handling shingle drives: devices which vastly increase storage density by overlapping disk tracks. They can increase storage radically but at the expense of having to write a band at a time (a band is a set of overlapping [shingled] tracks). The committee has three thoughts on handling them:

  • Transparent: just make it look like a normal disk

  • Banding: Make the host manage the geometry (back to the old IDE driver days) and expose new SCSI commands for handling bands

  • Transparent with Hints: make it look like a normal disks but develop new SCSI commands to hint both ways between device and host what the data is and device characteristics are to try to optimize data placement

The room quickly decided that only the first and last were viable options, so the slides on the new banding commands were skipped.

In the possible hint-based architecture, there would be static and dynamic hints. Static would be from device to host signalling which indicated geometry preferences by LBA range, while dynamic would be from the host to device indicating the data characteristics on a write which would allow the device to do more intelligent placement.

It was also pointed out that shingled drives have very similar characteristics to SSDs if you consider a band to be equivalent to an erase block.

The problem with the dynamic hinting architecture is that the proposal would repurpose the current group field in the WRITE command to contain the hint, but there would only be six bits available. Unfortunately, virtually every member of the SCSI committee has their own idea about what should be hinted (all the way from sequential vs random in a 32-level sliding scale, write and read frequency and latency, boot time preload, ...) and this lead to orders of magnitude more hints than fit into six bits, so the hint would be an index into a mode page which described what it means in detail. The room pointed out unanimously that the massive complexity in the description of the hints meant that we would never have any real hope of using them correctly since not even device manufacturers would agree exactly what they wanted. Martin Petersen proposed identifying a simple set of five or so hints and forcing at least array vendors to adhere to them when the LUN was in Linux mode.

Storage manager

Lukáš Czerner gave a description of the current state of his storage manager command-line tool, which, apart from having some difficulty creating XFS volumes was working nicely and should take a lot of the annoying administrative complexity out of creating LVM volumes for mounted devices.

Trim, unmap, and write same

Martin Petersen began by lamenting that in the ATA TRIM command, T13 only left two bytes for the trim range, meaning that, with one sector of ranges, we could trim at most 32MB of disk in one operation. The other problem is that the current architecture of the block layer only allows us to trim contiguous ranges. Since TRIM is unqueued and filesystems can only send single ranges inline, trimming is currently a huge performance hit. Christoph Hellwig had constructed a prototype with XFS which showed that if we could do multi-range trims inline, performance could come back to within 1% of what it was without sending trim.

Discussion then focused on what had to happen to the block layer to send multi-range commands (it was pointed out that it isn't just trim: scatter/gather SCSI commands with multiple ranges are also on the horizon). Jens Axboe initially favored the idea of allowing a single BIO to carry multiple ranges, whereas Petersen had a prototype using linked BIOs for the range. After discussion it was decided that linked BIOs was a better way forward for the initial prototype.

SR-IOV and FC sysfs

SR-IOV (Single Root I/O virtualization) is designed to take the hypervisor out of storage virtualization by allowing a guest to have a physical presence on the storage fabric. The specific problem is that each guest needs a world wide name (WWN) as their unique address on the fabric. It was agreed that we could use some extended host interface for setting WWNs but that we shouldn't expose this to the guest. The other thought was around naming of virtual functions when they attach to hosts. In the network world, virtual function (vf) network devices appear as eth<phys>-<virt> so should we do the same for SCSI? The answer was categorically that without any good justification for this naming scheme: "hell no."

The final problem discussed was that when the vf is created in the host, the driver automatically binds to it, so it has to be unbound before passing the virtual function to the guest. Hannes Reinecke pointed out that binding could simply be prevented using the standard sysfs interfaces. James Bottomley would prefer that the driver simply refuse to bind to vf devices in the host.

Robert Love noted that the first iteration of Fibre Channel attributes was out for review. All feedback from Greg Kroah-Hartman has been incorporated so he asked for others to look at it (Bottomley said he'd get round to it now that Kroah-Hartman is happy).

Unit attention handling

How should we report "unit attentions" (UAs - basically SCSI errors reported by storage devices) to userspace? Three choices were proposed:

  • netlink - which works but is only one way

  • blktrace using debugfs - needs a tool to extract data

  • using structured logging - feasible only in the current merge window since the structured logging patch is now in 3.4-rc1

There was a lot of discussion, but it was agreed that the in-kernel handling should be done by a notifier chain to which drivers and other entities could subscribe, and the push to user space would happen at the other end, probably via either netlink or structured logging.

[ I would like to thank LWN subscribers for funding that allowed me to attend the summit. Big thanks are also due to Mel Gorman and James Bottomley for their major contributions to the summit coverage. ]

Group photo

Thanks to Alasdair Kergon for making his photograph of the 2012 Linux Storage, Filesystem, and Memory Management summit available.

Comments (11 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management


Virtualization and containers

Page editor: Jonathan Corbet


DuckDuck Debian?

April 4, 2012

This article was contributed by Nathan Willis

The Debian project is considering a proposal from the DuckDuckGo search engine to accept a percentage of the engine's advertising revenue that originates from web browsers on Debian machines. The project already welcomes donations and has several other organizations that make large contributions, but DuckDuckGo's proposal would require some accounting that would dictate some small changes to the browser software — and that, in turn, raises thornier questions for the project's developers and package maintainers.


Stefano Zacchiroli, the current Debian project Leader (DPL), wrote to the debian-project list on March 27, explaining that he had been approached about a possible revenue-sharing agreement with the DuckDuckGo search engine. Two alternatives were proposed.

In the first scenario, the company would donate 25% of its income from inbound traffic on Debian systems (provided that DuckDuckGo is available as a search engine option in a web browser). In the second scenario, the company would donate 50% of its inbound-traffic income from Debian machines if Debian set DuckDuckGo as the default search engine in the browser. In both cases, the company proposed counting Debian traffic by using a modified search URL:{{search}}&t=debian

DuckDuckGo further requested that Debian send a periodic invoice to the company, presumably based on the traffic statistics that the search engine already publishes.

Zacchiroli commented that he had discussed the search URL proposal with Mike Hommey, maintainer of the Iceweasel package (Debian's re-branded version of Mozilla Firefox), who had no objections to the proposed string modification. Zacchiroli said that he was "very much inclined to accept" the proposal, and asked for input from the rest of the project. In particular, he pointed out that DuckDuckGo had a publicly-visible history of donating to open source projects, including other distributions. The main risk, he added, was that the project needed to make clear to the company that the agreement would not interfere with package maintainers' freedom to make their decisions on purely technical grounds. That said, he expressed his confidence that the risk was negligible, and that the maintainers could be trusted to "keep on doing their thing."

Zacchiroli also solicited input from the maintainers of other browsers, clarifying in a later message that DuckDuckGo had initially proposed appending the term "iceweasel" to the search string, to which he and Hommey had counter-proposed "debian" in order to include all of the packaged Web browsers in the arrangement.

The 50%

Several developers expressed concern that taking the "50% proposal" and changing the default search engine would upset standing relationships between Debian and the various browser projects, or between Debian and the displaced search engine. Some questioned whether Mozilla or Google may become unhappy with a change like this. Philip Hands asked whether replacing Google as the default search engine would endanger Google's sponsorship of DebConf or acceptance of Debian as a Google Summer of Code mentor. Steffen Möller said it would put Debian in "competition" with the upstream projects:

At the moment we are perceived as enthusiasts serving upstream developers with the best possible presentation of their work. Once we start getting money through their tools, they may possibly start thinking differently.

Clint Adams strongly disagreed with that sentiment, and said:

I reject and resent the idea that any software project has the entitlement to profit off of my web traffic.

Treating the change of a query string as theft is as ridiculous as broadcast TV stations telling me I'm robbing them by skipping commercials.

I was horrified to see this attitude espoused in the Ubuntu-Banshee episode.

Andreas Tille, on the other hand, asked whether Debian and Mozilla truly had a "good" relationship to begin with, given the renaming controversy of a few years ago. But ultimately Charles Plessy's viewpoint seemed to represent the views of most in the conversation, that the crux of the problem was in making a technical change from upstream's default for a non-technical reason.

Peter Samuelson suggested explicitly rejecting the 50% proposal outright, on the grounds that even the appearance of letting money influence a technical decision like the default search engine would be detrimental to the project. To that idea Zacchiroli replied that he had never intended to put the 50% option on the table, and meant only to open the floor to a discussion of the 25% proposal, even if that distinction had been unclear in his first message.

For his part, Iceweasel package maintainer Hommey commented that he would not even "start to consider [DuckDuckGo] as a default until it at least matches the user experience the current default engine provides, including search suggestions and localized results." So there would appear to be no possibility of Debian accepting the 50% proposal at this time.

Privacy and partnership

Although general opinion on the list leaned in favor of the 25% proposal endorsed by Zacchiroli, several people raised concerns. The first was that even the 25% arrangement established too close of a link between the search engine provider and the activities of package maintainers. Möller, in the same message where he speculated about competing with upstream projects, suggested that it would be more consistent and ultimately preferable to list DuckDuckGo as yet another partner that donates to Debian, and ask them to donate whatever amount they see fit. Paul Wise agreed, and said that Debian's time should either be spent building a flexible, user-configurable system for controlling revenue sharing deals, or not to touch them at all.

Zacchiroli replied that although he can see the potential risk in the DuckDuckGo partnership unconsciously affecting the project, he was "quite convinced" that it would not impact the project. If nothing else, he said, the chain between the entity donating the funds (DuckDuckGo) and the people making technical decisions (individual package maintainers) is long enough to mitigate the risk.

Joey Hess raised a privacy issue, noting that incorporating OS information into the User-Agent (or, presumably, the search string itself, which was DuckDuckGo's proposal) would amount to leaking information about the machine to a third party. Thijs Kinkhorst observed that there are already many ways for remote servers to know that a machine is running Debian, including User-Agent and plug-in information. The fact that Iceweasel identifies itself as such is enough, on its own, to identify the system as being a Debian derivative, for example.

Regardless of how detailed the tracking abilities of DuckDuckGo's proposal are, however, Zacchiroli argued that per Debian's governance model, the ultimate choice ought to be left up to each browser package maintainer:

All in all, as a project we should simply see the agreement as something like "for every web browser in Debian who decides to use t=something, Debian will receive donations". If, due to the usual way we maintain packages, including upstream relationships, that set will shrink to nothing, too bad. The agreement will simply allow the set to exist, it will not magically fill it with browsers that implement t=something.

Send in the lawyers

Neither the hands-off-donation suggestion nor the privacy question garnered sufficient support to overwhelm the general interest in accepting the DuckDuckGo 25% proposal. The devils are always in the details, though — or, as Jonas Smedegaard commented when Zacchiroli described the 25% as "basically it", there is no "basically it" when legalese is attached.

Smedegaard asked for more details on Debian's end of the agreement, as did others. Zacchiroli alluded to clauses that allowed Debian to challenge the numbers in DuckDuckGo's periodic statistical reports, and allowed either party to terminate the arrangement, but he declined to post the agreement itself on the list because neither side had agreed to make it public. He did make it accessible to Debian Developers, however.

Hess noted that there was also a clause requiring Debian to provide 30 days advance notice before "releasing changes to the implementation of the links." Axel Beckert asked if that meant advance notice of any modifications to the packages, but Zacchiroli expressed his interpretation that only changes to the search string were covered by the clause.

Still other questions about the specifics bubbled up at the end of March, and have yet to be definitively resolved. For example, Russ Allbery asked if the proposed agreement dealt with Debian's relationship to downstream projects:

Is DuckDuckGo aware of the fact that Debian is upstream of a number of derivative distributions that just import our packages, and if we modify our packages to do this, other distributions will be counted as "Debian" for their revenue-sharing purposes even if they aren't exactly?

For example, Ubuntu would inherit this behavior for the web browsers they just import from Debian, unless they went out of their way to change it.

Related, do they realize that we cannot and will not enforce any of the terms of their contract with us on any derivative distribution that happens to import Debian web browser packages?

Plessy asked whether the deal broke with Debian's long-standing policy against advertising. He cited a 2011 incident where the Debian Med project was removed from the Planet Debian feed for inviting users to shop online via an affiliate program that would direct funds back to Debian. Plessy said he did not see much difference between the two arrangements:

It is hard to guess where to draw the line between what is acceptable and what is unacceptable regarding revenue sharing agreements and their advertisement. I hope that the decision that will be taken about DuckDuckGo's proposition will be accompanied by a clarification on what we can generalise from it.

At this point, the general consensus appears to be in favor of accepting the 25% proposal, and leaving the decision about deploying the requisite search string change to the maintainer of each individual browser. With Hommey in favor of the change, and Iceweasel being the most well-known browser, Debian can probably expect to start seeing some revenue from DuckDuckGo.

Several project members asked whether there were projected revenue numbers attached to the proposal; Zacchiroli said that he had asked his contact at DuckDuckGo, but that no statistics were available. Consequently, what the deal means for the project is uncertain. "Considering we're talking about a non-default search option, I agree with Mike that the share of our searches will be quite low. But I've no idea how that would map to actual donations." Your author has no doubt, however, that the more slippery questions over adding trackable information to the search string and where to draw the line between acceptable and unacceptable revenue streams will crop up again — particularly if the search kickback ends up being substantial.

Comments (14 posted)

Brief items

Distribution quotes of the week

You might be on to something here! But the 140 char limit would really stifle my creativity when it comes to comments. I'd rather create facebook pages for every package - that way we could add karma by "liking" a package.

We could even take it a step farther and use this for marketing. Just imagine - "Play farmville with glibc next wednesday and learn about the great new features!", "gdb has shared a picture with you", "NetworkManager wants to be your friend". Oh the possibilities ...

-- Tim Flink (Thanks to James Wilkinson)

Oooh.The next [Fedora] name must be Chartreuse Bikeshed.
-- mattdm

I think one of the things that makes Debian off-putting and unwelcoming is that we're a little *too* obsessed with criticizing everyone's ideas, and what some people see as "healthy discussion" other people see as "hurtful flamewars over bike shed colors."
-- Russ Allbery

I still think we need to specify that we don't discriminate on grounds of preferred bikeshed colour.
-- Ben Hutchings

We seem to be drifting into dangerous territory here. Should we not make explicit the fact that we are willing to discuss the colour of all sheds, even those used for the storage of pots?
-- Philip Hands

Comments (none posted)

Debian joins the Open Source Initiative

The Debian project has announced that it is joining the Open Source Initiative as an affiliate. "By becoming an affiliate of the OSI, the Debian Project recognises the OSI's history of efforts towards goals shared by both organisations. However, the Debian Project will not automatically adopt OSI decisions on the acceptability of particular software licenses and will maintain an independent license review process."

Comments (2 posted)

Gentoo Linux releases 12.1 LiveDVD

Gentoo Linux has announced the release of the Gentoo 12.1 LiveDVD. "The LiveDVD is available in two flavors: a hybrid x86/x86_64 version, and an x86_64 multi lib version. The livedvd-x86-amd64-32ul-12.1 version will work on 32-bit x86 or 64-bit x86_64. If your CPU architecture is x86, then boot with the default gentoo kernel. If your arch is amd64, boot with the gentoo64 kernel. This means you can boot a 64-bit kernel and install a customized 64-bit user land while using the provided 32-bit user land. The livedvd-amd64-multilib-12.1 version is for x86_64 only."

Comments (4 posted)

Open Build Service Brings Website Integration

openSUSE's Open Build Service (OBS) is a system to collaboratively build and easily distribute packages for a wide variety of operating systems and platforms. OBS now has the ability to integrate the intelligent OBS 'download package' page into websites. "This is useful for projects who want to offer their users easy access to downloads for a wide variety of Linux (and non-linux) systems. Moreover, the Open Build Service 2.3 Release Candidate is out and the final release is near."

Full Story (comments: none)

OmniTI Debuts OmniOS

OmniTI has announced OmniOS, a continuation of OpenSolaris, using the Illumos base. "OmniOS provides users with a traditional, Solaris-like installable operating system with a minimal package set to ease regulatory compliance. It delivers a self-hosting, environment with simplified processes for ongoing maintenance. Most importantly, it brings third-party software components up-to-date within OmniOS. Third-party software has been a problem with previous attempts to evolve OpenSolaris, as some have not been updated in a decade. It served as a key driver behind OmniTI's interest to develop OmniOS."

Full Story (comments: none)

Ubuntu 12.04 LTS Beta 2

The second and final beta release of Ubuntu 12.04 LTS (Long Term Support) is available for testing. Variants Kubuntu, Edubuntu, Xubuntu, Lubuntu, Mythbuntu and Ubuntu Studio have also released a second beta. The final version of 12.04 LTS is expected to be released April 26.

Full Story (comments: none)

Distribution News

Debian GNU/Linux

Bits from the 5th Debian Groupware Meeting

The Debian Groupware team recently met in Germany. Click below for a short summary of the meeting.

Full Story (comments: none)

Ubuntu family

Ubuntu Studio 12.04 LTS

The Ubuntu Technical Board has approved Ubuntu Studio 12.04 (a variant aimed at multimedia creation) for three years of long term support.

Full Story (comments: none)

Newsletters and articles of interest

Distribution newsletters

Comments (none posted)

Free is too expensive (Economist)

The Economist complains about the state of desktop Linux. "That said, even the latest KDE distributions are proving just as annoying to set up as Gnome versions. Your correspondent blames the rapid upgrade cycle for leaving too many features with rough edges, too many wonky drivers and utilities, and too many unchecked regressions (bugs caused by changes) in the kernel. All that Linux developers seem to want to do these days is add cool new features, rather than squish existing bugs and make the software more usable." The article is a little muddled, complaining about the "we know best" attitude while saying that Linux lacks the integration seen in iOS or Android, but it's worth a look.

Comments (333 posted)

Linux Mint vs. Ubuntu: the Best Option? (Datamation)

Matt Hartley compares Ubuntu to Linux Mint in a three page article on Datamation. "Despite the mutual goal of offering an easy to use Linux desktop, I've noticed that Ubuntu and Linux Mint have different approaches as to how they appeal to their users. In recent years, I've actually found the two distributions shift further apart than ever before. This change isn't a negative thing, rather a positive highlight that allows both distributions to differentiate themselves better. The shift began with different approaches to tools and software. Later, the differences between the distros evolved to include the desktops as well."

Comments (none posted)

Page editor: Rebecca Sobol


Epiphany: the minimalist GNOME browser

By Jonathan Corbet
April 2, 2012
When one talks about web browsers for desktop Linux systems, there are usually two options on the table: Firefox or Chromium. There are a number of other browsers out there, though, including Epiphany, the GNOME project's official web browser. In past years, development of Epiphany appears to have slowed considerably, and it has not drawn much in the way of attention. Recently, though, there have been indications of a new burst of activity around Epiphany, so your editor decided to take a fresh look.

According to its web page, Epiphany "provides an elegant, responsive and uncomplicated user interface that fits in perfectly with GNOME." The initial experience is indeed uncomplicated; Epiphany, when it starts up, presents a single, unadorned, white window with an empty address bar at the top. No splash screens, no welcome messages, and no home page; indeed, Epiphany seems to lack the concept of a home page entirely. Actually getting content into the browser window is a matter of typing something into the bar at the top or dragging it over from some other application.

Epiphany is meant to be a fast browser. Much of its performance will naturally be bounded by the speed of the net and by the speed of the Webkit engine on which Epiphany is based, but your editor's subjective experience is that its developers have certainly not gotten in the way. Interaction with the net feels quick in a way that it most certainly does not with some other browsers. It can be a real pleasure to watch things happen so quickly.

[Epiphany] For the purposes of simply reading web pages, the simplicity of Epiphany's design is also quite nice. There has been a clear effort to remove as much non-content junk from the screen as possible. In particular, the developers seem to have decided to leave as much vertical space as possible for web content. In these days when the designers of monitors seem to have all concluded that widescreen movie watching is the only interesting use case for their products, it is nice to get some of that vertical real estate back. More web page and less scrolling is always a good thing.

Interestingly, hovering over a link in Epiphany does not produce any sort of display showing where the link goes. That is a bit of information that browsers have provided since the beginning; its absence here is strange and a bit jarring. It is nice to have some clue of what awaits at the far end of a link, and there is no real reason not to provide it.

Many of the keyboard and mouse shortcuts that one would expect are there, so moving to Epiphany is not a huge shock. That said, there are a few things missing. Your editor misses moving through a page's history with shift and the mouse scrollwheel; that lack is made worse by Epiphany's failure to implement the "forward" and "back" buttons (buttons 8 and 9) found on some mice. The address bar pulls up options from the history like other browsers, but the tab key, which selects an item in Firefox, just causes them all to disappear with Epiphany. One must, instead, use the arrow keys, taking the hand out of home position and slowing the whole process. But these complaints are minor; the basic operation of the browser is mostly as one would expect.

All the minimalism does come with a bit of a cost, though. The ability to put a small number of frequently-used bookmarks into a toolbar over the window itself can be quite useful, but it is missing from Epiphany. Even the bookmarks themselves are not directly accessible; instead, they are found in a second-level menu behind a button with a gear-shaped icon. That button provides access to a number of other standard functions - open a tab, print the page, view page source, etc. Some interesting things are missing, though: this menu lacks any option to set preferences, access help, or even to quit the application.

This is a GNOME application we're talking about, so your editor was entirely prepared to believe that the Epiphany developers had concluded that a simple application like a web browser has no knobs that a user might actually want to tweak if they knew what was good for them. That turns out not to be the case, though; Epiphany does allow for the tweaking of a certain number of preferences, including the download location, font sizes (though the useful ability to set a minimum font size is missing), JavaScript and cookie behavior, and so on. How this window is obtained is, sadly, an indication of where GNOME is going.

GNOME 3 users know that the top of the screen is occupied by a mostly empty black bar; toward the left end an icon and name for the currently-focused application appears. Thus far, that icon has been mostly a decorative feature. But, it seems, the GNOME developers intend it to be for an application menu. So, to get at Epiphany's preferences window, help browser, history browser, etc., or to tell it to quit, one must move out of the application and to that icon (labeled "Web," not "Epiphany") to request it from the global application menu. That icon is detached from the window(s) it relates to; indeed, it is likely, in multi-monitor setups, to be on an entirely different screen. But running up mileage on the pointer to get to that menu is the distraction-free computing paradigm of the future, it seems.

It would, of course, be purely gratuitous for your editor to point out that getting at the global application menu is especially challenging in a focus-follows-mouse setting, so he would not dream of doing that.

There are a few other settings available to those who are willing to wander into the dconf registry. If you do not want Google to be the recipient of any non-URL text typed into the location bar, for example, you'll need to go into dconf to change the search URL. There's a surprising number of options for configuring Epiphany to run in a locked-down kiosk mode. Happily, the minimum font size option - useful for those of us who want text at the smallest easily-readable size, but no smaller - can also be found there.

There is an extension mechanism for Epiphany, but, seemingly, no way to obtain extensions from the net. Instead, the few available extensions are assumed to be available on the local system, usually packaged by the distributor. The options are limited but they do include useful tools like Adblock and Greasemonkey. There is also a "subscribe to RSS feed" extension, but it appears to only work with locally-running feed reader applications. In general, it would appear that the Epiphany developers don't expect to see vast numbers of extensions as one might find for other browsers.

Epiphany's developers seem to have a number of plans for the near future. The blank initial page may eventually be replaced by an "overview" that includes bookmarks and recent history; it seems intended to at least partially mirror GNOME Shell's overview screen. The planned Queues feature looks useful; it will let users move those pages they plan to read out of their bookmarks and/or open tabs. A port to the WebKit2 API is also in the works; that will allow Epiphany to run different tabs in different processes. And, of course, there is a data synchronization feature that will allow users to store history, bookmarks, and more in a central location.

In summary: the renewed effort has turned Epiphany into a quick and focused tool that can be quite pleasurable to use if you are willing to accept its limitations. It sometimes seems like the problem of writing a workable free web browser has been solved for some time, but there is value in continued innovation and experimentation in this area. Many of us spend a lot of time dinking around working on the web; better tools for that work can only be welcome. For some people, Epiphany, in its current or future form, may well be that better tool.

Comments (56 posted)

Brief items

Quotes of the week

I've seen programs that end up swapping bytes two, three, even four times as layers of software grapple over byte order. In fact, byte-swapping is the surest indicator the programmer doesn't understand how byte order works.
-- Rob Pike

Try to imagine yourself in the IPMC, being asked to vote for the release of [Apache OpenOffice] 3.4. You want to make sure the release follows Apache policies and guidelines. You want to protect the ASF. You want to ensure that users, including developers using our source code packages, get the greatest benefit from the release. But you are faced with a 10 million line code project, larger and more complex than anything you've faced before at Apache.

What do you do? Where do you start?

Honestly, I have absolutely no idea.

-- Rob Weir

Regular ls output, tuned as it was for 9600 baud terminals or so, is really too verbose for modern media such as twitter and cell phones. This new output format, enabled by the -j switch (or --format=jam, but you don't want to type all that on a cell phone!), brings ls into the 21st century with an appropriate level of conciseness.
-- Joey Hess

Comments (15 posted)

Leo 4.10 released

Leo is an interesting combination of text editor, integrated development environment, project management tool, music player, and more. The 4.10 release is now available; it includes a lot of new commands, better abbreviation capabilities, and more.

Full Story (comments: 10)

libam7xxx 0.1.2

The libam7xxx project aims to write a user-space driver for USB-connected handheld projectors; the 0.1.2 release is now available. It currently supports the Acer C110 and Philips PicoPix PPX 1020 devices.

Comments (none posted)

netsniff-ng 0.5.6 released

Netsniff-ng is a toolkit for the analysis and generation of network traffic. The 0.5.6 release is essentially a rewrite from scratch that turns it into a set of tools for traffic capture and analysis, packet generation, route tracing, and more. "flowtop is a top-like connection tracking tool that can run on an end host or router. It is able to present TCP or UDP flows that have been collected by the kernel space netfilter framework. Next to reverse DNS data, connection states and ports, geographical information about the connection end points are supplied."

Comments (none posted)

StarPU 1.0.0 released

StarPU is a set of GCC extensions and associated runtime system intended to facilitate the programming of heterogeneous systems - computers with a programmable graphics processing unit, for example. "StarPU typically makes it much easier for high performance libraries or compiler environments to exploit heterogeneous multicore machines possibly equipped with GPGPUs or Cell processors: rather than handling low-level issues, programmers may concentrate on algorithmic concerns." The 1.0.0 release is now available; it has support for NVIDIA GPUs, processors implementing OpenCL, and Cell processors.

Full Story (comments: 2)

Qt5 Alpha released

The first alpha release of the Qt5 toolkit is available, showing the direction that Qt is taking. A lot of the work appears to be under-the-hood restructuring, but there's a number of new features as well. "There was one basic vision driving a lot of the Qt 5 work: 'Qt 5 should be the foundation for a new way of developing applications. While offering all of the power of native Qt using C++, the focus should shift to a model, where C++ is mainly used to implement modular backend functionality for Qt Quick.'" (Thanks to Paul Wise).

Comments (13 posted)

Udev and systemd to merge

Kay Sievers has sent out an announcement that the udev and systemd projects will be merging into a single source tree. "Today, ‘Init’ needs to be fully hotplug-capable; udev device management and knowledge about device lifecycles is an integral part of systemd and not an isolated logic. Due to this, and to minimize our administrative workload, as well as to minimize duplication of code, and to resolve cyclic build dependencies in the core OS, we have decided to merge the two projects." What the developers will not do is remove the ability to build and run udev on a system that is not using systemd.

Full Story (comments: 79)

Newsletters and articles

Development newsletters from the last week

Comments (none posted)

Grinberg: Linux on an 8-bit micro?

On his blog, Dmitry Grinberg writes about getting Linux to run on an 8-bit microcontroller. In order to do so, he wrote an ARM emulator for the ATmega1284p. The results: "uARM is certainly no speed demon. It takes about 2 hours to boot to bash prompt ("init=/bin/bash" kernel command line). Then 4 more hours to boot up the entire Ubuntu ("exec init" and then login). Starting X takes a lot longer. The effective emulated CPU speed is about 6.5KHz, which is on par with what you'd expect emulating a 32-bit CPU & MMU on a measly 8-bit micro. Curiously enough, once booted, the system is somewhat usable. You can type a command and get a reply within a minute. That is to say that you can, in fact, use it. I used it to day to format an SD card, for example. This is definitely not the fastest, but I think it may be the cheapest, slowest, simplest to hand assemble, lowest part count, and lowest-end Linux PC. The board is hand-soldered using wires, there is not even a requirement for a printed circuit board."

Comments (32 posted)

Can Willow Garage’s “Linux for Robots” Spur Internet-Scale Growth? (Xconomy)

Xconomy looks at Willow Garage and its open source software for robots. "Called the Robot Operating System, or ROS, it’s a collection of algorithms that handle standard tasks required of every mobile robot—things like making sense of a visual scene, planning a path around obstacles. Unlike PR2, ROS is completely free, and is already being adapted by hundreds of robotics labs and companies around the world. It’s spreading so fast that [CEO Steve] Cousins says Willow Garage is considering creating a non-profit foundation, similar to the Apache Software Foundation, that could organize the developer community, collect donations, and act as an independent steward and champion for the software." LWN covered a talk by Willow Garage's Tully Foote from SCALE 10x in January.

Comments (3 posted)

Russell: Sources of Randomness for Userspace

On his blog, Rusty Russell digs into sources of randomness for user-space programs (other than just reading /dev/urandom). "There are three obvious classes of randomness: things about the particular machine we’re on, things about the particular boot of the machine we’re on, and things which will vary every time we ask." He goes on to look at examples in each category and give a rough guess of the number of bits of entropy each would produce.

Comments (46 posted)

Page editor: Jonathan Corbet


Brief items

Creative Commons 4.0 BY-NC-SA draft available

The Creative Commons has posted the first draft of its revised noncommercial-sharealike license with a request for comments. "As anticipated, the license fully licenses database rights on the same terms and conditions as copyright and neighboring rights. We have heard no compelling reason for reversing course on this new policy, and all early feedback suggests this is a welcomed change despite questions about their utility. We have taken care to ensure that the license only applies where permission is needed and the licensor holds those rights."

Comments (13 posted)

FSF announces 2011 Free Software Award winners

The Free Software Foundation has announced that the winner of the 2011 award for the advancement of free software is Yukihiro "Matz" Matsumoto, the creator of the Ruby language. The award for projects of social benefit went to the GNU Health project.

Comments (none posted)

Linux Tycoon - Linux Distro Building Simulator Game

Linux Tycoon is a game in which you "build and manage your own Linux Distro… without actually building or managing your own Linux Distro. Basically take out the “work” and the “bug fixing” and the “programming” parts… and, wham-o!, you’ve got Linux Tycoon."

Full Story (comments: none)

Articles of interest

Archiving Images with an Open Source Scanning Robot (

Project Gado is aimed at developing an autonomous archival scanning robot that will allow small archives and museums digitize holdings at a low cost and help preserve important documents and pictures. takes a look at the project. ""Almost every aspect of the project uses some kind of open source tool," [project manager Thomas] Smith says. "Our robot control software is fully Linux compatible, and we run Ubuntu Linux on all our computers at the Afro. The Gado 2 uses the open source Arduino microcontroller, and all the components that we created – PCB, physical parts – are open source as well." The Gado also uses the open source Tesseract OCR engine to process materials, and the MySQL database system to store metadata. "Using open source tools allowed us to create the machine inexpensively, which is extremely important given our requirement that the final device cost less than $500," Smith says." Gado kits are available for pre-sale and are expected to be delivered in August.

Comments (none posted)

LF video: How Linux is built

The Linux Foundation has posted a cute video describing (at a very high level) how the kernel development process works. There will be few surprises there for LWN readers, but it may be useful for a wider audience.

Comments (5 posted)

Whitehurst: A billion thanks to the open source community from Red Hat

Red Hat CEO Jim Whitehurst celebrates the company's billion dollar milestone with a donation. "Last December, Red Hat decided that no billion dollar milestone would be complete without honoring the open source community. To that end, we are making a $100,000 donation to the future of open source. Red Hat associates nominated and voted for the following organizations to benefit:" Creative Commons, Electronic Frontier Foundation, Software Freedom Law Center, and UNICEF Innovation Labs.

Comments (11 posted)

Calls for Presentations

KDE and openSUSE Announce Opening of CfP for Dedicated COSCUP Track

COSCUP (Conference for Open Source Coders, Users and Promoters) will be held August 20-21, 2012 in Taipei, Taiwan. KDE and openSUSE are organizing a full two-day track at COSCUP. The call for papers deadline (for this track) is June 15. "The program committee is looking for presentations about KDE and openSUSE. Please note that the talks do NOT have to be related to both KDE and openSUSE. KDE and openSUSE happily welcome talks about KDE on other distributions or other (non) desktop technologies like GNOME or OpenStack on openSUSE."

Full Story (comments: none)

Scalability micro-conference topic proposals (LPC2012)

There will be a micro-conference on scaling both upwards (many cores) and downwards (low footprint, energy efficiency) during the Linux Plumbers Conference (August 29-31, 2012 in San Diego, California). "Suggestions of topics are welcome. If you would like to present, please let us know: we have lightning-talk slots and a few 30 minutes slots available. Presentations should be oriented towards stimulating discussion over currently faced scalability problems and/or work in progress in the area of scalability."

Full Story (comments: none)

openSUSE Summit website up and CfP started

The call for proposals for the openSUSE Summit, which will be held September 21-23, 2012 in Orlando, Florida, is now open. Submissions will be accepted until June 15 for sessions in three different tracks: "openSUSE Community", "openSUSE Tech", or "open World"—there is also a category for "fun" proposals: "The openSUSE Summit, by virtue of being an openSUSE event, has fun high on the agenda. Therefore, proposals that are "outside the box" of a "regular" software focused conference are encouraged. Collaboratively Building a Giant Paper Mache Geeko has already been proposed and rejected due to environmental concerns."

Full Story (comments: none)

1st Call For Papers, 19th Annual Tcl/Tk Conference 2012

The 19th Annual Tcl/Tk Conference (Tcl'2012) will be held in Chicago, Illinois November 12-16. The proposal deadline is August 27. "he program committee is asking for papers and presentation proposals from anyone using or developing with Tcl/Tk (and extensions)."

Full Story (comments: none)

Upcoming Events


The X.Org Developer Conference (XDC2012) will be held September 19-21 in Nürnberg, Germany. "If you would like to give a talk during the event, please add it to the program page"

Full Story (comments: none)

Events: April 5, 2012 to June 4, 2012

The following event listing is taken from the Calendar.

April 3
April 5
LF Collaboration Summit San Francisco, CA, USA
April 5
April 6
Android Open San Francisco, CA, USA
April 10
April 12
Percona Live: MySQL Conference and Expo 2012 Santa Clara, CA, United States
April 12
April 19
SuperCollider Symposium London, UK
April 12
April 13
European LLVM Conference London, UK
April 12
April 15
Linux Audio Conference 2012 Stanford, CA, USA
April 13 Drizzle Day Santa Clara, CA, USA
April 16
April 18
OpenStack "Folsom" Design Summit San Francisco, CA, USA
April 17
April 19
Workshop on Real-time, Embedded and Enterprise-Scale Time-Critical Systems Paris, France
April 19
April 20
OpenStack Conference San Francisco, CA, USA
April 21 international Openmobility conference 2012 Prague, Czech Republic
April 23
April 25
Luster User Group Austin, Tx, USA
April 25
April 28
Evergreen International Conference 2012 Indianapolis, Indiana
April 27
April 29
Penguicon Dearborn, MI, USA
April 28 Linuxdays Graz 2012 Graz, Austria
April 28
April 29
LinuxFest Northwest 2012 Bellingham, WA, USA
May 2
May 5
Libre Graphics Meeting 2012 Vienna, Austria
May 3
May 5
Utah Open Source Conference Orem, Utah, USA
May 7
May 9
Tizen Developer Conference San Francisco, CA , USA
May 7
May 11
Ubuntu Developer Summit - Q Oakland, CA, USA
May 8
May 11
samba eXPerience 2012 Göttingen, Germany
May 11
May 12
Professional IT Community Conference 2012 New Brunswick, NJ, USA
May 11
May 13
Debian BSP in York York, UK
May 13
May 18
C++ Now! Aspen, CO, USA
May 17
May 18
PostgreSQL Conference for Users and Developers Ottawa, Canada
May 22
May 24
Military Open Source Software - Atlantic Coast Charleston, SC, USA
May 23
May 26
LinuxTag Berlin, Germany
May 23
May 25
Croatian Linux Users' Convention Zagreb, Croatia
May 25
May 26
Flossie 2012 London, UK
May 28
June 1
Linaro Connect Q2.12 Gold Coast, Hong Kong
May 29
May 30
International conference NoSQL matters 2012 Cologne, Germany
June 1
June 3
Wikipedia & MediaWiki hackathon & workshops Berlin, Germany

If your event does not appear here, please tell us about it.

Page editor: Rebecca Sobol

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds