User: Password:
|
|
Subscribe / Log in / New account

LWN.net Weekly Edition for March 27, 2014

Kicking the tires of Ford's OpenXC

By Nathan Willis
March 26, 2014

OpenXC is a software platform created by the US auto maker Ford. The goal is to enable developers to write applications that integrate with the data buses in Ford vehicles. To that end, the company has rolled out a set of open APIs and has defined a family of hardware interfaces to bridge the vehicle-to-computer gap. I recently got the chance to work with one of the hardware interfaces, and, while the capabilities of OpenXC are limited in scope, the project certainly hits the right notes in terms of software freedom and usability. The long-term challenge will be seeing how OpenXC cooperates with the larger open source automotive software ecosystem.

Hardware

Ford launched OpenXC in January of 2012 in a collaborative effort with Bug Labs, who designed the hardware interface modules. Officially called Vehicle Interfaces (VIs), these modules plug into a car's On-Board Diagnostic (OBD-II) port. On the inside, the VI runs a car-specific firmware that reads from the data bus and translates raw OBD-II messages into JSON objects (in OpenXC's own message format), which it then sends out over Bluetooth.

That setup all sounds good, but naturally one cannot get started with it without a compatible VI (a generic OBD-II dongle is not a substitute, since it does not output OpenXC messages). In mid-2013, when I first started looking at OpenXC, the official VIs were out of production and there was no expected delivery date for more. Nevertheless, I signed up to be notified if new reference VIs were released, and in early March received an email announcing that a fresh batch was available for purchase. When the ordered VI arrived, I took it for a test drive.

[OpenXC VI attached in car]

The good news is that it is entirely possible to build one's own VI, should the current batch run out again, using off-the-shelf components and the open hardware chipKIT Max32, an Arduino-compatible microcontroller. That option is not an inexpensive one, however, and requires some skill with assembling electronics. Whichever hardware route one takes, the VI must be flashed with firmware tailored to the vehicle on which it will be used. The list of supported vehicles is not long—currently 35 Ford models, the oldest being from 2005. But the firmware is BSD-licensed and the source is available on GitHub.

With the hardware in hand, one can attach it to the car's OBD-II port, pair with it over Bluetooth, and run a quick smoke test to make sure that data is indeed getting sent from the VI.

Development

OpenXC supports both Android and Python development. The Android API is accessible from an open source library that supports Android 2.3 through 4.4. The API provides a high-level VehicleManager service that returns vehicle data objects relayed from the VI. There is also an auxiliary app called OpenXC Enabler that starts the vehicle service and pairs with the VI whenever the Android device is booted.

The Python interface is a bit lower-level; a module is available (just for Python 2.6 and 2.7 at the moment) that provides classes for the supported OpenXC data types as well as callbacks, basic logging functionality, and a facility to relay vehicle data to a remote HTTP server. The Python module also supports connecting to the VI over a serial or USB connection (as opposed to Bluetooth alone), and there is a set of command-line tools that can be used to monitor, dump, and trace through messages.

Both the Android and Python interfaces provide access to the same set of vehicle data, which is limited to a set of measurements (e.g., vehicle_speed, accelerator_pedal_position, headlamp_status, etc.) and a set of events (at the moment, just steering-wheel button presses). Exactly which measurements and events are available varies with the vehicle model; my own test car is a Mustang, which does not support the full set (omitting such measurements as steering_wheel_angle and fuel_level).

But Ford's documentation is detailed enough that the same set of measurements could also be read by a simpler device that plugged into the same OBD-II port; there are in fact other devices that expose much the same functionality to other automotive Linux projects. Tizen IVI, for example, can utilize the OBDLink MX to monitor the same data. There are other devices that can not only read messages from the vehicle's Controller Area Network (CAN) bus (which, for the record, is the bus protocol used by Ford behind the OBD-II port), but can write to it as well.

Merging with the automotive ecosystem

Given that there are other options, the question for developers is whether or not OpenXC offers advantages over interfacing directly with the vehicle bus. It certainly does simplify Android app development—and, if there was any lingering doubt about its usability, the project maintains a list of known OpenXC apps, which cover uses as diverse as indicating the optimal shift moment (which is a relatively straightforward calculation) to parking collision warnings (which necessitates significant hardware modifications, including the addition of a camera). Hooks are also in place to tie in to Android devices' other useful hardware features (such as location data).

The app-development story, however, is limited to Android, at least for now. Apple iOS support appears to be out of the question, since (as it was explained on the mailing list) Apple requires the presence of an Apple authentication chip in order to make low-level Bluetooth connections. As one might expect, there has so far been no word about any of the alternative mobile platforms (Firefox OS, Ubuntu, or even Windows Phone).

The case for Python development with OpenXC is not quite as clear-cut. The Python interface is designed to run on a more full-featured computer, which puts it into competition with in-vehicle infotainment (IVI) projects like GENIVI and Tizen IVI. An application-level API for accessing vehicle data is certainly appealing, but it could be a hard sell for Ford to convince other automakers to adopt OpenXC's message format and semantics rather than to draft something that is vendor-neutral from day one. Several pages on the project web site suggest that Ford is interested in pursuing this goal, encouraging users to (for example) tell other car makers that they want to see OpenXC support.

And there is competition. The most plausible candidate for such a vendor-neutral API is probably the World Wide Web Consortium (W3C) Automotive Business Group's Vehicle API, which recently became a draft specification. Ford does not have any official representatives sitting in with the group, although there are a few individuals representing themselves who are Ford employees. Most of the IVI development projects are also designed to support sending messages over the automotive data bus, which OpenXC does not yet address. There are hints in the project wiki and mailing list that further expansion is a possibility, but competing projects like Tizen IVI's Automotive Message Broker have already implemented message-writing functionality.

[OpenXC VI size comparison]

OpenXC also has a hurdle to overcome on the hardware side. The reference VI is quite large, especially given the fact that it must fit into a socket which, by informal industry standard, juts down below the steering column toward the driver's legs. OBD-II ports are not designed for permanent hardware installation, both due to their location and the fact that most of them do not latch into place; an innocent knee can easily dislodge the VI several times over the course of a drive.

That said, OpenXC is a commendable developer program for several reasons. The entire stack, including the VI firmware, is available as free software, which is refreshing and—judging from the list traffic—also seems to attract quite a few developers. In contrast, for example, support for the OBDLinks dongle mentioned earlier had to be reverse-engineered by the interested open source developers. Ford has also released quite a bit of information about its vehicles and their CAN Bus usage to the public, as well as other useful resources like vehicle data traces. In an industry as competitive as the automotive business, this level of openness is not the norm; the developer programs of other car makers generally require some type of non-disclosure agreement and maintain a tighter grip on their specifications.

The main goal of OpenXC seems to be making it possible to write third-party applications that utilize Ford vehicle data. That is, OpenXC is not aimed at system integrators or equipment manufacturers, but at individual, do-it-yourself hackers. On that front, it is easy to get started with, and worth exploring if one has a compatible car. Whether OpenXC can grow to encompass broader uses and more ambitious goals remains to be seen.

Comments (5 posted)

PostgreSQL pain points

By Jonathan Corbet
March 26, 2014
2014 LSFMM Summit
The kernel has to work for a wide range of workloads; it is arguably unsurprising that it does not always perform as well as some user communities would like. One community that sometimes felt left out in the cold is the PostgreSQL relational database management system project. In response to an invitation from the organizers of the 2014 Linux Storage, Filesystem, and Memory Management summit, PostgreSQL developers Robert Haas, Andres Freund, and Josh Berkus came to discuss their worst pain points and possible solutions.

PostgreSQL is an old system, dating back to 1996; it has a lot of users running on a wide variety of operating systems. So the PostgreSQL developers are limited in the amount of Linux-specific code they can add. It is based on cooperating processes; threads are not used. System V shared memory is used for interprocess communication. Importantly, PostgreSQL maintains its own internal buffer cache, but also uses buffered I/O to move data to and from disk. This combination of buffering leads to a number of problems experienced by PostgreSQL users.

Slow sync

The first problem described is related to how data gets to disk from the buffer cache. PostgreSQL uses a form of journaling that they call "write-ahead logging". Changes are first written to the log; once the log is safely on disk, the main database blocks can be written back. Much of this work is done in a "checkpoint" process; it writes log entries, then [Robert Haas] flushes a bunch of data back to various files on disk. The logging writes are relatively small and contiguous; they work fairly well, and, according to Andres, the PostgreSQL developers are happy enough with how that part of the system works on Linux.

The data writes are another story. The checkpoint process paces those writes to avoid overwhelming the I/O subsystem. But, when it gets around to calling fsync() to ensure that the data is safely written, all of those carefully paced writes are flushed into the request queue at once and an I/O storm results. The problem, they said, is not that fsync() is too slow; instead, it is too fast. It dumps so much data into the I/O subsystem that everything else, including read requests from applications, is blocked. That creates pain for users and, thus, for PostgreSQL developers.

Ted Ts'o asked whether the ability to limit the checkpoint process to a specific percentage of the available I/O bandwidth would help. But Robert responded that I/O priorities would be better; the checkpoint process should be able to use 100% of the bandwidth if nothing else wants it. Use of the ionice mechanism (which controls I/O priorities in the CFQ scheduler) was suggested, but there is a problem: it does not work for I/O initiated from an fsync() call. Even if the data was written from the checkpoint process — which is not always the case — priorities are not applied when the actual I/O is started by fsync().

Ric Wheeler suggested that the PostgreSQL developers need to better control the speed with which they write data; Chris Mason added that the O_DATASYNC option could be used to give better control over when the I/O requests are generated. The problem here is that such approaches require PostgreSQL to know about the speed of the storage device.

The discussion turned back to I/O priorities, at which point it was pointed out that those priorities can only be enforced in the request queue maintained by the I/O scheduler. Some schedulers, including those favored by PostgreSQL users (who tend to avoid the CFQ scheduler), do not implement I/O priorities at all. But, even those that do support I/O priorities place limits on the length of the request queue. A big flush of data will quickly fill the queue, at which point I/O priorities lose most of their effectiveness; a high-priority request will still languish if there is no room for it in the request queue. So, it seems, I/O priorities are not the solution to the problem.

It's not clear what the right solution is. Ted asked if the PostgreSQL developers could provide a small program that would generate the kind of I/O patterns created by a running database. Given a way to reproduce the problem easily, kernel developers could experiment with different approaches to a solution. This "program" may take the form of a configuration script for PostgreSQL initially, but a separate (small) program is what the kernel community would really like to see.

Double buffering

PostgreSQL needs to do its own buffering; it also needs to use buffered I/O for a number of reasons. That leads to a problem, though: database data tends to be stored in memory twice, once in the PostgreSQL buffer, and once in the page cache. That increases the amount of memory used by PostgreSQL considerably, to the detriment of the system as a whole.

[Andres Freund] Much of that memory waste could conceivably be eliminated. Consider, for example, a dirty buffer in the PostgreSQL cache. It is more current than any version of that data that the kernel might have in the page cache; the only thing that will ever happen to the page cache copy is that it will be overwritten when PostgreSQL flushes the dirty buffer. So there is no value to keeping that data in the page cache. In this case, it would be nice if PostgreSQL could tell the kernel to remove the pages of interest from the page cache, but there is currently no good API for that. Calling fadvise() with FADV_DONTNEED can, according to Andres, actually cause pages to be read in; nobody quite understood this behavior, but all agreed it shouldn't work that way. They can't use madvise() without mapping the files; doing that in possibly hundreds of processes tends to be very slow.

It would also be nice to be able move pages in the opposite direction: PostgreSQL might want to remove a clean page from its own cache, but leave a copy in the page cache. That could possibly be done with a special write operation that would not actually cause I/O, or with a system call that would transfer a physical page into the page cache. There was some talk of the form such an interface might take, but this part of the discussion eventually wound down without any firm conclusions.

Regressions

Another problem frequently experienced by PostgreSQL users is that recent kernel features tend to create performance problems. For example, the transparent huge pages feature tends not to bring much benefit to PostgreSQL workloads, but it slows them down significantly. Evidently a lot of time goes into the compaction code, which is working hard without actually producing a lot of free huge pages. In many systems, terrible performance problems simply vanish when transparent huge pages are turned off.

Mel Gorman answered that, if compaction is hurting performance, it's a bug. That said, he hasn't seen any transparent huge page bugs for quite some time. There is, he said, a patch out there which puts a limit on the number of processes that can be performing compaction at any given time. It has not been merged, though, because nobody has ever seen a workload where too many processes running compaction was a problem. It might, he suggested, be time to revisit that particular patch.

Another source of pain is the "zone reclaim" feature, whereby the kernel reclaims pages from some zones even if the system as a whole is not short of memory. Zone reclaim can slow down PostgreSQL workloads; usually the best thing to do on a PostgreSQL server is to simply disable the feature altogether. Andres noted that he has been called in as a consultant many times to deal with performance problems related to zone reclaim; it has, he said been a good money-maker for him. Still, it would be good if the problem were to be fixed.

Mel noted that the zone reclaim mode was written under the assumption that all processes in the system would fit into a single NUMA node. That assumption no longer makes sense; it's long past time, he said, that this option's default changed to "off." There seemed to be no opposition to that idea in the room, so a change may happen sometime in the relatively near future.

Finally, the PostgreSQL developers noted that, in general, kernel upgrades tend to be scary. The performance characteristics of Linux kernels tend to vary widely from one release to the next; that makes upgrades into an uncertain affair. There was some talk of finding ways to run PostgreSQL benchmarks on new kernels, but no definite conclusions were reached. As a whole, though, developers for both projects were happy with how the conversation came out; if nothing else, it represents a new level of communication between the two projects.

[Your editor would like to thank the Linux Foundation for supporting his travel to the LSFMM summit.]

Comments (68 posted)

Facebook and the kernel

By Jake Edge
March 26, 2014
2014 LSFMM Summit

As one of the plenary sessions on the first day of the Linux Storage, Filesystem, and Memory Management (LSFMM) Summit, Btrfs developer Chris Mason presented on how his new employer, Facebook, uses the Linux kernel. He shared some of the eye-opening numbers that demonstrate just how much processing Facebook does using Linux, along with some of the "pain points" the company has with the kernel. Many of those pain points are shared with the user-space database woes that were presented in an earlier session (and will be discussed further at the Collaboration Summit that follows LSFMM), he said, so he mostly concentrated on other problem areas.

Architecture and statistics

In a brief overview of Facebook's architecture, he noted that there is a web tier that is CPU- and network-bound. It handles requests from users as well as sending replies. Behind that is a memcached tier that caches data, mostly from MySQL queries. Those queries are handled by a storage tier that is a collection of several different database systems: MySQL, Hadoop, RocksDB, and others.

[Chris Mason]

Within Facebook, anyone can look at and change the code in its source repositories. The facebook.com site has its code updated twice daily, he said, so the barrier to getting new code in the hands of users is low. Those changes can be fixes or new features.

As an example, he noted that the "Look Back" videos, which were created by Facebook for each user and reviewed all of their posts to the service, added a huge amount of data and required a lot more network bandwidth. The process of creating and serving all of those videos was the topic of a Facebook engineering blog post. In all 720 million videos were created, which required an additional 11 petabytes of storage, as well as consuming 450 Gb/second of peak network bandwidth for people viewing the videos. The Look Back feature was conceived, provisioned, and deployed in only 30 days, he said.

The code changes quickly, so when performance or other problems crop up, he and other kernel developers can tell others in the company that "you're doing it wrong". In fact, he said, "they love that". It does mean that he has to come up with concrete suggestions on how to do it right, but Facebook is not unwilling to change its code.

Facebook runs a variety of kernel versions. The "most conservative" hosts run a 2.6.38-based kernel. Others run the 3.2 stable series with roughly 250 patches. Other servers run the 3.10 stable series with around 60 patches. Most of the patches are in the networking and tracing subsystems, with a few memory-management patches as well.

One thing that seemed to surprise Mason was the high failure tolerance that the Facebook production system has. He mentioned the 3.10 pipe race condition that Linus Torvalds fixed. It is a "tiny race", he said, but Facebook was hitting it (and recovering from it) 500 times per day. The architecture of the system is such that it could absorb that kind of failure rate without users noticing anything wrong.

Pain points

Mason asked around within Facebook to try to determine what the worst problem is that the company has with the kernel. In the end, two features were mentioned the most frequently: Stable pages and the completely fair queueing (CFQ) I/O scheduler. "I hope we never find those guys", he said with a laugh, since Btrfs implements stable pages. In addition, James Bottomley noted that Facebook already employs another CFQ developer (Jens Axboe).

Another area that was problematic for Facebook is surprises with buffered I/O latency, especially for append-only database files. Most of the time, those writes go fast, but sometimes they are quite slow. He would like to see the kernel avoid latency spikes like that.

He would like to see kernel-style spinlocks be available from user space. Rik van Riel suggested that perhaps POSIX locks could use adaptive locking, which would spin for a short time then switch to sleeping if the lock did not become available quickly. The memcached tier has a kind of user-space spinlock, Mason said, but it is "very primitive compared to the kernel".

Fine-grained I/O priorities is another wish list item for Facebook (and for the PostgreSQL developers as well). There are always cleaners and compaction threads that need to do I/O, but shouldn't hold off the higher-priority "foreground" I/O. Mason was asked about how the priorities would be specified, by I/O operation or file range, for example. In addition, he was asked about how fine-grained the priorities needed to be. Either way of specifying the priorities would be reasonable, and Facebook really only needs two (or few) priority levels: low and high.

The subject of ionice was raised again. One of the problems with that as a solution is that it only works with the (disabled by Facebook) CFQ scheduler. Bottomley suggested making ionice work with all of the schedulers, which Mason said might help. In order to do that, though, Ted Ts'o noted that the writeback daemon will have to understand the ionice settings.

Another problem area is logging. Facebook logs a lot of data and the logging workloads have to use fadvise() and madvise() to tell the kernel that those pages should not be saved in the page cache. "We should do better than that." Van Riel suggested that the page replacement patches in recent kernels may make things better. Mason said that Facebook does not mind explicitly telling the kernel which processes are sequentially accessing the log files, but continually calling *advise() seems excessive.

Josef Bacik has also been working on a small change to Btrfs to allow rate limiting buffered I/Os. It was easy to do in Btrfs, Mason said, but if the idea pans out would move elsewhere for more general availability. Jan Kara was concerned that only limiting buffered I/O would be difficult since there are other kinds of I/O bound for the disk at any given time. Mason agreed, saying that the solution would not be perfect but might help.

Bottomley noted that ionice is an existing API that should be reused to help with these kinds of problems. Similar discussions of using other mechanisms in the past have run aground on "painful arguments about which API is right", he said. Just making balance_dirty_pages() aware of the ionice priority may solve 90% of the problem. Other solutions can be added later.

Mason explained that Facebook stores its logs in a large Hadoop database, but that the tools for finding problems in those logs are fairly primitive—grep essentially. He said that he would "channel Lennart [Poettering] and Kay [Sievers]" briefly to wish for a way to tag kernel messages. Bottomley's suggestion that Mason bring it up with Linus Torvalds at the next Kernel summit was met with widespread chuckling.

Danger tier

While 3.10 is fairly close to the latest kernels, Mason would like to run even more recent kernels. To that end, he is creating something he calls the "danger tier". He ported the 60 patches that Facebook currently adds to 3.10.x to the current mainline Git tree and is carving out roughly 1000 machines to test that kernel in the web tier. He will be able to gather lots of performance metrics from those systems.

As a simple example of the kinds of data he can gather, he put up a graph of request response times (without any units) that was gathered over 3 days. It showed a steady average response time line all the way at the bottom as well as the ten worst systems' response times. Those not only showed large spikes in the response times, but also that the baseline for those systems was roughly twice that of the average. He can determine which systems those are, ssh in, and try to diagnose what is happening with them.

He said that was just an example. Eventually he will be able to share more detailed information that can be used to try to diagnose problems in newer kernels and get them fixed more quickly. He asked for suggestions of metrics to gather for the future. With that, his session slot expired.

[ Thanks to the Linux Foundation for travel support to attend LSFMM. ]

Comments (14 posted)

Page editor: Jonathan Corbet

Security

The S-CRIB password scrambler

By Nathan Willis
March 26, 2014

Password protection—and the lack thereof—is a hot news topic these days, particularly where public web services are concerned. It is no longer surprising to hear about the latest bulk theft of user credentials from a popular online merchant. Historically, user account passwords at such sites have been stored in hashed form, which requires some computing resources to crack—but not an unobtainable level of resources, at least for those passwords that are simple and easily guessed. A team of security researchers recently proposed a solution to that pain point, using an inexpensive hardware token to make stored passwords irretrievable to attackers.

Dan Cvrcek of the security consulting firm Smart Crib posted a description of the system at the Light Blue Touchpaper blog in early March. The setup uses a hardware security module (HSM) provisioned with a secret key; the key, in conjunction with the SHA-1 hash function, is used to compute the SHA1-HMAC message authentication code for each user password. The web service stores the SHA1-HMAC of the passwords rather than a simple hash of the passwords, which means that even if the password file is stolen, the passwords cannot be recovered.

Obviously storing user passwords in a more secure format is better, at least if the HSM can truly be trusted. If the key is somehow retrievable, the system falls apart. Smart Crib's system includes software that allows the user to provision the HSM with the key once during the initial setup phase. Subsequently, the HSM responds only to two commands: ENSCRAMBLE, which computes the SHA1-HMAC of the password provided as an argument, and ENGETID, which returns the ID of the HSM itself. Smart Crib's token, sold as the S-CRIB Scrambler, is a $70 USB stick, which can only be accessed by UART over USB (so it cannot normally be accessed by a virtual machine).

Apart from the trust factor in the HSM itself, there are a number of practical questions to ask as well—starting with whether the HSM can keep up with a high load of password checking requests. Cvrcek reports that the Scrambler is able to respond to 330 requests per minute—or 0.18 seconds per password. That is not a particularly high rate, as commenters on the original blog post pointed out, though it should be noted that the S-CRIB Scrambler is a modified version of Smart Crib's other USB dongle, which is designed for secure password storage, not transaction speed. It might be relatively easy to design dedicated hardware for the task with considerably higher throughput. Scramblers can also be configured to work in clusters, provisioning multiple HSMs with the same secret key.

But another practical question is what would happen if the HSM was damaged or destroyed (either by attackers or through an accident). If the only copy of the key is lost, all of the user account passwords are genuinely irretrievable. Clearly the ability to provision multiple HSMs with the same key provides a means to keep an external backup, but on the other hand, that same backup is also a tempting target for thieves.

Another potential weakness with the Scrambler is its use of SHA1. As security researcher Jeremi Gosney told ars technica, the Scrambler uses only one round of SHA1, and SHA1 is no longer considered state-of-the-art. The Scrambler documentation defends this choice by pointing out that SHA1 only contributes a portion of the security of the SHA1-HMAC scheme; in total they claim the equivalent of at least 122 bits of complexity. In comment 9 on the original blog post, Cvrcek also argues that SHA1's weaknesses are still limited to theoretical attacks for finding collisions, and collisions do not directly result in password cracks.

Einar Otto Stangvik raised yet another concern with the project on his blog. While encrypting the passwords for storage is a good idea, he said, the Scrambler does not currently offer an advantage over using something like bcrypt to do the encryption in software. Because bcrypt is adaptive, it can be configured to be unrealistically expensive for attackers to brute force. Thus, for whatever the current generation of hardware available to attackers is, bcrypt can make password-cracking just as or more expensive than the Scrambler, but without the potential pitfall of having the password database lost for good due to a hardware failure in the HSM.

Here again, the Scrambler documentation defends for the HSM approach, arguing that "too expensive for attackers" is no longer a meaningful measurement in a world where massive cloud server clusters can be rented cheaply. The Scrambler approach derives its security from the strength of the encryption algorithm, not from assumptions about the attacker's hardware. Stangvik does not take issue with that premise, but he advises against putting too much trust in an untested system like the Scrambler.

Of course, evidence suggests that the Smart Crib team agrees with that last point in theory; they are aware that the Scrambler is new and untested in the field, and Cvrcek's blog post explicitly asks for feedback from the security community.

Naturally, deploying the Scrambler system for a real-world site involves quite a few other puzzle pieces, such as end-to-end encryption between the client computer, the site's web server, and the host to which the HSM is attached (if indeed it is a separate machine). The Scrambler does offer some measure of protection for securing the communication channel. In addition to the secret key intended for SHA1-HMAC, the Scramblers are also provisioned with a separate key to encrypt the connection between the HSM host and web server. One assumes that the site and client computer connect over SSL/TLS.

In its current form, the Scrambler might not be fast enough to recommend for a high-performance web site, but it presents an interesting approach. Perhaps, with additional time and engineering, building trusted HSMs in an inexpensive USB form-factor will indeed offer a stronger approach to safely storing passwords and other credentials. Until then, there is still always bcrypt.

Comments (17 posted)

Brief items

Security quotes of the week

1990s: The net interprets censorship as damage and routes around it.

2010s: The net interprets censorship as entertainment and routes it around.

Don Marti

If you fret over hackers or intelligence agencies reading your email, wait until they eavesdrop on your eyesight.

That’s the troubling–if still remote–possibility demonstrated by a pair of California Polytechnic San Luis Obispo graduate researchers who have built what may be the world’s first spyware proof-of-concept for Google’s Glass computer eyepieces. The stealthy software, designed by 22-year-old Mike Lady and 24-year-old Kim Paterson, takes a photo every ten seconds when Glass’s display is off, uploading the images to a remote server without giving the wearer any sign that his or her vision is being practically livestreamed to a stranger. To trick users into accepting permissions that allow the software to take photos and access the Internet, Lady’s and Paterson’s app masquerades as note-taking software; They call it Malnotes.

Andy Greenberg in Forbes

Comments (none posted)

Full Disclosure Mailing List: A Fresh Start

The full-disclosure mailing list is back. Nmap developer Fyodor has announced that he is resurrecting the list after its abrupt closure in mid-March. "The new list must be run by and for the security community in a vendor-neutral fashion. It will be lightly moderated like the old list, and a volunteer moderation team will be chosen from the active users. As before, this will be a public forum for detailed discussion of vulnerabilities and exploitation techniques, as well as tools, papers, news, and events of interest to the community. FD differs from other security lists in its open nature, light (versus restrictive) moderation, and support for researchers' right to decide how to disclose their own discovered bugs. The full disclosure movement has been credited with forcing vendors to better secure their products and to publicly acknowledge and fix flaws rather than hide them. Vendor legal intimidation and censorship attempts won't be tolerated!"

Comments (none posted)

New vulnerabilities

apache: multiple vulnerabilities

Package(s):apache CVE #(s):CVE-2013-6438 CVE-2014-0098
Created:March 20, 2014 Updated:April 23, 2014
Description:

From the Mageia advisory:

Apache HTTPD before 2.4.9 was vulnerable to a denial of service in mod_dav when handling DAV_WRITE requests (CVE-2013-6438).

Apache HTTPD before 2.4.9 was vulnerable to a denial of service when logging cookies (CVE-2014-0098).

Alerts:
Mandriva MDVSA-2015:093 apache 2015-03-28
openSUSE openSUSE-SU-2014:1647-1 apache2 2014-12-15
SUSE SUSE-SU-2014:1082-1 apache2 2014-09-02
SUSE SUSE-SU-2014:1081-1 apache2 2014-09-02
SUSE SUSE-SU-2014:1080-1 apache2 2014-09-02
Gentoo 201408-12 apache 2014-08-29
openSUSE openSUSE-SU-2014:1044-1 apache2 2014-08-20
openSUSE openSUSE-SU-2014:1045-1 apache2 2014-08-20
openSUSE openSUSE-SU-2014:0969-1 apache 2014-08-07
SUSE SUSE-SU-2014:0967-1 the Apache Web Server 2014-08-07
Fedora FEDORA-2014-5004 httpd 2014-04-23
Scientific Linux SLSA-2014:0369-1 httpd 2014-04-04
Scientific Linux SLSA-2014:0370-1 httpd 2014-04-04
Oracle ELSA-2014-0369 httpd 2014-04-03
Oracle ELSA-2014-0370 httpd 2014-04-03
CentOS CESA-2014:0369 httpd 2014-04-04
CentOS CESA-2014:0370 httpd 2014-04-04
Red Hat RHSA-2014:0369-01 httpd 2014-04-03
Red Hat RHSA-2014:0370-01 httpd 2014-04-03
Slackware SSA:2014-086-02 httpd 2014-03-28
Fedora FEDORA-2014-4555 httpd 2014-03-31
Ubuntu USN-2152-1 apache2 2014-03-24
Mandriva MDVSA-2014:065 apache 2014-03-20
Mageia MGASA-2014-0135 apache 2014-03-19

Comments (none posted)

asterisk: two vulnerabilities

Package(s):asterisk CVE #(s):CVE-2014-2286 CVE-2014-2287
Created:March 21, 2014 Updated:January 27, 2017
Description: From the Fedora advisory:

CVE-2014-2286: AST-2014-001: Stack overflow in HTTP processing of Cookie headers.

Sending a HTTP request that is handled by Asterisk with a large number of Cookie headers could overflow the stack. Another vulnerability along similar lines is any HTTP request with a ridiculous number of headers in the request could exhaust system memory.

CVE-2014-2287: AST-2014-002: chan_sip: Exit early on bad session timers request

This change allows chan_sip to avoid creation of the channel and consumption of associated file descriptors altogether if the inbound request is going to be rejected anyway.

Alerts:
Gentoo 201405-05 asterisk 2014-05-03
Mandriva MDVSA-2014:078 asterisk 2014-04-16
Mageia MGASA-2014-0171 asterisk 2014-04-15
Mageia MGASA-2014-0172 asterisk 2014-04-15
Fedora FEDORA-2014-3779 asterisk 2014-03-21
Fedora FEDORA-2014-3762 asterisk 2014-03-21
Debian-LTS DLA-455-1 asterisk 2016-05-03
Debian-LTS DLA-781-1 asterisk 2017-01-13
Debian-LTS DLA-781-2 asterisk 2017-01-27

Comments (none posted)

chromium-browser: multiple vulnerabilities

Package(s):chromium-browser-stable CVE #(s):CVE-2014-1700 CVE-2014-1701 CVE-2014-1702 CVE-2014-1703 CVE-2014-1704 CVE-2014-1705 CVE-2014-1713 CVE-2014-1715
Created:March 20, 2014 Updated:April 16, 2014
Description:

From the Mageia advisory:

Use-after-free in speech (CVE-2014-1700).

UXSS in events (CVE-2014-1701).

Use-after-free in web database (CVE-2014-1702).

Potential sandbox escape due to a use-after-free in web sockets (CVE-2014-1703).

Multiple vulnerabilities in V8 fixed in version 3.23.17.18 (CVE-2014-1704).

Memory corruption in V8 (CVE-2014-1705).

Use-after-free in Blink bindings (CVE-2014-1713).

Directory traversal issue (CVE-2014-1715).

Alerts:
Red Hat RHSA-2014:1744-01 v8314-v8 2014-10-30
Fedora FEDORA-2014-10975 v8 2014-09-28
Fedora FEDORA-2014-11065 v8 2014-09-28
Gentoo 201408-16 chromium 2014-08-30
Fedora FEDORA-2014-4625 v8 2014-04-15
openSUSE openSUSE-SU-2014:0501-1 chromium 2014-04-09
Fedora FEDORA-2014-4081 v8 2014-04-02
Debian DSA-2883-1 chromium-browser 2014-03-23
Mageia MGASA-2014-0134 chromium-browser-stable 2014-03-19

Comments (none posted)

extplorer: multiple cross-site scripting flaws

Package(s):extplorer CVE #(s):CVE-2013-5951
Created:March 21, 2014 Updated:March 26, 2014
Description: From the Debian advisory:

Multiple cross-site scripting (XSS) vulnerabilities have been discovered in extplorer, a web file explorer and manager using Ext JS. A remote attackers can inject arbitrary web script or HTML code via a crafted string in the URL to application.js.php, admin.php, copy_move.php, functions.php, header.php and upload.php.

Alerts:
Debian DSA-2882-1 extplorer 2014-03-20

Comments (none posted)

icinga: buffer overflow

Package(s):icinga CVE #(s):CVE-2014-2386
Created:March 24, 2014 Updated:March 26, 2014
Description: From the openSUSE advisory:

The monitoring system icinga received security fixes in the cgi helpers where buffers could be overflowed by 1 byte. Note that this will be caught by the FORTIFY_SOURCE static overflow detection.

Alerts:
Debian DSA-2956-1 icinga 2014-06-11
openSUSE openSUSE-SU-2014:0420-1 icinga 2014-03-24

Comments (none posted)

initramfs-tools: incorrectly mounted /run

Package(s):initramfs-tools CVE #(s):
Created:March 25, 2014 Updated:March 27, 2014
Description: From the Ubuntu advisory:

Kees Cook discovered that initramfs-tools incorrectly mounted /run without the noexec option, contrary to expected behaviour.

Alerts:
Ubuntu USN-2153-1 initramfs-tools 2014-03-24

Comments (none posted)

kernel: denial of service

Package(s):kernel CVE #(s):CVE-2014-0055
Created:March 25, 2014 Updated:April 9, 2014
Description: From the Red Hat advisory:

A flaw was found in the way the get_rx_bufs() function in the vhost_net implementation in the Linux kernel handled error conditions reported by the vhost_get_vq_desc() function. A privileged guest user could use this flaw to crash the host.

Alerts:
Oracle ELSA-2014-1392 kernel 2014-10-21
openSUSE openSUSE-SU-2014:1246-1 kernel 2014-09-28
SUSE SUSE-SU-2014:0908-1 Linux kernel 2014-07-17
SUSE SUSE-SU-2014:0909-1 Linux kernel 2014-07-17
SUSE SUSE-SU-2014:0910-1 Linux kernel 2014-07-17
SUSE SUSE-SU-2014:0911-1 Linux kernel 2014-07-17
SUSE SUSE-SU-2014:0912-1 Linux kernel 2014-07-17
openSUSE openSUSE-SU-2014:0856-1 kernel 2014-07-01
openSUSE openSUSE-SU-2014:0840-1 kernel 2014-06-25
Ubuntu USN-2236-1 linux-ti-omap4 2014-06-05
Ubuntu USN-2235-1 kernel 2014-06-05
Ubuntu USN-2225-1 linux-lts-saucy 2014-05-27
Ubuntu USN-2224-1 linux-lts-raring 2014-05-27
Ubuntu USN-2223-1 linux-lts-quantal 2014-05-27
Ubuntu USN-2228-1 kernel 2014-05-27
Mageia MGASA-2014-0238 kernel-vserver 2014-05-24
Mageia MGASA-2014-0234 kernel-tmb 2014-05-23
Mageia MGASA-2014-0236 kernel-tmb 2014-05-24
Mageia MGASA-2014-0237 kernel-rt 2014-05-24
Mageia MGASA-2014-0235 kernel-linus 2014-05-24
Mageia MGASA-2014-0229 kernel-vserver 2014-05-19
Mageia MGASA-2014-0228 kernel 2014-05-19
Mageia MGASA-2014-0208 kernel-rt 2014-05-08
Mageia MGASA-2014-0207 kernel-linus 2014-05-08
Mageia MGASA-2014-0206 kernel 2014-05-08
Oracle ELSA-2014-0475 kernel 2014-05-07
CentOS CESA-2014:X009 kernel 2014-06-16
Fedora FEDORA-2014-4849 kernel 2014-04-09
Fedora FEDORA-2014-4675 kernel 2014-04-04
Scientific Linux SLSA-2014:0328-1 kernel 2014-03-25
Oracle ELSA-2014-0328 kernel 2014-03-25
CentOS CESA-2014:0328 kernel 2014-03-25
Red Hat RHSA-2014:0328-01 kernel 2014-03-25

Comments (none posted)

mozilla: multiple vulnerabilities

Package(s):firefox, thunderbird, seamonkey CVE #(s):CVE-2014-1496 CVE-2014-1501
Created:March 24, 2014 Updated:March 26, 2014
Description: From the SUSE advisory:

Security researcher Ash reported an issue where the extracted files for updates to existing files are not read only during the update process. This allows for the potential replacement or modification of these files during the update process if a malicious application is present on the local system. (CVE-2014-1496)

Security researcher Alex Infuehr reported that on Firefox for Android it is possible to open links to local files from web content by selecting "Open Link in New Tab" from the context menu using the file: protocol. The web content would have to know the precise location of a malicious local file in order to exploit this issue. This issue does not affect Firefox on non-Android systems. (CVE-2014-1501)

Alerts:
Gentoo 201504-01 firefox 2015-04-07
Mageia MGASA-2014-0146 iceape 2014-03-31
Fedora FEDORA-2014-4106 firefox 2014-03-25
SUSE SUSE-SU-2014:0418-1 firefox 2014-03-21

Comments (none posted)

nginx: code execution

Package(s):nginx CVE #(s):CVE-2014-0133
Created:March 20, 2014 Updated:June 23, 2014
Description:

From the Mageia advisory:

A bug in the experimental SPDY implementation in nginx was found, which might allow an attacker to cause a heap memory buffer overflow in a worker process by using a specially crafted request, potentially resulting in arbitrary code execution (CVE-2014-0133).

Alerts:
Mandriva MDVSA-2015:094 nginx 2015-03-28
Gentoo 201406-20 nginx 2014-06-22
openSUSE openSUSE-SU-2014:0450-1 nginx 2014-03-26
Mageia MGASA-2014-0136 nginx 2014-03-19

Comments (none posted)

nss: incorrect wildcard certificate handling

Package(s):nss CVE #(s):CVE-2014-1492
Created:March 21, 2014 Updated:August 19, 2014
Description: From the Mageia advisory:

In the NSS library before version 3.16, in a wildcard certificate, the wildcard character was embedded within the U-label of an internationalized domain name, which is not in accordance with RFC 6125 (CVE-2014-1492).

Alerts:
Gentoo 201504-01 firefox 2015-04-07
Mandriva MDVSA-2015:059 nss 2015-03-13
CentOS CESA-2014:1246 nss, nspr 2014-09-30
Scientific Linux SLSA-2014:1246-1 nss and nspr 2014-09-26
Oracle ELSA-2014-1246 nss, nspr 2014-09-17
Red Hat RHSA-2014:1246-01 nss, nspr 2014-09-16
openSUSE openSUSE-SU-2014:1100-1 Firefox 2014-09-09
Oracle ELSA-2014-1073 nss, nss-util, nss-softokn 2014-08-18
CentOS CESA-2014:1073 nss 2014-08-18
CentOS CESA-2014:1073 nss-softokn 2014-08-18
CentOS CESA-2014:1073 nss-util 2014-08-18
Red Hat RHSA-2014:1073-01 nss, nss-util, nss-softokn 2014-08-18
openSUSE openSUSE-SU-2014:0950-1 Mozilla 2014-07-30
Debian DSA-2994-1 nss 2014-07-31
Scientific Linux SLSA-2014:0917-1 nss and nspr 2014-07-22
Oracle ELSA-2014-0917 nss, nspr 2014-07-22
Red Hat RHSA-2014:0917-01 nss, nspr 2014-07-22
SUSE SUSE-SU-2014:0665-2 firefox 2014-05-28
SUSE SUSE-SU-2014:0727-1 firefox 2014-05-28
SUSE SUSE-SU-2014:0665-1 Mozilla Firefox 2014-05-16
SUSE SUSE-SU-2014:0638-2 Mozilla Firefox 2014-05-16
SUSE SUSE-SU-2014:0638-1 firefox 2014-05-14
openSUSE openSUSE-SU-2014:0629-1 seamonkey 2014-05-12
openSUSE openSUSE-SU-2014:0599-1 firefox 2014-05-02
Ubuntu USN-2185-1 firefox 2014-04-29
Ubuntu USN-2159-1 nss 2014-04-02
Slackware SSA:2014-086-04 nss 2014-03-28
Mandriva MDVSA-2014:066 nss 2014-03-20
Mageia MGASA-2014-0137 nss, firefox, and thunderbird 2014-03-20

Comments (none posted)

openssh: restriction bypass

Package(s):openssh CVE #(s):CVE-2014-2532
Created:March 25, 2014 Updated:April 7, 2014
Description: From the Ubuntu advisory:

Jann Horn discovered that OpenSSH incorrectly handled wildcards in AcceptEnv lines. A remote attacker could use this issue to possibly bypass certain intended environment variable restrictions.

Alerts:
Mandriva MDVSA-2015:095 openssh 2015-03-28
Scientific Linux SLSA-2014:1552-2 openssh 2014-10-22
Oracle ELSA-2014-1552 openssh 2014-10-16
Red Hat RHSA-2014:1552-02 openssh 2014-10-14
Fedora FEDORA-2014-6569 openssh 2014-06-10
Fedora FEDORA-2014-6380 openssh 2014-05-21
Gentoo 201405-06 openssh 2014-05-11
Mandriva MDVSA-2014:068 openssh 2014-04-09
Debian DSA-2894-1 openssh 2014-04-05
Mageia MGASA-2014-0143 openssh 2014-03-31
Slackware SSA:2014-086-06 openssh 2014-03-28
Ubuntu USN-2155-1 openssh 2014-03-25

Comments (none posted)

perltidy: insecure temporary file creation

Package(s):perltidy CVE #(s):CVE-2014-2277
Created:March 24, 2014 Updated:April 1, 2014
Description: From the Red Hat bugzilla:

Jakub Wilk discovered that perltidy's make_temporary_filename() function insecurely created temporary files via the use of the tmpnam() function. A local attacker could use this flaw to perform a symbolic link attack.

Alerts:
Mageia MGASA-2014-0147 perltidy 2014-03-31
Fedora FEDORA-2014-3891 perltidy 2014-03-24
Fedora FEDORA-2014-3874 perltidy 2014-03-24

Comments (none posted)

python: denial of service

Package(s):python CVE #(s):CVE-2013-1753
Created:March 24, 2014 Updated:March 26, 2014
Description: From the Mageia advisory:

A gzip bomb and unbound read denial of service flaw in python XMLRPC library

Alerts:
Oracle ELSA-2015-2101 python 2015-11-23
Red Hat RHSA-2015:2101-01 python 2015-11-19
Ubuntu USN-2653-1 python2.7, python3.2, python3.4 2015-06-25
Red Hat RHSA-2015:1064-01 python27 2015-06-04
Mandriva MDVSA-2015:075 python 2015-03-27
openSUSE openSUSE-SU-2014:0498-1 python3 2014-04-09
Mandriva MDVSA-2014:074 python 2014-04-09
Mageia MGASA-2014-0139 python 2014-03-24
Scientific Linux SLSA-2015:2101-1 python 2015-12-21

Comments (none posted)

python3: denial of service

Package(s):python3 CVE #(s):CVE-2013-7338
Created:March 24, 2014 Updated:January 6, 2015
Description: From the Python advisory:

I am using the zipfile module on a webserver which provides a service which processes files in zips uploaded by users, while hardening against zip bombs, I tried binary editing a zip to put in false file size information. The result is interesting, when with a ZIP_STORED file, or with carefully crafted ZIP_DEFLATED file (and perhaps ZIP_BZIP2 and ZIP_LZMA for craftier hackers than I), when the stated file size exceeds the size of the archive itself, ZipExtFile.read goes into an infinite loop, consuming 100% CPU.

Alerts:
Mandriva MDVSA-2015:076 python3 2015-03-27
Gentoo 201503-10 python 2015-03-18
Fedora FEDORA-2014-16479 python3 2015-01-06
Fedora FEDORA-2014-16393 python3 2014-12-12
openSUSE openSUSE-SU-2014:0597-1 python3 2014-05-02
openSUSE openSUSE-SU-2014:0498-1 python3 2014-04-09
Mageia MGASA-2014-0140 python3 2014-03-24

Comments (none posted)

python-swiftclient: SSL certificate verification

Package(s):python-swiftclient CVE #(s):CVE-2013-6396
Created:March 21, 2014 Updated:March 26, 2014
Description: From the Fedora advisory;

Add SSL certificate verification by default (CVE-2013-6396)

Alerts:
Fedora FEDORA-2014-3054 python-swiftclient 2014-03-21

Comments (none posted)

springframework-security: authentication bypass

Package(s):springframework-security CVE #(s):CVE-2014-0097
Created:March 21, 2014 Updated:March 26, 2014
Description: From the Red Hat bugzilla entry:

It was found that empty passwords could bypass authentication. From the original advisory:

"The ActiveDirectoryLdapAuthenticator does not check the password length. If the directory allows anonymous binds then it may incorrectly authenticate a user who supplies an empty password."

Alerts:
Fedora FEDORA-2014-3812 springframework-security 2014-03-21
Fedora FEDORA-2014-3811 springframework-security 2014-03-21

Comments (none posted)

tigervnc: code execution

Package(s):tigervnc CVE #(s):CVE-2014-0011
Created:March 24, 2014 Updated:November 6, 2014
Description: From the Red Hat bugzilla:

A heap-based buffer overflow was found in the way vncviewer rendered certain screen images from a vnc server. If a user could be tricked into connecting to a malicious vnc server, it may cause the vncviewer to crash, or could possibly execute arbitrary code with the permissions of the user running it.

Alerts:
Gentoo 201411-03 tigervnc 2014-11-05
Mageia MGASA-2014-0173 tigervnc 2014-04-15
Fedora FEDORA-2014-4180 tigervnc 2014-04-05
Fedora FEDORA-2014-4112 tigervnc 2014-03-24

Comments (none posted)

xen: denial of service

Package(s):Xen CVE #(s):CVE-2012-6333
Created:March 26, 2014 Updated:March 26, 2014
Description: From the CVE entry:

Multiple HVM control operations in Xen 3.4 through 4.2 allow local HVM guest OS administrators to cause a denial of service (physical CPU consumption) via a large input.

Alerts:
SUSE SUSE-SU-2014:0446-1 Xen 2014-03-25

Comments (none posted)

Page editor: Jake Edge

Kernel development

Brief items

Kernel release status

The current development kernel is 3.14-rc8, released on March 24. Linus said: "I delayed things a day from my normal schedule, hoping I'd feel more comfortable doing a 3.14 release, but that was not to be. So here's an rc8, and I expect to do the final 3.14 release next weekend."

Stable updates: 3.13.7, 3.10.34, and 3.4.84 were released on March 23; 3.12.15 came out on March 26.

Comments (none posted)

Kernel development news

The 2014 Linux Filesystem, Storage, and Memory Management Summit

By Jonathan Corbet
March 26, 2014
2014 LSFMM Summit
The 2014 Linux Storage, Filesystem, and Memory Management Summit was held March 24 and 25 in Napa, California. Nearly 90 developers met to discuss many issues of interest to the core kernel community; it is one of the most focused and technical events on the calendar. Naturally, LWN was there; articles documenting the discussions held there will be added to this page as they become available.

Plenary sessions

Many of the topics to be discussed were deemed to be relevant to all three groups of developers at the event. Those topics include:

The memory management track

There were a number of topics discussed in a smaller setting involving just those developers who are interested in memory management issues:

The storage and filesystem track

For the most part, the storage and filesystem developers had joint sessions, though there is one exception. Here is what was discussed:

Filesystem-only track

There were also two storage-only track sessions, but those got wrapped up into articles above. Here is what the filesystem developers discussed:

Group photo

The traditional group photo was a somewhat disorganized affair. Your editor apologizes to everybody who is not visible in this picture.

[LSFMM group photo]

Comments (1 posted)

Various page cache issues

By Jonathan Corbet
March 26, 2014
2014 LSFMM Summit
The Linux page cache is charged with maintaining the cache of blocks of data from persistent storage devices; it is a fundamental component of both the virtual filesystem and memory management subsystems. The page cache is key to the system's performance as a whole. Unfortunately, it is showing its age in a number of ways. The first technical session at the 2014 Linux Storage, Filesystem and Memory Management Summit discussed a number of page-cache-related problems and mapped a rough path toward their solution.

Large drives on 32-bit systems

One looming problem, introduced by James Bottomley, is that, on 32-bit systems, the page cache cannot work with drives larger than 16TB. The page cache uses the native page size (4KB) to address blocks on devices; with a 32-bit block index, the ability to represent blocks is exhausted at 16TB. The questions that James put to the group were: is this problem worth fixing on 32-bit systems, and, if so, how could be that done?

A number of developers clearly felt that there was no need to solve this problem. They see 32-bit systems as being slowly on their way out; anybody who wants to use large storage devices, it follows, should just get a 64-bit system to run them on. In the real world, though, things might not be that easy. There will be 32-bit devices out there for some time, especially in the embedded world. People will get 16TB USB-connected devices and expect them to work. Others will put a 32-bit processor into a cheap network-attached storage device and want to put large drives into it. There are also manufacturers putting 32-bit processors directly onto drives to allow them to support protocols like iSCSI.

As Dave Chinner pointed out, the core problem is in the page cache itself. If an application does direct I/O (which bypasses the page cache), there are no problems. Where things go bad is when user space attempts buffered I/O on a large device. That happens when filesystems are used, but also before that point: udev does buffered I/O to map out drives, for example. And there is one other little problem: even if a 16TB filesystem could be mounted and used on a 32-bit system, there is not enough address space to run a filesystem checker on it. That problem appears to be intractable on these systems. Even if the kernel is fixed, in other words, the rest of the ecosystem is just broken.

These observations led to a rough consensus in the room. There will be no real attempt to make larger filesystems work on 32-bit systems. But the direct I/O path will work with very large drives and will remain supported. So use cases that depend only on direct I/O — the iSCSI drive case, for example — will work on 32-bit systems. For the rest, it will just be necessary to get a 64-bit CPU.

Large pages in the page cache

Dave, along with Christoph Lameter, then moved the session onto the next topic: storing larger blocks of data in the page cache. It turned out that they want to do this for different reasons which are driving them toward different solutions. It may be possible to solve both problems, but one solution is likely to come sooner than the other.

Christoph is worried about operating system overhead at all levels of the stack; part of his solution is to support larger physical pages in the page cache. Larger physical pages would reduce management overhead in the kernel and the amount of setup and teardown associated with I/O operations [Christoph Lameter] on those pages. It may be possible to add an "order" attribute to page cache sizes, allowing differently sized (and larger) pages to be stored there, but there is trouble lurking in a familiar place: maintaining the ability to actually allocate large physical pages after the system has been running for a while.

One solution, Christoph said, is to maintain a reserve of large pages for only this purpose, but that is "awkward" and hard to tune correctly. The other approach is to allow the kernel to move pages around, defragmenting things at run time. Much of the support is already there; the memory-management subsystem has the ability to migrate pages, and that ability is used in the page compaction code for memory defragmentation. The problem is that not all pages are movable; since it only takes one unmovable page to break up a large physical page, that is a significant roadblock.

There are a number of reasons why a particular page might not be movable, but one of the most significant causes of unmovable pages is the allocation of kernel-space objects. If objects obtained from the slab allocators were to be movable, though, this problem would go away. Christoph has patches to add this functionality for some heavily-used caches (the inode cache, for example), but they have not been merged. There was some resistance to this idea; the kernel is full pointers to objects; no object can be moved unless all of those pointers can be changed without introducing race conditions. Christoph maintained that much of the problem could be solved by addressing a few important data structures.

Even then, though, things may not work as well as Christoph might like. Mel Gorman made the point that, even on current systems where attempts are made to segregate movable and unmovable pages, a surprising amount of "movable" memory turns out not to be. Page table pages can be a substantial portion of memory, and they cannot be moved; a patch to fix that added 5-6% overhead, and was thus not merged. Other pages are allegedly movable but are pinned for direct I/O or otherwise locked in memory; even when a lot of care is taken, it can be hard to migrate pages. For now, the main symptom of this problem for most users is that transparent huge page allocations fail; that is not a huge problem. But, Mel said, if we come to really depend on being able to move pages, "it will blow up in our face."

Larger blocks in filesystems

Dave had a related but different question to ask. He is interested in better supporting filesystems with block sizes that are larger than the system's page size. Supporting larger physically contiguous pages in the page cache could help toward that goal, and this approach offers some advantages, primarily being that it requires almost no changes in filesystem code to support. But the same cannot be said for the memory management subsystem, where some rather harder changes would need to be made. Rather than jump into all that work, he suggested, we should consider whether we really need to solve the problem that way.

Part of the impetus toward larger physical pages has been limitations in the kernel's block I/O stack; the maximum size of a single I/O request was, for a long time, too small to get full performance out of high-speed storage devices. Raising the page size would raise that limit, allowing more data to be transferred with the same number of pages, fixing the problem. But the request size limit has long been solved, so there is much less need to solve the page-size problem now. [Dave Chinner] The real problem, he said, is that the page cache knows about the block sizes used within each filesystem. It ends up tracking a lot of filesystem-level information, duplicating the information stored in the filesystems themselves. This duplication leads to coherency issues and occasional nasty bugs; it is also the source of the limit that forces filesystem block sizes to be no larger than the system page size.

Nick Piggin tried to solve this problem some years ago with his fsblock work. The problem with that patch set is that it required extensive changes to every filesystem. Even the ext2 conversion, done as a proof of concept by Nick, needed a lot of changes. So Dave has been looking at a different approach. In short, he would like the page cache to stop maintaining a mapping between pages and filesystem blocks; it would, instead, concern itself with the state of the pages themselves. Everything else is already known in the filesystems; getting that tracking out of the page cache would make the block size limit go away.

This work would also enable the elimination of the use of buffer heads for the tracking of blocks. Buffer heads are painful for filesystem developers to work with; they also add a lot of overhead for little value. The mapping of pages to blocks is much better managed in the extent-tracking information already found in the filesystem code. About all the page cache would need to worry about is whether any given page is up-to-date with respect to permanent store; the rest can more easily and reliably be managed elsewhere.

Christoph came back in to point out that he wants the higher-order page cache to reduce application overhead, especially with I/O requests; the filesystem-level solution described by Dave, he said, does not address his problem. Martin Petersen pointed out that smaller I/O granularity may well be forced by host adapters anyway, but Christoph responded that he uses high-end hardware and doesn't have such problems. Dave said, to widespread agreement, that there are two separate problems being described here, and that, in any case, the large filesystem block solution is needed first.

There were some concerns about interactions between the ELF executable format, which has some alignment assumptions built into it, and a larger filesystem block size. It seems, though, that there are no real problems in this area, especially once one gets away from 32-bit systems.

The consensus at the end of the session seemed to be that Dave should push forward with his work to move filesystem awareness out of the page cache. It is, he said, a surprisingly small change, but still a significant bit of work. The changes will be opt-in; nobody expects the numerous older filesystems supported by Linux to be updated, and those filesystems need to continue to work. There were concerns about how much of this work will be in the generic filesystem code; there is no desire to see duplicated implementations copied into each filesystem.

Meanwhile, Christoph was invited to work on supporting larger physically contiguous pages in the page cache. It was made clear, though, that the request for larger allocations would have to be a hint to the allocator; the system cannot depend on those allocations succeeding. In both cases, the next step will be the posting of code for review.

Comments (3 posted)

Support for shingled magnetic recording devices

By Jake Edge
March 26, 2014
2014 LSFMM Summit

One of the plenary sessions on the first day of the Linux Storage, Filesystem, and Memory Management (LSFMM) Summit concerned Linux support for shingled magnetic recording (SMR) devices. These next-generation hard disks have a number of interesting characteristics that will be challenging to fully support. Martin Furuhjelm led a discussion among a few drive vendor representatives and the assembled kernel developers about the latest developments in SMR-land.

There are three types of SMR drives: device managed, host aware, and host managed. Device-managed drives will essentially act just like regular disk drives, though the translation layer in the drive may cause unexpected performance degradation at times (much like flash devices today). Existing drivers don't need to change for device-managed disks. The discussion concentrated mostly on host-aware drives (where the host should try to follow the requirements for shingled regions) and host-managed devices (where the requirements must be followed).

SMR drives will be made up of multiple zones, some that are "normal" and allow random reads and writes throughout the zone, and some that can only be written sequentially. For the sequential zones, there is a write pointer maintained for each zone that corresponds to where the next write must go. Depending on the mode, writing elsewhere in the zone will either be an error (in host-managed devices) or will lead to some kind of remapping of the write (for host-aware devices). That remapping may lead to latency spikes due to garbage collection at some later time.

Two new SCSI commands have been added, one to query what zones exist on the drive and another to reset the write pointer to the beginning of a particular zone. To get the best performance, an SMR-aware driver will need to only write sequentially to the sequential zones (that will likely make up most of the disk), but if it fails to do so, it will be a fatal error only on host-managed drives. For that reason, most of the kernel developers seemed to think the first SMR drives are likely to be host-aware since those will work (though perhaps poorly at times) with today's software.

The T10 technical committee (for SCSI interface standards) is currently working on finishing the standards for SMR, so it is important that Linux developers make any concerns they have with the drafts known soon. Ted Ts'o noted that the drafts are available from the T10 site (Furuhjelm recommended looking for "ZBC"). In addition, more information on SMR and Linux can be found in a writeup from last year's LSFMM.

There were some questions about the zone reporting functionality, but much of that is still up in the air at this point. Currently, all zones are expected to be the same size, though there is a belief that will change before the draft is finalized. There has also been talk of adding a filtering capability on the query, so that only zones fitting a particular category (active, full, sequential-only, etc.) would be returned.

The overall sense was that kernel developers are waiting for hardware before trying to determine how best to support SMR in Linux. No major complaints about the draft interface were heard, but until hardware hits, it will be difficult for anyone to determine where the problems lie.

[ Thanks to the Linux Foundation for travel support to attend LSFMM. ]

Comments (26 posted)

Persistent memory

By Jake Edge
March 26, 2014
2014 LSFMM Summit

Matthew Wilcox and Kent Overstreet talked about support for persistent memory in the kernel on the first day of the 2014 Linux Storage, Filesystem, and Memory Management Summit held in Napa, California. There have been, well, persistent rumors of the imminent availability of persistent memory for some time, but Wilcox said you can actually buy some devices now. He wanted to report on some progress he had made on supporting these devices as well as to ask the assembled developers for their thoughts on some unresolved issues.

Persistent memory is supposed to be as fast as DRAM, but to retain its contents even without power. To support these devices, Wilcox has written a "direct access" block layer that is called DAX ("it has an 'X', which is cool", he said—it also is a three-letter acronym that is not used by anything else in the kernel). The idea behind DAX came from the execute-in-place (XIP) code in the kernel, not because the data accessed from persistent memory will be executed, necessarily, but because it should avoid the page cache. XIP originally came from IBM, which wanted to share executables and libraries between virtual machines, but it has also been used in the embedded world to execute directly from ROM or flash.

Since persistent memory is as fast as RAM, it doesn't make sense to put another copy into memory as a page cache entry. XIP seemed like a logical starting point, since it avoided the page cache, but it required a lot of work to make it suitable for persistent memory. So Wilcox rewrote it and renamed it. Filesystems will make calls to the direct_access() block device operation in a DAX driver to access data from the device without it ending up in the page cache. Wilcox would like to see DAX merged, so he was encouraging people in the room to look at the code and comment.

But there are a few problem areas still. Currently, calling msync() to flush a range of memory to persistent storage will actually sync the entire file and metadata. That is not required by POSIX and Wilcox would like to change the behavior to just sync the range in question. Obviously that has a much further reach than just affecting DAX, and Peter Zijlstra cautioned that changing sync behavior can surprise user space, pointing to "fsync() wars from a few years back" as an example. User space often doesn't care what is supposed to be done, instead it depends on the existing semantics, he said.

Wilcox said that kernel developers "suck at implementing [syncing], user space sucks at using it" and concluded that "syncing sucks". The consensus seemed to be that any application that was syncing a range, but depending on the whole file being synced, is broken. Furthermore, Chris Mason was all in favor of fixing msync() for ranges as it would "make filesystem guys look good".

Another problem area is with the MAP_FIXED flag for mmap(). It has two meanings, one of which is not very well known, he said. MAP_FIXED means to map the pages at the address specified, which is expected. But it also means to unmap any pages that are in the way of that mapping, which is surprising. Someone must have wanted that behavior at one time, but no one wants it any more, he said. He has proposed a MAP_WEAK flag that would only map the memory if nothing else is occupying the address range.

The get_user_pages() function cannot be used with persistent memory, because there are no struct page entries created for it. There could be a lot of pages in a persistent memory device, so wasting 64 bytes per page for a mostly unused struct page is not desirable. The call to get_user_pages() is generally for I/O, so Dave Hansen has been working on a get_user_sg() that create a scatter-gather list for doing I/O. The crypto subsystem also wants this capability, Wilcox said.

There is a problem, though. A truncate() operation could remove blocks out from under get_user_sg(), which would leave a mess behind. Wilcox wondered if file truncation could just be blocked until the pages are no longer pinned by the I/O operation. That did not seem popular, but Overstreet had another idea.

Overstreet has been working on a direct I/O rewrite for some time and, in many ways, doing a DAX mapping and a direct I/O look similar, he said. His rewrite would create a new struct bio that would be the container for the I/O. It would get rid of the get_block() callback, which is, he said, a horrible interface. For one thing, it may have to read the mapping from disk, which should be asynchronous, but get_block() isn't. Moving to struct bio would allow the usual block-layer filesystem locking to avoid the truncate().

There were some complaints that making I/O be bio-based was problematic for filesystems like NFS and CIFS that don't use the bio structure. Overstreet said that we may get to a point where buffered I/O lives atop direct I/O, which would help that problem. In addition, Mason did not think that a bio-based interface would really be that big of a problem for NFS and others. A bio is just a container of pages, Overstreet said.

In the end, no really clear conclusions were drawn. It would seem that folks need to review the DAX code (and, eventually, Overstreet's direct I/O rewrite) before reaching those conclusions.

[ Thanks to the Linux Foundation for travel support to attend LSFMM. ]

Comments (10 posted)

Trinity and memory management testing

By Jonathan Corbet
March 26, 2014
2014 LSFMM Summit
The Trinity tool is a system call fuzz-testing utility for the Linux kernel. By supplying random data to the kernel in a focused way, Trinity has managed to expose a large number of bugs over the years. Trinity maintainer Dave Jones addressed the memory management track at the 2014 Linux Storage, Filesystem, and Memory Management Summit to talk about how he is using the tool to turn up memory management bugs in particular.

He started by describing an idea he heard from Al Viro: create a memory range with mmap(), unmap a single page in the middle of that range, then pass the result to various system calls and see what happens. As he described it, "all hell broke loose." Large numbers of bugs have been turned up, followed by "heroic efforts" in the memory management community to fix them. As a result of those efforts, he said, he is now unable to find any problems in the 3.14-rc7 kernel's memory management code, which is a good thing.

On the other hand, Sasha Levin has been using Trinity to find bugs in the linux-next tree, where they are rather more plentiful. Those bugs often move into the mainline during the merge window. Bringing more stability to the code in linux-next before it is merged would be a worthwhile thing to do.

In general, Dave said, Trinity is good at finding bugs in the dark corners of the kernel that nobody makes much use of. So areas like huge pages, page migration, and the mbind() system call have been fertile ground. In the case of mbind(), it turned out that all callers were going through a user-space library. That library did argument checking, so, naturally, the system call itself did not. The result was a predictable pile of bugs which have now been fixed.

Lots of parts of the memory management subsystem are, he said, simply not getting adequate testing now. Trinity helps in this area, but its memory management testing is still on the rudimentary side. He wants to develop it further; he plans to work on memory management fuzzing for much of the rest of the year. But, even now, Trinity is finding more bugs in the memory management code than can be dealt with.

Huge pages generate lots of bug reports from Fedora users; they also are the source of lots of problems found by Trinity. Reproducing those bugs in any more standard setting is hard, though; many of them involve applications, like the Java runtime, that Dave is unfamiliar with and uninterested in learning more about. So, for now, transparent huge pages are simply turned off for his Trinity runs.

Reproduction of crashes provoked by Trinity is an ongoing problem in general. The tool can log everything that it does, but the logging is, itself, an expensive operation that can change timings to the point that a lot of problems simply go away. Many crashes are also the result of corrupted internal state in the kernel; the sequence that causes the corruption may happen a long time before that corruption causes the kernel to crash. So establishing the cause of crashes can be difficult.

Dave had a couple of requests for memory management developers. One was for anybody adding new flags to existing system calls; when that happens, he would like to get a note so that he can start testing calls with that flag. He'll often notice them in the patch stream anyway, but an explicit notification is more reliable. The other thing he would like is to see more developers running Trinity on their systems. It is trivial to set up, he said; so there is no real reason not to make use of it.

[Your editor would like to thank the Linux Foundation for supporting his travel to the Summit.]

Comments (6 posted)

Compressed swap

By Jonathan Corbet
March 26, 2014
2014 LSFMM Summit
There are a number of projects oriented around improving memory utilization through the compression of memory contents. Two of these, zswap and zram, have found their way into the mainline kernel; they both aim to replace swapping with compressed, in-memory storage of data. They differ in an important way, though. Zram acts like a special block device which can be mounted as a swap device; zswap, instead, uses the "frontswap" hooks to try to avoid swapping altogether.

Bob Liu led a session to talk about this technology with a specific focus on zswap and a performance problem he has encountered with it. Zswap stores "swapped" data by compressing it and placing the result in a special RAM zone maintained by the "zbud" memory allocator. When the zbud pool fills, zswap must respond by evicting pages from that area and pushing them out to a real swap device. That involves decompressing the data, then writing the resulting pages to the swap device. That can slow things down significantly.

Bob had a couple of options that he asked the group to consider. One of those was to turn zswap into a write-through cache; any pages stored in zswap would also be written to the swap device at the same time. That would allow the instant eviction of pages from zswap; since they already exist on the swap device, no further effort would be required. The cost, of course, would be in the form of increased swap device I/O and space usage.

The second option would be to make the zswap area dynamic in size. It is currently a fixed-size region of memory. If it were dynamic, it could grow in response to increased demand. Of course, there would be limits to that growth, after which it would still be necessary to evict pages from the zswap area.

Bob may have hoped for guidance with regard to which direction he should take, but he did not get it. Instead, Mel Gorman made the point that neither zram nor zswap has been well analyzed to quantify the benefits they provide to a running system. When people do run benchmarks, they tend to choose tests like SPECjbb which, he said, is not well suited to the job. Or they pick kernel compiles, which is even worse.

What the compressed swapping subsystems really need, he said, is better demonstration workloads. In fact, they need those workloads so badly that no changes to the behavior of these subsystems will be considered until those workloads have been provided. So the real next step for developers working with compressed swapping is not to worry about how the system responds to pool exhaustion — at least, not until a better way to quantify the performance impact of any changes has been found.

[Your editor would like to thank the Linux Foundation for supporting his travel to the Summit.]

Comments (6 posted)

Memory management locking

By Jonathan Corbet
March 26, 2014
2014 LSFMM Summit
Like many parts of the core kernel, the memory management subsystem is highly sensitive to lock contention, which can quickly ruin performance. Davidlohr Bueso has been working on fixing some locking problems in that code; he led a session in the memory management track of the 2014 Linux Storage, Filesystem, and Memory Management Summit to talk about the directions for that work.

Two locks that show contention problems are the anon_vma lock, which controls access to virtual memory areas representing regions of anonymous pages, and the i_mmap_mutex, which protects several fields of the address_space structure. These locks were once mutexes, but the read-mostly access patterns for those data structures led developers to switch them to reader/writer semaphores (rwsems) instead. The only problem is that performance took a significant hit whenever it became necessary to do a non-trivial amount of writing to those data structures.

Some of that performance was regained through the application of "rwsem stealing," whereby a thread that is running can grab a lock ahead of another thread which had been waiting for it. But, Davidlohr said, what was missing was the sort of adaptive spinning found in regular mutexes. Even though a mutex is a sleeping lock, a thread trying to acquire it may spin for a while in the hope that the lock will be released soon; doing so can yield a significant performance boost. Adding spinning to the rwsem implementation gets performance back to previous levels for all workloads. So, Davidlohr asked, is there any opposition to merging that code? The response in the room suggested that no such opposition exists.

In the case of the anon_vma lock, there is a strong desire to avoid using a sleeping lock at all. The rwlock mechanism exists for just that use case, but there are fairness issues with rwlocks. Waiman Long has done some work with queued rwlocks, which address those issues while also improving performance. Peter Zijlstra noted that he has rewritten those patches, but is not quite sure what to do with the results. He likes the fairness, but lacks good benchmarks by which to judge them. There are still problems using these locks on virtualized systems. Even so, he is not opposed to merging this code.

Not everybody feels quite the same way, though. Sagi Grimberg noted that he has code that needs to be able to sleep in functions like invalidate_page(), where the anon_vma lock can be held. So turning that lock into a non-sleeping lock would clearly create problems. This kind of need comes up in areas like InfiniBand and RDMA, where work that can potentially sleep has to be done in settings where this lock is held. Mechanisms like xpmem also have this problem.

Rik van Riel suggested that the best way to avoid problems is to get the relevant code upstream as soon as possible, but Davidlohr protested that the performance cost of using a sleeping lock is severe. Peter added that Linus has had "choice words" for authors of code needing a sleeping anon_vma lock. So, he said, the right thing to do with the non-sleeping lock patches would be to send them to Linus with an explanation of what would break if they were applied. Then we could all see what Linus chooses to do.

Davidlohr went on to say that he would also like to restart the discussion of the mmap_sem semaphore, which protects many parts of a process's address space. It is often held for too long, he said, creating excessive latencies. We are also serializing too much work. It is not necessary, he said, to lock the entire address space if work is being done on a portion of that space. Perhaps it is time to look at range locking as a way to reduce mmap_sem contention?

Michel Lespinasse responded that, while range locking might make sense, it would be better to work on eliminating long hold times for mmap_sem. Rik suggested that, perhaps, it could be turned into a per-virtual-memory-area lock, but Peter responded that this has been tried in the past. The patches have ended up replacing mmap_sem contention with contention for a "big VMA lock" instead.

Jan Kara raised the problem of holding mmap_sem when the memory management subsystem calls into filesystem code. Beyond performance problems, this pattern can create lock inversion issues as well. He has been working on eliminating the places where mmap_sem is held for filesystem calls for some time; that work is getting closer to being ready. There are just a couple of remaining problem areas, one of which is the page fault handling code, but there are solutions to that problem.

The other is calls to get_user_pages(), which requires that the mmap_sem be held by the caller. Jan has been converting callers to get_user_pages_fast(), which does not have that requirement. Most of the easy cases have been handled, but a few of the harder issues remain. Sometimes get_user_pages() is called in situations where mmap_sem has been acquired by higher-level code. The Video4Linux videobuf2 code has some interesting usages of its own which are hard to convert.

But the most worrisome area is uprobes, which needs to be able to place breakpoints into program code (text) pages when they are brought into memory. This code registers a callback on the creation of virtual memory areas; if need be, it installs breakpoints into text pages when they are instantiated. This results in a call into filesystem code from well within an mmap() call. Peter suggested that this one could be fixed by reordering the mmap() code. The initial setup work could be done, after which mmap_sem would be dropped and the page contents could be filled. That would create a window where a program might be able to access some of the mapped pages before their initialization is complete, but, Peter said, no well-behaved program will access pages created by mmap() before that call returns, so there should be no problems.

The session ended with some inconclusive discussion on rationalizing the naming of the growing family of get_user_pages() variants.

[Your editor would like to thank the Linux Foundation for supporting his travel to the Summit.]

Comments (2 posted)

Hardware pain points for memory management

By Jonathan Corbet
March 26, 2014
2014 LSFMM Summit
H. Peter Anvin ran a brief session at the 2014 Linux Storage, Filesystem, and Memory Management Summit to ask a simple question: how could hardware (and processors in particular) improve to make the memory management task easier? While he offered no guarantees that any actual hardware changes would result from the discussion, he did say that he would be able to carry any requests back to the hardware people at Intel.

The first complaint had more to do with hardware-specific software in the kernel: Rik van Riel noted that the PowerPC architecture code does not implement the translation lookaside buffer flush functions. Some other architectures (such as SPARC) have similar limitations. That makes it hard to do the right thing in generic code. It would be nice, he said, if something could be done to make it easier for architecture-independent code to update page table entries.

Peter Zijlstra asked for a way to invalidate a range of page table entries on the x86 architecture. Another popular request for x86 was the ability to support 64KB pages. Currently, on that architecture, there is no hardware page size between 4KB and 2MB.

Mel Gorman asked for a fast operation to zero-fill a page of memory. This ability could be especially useful for huge pages, which can take a while to overwrite with zeroes. There was some talk about whether non-temporal stores (which can overwrite memory without pushing other data out of the processor caches) would be helpful in this situation. Somebody suggested zeroing pages in the kernel's idle loop, when nothing else is going on, but Christoph Lameter responded that he has tried that and it does not really help.

Other requests included a version of the iret instruction that is less painful (faster) for the page fault handler. There was talk of the cost of responding to events and passing messages between CPUs; a version of the mwait instruction that works in user space was suggested as being possibly helpful. The end result of the session was a wishlist to be taken back to the hardware developers; what will come of that remains to be seen.

[Your editor would like to thank the Linux Foundation for supporting his travel to the Summit.]

Comments (1 posted)

Volatile ranges

By Jonathan Corbet
March 26, 2014
2014 LSFMM Summit
"Volatile ranges" are special regions of memory containing data that the owner application can regenerate if need be. If the system runs short of memory, the kernel is free to evict data from a volatile range, but otherwise the space is usable for activities like caching. The volatile range concept was raised again at the 2014 Linux Storage, Filesystem, and Memory Management Summit, in two separate sessions. This article is a combined look at both discussions.

The first session started with an overview of the latest incarnation of the volatile ranges and MADV_FREE APIs; see this article for an overview of those proposals. One question that came up repeatedly concerned the need for a separate vrange() system call for volatile ranges. Some of the incarnations of that work have used madvise() instead, and some developers think that is the better approach. It turns out that one of the biggest arguments against an madvise() interface has to do with the process of marking pages as no longer being volatile. In that case, the system call needs to return two separate values: (1) how much memory was successfully marked non-volatile, and (2) whether any pages were purged by the kernel while they were marked as volatile. madvise() only allows for one return value, so it cannot be used to create that kind of interface.

Should the interface indicate which pages have been purged when a range is marked non-volatile? The current code returns a single boolean value indicating only whether any pages have been purged at all. Hugh Dickins said that some users would like to have more detailed information. That said, there does not appear to be any plan to extend the interface in that direction at this point.

Another question has to do with page aging. When pages are marked as being volatile, should they be "aged" to look like they have not been referenced for a long time? Aging the pages in that way would cause them to be among the first that are reclaimed if the system encounters memory pressure. There does not seem to be much consensus on whether this kind of aging should be performed; if it is added, it might be under the control of a separate flag allowing user space to select the behavior it wants.

Hugh said he didn't like the vrange() name; he would rather see the name be a verb describing the action that is to be performed. There was also talk of making an madvise2() system call that would be able to provide the needed API. In the end, though, suggestions for better names have been in short supply, and Hugh agreed that, given all the revisions that volatile ranges have been through, keeping that functionality as a separate system call might be the best approach to take.

Keith Packard raised a related use case that he has: graphics drivers can allocate large amounts of memory for caching that they can give up if need be. But the existing shrinker interface is not actually invoked often by the kernel, so he ends up holding memory rather longer than is warranted. Perhaps the volatile range functionality could be made available in a form that could be used by drivers as well?

A couple of other API issues came up toward the end of the session. One had to do with what happens if a process writes to memory that is in a volatile range: in that case, should the memory remain volatile, or should writing the memory automatically make it non-volatile? Some developers would like to see the latter behavior, but John Stultz, the author of current versions of the patch, is uncomfortable with changing the state of pages on writes in that way.

The current interface is memory-based, in that a volatile range is described by a base address and a length. Some versions of the patch have, instead, used a file-based interface, where a volatile range is described as a portion of a file. The Android "ashmem" subsystem, which, it is hoped, can be replaced by volatile ranges someday, uses a file-based interface, but John said that it could be changed internally to use a memory-based method instead. Keith had a bit of a stronger requirement for a file-based interface, though. The graphics system, he said, does not normally have addresses for most of the memory it uses for caching, and mapping all of that memory could create problems on 32-bit systems where there is not a lot of address space available. So he would rather see a file-based API.

In the end, there was little in the way of concrete conclusions from this session. There will certainly be another version of the volatile ranges patch set at some point, but what it will look like is not entirely clear.

[Your editor would like to thank the Linux Foundation for supporting his travel to the Summit.]

Comments (1 posted)

Toward better testing

By Jonathan Corbet
March 26, 2014
2014 LSFMM Summit
Dave Chinner and Dave Jones started off the second day of the 2014 Linux Storage, Filesystem, and Memory Management Summit with a discussion of testing tools. What are we doing now, and what can be done better? The state of the art has improved considerably in recent times, but there are always ways to do better yet.

Dave Chinner spoke as the maintainer of the xfstests filesystem testing suite. Despite its name, this suite has not been specific to the XFS filesystem for some time. There are, he said, more people now who are both using and contributing to xfstests, but there is still room for improvement. When a developer finds a filesystem bug, he said, the fix should include a contribution to the test suite to help ensure that the bug does not return in the future.

James Bottomley asked how much code coverage is provided by xfstests now. It seems that the quality assurance people at Red Hat have done some coverage testing; about 75% of the code in the XFS filesystem is exercised by xfstests. Coverage of ext4 is a bit less at 65%; there are currently no tests to exercise the ioctl() code in particular. In general, the common code paths are tested well, but the more esoteric features lack test coverage.

There was a request for the addition of power-failure testing to xfstests. Dave responded that there is a "crashme" script in xfstests now that can be used to randomly reboot the machine; XFS also has a special ioctl() that will immediately cut off the I/O stream, simulating a power failure on the underlying device. So, he said, there is no need to physically remove power to do power-failure testing; it can be done with the software tools that exist already.

Al Viro said that some tests will fail if the underlying storage partition is too small. Dave replied that there is a mechanism in the xfstests harness to specify how much space each test needs. In general, the minimum amount of space is 5-10GB; with that, most of the tests will run. At the other end, he runs some tests on a 100TB device, though, he noted dryly, it is wise to avoid any tests which need to fill the entire filesystem when working at that scale. Al also said that some tests can fail after thousands of operations; it would be nice for debugging to be able to replay an xfstests log and quickly zero in on places where things fail.

In general, Matthew Wilcox said, it is not always easy to figure out why a specific test failed. Dave responded that this situation may not change; the purpose of xfstests is to alert developers that a bug exists, not to actually find that bug. He did say that he would accept patches that provide more hints to developers, but that there is also a reluctance to go back and change existing tests. It is easy to break the test itself, sending developers scrambling to find a filesystem bug that does not actually exist. Things are bad enough even without changing the tests, he said: every couple of years the GNU utilities developers feel the need to change the formats of error strings, causing problems in the test suite.

Zach Brown complained that the discussion was focusing on details, when the most significant resource we have is the fact that Intel is paying people to put together testing infrastructure and actually run the tests on development kernels. Now, when developers introduce a bug, they will often get an automated email informing them of the fact. That is good, since, he said, the xfstests suite is painful to set up and run.

Dave Jones asked if we need a similar test suite for the storage layer. Ric Wheeler responded that storage vendors have such suites, but those suites tend to be kept private. Mike Snitzer has a test suite for the device mapper; among other things, it helped to find problems with the recently merged immutable biovec work. When asked why this tool isn't more widely used, Mike responded that the fact that it is written in Ruby might have something to do with it.

Another developer expressed a desire to coordinate filesystem tests with outside processes; the objective in particular was to create more memory pressure while the tests are running. Dave Chinner agreed that more testing should be done under memory pressure. Dave Jones suggested that the fault injection framework could be used; Dave Chinner agreed, but noted that fault injection, while exercising error paths, does little to exercise the reclaim paths in the kernel. So there is no substitute for real memory pressure. A program found in xfstests now will lock large amounts of memory into place, providing an easy way to add memory pressure to the system.

Moving beyond xfstests, Dave Jones asked the community what kinds of tests are missing in general. James immediately responded that we need better ways of testing for performance regressions. Mel Gorman added that the community is "completely blind" when it comes to I/O performance. He has added some simple I/O tests to the mmtests suite and found some regressions in that area almost right away. But, he said, having the test is not enough, some kinds of problems require looking over the results in a detail-oriented fashion. Performance regressions may manifest themselves as latency spikes that have little effect on overall throughput numbers.

Dave Jones recounted that, during the 3.10 development cycle, RAID5 was broken through the development cycle from the merge window until just before the release. Somebody, he said, should have found the problem sooner. It is also easy, he said, to bring down the kernel when assembling block devices with the device mapper. Developers, he said, simply are not trying to test a lot of this code in any sort of regular way.

Ted Ts'o suggested that not enough developers have come to understand the deep sense of relief that comes from knowing that a set of changes has passed all of the regression tests. He wished he knew of a way to package that feeling and sell it to new developers. In the absence of that ability, he said, maintainers should do more yelling at developers who clearly have not run the available tests on their patches. Once a culture of regular testing sets in, it tends to become persistent.

Dave Jones complained that, while we sometimes write tests for problems that have been experienced, we are not so good at proactively writing tests for functionality that might break sometime in the future. Dave Chinner agreed, saying that the quality assurance organizations run by distributors should really be writing more tests and trying to break things. In most organizations he has worked with, that kind of outside testing is the norm, but we don't do much of it in the kernel community. Developers, he said, tend not to break their own code well enough; we really need outside testers to find new and creative ways to break things.

As the discussion wound down, there was some talk about areas that do not have good tests now. The filesystem notification system calls were mentioned. Some of the more obscure memory management system calls — mremap() or remap_file_pages() for example — don't have much test coverage. More test coverage for the NUMA memory policy code would also be helpful. Developers may eventually write these tests; hopefully others will then run them and let the community know when things break.

[Your editor would like to thank the Linux Foundation for supporting his travel to the Summit.]

Comments (3 posted)

User-space out-of-memory handling

By Jonathan Corbet
March 26, 2014
2014 LSFMM Summit
While opinions on how the kernel should respond to out-of-memory (OOM) situations vary, almost everybody seems to agree that what the kernel does now is in need of improvement. A session on the topic during the memory management track at the 2014 Linux Storage, Filesystem, and Memory Management Summit covered some possible improvements, but reached no real conclusions.

David Rientjes used the session to talk about his user-space OOM handling patches and to ask for a green light for their inclusion. He spent a while talking about how these patches work; this introduction can be found in David's article on the subject and will not be [David Rientjes] repeated here. David has been pushing this work for the last year or so, but it seems clear that the community is still not completely sold on it.

Sasha Levin asked whether it might be better to use the vmpressure mechanism, which sends notifications when memory is getting tight, rather than waiting for a full OOM situation and hoping that user space can handle it. The problem with that approach, as Rik van Riel put it, is that there is no limit to how quickly a system can consume its memory. David added that the vmpressure mechanism does not work as well as one might think. As an illustration of the problem, consider a process that locks many pages into memory; it will consume much of the available memory, but no pressure notifications will result because no reclaim is yet happening. The system can then go from a "no pressure" state to "out of memory" almost instantaneously once reclaim starts; there simply is no opportunity for user space to respond.

As the discussion went on, it became clear that the most discomfort existed around the use of a user-space handler to deal with global OOM situations. If a single control group under the memory controller (a "memcg") runs out of memory, it makes sense to have a user-space handler respond. But, Michal Hocko asked, do we really want to handle global OOM situations (where the system as a whole is out of memory) in user space? He agreed that the current code does not work for everybody, but, he said, pushing responsibility into user space opens up a can of worms and would be hard to maintain in the long term. It would be better, he suggested, to improve the global OOM killer in the kernel instead.

Tim Hockin, speaking about his work at Google (which has driven the user-space OOM handler development), talked about the problems they have had with OOM-handling requirements that have changed over time. Google has a hard time deciding what it wants to have happen in OOM situations; it seems hard to expect the kernel developers to anticipate where those requirements might go in the future. That has led to the desire to push the policy into user space where it can be changed without the need to build and deploy a new kernel — a process which does not happen quickly at Google. He would be happy with an in-kernel mechanism that allowed policies to be changed, but only if it is possible to effect a change without building a new kernel.

Robert Haas agreed that moving the policy into user space gives users the ability to make changes without having to change the kernel itself. Kernel developers, he said, simply are not smart enough to come up with all possible policies. But David said he was willing to try if that was how it would be done, though he suggested that the community might not be happy about the "hundreds of patches" implementing all of the possible policies that would result.

There was also some unhappiness about David's use of the memcg mechanism for global OOM handling. That mechanism will only work if control groups are built into the kernel, but there are still plenty of users who prefer not to enable control groups at all. The motivation for using that interface was to allow per-memcg and global OOM handlers to work with the same interface and be coded the same way. Peter Zijlstra suggested that the same control files could be placed in /proc for global OOM handling, providing something very close to the same interface without needing to enable control groups.

David asked for some guidance on how he could make progress in this area. It has been hard to get a consensus on his user-space OOM handling patches, but no viable alternatives have come forward. So he is somewhat stuck. Unfortunately, no consensus emerged in this session either, so there is still no clear path forward for this project.

[Your editor would like to thank the Linux Foundation for supporting his travel to the Summit.]

Comments (7 posted)

NUMA placement problems

By Jonathan Corbet
March 26, 2014
2014 LSFMM Summit
The kernel's handling of task and memory placement has been the subject of a lot of discussion and development in recent years. The pace has slowed for the last few development cycles, but there is still work to be done in this area as can be seen by the discussion on the topic that was held at the 2014 Linux Storage, Filesystem, and Memory Management Summit. The session, led by Rik van Riel, Peter Zijlstra, and Mel Gorman, started with the question: lots of code has been merged, what should happen next? Peter made the observation that, while the code is in the kernel, few people have actually tried to take advantage of it to improve NUMA performance. What is most needed now is user feedback on how things are working and what could be improved.

Davidlohr Bueso said that, on his systems, he can still get much better performance from a carefully hand-tuned configuration than with the automatic NUMA placement code. Rik added that, as far as he can tell, things work close to optimally on four-node systems, but tend to fall apart on systems with more nodes than that. Mel asked why that might be; there was some speculation that the costs of page hinting (tracking who is using each page of memory so that it can be moved to the right node) might be responsible, or perhaps the more complex topology of larger NUMA systems is not being handled well. But it seems that nobody really knows what the problems are.

Mel said that truly understanding NUMA performance issues requires the collection of a lot of data. But that data collection is expensive, to the point that it disrupts the workload under study. It's hard enough for him to run his tests; he hasn't really found a good way for others to do it yet. It seems that Rik, Peter, and Mel each have their own way of measuring NUMA performance; they haven't done much talking among themselves in this area. That is, it was suggested, actually a good thing; each developer is able to find different problems with his particular approach.

Rik noted that, while the NUMA code tries hard to keep anonymous pages on the same node as the processes using them, the same care is not yet applied to page cache pages. His question was: should it be? It is not clear that localization of the page cache would lead to better performance overall; Mel said that this area is pretty much ignored for now.

Johannes Weiner said that a node-local page cache allocation policy might not make sense. If the system tries hard to allocate those pages locally, it could do so at the expense of pushing other useful pages out. At that point, the kernel is buying local pages at the expense of forcing disk I/O for other needed pages — probably not a good bargain. Currently, page reclaim is strongly tied to nodes, so some nodes can reclaim heavily while old pages languish on others. So, he said, it might be good to force some page aging on all nodes even if there isn't memory pressure everywhere. Then interleaving page cache pages across all nodes might be able to increase memory utilization and reduce the aging of useful pages, a win even if it results in more cross-node traffic.

There were complaints that processes communicating over network sockets should be grouped onto the same node, but that doesn't happen now. There is seemingly a bit of a disconnect with the networking developers on how that kind of grouping should be done. There would also be value in moving network-oriented processes onto the NUMA node that holds the network adapter they are using, but there is no I/O awareness in the NUMA code at all currently. Improving the integration of networking and NUMA placement is not going to be an easy task; it will likely involve carrying NUMA information through many layers of the network stack.

The session wound down without a lot in the way of hard conclusions. It seems clear that there is still a lot of work to be done in the area of NUMA placement.

[Your editor would like to thank the Linux Foundation for supporting his travel to the Summit.]

Comments (1 posted)

Memory compaction issues

By Jonathan Corbet
March 26, 2014
2014 LSFMM Summit
Memory compaction is the process of relocating active pages in memory in order to create larger, physically contiguous regions — memory defragmentation, in other words. It is useful in a number of ways, not the least of which is making huge pages available. But compaction apparently has some problems of its own; Vlastimil Babka led a brief session in the 2014 Linux Storage, Filesystem, and Memory Management Summit to explore the issue.

After Vlastimil gave a quick overview of how compaction works (also described in this article) and described problems related to compaction overhead, Rik van Riel made the claim that there are two core issues to be looked at in this area: (1) can the compaction code be made to be faster, and (2) when compaction appears to be too expensive, should it just be skipped?

It seems that a number of compaction bugs have been fixed over the years, but some clearly remain. How, it was asked, can they be made easier to find? Writing test programs that reveal compaction problems tends to be hard; these problems arise out of specific workloads that exercise the system in certain ways. There does not appear to be any easy way to abstract the problematic access patterns out of the workloads into separate test programs.

What that means is that the memory management developers don't really have a good understanding of why compaction problems are happening. Some workloads obviously create situations where compaction gets expensive, but how that happens is obscure. So there is clearly a need to gain a better understanding of how the problems come about. One step in that direction might be to add a new counter that is incremented anytime the kernel detects that it has spent a significant amount of time in the compaction code. If that counter starts to increase, that will be a signal that bugs in the compaction code are being tickled. Then, perhaps, it will be possible to try to figure out where those bugs are.

[Your editor would like to thank the Linux Foundation for supporting his travel to the Summit.]

Comments (1 posted)

Huge page issues

By Jonathan Corbet
March 26, 2014
2014 LSFMM Summit
Using huge pages can improve performance on a number of workloads, mostly through decreased paging costs and better translation lookaside buffer usage. But supporting huge pages imposes costs of its own on the kernel. The memory management track of the 2014 Linux Storage, Filesystem, and Memory Management Summit set aside some time to talk about those costs and how they might be reduced.

Aneesh Kumar started out by saying that, on architectures with larger page sizes (such as PowerPC), users tend to disable the transparent huge pages feature because it causes performance problems. Those problems might result from the fact that, when normal pages are larger (64KB, for example), huge pages are also larger. They may be getting large enough that internal fragmentation is a concern. One possible solution might be to split huge pages when they are swapped out of memory. That might reduce the amount of I/O required, especially if pages filled with zeroes can be skipped during the swapout process. There might also be some gains to be had by disabling the allocation of huge pages during page fault time, leaving it to the khugepaged daemon to assemble huge pages later on.

A question arises on NUMA systems: if a huge page cannot be allocated locally, is it better to allocate a remote huge page or a local small page? The benefits from using remote huge pages seemingly do not outweigh the costs of doing so. There was some confusion on this issue; Rik van Riel thought that falling back to small pages locally should already be the default course of action. Matthew Wilcox suggested that, if a page must be allocated remotely, it should always be a small page; huge pages should only be allocated on the local node.

There was also some talk of adding a new heuristic that would disable the allocation of huge pages in situations where memory is highly fragmented. When fragmentation happens, it might well be better to try to reduce overall memory usage by sticking with small pages. Once again, khugepaged can collapse things into huge pages later if the resources become available.

Peter Zijlstra had a different suggestion: take the transparent huge page mechanism out of the kernel entirely and leave the associated headaches behind. Andrew Morton agreed that transparent huge pages have "made a mess of the kernel." Hugh Dickins expanded on that thought, noting that the memory management subsystem as a whole has gotten significantly more complex, and that the transparent huge page feature is a big part of the problem. It is a feature that does benefit some workloads, and, he said, it was a magnificent technical achievement. But the way transparent huge pages rely on memory compaction, extensive reference counting, and complex code are downsides to the feature.

Rik noted that much of the complexity comes from the feature having been retrofitted onto the existing memory management code; it may be time to look at simplification through extensive rewriting of the code. Andrew agreed with a focus on simplicity, stating that the memory management code has gone beyond the developers' ability to maintain it.

Davidlohr Bueso shifted the conversation a bit, noting that HPUX has a "zero page daemon" charged with zero-filling huge pages when the CPU is otherwise idle. He would like to add a similar feature to Linux. It would work with the hugetlbfs subsystem only; transparent huge pages would not be zeroed in this way. Working with hugetlbfs is beneficial in that the kernel knows just how many pages have been configured, so there is an automatic bound on how many pages would need to be zeroed.

The benefit of such a mechanism would mainly be in reduced application-startup time. But it is unlikely to find its way into the mainline; Mel stated firmly that he saw it as a bunch of additional code for little real benefit. Adding work to the idle loop would have power-consumption implications, increase memory bandwidth usage, and lead to confusing variability in application performance. It is, he thought, a poor tradeoff overall. Andrew suggested finding a way to do the page zeroing from user space; that would make the overhead visible and put it all under user control.

Transparent huge pages currently only work with anonymous pages; file-backed pages are not covered. Kirill Shutemov talked about his work to extend transparent huge pages to the page cache (described briefly in this article). The work is about one year old and has shown significant improvements on some benchmarks. This improvement has come at the cost of adding a new lock to protect the page cache registry. Things get especially complicated when huge pages need to be split.

At that point, the discussion headed into the question of whether it is ever really necessary to split a huge page in the page cache. With anonymous pages, there are times when splitting is nearly unavoidable; performing copy-on-write on a page within a huge page is one example. But the page cache works differently. It might well be possible for some processes to map parts of a huge page as small pages while others see it as a single huge page cache page. But there are some interesting reference counting questions to be answered.

Reference counts for page cache pages live in the associated struct page in the system memory map. When the page is a huge page, there are many page structures, one for each small page that goes into the huge page. A reference count for a huge page is kept in the first of those page structures, while the reference count in the remaining "tail pages" are not used. But if some threads see only a few of the tail pages rather than the huge page as a whole, where should the reference count for those tail pages be? Rik suggested that the head page should be used to maintain a reference count for the whole huge page; a reference to the huge page would then set the count to 512 (the number of small pages that fit into a huge page). If the page is split, references to individual small pages can be dropped by decrementing the head-page counter, and the kernel still knows how many references to the huge page (or parts thereof) still exist.

Hugh worried that there could be confusion between the head page as representing a huge page and that page on its own as a small page, but Rik thought those issues could be worked out.

Should this work go upstream? Andrew suggested it should if, in the end, it makes things simpler. He also said that, in retrospect, the memory management subsystem should have been designed around variable page sizes from the beginning, but nobody was thinking in those terms. Any future work should be done with that kind of end goal in mind, though; grafting features like huge pages onto the existing memory management code has clearly been a mistake. That, of course, is a tall order for anybody wanting to improve the kernel's management of huge pages; it suggests that we could be seeing some fundamental changes in memory management in the coming years.

[Your editor would like to thank the Linux Foundation for supporting his travel to the Summit.]

Comments (1 posted)

Memory accounting and limits

By Jonathan Corbet
March 26, 2014
2014 LSFMM Summit
Two separate sessions in the memory management track of the 2014 Linux Storage, Filesystem, and Memory Management Summit looked at memory accounting and the application of limits to memory usage. One would think that this old problem would have been solved long ago, but it is clear that there are still a number of open issues in this area.

Low limits

At the 2013 Summit, Michal Hocko tried to convince developers that a change in how the "soft" limit in memory control groups ("memcgs") are implemented was needed. He was not successful in that attempt, so, this year, he came back with a variation of that approach: rather than change soft limits, he would like to add a new limit to memcgs called the "low limit."

A soft limit is meant to provide an upper bound on memory consumption when the system is under memory pressure. If there is plenty of memory available, a memcg can consume more than its soft limit would allow, but, when pressure hits, the reclaim code will step in and the memcg's use will be cut back quickly to the soft [Michal Hocko] limit. If the memory pressure persists, processes in memcgs may be cut back even further, well below the soft limit set by the memcg. But sometimes users don't want certain memcgs to go below a minimum amount of memory even when the memory pressure is severe.

That is the purpose of the low limit. If this limit is set on a memcg, the memory management subsystem will not reduce that memcg's usage below the limit even if the system is desperately short of memory. The low limit is meant to be a sort of guarantee; the system takes it seriously enough that it will go into a full out-of-memory condition before it will reduce a memcg below its low limit.

There were a couple of questions that resulted from this presentation. Peter Zijlstra went back to the idea of using the soft limit as a guarantee instead. Since nobody seems to like how soft limits are implemented now, why not just change things? Part of the problem is that the current default soft limit is "unlimited"; using the soft limit as a guarantee would require changing that default to zero. Whether that change (which constitutes an ABI change) would affect users is unclear; as Peter put it, anybody who is actually using soft limits is already changing that value anyway. But Michal, who fought this battle for a while, is nervous about changing that interface now, and he is not the only one.

Other developers questioned the wisdom of setting up a limit mechanism that is designed to push the system into out-of-memory situations. They don't feel that a minimal amount of memory can ever be guaranteed to a memcg, since the total amount of available memory cannot be guaranteed. But, in the end, most seem willing to let Michal try; if users break their systems with it, they get to keep all of the pieces.

But, in contrast to last year's discussion, Michal may well be pushed back toward using the soft limit rather than adding a new one. Some developers don't want to add yet another limit. There is also universal disdain for the current soft limit code, which, it is said, should not be viewed shortly after meals by developers with sensitive stomachs. Changing the way the limits work would enable the removal of much of that code. If soft limits are used, a simple "oom" Boolean flag could be added to allow users to request the "low limit" behavior; this flag would not be set by default. If the current view doesn't change, that is the form that the next version of this patch set will take.

Memory pinning

Peter Zijlstra got up to talk about situations where drivers need to allocate "pinned" pages — pages in a process's address space that cannot be swapped out or even migrated between processors. Pinning is useful for buffers used in RDMA conversations, with the perf events subsystem, and for video frame buffers, among other things. Once upon a time, Peter said, pinned pages were treated much like pages locked into memory with mlock() for accounting purposes. Either type of page would be accounted against the mlock limit, placing an upper bound on the total amount of memory a process could lock down.

More recently, he said, the accounting changed so that pinned pages are counted separately from locked pages. That essentially doubled the amount of memory a process could lock down. On some systems, that meant that processes were now able to push the system into an out-of-memory condition, which is not desirable. So Peter would like to revert the accounting back to the way it was before.

Andrew Morton replied that this could be hard. The kernel has been, for better or worse, changed to be more permissive; going back now could break things for other users. In the end, that view may carry, though no real conclusion was reached in the session.

One reason that Peter is looking at this functionality is that developers in the realtime community are figuring out that mlock() doesn't quite give them the guarantees they would like to have. Locking a page into memory guarantees that it will not be swapped out, but it still gives the kernel some freedom; in particular, the kernel is free to migrate a locked page between locations in RAM. Migration can cause delays and soft page faults for realtime applications, which is not welcomed by realtime developers.

As it happens, the kernel does not currently migrate locked pages, but the memory management developers reserve the right to do so in the future. So Peter is looking at adding a new set of system calls, mpin() and munpin(), that would fully pin pages in memory. When those calls go in, it would be nice to have a clear view of how the accounting will work. At the moment, it appears that pinned pages will go into a different accounting bin than locked pages.

[Your editor would like to thank the Linux Foundation for supporting his travel to the Summit.]

Comments (none posted)

Some vmsplice() issues

By Jonathan Corbet
March 26, 2014
2014 LSFMM Summit
Pavel Emelyanov works with the checkpoint-restart in user space (CRIU) project. One of the use cases for CRIU is live migration of processes from one host to another; that involves the moving of a lot of memory to and from sockets. The vmsplice() interface seems like an ideal tool for doing that work without unnecessarily copying the data. But in the process of using vmsplice() for this purpose, Pavel has run into a number of issues. In the final plenary session at the 2014 Linux Storage, Filesystem, and Memory Management workshop, Pavel discussed the problems he has encountered and their possible solutions.

One problem is that using a pipe to move pages of memory — part of the process of using vmsplice() — requires opening two separate file descriptors. CRUI needs to open a lot of pipes, so it tends to run into the limit on the total number of open file descriptors. Al Viro described a possible workaround: find one of the pipe file descriptors under /proc, open it as a read/write file descriptor, then close the two original descriptors. That will cut the number of required file descriptors in half.

vmsplice(), when used with the SPLICE_F_GIFT flag, is meant to hand the indicated pages of data directly to the kernel without copying the data. But, Pavel said, it often ends up copying those pages anyway, even though it seems the copying should not be necessary. Some digging through the commit logs suggests that things were done this way to avoid surprising filesystems with pages of data coming from an unexpected direction. The filesystem developers seemed to agree that the amount of work required to handle such pages would be quite small, so perhaps this behavior could be changed. An action item was taken to try to query Nick Piggin (the original author of this code, who has since disappeared from the kernel community) about whether there are any other subtle issues that might prevent greater use of zero-copy transfers.

Pavel's next problem is that pages sent to files with vmsplice() go into the page cache, but he would rather have them bypass the page cache and be written directly to the target file. It was pointed out that splicing to a file descriptor opened with O_DIRECT should work properly; at that point, the rest of the problem description came out. An O_DIRECT file descriptor does indeed work, but writes are synchronous, slowing things down. Pavel would rather there were a way to do asynchronous O_DIRECT writes via vmsplice(). Al allowed that it might be possible to make this work, but the job "might not be fun."

The final problem had to do with how to send pages out of another process's address space without actually copying them. James Bottomley suggested that some of the machinery behind the fork() system call could be used. The process would not actually be forked, but a copy of its address space would be made so that the migration process could get to its pages directly. The implementation of this functionality could be tricky but, if it could be done, it might make process migration significantly more efficient.

[Your editor would like to thank the Linux Foundation for supporting his travel to the Summit.]

Comments (1 posted)

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Security-related

Virtualization and containers

Miscellaneous

Page editor: Jonathan Corbet

Distributions

Distributions wrestle with pip

By Jake Edge
March 26, 2014

One of the headline features for Python 3.4 was the inclusion of the pip Python package installer by default. But now that 3.4 is out, distributions are finding that the interactions between Python's pip and system-installed pip programs were not completely thought out. Complicating things further is the virtualenv program that is used to create a private Python environment for development and other purposes.

Barry Warsaw sounded the alarm in a long message to the debian-python mailing list. There is an ensurepip program that is shipped with Python 3.4, but disabled in Debian, that will bootstrap pip for a Python installation. Pip is released on its own schedule, separate from that of the main language and standard library. ensurepip is used to install (or upgrade) the pip package and, as its name implies, ensure that pip is available. Debian has its own .deb-based pip package, though, so ensurepip is not needed for the distribution as package dependencies will pull in pip.

Except that it is needed. When creating a virtual environment using the pyvenv program—which is shipped with Python to create virtual environments—on Debian, the lack of an installed ensurepip causes a failure. To further complicate an already murky situation, the virtualenv command from the Debian python-virtualenv package works just fine. Virtual environments create a Python installation that is entirely separate from any other Python installation on the system. They can be used to simultaneously run multiple versions of Python (2.x and 3.x, say) or to run programs using different library versions (multiple versions of Django, for example).

Debian is not the only distribution struggling with pip for Python 3.4. Fedora has also encountered it. It has also been discussed before by both distributions (and presumably others).

The idea behind bundling pip comes from Python enhancement proposal (PEP) 453. It is really aimed at users of operating systems that don't have the equivalent of a distribution package manager (e.g. Windows or OS X) to make it easier for those users to access packages outside of the standard library at the Python package index (PyPI). But even a distribution with as many packages as Debian has does not package all of those available at PyPI, so there is still a need for pip, even on Linux distributions.

But it is clear from the threads that pip is not universally admired. One of the problems with it was raised by Scott Kitterman back in September: there is no package validation when using pip. There are plans to add that feature, but that is not sufficient for some. As Kitterman put it: "I think that introducing a package download mechanism that is not cryptographically secured with a promise to later insecurely update the mechanism to have security is crazy talk."

In addition, pip does not play well with Python packages that are installed using the standard distribution package manager. It acts as if it owns the whole Python installation and will just overwrite files in the site-packages directory when run as root. It clearly shows its non-Linux bias, which is irritating to some and can lead to unexpected results. For that reason, distributions generally try to restrict pip to either not run as root or to only affect the user-specific package installation (in ~/.local).

Part of the solution may lie in finding a foolproof way to determine whether Python is running in a virtual environment or not. Warsaw outlined several tests that can be made to try to figure that out (and to figure which type of virtual environment it is). He suggested incorporating that into Debian's version of ensurepip so that it would be available for use in environments created with pyvenv, but not globally (which would allow users to run pip as root and potentially step on files installed by Apt and friends).

Fedora has taken a different approach, as described by Bohuslav Kabrda. By using the custom rewheel program, Python packages in the Wheel format (which is how the bundled pip is delivered) can be unpacked, modified for Fedora, then repacked into a new Wheel file. The ensurepip program can call rewheel to install pip properly for Fedora.

Though rewheel is not distribution-specific, Warsaw was not particularly interested in that approach for Debian (or, especially, Ubuntu, where the pip package lives in Universe rather than the main repository). It would introduce a circular dependency for Debian and cross-repository dependency for Ubuntu, Warsaw noted. But Fedora is also preparing for a switch to Python 3 as its default in the not too distant future, so the rewheel plan is expedient, Kabrda said.

There is clearly something of an impedance mismatch between Linux distributions and languages (or other systems) that have their own idea of how packages or add-ons should be installed. Making pip play more nicely with distribution package managers would go a long way toward solving the problem for Python, at least. There are movements in that direction, but we aren't there yet.

[ Thanks to Olav Vitters for a heads-up about this issue. ]

Comments (10 posted)

Brief items

Distribution quotes of the week

I absolutely did not start working on a "linux distribution", because that would be crazy. Do I look like a crazy-person?
-- Steve Kemp

Apps sprout up like mushrooms after rain, they change all the time, they conflict with each other, just conveying information from the development teams to the security people is a full time job.
-- Nicolas Mailhot

Doing this stuff requires a lot of hair-splitting, since it involves quite a bit of, as the saying goes, "tepid change for the somewhat better."
-- Russ Allbery

Neil and Lucas: do you have, or will you get, a Debian kilt and wear that for Deconf14?
-- Lars Wirzenius

Comments (3 posted)

Development for openSUSE 13.2 begins

The first milestone release of openSUSE 13.2 is available. This latest version makes btrfs the default. YaST sports a new look with its Qt front-end ported to Qt5. KDE Frameworks 5 packages are included, along with Zypper 1.10.x, rpm 4.11.2, Wayland 1.4, and much more. openSUSE 13.2 is tentatively scheduled for a November release.

Comments (none posted)

Distribution News

Debian GNU/Linux

Debian Project Leader Elections

The Debian Project Leader election is underway, with two candidates. Incumbent DPL Lucas Nussbaum [platform] and challenger Neil McGovern [platform] are campaigning until March 30, after which voting begins. See the official vote page and follow the questions and answers on the debian-vote mailing list.

Comments (1 posted)

Newsletters and articles of interest

Distribution newsletters

Comments (none posted)

Fedora Present and Future: a Fedora.next 2014 Update (Part I, “Why?”)

In Fedora Magazine, Matthew Miller has an extensive look at why there is a need for Fedora.next. He links to a number of talks he and others have given as background, but the basic idea is that (at least in Miller's view) open source development has moved beyond the concept of distributions—they have just become boring infrastructure. "Well, actually all of the major distributions that work basically in the way Fedora does are on the decline. Slackware peaked before Fedora; openSUSE and Fedora seem to have peaked in terms of the buzz/popularity measure around 2006 or 2007. But Ubuntu has the same peak, just a bit later in 2009. If we count the years from now… that’s a long trend of decline for all of us. Ubuntu is still very popular, of course, but, they’re not cool. None of us are cool anymore. We want to be cool. How can we do that?" (Thanks to Paul Wise.)

Comments (161 posted)

Page editor: Rebecca Sobol

Development

The new and the not new in Java 8

By Nathan Willis
March 26, 2014

The "Standard Edition" of Java 8 was released on March 18. Over the course of the development cycle, this release has been through several delays and, with them, the feature set was cut back—losing some long-awaited changes to the platform. Still, the new release does incorporate what many see as the most important of the advertised additions: support for functional-programming style anonymous functions.

Lambdas

By far the most significant change in Java 8 is the addition of lambda expressions to the language definition. Lambda expressions, of course, are a key component of functional programming, but they have been available in popular imperative languages (other than Java) for quite a few years.

In short, lambda expressions are anonymous functions—they can be declared and called without being bound to a function name, and they can be used in methods as a parameter. Being able to declare and call anonymous functions is a verbosity-saving technique in its own right, but the ability to pass functions as parameters (which cannot be done otherwise in earlier Java releases) is the most oft-discussed use case for lambda expressions. The canonical example for this functionality is a function that takes another function as an argument, such as integration. Prior to Java 8, integrating a function would first require defining, naming, and instantiating a wrapper class:

    public class FunctionWrapper
      {
        public double f(double x)
          {
            return x * x;
          }
      }

which would then be passed to the integration routine:

    FunctionWrapper MyFunctionF = new FunctionWrapper();
    double answer = TheCalculus.integrate(MyFunctionF, 0, 1);

Naturally, this process must be duplicated for each such function to be integrated. With lambda expressions, the code becomes much shorter:

    double answer = TheCalculus.integrate(x -> (x * x), 0, 1);

Thus, lambda expressions are (if nothing else) a space-saving addition to the language, removing the need for the anonymous inner classes used to wrap functions. But they have more far-reaching potential as well. In the long run, lambda expressions make it possible for developers to write more functional-language style code with the existing Java APIs. That is a different programming paradigm, so it is not easy to predict where it will lead, of course.

A more practical application of the feature is that it makes parallel programming simpler. Specifically, where large lists or sets are concerned, the purely imperative approach required iterating over the list or set in sequence. Lambda expressions make it possible to pass functions directly to the code that is handling each list or set element. As a result, several threads can split up the list or set and process it in parallel.

In fact, the Java 8 release makes use of this in the new java.util.stream package. Its Stream API is focused on parallel processing of large data sets.

Jigsaws, or the lack thereof

Lambda expressions have been a long-requested feature, since so many other languages already supported them. But the new release is also noteworthy for the fact that there were a few other major features originally planned that did not make it into the final Java 8. The first was Jigsaw, a complete modularization of JRE components, along the lines of the Python or Ruby module systems. Jigsaw was pulled from the new release in late 2012 and pushed back to the Java 9 development cycle. Of course, at the time, the decision to punt on Jigsaw was justified by saying that things had to be streamlined in order to meet Java 8's target ship date: end-of-year 2012.

Jigsaw arguably would have improved Java performance by minimizing the number of modules that needed to be loaded—or, in the case of embedded deployments, by minimizing the number of modules installed in the first place. But although the effort has been pushed back to Java 9 (ostensibly two years or so away), there is a concession in Java 8 to the desire for more modularity. Java 8 ships with a lightweight—if ultimately less flexible—alternative feature called Compact Profiles. This feature defines three distinct subsets of the Java SE platform, which can be used as development targets for smaller platforms.

Oracle's Jim Connors lists the components of each: "compact1" (containing 42 components), "compact2" (containing compact1 and an additional 30 components), and "compact3" (containing 37 components on top of the 72 found in compact2). While not a particularly flexible offering, the profiles do save space; they range in size from 14MB to 21MB, Connors notes, compared to about 45MB for the complete Java 8 SE virtual machine.

The final major feature that did not make it into Java 8 is Stripped Implementations, which was only dropped in the final month before the Java 8 release. The idea of stripped implementations was that developers could compile a Java application with a self-contained bundle of Java libraries built in, and as a result distribute the bundle as an executable that would run without the need to install the Java runtime environment.

The primary reason given for dropping Stripped Implementations was licensing—specifically, figuring out how to validate a particular stripped implementation. Mark Reinhold told the OpenJDK mailing list in early February that Oracle's legal team had concluded that Stripped Implementations would require rewriting the Java Technology Compatibility Kit (TCK) license. The TCK is used to validate that an implementation adheres to the Java specification. Out of a desire to not have wildly differing Stripped Implementations out there—all using the Java branding—the feature has been pushed back to Java 9.

Additional flavor crystals

Despite dropping two high-profile features, the Java 8 release was also delayed for at least one reason likely to meet with approval from most quarters. In mid-2013, Reinhold announced a concerted effort at Oracle to squash security bugs for Java 8, which further pushed back the final release date. Few Java developers expressed dissatisfaction with the goal of improving Java 8's security readiness, of course.

Ultimately, there were several other new features that made it into the release. Project Nashorn, for instance, is a new JavaScript engine designed to increase performance of JavaScript components running in server-side Java applications. There is also a new Date-Time package that is widely recognized as an improvement. Finally, there are two new atomic number classes: LongAdder and DoubleAdder. As with the existing AtomicInteger and AtomicLong offerings, they provide values that can be updated atomically, but the new atomics are designed specifically to perform well under high contention from multiple threads.

All things considered, Java 8's major new feature is still lambda expressions; many of the secondary feature additions bring performance improvements of their own, but only time will tell whether they spark a resurgence in Java adoption. Certainly Java 7's lack of lambda expressions was seen as holding the platform back. Unfortunately, it seems as if Java shops will have to wait at least one more release cycle for the other big enhancements to arrive.

Comments (12 posted)

Brief items

Quote of the week

Never underestimate a C programmer's ability to write C code in any language you give them.
Jason Gorman

Comments (4 posted)

Firefox 28 is available

Mozilla has released Firefox 28 for Linux, Mac OS X, Windows, and Android. New in the desktop edition is support for VP9 video decoding, Opus audio decoding inside WebM containers, and the latest edition of the SPDY protocol, SPDY/3. In addition to those changes, the Android edition also adds native text selection, OpenSearch support, and the Estonian [et] locale.

Comments (16 posted)

Suricata 2.0 available

Version 2.0 of the Suricata intrusion-detection system (IDS) has been released. This version incorporates scalability improvements, better ICMPv6 support, and JSON output, among many other changes.

Full Story (comments: none)

musl libc 1.0.0 released

Version 1.0.0 of the "musl" C library implementation has been announced. Musl is aimed at small size and high efficiency; it is distributed under the MIT license. See this page for an introduction to the project.

Comments (91 posted)

Tracker 1.0.0 released

Tracker 1.0.0 has been released. Few functional changes landed since version 0.16, but this release does mark the production-ready, stable release of the semantic data storage engine in use by a number of open source desktop and mobile systems.

Full Story (comments: none)

Adobe releases OpenType toolkit for Linux

Adobe has released the latest build of its Adobe Font Development Kit for OpenType (AFDKO), a collection of command-line utilities and scripts for testing, validating, and manipulating OpenType font files. But a surprise in this latest update is that Linux packages are available for the first time. AFDKO is a valuable resource, but don't look for it to be packaged by distributions or other open source projects just yet; the license agreement prohibits some important freedoms, such as reverse-engineering or attempting to "discover the source code." Since many of the utilities are Python or Bash scripts, such prohibitions are perplexing if taken at face value, but a real license change is still in order before AFDKO would be compatible with most free software.

Comments (1 posted)

LibreJS 6.0 released

Version 6.0 of LibreJS, the GNU tool for blocking execution of non-free JavaScript in Web page contents, has been released. Quite a few new features are included, including the ability to whitelist specific scripts on a given page, a new Settings interface, better pattern-matching functionality in the blocklist, and improved performance on small-screen devices.

Full Story (comments: none)

Mozilla's "rr" debugger

Robert O'Callahan has posted an announcement of a new record-and-replay debugger (called rr) from the Mozilla project. "It's difficult to communicate the feeling of debugging with rr, but if you've ever done something wrong during debugging (e.g. stepped too far) and had that 'crap! Now I have to start all over again' sinking feeling --- rr does away with that. Everything you already learned about the execution --- e.g. the addresses of the objects that matter --- remains valid when you start the next replay. Your understanding of the execution increases monotonically."

Comments (22 posted)

Libav 10 available

Version 10 of the libav library is now available. Many new codecs are supported, as are several new format conversion options, filters, and media decoding options, including hardware acceleration. Users are encouraged to look at the full changelog for a complete list of updates.

Comments (none posted)

GTK+ 3.12 released

GTK+ 3.12 has been released, with support for several new widgets (GtkFlowBox, GtkActionBar, and GtkPopover), better OS X integration, and support for version 1.4 of the Wayland protocol.

Full Story (comments: none)

GNOME 3.12 Released

GNOME 3.12 is out. "This is an exciting release for GNOME, and brings many new features and improvements, including app folders, enhanced system status and high-resolution display support. This release also includes new and redesigned applications for video, software, editing, sound recording and internet relay chat. Under the hood, support for using Wayland instead of X has progressed significantly." More information can be found in the release notes.

Full Story (comments: 27)

Newsletters and articles

Development newsletters from the past week

Comments (none posted)

GNOME 3.12 Seeded by GNOME OS Projects (Linux.com)

Linux.com has an interview with Allan Day and Matthias Clasen about GNOME 3.12 and the various projects that form GNOME OS. "Matthias: Continuous testing is being used in the GNOME project as we speak, and it is helping us every day to raise the quality of our code. Automatic builds are triggered after every commit, and a plethora of unit and integration tests are run on the resulting VM images. You can also download a VM image to run locally. Our continuous testing framework is based on OSTree, which describes itself as "Git for operating systems." It allows bootable, immutable, operating systems to be installed in parallel. OSTree is being used in a variety of interesting ways, such as in the Fedora Atomic initiative."

Comments (10 posted)

Does the Display Server matter?

Robert Ancell contends that the application toolkit is more important than the display server. "The result of this is the display server doesn't matter much to applications because we have pretty good toolkits that already hide all this information from us. And it doesn't matter much to drivers as they're providing much the same operations to anything that uses them (i.e. buffer management and passing shaders around)."

Martin Gräßlin counters that by taking a look at issues created by making applications work with multiple display servers. "Also the assumption that the toolkit behaves the same is just wrong. One of the issues I fixed was Qt returning a nullptr from QClipboard::mimeData on platform Wayland while it returned a valid pointer on all other platforms. It’s obviously an application bug but the documentation doesn’t say that there could be nullptr returned or not. There can be hundreds or thousands of such small behavior differences. An application developer is not able to ensure that the application is working correctly on a distro-specific display server."

Comments (105 posted)

Duffy: Hyperkitty at the 0th SpinachCon

At her blog, Máirín Duffy writes about her experience attending the "zeroth" SpinachCon, a new usability-centric conference for free software projects. "Like pointing out to a friend that they have spinach stuck in their teeth, SpinachCon is an event where free software fans can show up and let the participating free software projects know whether or not they have ‘spinach in their teeth,’ and between what teeth that spinach might be." Among the projects represented were Hyperkitty, MediaGoblin, Inkscape, and LibreOffice.

Comments (none posted)

Page editor: Nathan Willis

Announcements

Brief items

FSF gives awards to Matthew Garrett, Outreach Program for Women

The Free Software Foundation has announced that its annual award for the advancement of free software has gone to Matthew Garrett for his work making Linux support UEFI secure boot systems. The award for projects of social benefit went to the Outreach Program for Women.

Comments (31 posted)

FSFE: Computers in the post-Snowden era

The Free Software Foundation Europe is asking people to sign a petition requesting an unfettered choice of the operating system on telephones, laptops and other computing devices. "The revelations from Edward Snowden concerning massive surveillance of communications demonstrates the need for each person to be able to control their computers and phones. Yet computer and telephone manufacturers and retailers [typically] impose on users programs that jeopardise their privacy. Each person should therefore have the opportunity to refuse to pay for non-Free software, and be allowed to choose the programs that run on their telephone and computer."

Full Story (comments: none)

The AllSeen Alliance Announces Newest Members

The AllSeen Alliance has announced that Audio Partnership, Beechwoods Software, Beijing Winner Micro Electronics, CA Engineering, Imagination Technologies and Two Bulls are joining the alliance as community members. "Since launching in December 2013, the AllSeen Alliance has brought together 41 organizations to collaborate on an open software framework based on the AllJoyn open source project."

Comments (none posted)

Verizon Terremark Joins LF and New LF Members Collaborate on Connected Car Technology

The Linux Foundation has two announcements. The first is that Verizon Terremark has joined as a gold member. "Verizon Terremark's next generation cloud offering, Verizon Cloud, was announced in October and is now in public beta. It includes a IaaS platform, Verizon Cloud Compute, and an object-based storage service, Verizon Cloud Storage, that are built for the enterprise but nimble enough to meet the needs of small and medium-sized businesses, individual IT departments and software developers."

In other news, four companies have joined the Automotive Grade Linux Steering Committee; Advanced Driver Information Technology GmbH, ATS Advanced Telematic Systems GmbH, GlobalLogic, and OBIGO. "The Linux Foundation's AGL initiative continues to grow rapidly as the auto industry realizes that Linux -- both highly customizable and extremely powerful -- provides newfound control and innovation opportunities that are simply not possible with a proprietary operating system."

Full Story (comments: none)

Verizon Joins OIN

Open Invention Network (OIN) has announced that Verizon Communications has joined OIN as a licensee. "As the first major communications service provider to enroll in the OIN community, Verizon, a global leader in broadband, wireline, and wireless communications services, is demonstrating its commitment to open source software and the associated development efforts that benefit the entire communications industry."

Full Story (comments: none)

Articles of interest

FSFE: Open Letter to EU institutions: Time to support Open Standards

The Free Software Foundation Europe and Open Forum Europe wrote an open letter to the European Parliament and the European Commission asking for improved support for open standards. "The letter also raises the issue of video formats. Currently, it is difficult or impossible for Free Software users to follow the proceedings of the Parliament and the Council in real time, because the live video streams of these organisations rely on proprietary technology. This is a problem which OFE and FSFE have highlighted for many years."

Full Story (comments: none)

LF Collaborative Development Trends Report

The Linux Foundation has announced the release of the first-ever Collaborative Development Trends Report. "Companies in diverse industries across the globe are increasingly joining together to share development resources and build common open source code bases on which they can diversify their own products and services. These methods are dramatically disrupting the way technologies are being built and distributed: Linux, OpenStack, Hadoop, OpenDaylight and more are changing the way developers and business managers approach the world’s most complex technology challenges." Registration is required to access the full report.

Comments (none posted)

Calls for Presentations

Call for Papers for UK PostgreSQL conferences

Two one day conferences will be taking place in a British country house near London. Char(14), "an international conference focused on Clustering, HA and Replication techniques, tools and user experiences", takes place July 8. PGday UK follows on July 9. The call for papers closes April 17.

Full Story (comments: none)

CFP Deadlines: March 27, 2014 to May 26, 2014

The following listing of CFP deadlines is taken from the LWN.net CFP Calendar.

DeadlineEvent Dates EventLocation
March 31 July 18
July 20
GNU Tools Cauldron 2014 Cambridge, England, UK
March 31 September 15
September 19
GNU Radio Conference Washington, DC, USA
March 31 June 2
June 4
Tizen Developer Conference 2014 San Francisco, CA, USA
March 31 April 25
April 28
openSUSE Conference 2014 Dubrovnik, Croatia
April 3 August 6
August 9
Flock Prague, Czech Republic
April 4 June 24
June 27
Open Source Bridge Portland, OR, USA
April 5 June 13
June 14
Texas Linux Fest 2014 Austin, TX, USA
April 7 June 9
June 10
DockerCon San Francisco, CA, USA
April 14 May 24 MojoConf 2014 Oslo, Norway
April 17 July 9 PGDay UK near Milton Keynes, UK
April 17 July 8 CHAR(14) near Milton Keynes, UK
April 18 November 9
November 14
Large Installation System Administration Seattle, WA, USA
April 18 June 23
June 24
LF Enterprise End User Summit New York, NY, USA
April 24 October 6
October 8
Operating Systems Design and Implementation Broomfield, CO, USA
April 25 August 1
August 3
PyCon Australia Brisbane, Australia
April 25 August 18 7th Workshop on Cyber Security Experimentation and Test San Diego, CA, USA
May 1 July 14
July 16
2014 Ottawa Linux Symposium Ottawa, Canada
May 1 May 12
May 16
Wireless Battle Mesh v7 Leipzig, Germany
May 2 August 20
August 22
LinuxCon North America Chicago, IL, USA
May 2 August 20
August 22
CloudOpen North America Chicago, IL, USA
May 3 May 17 Debian/Ubuntu Community Conference - Italia Cesena, Italy
May 4 July 26
August 1
Gnome Users and Developers Annual Conference Strasbourg, France
May 9 June 10
June 11
Distro Recipes 2014 - canceled Paris, France
May 12 July 19
July 20
Conference for Open Source Coders, Users and Promoters Taipei, Taiwan
May 18 September 6
September 12
Akademy 2014 Brno, Czech Republic
May 19 September 5 The OCaml Users and Developers Workshop Gothenburg, Sweden
May 23 August 23
August 24
Free and Open Source Software Conference St. Augustin (near Bonn), Germany

If the CFP deadline for your event does not appear here, please tell us about it.

Upcoming Events

Events: March 27, 2014 to May 26, 2014

The following event listing is taken from the LWN.net Calendar.

Date(s)EventLocation
March 26
March 28
Collaboration Summit Napa Valley, CA, USA
March 26
March 28
16. Deutscher Perl-Workshop 2014 Hannover, Germany
March 29 Hong Kong Open Source Conference 2014 Hong Kong, Hong Kong
March 31
April 4
FreeDesktop Summit Nuremberg, Germany
April 2
April 5
Libre Graphics Meeting 2014 Leipzig, Germany
April 2
April 4
Networked Systems Design and Implementation Seattle, WA, USA
April 3 Open Source, Open Standards London, UK
April 7
April 9
ApacheCon 2014 Denver, CO, USA
April 7
April 8
4th European LLVM Conference 2014 Edinburgh, Scotland, UK
April 8
April 10
Lustre User Group Conference Miami, FL, USA
April 8
April 10
Open Source Data Center Conference Berlin, Germany
April 11
April 13
PyCon 2014 Montreal, Canada
April 11 Puppet Camp Berlin Berlin, Germany
April 12
April 13
State of the Map US 2014 Washington, DC, USA
April 14
April 17
Red Hat Summit San Francisco, CA, USA
April 25
April 28
openSUSE Conference 2014 Dubrovnik, Croatia
April 26
April 27
LinuxFest Northwest 2014 Bellingham, WA, USA
April 29
May 1
Embedded Linux Conference San Jose, CA, USA
April 29
May 1
Android Builders Summit San Jose, CA, USA
May 1
May 4
Linux Audio Conference 2014 Karlsruhe, Germany
May 2
May 3
LOPSA-EAST 2014 New Brunswick, NJ, USA
May 8
May 10
LinuxTag Berlin, Germany
May 12
May 16
OpenStack Summit Atlanta, GA, USA
May 12
May 16
Wireless Battle Mesh v7 Leipzig, Germany
May 13
May 16
Samba eXPerience Göttingen, Germany
May 15
May 16
ScilabTEC 2014 Paris, France
May 17 Debian/Ubuntu Community Conference - Italia Cesena, Italy
May 20
May 21
PyCon Sweden Stockholm, Sweden
May 20
May 22
LinuxCon Japan Tokyo, Japan
May 20
May 24
PGCon 2014 Ottawa, Canada
May 21
May 22
Solid 2014 San Francisco, CA, USA
May 23
May 25
PyCon Italia Florence, Italy
May 23
May 25
FUDCon APAC 2014 Beijing, China
May 24 MojoConf 2014 Oslo, Norway
May 24
May 25
GNOME.Asia Summit Beijing, China

If your event does not appear here, please tell us about it.

Page editor: Rebecca Sobol


Copyright © 2014, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds