LWN.net Weekly Edition for April 24, 2014
The next generation of Python programmers
In a keynote on day two of PyCon 2014 (April 12), Jessica McKellar made an impassioned plea for the Python community to focus on the "next generation" of Python programmers. She outlined the programming-education problem that exists in US high schools (and likely elsewhere in the world as well), but she also highlighted some concrete steps the community could take to help fix it. She expressed hope that she could report progress at the next PyCon, which will also be held in Montréal, Canada, next year.
![Jessica McKellar [Jessica McKellar]](https://static.lwn.net/images/2014/pycon-mckellar-sm.jpg)
Statistics
McKellar used the US state of Tennessee (known for "country music and Jack Daniels") as an example. That is where she went to high school and where her family lives. There are roughly 285,000 high school students in Tennessee, but only 251 of them took the advanced placement (AP) computer science (CS) exam. That is 0.09%. AP CS is the most common computer science class that high school students in the US can take.
She showed a slide with an Hour of Code promo that had various "important, famous, and rich people", including US President Barack Obama, who are enthusiastic about students learning to code, she said. They believe that all students should at least have the opportunity to learn how to program in school.
But the reality is quite a bit different. In terms of AP participation by students, CS is quite low. She put up a graph of AP test takers by subject for 2013, showing CS with roughly 30,000 takers nationwide, slightly ahead of 2D studio art and well below subjects like history or literature, which have 400,000+ participants each.
The problem, typically, is the availability of classes. McKellar's sister goes to high school in Nashville, Tennessee. She likes electronics and would be interested in taking a class in computer programming. The best that she is offered at her high school is a class on learning to use Microsoft Word, however. That's not just a problem for Tennessee, either, as there are roughly 25,000 high schools in the US and only around 2,300 of them teach AP CS.
She put up more statistics, including the pass rate of the AP CS exam, which was 66% in Tennessee. But of the 25 African-American students who took it, the pass rate was 32%. She showed maps of the US with states highlighted where no African-Americans took the test (11 states), no Hispanics (7), and no girls (3). One of the latter was Mississippi, where the lack of female test takers may be somewhat self-reinforcing; any girl in that state may well get the message that AP CS is not something that girls do. In addition, any girl who considers taking it will have to worry about her results being scrutinized on a national level: "I better pass or I will be a stat people talk about".
AP class participation by gender was next up. There are more AP classes where girls outnumber boys, but for math and science, that balance switches. CS has the worst gender imbalance of any AP class.
The Python community cares about this because it spends not just time and money, but "blood, sweat, and actual tears" trying to improve this imbalance, which starts in high school and even earlier. McKellar said she understands "how empowering programming is" and she wants students to have the opportunity to "engage in computational thinking skills". She wants the politicians that are making decisions about how the country is run to have that experience as well. Fixing this problem is important to the long-term success of the Python language as well.
What can be done
It was a "depressing introduction", she said, with lots of statistics that make it look like an "intractable problem". But there is some solace that can be taken from some of those statistics. While AP CS is the most gender-skewed AP exam, 29% of the test takers in Tennessee were girls. That is largely due to one teacher, Jill Pala in Chattanooga, who taught 30 of the 71 girls who took the exam. McKellar asked: If one teacher can make that big of a difference, what can a 200,000-member community do?
To answer that question, she asked three CS education specialists. If one is "armed with a Python community", what can be done to help educate the next generation of (Python) programmers? She got responses in four main areas: policy, student engagement, supporting teachers, and curriculum development. She said that she would be presenting the full fire hose of ideas in the hopes that everyone in the audience would find at least one that resonated.
Policy
To start with, in 35 states computer science is only an elective that doesn't give any math or science credit to a student who takes it. If a class doesn't "count for anything", students don't want to take it, schools don't want to offer it, and teachers don't want to teach it. One thing that audience members could do is to "pick up the phone" and ask legislators and school boards to change that.
There is also a lack of per-state data on who makes the decisions about what high school graduation requirements are. There is also a lack of a single place to go for per-state credential requirements for a teacher to be certified in CS. This is a problem for policy makers because they have no way to judge their own state policies by comparing them with their neighbors. It is something that could be fixed "in a weekend" with some Python, web scraping, and version control, she said.
Another problem is that AP CS is still taught in Java, which is a bad language to teach in. That is not her "hating on Java", but it is something that teachers say. If you think about what goes into writing "hello world" for Java, it doesn't allow deferring certain concepts (e.g. classes, object-oriented programming), which makes it difficult to understand "if you've never written a for loop before". Java is also getting "long in the tooth" as the AP CS language. Pascal was the that language for fifteen years, followed by C++ for six years, and now Java for the last eleven years.
People have been gathering information about what languages colleges use, what languages college teachers wish they were using, and what language they think they will be using ten years from now. Some of that information is compiled into Reid's List (described in this article [PDF], scroll down a ways), which shows that Python's popularity with college CS programs is clearly on the rise. But Reid's List has not been updated since 2012, as it is done manually, partly via phone calls, she said.
The 2012 list also shows Java with a clear lock on first place for the first programming language taught (Java 197, C++ 82, Python 43, ...). But, McKellar believes that Python's numbers have "skyrocketed" since then. She would like people to engage the College Board (which sets the AP standards) to switch the AP CS exam from Java to Python. The College Board bases its decision on what language teachers want to teach in, so the rise of Python, could be instrumental in helping it to make that decision—especially if that rise has continued since 2012. AP CS courses in Python would make for a "more fun computing experience", she said, and by engaging with the College Board, that change could happen in four to six years.
Student engagement
Students don't really know what CS is or what it is about, so they don't have much interest in taking it. But there are lots of existing organizations that teach kids, but don't know programming: summer camps, Boy Scouts, Girl Scouts, etc. This is an opportunity for the Python community to work with these groups to add a programming component to their existing activities.
There are also after-school programs that lack programming teachers. The idea is to take advantage of existing programs for engaging students to help those students learn a bit about computer science. That may make them more likely to participate in AP CS when they get to high school.
Supporting teachers
CS teachers are typically all alone, as there is generally only one per high school. That means they don't have anyone to talk to or to bounce ideas off of. But the Python community is huge, McKellar said, so it makes sense to bring those high school teachers into it.
Pythonistas could offer to answer lesson plan questions. Or offer to be a teaching assistant. They could also volunteer to visit the class to answer questions about programming. Inviting the teacher to a local user group meeting might be another way to bring them into the community.
Curriculum development
There is a new AP CS class called "CS Principles" that is being developed and tested right now. It covers a broader range of topics that will appeal to more students, so it is a "really exciting way to increase engagement", she said. So far, though, there is no day-to-day curriculum for the course in any language. That is a huge opportunity for Python.
If the best way to teach the CS Principles class was with a curriculum using Python, everyone would use it, she said. Teachers have a limited amount of time, so if there is an off-the-shelf curriculum that does a good job teaching the subject, most will just use it. The types of lessons that are laid out for the class look interesting (e.g. cryptography, data as art, sorting) and just require some eyes and hands to turn them into something Python-oriented that can be used in the classroom. Something like that could be used internationally, too, as there aren't many curricula available for teaching programming to high school students.
Deploying Python for high schools can be a challenge, however. She talked with one student who noted that even the teachers could not install software in the computer lab—and the USB ports had been filled with glue for good measure. That means that everything must be done in the browser. Runestone Interactive has turned her favorite book for learning Python, Think Python, into an interactive web-based textbook. The code is available on GitHub.
Perhaps the most famous browser-based Python is Skulpt, which is an implementation of the language in JavaScript (also available on GitHub). There are currently lots of open bugs for things that teachers want Skulpt to be able to do. Fixing those bugs might be something the community could do. Whether we like or not, she said, the future of teaching programming may be in the browser.
Summing up
Since we are starting from such terrible numbers (both raw and percentage-wise), a small effort can make a big difference, McKellar said. The Python Software Foundation (PSF), where McKellar is a director, is ready to help. If you are interested in fixing Skulpt bugs, for example, the PSF will feed you pizza while you do that (in a sprint, say). It will also look to fund grant proposals for any kind of Python-education-related project.
She put forth a challenge: by next year's PyCon, she would like to see every attendee do one thing to further the cause of the next generation of Python programmers. At that point, she said, we can look at the statistics again and see what progress has been made. As she made plain, there is plenty of opportunity out there, it just remains to be seen if the community picks up the ball and runs with it.
Slides and video from McKellar's keynote are available.
Pickles are for delis
Alex Gaynor likes pickles, but not of the Python variety. He spoke at PyCon 2014 in Montréal, Canada to explain the problems he sees with the Python pickle object serialization mechanism. He demonstrated some of the things that can happen with pickles—long-lived pickles in particular—and pointed out some alternatives.
![Alex Gaynor [Alex Gaynor]](https://static.lwn.net/images/2014/pycon-gaynor-sm.jpg)
Pickle introduction
He began by noting that he is a fan of delis, pickles, and, sometimes, software, but that some of those things—software and the Python pickle module—were also among his least favorite things. The idea behind pickle serialization is simple enough: hand the dump() function an object, get back a byte string. That byte string can then be handed to the pickle module's load() function at a later time to recreate the object. Two of the use cases for pickles are to send objects between two Python processes or to store arbitrary Python objects in a database.
The pickle.dumps() (dump to a string) method returns "random nonsense", he said, and demonstrated that with the following code:
>>> pickle.dumps([1, "a", None]) "(lp0\nI1\naS'a'\np1\naNa."
By using the pickletools module, which is not well-known, he said, one can peer inside the nonsense:
>>> pickletools.dis("(lp0\nI1\naS'a'\np1\naNa.") 0: ( MARK 1: l LIST (MARK at 0) 2: p PUT 0 5: I INT 1 8: a APPEND 9: S STRING 'a' 14: p PUT 1 17: a APPEND 18: N NONE 19: a APPEND 20: . STOPThe pickle format is a simple stack-based language, similar in some ways to the bytecode used by the Python interpreter. The pickle is just a list of instructions to build up the object, followed by a STOP opcode to return what it has built so far.
In principle, dumping data to the pickle format is straightforward: determine the object's type, find the dump function for that type, and call it. Each of the built-in types (like list, int, or string) would have a function that can produce the pickle format for that type.
But, what happens for user-defined objects? Pickle maintains a table of functions for the built-in types, but it can't do that for user-defined classes. It turns out that it uses the __reduce__() member function that returns a function and arguments used to recreate the object. That function and its arguments are put into the pickle, so that the function can be called (with those arguments) at unpickling time. Using the built-in object() type, he showed how that information is stored in the pickle (the output was edited by Gaynor for brevity):
>>> pickletools.dis(pickle.dumps(object())) 0: c GLOBAL 'copy_reg _reconstructor' 29: c GLOBAL '__builtin__ object' 55: N NONE 56: t TUPLE 60: R REDUCE 64: . STOPThe _reconstructor() method from the copy_reg module is used to reconstruct its argument, which is the object type from the __builtin__ module. Similarly, for a user-defined class (again, output has been simplified):
>>> class X(object): ... def __init__(self): ... self.my_cool_attr = 3 ... >>> x = X() >>> pickletools.dis(pickle.dumps(x)) 0: c GLOBAL 'copy_reg _reconstructor' 29: c GLOBAL '__main__ X' 44: c GLOBAL '__builtin__ object' 67: N NONE 68: t TUPLE 72: R REDUCE 77: d DICT 81: S STRING 'my_cool_attr' 100: I INT 3 103: s SETITEM 104: b BUILD 105: . STOPThe pickle refers to the class X, by name, as well as the my_cool_attr attribute. By default, Python pickles all of the entries in x.__dict__, which stores the attributes of the object.
A class can define its own unique pickling behavior by defining the __reduce__() method. If it contains something that cannot be pickled (a file object, for example), some kind of custom pickling solution must be used. __reduce__() needs to return a function and arguments to be called at unpickling time, for example:
>>> class FunkyPickle(object): ... def __reduce__(self): ... return (str, ('abc',),) ... >>> pickle.loads(pickle.dumps(FunkyPickle())) 'abc'
Unpickling is "shockingly simple", Gaynor said. If we look at the first example again (i.e. [1, 'a', None]), the commands in the pickle are pretty straightforward (ignoring some of the extraneous bits). LIST creates an empty list on the stack, INT 1 puts the integer 1 on the stack, and APPEND appends it to the list. The string 'a' and None are handled similarly.
Pickle woes
But, as we've seen, pickles can cause calls to any function available to the program (built-ins, imported modules, or those present in the code). Using that, a crafted pickle can cause all kinds of problems—from information disclosure to a complete compromise of the user account that is unpickling the crafted data. It is not a purely theoretical problem, either, as several applications or frameworks have been compromised because they unpickled user-supplied data. "You cannot safely unpickle data that you do not trust", he said, pointing to a blog post that shows how to exploit unpickling.
But, if the data is trusted, perhaps because we are storing and retrieving it from our own database, are there other problems with pickle? He put up a quote from the E programming language web site (scroll down a ways) that pointed to the problem:
- Erm, can I get back to you on that?
He then related a story that happened "many, many maintenance iterations ago, in a code base far, far away". Someone put a pickle into a database, then no one touched it for eighteen months or so. He needed to migrate the table to a different format in order to optimize the storage of some of the other fields. About halfway through the migration of this 1.6 million row table, he got an obscure exception: "module has no attribute".
As he mentioned earlier, pickle stores the name of the pickled class. What if that class no longer exists in the application? In that case, Python throws an AttributeError exception, because the "'module' object has no attribute 'X'" (where X is the name of the class). In Gaynor's case, he was able to go back into the Git repository, find the class in question, and add it back into the code.
A similar problem can occur if the name of an attribute in the class should change. The name is "hardcoded" in the pickle itself. In both of his examples, someone was doing maintenance on the code, made some seemingly innocuous changes, but didn't realize that there was still a reference to the old names in a stored pickle somewhere. In Gaynor's mind, this is the worst problem with pickles.
Alternatives
But if pickling is not a good way to serialize Python objects, what is the alternative? He said that he advocated writing your own dump() functions for objects that need them. He demonstrated a class that had a single size attribute, along with a JSON representation that was returned from dump():
def dump(self): return json.dumps({ "version" : 1, "size": self.size })The version field is the key to making it all work as maintenance proceeds. If, at some point, size is changed to width and height, the dump() function can be changed to emit "version" : 2. One can then create a load() function that deals with both versions. It can derive the new width and height attributes from size (perhaps using sqrt() if size was the area of a square table as in his example).
Writing your own dump() and load() functions is more testable, simpler, and more auditable, Gaynor said. It can be tested more easily because the serialization doesn't take place inside an opaque framework. The simplicity also comes from the fact that the code is completely under your control; pickle gives you all the tools needed to handle these maintenance problems (using __reduce__() and a few other special methods), but it takes a lot more code to do so. Custom serialization is more auditable because one must write dump() and load() for each class that will be dumped, rather than pickle's approach which simply serializes everything, recursively. If some attribute got pickled improperly, it won't be known until the pickle.load() operation fails somewhere down the road.
His example used JSON, but there are other mechanisms that could be used. JSON has an advantage that it is readable and supported by essentially every language out there. If speed is an issue, though, MessagePack is probably worth a look. It is a binary format and supports lots of languages, though perhaps somewhat fewer than JSON.
He concluded his talk by saying that "pickle is unsafe at any speed" due to the security issues, but, more importantly, the maintenance issues. Pickles are still great at delis, however.
An audience member wondered about using pickles for sessions, which is common in the Python web-framework world. Gaynor acknowledged pickle's attraction, saying that being able to toss any object into the session and get it back later is convenient, but it is also the biggest source of maintenance problems in his experience. The cookies that are used as session keys (or signed cookies that actually contain the pickle data) can pop up at any time, often many months later, after the code has changed. He recommends either only putting simple types that JSON can handle directly into sessions or creating custom dump() and load() functions for things JSON can't handle.
There are ways to make pickle handle code updates cleanly, but they require that lots of code be written to do so. Pickle is "unsafe by default" and it makes more sense to write your own code rather than to try to make pickle safe, he said. One thing that JSON does not handle, but pickle does, is cyclic references. Gaynor believes that YAML does handle cyclic references, though he cautioned that the safe_load() function in the most popular Python implementation must be used rather than the basic load() function (though he didn't elaborate). Cyclic references are one area that makes pickle look nice, he said.
One of the biggest lessons he has learned when looking at serialization is that there is no single serialization mechanism that is good for all use cases. Pickle may be a reasonable choice for multi-processing programs where processes are sending pickles to each other. In that case, the programmer controls both ends of the conversation and classes are not usually going to disappear during the lifetime of those processes. But the clear sense was that, even in that case, Gaynor would look at a solution other than pickle.
The video of the talk is at pyvideo.org (along with many other PyCon videos). The slides are available at Speaker Deck.
Transmageddon media transcoder reaches 1.0
Version 1.0 of the Transmageddon media transcoder was released in early April, complete with several new features in addition to the implicit stability cred that accompanies a 1.0 release. The tool itself looks unassuming: it occupies a small GTK+ window with little on display beyond a few pull-down menu selectors and a "Transcode" button, but it fills an important gap in the free software landscape. Support for compressed media formats can often turn into a minefield, with patent and licensing problems on one side and lack of free-software support on the other. Tools like Transmageddon that convert one format to another are in some sense a necessary evil as long as there are multiple audio/video formats to choose from, but a good free software tool can minimize the pain, making format conversion easy to tackle without constructing the lengthy and arcane command sequences demanded by command-line encoders.
![[Transmageddon GUI]](https://static.lwn.net/images/2014/04-transmageddon-gui-sm.png)
The new release was made April 1, and was followed up by a 1.1 release on April 10 that fixed a pair of important bugs. Source bundles are available for download on the project's web site, as are RPM packages. Transmageddon requires GTK+3, GLib 2, Python 3, and GStreamer 1.0. The formats that the application can handle depend on which GStreamer plugins are installed; for the fullest range of options, all of the plugin collections are recommended, including the Bad and Ugly sets.
Christian Schaller started Transmageddon as a kind of personal experiment to familiarize himself with Python and GStreamer development back in 2009. Since then, the application has grown a bit in scope, but mostly in accordance with new features supported by GStreamer. The original intent, for instance, was to support simple output profiles that targeted portable devices, so that a user could transcode a video file for playback on the Nokia 810 tablet or Apple iPod. As most people who have experimented with transcoding videos have discovered, though, there are still quite a few options to juggle for any one target device, including separate options for the video and audio codecs chosen and what to do with multiple audio tracks. In addition to that, of course, the landscape of codecs and container formats has continued to evolve over the intervening years, as it will certainly continue to do.
Transmageddon's main competition in the easy-to-use media transcoder space is arguably HandBrake, which also offers a user-friendly GUI and is available under the GPL. But HandBrake has its own oddities that inhibit its support from Linux distributions, starting with the fact that it ships linked against GPL-incompatible components like the versions of FAAC that include proprietary code.
In practice, the other main transcoding approach in use is directly calling the command-line transcoding features exposed by FFmpeg, Mencoder, and other media-playback libraries. The difficulty, of course, is that ease of use is severely lacking with this approach. One often needs a different incantation for almost every input and output option, and the command-line libraries have a notorious reputation for esoteric syntax that is difficult to memorize—a reputation that, frankly, is well-earned. Command-line usage by end users is not the main goal of any of these library projects.
Transmageddon, on the other hand, works—and it generally works without too many confusing errors that are hard to debug. The user selects the input file from a GTK+ file-chooser dialog, Transmageddon automatically detects the makeup and format of the audio and video data within, and presents drop-down lists with the output options supported by the available GStreamer plugins.
![[Transmageddon and audio options]](https://static.lwn.net/images/2014/04-transmageddon-audio-sm.png)
The 1.0 release introduced three major new features: support for multiple audio streams, support for directly ripping DVDs, and support for defining custom transcoding presets. The multiple audio stream support means that Transmageddon has the ability to transcode each audio stream independently—including omitting some streams, and including transcoding each stream into a different codec. DVDs commonly include multiple audio tracks these days, and for some (such as audio commentary tracks) it might make sense to use a speech-specific codec like Speex, rather than a general-purpose codec that must handle the blend of speech, music, and sound effects found in the primary audio mix.
![[Transmageddon DVD ripping]](https://static.lwn.net/images/2014/04-transmageddon-dvd-sm.png)
DVD-ripping, of course, is a common use case that Transmageddon had simply never supported before. The legality of ripping DVDs—even DVDs that you own—is a matter where the entertainment industry vehemently clashes with the software industry (and many consumers), so manually installing a few optional libraries like libdvdcss2 is required. Some Linux distributions make this step easier than others, but for a readable DVD, Transmageddon accurately detects the disc's media files, and transcoding them to a more portable format is not any more difficult than transcoding other content.
Custom presets for Transmageddon can be created by making an XML configuration file that follows the application's profile syntax. The syntax is straightforward, with <video> and <audio> elements encapsulating basic information like the codec to be used, frame rate, and number of audio channels. The only genuinely complex part of defining a custom profile is likely to be knowing and understanding the options for the chosen codecs; for help with that task, the Transmageddon site recommends getting familiar with GStreamer's gst-inspect-1.0 utility.
Two other low-level features also made it into the 1.0 release: support for Google's VP9 video codec and the ability to set a "language" tag for the input file (which is then preserved in the metadata of the output file). VP9 is now supported by Firefox, Chrome/Chromium, FFmpeg, and a few other applications, although it might not yet be called widespread. The language-tagging feature is just the beginning of better language support: DVDs typically tag their audio tracks with a language (which is metadata that Transmageddon preserves in the output file); in the future, Schaller said, he hopes to support multiple language tags on all input files, to simplify finding an audio track by language during playback.
Given its ease of use, Transmageddon is likely to attract quite a few users. Almost everyone runs into the need to convert a media file into another format at one time or another; some people do such transcoding on a regular basis, particularly for their portable devices. But Transmageddon is also worth exploring because it is built on GStreamer, which in recent years has grown into a library in widespread use by projects far beyond the GNOME desktop environment. GStreamer's plugin architecture makes it possible to build and use a multimedia system without any problematic codecs—and to build it without the hassle of recompiling from source. Distributors and system builders can address codec patent licenses and other such issues as they see fit, often on a per-plugin basis.
Transmageddon thus meets an important need with a simple and straightforward interface. At the same time, though, its existence also demonstrates that despite the popularity of GStreamer, FFmpeg, MPlayer, and similar big-name multimedia projects, transcoding media files remains an end-user's problem. Life for the average user would be much simpler if media applications like MythTV, XBMC, gPodder, and the various "music store" plugins of audio players automatically handled transcoding duties. But they don't. Partly that is because the never-ending codec wars make it unduly hard for a free-software application to automatically read and convert proprietary file formats, of course, but the upshot is that users have to juggle the codecs, containers, and transcoding themselves.
Security
Decentralized storage with Camlistore
Reducing reliance on proprietary web services has been a major target of free-software developers for years now. But it has taken on increased importance in the wake of Edward Snowden's disclosures about service providers cooperating with government mass-surveillance programs—not to mention the vulnerability that many providers have to surveillance techniques whether they cooperate or not. While some projects (such as Mailpile, ownCloud, or Diaspora) set out to create a full-blown service that users can be in complete control of, others, such as the Tahoe Least-Authority Filesystem, focus on more general functionality like decentralized data storage. Camlistore is a relative newcomer to the space; like Tahoe-LAFS it implements a storage system, but its creators are particularly interested in its use as a storage layer for blogs, content-management systems (CMSes), filesharing, and other web services.
Camlistore is a content-addressable storage (CAS) system with an emphasis on decentralized data storage. Specifically, the rationale for the project notes that it should be usable on a variety of storage back-ends, including Amazon's S3, local disk, Google Drive, or even mobile devices, with full replication of content between different locations.
Content addressability means that objects can be stored without assigning them explicit file names or placing them in a directory hierarchy. Instead, the "identity" of each object is a hash or digest calculated over the content of the object itself; subsequent references to the object are made by looking up the object's digest—where it is stored is irrelevant. As the rationale document notes, this property is a perfect fit for a good many objects used in web services today: photos, blog comments, bookmarks, "likes," and so on. These objects are increasing created in large numbers, and rarely does a file name or storage location come into play. Rather, they are accessed through a search interface or a visual browsing feature.
The Camlistore project produces both an implementation of such a decentralized storage system and a schema for representing various types of content. The schema would primarily be of interest to those wishing to use Camlistore as a storage layer for other applications.
The project's most recent release is version 0.7, from February 27. The storage server (with several available back-ends) is included in the release, as are a web-based interface, a Filesystem in Userspace (FUSE) module for accessing Camlistore as a filesystem, several tools for interoperating with existing web services, and mobile clients for Android and iOS.
The architecture of a Camlistore repository includes storage nodes (referred to by the charming name "blob servers") and indexing/search nodes, which index uploaded items by their digests and provide a basic search interface. The various front-end applications (including the mobile and web interfaces) handle both connecting to a blob server for object upload and retrieval and connecting to a search server for finding objects.
There can be several blob servers that fully synchronize with one another by automatically mirroring all data; the existing implementations can use hard disk storage or any of several online storage services. At the blob-server level, the only items that are tracked are blobs: immutable byte sequences that are uploaded to the service. Each blob is indexed by its digest (also called a blobref); Camlistore supports SHA1, MD5, and SHA256 as digest functions. Blobs themselves are encrypted (currently with AES-128, although other ciphers may be added in the future).
Semantically speaking, a blob does not contain any metadata—it is just a bunch of bytes. Metadata is attached to a blob by associating the blob with a data type from the schema, then cryptographically signing the result. Subsequently, an application can alter the attributes of a blob by creating a new signed schema blob (called a "claim"). For any blob, then, all of the claims on it are saved in the data store and can be replayed or backed up at will. That way, stored objects are mutable, but the changes to them are non-destructive. The current state of an object is the application of all of the claims associated with a blob, applied in the order of their timestamps.
This storage architecture allows for, potentially, a wide variety of front-end clients. Index servers already exist that use SQLite, LevelDB, MySQL, PostgreSQL, MongoDB, and Google App Engine's data store to manage the indexed blobs. Since an index server is logically separate from the blob servers that it indexes, it is possible to run an index on a portable device that sports little built-in storage, and still be able to transparently access all of the content maintained in the remote storage locations. In addition, Camlistore has the concept of a "graph sync," in which only a subset of the total blob storage is synchronized to a particular device. While full synchronization is useful to preserve the data in the event that a web service like Amazon S3 unexpectedly becomes unreachable, there are certainly many scenarios when it makes sense to keep only some of the data on hand.
As far as using the blob storage is concerned, at present Camlistore only implements two models: the basic storage/search/retrieval approach one would use to manage the entire collection, and directly sharing a particular item with another user. By default, each Camlistore server is private to a single user; users can share an object by generating a signed assertion that another user is permitted to access the object. This signed assertion is just one more type of claim for the underlying blob in the database. Several user-authentication options are supported, but for now the recipient of the share needs to have an account on the originating Camlistore system.
It may be a while before Camlistore is capable of serving as a storage layer for a blog, photo-hosting site, or other web service, but when it is ready, it will bring some interesting security properties with it. As mentioned, all claims on items in the database are signed—using GPG keys. That not only allows for verification of important operations (like altering the metadata of a blob), but it means it would be possible to perform identity checks for common operations like leaving comments. Camlistore does have some significant competition from other decentralized storage projects, Tahoe-LAFS included, but it will be an interesting project to watch.
Brief items
Security quotes of the week
The depressing part of this is that there's no reason to believe that Panasonic are especially bad here - especially since a large number of vendors are shipping much the same Mediatek code, and so probably have similar (if not identical) issues. The future is made up of network-connected appliances that are using your electricity to mine somebody else's Dogecoin. Our nightmarish dystopia may be stranger than expected.
OpenSSL code beyond repair, claims creator of “LibreSSL” fork (Ars Technica)
Ars Technica takes a look at the LibreSSL fork of OpenSSL created by the OpenBSD project. "The decision to fork OpenSSL is bound to be controversial given that OpenSSL powers hundreds of thousands of Web servers. When asked why he wanted to start over instead of helping to make OpenSSL better, de Raadt said the existing code is too much of a mess. "Our group removed half of the OpenSSL source tree in a week. It was discarded leftovers," de Raadt told Ars in an e-mail. "The Open Source model depends [on] people being able to read the code. It depends on clarity. That is not a clear code base, because their community does not appear to care about clarity. Obviously, when such cruft builds up, there is a cultural gap. I did not make this decision... in our larger development group, it made itself.""
New vulnerabilities
cacti: multiple vulnerabilities
Package(s): | cacti | CVE #(s): | CVE-2014-2708 CVE-2014-2709 CVE-2014-2326 CVE-2014-2328 CVE-2014-2327 | ||||||||||||||||||||||||||||
Created: | April 17, 2014 | Updated: | June 30, 2014 | ||||||||||||||||||||||||||||
Description: | From the Red Hat bugzilla entries [1, 2]:
CVE-2014-2708 is for the SQL injection issues in graph_xport.php. CVE-2014-2709 is for the shell escaping issues in lib/rrd.php A posting to bugtraq from Deutsche Telekom noted multiple flaws in Cacti 0.8.7g: CVE-2014-2326: stored XSS "The Cacti application is susceptible to stored XSS attacks. This is mainly the result of improper output encoding." CVE-2014-2327: missing CSRF token "The Cacti application does not implement any CSRF tokens. More about CSRF attacks, risks and mitigations see https://www.owasp.org/index.php/Cross-Site_Request_Forgery_(CSRF). This attack has a vast impact on the security of the Cacti application, as multiple configuration parameters can be changed using a CSRF attack. One very critical attack vector is the modification of several binary files in the Cacti configuration, which may then be executed on the server. This results in full compromise of the Cacti host by just clicking a web link. A proof of concept exploit has been developed, which allows this attack, resulting in full (system level) access of the Cacti system. Further attack scenarios include the modification of the Cacti configuration and adding arbitrary (admin) users to the application." CVE-2014-2328: use of exec-like function calls without safety checks allow arbitrary command execution "Cacti makes use of exec-like method PHP function calls, which execute command shell code without any safety checks in place. In combination with a CSRF weakness this can be triggered without the knowledge of the Cacti user. Also, for more elaborate attacks, this can be combined with a XSS attack. Such an attack will result in full system (Cacti host) access without any interaction or knowledge of the Cacti admin." | ||||||||||||||||||||||||||||||
Alerts: |
|
java: three unspecified vulnerabilities
Package(s): | java-1.7.0-oracle | CVE #(s): | CVE-2014-0432 CVE-2014-0448 CVE-2014-2422 | ||||||||||||||||||||||||
Created: | April 17, 2014 | Updated: | May 14, 2014 | ||||||||||||||||||||||||
Description: | Yet again more unspecified Java vulnerabilities. | ||||||||||||||||||||||||||
Alerts: |
|
java: multiple unspecified vulnerabilities
Package(s): | java-1.6.0-sun | CVE #(s): | CVE-2014-0449 CVE-2014-2401 CVE-2014-2409 CVE-2014-2420 CVE-2014-2428 | ||||||||||||||||||||||||||||||||||||||||||||||||
Created: | April 17, 2014 | Updated: | June 3, 2014 | ||||||||||||||||||||||||||||||||||||||||||||||||
Description: | More in a long series of unspecified Java vulnerabilities. | ||||||||||||||||||||||||||||||||||||||||||||||||||
Alerts: |
|
kernel: privilege escalation
Package(s): | kernel | CVE #(s): | CVE-2014-2851 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Created: | April 18, 2014 | Updated: | May 6, 2014 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Description: | From the CVE entry: Integer overflow in the ping_init_sock function in net/ipv4/ping.c in the Linux kernel through 3.14.1 allows local users to cause a denial of service (use-after-free and system crash) or possibly gain privileges via a crafted application that leverages an improperly managed reference counter. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Alerts: |
|
kernel: denial of service
Package(s): | kernel | CVE #(s): | CVE-2014-0155 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Created: | April 21, 2014 | Updated: | May 6, 2014 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Description: | From the CVE entry
The ioapic_deliver function in virt/kvm/ioapic.c in the Linux kernel through 3.14.1 does not properly validate the kvm_irq_delivery_to_apic return value, which allows guest OS users to cause a denial of service (host OS crash) via a crafted entry in the redirection table of an I/O APIC. NOTE: the affected code was moved to the ioapic_service function before the vulnerability was announced. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Alerts: |
|
mysql: multiple unspecified vulnerabilities
Package(s): | mysql-5.5 | CVE #(s): | CVE-2014-0384 CVE-2014-2419 CVE-2014-2430 CVE-2014-2431 CVE-2014-2432 CVE-2014-2436 CVE-2014-2438 CVE-2014-2440 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Created: | April 23, 2014 | Updated: | July 24, 2014 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Description: | From the CVE entries:
Unspecified vulnerability in the MySQL Server component in Oracle MySQL 5.5.35 and earlier and 5.6.15 and earlier allows remote authenticated users to affect availability via vectors related to XML. (CVE-2014-0384) Unspecified vulnerability in Oracle MySQL Server 5.5.35 and earlier and 5.6.15 and earlier allows remote authenticated users to affect availability via unknown vectors related to Partition. (CVE-2014-2419) Unspecified vulnerability in Oracle MySQL Server 5.5.36 and earlier and 5.6.16 and earlier allows remote authenticated users to affect availability via unknown vectors related to Performance Schema. (CVE-2014-2430) Unspecified vulnerability in Oracle MySQL Server 5.5.36 and earlier and 5.6.16 and earlier allows remote attackers to affect availability via unknown vectors related to Options. (CVE-2014-2431) Unspecified vulnerability Oracle the MySQL Server component 5.5.35 and earlier and 5.6.15 and earlier allows remote authenticated users to affect availability via unknown vectors related to Federated. (CVE-2014-2432) Unspecified vulnerability in Oracle MySQL Server 5.5.36 and earlier and 5.6.16 and earlier allows remote authenticated users to affect confidentiality, integrity, and availability via vectors related to RBR. (CVE-2014-2436) Unspecified vulnerability in Oracle MySQL Server 5.5.35 and earlier and 5.6.15 and earlier allows remote authenticated users to affect availability via unknown vectors related to Replication. (CVE-2014-2438) Unspecified vulnerability in the MySQL Client component in Oracle MySQL 5.5.36 and earlier and 5.6.16 and earlier allows remote attackers to affect confidentiality, integrity, and availability via unknown vectors. (CVE-2014-2440) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Alerts: |
|
openshift-origin-broker: authentication bypass
Package(s): | openshift-origin-broker | CVE #(s): | CVE-2014-0188 | ||||||||
Created: | April 23, 2014 | Updated: | April 23, 2014 | ||||||||
Description: | From the Red Hat advisory:
A flaw was found in the way openshift-origin-broker handled authentication requests via the remote user authentication plug-in. A remote attacker able to submit a request to openshift-origin-broker could set the X-Remote-User header, and send the request to a passthrough trigger, resulting in a bypass of the authentication checks to gain access to any OpenShift user account on the system. | ||||||||||
Alerts: |
|
openssl: denial of service
Package(s): | openssl | CVE #(s): | CVE-2010-5298 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Created: | April 18, 2014 | Updated: | July 24, 2014 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Description: | From the Debian advisory: A read buffer can be freed even when it still contains data that is used later on, leading to a use-after-free. Given a race condition in a multi-threaded application it may permit an attacker to inject data from one connection into another or cause denial of service. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Alerts: |
|
otrs: cross-site scripting
Package(s): | otrs | CVE #(s): | CVE-2014-2553 CVE-2014-2554 | ||||||||||||
Created: | April 22, 2014 | Updated: | June 10, 2014 | ||||||||||||
Description: | From the SUSE bug report:
Cross-site scripting (XSS) vulnerability in Open Ticket Request System (OTRS) 3.1.x before 3.1.21, 3.2.x before 3.2.16, and 3.3.x before 3.3.6 allows remote authenticated users to inject arbitrary web script or HTML via vectors related to dynamic fields. | ||||||||||||||
Alerts: |
|
python-django: multiple vulnerabilities
Package(s): | python-django | CVE #(s): | CVE-2014-0472 CVE-2014-0473 CVE-2014-0474 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Created: | April 22, 2014 | Updated: | May 5, 2014 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Description: | From the Ubuntu advisory:
Benjamin Bach discovered that Django incorrectly handled dotted Python paths when using the reverse() function. An attacker could use this issue to cause Django to import arbitrary modules from the Python path, resulting in possible code execution. (CVE-2014-0472) Paul McMillan discovered that Django incorrectly cached certain pages that contained CSRF cookies. An attacker could possibly use this flaw to obtain a valid cookie and perform attacks which bypass the CSRF restrictions. (CVE-2014-0473) Michael Koziarski discovered that Django did not always perform explicit conversion of certain fields when using a MySQL database. An attacker could possibly use this issue to obtain unexpected results. (CVE-2014-0474) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Alerts: |
|
python-django-horizon: cross-site scripting
Package(s): | python-django-horizon | CVE #(s): | CVE-2014-0157 | ||||||||||||||||
Created: | April 23, 2014 | Updated: | May 30, 2014 | ||||||||||||||||
Description: | From the CVE entry:
Cross-site scripting (XSS) vulnerability in the Horizon Orchestration dashboard in OpenStack Dashboard (aka Horizon) 2013.2 before 2013.2.4 and icehouse before icehouse-rc2 allows remote attackers to inject arbitrary web script or HTML via the description field of a Heat template. | ||||||||||||||||||
Alerts: |
|
qemu: code execution
Package(s): | qemu | CVE #(s): | CVE-2014-0150 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Created: | April 18, 2014 | Updated: | December 12, 2014 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Description: | From the Debian advisory: Michael S. Tsirkin of Red Hat discovered a buffer overflow flaw in the way qemu processed MAC addresses table update requests from the guest. A privileged guest user could use this flaw to corrupt qemu process memory on the host, which could potentially result in arbitrary code execution on the host with the privileges of the qemu process. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Alerts: |
|
qemu-kvm: multiple vulnerabilities
Package(s): | qemu-kvm | CVE #(s): | CVE-2014-0142 CVE-2014-0143 CVE-2014-0144 CVE-2014-0145 CVE-2014-0146 CVE-2014-0147 CVE-2014-0148 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Created: | April 23, 2014 | Updated: | April 23, 2014 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Description: | From the Red Hat advisory:
Multiple integer overflow, input validation, logic error, and buffer overflow flaws were discovered in various QEMU block drivers. An attacker able to modify a disk image file loaded by a guest could use these flaws to crash the guest, or corrupt QEMU process memory on the host, potentially resulting in arbitrary code execution on the host with the privileges of the QEMU process. (CVE-2014-0143, CVE-2014-0144, CVE-2014-0145, CVE-2014-0147) A divide-by-zero flaw was found in the seek_to_sector() function of the parallels block driver in QEMU. An attacker able to modify a disk image file loaded by a guest could use this flaw to crash the guest. (CVE-2014-0142) A NULL pointer dereference flaw was found in the QCOW2 block driver in QEMU. An attacker able to modify a disk image file loaded by a guest could use this flaw to crash the guest. (CVE-2014-0146) It was found that the block driver for Hyper-V VHDX images did not correctly calculate BAT (Block Allocation Table) entries due to a missing bounds check. An attacker able to modify a disk image file loaded by a guest could use this flaw to crash the guest. (CVE-2014-0148) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Alerts: |
|
rsync: denial of service
Package(s): | rsync | CVE #(s): | CVE-2014-2855 | ||||||||||||||||||||||||
Created: | April 18, 2014 | Updated: | March 29, 2015 | ||||||||||||||||||||||||
Description: | From the Mageia advisory: Ryan Finnie discovered that rsync 3.1.0 contains a denial of service issue when attempting to authenticate using a nonexistent username. A remote attacker could use this flaw to cause a denial of service via CPU consumption. | ||||||||||||||||||||||||||
Alerts: |
|
Page editor: Jake Edge
Kernel development
Brief items
Kernel release status
The current development kernel is 3.15-rc2, released on April 20. Linus said: "And on the seventh day the rc release rose again, in accordance with the Scriptures laid down at the kernel summit of the year 2004."
Stable updates: 3.13.11 came out on April 22. Greg has said that he will stop maintaining 3.13 at this point, but the Ubuntu kernel team has taken over support through April 2016.
Quotes of the week
In this repository, you will be able to compile your own kernel module, create a /dev/netcat device and pipe its output into an audio player.
ogg123 - < /dev/netcat
This repository contains the album's track data in source files, that (for complexity's sake) came from .ogg files that were encoded from .wav files that were created from .mp3 files that were encoded from the mastered .wav files which were generated from ProTools final mix .wav files that were created from 24-track analog tape.
Kernel development news
Ktap or BPF?
While the kernel's built-in tracing mechanisms have advanced considerably over the last few years, the kernel still lacks a DTrace-style dynamic tracing facility. In the last year we have seen the posting of two different approaches toward scriptable dynamic tracing: ktap and BPF tracing filters. Both work by embedding a virtual machine in the kernel to execute scripts, but the similarity ends there. Putting one virtual machine into the kernel for tracing is a hard sell; adding two of them is not really seen as an option by anybody involved. So, at some point, a decision will have to be made. A recent discussion on that topic gives some hints about the direction that decision could go.The trigger for the discussion was the posting of a new version of the ktap patch set after a period of silence. While quite a bit of work has been done on ktap, little was done to address the concerns that kept ktap out of the 3.13 kernel. Ingo Molnar, who blocked the merging of ktap the last time around, was not pleased that progress had not been made on that front.
Virtual machines
There appear to be two specific points of argument that come up when the merits of ktap and BPF tracing filters are discussed. The first of those is, naturally, the question of introducing another virtual machine into the kernel. On this point, the discussion has shifted a bit, though, for a simple reason: while ktap needs its own virtual machine, the BPF engine is already in the mainline kernel, and it has been getting better.
BPF originally stood for "Berkeley packet filter"; it was used as a way to tell the kernel how to narrow down a stream of packets from a network interface when tools like tcpdump are in use. Over time, though, BPF has been used in other contexts, such as filtering access to system calls as part of the seccomp mechanism and in a number of packet classification subsystems. Alexei Starovoitov's tracing filters patch set simply allows this virtual engine to be used to select and process system events as well.
In 2011, BPF gained a just-in-time compiler that sped it up considerably. The 3.15 kernel takes this work further; it will feature a radically reworked (by Alexei) BPF engine that expands its functionality considerably. The new BPF offers the same virtual instruction set to user space, but those instructions are translated within the kernel into a format that is closer to what the hardware provides. The new format offers a number of advantages over the old, including ten registers instead of two, 64-bit registers, more efficient jump instructions, and a mechanism to allow kernel functions to be called from BPF programs. Needless to say, the additional capabilities have further reinforced BPF's position as the virtual machine of choice for an in-kernel dynamic tracing facility.
Thus, if ktap is to be accepted into the kernel, it almost certainly needs to be retargeted to the BPF virtual machine. Ktap author Jovi Zhangwei has expressed a willingness to consider making such a change, but he sees a number of shortcomings in BPF that would need to be resolved first. BPF as it currently exists does not support features needed by ktap, including access to global variables, timer-limited looping (or loops in general, since BPF disallows them by design), and more. Jovi also repeatedly complained about the BPF tracing filter design, which is oriented around attaching scripts to specific tracepoints; Jovi wants a more flexible mechanism that would allow attaching a single script to a range of tracepoints.
That last functionality should not be too hard to add. Most of the rest of Jovi's requests could probably be worked into BPF as well, especially if Jovi were to help to do the work. Alexei seems to be amenable to evolving BPF in ways that would enable it to better support ktap. The communication between the two developers appears to be difficult, though, with frequent misunderstandings being seen. At one point, Jovi concluded that Alexei was not interested in making the necessary changes to BPF; he responded by saying:
In truth, the situation need not be so grim, but there may be a need for an outside developer to come in and actually do the work to integrate ktap and BPF to show that it is possible. Thus far, volunteers to do this work have not made themselves known. And, in any case, there is another issue.
Scripting languages
Ktap is built on the Lua language, which offers a number of features (associative arrays, for example) that can be useful in dynamic tracing settings. Ingo, along with a few others, would rather see a language that looks more like C:
The BPF tracing filters patch uses a restricted version of the C language; Alexei has also provided backends for both GCC and LLVM to translate that language into something the BPF virtual machine can run. So, once again, the BPF approach appears to have a bit of an advantage here at the moment.
Unsurprisingly, Jovi feels differently about this issue; he sees the ktap language as being far simpler to work with. To support this claim, he provided this code from a BPF tracing filter example:
void dropmon(struct bpf_context *ctx) { void *loc; uint64_t *drop_cnt; loc = (void *)ctx->arg2; drop_cnt = bpf_table_lookup(ctx, 0, &loc); if (drop_cnt) { __sync_fetch_and_add(drop_cnt, 1); } else { uint64_t init = 0; bpf_table_update(ctx, 0, &loc, &init); } }
This filter, he says, can be expressed this way in ktap:
var s ={} trace skb:kfree_skb { s[arg2] += 1 }
Alexei concedes that ktap has a far less verbose source language, though he has reservations about the conciseness of the underlying bytecode. In any case, though, he (along with others) has suggested that, once there is agreement on which virtual machine is to be used, there could be any number of scripting languages supported in user space.
And that is roughly where the discussion wound down. There is a lot of interesting functionality to be found in ktap, but, the way things stand currently, it may well be that this code gets passed over in favor of an offering from a developer who is more willing to do what is needed to get the code upstream. That said, this discussion is far from resolved, and Jovi is not the only developer who is working on ktap. With the application of a bit of energy, it may yet be possible to get ktap's higher-level functionality into a condition where it could someday be merged.
Changing the default shared memory limits
The Linux kernel's System V shared-memory limit has, by default, been fixed at the same value since its inception. Although users can increase this limit, as the amount of memory expected by modern applications has risen over the years, the question has become whether or not it makes sense to simply increase the default setting—including the option of removing the limit altogether. But, as is often the case, it turns out that there are users who have come to expect the shared-memory limit to behave in a particular way, so altering it would produce unintended consequences. Thus, even though no one seems happy with the default setting as it is, how exactly to fix it is not simple.
System V–style shared memory (SHM) is commonly used as an interprocess communication resource; a set of cooperating processes (such as database sessions) can share a segment of memory up to the maximum size allowed by the operating system. That limit that can be expressed in terms of bytes per shared segment (SHMMAX), and in terms of the total number of pages used for all SHM segments (SHMALL). On Linux, the default value of SHMMAX has always been set at 32MB, and the default value of SHMALL is defined as:
#define SHMALL (SHMMAX/getpagesize()*(SHMMNI/16))
where SHMMNI is the maximum number of SHM segments—4096 by default—which in turn gives a default SHMALL of 2097152 pages. Though they have well-known defaults, both SHMMAX and SHMALL can be adjusted with sysctl. There is also a related parameter setting the minimum size of a shared segment (SHMMIN); unlike the other resource limits, it is set to to one byte and cannot be changed.
While most users seem to agree that one byte is a reasonable minimum segment size, the same cannot be said for SHMMAX; 32MB does not go far for today's resource-hungry processes. In fact, it has been routine procedure for several years for users to increase SHMMAX on production systems, and it is standard practice to recommend increasing the limit for most of the popular applications that make use of SHM.
Naturally, many in the community have speculated that it is high time to bump the limit up to some more suitable value, and on March 31, Davidlohr Bueso posted a patch that increased SHMMAX to 128MB. Bueso admitted that the size of the increase was an essentially arbitrary choice (a four-fold bump), but noted in the ensuing discussion that, in practice, users will probably prefer to make their own choice for SHMMAX as a percentage of the total system RAM; bumping up the default merely offers a more sensible starting point for contemporary hardware.
But Andrew Morton argued that increasing the size of the default parameter did not address the underlying issue—that users were frequently hitting what was, fundamentally, an artificial limit with no real reason behind it:
One way to make the problem go away forever would be to eliminate SHMMAX entirely, but as was pointed out in the discussion, administrators probably do want to be able to set some limit to ensure that no user creates a SHM segment that eats up all of the system memory. Motohiro Kosaki suggested setting the default to zero, to stand for "unlimited." Bueso then adopted that approach for the second version of his patch. Since SHMMIN is hardcoded to one, the reasoning goes, SHMMAX cannot ever be misinterpreted by users as a valid value—either it is the default ("unlimited"), or it is the result of an overflow.
The updated patch also set the default value of SHMALL to zero—again representing "unlimited". But removing the limit on the total amount of SHM in this manner revealed a second wrinkle: as Manfred Spraul pointed out, setting SHMALL to zero is currently a move that system administrators (quite reasonably) use to disable SHM allocations entirely; the patch has the unwanted effect of completely reversing the outcome of that move—enabling unlimited SHM allocation.
Spraul subsequently wrote his own alternative patch set that attempts to avoid this issue by instead setting the defaults for SHMMAX and SHMALL to ULONG_MAX, which amounts to setting them to infinity. This solution is not without its risks, either; in particular there are known cases where an application simply tries to increment the value SHMMAX rather than setting it, which causes an overflow. The result would be that applications would encounter the wrong value for SHMMAX—most likely a value far smaller than they need, causing their SHM allocation attempts to fail.
Nevertheless, Bueso concurred that avoiding the reversal of behavior for manually setting SHMALL to zero was a good thing, and signed off on Spraul's approach. The latest version of Spraul's patch set attempts to avoid the overflow issue by using ULONG_MAX - 1L<<24 instead, but he admits that ultimately there is nothing preventing users from causing overflows when left to their own devices.
One final concern stemming from this change is that if a system implements no upper limits on SHM allocation, it will be possible for users to consume all of the available memory as SHM segments. If such a greedy allocation happens, however, the out-of-memory (OOM) killer will not be able to free that memory. The solution is for administrators to either enable the shm_rmid_forced option (which forces each SHM segment to be created with the IPC_RMID flag—guaranteeing that it is associated with at least one process, which in turn ensures that when the OOM killer kills the guilty process, the SHM segment vanishes with it) or to manually set SHM limits.
Since the desire to avoid manually configuring SHM limits was the original goal of the effort, it might seem as if the effort has come full circle. But, for the vast majority of users, removing the ancient defaults is a welcome improvement. Rogue users attempting to allocate all of the memory in a shared segment are at best an anomaly (and certainly something that administrators should stay on the lookout for), whereas the old default 32MB SHM size has long been problematic for a wide variety of users in need of shared memory.
Loopback NFS: theory and practice
The Linux NFS developers have long known that mounting an NFS filesystem onto the same host that is exporting it (sometimes referred to as a loopback or localhost NFS mount) can lead to deadlocks. Beyond one patch posted over ten years ago, little effort has been put into resolving the situation as no convincing use case was ever presented. Testing of the NFS implementation can certainly benefit from loopback mounts; this use probably triggered the mentioned patch. With that fix in place, the remaining deadlocks do take some effort to trigger, so the advice to testers was essentially "be careful and you should be safe".
For any other use case, it would seem that using a "bind" mount would provide a result that is indistinguishable from a loopback NFS mount. In short: if it hurts when you use a loopback NFS mount, then simply don't do that. However, a convincing use case recently came to light which motivated more thought on the issue. It led this author on an educational tour of the interaction between filesystems and memory management, and produced a recently posted patch set (replacing an earlier attempt) which removes most, and hopefully all, such deadlocks.
A simple cluster filesystem
That use case involves using NFS as the filesystem in a high-availability cluster where all hosts have shared access to the storage. For all nodes in the cluster to be able to access the storage equally, you need some sort of cluster filesystem like OCFS2, Ceph, or GlusterFS. If the cluster doesn't need particularly high levels of throughput and if the system administrator prefers to stick with known technology, NFS can provide a simple and tempting alternative.
To use NFS as a cluster filesystem, you mount the storage on an arbitrary node using a local filesystem (ext4, XFS, Btrfs, etc), export that filesystem using NFS, then mount the NFS filesystem on all other nodes. The node exporting the filesystem can make it appear in the local namespace in the desired location using bind mounts and no loopback NFS is needed — at least initially.
As this is a high-availability cluster, it must be able to survive the failure of any node, and particularly the failure of the node running the NFS server. When this happens, the cluster-management software can mount the filesystem somewhere else. The new owner of the filesystem can export it via NFS and take over the IP address of the failed host; all nodes will smoothly be able to access the shared storage again. All nodes, that is, except the node which has taken over as the NFS server.
The new NFS-serving node will still have the shared filesystem mounted via NFS and will now be accessing it as a loopback NFS mount. As such, it will be susceptible to all the deadlocks that have led us to recommend against loopback NFS mounts in the past. In this case, it is not possible to "simply use bind mounts" as the filesystem is already mounted, applications are already using it and have files open, etc. Unmounting that filesystem would require stopping those applications — an action which is clearly contrary to the high-availability goal.
This scenario is clearly useful, and clearly doesn't work. So what was previously a wishlist item, and quite far from the top of the list at that, has now become a bug that needs fixing.
Theory meets practice
The deadlocks that this scenario trigger generally involve a sequence of events like: (1) the NFS server tries to allocate memory, (2) the memory allocator then tries to free memory by writing some pages out to the filesystem via the NFS client, and (3) the NFS client waits for the NFS server to make some progress. My assumption had been that this deadlock was inevitable because the same memory manager was trying to serve two separate but competing users: the NFS client and the NFS server.
A possible fix might be to run the NFS server inside a virtual machine, and to give this VM a fixed and locked allocation of memory so there would not be any competition. This would work, but it is hardly the simple solution that our administrator was after and would likely present challenges in sizing the VM for optimal performance.
This is where I might have left things had not a colleague, Ulrich Schairer, presented me with a system which was deadlocking exactly as described and said, effectively, "It's broken, please fix". I reasoned that analyzing the deadlock would at least allow me to find a precise answer as to why it cannot work. I now know it led to more than that. After a sequence of patches and re-tests it turned out that there were two classes of problem, each of which differed in important ways from the problem which was addressed 10 years ago. Trying to understand these problems led to an exploration of the nature and history of the various mechanisms already present in Linux to avoid memory-allocation deadlocks as reported on last week.
With that context, it might seem that some manipulation of the
__GFP_FS
and/or PF_FSTRANS
flags should allow the
deadlock to be resolved. If we think of nfsd
as simply being
the lower levels of the NFS filesystem, then the deadlock involves a lower
layer of a filesystem allocating memory and thus triggering writeout to
that same filesystem. This is exactly the deadlock that
__GFP_FS
was designed to prevent, and, in fact, setting
PF_FSTRANS
in all nfsd
threads did fix the
deadlock that was the easiest to hit.
Further investigation revealed, as it often does, that reality is
sometimes more complex
than theory might suggest. Using the __GFP_FS
infrastructure,
either directly or through PF_FSTRANS
, turns out to be neither
sufficient, nor desirable, as a solution to the problems with loopback NFS
mounts. The remainder of this article explores why it is not sufficient and
next week we will conclude with an explanation of why it isn't even
desirable.
A pivotal patch
Central to understanding both sides of this problem is a change
that happened in Linux 3.2. This change was authored by my colleague
Mel Gorman who fortunately sits just on the other side of the Internet from
me and has greatly helped my understanding of some of these issues (and
provided valuable review of early versions of this article).
This patch series changed the interplay between memory reclaim and
filesystem writeout in a way that, while not actually invalidating
__GFP_FS
, changed its importance.
Prior to 3.2, one of the several strategies that memory reclaim would follow was to initiate writeout of any dirty filesystem pages that it found. Writing a dirty page's contents to persistent storage is an obvious requirement before the page itself can be freed, so it would seem to make sense to do it while looking for pages to free. Unfortunately, it had some serious negative side effects.
One side effect was the amount of kernel stack space that could be
used. The writepage()
function in some filesystems can be quite
complex and, as a result, can quite reasonably use a lot of stack space. If
a memory
allocation request was made in some unrelated code that also used a lot of
stack space, then the fact that memory allocation led directly to memory
reclaim and, from there, to filesystem writeout, meant that two heavy users of
stack space could be joined together, significantly increasing the total
amount of stack space that could be required. In some cases, the amount of
space needed could exceed the size of the kernel stack.
Another side effect is that pages could be written out in an unpredictable order. Filesystems tend to be optimized to write pages out in the order they appear in the file, first page first. This allows space on the storage device to be allocated optimally and allows multiple pages to be easily grouped into fewer, larger writes. When multiple processes are each trying to reclaim memory, and each is writing out any dirty pages it finds, the result is somewhat less orderly than we might like.
Hence the change in Linux 3.2 removed writeout from direct reclaim,
leaving it to be done by kswapd
or the various filesystem
writeback threads. In such a complex system as Linux memory management, a
little change like that should be expected to have significant follow-on
effects, and the patch mentioned above was really just the first of a short
series which made the main change and then made some adjustments to restore
balance. The particular adjustment which interests us was to add
a small delay during reclaim.
Waiting for writeout
The writeout code that was removed would normally avoid submitting a write if doing so might block. This can happen if the block I/O request queue is full and the submission needs to wait for a free slot; it can be avoided by checking if the backing device is "congested". However, if the process that is allocating memory is in the middle of writing to a file on a particular device, and the memory reclaim code finds a dirty page that can be written to that same device, then it skips the congestion test and, thus, it may well block. This has the effect of slowing down any process writing to a device to match the speed of the device itself and is important for keeping balance in memory management.
With the change so that direct reclaim would no longer write out dirty
file pages, this delay no longer happened (though the
backing_device_info
field of the task structure which enabled
the delay is still present with no useful purpose).
In its place, we get an explicit small delay if all the dirty pages
looked at are waiting for a congested backing device. This delay causes
problems for loopback NFS mounts. In contrast to the implicit delay
present before Linux 3.2, this delay is not avoided by clearing
__GFP_FS
. This is why using __GFP_FS
or
PF_FSTRANS
is not sufficient.
Understanding this problem requires an understanding of the "backing device" object, an abstraction within Linux that holds some important information about the storage device underlying a filesystem. This information includes the recommended read-ahead size and the request queue length — and also whether the device is congested or not. For local filesystems struct backing_dev_info maps directly to the underlying block device (though, for Btrfs, which can have multiple block devices, there are extra challenges). For NFS, the queue in this structure is a list of requests to be sent to the NFS server rather than to a physical device. When this queue reaches a predefined size, the backing device for the NFS filesystem will be designated as "congested".
If the backing device for a loopback-mounted NFS filesystem ever gets
congested while memory is tight, we have a problem. As
nfsd
tries to allocate pages to execute write requests, it will
periodically enter reclaim and, as the NFS backing device is congested, it
will be forced to sleep for 100ms. This delay will slow NFS throughput down to
several kilobytes per second and so will ensure that the NFS backing device
remains
congested. This does not actually result in a deadlock as forward progress
is achieved, but it is a livelock resulting in severely reduced throughput,
which is nearly as bad.
This situation is very specific to our NFS scenario, as the problem is
caused by a backing device writing into the page cache. It is not really a
general filesystem recursion issue, so it is not the same sort of problem
that might be addressed with suitable use of __GFP_FS
.
Learning from history
This issue is, however, similar to the problem from ten years ago that was
fixed by the patch mentioned in the introduction. In that case, the problem
was that a process which was dirtying pages would be slowed down until a
matching number of dirty pages had been written out. When this happened,
nfsd
could end up being blocked until nfsd
had
written out some pages, thus producing a deadlock. In our present case, the
delay happens when reclaiming memory rather than when dirtying memory, and
the delay has an upper limit of 100ms, but otherwise it is a similar
problem.
The solution back then was to add a per-process flag called
PF_LESS_THROTTLE
, which was set only for nfsd
threads. This flag increased the threshold at which the process would be
slowed down (or "throttled") and so broke the deadlock.
There are two important ideas to be seen in that patch: use a
per-process flag, and do not remove the throttling completely but
relax it just enough to avoid the deadlock. If nfsd
were
not throttled at all when dirtying pages, that would just cause other
problems.
With our 100ms delay, it is easy to add a test for the same per-process flag, but the sense in which the delay should only be partially removed is somewhat less obvious.
The problem occurs when nfsd
is writing to a local
filesystem, but the NFS queue is congested. nfsd
should
probably still be throttled when that local filesystem is congested, but
not when the NFS queue is congested. If other queues are congested, it
probably doesn't matter very much whether nfsd
is throttled or
not, though there is probably a small preference in favor of
throttling.
As the backing_dev_info
field of the task structure was
(fortuitously) not removed when direct-reclaim writeback was removed in
3.2, we can easily use PF_LESS_THROTTLE
to avoid the delay in cases where current->backing_dev_info
(i.e. the backing device that nfsd
is writing to) is not
congested. This may not be completely ideal, but it is simple and meets the
key requirements, so should be safe ... providing it doesn't upset other
users of PF_LESS_THROTTLE
.
Though PF_LESS_THROTTLE
has only ever been used in
nfsd
, there have been various patches proposed between
2005
and 2013 adding the flag
to the writeback process used by the loop
block device, which
makes a regular file behave like a block device. This process is in exactly
the same situation as nfsd
: it implements a backing device by
writing into the page cache. As such, it can be expected to face exactly the
same problems as described above and would equally benefit from having
PF_LESS_THROTTLE
set and having that flag bypass the 100ms
delay. It is probably only a matter of time before some patch to add
PF_LESS_THROTTLE
to loop
devices will be
accepted.
There are three other places where direct reclaim can be throttled. The
first is the function throttle_direct_reclaim()
, which was added
in 3.6 as part of swap-over-NFS support. This throttling is explicitly
disabled for any kernel threads (i.e. processes with no user-space
component). As both nfsd
and the loop
device
thread are kernel threads, this function cannot affect users of
PF_LESS_THROTTLE
and so need not concern us.
The other two are in shrink_inactive_list()
(the same
function which holds the primary
source of our present pain). The first
of these repeatedly calls congestion_wait() until there aren't
too many processes reclaiming memory at
the same time, as this can upset some heuristics. This has previously led to a
deadlock that was fixed by avoiding the delay whenever __GFP_FS
or __GFP_IO
was cleared. Further discussion of this will be
left to next time when we examine the use of __GFP_FS
more
closely.
The last
delay is near the end of shrink_active_list()
; it adds an
extra delay (via congestion_wait() again) when it appears that the
flusher threads are struggling to make
progress. While a livelock triggered by this delay has not been seen in
testing, it is conceivable that the flusher thread could block when the NFS
queue is congested; that could lead to nfsd
suffering this
delay as well and so keeping the queue congested. Avoiding this delay in
the same conditions as the other delay seems advisable.
One down, one to go
With the livelocks under control, not only for loopback NFS mounts but
potentially for the loop
block device as well, we only need to
deal with one remaining deadlock. As we found with this first problem, the
actual change required will be rather small. The effort to understand and
justify that change, which will be explored next week, will be somewhat
more substantial.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Distributions
Fedora's firewall furor
For many Fedora users, one of the first steps taken after installing a new machine is to turn off SELinux — though SELinux has indeed become less obstructive over time. In a similar vein, many users end up turning off the network firewall that Fedora sets up by default. The firewall has a discouraging tendency to get in the way as soon as one moves beyond simple web-browsing and email applications and, while it is possible to tweak the firewall to make a specific application work, few users have the knowledge or the patience to do that. So they just make the whole thing go away. That's why it has been proposed to turn off the firewall by default for the Fedora 21 Workstation product. While the proposal has been rejected, for now anyway, the reasons behind it remain.The change proposal reads this way:
As one might imagine, this proposal led to a lengthy discussion once it hit the Fedora development list. To those who oppose the proposal, it is seen as an admission of defeat: firewalling is too hard, so Fedora's developers are simply giving up and, in the process, making the system less secure and repeating mistakes made by other operating systems in the past. Others, though, question the level of real-world security provided by the firewall, especially when many users just end up turning it off anyway. But, behind all the noise, there actually appears to be a sort of consensus on how things should actually work; it's just that nobody really knows how to get to that point.
Nobody, obviously, wants a Fedora system to be insecure by default (unless perhaps they work for the NSA). But there is a desire for the applications installed by the user to simply work without the need for firewall tweaking. Beyond that, the set of services that should work should depend on the network environment that the system finds itself in. When the system is connected to a trusted network (at home, say), more services should be reachable than when it is connected to a dodgy airport wireless network. Fedora's firewalld tries to make the right thing happen in all cases, but the problem turns out to be hard.
Once the firewall is in place, any network service that is not explicitly allowed will not function properly. That is why the first attempt to set up something as simple as an SSH server on a Fedora system is usually doomed to failure. There are a couple of mechanisms that could address this problem, but they have issues of their own. The first possible solution is to provide an API to allow applications to request the opening of a hole in the firewall for a specific port. Firewalld already supports this API via D-Bus, but it is hard to see this API as a full solution to the problem for a couple of reasons:
- Firewalld is a Fedora-specific project, so its API, too, is naturally
a Fedora-only affair. As a result, application developers are
reluctant to include firewalld support in their code; it's an
additional maintenance burden for a relatively small group of
additional users.
- Users may not want an application to be able to silently perforate the firewall in all settings, especially if they are worried about malicious software running on their own systems.
A potential solution to the second problem (and, perhaps the first) is to
put a mechanism in place to notice when an application is trying to listen
on a firewalled port and ask the user if the port should be opened. The
problem with this approach was nicely summarized by Daniel Walsh, who said:
"Nothing worse than asking users security-related questions about
opening firewall ports. Users will just answer yes, whether or not
they are being hacked.
"
Even if they take the time to answer carefully, users will tend to get
annoyed by security questions in short order. As a general rule, the "ask
the user" approach tends not to work out well.
An alternative is to try to do the right thing depending on the network the system is connected to. On a trusted network, the firewall could allow almost all connections and services will just work. When connecting to a coffee-shop network, instead, a much tighter firewall would be in place. As it happens, firewalld was designed to facilitate this type of policy as well; it allows the placement of both networks and applications into "zones." When the system is on a network assigned to a specific zone, only applications which have been enabled for that zone will be able to open reachable ports to the world.
The current setup does not lack for zones; indeed, there are nine of them with names that vary from "trusted" to "external," "dmz," or "drop." As Matthias Clasen pointed out, this is far too many zones for most users to know what to do with, and there is no real information about what the differences between them are. Configuration is via a set of XML files; NetworkManager can put networks into zones if one digs far enough into the dialogs, but there is little help for users wanting to know what a specific zone means or how it can be changed.
There seems to be a rough consensus that, if firewalld had a more usable zones system, it could be left enabled by default. The move to disable the firewall is a clear statement that, in some minds at least, firewalld cannot be fixed in the Fedora 21 time frame. There is, however, one approach that might work: reducing the number of zones considerably. In fact, in a related discussion last February, Christian Schaller suggested that all the system needs by default is two zones: trusted and untrusted. When NetworkManager connects to a new network, it can ask the user whether that network is trusted or not and set the firewall accordingly.
This idea seemed to gain some favor in both discussions, but it is not clear that somebody will get around to actually making it work in the near term. That may need to change in the near future, though. On April 23, the Fedora Engineering Steering Committee discussed this proposal and, with a five-to-two vote, rejected it. So the Fedora 21 workstation product will probably have a firewall by default, but how that firewall will work still needs to be figured out.
Brief items
Distribution quotes of the week
Debian 6.0 to get long-term support
The Debian project has announced that the security support period for the 6.0 ("squeeze") release has been extended by nearly two years; it now runs out in February 2016. At the end, squeeze will have received a full five years of security support. "squeeze-lts is only going to support i386 and amd64. If you're running a different architecture you need to upgrade to Debian 7 (wheezy). Also there are going to be a few packages which will not be supported in squeeze-lts (e.g. a few web-based applications which cannot be supported for five years). There will be a tool to detect such unsupported packages."
Ubuntu 14.04 LTS (Trusty Tahr) released
Ubuntu has announced the release of its latest long-term support distribution: Ubuntu 14.04 LTS (aka "Trusty Tahr"). The release notes have all the details. It comes in a multitude of configurations, for desktops, servers, the cloud, phones, and tablets; also in many flavors: Kubuntu, Edubuntu, Xubuntu, Lubuntu, Ubuntu GNOME, Ubuntu Kylin, and Ubuntu Studio. "Ubuntu 14.04 LTS is the first long-term support release with support for the new "arm64" architecture for 64-bit ARM systems, as well as the "ppc64el" architecture for little-endian 64-bit POWER systems. This release also includes several subtle but welcome improvements to Unity, AppArmor, and a host of other great software."
Newsletters and articles of interest
Distribution newsletters
- DistroWatch Weekly, Issue 555 (April 21)
- Five Things in Fedora This Week (April 22)
- Ubuntu Weekly Newsletter, Issue 364 (April 20)
Shuttleworth: U talking to me?
Ubuntu's Trusty Tahr has been released and that means it's time for a new development branch. Mark Shuttleworth has announced the name of the next Ubuntu release. "So bring your upstanding best to the table – or the forum – or the mailing list – and let’s make something amazing. Something unified and upright, something about which we can be universally proud. And since we’re getting that once-every-two-years chance to make fresh starts and dream unconstrained dreams about what the future should look like, we may as well go all out and give it a dreamlike name. Let’s get going on the utopic unicorn."
Emmabuntüs: A philanthropist’s GNU/Linux (muktware)
Muktware takes a quick look at Emmabuntüs. "Emmabuntüs is a desktop GNU/Linux distribution which originated in France with a humanitarian mission. It was designed with 4 primary objectives – refurbishing of computers given to humanitarian organizations like the Emmaüs communities, promoting GNU/Linux among beginners, extending the life of older equipments and reducing waste by over-consumption of raw materials."
Page editor: Rebecca Sobol
Development
Testing your full software stack with cwrap
Testing network applications correctly is hard. The biggest challenge is often to set up the environment to test a client/server application. One option is to set up several virtual machines or containers and run a full client/server interaction between them. But building this environment might not always be possible; for example some build systems typically have no network at all and run as a non-privileged user. Also for newcomers, who want to contribute code to your project, it is often a difficult and time-consuming task to set up that kind of development environment.
Reading and running the test cases is normally a good entry point toward understanding a project, because you learn how it is set up and how you need to use the API to achieve your goal. For these reasons, it would be preferable if there was a way to run the tests locally using a non-root user, while still being able to run in an environment as close to real world as possible. Avoiding the testing of code that requires elevated privileges or networking is usually not an option, because many projects have a test-driven development model. This means to submit new code or to fix issues, a test case is required so regressions are avoided.
The cwrap project
The cwrap project aims to help client/server software development teams that are trying to gain full functional test coverage to complete that task. It makes it possible to run several instances of the full software stack on the same machine and perform local functional testing of complex network configurations. Daemons often require privilege separation and local user and group accounts, separate from the hosting system. The cwrap project does not require virtualization or root credentials and can be used on different operating systems.
It is basically like The Matrix, where reality is simulated and everything is a lie.
cwrap is a new project, but the ideas and the origin of the project are from the Samba codebase. cwrap presents the internals of one of the most advanced FOSS testing systems that has helped Samba developers for many years to test their protocol implementations. Samba is complex, it provides several server components that need to interact with each other. It provides a client executable, a client library, and a testing suite called smbtorture. These need to be run against different server setups to test the correctness of the protocols and server components.
In trying to test your server, you may run into some problems. Your server might need to open privileged ports, which requires superuser access. If you need to run several instances of daemons for different tasks, then the setup becomes more complex. An example would be that you want to test a SSH client with Kerberos. So you need a KDC (key distribution center) and an SSH server. If you provide login or authentication functionality, user and group accounts on the system are required. This means each machine you run the tests on needs to have the same users. To be able to switch to a user after authentication, you have to be root in the first place. All these things make testing harder and the setup more complex.
What you actually want is to be able to run all required components on a single machine: the one a developer is working on. All tests should work as a normal non-privileged user. So what you really want is to just run make test and wait till all tests are finished.
The cwrap project enables you to set up such an environment easily by providing three libraries you can preload to any binary.
What is preloading?
Preloading is a feature of the dynamic linker (ld). It is a available on most Unix systems and allows loading a user-specified, shared library before all other shared libraries that are linked to an executable.
Library preloading is most commonly used when you need a custom version of a library function to be called. You might want to implement your own malloc(3) and free(3) functions that would perform rudimentary leak checking or memory access control for example, or you might want to extend the I/O calls to dump data when reverse engineering a binary blob. In those cases, the library to be preloaded would implement the functions you want to override. Only functions in dynamically loaded libraries can be overridden. You're not able to override a function the application implements by itself or links statically with. More details can be found in the man page of ld.so.
The wrappers use preloading to supply their own variants of several system or library calls suitable for unit testing of networked software or privilege separation. For example, the socket_wrapper includes its version of most of the standard API calls used to communicate over sockets. Its version routes the communication over local sockets.
The wrappers
cwrap consists of three different wrappers. Each of them implements a set of functions to fulfill a testing task. There is socket_wrapper, nss_wrapper and uid_wrapper.
socket_wrapper
This library redirects all network communication to happen over Unix sockets, emulating both IPv4 and IPv6 sockets and addresses. This allows you to start several daemons of a server component on the same machine without any conflicts. You are also able to simulate binding to privileged ports below port 1024, which normally requires root privileges. If you need to understand the packet flow to see what is happening on the wire, you can also capture the network traffic in pcap format and view it later with tools such as Wireshark.
The idea and the first incarnation of socket_wrapper was written by Jelmer Vernooij in 2005. It made it possible to run the Samba torture suite against smbd in make test. From that point in time, we started to write more and more automated tests. We needed more wrappers as the test setup became increasingly complex. With Samba 4.0 we needed to test the user and group management of an Active Directory server and make it simple for developers to do that. The technology has been in use and tested a while now. But because the code was embedded in the Samba source tree, it wasn't possible to use it outside of the Samba code base. The cwrap project now makes this possible.
There are some features in development, like support for IP_PKTINFO in auxiliary messages of sendmsg() and recvmsg(). We also would like to support for fd-passing with auxiliary messages soon to implement and test some new features for the Samba DCERPC infrastructure.
Lets take a look how socket_wrapper works on a single machine. Here is a demo you can run yourself after you have installed it:
# Open a console and create a directory for the unix sockets. $ mktemp -d /tmp/tmp.bQRELqDrhM # Then start nc to listen for network traffic using the temporary directory. $ LD_PRELOAD=libsocket_wrapper.so \ SOCKET_WRAPPER_DIR=/tmp/tmp.bQRELqDrhM \ SOCKET_WRAPPER_DEFAULT_IFACE=10 nc -v -l 127.0.0.10 7 # nc, listens on 127.0.0.10 because it is specified on the command-line # and it corresponds to the SOCKET_WRAPPER_DEFAULT_IFACE value specified # Now open another console and start 'nc' as a client to connect to the server: $ LD_PRELOAD=libsocket_wrapper.so \ SOCKET_WRAPPER_DIR=/tmp/tmp.bQRELqDrhM \ SOCKET_WRAPPER_DEFAULT_IFACE=100 \ SOCKET_WRAPPER_PCAP_FILE=/tmp/sw.pcap nc -v 127.0.0.10 7 # (The client will use the address 127.0.0.100 when connecting to the server) # Now you can type 'Hello!' which will be sent to the server and should appear # in the console output of the server. # When you have finished, you can examine the network packet dump with # "wireshark /tmp/sw.pcap"
nss_wrapper
There are projects that provide daemons needing to be able to create, modify, and delete Unix users. Others just switch user IDs to interact with the system on behalf of another user (e.g. a user space file server). To be able to test these, you need the privilege to modify the passwd and group files. With nss_wrapper it is possible to define your own passwd and group files which will be used by the software while it is under test.
If you have a client and server under test, they normally use functions to resolve network names to addresses (DNS) or vice versa. The nss_wrapper allows you to create a hosts file to set up name resolution for the addresses you use with socket_wrapper.
The user, group, and hosts functionality are all defined as wrappers around the Name Service Switch (NSS) API. The Name Service Switch is a modular system, used by most Unix systems, that allows you to fetch information from several databases (users, groups, hosts, and more) using loadable modules. The list and order of modules is configured in the file /etc/nsswitch.conf. Usually, the nsswitch.conf file contains the "files" module shipped with glibc that looks up users in /etc/passwd, groups in /etc/group, and hosts in /etc/hosts. But it's also possible to define additional sources of information by configuring third party modules — a good example might be looking up users from LDAP using nss_ldap.
Here is an example of using nss_wrapper to handle users and groups:
$ echo "bob:x:1000:1000:Bob Gecos:/home/test/bob:/bin/false" > passwd $ echo "root:x:65534:65532:Root user:/home/test/root:/bin/false" >> passwd $ echo "users:x:1000:" > group $ echo "root:x:65532:" >> group $ LD_PRELOAD=libnss_wrapper.so NSS_WRAPPER_PASSWD=passwd \ NSS_WRAPPER_GROUP=group getent passwd bob bob:x:1000:1000:Bob Gecos:/home/test/bob:/bin/falseThe following shows nss_wrapper faking the host name:
$ LD_PRELOAD=libnss_wrapper.so NSS_WRAPPER_HOSTNAME=test.example.org hostname test.example.orgHere, nss_wrapper simulates host name resolution:
$ echo "fd00::5357:5faa test.cwrap.org" > hosts $ echo "127.0.0.170 test.cwrap.org" >> hosts # Now query ahostsv6 which returns only IPv6 addresses and # calls getaddrinfo() for each the entry. $ LD_PRELOAD="libnss_wrapper.so" NSS_WRAPPER_HOSTS=hosts \ getent ahostsv6 test.cwrap.org fd00::5357:5faa DGRAM test.cwrap.org fd00::5357:5faa STREAM test.cwrap.org
uid_wrapper
Some projects, such as a file server, need privilege separation to be able to switch to the user who owns the files and do file operations on their behalf. uid_wrapper convincingly lies to the application, letting it believe it is operating as root and even switching between UIDs and GIDs as needed. You can start any application making it believe it is running as root. We will demonstrate this later. You should keep in mind that you will not gain more permissions or privileges with uid_wrapper than you currently have; remember it is The Matrix.
Maybe you know that glibc has support for switching the user/group only for the local thread. For example calling setuid(1000) synchronizes all threads to change to the given UID. The setuid(), setguid(), etc. functions send a signal to each thread, telling it that it should change the relevant ID. The signal handler of each thread for the signal then uses syscall() using the corresponding SYS_setXid constant for the local thread. So, under glibc, if you want to change the UID only for the local thread, you have to make the system call directly:
rc = syscall(SYS_setruid, 1000, 0);uid_wrapper has support for glibc's special privilege separation with threads. It intercepts calls to syscall() to handle the remapping of UIDs and GIDs. Here is an example of uid_wrapper in action:
$ LD_PRELOAD=libuid_wrapper.so UID_WRAPPER=1 UID_WRAPPER_ROOT=1 id uid=0(root) gid=0(root) groups=100(users),0(root)
How are the wrappers tested?
You may sense a bit of a conflict of interest with wrappers. On one hand, this article stated that unit tests with wrappers strive to simulate the real world environment as closely as possible. On the other hand, the wrappers substitute such fundamental calls as socket() and getpwnam(). It's paramount that the wrappers be extremely well tested so that you, as a user of the wrappers, are confident that any failure in testing implemented using the wrappers is a bug in the program under test and not an unwanted side effect of the wrappers. To this end, the wrappers include a large unit test suite that make sure the wrappers function as intended. At the time of this writing, the code coverage for wrappers is pretty high: nss_wrapper 79%, socket_wrapper 77%, and uid_wrapper 85% code coverage.
As an example of a unit test, the socket_wrapper implements a very simple echo server. The unit tests that exercise the read() or write() calls then connect to the echo server instance that is seemingly running on a privileged port. In fact, the echo server is run using socket_wrapper, so all communication is redirected over a local socket. You can inspect the unit test in the Samba repository. The CMakeLists.txt file also gives a good overview of how the tests are set up.
The wrappers leverage the cmocka unit testing framework that was covered in an earlier LWN article. In short, the cmocka library provides unit test developers with the ability to use mock objects. Moreover, the cmocka library has very low dependency footprint; in fact, it requires only the standard C library.
All the wrapper libraries are built using the cmake build system. In order to provide cwrap developers with an easy-to-use dashboard that displays the results of unit tests, an instance of the cdash dashboard is running and scheduling tests on several operating systems including several Linux distributions, FreeBSD, and OpenIndiana (descended from OpenSolaris). Currently the i686 and x86_64 architectures are tested. The dashboard is a one-stop view that lets you see if any of the unit tests has trouble or if compiling the wrappers or their unit tests yields any compiler errors or warnings.
Final thoughts
Regular LWN readers may have read about namespaces in Linux. These provide similar functionality as the lightweight virtualization layer mechanism known as containers. But to set up namespaces, you often will need root privileges. When distributions enable user namespaces, that requirement will go away, but there is another problem: namespaces are not available on BSD or Solaris.
Currently Samba is the only user of the cwrap libraries since cwrap was not available for external consumption until recently. Andreas is currently working on cwrap integration to test libssh against an OpenSSH sshd server. We are also planning to improve the test environment of SSSD, but we didn't have time to work on it yet. At Red Hat, Quality Engineering has started to write tests for nss_ldap using nss_wrapper, but they are not upstream yet. If you plan to use cwrap, join us on the #cwrap IRC channel on Freenode.
Brief items
Quotes of the week
QEMU 2.0.0 released
The QEMU team has announced the release of version 2.0.0 of the QEMU "open source machine emulator and virtualizer". New features in the release include support for KVM on AArch64 (64-bit ARM) systems, support for all 64-bit ARMV8 instructions (other than the optional CRC and crypto extensions), support for the Allwinner A10-based cubieboard, CPU hotplug for Q35 x86 systems, better Windows guest performance when doing many floating-point or SIMD operations, live snapshot merging, new management interfaces for CPU and virtio-rng hotplug, direct access to NFSv3 shares using libnfs, and lots more. Detailed information about all of the changes can be found in the changelog.
ISC releases BIND 10 1.2, renames it, and turns it over to community
Internet Systems Consortium, the non-profit behind the BIND DNS server, has released version 1.2 of BIND 10, which is the last release it will make of the "applications framework for Internet infrastructure, such as DNS". That completes ISC's development effort on BIND 10, so it has renamed the project to Bundy and turned it over to the community for updates and maintenance. "
'BIND 10 is an excellent software system,' said Scott Mann, ISC's Vice President of Engineering, 'and a huge step forward in open-source infrastructure software. Unfortunately, we do not have the resources to continue development on both projects, and BIND 9 is much more widely used.' 'The BIND 10 software is open-source,' Scott added, 'so we are making it available for anyone who wants to continue its development. The source will be available from GitHub under the name Bundy, to mitigate the confusion between it and ISC's BIND 9 (a completely separate system). The name 'BIND' is associated with ISC; we have changed its name as a reminder that ISC is no longer involved with the project.'"
GCC 4.9.0 released
Version 4.9.0 of the GNU Compiler Collection is out. "GCC 4.9.0 is a major release containing substantial new functionality not available in GCC 4.8.x or previous GCC releases." The list of new features is indeed long; see the 4.9.0 release page for lots more information.
Linux Test Project released for April 2014
The stable test suite from the Linux Test Project has been updated for April 2014. Notable changes include 20 new syscall test cases, fixes for out-of-tree building and cross-compilation, and the rewrite of several scripts to run in shells other than bash.
Newsletters and articles
Development newsletters from the past week
- What's cooking in git.git (April 17)
- What's cooking in git.git (April 18)
- What's cooking in git.git (April 22)
- LLVM Weekly (April 21)
- OCaml Weekly News (April 22)
- OpenStack Community Weekly Newsletter (April 18)
- Perl Weekly (April 21)
- PostgreSQL Weekly News (April 20)
- Python Weekly (April 17)
- Ruby Weekly (April 17)
- Tor Weekly News (April 23)
Ars Technica: Tor network’s ranks of relay servers cut because of Heartbleed bug
Ars Technica reports
on the impact that the "Heartbleed" bug in OpenSSL has had for the Tor
anonymizing network. "The Tor Project team has been moving to
provide patches for all of the components, and most of the core
network was quickly secured. However, a significant percentage of the relay servers, many of which serve countries with heavy Internet censorship, have remained unpatched. These systems are operated by volunteers and may run unattended.
"
Faure: Freedesktop Summit 2014 Report
David Faure has a report on the Freedesktop Summit, which was held recently in Nuremberg. "The meeting also produced an agreement on the future of startup notification in the Wayland world. A protocol based on broadcast of D-Bus signals will be used instead of the current approach with X client messages. This approach is expected to integrate nicely with future frameworks for sandboxed applications. Improvements were also made to the protocol to allow for tab-based applications that make dynamic choices about creating a new tab or a new window depending on the workspace in which a document was opened."
[Editor's note: apologies to Ryan Lortie who wrote this article.]
Page editor: Nathan Willis
Announcements
Brief items
The Apache Software Foundation Announces 100 Million Downloads of Apache OpenOffice
The Apache Software Foundation has announced that Apache OpenOffice has been downloaded 100 million times. "Official downloads at openoffice.org are hosted by SourceForge, where users can also find repositories for more than 750 extensions and over 2,800 templates for OpenOffice."
Articles of interest
Plant Breeders Release First 'Open Source Seeds' (NPR)
NPR has a look at the cross-pollination of open source software and agriculture, resulting in the release of the first "Open Source Seeds". The new Open Source Seed Initiative was formed to put seeds, and, more importantly, their genetic material, into a protected commons, so they will be available in perpetuity. "At an event on the campus of the University of Wisconsin, Madison, backers of the new Open Source Seed Initiative will pass out 29 new varieties of 14 different crops, including carrots, kale, broccoli and quinoa. Anyone receiving the seeds must pledge not to restrict their use by means of patents, licenses or any other kind of intellectual property. In fact, any future plant that's derived from these open source seeds also has to remain freely available as well." (Thanks to Rich Brown.)
LAC14 interview series
Gabriel Nordeborn has started a series of interviews with people involved with the Linux Audio Conference, which will be held May 1-4 in Karlsruhe, Germany. As of this writing interviews with Miller Puckette, Robin Gareus, and Albert Graef are available.
New Books
Raspberry Pi, 2nd Edition--New from Pragmatic Bookshelf
Pragmatic Bookshelf has released "Raspberry Pi, 2nd Edition" by Maik Schmidt.
Calls for Presentations
CFP Deadlines: April 24, 2014 to June 23, 2014
The following listing of CFP deadlines is taken from the LWN.net CFP Calendar.
Deadline | Event Dates | Event | Location |
---|---|---|---|
April 24 | October 6 October 8 |
Operating Systems Design and Implementation | Broomfield, CO, USA |
April 25 | August 1 August 3 |
PyCon Australia | Brisbane, Australia |
April 25 | August 18 | 7th Workshop on Cyber Security Experimentation and Test | San Diego, CA, USA |
May 1 | July 14 July 16 |
2014 Ottawa Linux Symposium | Ottawa, Canada |
May 1 | May 12 May 16 |
Wireless Battle Mesh v7 | Leipzig, Germany |
May 2 | August 20 August 22 |
LinuxCon North America | Chicago, IL, USA |
May 2 | August 20 August 22 |
CloudOpen North America | Chicago, IL, USA |
May 3 | May 17 | Debian/Ubuntu Community Conference - Italia | Cesena, Italy |
May 4 | July 26 August 1 |
Gnome Users and Developers Annual Conference | Strasbourg, France |
May 9 | June 10 June 11 |
Distro Recipes 2014 - canceled | Paris, France |
May 12 | July 19 July 20 |
Conference for Open Source Coders, Users and Promoters | Taipei, Taiwan |
May 18 | September 6 September 12 |
Akademy 2014 | Brno, Czech Republic |
May 19 | September 5 | The OCaml Users and Developers Workshop | Gothenburg, Sweden |
May 23 | August 23 August 24 |
Free and Open Source Software Conference | St. Augustin (near Bonn), Germany |
May 30 | September 17 September 19 |
PostgresOpen 2014 | Chicago, IL, USA |
June 6 | September 22 September 23 |
Open Source Backup Conference | Köln, Germany |
June 6 | June 10 June 12 |
Ubuntu Online Summit 06-2014 | online, online |
June 20 | August 18 August 19 |
Linux Security Summit 2014 | Chicago, IL, USA |
If the CFP deadline for your event does not appear here, please tell us about it.
Upcoming Events
Events: April 24, 2014 to June 23, 2014
The following event listing is taken from the LWN.net Calendar.
Date(s) | Event | Location |
---|---|---|
April 25 April 28 |
openSUSE Conference 2014 | Dubrovnik, Croatia |
April 26 April 27 |
LinuxFest Northwest 2014 | Bellingham, WA, USA |
April 29 May 1 |
Embedded Linux Conference | San Jose, CA, USA |
April 29 May 1 |
Android Builders Summit | San Jose, CA, USA |
May 1 May 4 |
Linux Audio Conference 2014 | Karlsruhe, Germany |
May 2 May 3 |
LOPSA-EAST 2014 | New Brunswick, NJ, USA |
May 8 May 10 |
LinuxTag | Berlin, Germany |
May 12 May 16 |
Wireless Battle Mesh v7 | Leipzig, Germany |
May 12 May 16 |
OpenStack Summit | Atlanta, GA, USA |
May 13 May 16 |
Samba eXPerience | Göttingen, Germany |
May 15 May 16 |
ScilabTEC 2014 | Paris, France |
May 17 | Debian/Ubuntu Community Conference - Italia | Cesena, Italy |
May 20 May 24 |
PGCon 2014 | Ottawa, Canada |
May 20 May 21 |
PyCon Sweden | Stockholm, Sweden |
May 20 May 22 |
LinuxCon Japan | Tokyo, Japan |
May 21 May 22 |
Solid 2014 | San Francisco, CA, USA |
May 23 May 25 |
FUDCon APAC 2014 | Beijing, China |
May 23 May 25 |
PyCon Italia | Florence, Italy |
May 24 | MojoConf 2014 | Oslo, Norway |
May 24 May 25 |
GNOME.Asia Summit | Beijing, China |
May 30 | SREcon14 | Santa Clara, CA, USA |
June 2 June 3 |
PyCon Russia 2014 | Ekaterinburg, Russia |
June 2 June 4 |
Tizen Developer Conference 2014 | San Francisco, CA, USA |
June 9 June 10 |
Erlang User Conference 2014 | Stockholm, Sweden |
June 9 June 10 |
DockerCon | San Francisco, CA, USA |
June 10 June 12 |
Ubuntu Online Summit 06-2014 | online, online |
June 10 June 11 |
Distro Recipes 2014 - canceled | Paris, France |
June 13 June 14 |
Texas Linux Fest 2014 | Austin, TX, USA |
June 13 June 15 |
State of the Map EU 2014 | Karlsruhe, Germany |
June 13 June 15 |
DjangoVillage | Orvieto, Italy |
June 17 June 20 |
2014 USENIX Federated Conferences Week | Philadelphia, PA, USA |
June 19 June 20 |
USENIX Annual Technical Conference | Philadelphia, PA, USA |
June 20 June 22 |
SouthEast LinuxFest | Charlotte, NC, USA |
June 21 June 28 |
YAPC North America | Orlando, FL, USA |
June 21 June 22 |
AdaCamp Portland | Portland, OR, USA |
If your event does not appear here, please tell us about it.
Page editor: Rebecca Sobol