LWN.net Weekly Edition for January 4, 2013
LWN's 2013 Predictions
It's that time of year again: a new year has begun, so your editor must duly come up with a set of predictions that, by the end of the year, will look either embarrassing or painfully obvious — or both. It is a doomed exercise, but, then, we all need some comic relief sometimes. So, without further ado, here's a few thoughts on what can be expected this year.In 2012, the shape of a solution for the UEFI secure boot problem came into view. In 2013, the full messiness of the secure boot situation will come to the fore. There are already reports of some secure-boot-enabled systems refusing to work right with the Linux "shim" solution; others will certainly come out. We are, after all, at the mercy of BIOS developers who only really care that Windows boots properly. We are also at the mercy of Microsoft, which could decide to take a more hostile stance at anytime; there have already been snags in getting the Linux Foundation's bootloader signed.
UEFI secure boot bears watching, and we owe thanks to the developers who have been laboring to make Linux work in that environment. But the problem of locked-down systems is much larger — and much older — than UEFI secure boot, and many of the systems in question already run Linux. Maintaining access to "our" hardware will continue to be a problem this year, just like it has been in the past. "Runs Linux" will continue to mean something different than "the owner can run their own software on it," and UEFI secure boot does not really change the situation all that much, especially for non-x86 systems.
Maintaining access to our software will also be a problem; the usual hassles with software patents will continue, as will attempts to impose draconian surveillance laws, take over the management of the Internet, and so on. Such is life in the 21st century. One could hope that the patent issue, at least, would eventually get to the point where legislators feel the need to improve the situation even slightly; recent reports that a patent troll is shaking down companies for the novel act of using a scanner might suggest that we are getting closer. But the legal system excels at tolerating (and abetting) absurdity; don't expect any significant fixes in 2013.
The 3.12 kernel release will happen on November 20, 2013, or, at worse, by the beginning of December. The kernel development process has become a well-tuned machine with a highly predictable cycle; the longest cycle in 2012 (3.3) was only twelve days longer than the shortest (3.5). In the absence of significant externally imposed stress, it is hard to see anything changing that in 2013.
On the other hand, even your editor would not dream of trying to predict which features will be added in the 3.12 development cycle.
The community will continue to become less tolerant of unpleasant behavior from even its most prominent members. The history of free software resembles that of many frontier environments; at the outset there is a small set of explorers who work mostly below the radar. As the frontier is settled — free software becoming successful and often commercially driven — the small and young community begins to lose its "wild west" feel. In our rather larger and older community, the standards for behavior are becoming more stringent. At this point, almost nobody is seen as being so indispensable that we have to put up with them regardless of their behavior.
So, in 2013, we may well see more episodes where community members call out others for what they say or do and suggest that others refuse to associate with them. To an extent, that may lead to a more friendly and inclusive community environment. But a consensus on what constitutes acceptable behavior does not always exist. So we may also see energy going into personal fights that might be better directed toward more positive activities.
In a similar vein, recent resignations of GNU maintainers have made it clear that there is some disagreement within the organization on how decisions should be made. It may be time for a change of management in the Free Software Foundation and the GNU Project in 2013. Richard Stallman will certainly keep his place as the philosophical leader of both organizations, but it may become increasingly clear that his limited energies need not be absorbed by administrative matters. If these projects are to survive his eventual departure (nobody lasts forever), they would do well to better establish their processes and identities separate from their founder now.
"Vertical integration" will be heard often in 2013. There is no shortage of developers and companies asserting that the "distribution of a collection of disparate chunks of software" model is holding Linux back. Instead, they want to create a system following an overall design from top to bottom. The market performance of platforms like Android suggests that success may be found this way; now we have projects like GNOME OS, Firefox OS, and Ubuntu following similar paths. This kind of integration may well lead to a slicker result, but it also risks fragmenting the Linux world in a way that the traditional distributions did not.
This push toward vertical integration has generated a fair amount of conflict in our community; that will continue into 2013. Perhaps unsurprisingly, the strongest criticisms are reserved for those who are trying hardest to do this integration work as a community project rather than a company-controlled commercial product. As the Android developers have discovered, it can be a lot easier to just design a top-to-bottom system behind closed doors and release the result later. If we wish to minimize the fragmentation risk, we might want to engage more fully — and more constructively — with those who are doing their integration work in the open.
Some distribution will ship a release based on Wayland this year. It will be a painful experience for everybody involved. Wayland as a replacement for the venerable X Window System is certainly the future, but the future can sometimes take rather longer to arrive than one would expect.
Several new Linux-based platforms will ship on hardware in the coming year. We should see devices based on Firefox OS in some parts of the world. The Mer-based "Sailfish OS" may well be available in 2013. Samsung will likely ship Tizen-based handsets, the KDE-based "Vivaldi" tablet (or something descended from it) may actually ship, Ubuntu may show up on some mobile devices, and something totally unpredictable will probably materialize as well. There will be a wealth of interesting Linux-based choices, though they cannot all be expected to do well in the market.
Finally, LWN will celebrate its 15th anniversary on January 22. As we were thinking about what LWN might be in 1997, we could never have imagined that we would still be at it all these years later. Some people never learn, evidently. But, having come this far, we certainly don't plan to stop now. So, as we wish all of you a great 2013, we would also like to thank you for fifteen years of support, and all the years yet to come. We could not possibly ask for a better audience.
The 5.0 BlueZ
The BlueZ project, which provides the official Bluetooth support for Linux, unveiled version 5.0 on December 24, with the new major version number signifying changes in the API — changes found in several places in the Bluetooth stack. Only relatively recent kernels are supported due to reliance on some new kernel interfaces, several multimedia modules have been moved out to GStreamer or into separate projects, and there are new tools and supported profiles. The most significant changes are in the D-Bus API, however. Most (if not all) BlueZ applications will require a bit of porting, but the new interfaces should reduce work in the long run.
Interfaces
Linux 3.4 introduced a new Bluetooth Management kernel interface for interacting with Bluetooth adapters. BlueZ 5.0 now uses the new interface to manage local hardware. The new interface allows the kernel to manage all Host Controller Interface (HCI) traffic, as opposed to the 4.x stack's approach, with the kernel splitting that duty with user-space processes hooking into raw HCI sockets — and all of the synchronization and security problems that followed.
However, a recurring complication in the Bluetooth world is the standard's revisions and extensions. In particular, Bluetooth 4.0 added the Low Energy (LE) protocol designed to work with sensors and other Bluetooth devices capable of running off of small, "coin cell" batteries. Not only is the protocol different, but new adapter hardware is required on the host side as well. Support for Bluetooth LE adapters was not introduced until kernel 3.5; BlueZ 5.0 supports Bluetooth LE functionality only for this and subsequent kernels.
In user space, the most visible changes to working with BlueZ come from the D-Bus API. Version 5.0 has been rewritten to transform a number of previously custom interfaces into their standard D-Bus equivalents. For example, the BlueZ 4.x series provided its own methods for retrieving and setting an object's properties (GetProperties and SetProperty), and the PropertyChanged event to monitor an object for changes. In BlueZ 5.0, the standard D-Bus Properties interface is used instead.
Similarly, the generic D-Bus ObjectManager interface replaces BlueZ 4.x's custom interfaces for the tasks of adding, removing, and managing objects on the BlueZ D-Bus service. For the most part, this migration is simply the removal of duplicate, BlueZ-only methods from the code, however there are some significant changes as well. BlueZ 4.x had a org.bluez.Manager interface that enabled applications to find the attached Bluetooth adapters. BlueZ 5.0 has no equivalent interface, because D-Bus's generic ObjectManager provides an ObjectManager.GetManagedObjects method. This method returns every managed object in the BlueZ service, but, as the BlueZ 5.0 porting guide explains, applications need only scan through the returned objects to find an org.bluez.Adapter1 instance.
The "1" appended to org.bluez.Adapter1 is another change debuting in BlueZ 5.0; it is the version number of the new Adapter interface. All of BlueZ's interfaces now have a version number attached, with the project committing to supporting the two most recent versions of each interface. Unfortunately, this backward-compatibility pledge does not apply retroactively; the BlueZ 4.x interfaces — and their predecessors — have been dropped.
Devices and profiles
The changes outlined above all concern interacting with the locally-attached Bluetooth host adapter. But, as much fun as that is, a single Bluetooth dongle is not particularly useful on its own — users will eventually want to pair or connect to other Bluetooth devices.
BlueZ 5.0 introduces new methods for discovering and connecting to Bluetooth devices, again retiring a number of custom methods in favor of simpler, more general D-Bus APIs. For example, the CreateDevice and CreatePairedDevice methods were used to explicitly instantiate devices in BlueZ 4.x, but they have been removed entirely. Instead, the ObjectManager automatically creates org.bluez.Device1 objects for each device discovered during a scan, and automatically removes unused objects every few minutes.
There is also a new general-purpose Device1.Connect method that applications can call to connect to a discovered device without knowing in advance which Bluetooth profiles the device supports. This simplifies the process considerably, particularly when dealing with devices that support multiple profiles (such as audio devices, which for example could serve as hands-free phone accessories or music headsets).
API changes are not all that the update provides, however. The new release adds support for five new profiles, all of them of the Bluetooth 4.0 LE variety. Cycling speedometers and heart-rate monitors are the newly-supported hardware device profiles, and new functional profiles support Bluetooth LE's alert-notification, scan-parameter, and HID-over-GATT services. The alert notification and scan parameter profiles are precisely what one would guess given a few minutes to brainstorm; the first is a framework for sending alert messages of various types (such as "new message" alerts sent from a computer to a Bluetooth watch), and the second is a framework for servers and wireless sensors to agree on connection and timing settings, which are important for Bluetooth LE's advertised power savings. The HID-over-GATT profile is a way for low-energy devices to serve as human interface devices (HIDs) over Bluetooth LE's Generic Attribute (GATT) profile.
With these additions, BlueZ's supported profiles number more than 25 (the list at the project's site is out of date at the moment, but a more complete list is found in the Bluez git repository). The math can get a little fuzzy because some profiles are layered on top of other profiles, and it may make more sense to implement the higher-level profiles in other applications. For example, a 2011 Google Summer of Code project implemented the Phonebook Access Profile (PBAP) — which runs on top of the generic object exchange (OBEX) profile — in Evolution Data Server. BlueZ 5.0 pushes a few more profiles out of the BlueZ project itself; audio-video elements implementing the A2DP and AVRCP profiles were pushed upstream to GStreamer, and the telephony profiles HFP and HSP were dropped in favor of the external implementations in oFono.
Both the GStreamer and oFono profiles implement BlueZ's new org.bluez.Profile1 interface, which is designed for connecting to external processes that implement a Bluetooth profile. The interface expects some basic information common to all profiles, such as authentication behavior and service discovery protocol (SDP) records, but implementing the rest of the profile's behavior is up to the application. It is not clear how many external profile projects one could reasonably expect to appear, but BlueZ does still have a handful of unimplemented profiles to choose from (such as the blood pressure monitor profile BLP). Furthermore, it is certainly likely that the Bluetooth Special Interest Group (SIG) will come up with more profiles as Bluetooth LE continues to grow in popularity.
The BlueZ project did a bit of refactoring in this release as well, moving its implementation of the Sub-band codec (SBC) into a standalone project, and adding some new tools for testing and monitoring Bluetooth functionality.
For end users, BlueZ 5.0's immediate gains are the LE protocol support and the new profiles. These will enable users to pair and use the latest generation of Bluetooth devices — at least, until the next revision of the Bluetooth specification is turned loose. For application developers, the simplified APIs are no doubt welcome news (particularly removing custom interfaces in favor of standardized D-Bus alternatives).
This is especially helpful when it comes to implementing new device profiles not yet supported by BlueZ itself — because Bluetooth products vary so much and new ones are constantly popping up without warning. One might well argue that Bluetooth's design makes life easier from a hardware maker's perspective by cramming as much intelligence into the protocol and the software stack as possible; Bluetooth devices are cheap, plentiful, and widely interoperable as a result. But the downside is increased complexity required of the operating system and application developer. BlueZ 5.0 manages to simplify things for Linux, which is a significant accomplishment — hopefully one that will bear fruit in the coming months.
Previewing digiKam 3.0.0
DigiKam, the KDE-based photo organization tool, is slated to release version 3.0.0 at the end of January (to coincide with the release of KDE 4.10). The digiKam project has a rapid release schedule — with four or more stable releases per year being typical in recent years — which can make staying up to date a challenge. But the circumstances of my personal photo collection dictated an upgrade, so I spent some time at the end of December working with the latest build. DigiKam is not quite a one-stop-solution for photographers, although each release adds more features, but it still excels at the task of managing a large collection of images.
The "circumstances" I refer to are not unique to me; they are a common issue with digital photography and open source software. It started with the purchase of a new (as in just-released) DSLR camera. New hardware is an area where Linux photographers still suffer at the hands of the camera-makers, because the companies routinely change details in their cameras' raw photo formats. These companies typically provide their own Windows and OS X software for working with the new formats, but open source projects like dcraw and LibRaw must reverse-engineer those formats by examining donated sample images. Consequently, I was forced to use pre-release versions of various raw conversion applications in order to use the files produced by the new camera.
Working with pre-release editing software is not particularly painful, since daily builds and snapshots are standard practice for most of the projects — but then again, raw converters have the distinct advantage of being non-destructive, never touching the original file. It is riskier (at least to a degree) to use unstable code to manage the database of archived images, because data loss could occur and potentially harm many files at once. As a result, I ended 2012 with several thousand images that I had not imported into a digiKam collection. When the digiKam 3.0.0 release candidate was announced on December 29, I decided it was time to take the plunge (on a subset of the new images, to simplify matters).
DigiKam relies on a sizable number of external libraries, including several that are specific to KDE, so it can be an undertaking to compile if you do not run a recent version of KDE. That said, it did not require the absolute latest versions of its KDE dependencies, so it does not require installing bleeding-edge dependencies. In addition, digiKam does not impose nearly as many dependencies as some other KDE applications, so there is not significant overhead for those users running a GNOME or LXDE environment.
For the uninitiated, digiKam offers four major modes: the main collection manager, an image editor, a batch processor, and a "light table." The light table is a comparison tool designed to help you inspect multiple images in detail. In one sense it is just another view into the image collection, but it functions in its own window. In practice, a user would locate a set of images in the collection manager, then drag-and-drop them into the light table to zoom in and compare similar options side by side. The digiKam image editor is not as full-featured as any of the standalone Linux photo editors (such as Darktable or Rawstudio), and in particular it offers fewer features for working with raw image formats. Whether it meets any individual user's needs for a particular task is going to vary. The same would be said of the batch processor; most of the raw photo editors offer batch processing as well, and digiKam's batch support does not include every feature available in the editor, but it does include quite a few.
What's new in editing and export
In fact, the 3.0.0 release adds batch-processing support for several new effects and tools. Perhaps the most important is raw demosaicing, which allows the user a choice among multiple methods for transforming a raw image file's data into standard RGB pixels. Raw formats are in theory wrappers around minimally-processed sensor data, which often means that they incorporate different numbers of red, green, and blue pixels, arranged in specific patterns, and at unusual bit-depths. There is a never-ending disagreement on how best to interpolate this raw data into more traditional RGB triples; hence there are multiple demosaicing algorithms to choose from. The average user might not notice the difference, but sticklers for detail will enjoy the ability to choose.
Several other new batch-processing options have been added, including
color transformation effects and cropping, both of which are more common tasks than choosing a custom raw demosaicing algorithm. It is also noteworthy
that the batch processor was rewritten to parallelize image
operations. As a result, processing a batch queue where several
operations are performed on each file should be noticeably faster on
multi-core machines by pipelining the images.
There is just one noteworthy addition to the image editing tools, automatic noise reduction, which was implemented as a student project during 2012's Summer of KDE program. This function uses digiKam's existing wavelet-based noise reduction feature, but it estimates the amount of noise in the open image and attempts to hone in on an appropriate level of denoising that will not adversely soften important image features. This feature may prove useful for batch processing, when the user cannot spare the time to inspect every image file, but it is also handy to simply start off with a good guess at the amount of noise needing to be removed.
DigiKam offers users a variety of export options beyond the simple still image, and the 3.0.0 release adds to the list. First, the application's database can now import metadata tags for video files, and users can incorporate videos into "slideshow" output. That does stretch the definition of "slide," but since most digital cameras support recording video, it is undoubtedly a useful feature. Second, at some point in the recent past digiKam evidently lost the ability to export photos directly to KDE's wallpaper collection (the environment's Plasma framework allows for automatic wallpaper rotation and other fanciful effects); 3.0.0 restores this functionality. Last but certainly not least, digiKam can now act as a Digital Living Network Alliance (DLNA) media server, which enables users to discover and browse image collections by album on any DLNA-aware product, such as the many "smart" TVs on the market.
Import and management
With the image editing and export tools, it is certainly possible to work entirely within digiKam, but I have always preferred to make use of its collection management features and jump back and forth between an array of other applications for editing. The 3.0.0 release only introduces one change to the image management tools: the ability to pre-assign labels and tags to a set of images at the time they are imported from a camera and into the collection. This is a time-saver; as memory card capacities increase, importing a card-full of photos takes longer and longer. In my experience, the chief reason most people do not maintain good metadata about the location and subject of their photos is that the tools make it too difficult. Adding tag metadata while importing them may not solve the problem entirely, but it helps — after all, the user is guaranteed to be present when attaching the camera and starting the import.
One of digiKam's biggest strengths is that is offers so much flexibility in where an image collection is stored and how it can be searched. Unlike most photo organizers, it can track images stored locally, on remote servers, and on removable media all in the same database. The search options include geotagging, time and date information, user-supplied text tags, a broad assortment of metadata options, and some more esoteric options like "image fingerprints." Fingerprints allow the user to sketch out an image, which is then compared against a mathematical decomposition of the images in the collection — it is essentially a find-by-visual-similarity search.
On the other hand, there is also one oft-requested feature that still has not made it into digiKam of 3.0.0: face recognition. Face recognition is a tricky task, to be sure, but in digiKam's case the feature was a Google Summer of Code project that was started but dropped by the student before it was complete. Cynics might suggest that this is an inherent problem with the Summer of Code method for feature-addition, but it is not preventable. After all, there is nothing that prevents a non-student contributor from dropping out of a project either. You can still manually tag individuals in an image, which is a feature that the facial-recognition search will presumably hook into. On its own, though, tagging an individual in the People search tab does not offer any advantage over putting the person's name into a text tag.
The 3.0.0 release candidate offers a nice set of new features for digiKam users, but it is probably still wise for the average user to wait until the final release. In my limited test, the application failed to import and convert the existing, older version of digiKam's collection database to the new schema used by 3.0.0. Although there are workarounds, such as manually moving the database with the application's built-in database migration tool, corrupting the database of a hefty collection is a major problem. The automatic schema conversion problem has been reported, so it will hopefully be fixed before 3.0.0 is released. Once the kinks are worked out and the final release is available, it is certainly worth a look.
Security
Inferring TCP sequence numbers
Over the past few years, seemingly harmless internal kernel state that is available to user space has allowed attackers to gain useful information to facilitate their attacks. The TCP DelayedACKLost counter (exported to user space via the /proc/net/netstat virtual file) is yet another. It can be used to determine the sequence number of a TCP connection, which can lead to packet injection or connection hijacking.
TCP sequence numbers have proven to be problematic over the years. Sequence numbers essentially count the number of octets transferred in each direction in a TCP connection. Knowing the sequence number, along with IP address and port number, allows an attacker to interfere with a connection by spoofing (or faking) packets that appear to have come from the other endpoint. Endpoints will reject packets that contain invalid sequence numbers, so those numbers must be known to an attacker. Originally, sequence numbers were easily predicted, so, in the mid-1990s, RFC 1948 specified that the initial sequence number (ISN) for each side of a TCP connection should be randomized.
Randomization should make sequence numbers difficult for attackers to guess, but various ways to infer the sequence numbers for a connection have come about over the years. In this case, researchers Zhiyun Qian, Z. Morley Mao, and Yinglian Xie discovered a way to quickly determine the sequence number for an open connection using the DelayedACKLost counter as a side channel. The researchers presented a paper [PDF] with their findings at the ACM Conference on Computer and Communications Security back in October. In addition, Qian posted a summary of the problem to the kernel netdev mailing list in December:
The problem is caused by the common TCP stats counters (the specific counter I found is DelayedACKLost) maintained by the kernel (but exposed to user space). By reading and reporting such counters to an external attacker (colluded), the aforementioned attack can be accomplished.
As described in their paper, the researchers found a way to use the DelayedACKLost counter as a reliable side channel to quickly narrow in on the sequence number. By having a client-side application relay the counter value to a collaborating host, which can then send probe packets to the client, a binary or N-way search can be done to determine the sequence number. In their test cases, the researchers were able to infer the sequence number using four to five round trips between the client application and its collaborator, which could complete in as little as 50ms.
The key to this search is a bug in the way the Linux kernel handles packets with incorrect sequence numbers. If a packet is received that has a sequence number "less than" that which is expected, the DelayedACKLost counter is incremented—regardless of whether the packet is an acknowledgment (ACK) or not. The calculation that is done to determine whether the number is less than what is expected essentially partitions the 32-bit sequence number space into two halves. Because DelayedACKLost does not get incremented if the sequence number is larger than the expected number, it can be used in a search to narrow in on the value of interest.
DelayedACKLost was once used to decide if the network stack should use delayed ACKs or not. It is meant to count missing incoming delayed ACKs. In normal usage, it is rarely incremented, so an attacker can get a "clean" signal by simply sending packets with various sequence numbers and observing their effects on the counter. The paper describes using an N-way search technique that splits the sequence number space into a small number of equal-sized bins, sends a different number of packets with sequence numbers from each bin, and uses the value of the counter to find the sequence number with fewer round trips between the client and the probing host.
Beyond just inferring sequence numbers, though, the paper lays out a number of ways to use that information to interfere with real connections. The first of those is to inject data into an existing connection. By using the inferred sequence number, the "off-path" (i.e. not a man in the middle) attacker can spoof a packet to the client. The example used is an HTTP response from some internet server. If that server takes "long enough" to respond, the sequence number can be determined and a fake response can be sent before the real server responds. That turns out to be the pattern for Facebook, for example, and the researchers were able to inject JavaScript into a client session to update the user's status. In addition, if an HTTP connection is used for multiple requests, later responses (after the sequence number has been determined) can be spoofed.
A similar technique, using the same counter, can determine the client-side sequence number. That number can be used by an attacker to send spoofed packets in the other direction, so client requests to the server can be faked.
Beyond that, two types of connection hijacking are possible. A passive hijacking can be done on kernels older than 3.0.2 because there were only 24 bits of randomness in the ISN. When the connection is established to the remote server, the off-path attacker immediately sends spoofed "reset" (RST) packets to the server with sequence numbers covering the entire 24-bit client ISN space. It then commences to determine the server's ISN using the DelayedACKLost counter and sends a spoofed response once it has done so.
For more recent kernels, there is still an active hijacking that can be performed by pre-calculating the client ISNs for a range of port numbers. By "port jamming" the other ports (i.e. using those port numbers so they aren't available), the malware can ensure that the outgoing connection originates from a known, pre-calculated port. ISNs change at a known rate, so the attacker can calculate the ISN based on what it originally determined and how much time has passed since the determination was made. The trick is to know that a connection will be made to the server "soon". Ensuring that is the "active" piece of the puzzle.
Both hijacking techniques require that the server have some kind of stateful firewall that discards "out of state" packets. Otherwise, RST responses from the server on a connection it thinks is closed will confuse the real client (which thinks the connection is still open). It turns out that many internet services (e.g. Facebook, Twitter) do have such firewalls.
All of the techniques described in the paper were put to use in fairly "real world" scenarios. An Android application was used on the client side to monitor the counter, while other servers acted as the off-path collaborator. Many Android devices are based on 2.6.x, so they are vulnerable to either of the two hijacking techniques. Beyond the Facebook-status-updating malware mentioned above, they were able to create ways to "phish" for Facebook credentials using connection hijacking.
The fix for the problem on Linux is to check that the ACK bit is set on the packet before deciding that it is a lost delayed ACK. Attackers cannot just switch to turning on the ACK bit on the probes, because they would need the sequence number in the other direction to use as the "ACK number" in the packet. Eric Dumazet has posted a patch to discard the packets that would trigger this bug, but there is more needed. The longer-term fix will require moving most of the packet safety checks to an earlier point in packet handling. That way, fewer kinds of bogus packets will get far enough into the networking stack to cause this kind of observable state change.
One other possible solution would be to remove access to the DelayedACKLost counter for non-privileged programs. That would likely be difficult to do in Linux, as it has become part of the kernel's ABI, but it is something to consider for the future. In their paper, the researchers pointed that out:
There certainly are dangers from exposing internal kernel state to user space—sometimes those dangers don't manifest themselves for quite some time. Doing so has its benefits, though, for users and developers. It is a bit of a tricky balancing act—one that is made more difficult by the "no ABI changes" policy in the kernel. In this case, though, it seems that a solution has been found without changing the ABI. While the examples given in the paper may seem somewhat trivial, the techniques could certainly be used in ways that are far more damaging than an embarrassing Facebook status update.
Brief items
Security quotes of the week
Fraudulent certificates in the wild — again
Google reports that another fraudulent *.google.com digital certificate was detected by the Chrome browser in late December; this one traces back to the certificate authority TURKTRUST. "In response, we updated Chrome’s certificate revocation metadata on December 25 to block that intermediate CA, and then alerted TURKTRUST and other browser vendors. TURKTRUST told us that based on our information, they discovered that in August 2011 they had mistakenly issued two intermediate CA certificates to organizations that should have instead received regular SSL certificates." Expect a round of updates from other browser projects.
Ruby on Rails SQL injection issue
An SQL injection vulnerability in all Ruby on Rails releases has been disclosed. "Due to the way dynamic finders in Active Record extract options from method parameters, a method parameter can mistakenly be used as a scope. Carefully crafted requests can use the scope to inject arbitrary SQL." Fixes can be found in the 3.2.10, 3.1.9, and 3.0.18 releases. This seems like a good one to address quickly.
Update: this article has a lot more information on this vulnerability.
Apache plugin turns legit sites into bank-attack platforms (ars technica)
Ars technica writes about an Apache plugin that is being used to turn Linux web servers into Windows banking malware distribution sites. "The Apache plugin, which Eset software flags as Linux/Chapro.A, contains several features designed to make infections stealthy. To prevent being widely detected, it doesn't serve malicious content when a visitor's browser user agent indicates it's coming from Google or another automated search-engine agent. It also holds its fire against IP addresses that connect to the Web server over SSH-protected channels, preventing site administrators from being exposed. It also uses browser cookies and IP logging to prevent visitors from being exposed to exploits more than once. By hiding the attacks from search engines and admins—and making it hard to determine how end-user machines are infected—the features make it harder to identify the site as compromised."
New vulnerabilities
apparmor-profiles: insecure profile for Chromium
Package(s): | apparmor-profiles | CVE #(s): | |||||
Created: | December 20, 2012 | Updated: | January 3, 2013 | ||||
Description: | From the Ubuntu advisory: Dan Rosenberg discovered that the example AppArmor profile for chromium-browser could be escaped by calling xdg-settings with a crafted environment. | ||||||
Alerts: |
|
chromium: multiple vulnerabilities
Package(s): | chromium | CVE #(s): | CVE-2012-5139 CVE-2012-5140 CVE-2012-5141 CVE-2012-5142 CVE-2012-5143 CVE-2012-5144 | ||||||||||||||||||||
Created: | December 21, 2012 | Updated: | January 28, 2013 | ||||||||||||||||||||
Description: | From the openSUSE advisory:
| ||||||||||||||||||||||
Alerts: |
|
drupal: multiple vulnerabilities
Package(s): | drupal | CVE #(s): | CVE-2012-5651 CVE-2012-5653 | ||||||||||||||||||||||||||||
Created: | December 28, 2012 | Updated: | October 14, 2013 | ||||||||||||||||||||||||||||
Description: | From the Mageia advisory: A vulnerability was identified that allows blocked users to appear in user search results, even when the search results are viewed by unprivileged users (CVE-2012-5651). Drupal core's file upload feature blocks the upload of many files that can be executed on the server by munging the filename. A malicious user could name a file in a manner that bypasses this munging of the filename in Drupal's input validation (CVE-2012-5653). | ||||||||||||||||||||||||||||||
Alerts: |
|
elinks: information disclosure
Package(s): | elinks | CVE #(s): | CVE-2012-4545 | ||||||||||||||||||||||||||||||||||||||||
Created: | December 28, 2012 | Updated: | April 9, 2013 | ||||||||||||||||||||||||||||||||||||||||
Description: | From the Debian advisory: Marko Myllynen discovered that elinks, a powerful text-mode browser, incorrectly delegates user credentials during GSS-Negotiate. | ||||||||||||||||||||||||||||||||||||||||||
Alerts: |
|
fail2ban: unspecified vulnerability
Package(s): | fail2ban | CVE #(s): | CVE-2012-5642 | ||||||||||||||||||||
Created: | December 28, 2012 | Updated: | June 28, 2013 | ||||||||||||||||||||
Description: | From the Fedora advisory: The release notes for fail2ban 0.8.8 [1],[2] indicate: * [83109bc] IMPORTANT: escape the content of <matches> (if used in custom action files) since its value could contain arbitrary symbols. Thanks for discovery go to the NBS System security team This could cause issues on the system running fail2ban as it scans log files, depending on what content is matched. There isn't much more detail about this issue than what is described above, so I think it may largely depend on the type of regexp used (what it matches) and the contents of the log file being scanned (whether or not an attacher could insert something that could be used in a malicious way). | ||||||||||||||||||||||
Alerts: |
|
freetype2: multiple vulnerabilities
Package(s): | freetype2 | CVE #(s): | CVE-2012-5668 CVE-2012-5669 CVE-2012-5670 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Created: | December 28, 2012 | Updated: | March 18, 2015 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Description: | From the Mageia advisory: A null pointer de-reference flaw was found in the way Freetype font rendering engine handled Glyph bitmap distribution format (BDF) fonts. A remote attacker could provide a specially-crafted BDF font file, which once processed in an application linked against FreeType would lead to that application crash (CVE-2012-5668). An out-of heap-based buffer read flaw was found in the way FreeType font rendering engine performed parsing of glyph information and relevant bitmaps for glyph bitmap distribution format (BDF). A remote attacker could provide a specially-crafted BDF font file, which once opened in an application linked against FreeType would lead to that application crash (CVE-2012-5669). An out-of heap-based buffer write flaw was found in the way FreeType font rendering engine performed parsing of glyph information and relevant bitmaps for glyph bitmap distribution format (BDF). A remote attacker could provide a specially-crafted font file, which once opened in an application linked against FreeType would lead to that application crash, or, potentially, arbitrary code execution with the privileges of the user running the application (CVE-2012-5670). | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Alerts: |
|
fuse-esb: denial of service
Package(s): | fuse-esb | CVE #(s): | CVE-2012-5370 | ||||
Created: | December 21, 2012 | Updated: | January 4, 2013 | ||||
Description: | From the Red Hat advisory: A denial of service flaw was found in the implementation of associative arrays (hashes) in JRuby. An attacker able to supply a large number of inputs to a JRuby application (such as HTTP POST request parameters sent to a web application) that are used as keys when inserting data into an array could trigger multiple hash function collisions, making array operations take an excessive amount of CPU time. To mitigate this issue, the Murmur hash function has been replaced with the Perl hash function. (CVE-2012-5370) Note: Fuse ESB Enterprise 7.0.2 ships JRuby as part of the camel-ruby component, which allows users to define Camel routes in Ruby. The default use of JRuby in Fuse ESB Enterprise 7.0.2 does not appear to expose this flaw. If the version of JRuby shipped with Fuse ESB Enterprise 7.0.2 was used to build a custom application, then this flaw could be exposed. | ||||||
Alerts: |
|
gnupg: memory access violations
Package(s): | gnupg | CVE #(s): | CVE-2012-6085 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Created: | January 2, 2013 | Updated: | June 11, 2013 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Description: | From the Mandriva advisory:
Versions of GnuPG <= 1.4.12 are vulnerable to memory access violations and public keyring database corruption when importing public keys that have been manipulated. An OpenPGP key can be fuzzed in such a way that gpg segfaults (or has other memory access violations) when importing the key. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Alerts: |
|
mahara: multiple vulnerabilities
Package(s): | mahara | CVE #(s): | CVE-2012-2239 CVE-2012-2243 CVE-2012-2244 CVE-2012-2246 CVE-2012-2247 CVE-2012-2253 CVE-2012-6037 | ||||
Created: | December 28, 2012 | Updated: | January 3, 2013 | ||||
Description: | From the Debian advisory: Multiple security issues have been found in Mahara - an electronic portfolio, weblog, and resume builder -, which can result in cross-site scripting, clickjacking or arbitrary file execution. CVE-2012-2239: Mahara 1.4.x before 1.4.4 and 1.5.x before 1.5.3 allows remote attackers to read arbitrary files or create TCP connections via an XML external entity (XXE) injection attack, as demonstrated by reading config.php. CVE-2012-2243: Cross-site scripting (XSS) vulnerability in Mahara 1.4.x before 1.4.5 and 1.5.x before 1.5.4 allows remote attackers to inject arbitrary web script or HTML by uploading an XML file with the xhtml extension, which is rendered inline as script. NOTE: this can be leveraged with CVE-2012-2244 to execute arbitrary code without authentication, as demonstrated by modifying the clamav path. CVE-2012-2244: Mahara 1.4.x before 1.4.5 and 1.5.x before 1.5.4 allows remote authenticated administrators to execute arbitrary programs by modifying the path to clamav. NOTE: this can be exploited without authentication by leveraging CVE-2012-2243. CVE-2012-2246: Mahara 1.4.x before 1.4.5 and 1.5.x before 1.5.4 allows remote attackers to conduct clickjacking attacks to delete arbitrary users and bypass CSRF protection via account/delete.php. CVE-2012-2247: Cross-site scripting (XSS) vulnerability in Mahara 1.4.x before 1.4.5 and 1.5.x before 1.5.4 allows remote attackers to inject arbitrary web script or HTML via vectors related to artefact/file/ and a crafted SVG file. CVE-2012-2253: Cross-site scripting (XSS) vulnerability in group/members.php in Mahara 1.5.x before 1.5.7 and 1.6.x before 1.6.2 allows remote attackers to inject arbitrary web script or HTML via the query parameter. CVE-2012-6037: Multiple cross-site scripting (XSS) vulnerabilities in Mahara 1.4.x before 1.4.5 and 1.5.x before 1.5.4, and other versions including 1.2, allow remote attackers to inject arbitrary web script or HTML via a CSV header with "unknown fields," which are not properly handled in error messages in the (1) bulk user, (2) group, and (3) group member upload capabilities. NOTE: this issue was originally part of CVE-2012-2243, but that ID was SPLIT due to different issues by different researchers. | ||||||
Alerts: |
|
mediawiki-extensions: cross-site scripting
Package(s): | mediawiki-extensions | CVE #(s): | |||||
Created: | December 31, 2012 | Updated: | January 8, 2013 | ||||
Description: | From the Debian advisory:
Thorsten Glaser discovered that the RSSReader extension for mediawiki, a website engine for collaborative work, does not properly escape tags in feeds. This could allow a malicious feed to inject JavaScript into the mediawiki pages. | ||||||
Alerts: |
|
moin: multiple vulnerabilities
Package(s): | moin | CVE #(s): | CVE-2012-6495 CVE-2012-6081 CVE-2012-6082 CVE-2012-6080 | ||||||||||||||||||||||||
Created: | December 31, 2012 | Updated: | September 25, 2013 | ||||||||||||||||||||||||
Description: | From the CVE entries:
Multiple directory traversal vulnerabilities in the (1) twikidraw (action/twikidraw.py) and (2) anywikidraw (action/anywikidraw.py) actions in MoinMoin before 1.9.6 allow remote authenticated users with write permissions to overwrite arbitrary files via unspecified vectors. NOTE: this can be leveraged with CVE-2012-6081 to execute arbitrary code. (CVE-2012-6495) Multiple unrestricted file upload vulnerabilities in the (1) twikidraw (action/twikidraw.py) and (2) anywikidraw (action/anywikidraw.py) actions in MoinMoin before 1.9.6 allow remote authenticated users with write permissions to execute arbitrary code by uploading a file with an executable extension, then accessing it via a direct request to the file in an unspecified directory, as exploited in the wild in July 2012. (CVE-2012-6081) Cross-site scripting (XSS) vulnerability in the rsslink function in theme/__init__.py in MoinMoin 1.9.5 allows remote attackers to inject arbitrary web script or HTML via the page name in a rss link. (CVE-2012-6082) Directory traversal vulnerability in the _do_attachment_move function in the AttachFile action (action/AttachFile.py) in MoinMoin 1.9.3 through 1.9.5 allows remote attackers to overwrite arbitrary files via a .. (dot dot) in a file name. (CVE-2012-6080) See the MoinMoin security fixes page for more information. Version 1.9.6 contains the patches for these issues. | ||||||||||||||||||||||||||
Alerts: |
|
ndjbns: DNS cache poisoning
Package(s): | ndjbdns | CVE #(s): | CVE-2008-4392 | ||||||||
Created: | January 3, 2013 | Updated: | January 3, 2013 | ||||||||
Description: | From the NVD entry: dnscache in Daniel J. Bernstein djbdns 1.05 does not prevent simultaneous identical outbound DNS queries, which makes it easier for remote attackers to spoof DNS responses, as demonstrated by a spoofed A record in the Additional section of a response to a Start of Authority (SOA) query. | ||||||||||
Alerts: |
|
php-symfony2-HttpKernel: multiple vulnerabilities
Package(s): | php-symfony2-HttpKernel | CVE #(s): | CVE-2012-6431 CVE-2012-6432 | ||||||||
Created: | January 3, 2013 | Updated: | January 7, 2013 | ||||||||
Description: | From the Symfony advisory: CVE-2012-6431: On the Symfony 2.0.x version, there's a security issue that allows access to routes protected by a firewall even when the user is not logged in. CVE-2012-6432: For handling ESIs (via the render tag), Symfony uses a special route named _internal, defined in @FrameworkBundle/Resources/config/routing/internal.xml. As of Symfony 2.1, the internal routing file defines an additional route, _internal_public, to be able to manage HIncludes (also via the render tag). As the _internal route must only be used to route URLs between your PHP application and a reverse proxy, it must be secured to avoid any access from a browser. But the _internal_public route must always be available from a browser as it should be reachable by your frontend JavaScript (of course only if you are using HIncludes in your application). These two routes execute the same FrameworkBundle:Internal:index controller which in turn executes the controller passed as an argument in the URL. If these routes are reachable by a browser, an attacker could call them to execute protected controllers or any other service (as a controller can also be defined as a service). | ||||||||||
Alerts: |
|
php-ZendFramework: denial of service
Package(s): | php-ZendFramework | CVE #(s): | CVE-2012-5657 | ||||||||||||||||||||||||||||||||||||||||
Created: | December 28, 2012 | Updated: | April 10, 2013 | ||||||||||||||||||||||||||||||||||||||||
Description: | From the Mageia advisory: A vulnerability was reported in Zend Framework versions prior to 1.11.15 and 1.12.1, which can be exploited to disclose certain sensitive information. This flaw is caused due to an error in the "Zend_Feed_Rss" and "Zend_Feed_Atom" classes of the "Zend_Feed" component, when processing XML data. It can be used to disclose the contents of certain local files by sending specially crafted XML data including external entity references. | ||||||||||||||||||||||||||||||||||||||||||
Alerts: |
|
python-django: multiple vulnerabilities
Package(s): | python-django | CVE #(s): | |||||
Created: | January 3, 2013 | Updated: | January 3, 2013 | ||||
Description: | From the Django advisory: Several earlier Django security releases focused on the issue of poisoning the HTTP Host header, causing Django to generate URLs pointing to arbitrary, potentially-malicious domains. In response to further input received and reports of continuing issues following the previous release, we're taking additional steps to tighten Host header validation. Also following up on a previous issue: in July of this year, we made changes to Django's HTTP redirect classes, performing additional validation of the scheme of the URL to redirect to (since, both within Django's own supplied applications and many third-party applications, accepting a user-supplied redirect target is a common pattern). Since then, two independent audits of the code turned up further potential problems. So, similar to the Host-header issue, we are taking steps to provide tighter validation in response to reported problems (primarily with third-party applications, but to a certain extent also within Django itself). | ||||||
Alerts: |
|
squid: denial of service
Package(s): | squid | CVE #(s): | CVE-2012-5643 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Created: | December 26, 2012 | Updated: | March 11, 2013 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Description: | From the CVE entry:
Multiple memory leaks in tools/cachemgr.cc in cachemgr.cgi in Squid 2.x and 3.x before 3.1.22, 3.2.x before 3.2.4, and 3.3.x before 3.3.0.2 allow remote attackers to cause a denial of service (memory consumption) via (1) invalid Content-Length headers, (2) long POST requests, or (3) crafted authentication credentials. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Alerts: |
|
tomcat: denial of service
Package(s): | tomcat | CVE #(s): | CVE-2012-5568 | ||||||||||||
Created: | December 28, 2012 | Updated: | January 3, 2013 | ||||||||||||
Description: | From the openSUSE advisory: Apache Tomcat through 7.0.x allows remote attackers to cause a denial of service (daemon outage) via partial HTTP requests, as demonstrated by Slowloris. | ||||||||||||||
Alerts: |
|
v8: multiple vulnerabilities
Package(s): | v8 | CVE #(s): | CVE-2012-5120 CVE-2012-5128 | ||||||||||||
Created: | December 31, 2012 | Updated: | January 9, 2013 | ||||||||||||
Description: | From the CVE entries:
Google V8 before 3.13.7.5, as used in Google Chrome before 23.0.1271.64, on 64-bit Linux platforms allows remote attackers to cause a denial of service or possibly have unspecified other impact via crafted JavaScript code that triggers an out-of-bounds access to an array. (CVE-2012-5120) Google V8 before 3.13.7.5, as used in Google Chrome before 23.0.1271.64, does not properly perform write operations, which allows remote attackers to cause a denial of service or possibly have unspecified other impact via unknown vectors. (CVE-2012-5128) | ||||||||||||||
Alerts: |
|
virtualbox-ose: denial of service
Package(s): | virtualbox-ose | CVE #(s): | CVE-2012-3221 | ||||||||||||
Created: | December 31, 2012 | Updated: | January 20, 2014 | ||||||||||||
Description: | From the Debian advisory:
"halfdog" discovered that incorrect interrupt handling in Virtualbox, a x86 virtualization solution - can lead to denial of service. | ||||||||||||||
Alerts: |
|
Page editor: Jake Edge
Kernel development
Brief items
Kernel release status
The current development kernel is 3.8-rc2, released on January 2. Linus says: "It's a new year, people are getting back to work, and trying desperately to forget the over-eating that has been going on for the last two weeks. And hey, to celebrate, here's -rc2!" Perhaps as a result of the aforementioned over-eating, this patch is relatively small; things might be expected to pick up a bit by the time -rc3 comes out.
The 3.8-rc1 release happened on December 21. The most significant changes pulled between the last edition's summary and the closing of the merge window include the f2fs filesystem, and a driver for Dialog Semiconductor DA9055 watchdog devices.
Stable updates: no stable updates have been released since December 17. The 3.2.36 update is in the review process as of this writing; its release can be expected at any time.
Quotes of the week
But then after a few days, I've been thinking I should have taken a second cup of tea with me.
For all others this is just the last release of 2012.
With a shark on it.
The Tux3 filesystem returns
The "Tux3" next-generation filesystem project generated a lot of discussion and a fair amount of code before fading into obscurity; LWN last covered this work in 2008. Tux3 developer Daniel Phillips has resurfaced with a new-year posting suggesting that work on this code is resuming. "In brief, the first time Hirofumi ever put together all the kernel pieces in his magical lab over in Tokyo, our Tux3 rocket took off and made it straight to orbit. Or in less metaphorical terms, our first meaningful benchmarks turned in numbers that meet or even slightly beat the illustrious incumbent, Ext4." The code can be found in git://github.com/OGAWAHirofumi/tux3.git for those who want to play with it.
Kernel development news
Improving ticket spinlocks
Spinlocks, being the lowest-level synchronization mechanism in the kernel, are the target of seemingly endless attempts at performance enhancement. The ticket spinlock mechanism used in the mainline has resisted such attempts for a few years. Now, though, some developers have identified a performance bottleneck associated with these locks and are busily trying to come up with an improved version.A spinlock is so-named because a CPU waiting for a contended lock will "spin" in a tight loop, repeatedly querying the lock until it becomes available. Ticket spinlocks adjust this algorithm by having each waiting CPU take a "ticket" so that each CPU obtains the lock in the order in which it arrived. These locks thus resemble the "take a number" mechanisms found at deli counters or motor vehicle division offices worldwide — though, with luck, the wait is rather shorter than is required to renew a driver's license in your editor's part of the world. Without the ticket mechanism, which was added for the 2.6.25 release, the kernel's spinlocks were unfair; in some situations, some waiters could be starved for an extended period of time.
It has long been understood that lock contention reduces system performance considerably. The simple act of spinning for a lock clearly is not going to be good for performance, but there are also caching issues to take into account. If two CPUs are repeatedly acquiring a spinlock, the memory location representing that lock will bounce back and forth between those CPUs' caches. Even if neither CPU ever has to wait for the lock, the process of moving it between caches will slow things down considerably. For that reason, interest in lockless algorithms has been growing for many years.
In the case of a contended lock, though, cache contention would appear to be less of an issue. A CPU spinning on a lock will cache its contents in a shared mode; no cache bouncing should occur until the CPU owning the lock releases it. Releasing the lock (and its acquisition by another CPU) requires writing to the lock, and that requires exclusive cache access. The cache line movement at that time hurts, but probably not as much as waiting for the lock in the first place. So it would seem that trying to optimize cache behavior in the contended case is not likely to produce much in the way of useful results.
That picture is not complete, though; one must take a couple of other facts into account. Processors do not cache a single value; they cache a "line" of (typically) 128 consecutive bytes as a single unit. In other words, the cache lines in any contemporary processor are almost certainly significantly larger than what is required to hold a spinlock. So when a CPU needs exclusive access to a spinlock's cache line, it also gains exclusive access to a significant chunk of surrounding data. And that is where the other important detail comes into play: spinlocks tend to be embedded within the data structures that they protect, so that surrounding data is typically data of immediate interest to the CPU holding the lock.
Kernel code will acquire a lock to work with (and, usually, modify) a structure's contents. Often, changing a field within the protected structure will require access to the same cache line that holds the structure's spinlock. If the lock is uncontended, that access is not a problem; the CPU owning the lock probably owns the cache line as well. But if the lock is contended, there will be one or more other CPUs constantly querying its value, obtaining shared access to that same cache line and depriving the lock holder of the exclusive access it needs. A subsequent modification of data within the affected cache line will thus incur a cache miss. So CPUs querying a contended lock can slow the lock owner considerably, even though that owner is not accessing the lock directly.
How badly can throughput be impacted? In the description of his patch adding proportional backoff to ticket spinlocks, Rik van Riel describes a microbenchmark that is slowed by a factor of two when there is a single contending CPU, and by as much as a factor of ten with many CPUs in the mix. That is not just a slowdown; that is a catastrophic loss of performance. Needless to say, that is not the sort of behavior that kernel developers like to see.
Rik's solution is simple enough. Rather than spinning tightly and querying a contended lock's status, a waiting CPU should wait a bit more patiently, only querying the lock occasionally. So his patch causes a waiting CPU to loop a number of times doing nothing at all before it gets impatient and checks the lock again. It goes without saying that picking that "number of times" correctly is the key to good performance with this algorithm. While a CPU is looping without querying the lock it cannot be bouncing cache lines around, so the lock holder should be able to make faster progress. But too much looping will cause the lock to sit idle before the owner of the next ticket notices that its turn has come; that, too, will hurt performance.
The first step in Rik's patch series calculates how many CPUs must release the lock before the current CPU can claim it (by subtracting the current CPU's ticket number from the number currently being served) and loops 50 times for every CPU that is ahead in the queue. That is where the "proportional backoff" comes in; the further back in line the CPU is, the longer it will wait between queries of the lock. The result should be a minimizing of idle looping while also minimizing cache traffic.
The number 50 was determined empirically, but it seems unlikely that it will be optimal for all situations. So the final part of Rik's patch set attempts to tune that number dynamically. The dynamic delay factor is increased when the lock is found to be unavailable and decreased when the lock is obtained. The goal is to have a CPU query the lock an average of 2.7 times before obtaining it. The number 2.7, once again, was obtained by running lots of tests and seeing what worked best; subsequent versions of the patch have tweaked this heuristic somewhat. Details aside, the core idea is that the delay factor (a per-CPU value that applies to all contended locks equally) will increase for workloads experiencing more contention, tuning the system appropriately.
That said, the notion of a single delay for all locks is likely to be causing a severe case of raised eyebrows for some readers, and, indeed, it turned out to be inadequate; some locks are rather more contended than others, after all. So the January 3 version of Rik's patch keeps a hashed list (based on the spinlock address) of delay values instead.
Michel Lespinasse ran some experiments of his own to see how well the proportional backoff algorithm worked. In particular, he wanted to figure out whether it was truly necessary to calculate a dynamic delay factor, or whether an optimal static value could be found. His conclusion was that, in fact, a static value is good enough; it might be possible to do a little better with a dynamic value, he said, but the improvement is not enough to justify the added complexity of the tuning mechanism. There is just one little difficulty:
If these results stand, and an appropriate way of picking the static value can be found, then there is probably not a case for adding dynamic backoff to the kernel's spinlock implementation. But the backoff idea in general would appear to be a significant improvement for some workloads. So the chances are good that we will see it added in some form in an upcoming development cycle.
The mempressure control group proposal
Last November, LWN described the vmpressure_fd() work which implemented a new system call making it possible for user-space applications to be informed when system memory is tight. Those applications could then respond by freeing memory, easing the crunch. That patch set has since evolved considerably.Based on the feedback that author Anton Vorontsov received, the concept has changed from a new system call to a new, control-group-based subsystem. The controller implementation allows for integration with the memory controller, meaning that applications can receive notifications when their specific control group is running low on memory, even if the system as a whole is not under memory pressure.
As with previous versions of the patch, applications can receive notifications for three levels of memory pressure: "low" (memory reclaim is happening at a low level), "medium" (some swapping is happening), and "oom" (memory pressure is severe). But these notifications may no longer be the primary way in which applications interact with the controller, thanks to the most significant change in comparison to the previous vmpressure_fd() solution: the addition of a user-space "shrinker" interface allowing the kernel to ask user space to free specific amounts of memory when needed. This API was inspired by Andrew Morton's feedback on the first revision of the mempressure control group subsystem patch:
Andrew also worried that application developers may tune their programs against a particular kernel version; subtle behavioral changes in new kernel releases might then cause regressions. In short, Andrew complained, the behavior of the system as a whole was not testable, so there would be no way to know if subsequent kernel changes made performance worse.
Andrew's suggestion was to give more control to the kernel and introduce some kind of interface for user-space memory scanning and freeing (similar in its main concept to the shrink_slab() kernel shrinkers). This interface would control user-space reclaim behavior; if something goes wrong, it will be up to kernel to resolve the issue. It would also give kernel developers the ability to test and tune whole system behavior by writing a compliant user-space test application and running it.
The user-space shrinker implementation by Anton operates on the concept of chunks of an application-defined size. There is an assumption that the application does memory allocations with a specific granularity (the "chunk size," which may be not 100% accurate but the more accurate it is, the better). So if the application caches data in chunks of 1MB, that is the size it will provide to the shrinker interface. That is done through a sequence like this:
- The application opens the control interface, which is found as the
file cgroup.event_control in the controller directory.
- The shrinker interface
(mempressure.shrinker in the controller directory) must also
be opened.
- The eventfd() system call is used to obtain a third file
descriptor for notifications.
- The application then writes a string containing the eventfd() file descriptor number, the mempressure.shrinker file descriptor number, and the chunk size to the control interface.
Occasionally, the application should write a string to the shrinker file indicating how many chunks have been allocated or (using a negative count) freed. The kernel uses this information to maintain an internal count of how many reclaimable chunks the application is currently holding on to.
If the kernel wants the application to free some memory, the notification will come through the eventfd() file descriptor in the form of an integer count of the number of chunks that should be freed. The kernel assumes that the application will free the specified number of chunks before reading from the eventfd() file descriptor again. If the application isn't able to reclaim all chunks for some reason, it should re-add the number of chunks that were not freed by writing to the mempressure.shrinker file.
The patchset also includes an example application (slightly buggy in the current version) for testing the new interface. It creates two threads; the first thread initializes the user-space shrinker mechanism notifications and then tries to allocate memory (more than physically available) in an infinite loop. The second thread listens for user-space shrinker notifications and frees the requested number of chunks (also in an infinite loop). Ideally, during the run of the test application the system shouldn't get into an out-of-memory condition and it also shouldn't use much swap space (if any is available of course).
Various comments were received on the patch set, so at least one more round of changes will be required before this interface can be considered for merging into the mainline. There is also an open question on how this feature interacts with volatile ranges and whether both mechanisms (neither of which has yet been merged) are truly required. So this discussion may continue well into the new year before we end up with reclaimable user-space memory caches in their final form.
Namespaces in operation, part 1: namespaces overview
The Linux 3.8 merge window saw the acceptance of Eric Biederman's sizeable series of user namespace and related patches. Although there remain some details to finish—for example, a number of Linux filesystems are not yet user-namespace aware—the implementation of user namespaces is now functionally complete.
The completion of the user namespaces work is something of a milestone, for a number of reasons. First, this work represents the completion of one of the most complex namespace implementations to date, as evidenced by the fact that it has been around five years since the first steps in the implementation of user namespaces (in Linux 2.6.23). Second, the namespace work is currently at something of a "stable point", with the implementation of most of the existing namespaces being more or less complete. This does not mean that work on namespaces has finished: other namespaces may be added in the future, and there will probably be further extensions to existing namespaces, such as the addition of namespace isolation for the kernel log. Finally, the recent changes in the implementation of user namespaces are something of a game changer in terms of how namespaces can be used: starting with Linux 3.8, unprivileged processes can create user namespaces in which they have full privileges, which in turn allows any other type of namespace to be created inside a user namespace.
Thus, the present moment seems a good point to take an overview of namespaces and a practical look at the namespace API. This is the first of a series of articles that does so: in this article, we provide an overview of the currently available namespaces; in the follow-on articles, we'll show how the namespace APIs can be used in programs.
The namespaces
Currently, Linux implements six different types of namespaces. The purpose of each namespace is to wrap a particular global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the global resource. One of the overall goals of namespaces is to support the implementation of containers, a tool for lightweight virtualization (as well as other purposes) that provides a group of processes with the illusion that they are the only processes on the system.
In the discussion below, we present the namespaces in the order that they were implemented (or at least, the order in which the implementations were completed). The CLONE_NEW* identifiers listed in parentheses are the names of the constants used to identify namespace types when employing the namespace-related APIs (clone(), unshare(), and setns()) that we will describe in our follow-on articles.
Mount namespaces (CLONE_NEWNS, Linux 2.4.19) isolate the set of filesystem mount points seen by a group of processes. Thus, processes in different mount namespaces can have different views of the filesystem hierarchy. With the addition of mount namespaces, the mount() and umount() system calls ceased operating on a global set of mount points visible to all processes on the system and instead performed operations that affected just the mount namespace associated with the calling process.
One use of mount namespaces is to create environments that are similar to chroot jails. However, by contrast with the use of the chroot() system call, mount namespaces are a more secure and flexible tool for this task. Other more sophisticated uses of mount namespaces are also possible. For example, separate mount namespaces can be set up in a master-slave relationship, so that the mount events are automatically propagated from one namespace to another; this allows, for example, an optical disk device that is mounted in one namespace to automatically appear in other namespaces.
Mount namespaces were the first type of namespace to be implemented on Linux, appearing in 2002. This fact accounts for the rather generic "NEWNS" moniker (short for "new namespace"): at that time no one seems to have been thinking that other, different types of namespace might be needed in the future.
UTS namespaces (CLONE_NEWUTS, Linux 2.6.19) isolate two system identifiers—nodename and domainname—returned by the uname() system call; the names are set using the sethostname() and setdomainname() system calls. In the context of containers, the UTS namespaces feature allows each container to have its own hostname and NIS domain name. This can be useful for initialization and configuration scripts that tailor their actions based on these names. The term "UTS" derives from the name of the structure passed to the uname() system call: struct utsname. The name of that structure in turn derives from "UNIX Time-sharing System".
IPC namespaces (CLONE_NEWIPC, Linux 2.6.19) isolate certain interprocess communication (IPC) resources, namely, System V IPC objects and (since Linux 2.6.30) POSIX message queues. The common characteristic of these IPC mechanisms is that IPC objects are identified by mechanisms other than filesystem pathnames. Each IPC namespace has its own set of System V IPC identifiers and its own POSIX message queue filesystem.
PID namespaces (CLONE_NEWPID, Linux 2.6.24) isolate the process ID number space. In other words, processes in different PID namespaces can have the same PID. One of the main benefits of PID namespaces is that containers can be migrated between hosts while keeping the same process IDs for the processes inside the container. PID namespaces also allow each container to have its own init (PID 1), the "ancestor of all processes" that manages various system initialization tasks and reaps orphaned child processes when they terminate.
From the point of view of a particular PID namespace instance, a
process has two PIDs: the PID inside the namespace, and the PID outside the
namespace on the host system. PID namespaces can be nested: a process will
have one PID for each of the layers of the hierarchy starting from the PID
namespace in which it resides through to the root PID namespace. A process
can see (e.g., view via /proc/PID and send
signals with kill()) only processes contained in its own PID
namespace and the namespaces nested below that PID namespace.
Network namespaces
(CLONE_NEWNET, started in Linux 2.4.19 2.6.24 and largely completed by
about Linux 2.6.29) provide isolation of the system resources associated
with networking. Thus, each network namespace has its own network devices, IP
addresses, IP routing tables, /proc/net directory, port numbers,
and so on.
Network namespaces make containers useful from a networking perspective: each container can have its own (virtual) network device and its own applications that bind to the per-namespace port number space; suitable routing rules in the host system can direct network packets to the network device associated with a specific container. Thus, for example, it is possible to have multiple containerized web servers on the same host system, with each server bound to port 80 in its (per-container) network namespace.
User namespaces (CLONE_NEWUSER, started in Linux 2.6.23 and completed in Linux 3.8) isolate the user and group ID number spaces. In other words, a process's user and group IDs can be different inside and outside a user namespace. The most interesting case here is that a process can have a normal unprivileged user ID outside a user namespace while at the same time having a user ID of 0 inside the namespace. This means that the process has full root privileges for operations inside the user namespace, but is unprivileged for operations outside the namespace.
Starting in Linux 3.8, unprivileged processes can create user namespaces, which opens up a raft of interesting new possibilities for applications: since an otherwise unprivileged process can hold root privileges inside the user namespace, unprivileged applications now have access to functionality that was formerly limited to root. Eric Biederman has put a lot of effort into making the user namespaces implementation safe and correct. However, the changes wrought by this work are subtle and wide ranging. Thus, it may happen that user namespaces have some as-yet unknown security issues that remain to be found and fixed in the future.
Concluding remarks
It's now around a decade since the implementation of the first Linux namespace. Since that time, the namespace concept has expanded into a more general framework for isolating a range of global resources whose scope was formerly system-wide. As a result, namespaces now provide the basis for a complete lightweight virtualization system, in the form of containers. As the namespace concept has expanded, the associated API has grown—from a single system call (clone()) and one or two /proc files—to include a number of other system calls and many more files under /proc. The details of that API will form the subject of the follow-ups to this article.
Series index
The following list shows later articles in this series, along with their example programs:
-
Part 2: the namespaces API
- demo_uts_namespaces.c: demonstrate the use of UTS namespaces
- ns_exec.c: join a namespace using setns() and execute a command
- unshare.c: unshare namespaces and execute a command; similar in concept to unshare(1)
- Part 3: PID namespaces
- pidns_init_sleep.c: demonstrate PID namespaces
- multi_pidns.c: create a series of child processes in nested PID namespaces
- Part 4: more on PID namespaces
- ns_child_exec.c: create a child process that executes a shell command in new namespace(s)
- simple_init.c: a simple init(1)-style program to be used as the init program in a PID namespace
- orphan.c: demonstrate that a child becomes orphaned and is adopted by the init process when its parent exits
- ns_run.c: join one or more namespaces using setns() and execute a command in those namespaces, possibly inside a child process; similar in concept to nsenter(1)
- Part 5: user namespaces
- demo_userns.c: simple program to create a user namespace and display process credentials and capabilities
- userns_child_exec.c: create a child process that executes a shell command in new namespace(s); similar to ns_child_exec.c, but with additional options for use with user namespaces
- Part 6: more on user namespaces
- userns_setns_test.c: test the operation of setns() from two different user namespaces.
- Part 7: network namespaces
- Mount namespaces and shared subtrees
- Mount namespaces, mount propagation, and unbindable mounts
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Virtualization and containers
Page editor: Jonathan Corbet
Distributions
Ubuntu for phones—and beyond
Canonical announced another entrant into the mobile phone space on January 2: Ubuntu for phones. In some ways it is similar to Ubuntu for Android, which was announced last February, but there are some substantial differences as well. To the confusion of some, Ubuntu for Android is not going away after this announcement—in fact we may see that ship on devices before Ubuntu for phones hits store shelves. After nearly a year with no shipping hardware, though, some are skeptical, but Canonical and founder Mark Shuttleworth expressed confidence that we will be seeing some form of Ubuntu on mobile phones before the year is out. With luck, long before the year is out.
![[Ubuntu for phones]](https://static.lwn.net/images/2013/ubuntu-for-phones.jpg)
Essentially, Ubuntu for phones is exactly what it sounds like: the "full" Ubuntu distribution running on phone hardware. But it is clearly more than that as well. The user interface is radically different, both from desktop Ubuntu (Unity or any of the other choices) and from other mobile operating systems. But, phone Ubuntu has adopted one of the more interesting parts of Ubuntu for Android: the ability for suitably beefy hardware to be connected to a display and keyboard/mouse via a dock, allowing for desktop-mobile convergence.
In fact, the somewhat hype-filled video that accompanies the announcement (perhaps the fact that Shuttleworth calls it a "virtual keynote" should have been a clue) talks about convergence between mobile phone, tablet, desktop, TV, cloud, and "personal supercomputer" all using Ubuntu. It's clear that Canonical has a sweeping vision of where it—and Ubuntu—are headed.
Six minutes or so into the video, Shuttleworth introduces Ubuntu for phones, using what appears to be a Galaxy Nexus. Technical details of things like the software underlying the user interface are scant, but one suspects it is the Ubuntu user space running atop an Android-derived kernel.
Shuttleworth gives a tour of the interface, starting with the "welcome screen" (as opposed to a lock screen—something that has been attacked using a notorious patent) that dynamically updates various activities (such as "tweets" received, kilometers walked, talk time used, etc.) as well as the underlying artwork. Each of the four edges of the screen has a specific purpose, providing direct access from the welcome screen. For example, the left edge holds a handful of favorite apps that can be launched directly. Security would seem to be a concern here (as lock screens are often used to restrict phone access), but Shuttleworth indicated that it was "secure" without providing any details.
That lack of details is, of course, a bit irritating to some. It is not clear, for example, how much of all of this is "demo-ware" and how much is real. But the video is not directly targeted at LWN editors (or regular users) so much as at the hardware manufacturers and app developers. That makes perfect sense. Before we can get Ubuntu phones in our hands, Canonical needs to find hardware partners. Any of the pieces that are partly mocked-up for the Consumer Electronics Show starting January 7 will presumably be finished by the time we see phones.
The edge-based interface is touted as making it easier to perform the various tasks one might want to do with their phone, without having to constantly return to the home screen. That certainly looks like a compelling feature, given that it is one of the pain points for other phone interfaces.
The top edge allows searching from any screen, for example; as befits a mobile device, that searching is done on the internet. Unsurprisingly, it will also search for "products" of various sorts, not just web pages. While some users have been unhappy with the addition of Amazon searching to the Unity "Dash" on the desktop, one could argue that it makes more sense on a device like a phone where content consumption is one of the primary activities—at least for some.
There are also a number of global gestures that will immediately take you to various screens or previously used apps. Overall, it looks like a well-thought-out interface that avoids some of the pitfalls of its competitors. It clearly targets making the most use of the entire screen by, for example, allowing the top-edge status bar to be hidden and to put the app controls "below" the bottom of the screen.
Beyond that, there's an app store (of course), but it is integrated with the app screen (which shows the installed apps), rather than by running a separate program (e.g. Google Play). The Ubuntu One "cloud" is integrated as well, so that settings, photos, and other content are all backed up. Integration of shared contact lists and other similar data with desktop applications is at the very least implied.
A phone ecosystem suffers from something of a chicken-and-egg problem, in that hardware devices are needed to generate interest from users and, importantly, app developers. But without an ecosystem of apps, it may be difficult to get hardware manufacturers interested. Canonical appears to be taking two approaches to solving that problem. It is clearly targeting manufacturers who already have Android-ready hardware in the pipeline (so little if any hardware customization will be needed), and it is pitching an Ubuntu-wide development story for apps.
For all Ubuntu devices, both HTML 5 and native applications are supported, with web applications being promoted to an equal footing with their native counterparts, according to Shuttleworth. For native applications, QML is recommended for C and C++ applications with JavaScript for the user interface. There is also access to native OpenGL for graphics intensive apps, such as games, which are clearly important to Canonical. Games are one of the areas where Linux lags on the desktop, and are fairly critical to any mobile phone platform, so it is not a surprise that the company is particularly interested.
There were also a few interesting tidbits that were mentioned in the "keynote"
and elsewhere.
Ubuntu is shipping on "10% of the world's new branded PCs
",
which is a rather eye-opening number. In addition, Shuttleworth noted that
Dell, Lenovo, ASUS, and, now, HP, are all shipping systems with Ubuntu
pre-installed. He said that 70% of the systems offered by those companies
are now Ubuntu certified. One of the biggest problem areas for
desktop adoption has been finding systems that come with Linux installed, so
those numbers would seem to bode well for the future.
One can only wish Canonical well with this new venture. The skeptical may point to the lack of progress on Ubuntu for Android devices, but that could soon change. Ubuntu for phones seems like a more coherent story overall, but it's too early to tell. From a free software perspective, there is the question of whether the user interface code (and any other underlying Canonical-owned pieces) will be released. So far, that is unclear, but Canonical has generally been a stalwart ally of free software along the way; with luck we'll see the code along with a phone or three in the coming year.
Brief items
Distribution quotes of the week
FreeBSD 9.1 released
The FreeBSD project has announced the release of FreeBSD 9.1. "This is the second release from the stable/9 branch, which improves on the stability of FreeBSD 9.0 and introduces some new features." Further information can be found in the release notes.
ROSA Desktop.Fresh 2012 Operating System
The ROSA company has announced the release of ROSA Desktop.Fresh 2012. "ROSA Desktop.Fresh 2012 is a new name for a line of non-commercial desktop operating systems developed by ROSA. The name underlines the fact that the system contains "fresh" versions of user software and system components (compared to enterprise editions of ROSA). The system is compatible with a wide range of modern hardware."
Newsletters and articles of interest
Distribution newsletters
- Debian Misc Developer News #31 (December 31)
- Debian Project News (December 24)
- DistroWatch Weekly, Issue 488 (December 24)
- Ubuntu Weekly Newsletter, Issue 297 (December 23)
Verhelst: m68k: back from the dead
Wouter Verhelst reports that Debian's m68k port has been resurrected. "Contrary to some rumours which I've had to debunk over the years, the m68k port did not go into limbo because it was kicked out of the archive; instead, it did because recent versions of glibc require support for thread-local storage, a feature that wasn't available on m68k, and nobody with the required time, willingness, and skill set could be found to implement it. This changed a few years back, when some people wrote the required support, because they were paid to do so in order to make recent Linux run on ColdFire processors again. Since ColdFire and m68k processors are sufficiently similar, that meant the technical problem was solved." (Thanks to Mattias Mattsson and Paul Wise)
Distributions for the Nexus 7 (TGDaily and HotHardware)
LWN has recently looked at Ubuntu on the Nexus 7 and CyanogenMod 10 on the Nexus 7. Two more distributions are also targeting the Nexus 7. TGDaily covers Bohdi Linux on this tablet and HotHardware looks at the Plasma Active port. Both are works in progress.
Page editor: Rebecca Sobol
Development
Special sections in Linux binaries
A section is an area in an object file that contains information which is useful for linking: the program's code and data, relocation information, and more. It turns out that the Linux kernel has some additional types of sections, called "special sections", that are used to implement various kernel features. Special sections aren't well known, so it is worth shedding some light on the topic.
Segments and sections
Although Linux supports several binary file formats, ELF (Executable and Linking Format) is the preferred format since it is flexible and extensible by design, and it is not bound to any particular processor or architecture. ELF binary files consist of an ELF header followed by a few segments. Each segment, in turn, includes one or more sections. The length of each segment and of each section is specified in the ELF header. Most segments, and thus most sections, have an initial address which is also specified in the ELF header. In addition, each segment has its own access rights.
The linker merges together all sections of the same type included in the input object files into a single section and assigns an initial address to it. For instance, the .text sections of all object files are merged together into a single .text section, which by default contains all of the code in the program. Some of the segments defined in an ELF binary file are used by the GNU loader to assign memory regions with specific access rights to the process.
Executable files include four canonical sections called, by convention, .text, .data, .rodata, and .bss. The .text section contains executable code and is packed into a segment which has the read and execute access rights. The .data and .bss sections contain initialized and uninitialized data respectively, and are packed into a segment which has the read and write access rights.
Linux loads the .text section into memory only once, no matter how many times an application is loaded. This reduces memory usage and launch time and is safe because the code doesn't change. For that reason, the .rodata section, which contains read-only initialized data, is packed into the same segment that contains the .text section. The .data section contains information that could be changed during application execution, so this section must be copied for every instance.
The "readelf -S" command lists the sections included in an executable file, while the "readelf -l" command lists the segments included in an executable file.
Defining a section
Where are the sections declared? If you look at a standard C program you won't find any reference to a section. However, if you look at the assembly version of the C program you will find several assembly directives that define the beginning of a section. More precisely, the ".text", ".data", and ".section rodata" directives identify the beginning of the the three canonical sections mentioned previously, while the ".comm " directive defines an area of uninitialized data.
The GNU C compiler translates a source file into the equivalent assembly language file. The next step is carried out by the GNU assembler, which produces an object file. This file is an ELF relocatable file which contains only sections (segments which have absolute addresses cannot be defined in a relocatable file). Sections are now filled, with the exception of the .bss section, which just has a length associated with it.
The assembler scans the assembly lines, translates them into binary code, and inserts the binary code into sections. Each section has its own offset which tells the assembler where to insert the next byte. The assembler acts on one section at a time, which is called the current section. In some cases, for instance to allocate space to uninitialized global variables, the assembler does not add bytes in the current section, it just increments its offset.
Each assembly language program is assembled separately; the assembler assumes thus that the starting address of an object program is always 0. The GNU linker receives as input a group of these object files and combines them into a single executable file. This kind of linkage is called static linkage because it is performed before running the program.
The linker relies on a linker script to decide which address to assign to each section of the executable file. To get the default script of your system, you can issue the command:
ld --verbose
Special sections
If you compare the sections present in a simple executable file, say one associated with helloworld.c, with those present in the Linux kernel executable, you will notice that Linux relies on many special sections not present in conventional executable files. The number of such sections depends on the hardware platform. On an x86_64 system over 30 special sections are defined, while on an ARM system there are about ten.
You can use the readelf command to extract data from the ELF header of vmlinux, which is the kernel executable. When issuing this command on an x86_64 box you get something like:
Elf file type is EXEC (Executable file) Entry point 0x1000000 There are 6 program headers, starting at offset 64 Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flags Align LOAD 0x0000000000200000 0xffffffff81000000 0x0000000001000000 0x00000000007a3000 0x00000000007a3000 R E 200000 LOAD 0x0000000000a00000 0xffffffff81800000 0x0000000001800000 0x00000000000c7b40 0x00000000000c7b40 RW 200000 LOAD 0x0000000000c00000 0xffffffffff600000 0x00000000018c8000 0x0000000000000d60 0x0000000000000d60 R E 200000 LOAD 0x0000000000e00000 0x0000000000000000 0x00000000018c9000 0x0000000000010f40 0x0000000000010f40 RW 200000 LOAD 0x0000000000eda000 0xffffffff818da000 0x00000000018da000 0x0000000000095000 0x0000000000163000 RWE 200000 NOTE 0x0000000000713e08 0xffffffff81513e08 0x0000000001513e08 0x0000000000000024 0x0000000000000024 4 Section to Segment mapping: Segment Sections... 00 .text .notes __ex_table .rodata __bug_table .pci_fixup __ksymtab __ksymtab_gpl __ksymtab_strings __init_rodata __param __modver 01 .data 02 .vsyscall_0 .vsyscall_fn .vsyscall_1 .vsyscall_2 .vsyscall_var_jiffies .vsyscall_var_vgetcpu_mode .vsyscall_var_vsyscall_gtod_data 03 .data..percpu 04 .init.text .init.data .x86_trampoline .x86_cpu_dev.init .altinstructions .altinstr_replacement .iommu_table .apicdrivers .exit.text .smp_locks .data_nosave .bss .brk 05 .notes
Defining a Linux special section
Special sections are defined in the Linux linker script, which is a linker script distinct from the default linker script mentioned above. The corresponding source file is stored in the kernel/vmlinux.ld.S in the architecture-specific subtree. This file uses a set of macros defined in the linux/include/asm_generic/vmlinux.lds.h header file.
The linker script for the ARM hardware platform contains an easy-to-follow definition of a special section:
. = ALIGN(4); __start___ex_table = .; *(__ex_table) __stop___ex_table = .;The __ex_table special section is aligned to a multiple of four bytes. Furthermore, the linker creates a pair of identifiers, namely __start___ex_table and __stop___ex_table, and sets their addresses to the beginning and the end of __ex_table. Linux functions can use these identifiers to iterate through the bytes of __ex_table. Those identifiers must be declared as extern because they are defined in the linker script.
Defining and using special sections can thus be summarized as follows:
- Define the special section ".special" in the Linux linker
script together with the pair of identifiers that delimit it.
- Insert the .section .special assembly
directive into the Linux code to specify that all bytes up to the next
.section
assembly directive must be inserted in .special.
- Use the pair of identifiers to act on those bytes in the kernel.
This technique seems to apply to assembly code only. Luckily, the GNU C compiler offers the non-standard attribute construct to create special sections. The
__attribute__((__section__(".init.data")))declaration, for instance, tells the compiler that the code following that declaration must be inserted into the .init.data section. To make the code more readable, suitable macros are defined. The __initdata macro, for instance, is defined as:
#define __initdata __attribute__((__section__(".init.data")))
Some examples
As seen in the previous readelf listing, all special sections appearing in the Linux kernel end up packed in one of the segments defined in the vmlinux ELF header. Each special section fulfills a particular purpose. The following list groups some of the Linux special sections according to the type of information stored in them. Whenever applicable, the name of the macro used in the Linux code to refer to the section is mentioned instead of the special section's name.
- Binary code
Functions invoked only during the initialization of Linux are declared as __init and placed in the .init.text section. Once the system is initialized, Linux uses the section delimiters to release the page frames allocated to that section.
Functions declared as __sched are inserted into the .sched.text special section so that they will be skipped by the get_wchan() function, which is invoked when reading the /proc/PID/wchan file. This file contains the name of the function, if any, on which process PID is blocked (see WCHAN the waiting channel for further details). The section delimiters bracket the sequence of addresses to be skipped. The down_read() function, for instance, is declared as __sched because it gives no helpful information on the event that is blocking the process.
- Initialized data
Global variables used only during the initialization of Linux are declared as __initdata and placed in the .init.data section. Once the system is initialized, Linux uses the section delimiters to release the page frames allocated to the section.
The EXPORT_SYMBOL() macro makes the identifier passed as parameter accessible to kernel modules. The identifier's string constant is stored in the __ksymtab_strings section.
- Function pointers
To invoke an __init function during the initialization phase, Linux offers an extensive set of macros (defined in <linux/init.h>); module_init() is a well-known example. Each of these macros puts a function pointer passed as its parameter in a .initcalli.init section (__init functions are grouped in several classes). During system initialization, Linux uses the section delimiters to successively invoke all of the functions pointed to.
- Pairs of instruction pointers
The _ASM_EXTABLE(addr1, addr2) macro allows the page fault exception handler to determine whether an exception was caused by a kernel instruction at address addr1 while trying to read or write a byte into a process address space. If so, the kernel jumps to addr2 that contains the fixup code, otherwise a kernel oops occurs. The delimiters of the __ex_table special section (see the previous linker script example) set the range of critical kernel instructions that transfer bytes from or to user space.
- Pairs of addresses
The EXPORT_SYMBOL() macro mentioned earlier also inserts in the ksymtab (or ksymtab_gpl) special section a pair of addresses: the identifier's address and the address of the corresponding string constant in ksymtab (or ksymtab_gpl). When linking a module, the special sections filled by EXPORT_SYMBOL() allow the kernel to do a binary search to determine whether an identifier declared as extern by the module belongs to the set of exported symbols.
- Relative addresses
On SMP systems, the DEFINE_PER_CPU(type, varname) macro inserts the varname uninitialized global variable of type in the .data..percpu special section. Variables stored in that section are called per-CPU variables. Since .data..percpu is stored in a segment whose initial address is set to 0, the addresses of per-CPU variables are relative addresses. During system initialization, Linux allocates a memory area large enough to store the NR_CPUS groups of per-CPU variables. The section delimiters are used to determine the size of the group.
- Structures
The kernel's SMP alternatives mechanism allows a single kernel to be built optimally for multiple versions of a given processor architecture. Through the magic of boot-time code patching, advanced instructions can be exploited if, and only if, the system's processor is able to execute those instructions. This mechanism is controlled with the alternative() macro:
alternative(oldinstr, newinstr, feature);
This macro first stores oldinstr in the .text regular section. It then stores in the .altinstructions special section a structure that includes the following fields: the address of the oldinstr, the address of the newinstr, the feature flags, the length of the oldinstr, and the length of the newinstr. It stores newinstr in a .altinstr_replacement special section. Early in the boot process, every alternative instruction which is supported by the running processor is patched directly into the loaded kernel image; it will be filled with no-op instructions if need be.
The .modinfo section is used by the modinfo command to show information about the kernel module. The data stored in the section is not loaded in the kernel address space. The .gnu.linkonce.this_module special section includes a module structure which contains, among other fields, the module's name. When inserting a module, the init_module() system call reads the module structure from this special section into an area of dynamic memory.
Conclusion
Although special sections can be defined in application programs too, there is no doubt that kernel developers have been quite creative in exploiting them. In fact, the examples listed above are by no means exhaustive and new special sections keep popping up in recent kernel releases. Without special sections, implementing some kernel features like those above would be rather difficult.
Brief items
Quote of the week
MediaGoblin 0.3.2 "Goblinverse" released
Hot on the heels of a successful fundraising campaign, the MediaGoblin decentralized media publishing platform has released version 0.3.2. The headline feature in the release is support for 3D models. "We've blogged about this, we've collared people at holiday parties, we've done everything but make a Gangnam Style parody video about it... but in case you haven't heard, you can now upload 3d models to MediaGoblin, whoo! This means you can build your own free-as-in-freedom Thingiverse replacement and start printing out objects. We support the sharing of STL and OBJ files. MediaGoblin can also call on Blender to create nice image previews during upload. Or if you prefer, you can use javascript to display live 3d previews in webgl-enabled browsers (we use the thingiview.js library to do this)."
LLVM 3.2 released
Version 3.2 of the LLVM compiler system and Clang C compiler has been released. "Despite only it being a bit over 6 months of development since 3.1, LLVM 3.2 is a huge leap, delivering a wide range of improvements and new features. Clang now includes industry-leading C++'11 support, improved diagnostics, C11 and Objective-C improvements (including 'ObjC literals' support), and the Clang static analyzer now has the ability to do inter-procedural (cross- function) analysis along with improved Objective-C support." See the release notes for lots of details.
Awesome 3.5 released
Version 3.5 of the "Awesome" window manager has been released. "The last major release happened more than three years ago. However, even longer ago, a civilization known as the 'Maya' predicted that today a great pain will be brought to everyone (Don't trust the 'Date' header of this mail or you will get a long and weird explanation about time zones and other weak excuses). Today is the day of thousand crys from users whose config broke. Today is the end. Welcome to the time after the end." See this message for a summary of changes in this release, or this LWN review of Awesome from 2011.
Enlightenment 17 released
Enlightenment DR 0.17.0 (E17) has been released, along with version 1.7.4 of the Enlightenment Foundation Libraries. LWN looked at Enlightenment in August 2011.GNU C library 2.17 released
Version 2.17 of the GNU C library (glibc) is available. This release includes a port to ARM AArch64, contributed by Linaro, as well as a lot of bug fixes. The minimum Linux kernel version supported by this glibc release is 2.6.16.GNU sed 4.2.2 released; maintainer resigns
Version 4.2.2 of the GNU stream editor "sed" is out. There's a number of new features, but the announcement also includes the resignation of the sed maintainer. "Barring any large change in policy and momentum from GNU, these three reasons are bound to be the first step towards the irrelevance of GNU. And barring any such policy change, I have no reason to be part of GNU anymore."
A proposal for "rebooted" Python asynchronous I/O support
Much of the discussion on the Python mailing lists in recent times has been devoted to the topic of a new framework to support the development of "event loop" programs in Python 3. That discussion has been pulled together into PEP 3156; there is an accompanying reference implementation currently called "tulip". Guido van Rossum is seeking comments on both the proposal and the implementation. Click below for the full text of the proposal.Simon 0.4.0 released
Simon is a system for speech recognition; version 0.4.0 is now available. "This new version of the open source speech recognition system Simon features a whole new recognition layer, context-awareness for improved accuracy and performance, a dialog system able to hold whole conversations with the user and more."
GMP 5.1.0 available
Version 5.1.0 of the GNU Multiple Precision Arithmetic Library
(GMP) has been released. A number of speed optimizations have been added, as has support for new processors. New functions for "multi-factorials, and primorial: mpz_2fac_ui, mpz_mfac_uiui and mpz_primorial_ui
" have also been added.
Twisted 12.3 available
Version 12.3 of the Twisted framework has been released. This version adds partial support for Python 3.3, among other changes.
LightZone is now a open source project
LightZone, a multi-platform digital photo editor that started out as a proprietary product, has been released as an open source project. The code can be found at lightzoneproject.org.
GNU Automake 1.13 released
GNU Automake 1.13 has been released. This is a major update with several important changes, among them the ability to define custom recursive targets and changes to several macros.
Newsletters and articles
Development newsletters from the past week (and the week before that)
- Ruby Weekly (December 20)
- Caml Weekly News (December 25)
- What's cooking in git.git (December 21)
- What's cooking in git.git (December 26)
- Mozilla Hacks Weekly (December 20)
- Openstack Community Weekly Newsletter (December 21)
- Perl Weekly (December 24)
- PostgreSQL Weekly News (December 23)
- Ruby Weekly (December 27)
- TUGboat (December 31)
- Caml Weekly News (January 1)
- What's cooking in git.git (December 31)
- What's cooking in git.git (January 1)
- What's cooking in git.git (January 3)
- Openstack Community Weekly Newsletter (December 28)
- Perl Weekly (December 31)
- PostgreSQL Weekly News (December 31)
- Ruby Weekly (January 3)
Eben Upton: An educational life of Pi (The H)
The H interviews Raspberry Pi Foundation executive director Eben Upton about the educational mission of the foundation—something that got a bit lost in the excitement over the hardware. "The nice thing is that almost all of the good CS teaching software already runs on Linux, so the bulk of the work is in making sure it works well on the Pi, rather than developing things from a standing start. MIT Scratch is actually a great example of this – it's built on top of the Squeak Smalltalk VM, and because this has generally only been run in anger on modern desktop hardware there hasn't previously been a case for heavy optimisation of its graphics routines, so it's a little sluggish on the Pi right now. We've commissioned a couple of pieces of work, the first of which involves porting it to use Pixman as its rendering backend, and the second involves optimising Pixman itself for the Pi's ARMv6 architecture (which will obviously pay dividends elsewhere in the system too)."
7 Embedded Linux Stories to Watch in 2013 (Linux.com)
Linux.com looks ahead to where embedded Linux is heading for 2013. The article forecasts Linux to replace realtime operating systems (RTOS) in many devices, Android getting into traditional embedded devices, more open source embedded Linux projects becoming available, expansion for Linux in the mobile and automotive spaces, and more. "As Android enters the general embedded realm, several new Linux-based mobile OSes [6] are stepping out to compete in the smartphone market. In 2013, the Linux Foundation's Tizen, Mozilla's Firefox OS, and Jolla's Meego spinoff, Sailfish, all plan to ship on new phones. If that's not enough, an upcoming mobile version of Ubuntu is due in 2014, HP's Open WebOS may yet reawaken on new hardware, and even the GNOME Foundation is planning a mobile-ready, developer-focused GNOME OS."
Märdian: Openmoko/OpenPhoenux GTA04 jumps off
Lukas 'Slyon' Märdian looks at an Openmoko based smartphone. The Openmoko smartphone efforts were abandoned some time ago, but Golden Delicious Computers has taken the code and created the OpenPhoenux GTA04. "Golden Delicious Computers and the enthusiasts from the Openmoko community started off with the idea of stuffing a BeagleBoard into a Neo Freerunner case and connecting an USB UMTS dongle to it – this was the first prototype GTA04A1, announced in late 2010 and presented at OHSW 2010 and FOSDEM 2011." At this time there are about 300 GTA04(A3+A4) devices in the wild and the company has GTA04A5 phones in production. (Thanks to Paul Wise)
Ten simple rules for the open development of scientific software
Here is some advice for scientists developing open-source software published on the PLOS Computational Biology site in early December. "The sustainability of software after publication is probably the biggest problem faced by researchers who develop it, and it is here that participating in open development from the outset can make the biggest impact. Grant-based funding is often exhausted shortly after new software is released, and without support, in-house maintenance of the software and the systems it depends on becomes a struggle. As a consequence, the software will cease to work or become unavailable for download fairly quickly, which may contravene archival policies stipulated by your journal or funding body. A collaborative and open project allows you to spread the resource and maintenance load to minimize these risks, and significantly contributes to the sustainability of your software."
Page editor: Nathan Willis
Announcements
Brief items
Canonical to demonstrate Ubuntu on phones
Canonical's much-hyped January 2 announcement is the availability of a version of the Ubuntu distribution for mobile phones; it will be on display at the Consumer Electronics Show starting January 7. "Your phone is more immersive, the screen is less cluttered, and you flow naturally from app to app with edge magic. The phone becomes a full PC and thin client when docked. Ubuntu delivers a magical phone that is faster to run, faster to use and fits perfectly into the Ubuntu family." There is no word about uptake by any handset manufacturers.
Petition: promote the use of free software in US schools
The US Whitehouse.gov site has a petition to promote the use of free software in schools. "Each year our educational system wastes billions of dollars for the purchase and support of proprietary operating systems and application software in our schools. The software is rigid and inflexible, opaque in its design and mysterious to our children. We advocate and propose the gradual replacement of privately owned software with restrictive licensing in favor of open source alternatives with GPL type licenses. In as much as possible we should have our students using software that complies with the definition of free software as defined by the Free Software Foundation." Registration is required to sign the petition. (Thanks to Davide Del Vento)
Articles of interest
Free Software Supporter -- Issue 57, December 2012
This edition of the Free Software Foundation's newsletter covers a Gnu bearing gifts, an interview with Kovid Goyal of Calibre, Bradley Kuhn on *Oracle v. Google*, new GNU releases, and several other topics.FSFE: Fellowship interview with Anna Morris
The Free Software Foundation Europe talks with Anna Morris, co-founder of the FLOSSIE conference for women in Free Software, Manchester Fellowship Group Deputy Coordinator, and Co-Director of Ethical Pets Ltd. "Perhaps most importantly, the learning and subsequent freedom that I have achieved is also down to Free software (and the community surrounding it). Free Software challenges you to learn: to do for yourself, to be fearlessly independent when it comes to your tech. In the past few years I have taken pride in watching my skill base catchup with and overtake that of my proprietary-loving peers, even some paid professionals, simply by having a free and curious mindset. Free software frees you in many ways."
The H Year: 2012's Wins, Fails and Mehs
The H rates some highlights of 2012. "Win – The Linux community’s reboots Secure Boot – Microsoft’s requirement that OEMs start using UEFI’s Secure Boot function had caused much concern within the Linux community, but when that had died down, developers at Red Hat, SUSE, Canonical and the Linux Foundation worked on a range of solutions for Linux distributions, large and small, to use if they wanted to boot on a machine with Secure Boot enabled and a user not capable of disabling it. Good ideas and information was exchanged, code was written and answers were found; that's how things should work."
Upcoming Events
SCALE 11X: Discounts, UpSCALE and more
The Southern California Linux Expo (SCALE) will take place February 22-24, 2013 in Los Angeles, CA. Early bird registration ends January 8. The call for UpSCALE (short) talks is still open. There will be 'SCALE: The Next Generation', a "youth driven" conference and that call for papers is open.Events: January 4, 2013 to March 5, 2013
The following event listing is taken from the LWN.net Calendar.
Date(s) | Event | Location |
---|---|---|
January 18 January 20 |
FUDCon:Lawrence 2013 | Lawrence, Kansas, USA |
January 18 January 19 |
Columbus Python Workshop | Columbus, OH, USA |
January 20 | Berlin Open Source Meetup | Berlin, Germany |
January 28 February 2 |
Linux.conf.au 2013 | Canberra, Australia |
February 2 February 3 |
Free and Open Source software Developers' European Meeting | Brussels, Belgium |
February 15 February 17 |
Linux Vacation / Eastern Europe 2013 Winter Edition | Minsk, Belarus |
February 18 February 19 |
Android Builders Summit | San Francisco, CA, USA |
February 20 February 22 |
Embedded Linux Conference | San Francisco, CA, USA |
February 22 February 24 |
Southern California Linux Expo | Los Angeles, CA, USA |
February 22 February 24 |
FOSSMeet 2013 | Calicut, India |
February 22 February 24 |
Mini DebConf at FOSSMeet 2013 | Calicut, India |
February 23 February 24 |
DevConf.cz 2013 | Brno, Czech Republic |
February 25 March 1 |
ConFoo | Montreal, Canada |
February 26 March 1 |
GUUG Spring Conference 2013 | Frankfurt, Germany |
February 26 February 28 |
ApacheCon NA 2013 | Portland, Oregon, USA |
February 26 February 28 |
O’Reilly Strata Conference | Santa Clara, CA, USA |
March 4 March 8 |
LCA13: Linaro Connect Asia | Hong Kong, China |
If your event does not appear here, please tell us about it.
Page editor: Rebecca Sobol