By Jonathan Corbet
October 10, 2012
Those who like to complain about udev, systemd, and their current
maintainers have had no shortage of company recently as the result of a
somewhat incendiary discussion on the linux-kernel mailing list.
Underneath the flames,
though, lie some important issues: who decides what constitutes appropriate
behavior for kernel device drivers, how strong is our commitment to
backward compatibility, and which tasks are best handled in the kernel
without calling out to user space?
The udev process is responsible for a number of tasks, most
initiated as the result of events originating in the kernel. It responds
to device creation events by making device nodes, setting permissions, and,
possibly, running a setup program. It also handles module loading requests
and firmware requests from the kernel. So, for example, when a driver
calls request_firmware(), that request is turned into an event
that is passed to the udev process. Udev will, in response, locate the
firmware file, read its contents, and pass the data back to the kernel.
The driver will get its firmware blob without having to know anything about
how things are organized in user space, and everybody should be happy.
Back in January, the udev developers decided to implement a stricter notion
of sequencing between various types of events. No events for a specific
device, they decided, would be processed until the process of loading the
driver module for that device had completed. Doing things this way makes
it easier for them to keep things straight in user space and to avoid
attempting operations that the kernel is not yet ready to handle. But it
also created problems for some types of drivers. In particular, if a
driver tries to load device firmware during the module initialization
process, things will appear to lock up. Udev sees that the module is not
yet initialized, so it will hold onto the firmware request and everything
stops. Udev developer Kay Sievers warned the
world about this problem last January:
We might need to work around that in the current udev for now, but
these drivers will definitely break in future udev versions.
Userspace, these days, should not be in charge of papering over
obvious kernel bugs like this.
The problem with this line of reasoning, of course, is that one person's
kernel bug is another's user-space problem. Firmware loading at module
initialization time has worked just fine for a long time — if one ignores
little problems like built-in modules, booting with init=/bin/sh,
and other situations where proper user-space support is not present when
the request_firmware() call takes place. What matters most is
that it works for a normal
bootstrap on a typical distribution install. The udev sequencing change
breaks that: users of a number of
distributions have been reporting that things no longer work properly with
newer versions of udev installed.
Breaking currently-running systems is something the kernel development
community tries hard to avoid, so it is not surprising that there was some
disagreement over the appropriateness of the udev changes. Even so,
various kernel developers were trying to work around the problems when
Linus threw a bit of a tantrum, saying that
the problem lies with udev and needs to be fixed there. He did not get the
response that he was hoping for.
Kay answered that, despite the problem
reports, udev had not yet been fixed, saying "we still haven't
wrapped our head around how to fix it/work around it." He pointed
out that things don't really hang, they just get "slow" while waiting for a
30-second timeout to expire. And he reiterated his position that the real
problem lies in the kernel and should be fixed there. Linus was unimpressed, but, since he does not maintain
udev, there is not a whole lot that he can do directly to solve the
problem.
Or, then again, maybe there is. One possibility raised by a few developers
was pulling udev into the kernel source tree and maintaining it as part of
the kernel development process. There was a certain amount of support for this idea,
but nobody actually stepped up to take responsibility for maintaining udev
in that environment. Such a move would represent a fork of a significant
package that would take it in a new direction; current plans are to
integrate udev more thoroughly with systemd.
The current udev developers thus seem unlikely to support putting udev in
the kernel tree. Getting distributors to adopt the kernel's version of
udev could also prove to be a challenge. In general, it is the sort of
mess that is best avoided if at all possible.
An alternative is to simply short out udev for firmware loading
altogether. That is, in fact, what has been done; the 3.7 kernel will
include a patch (from Linus) that causes firmware loading to be done
directly from the kernel without involving user space at all. If the
kernel is unable to find the firmware file in the expected places (under
/lib/firmware and variants) it will fall back to sending a request
to udev in the usual manner. But if the kernel-space load attempt works,
then udev will never even know that the firmware request was made.
This appears to be a solution that is workable for everybody involved.
There is nothing particularly tricky about firmware loading, so few
developers seem to have concerns about doing it directly from the kernel.
Kay supports the idea as well, saying
"I would absolutely like to get udev entirely out of the sick game of
firmware loading." The real proof will be in how well the concept
works once the 3.7 kernel starts seeing widespread testing, but the initial
indications are that there will not be a lot of problems. If things stay
that way, it would not be surprising to see the direct firmware loading
patch backported to the stable series — once it has gained a few amenities
like user-configurable paths.
One of the biggest challenges in kernel design can be determining what
should be done in the kernel and what should be pushed out to user space.
The user-space solution is often appealing; it can simplify kernel code and
make it easier for others to implement their own policies. But an overly
heavy reliance on user space can lead to just the sort of difficulty seen
with firmware loading. In this case, it appears, the problem was better
solved in the kernel; fortunately, it appears to have been a relatively
easy one for the kernel to take back without causing compatibility
problems.
Comments (11 posted)
October 10, 2012
This article was contributed by Joey Hess
CIA didn't seem important until it was gone. For developers and users on
IRC networks like Freenode, CIA was just there in the background,
relaying commit messages into the channels of thousands of projects in real
time—until recently.
CIA.vc was a central clearinghouse for commit messages sent to it from
ten thousand or more version control repositories. There were CIA hooks
for subversion, git, bzr, etc, so a project just had to install
such a hook into their repository and register on the CIA website.
CIA handled the rest, collecting the commit messages as they came in and
announcing them on appropriate channels via its swarm of IRC bots. Here is
an example from the #commits channel from April:
<CIA-93> OpenWrt: [packages] fwknop: update to 2.0, use new startup commands
<CIA-93> vlc: Pierre Ynard master * r31b5fbdb6d vlc/modules/lua/libs/equalizer.c:
lua: fix memory and object leak and reset locale on error path
<CIA-93> FreeBSD: rakuco * ports/graphics/autoq3d/files/
(patch-src__cmds__cmds.cpp . patch-src__fgui__cadform.cpp):
<CIA-93> FreeBSD: Make the port build with gcc 4.6 (and possibly other compilers).
<CIA-93> gentoo: robbat2 * gentoo/xml/htdocs/proj/en/perl/outdated-cpan-packages.xml:
Automated update of outdated-cpan-packages.xml
<CIA-93> compiz-fusion: a.j.buxton master * /fusion/plugins-main/src/ezoom/ezoom.c:
For a decade, the CIA bots were part of the infrastructure of many
projects, which, along with their bug tracker, mailing lists, wiki, and
version control system, helped tie communities together.
Eric S. Raymond described the effect
of the CIA service as follows:
It makes IRC conversations among a development group more productive. It
also does something unquantifiable but good to the
coherence of the development groups that use it, and the coherence of
the open-source community as a whole — when the service was live it
was hard to watch #commits for any length of time without being
impressed and encouraged.
That stream of notifications dried up on September 26th,
when CIA.vc was shut down,
due to a miscommunication with a hosting provider. It seems there were no
backups. It is unclear if CIA will return, but there are two possible
replacements available now.
irker
Irker is a simple replacement for CIA,
that was announced just three days later.
Raymond was developing it even before CIA went down, and designed
a much different architecture than the centralized CIA service.
Irker consists of two pieces: a server that acts as a simple relay to IRC
and a
client that sends messages to be relayed. The server has no knowledge of
version control systems or commits,
and could be used to relay any sort of content. All the
version-control-specific code necessary to extract and format the message
is in the client,
which is run by a version control hook script.
The irker client and server typically both run on the same machine or LAN,
so each project or hosting service is responsible for running its own
service, rather
than relying on a centralized service like CIA.
Irker has undergone heavy development since the announcement, and is now
considered basically feature complete. Its simple and general design
is likely to lead to other things being built on top of it. For example,
there is a CIA to irker proxy
for sites that want to retain their current CIA hooks.
KGB
Although irker made a splash when CIA died, another clone has quietly been
overlooked for years. KGB was
developed by Martín Ferrari and Damyan Ivanov of the Debian project and
released in 2009.
KGB is shipped in the current Debian stable release, as well as in Ubuntu
universe, making it
easy to deploy as a
replacement for CIA.
KGB is, like irker, a decentralized client-server system.
Unlike irker's content-agnostic server, the KGB server is responsible for
formatting notifications from commit information it receives from its
clients. Though a less flexible design, this does insulate the
clients from some details of IRC, particularly from message length limits.
KGB has enjoyed a pronounced upswing in feature requests and development
since CIA went down, gaining features such as web links to commits, url
shortening, and the ability to broadcast all projects'
notifications to a channel like #commits. Developer Martín Ferrari says:
For a small project that was mainly developed and maintained for our own
use, this was quite some unexpected popularity!
Will CIA.vc return?
The CIA.vc website currently promises an attempt to
revive the service. Any attempt to do so will surely face numerous
challenges. Not least is the missing database, which configured much of
CIA's behavior. Unless a recent backup of the database is found, any
revived CIA.vc will certainly need much configuration to return it to its
past functionality.
CIA's code base, while still available,
is large and complex with many moving parts written in
different languages, is reputedly difficult to install,
and has been neglected for years. Raymond's opinion is that
"CIA suffered a complexity collapse", and
as he said:
"It is notoriously difficult to un-collapse a rubble pile".
Even if CIA does eventually return, it seems likely that many projects will
have moved away from it for good, deploying their own irker or KGB bots.
The Apache Software Foundation, KDE project, and Debian's Alioth project
hosting site have already deployed their own bots. If the larger hosting
sites like Github, Sourceforge, and Savannah follow suit, any revived CIA
may be reduced to being, at best, a third player.
Conclusion
CIA.vc was a centralized service, with code that is free software, but with
a design and implementation that did not encourage reuse. The service was
widely used by the community, which mostly seems to have put up
with its instability, its UTF-8 display bugs, its odd formatting of git
revision numbers, and its often crufty hook scripts.
According to CIA's author,
Micah Dowty, it never achieved a "critical mass of
involvement" from contributors. Perhaps CIA was not seen as
important enough to work on. But with two replacements now being developed,
there is certainly evidence of interest. Or perhaps CIA did not present itself
as a free software project, and so was instead treated as simply the
service that
it appeared to be. CIA's website featured things like a commit leaderboard
and new project list, which certainly helped entice people to use it. (Your
author must confess to occasionally trying to fit enough commits into a day to
get to the top of that leader board.) But the website did not encourage
bugs or patches to be filed.
In a way, the story of CIA mirrors the story of the version control systems
it reported on. When CIA began in 2003, centralized version control
was the norm. The Linux kernel used distributed version control only thanks
to the proprietary Bitkeeper, which itself ran a centralized commit
publication service. These choices were entirely pragmatic, and the
centralized CIA was perhaps in keeping with the times.
Much as happened with version control, the community has gone from being
reliant on a centralized service, to having a choice of
decentralized alternatives. As a result, new features
are rapidly emerging in both KGB and irker that CIA never provided. This is
certainly a healthy response to CIA's closure, but it also seems
that our many years of reliance on the centralized service held us back
from exploring the space that CIA occupied.
Comments (10 posted)
By Nathan Willis
October 10, 2012
There was no security track at the 2012 Automotive
Linux Summit,
but numerous sessions and the "hallway track"
featured anecdotes about the ease of compromising car
computers. This is no surprise: as Linux makes inroads into
automotive computing, the security question takes on an urgency not
found on desktops and
servers. Too often, though, Linux and open source
software in general are perceived as insufficiently battle-hardened
for the safety-critical needs of highway speed computing —
reading the comments on an automotive Linux news story it is easy to find
a skeptic scoffing that he or she would not trust Linux to manage the
engine, brakes, or airbags. While hackers in other embedded
Linux realms may understandably feel miffed at such a slight, the
bigger problem is said skeptic's presumption that a modern Linux-free
car is a secure environment — which is demonstrably untrue.
First, there is a mistaken assumption that computing is not
yet a pervasive part of modern automobiles. Likewise mistaken
is the assumption that safety-critical systems (such as the
aforementioned brakes, airbags, and engine) are properly isolated from
low-security components (like the entertainment head unit) and are not
vulnerable to attack. It is also incorrectly assumed that the low-security
systems themselves do not harbor risks to drivers and passengers. In
reality, modern cars have shipped with multiple embedded computers for years
(many of which are mandatory by government order), presenting a large
attack surface with numerous risks to personal safety, theft,
eavesdropping, and other exploits. But rather than exacerbating this
situation, Linux and open source adoption stand to improve it.
There is an abundance of research dealing with hypothetical exploits
to automotive computers, but the seminal work on practical exploits
is a pair of papers from the Center for Automotive Embedded
Systems Security (CAESS), a team from the University of California San
Diego and the University of Washington. CAESS published a 2010 report [PDF]
detailing attacks that they managed to implement against a pair of late-model
sedans via the vehicles' Controller Area Network (CAN) bus, and a
2011 report [PDF] detailing how they managed to access the CAN network from
outside the car, including through service station diagnostic scanners,
Bluetooth, FM radio, and cellular modem.
Exploits
The 2010 paper begins by addressing the connectivity of modern cars.
CAESS did not disclose the brand of vehicle they experimented on
(although car mavens could probably identify it from the photographs),
but they purchased two vehicles and experimented with them on the lab
bench, on a garage lift, and finally on a closed test track. The cars
were not high-end, but they provided a wide range of targets.
Embedded electronic control units (ECUs) are found all over the
automobile, monitoring and reporting on everything from the engine to
the door locks, not to mention lighting, environmental controls, the
dash instrument panel, tire pressure sensors, steering, braking, and
so forth.
Not every ECU is designed to control a portion of the
vehicle, but due to the nature of the CAN bus, any ECU can be used to
mount an attack. CAN is roughly equivalent to a link-layer protocol,
but it is broadcast-only, does not employ source addressing or
authentication, and is easily susceptible to denial-of-service attacks
(either through simple flooding or by broadcasting messages with
high-priority message IDs, which force all other nodes to back off and
wait). With a device plugged into the CAN bus (such as through the
OBD-II port mandatory on all 1995-or-newer vehicles in the US),
attackers can spoof messages from any ECU. There are often
higher-level protocols employed, but CAESS was able to
reverse-engineer the protocols in its test vehicles and found security
holes that allow attackers to brute-force the challenge-response
system in a matter of days.
CAESS's test vehicles did separate the CAN bus into high-priority and
low-priority segments, providing a measure of isolation. However,
this also proved to be inadequate, as there were a number of ECUs
that were connected to both segments and which could therefore be used to bridge
messages between them. That set-up is not an error, however; despite
common thinking on the subject, quite a few features demanded by car
buyers rely on coordinating between the high- and low-priority
devices.
For example, electronic stability control involves measuring
wheel speed, steering angle, throttle, and brakes. Cruise control
involves throttle, brakes, speedometer readings, and possibly
ultra-sonic range sensors (for collision avoidance). Even the lowly
door lock must be connected to multiple systems: wireless key fobs,
speed sensors (to lock the doors when in motion), and the cellular
network (so that remote roadside assistance can unlock the car).
The paper details a number of attacks the team deployed against the
test vehicles. The team wrote a tool called CarShark to
analyze and inject CAN bus packets, which provided a method to mount
many attacks. However, the vehicle's diagnostic service (called
DeviceControl) also proved to be a useful platform for attack.
DeviceControl is intended for use by dealers and service stations, but
it was easy to reverse engineer, and subsequently allowed a number of
additional attacks (such as sending an ECU the "disable all CAN bus
communication" command, which effectively shuts off part of the car).
The actual attacks tested include some startlingly dangerous tricks,
such as disabling the brakes. But the team also managed to create
combined attacks that put drivers at risk even with "low risk"
components — displaying false speedometer or fuel gauge
readings, disabling dash and interior lights, and so forth.
Ultimately the team was able to gain control of every ECU in the car,
load and execute custom software, and erase traces of the attack.
Some of these attacks exploited components that did not adhere to the
protocol specification. For example, several ECUs allowed their firmware to be
re-flashed while the car was in motion, which is expressly forbidden
for obvious safety reasons. Other attacks were enabled by
run-of-the-mill implementation errors, such as components that re-used
the same challenge-response seed value every time they were
power-cycled. But ultimately, the critical factor was the fact that
any device on the vehicle's internal bus can be used to mount an
attack; there is no "lock box" protecting the vital systems, and the
protocol at the core of the network lacks fundamental security features
taken for granted on other computing platforms.
Vectors
Of course, all of the attacks described in the 2010 paper relied on an
attacker with direct access to the vehicle. That did not necessarily
mean ongoing access; they explained that a dongle attached to
the OBD-II port could work at cracking the challenge-response system
while left unattended. But, even though there are a number of
individuals with access to a driver's car over the course of a year
(from mechanics to valets), direct access is still a hurdle.
The 2011 paper looked at vectors to attack the car remotely, to assess
the potential for an attacker to gain access to the car's internal CAN
bus, at which point any of the attacks crafted in the 2010 paper could
easily be executed. It considered three scenarios: indirect physical
access, short-range wireless networking, and long-range wireless
networking. As one might fear, all three presented opportunities.
The indirect physical access involved compromising the CD player and
the dealership or service station's scanning equipment, which
is physically connected to the car while in the shop for diagnosis.
CAESS found that the model of diagnostic scanner used (which adhered
to a 2004 US government mandated standard called PassThru) was an embedded
Linux device internally, even though it was only used to interface
with a Windows application running on the shop's computer. However,
the scanner was equipped with WiFi, and broadcasts its address and
open TCP port in the clear. The diagnostic application API is
undocumented, but the team sniffed the traffic and found several
exploitable buffer overflows — not to mention extraneous services
like telnet also running on the scanner itself. Taking
control of the scanner and programming it to upload malicious code to
vehicles was little additional trouble.
The CD player attack was different; it started with the CD player's
firmware update facility (which loads new firmware onto the player if
a properly-named file is found on an inserted disc). But the player
can also decode compressed audio files, including undocumented
variants of Windows Media Audio (.WMA) files. CAESS found a buffer
overflow in the .WMA player code, which in turn allowed the team to
load arbitrary code onto the player. As an added bonus, the .WMA file
containing the exploit plays fine on a PC, making it harder to detect.
The short-range wireless attack involved attacking the head unit's
Bluetooth functionality. The team found that a compromised Android
device could be loaded with a trojan horse application designed to
upload malicious code to the car whenever it paired. A second option
was even more troubling; the team discovered that the car's Bluetooth stack
would respond to pairing requests initiated without user
intervention. Successfully pairing a covert Bluetooth device still
required correctly guessing the four-digit authorization PIN, but
since the pairing bypassed the user interface, the attacker could make
repeated attempts without those attempts being logged — and,
once successful, the paired device does not show up in the head
unit's interface, so it cannot be removed.
Finally, the long-range wireless attack gained access to the car's CAN
network through the cellular-connected telematics unit (which handles
retrieving data for the navigation system, but is also used to connect
to the car maker's remote service center for roadside assistance and
other tasks). CAESS discovered that although the telematics unit
could use a cellular data connection, it also used a software modem
application to encode digital data in an audio call — for
greater reliability in less-connected regions.
The team
reverse-engineered the signaling and data protocols used by this
software modem, and were subsequently able to call the car from
another cellular device, eventually uploading malicious code through
yet another buffer overflow. Even more disturbingly, the team encoded
this attack into an audio file, then played it back from an MP3 player
into a phone handset, again seizing control over the car.
The team also demonstrated several post-compromise attack-triggering
methods, such as delaying activation of the malicious code payload
until a particular geographic location was reached, or a particular
sensor value (e.g., speed or tire pressure) was read. It also managed
to trigger execution of the payload by using a short-range FM
transmitter to broadcast a specially-encoded Radio Data System (RDS)
message, which vehicles' FM receivers and navigation units decode.
The same attack could be performed over longer distances with a more
powerful transmitter.
Among the practical exploits outlined in the paper are recording
audio through the car's microphone and uploading it to a remote
server, and connecting the car's telematics unit to a hidden IRC
channel, from which attackers can send arbitrary commands
at their leisure. The team speculates on the feasibility of turning
this last attack into a commercial enterprise, building "botnet" style
networks of compromised cars, and on car thieves logging car makes and
models in bulk and selling access to stolen cars in advance, based on
the illicit buyers' preferences.
What about Linux?
If, as CAESS seems to have found, the state-of-the-art is so poor in
automotive computing security, the question becomes how Linux (and
related open source projects) could improve the situation. Certainly
some of the problems the team encountered are out of scope for
automotive Linux projects. For example, several of the simpler ECUs
are unsophisticated microcontrollers; the fact that some of them ship
from the factory with blatant flaws (such as a broken
challenge-response algorithm) is the fault of the manufacturer. But
Linux is expected to run on the higher-end ECUs, such as the IVI head
unit and telematics system, and these components were the nexus for
the more sophisticated attacks.
Several of the sophisticated attacks employed by CAESS relied on
security holes found in application code. The team acknowledged that
standard security practices (like stack cookies and address space
randomization) that are established practice in other computing
environments simply have not been adopted in automotive system
development for lack of perceived need. Clearly, recognizing that
risk and writing more secure application code would improve things,
regardless of the operating system in question. But the fact that
Linux is so widely deployed elsewhere means that more
security-conscious code is available for the taking than there is for
any other embedded platform.
Consider the Bluetooth attack, for example. Sure, with a little
effort, one might could envision a scenario when unattended Bluetooth
pairing is desirable — but in practice, Linux's dominance in the
mobile device space means there is a greater likelihood that developers would
quickly find and patch the problem than would any tier one supplier working
in isolation.
One step further is the advantage gained by having Linux serve as a
common platform used by multiple manufacturers. CAESS observed in its
2011 paper that the "glue code" linking discrete modules together was
the greatest source of exploits (e.g., the PassThru diagnostic
scanning device), saying "virtually all vulnerabilities emerged
at the interface boundaries between code written by distinct
organizations." It also noted that this was an artifact of the
automotive supply chain itself, in which individual components were
contracted out to separate companies working from specifications, then
integrated by the car maker once delivered:
Thus, while each supplier does unit testing (according to the
specification) it is difficult for the manufacturer to evaluate
security vulnerabilities that emerge at the integration
stage. Traditional kinds of automated analysis and code reviews
cannot be applied and assumptions not embodied in the
specifications are difficult to unravel. Therefore, while
this outsourcing process might have been appropriate for
purely mechanical systems, it is no longer appropriate for
digital systems that have the potential for remote compromise.
A common platform employed by multiple suppliers would go a long way
toward minimizing this type of issue, and that approach can only work
if the platform is open source.
Finally, the terrifying scope of the attacks carried out in the 2010
paper (and if one does not find them terrifying, one needs to read
them again) ultimately trace back to the insecure design of CAN bus.
CAN bus needs to be replaced; working with a standard IP stack,
instead, means not having to reinvent the wheel. The networking angle
has several factors not addressed in CAESS's papers, of course —
most notably the still-emerging standards for vehicle ad-hoc
networking (intended to serve as a vehicle-to-vehicle and
vehicle-to-infrastructure channel).
On that subject, Maxim Raya and Jean-Pierre Hubaux recommend
using public-key infrastructure and other
well-known practices from the general Internet communications realm.
While there might be some skeptics who would argue with Linux's
first-class position as a general networking platform, it should be
clear to all that proprietary lock-in to a single-vendor solution
would do little to improve the vehicle networking problem.
Those on the outside may find the recent push toward Linux in the
automotive industry frustratingly slow — after all, there is
still no GENIVI code visible to non-members. But to conclude that the
pace of development indicates Linux is not up to the task would be a
mistake. The reality is that the automotive computing problem is enormous
in scope — even considering security alone — and Linux and
open source might be the only way to get it under control.
Comments (69 posted)
Page editor: Jonathan Corbet
Security
By Michael Kerrisk
October 10, 2012
Loadable kernel modules provide a mechanism to dynamically modify the
functionality of a running system, by allowing code to be loaded and
unloaded from the kernel. Loading code into the kernel via a module has a
number of advantages over building a completely new monolithic kernel from
modified source code. The first of these is that loading a kernel module
does not require a system reboot. This means that new kernel functionally
can be added without disturbing users and applications.
From a developer perspective, implementing new kernel functionality via
modules is faster: a slow "compile kernel, reboot, test" sequence in each
development iteration is instead replaced by a much faster "compile module,
load module, test" sequence. Employing modules can also save memory, since
code in a module can be loaded into memory only when it is actually
needed. Device drivers are often implemented as loadable modules for this
reason.
From a security perspective, loadable modules also have a potential
downside: since a module has full access to kernel memory, it can
compromise the integrity of a system. Although modules can be loaded only
by privileged users, there are still potential security risks, since a
system administrator may be unable to directly verify the authenticity and
origin of a particular kernel module. Providing module-related
infrastructure to support administrators in that task is the subject of
ongoing effort, with one of the most notable pieces being the work to
support module signing.
Kees Cook has recently posted a series of patches that tackle another
facet of the module-verification problem. These patches add a new system
call for loading kernel modules. To understand why the new system call is
useful, we need to start by looking at the existing interface for loading
kernel modules.
The Linux interface for loading kernel modules has had (since kernel
2.6.0) the following form:
int init_module(void *module_image, unsigned long len,
const char *param_values);
The caller supplies the ELF image of the to-be-loaded module
via the memory buffer pointed to by module_image; len
specifies the size of that buffer. (The param_values argument is
a string that can be used to specify initial values for the module's
parameters.)
The main users of init_module() are the insmod and
modprobe commands. However any privileged user-space application
(i.e., one with the CAP_SYS_MODULE capability) can load a module
in the same way that these commands do, via a three-step process: opening a
file that contains a suitably built ELF image, reading
or mmap()ing the file's contents into memory, and then calling
init_module().
However, this call sequence is the source of an itch for Kees. Because
the step of obtaining a file descriptor for the image file is separated
from the module-loading step, the operating system loses the ability to
make deductions about the trustworthiness of the module based on its origin
in the filesystem. As Kees said:
being able to reason about the origin of a kernel module would be valuable
in situations where an OS already trusts a specific file system, file, etc,
due to things like security labels or an existing root of trust to a
partition through things like
dm-verity.
His solution is fairly straightforward: remove the middle of the three
steps posted above. Instead, the application will open the file and pass
the returned file descriptor directly to the kernel as part of a new
module-loading system call; the kernel then performs the task of reading
the module image from the file as a precursor to loading the module.
Although the concept of the solution is simple, it has been through a
few iterations, with the most notable changes being to details of the
user-space interface. Kees's initial proposal was to hack the existing
init_module() interface, so that if NULL is passed in the
module_image argument, the kernel would interpret the len
argument as a file descriptor. Rusty Russell, the kernel modules subsystem
maintainer, somewhat bluntly suggested that
a new system call would be a better approach, and on the next revision of the patch, H. Peter Anvin
pointed out that the system call would be
better named according to existing conventions, where the file descriptor
analog of an existing system call simply uses the same name as that system
call, but with an "f" prefix. Thus, Kees has arrived at the currently proposed interface:
int finit_module(int fd, const char *param_values);
In the most recent patch, Kees, who works for Google on Chrome OS, has
also further elaborated on the motivations for adding this system call.
Specifically, in order to ensure the integrity of a user's system, the
Chrome OS developers would like to be able to enforce the restriction that
kernel modules are loaded only from the system's read-only,
cryptographically verified root filesystem. Since the developers already
trust the contents of the root filesystem, employing module signatures to verify the contents of a
kernel module would require the addition of an unnecessary set of keys to
the kernel and would also slow down module loading. All that Chrome OS
requires is a light-weight mechanism for verifying that the module image
originates from that filesystem, and the new system call provides just that
facility.
Kees pointed out that the new system call also has potential for wider
use. For example, Linux Security Modules (LSMs) could use it to examine
digital signatures contained in the module file's extended attributes (the
file descriptor provides the kernel with the route to access the extended
attributes). During discussion of the patches, interest in the new system
call was confirmed by the maintainers of the IMA and AppArmor kernel subsystems.
At this stage, there appear to be few roadblocks to getting this system
call into the kernel. The only question is when it will arrive. Kees would
very much like to see the patches go into the currently open 3.7 merge
window, but for various reasons, it appears
probable that they will only be merged in Linux 3.8.
Update, January 2013: finit_module() was indeed merged
in Linux 3.8, but with a changed API that added a flags argument
that can be used to modify the behavior of the system call. Details can be
found in the manual page.
Comments (4 posted)
Brief items
The point is that we in the community need to start the migration away from
SHA-1 and to SHA-2/SHA-3 now.
--
Bruce
Schneier
That's because a design flaw in the service [McAfee Secure], and in competing services
offered by Trust Guard and others, makes it easy to discover in almost real
time when a customer has had the seal revoked. A revocation is a either a
sign the site has failed to pay its bill, has been inaccessible for a
sustained period of time, or most crucially, is no longer able to pass the
daily security test.
--
Dan
Goodin in
ars technica
This apparent screw up in the automated filter mistakenly attempts to censor AMC Theatres, BBC, Buzzfeed, CNN, HuffPo, TechCrunch, RealClearPolitics, Rotten Tomatoes, ScienceDirect, Washington Post, Wikipedia and even the U.S. Government.
Judging from the page titles and content the websites in question were targeted because they reference the number "45".
--
TorrentFreak
looks at a Microsoft DMCA notice
Comments (2 posted)
The Linux Foundation has
announced
a new boot system meant to make life easier on UEFI secure boot systems.
"
In a nutshell, the Linux Foundation will obtain a Microsoft Key and
sign a small pre-bootloader which will, in turn, chain load (without any
form of signature check) a predesignated boot loader which will, in turn,
boot Linux (or any other operating system). The pre-bootloader will employ
a 'present user' test to ensure that it cannot be used as a vector for any
type of UEFI malware to target secure systems. This pre-bootloader can be
used either to boot a CD/DVD installer or LiveCD distribution or even boot
an installed operating system in secure mode for any distribution that
chooses to use it."
Comments (39 posted)
The first draft of the
CryptoParty
Handbook, a 390-page guide to maintaining privacy in the networked
world, is available. "
This book was written in the first 3 days of
October 2012 at Studio Weise7, Berlin, surrounded by fine food and a lake
of coffee amidst a veritable snake pit of cables. Approximately 20 people
were involved in its creation, some more than others, some local and some
far (Melbourne in particular)." It is available under the (still
evolving)
CC-BY-SA 4.0
license. The guide, too, is still evolving; it should probably be regarded
the way one would look at early-stage cryptographic code. Naturally, the
authors are looking for contributors to help make the next release better.
Comments (none posted)
New vulnerabilities
bacula: information disclosure
| Package(s): | bacula |
CVE #(s): | CVE-2012-4430
|
| Created: | October 8, 2012 |
Updated: | January 25, 2013 |
| Description: |
From the Debian advisory:
It was discovered that bacula, a network backup service, does not
properly enforce console ACLs. This could allow information about
resources to be dumped by an otherwise-restricted client. |
| Alerts: |
|
Comments (none posted)
bind: denial of service
| Package(s): | bind |
CVE #(s): | CVE-2012-5166
|
| Created: | October 10, 2012 |
Updated: | November 6, 2012 |
| Description: |
From the Mandriva advisory:
A certain combination of records in the RBT could cause named to hang
while populating the additional section of a response. |
| Alerts: |
|
Comments (none posted)
hostapd: denial of service
| Package(s): | hostapd |
CVE #(s): | CVE-2012-4445
|
| Created: | October 8, 2012 |
Updated: | October 19, 2012 |
| Description: |
From the Debian advisory:
Timo Warns discovered that the internal authentication server of hostapd,
a user space IEEE 802.11 AP and IEEE 802.1X/WPA/WPA2/EAP Authenticator,
is vulnerable to a buffer overflow when processing fragmented EAP-TLS
messages. As a result, an internal overflow checking routine terminates
the process. An attacker can abuse this flaw to conduct denial of service
attacks via crafted EAP-TLS messages prior to any authentication. |
| Alerts: |
|
Comments (none posted)
libxslt: code execution
| Package(s): | libxslt |
CVE #(s): | CVE-2012-2893
|
| Created: | October 4, 2012 |
Updated: | October 22, 2012 |
| Description: |
From the Ubuntu advisory:
Cris Neckar discovered that libxslt incorrectly managed memory. If a user
or automated system were tricked into processing a specially crafted XSLT
document, a remote attacker could cause libxslt to crash, causing a denial
of service, or possibly execute arbitrary code. (CVE-2012-2893) |
| Alerts: |
|
Comments (none posted)
mozilla: multiple vulnerabilities
| Package(s): | firefox, thunderbird, seamonkey |
CVE #(s): | CVE-2012-3983
CVE-2012-3989
CVE-2012-3984
CVE-2012-3985
|
| Created: | October 10, 2012 |
Updated: | October 17, 2012 |
| Description: |
From the Ubuntu advisory:
Henrik Skupin, Jesse Ruderman, Christian Holler, Soroush Dalili and others
discovered several memory corruption flaws in Firefox. If a user were
tricked into opening a specially crafted web page, a remote attacker could
cause Firefox to crash or potentially execute arbitrary code as the user
invoking the program. (CVE-2012-3982, CVE-2012-3983, CVE-2012-3988,
CVE-2012-3989)
David Bloom and Jordi Chancel discovered that Firefox did not always
properly handle the <select> element. A remote attacker could exploit this
to conduct URL spoofing and clickjacking attacks. (CVE-2012-3984)
Collin Jackson discovered that Firefox did not properly follow the HTML5
specification for document.domain behavior. A remote attacker could exploit
this to conduct cross-site scripting (XSS) attacks via javascript
execution. (CVE-2012-3985)
Johnny Stenback discovered that Firefox did not properly perform security
checks on tests methods for DOMWindowUtils. (CVE-2012-3986)
Alice White discovered that the security checks for GetProperty could be
bypassed when using JSAPI. If a user were tricked into opening a specially
crafted web page, a remote attacker could exploit this to execute arbitrary
code as the user invoking the program. (CVE-2012-3991)
Mariusz Mlynski discovered a history state error in Firefox. A remote
attacker could exploit this to spoof the location property to inject script
or intercept posted data. (CVE-2012-3992)
Mariusz Mlynski and others discovered several flays in Firefox that allowed
a remote attacker to conduct cross-site scripting (XSS) attacks.
(CVE-2012-3993, CVE-2012-3994, CVE-2012-4184)
Abhishek Arya, Atte Kettunen and others discovered several memory flaws in
Firefox when using the Address Sanitizer tool. If a user were tricked into
opening a specially crafted web page, a remote attacker could cause Firefox
to crash or potentially execute arbitrary code as the user invoking the
program. (CVE-2012-3990, CVE-2012-3995, CVE-2012-4179, CVE-2012-4180,
CVE-2012-4181, CVE-2012-4182, CVE-2012-4183, CVE-2012-4185, CVE-2012-4186,
CVE-2012-4187, CVE-2012-4188) |
| Alerts: |
|
Comments (none posted)
mozilla: multiple vulnerabilities
| Package(s): | firefox, thunderbird, seamonkey |
CVE #(s): | CVE-2012-3982
CVE-2012-3986
CVE-2012-3988
CVE-2012-3990
CVE-2012-3991
CVE-2012-3992
CVE-2012-3993
CVE-2012-3994
CVE-2012-3995
CVE-2012-4179
CVE-2012-4180
CVE-2012-4181
CVE-2012-4182
CVE-2012-4183
CVE-2012-4184
CVE-2012-4185
CVE-2012-4186
CVE-2012-4187
CVE-2012-4188
|
| Created: | October 10, 2012 |
Updated: | January 10, 2013 |
| Description: |
From the Red Hat advisory:
Several flaws were found in the processing of malformed web content. A web
page containing malicious content could cause Firefox to crash or,
potentially, execute arbitrary code with the privileges of the user running
Firefox. (CVE-2012-3982, CVE-2012-3988, CVE-2012-3990, CVE-2012-3995,
CVE-2012-4179, CVE-2012-4180, CVE-2012-4181, CVE-2012-4182, CVE-2012-4183,
CVE-2012-4185, CVE-2012-4186, CVE-2012-4187, CVE-2012-4188)
Two flaws in Firefox could allow a malicious website to bypass intended
restrictions, possibly leading to information disclosure, or Firefox
executing arbitrary code. Note that the information disclosure issue could
possibly be combined with other flaws to achieve arbitrary code execution.
(CVE-2012-3986, CVE-2012-3991)
Multiple flaws were found in the location object implementation in Firefox.
Malicious content could be used to perform cross-site scripting attacks,
script injection, or spoofing attacks. (CVE-2012-1956, CVE-2012-3992,
CVE-2012-3994)
Two flaws were found in the way Chrome Object Wrappers were implemented.
Malicious content could be used to perform cross-site scripting attacks or
cause Firefox to execute arbitrary code. (CVE-2012-3993, CVE-2012-4184) |
| Alerts: |
|
Comments (none posted)
openstack-keystone: two authentication bypass flaws
| Package(s): | openstack-keystone |
CVE #(s): | CVE-2012-4456
CVE-2012-4457
|
| Created: | October 4, 2012 |
Updated: | October 10, 2012 |
| Description: |
From the Red Hat Bugzilla entries [1, 2]:
CVE-2012-4456: Jason Xu discovered several vulnerabilities in OpenStack
Keystone token verification:
The first occurs in the API /v2.0/OS-KSADM/services and
/v2.0/OS-KSADM/services/{service_id}, the second occurs in
/v2.0/tenants/{tenant_id}/users/{user_id}/roles
In both cases the OpenStack Keystone code fails to check if the tokens are
valid. These issues have been addressed by adding checks in the form of
test_service_crud_requires_auth() and test_user_role_list_requires_auth().
CVE-2012-4457: Token authentication for a user belonging to a disable tenant should not be
allowed. |
| Alerts: |
|
Comments (none posted)
openstack-swift: insecure use of python pickle
| Package(s): | openstack-swift |
CVE #(s): | CVE-2012-4406
|
| Created: | October 8, 2012 |
Updated: | October 18, 2012 |
| Description: |
From the Red Hat bugzilla:
Sebastian Krahmer (krahmer@suse.de) reports:
swift uses pickle to store and load meta data. pickle is insecure
and allows to execute arbitrary code in loads(). |
| Alerts: |
|
Comments (none posted)
php: multiple vulnerabilities
| Package(s): | php |
CVE #(s): | |
| Created: | October 8, 2012 |
Updated: | October 10, 2012 |
| Description: |
PHP 5.4.7 fixes multiple vulnerabilities. See the PHP changelog for details. |
| Alerts: |
|
Comments (none posted)
phpldapadmin: cross-site scripting
| Package(s): | phpldapadmin |
CVE #(s): | CVE-2012-1114
CVE-2012-1115
|
| Created: | October 8, 2012 |
Updated: | October 10, 2012 |
| Description: |
From the Red Hat bugzilla:
Originally (2012-03-01), the following cross-site (XSS) flaws were reported against LDAP Account Manager Pro (from Secunia advisory):
* 1) Input passed to e.g. the "filteruid" POST parameter when filtering result sets in lam/templates/lists/list.php (when "type" is set to a valid value) is not properly sanitised before being returned to the user. This can be exploited to execute arbitrary HTML and script code in a user's browser session in context of an affected site.
* 2) Input passed to the "filter" POST parameter in lam/templates/3rdParty/pla/htdocs/cmd.php (when "cmd" is set to "export" and "exporter_id" is set to "LDIF") is not properly sanitised before being returned to the user. This can be exploited to execute arbitrary HTML and script code in a user's browser session in context of an affected site.
* 3) Input passed to the "attr" parameter in lam/templates/3rdParty/pla/htdocs/cmd.php (when "cmd" is set to "add_value_form" and "dn" is set to a valid value) is not properly sanitised before being returned to the user. This can be exploited to execute arbitrary HTML and script code in a user's browser session in context of an affected site. |
| Alerts: |
|
Comments (none posted)
php-zendframework: multiple vulnerabilities
| Package(s): | php-zendframework |
CVE #(s): | |
| Created: | October 8, 2012 |
Updated: | October 10, 2012 |
| Description: |
From the ZendFramework advisories [1], [2]:
[1] The default error handling view script generated using Zend_Tool failed to escape request parameters when run in the "development" configuration environment, providing a potential XSS attack vector.
[2] Developers using non-ASCII-compatible encodings in conjunction with the MySQL PDO driver of PHP may be vulnerable to SQL injection attacks. Developers using ASCII-compatible encodings like UTF8 or latin1 are not affected by this PHP issue. |
| Alerts: |
|
Comments (none posted)
wireshark: denial of service
| Package(s): | wireshark |
CVE #(s): | CVE-2012-5239
CVE-2012-3548
|
| Created: | October 8, 2012 |
Updated: | March 8, 2013 |
| Description: |
From the CVE entries:
The Mageia advisory references CVE-2012-5239, which is a duplicate of CVE-2012-3548.
The dissect_drda function in epan/dissectors/packet-drda.c in Wireshark 1.6.x through 1.6.10 and 1.8.x through 1.8.2 allows remote attackers to cause a denial of service (infinite loop and CPU consumption) via a small value for a certain length field in a capture file. (CVE-2012-3548)
|
| Alerts: |
|
Comments (none posted)
Page editor: Jake Edge
Kernel development
Brief items
The 3.7 merge window is still open, so there is no current
development kernel. See the article below for merges into 3.7 since last
week.
Stable updates: The 3.2.31 stable
kernel was released on October 10. In addition, the 3.0.45,
3.4.13,
3.5.6,
and 3.6.1 stable kernels were released on
October 7. Support for the 3.5 series is coming to an end, as there may
only be one more update, so users of that kernel should be planning to
upgrade.
Comments (none posted)
If you look at Linux contributions they come from everywhere. The
core of the network routing code was written by Russians (and
Alexey who worked at a nuclear research instutite even turned up at
OLS with a 'minder' who as per every stereotype was apparently
capable of drinking vodka in half pints). We have code from
government projects, from educational projects (some of which are
in effect state funded), from businesses, from volunteers, from a
wide variety of non profit causes. Today you can boot a box
running Russian based network code with an NSA written ethernet
driver.
—
Alan Cox
Every time a patch goes through more than 3 purely coding style
revisions, a unicorn dies.
—
David
Miller
Comments (none posted)
Back in August, a linux-kernel
discussion on
removable device filesystems hinted at a new filesystem waiting in the
wings. It now seems clear that said filesystem was the
just-announced F2FS, a flash-friendly
filesystem from Samsung. "
F2FS is a new file system carefully
designed for the NAND flash memory-based storage devices. We chose a log
structure file system approach, but we tried to adapt it to the new form of
storage." See
the associated
documentation file for details on the F2FS on-disk format and how it works.
[Update: Also see Neil Brown's dissection of f2fs on this week's kernel page.]
Comments (44 posted)
Kernel development news
By Jonathan Corbet
October 10, 2012
As of this writing, Linus has pulled 9,167 non-merge changesets into the
mainline for the 3.7 merge window; that's just over 3,600 changes since
last week's summary. As predicted, the merge
rate has slowed a bit as Linus found
better
things to do with his time. Still, it is shaping up to be an active
development cycle.
User-visible changes since last week include:
- The kernel's firmware loader will now attempt to load files directly
from user space without involving udev. The firmware path is
currently wired to a few alternatives under /lib/firmware;
the plan is to make things more flexible in the future.
- The epoll_ctl() system call supports a new
EPOLL_CTL_DISABLE operation to disable polling on a specific
file descriptor.
- The Xen paravirtualization mechanism is now supported on the ARM
architecture.
- The tools directory contains a new "trace agent" utility; it
uses virtio to move trace data from a guest system to a host in an
efficient manner. Also added to tools is acpidump,
which can dump a system's ACPI tables to a text file.
- Online resizing of ext4 filesystems that use the metablock group
(meta_bg) or 64-bit block number features is now supported.
- The UBI translation layer for flash-based storage devices has gained
an experimental "fastmap" capability. The fastmap caches erase block
mappings, eliminating the need to scan the device at mount time.
- The Btrfs filesystem has gained the ability to perform hole punching
with the fallocate() system call.
- New hardware support includes:
- Systems and processors:
Freescale P5040DS reference boards,
Freescale / iVeia P1022RDK reference boards, and
MIPS Technologies SEAD3 evaluation boards.
- Audio:
Wolfson Bells boards,
Wolfson WM0010 digital signal processors,
TI SoC based boards with twl4030 codecs,
C-Media CMI8328-based sound cards, and
Dialog DA9055 audio codecs.
- Crypto: IBM 842 Power7 compression accelerators.
- Graphics: Renesas SH Mobile LCD controllers.
- Miscellaneous: ST-Ericsson STE Modem devices,
Maxim MAX8907 power management ICs,
Dialog Semiconductor DA9055 PMICs,
Texas Instruments LP8788 power management units,
Texas Instruments TPS65217 backlight controllers,
TI LM3630 and LM3639 backlight controllers,
Dallas DS2404 RTC chips,
Freescale SNVS RTC modules,
TI TPS65910 RTC chips,
RICOH 5T583 RTC chips,
Marvell MVEBU pin control units (and several SoCs using it),
Marvell 88PM860x PMICs,
LPC32x SLC and MLC NAND controllers, and
TI EDMA controllers.
- Video4Linux2:
Syntek STK1160 USB audio/video bridges,
TechnoTrend USB infrared receivers,
Nokia N900 (RX51) IR transmitters,
Chips&Media Coda multi-standard codecs,
FCI FC2580 silicon tuners,
Analog Devices ADV7604 decoders,
Analog Devices AD9389B encoders,
Samsung Exynos G-Scaler image processors,
Samsung S5K4ECGX sensors, and
Elonics E4000 silicon tuners.
Changes visible to kernel developers include:
- The precursors for the user-space API
header file split have been merged. These create
include/uapi directories meant to hold header files
containing the definitions of data types visible to user space.
Actually splitting those definitions out is a lengthy patch set that
looks to be only partially merged in 3.7; the rest will have to wait
for the 3.8 cycle.
- The core of the Nouveau driver for NVIDIA chipsets has been torn out
and rewritten. The developers understand the target hardware much
better than they did when Nouveau started; the code has now been
reworked to match that understanding.
- The Video4Linux2 subsystem tree has been massively reorganized; driver
source files are now organized by bus type. Most files have moved, so
developers working in this area will need to retrain their fingers for
the new locations. There is also a new, rewritten DVB USB core; a
number of drivers have been converted to this new code.
- The ALSA sound driver subsystem has added a new API for the management
of audio channels; see Documentation/sound/alsa/Channel-Mapping-API.txt
for details.
- The red-black tree implementation has been substantially reworked. It
now implements both interval trees and priority trees; the older
kernel "prio tree" implementation has been displaced by this work and
removed.
Linus had raised the possibility of extending the merge window if his
travels got in the way of pulling changes into the mainline. The changeset
count thus far, though, suggests that there has been no problem with
merging, so chances are that the merge window will close on schedule around
October 14.
Comments (none posted)
October 10, 2012
This article was contributed by Paul McKenney
If you run a post-3.6 Linux kernel for long enough,
you will likely see a process named rcu_sched or
rcu_preempt or maybe even rcu_bh having
consumed significant CPU time.
If the system goes idle and all application processes exit,
these processes might well have the largest CPU consumption of
all the remaining processes.
It is only natural to ask “what are these processes and why
are they consuming so much CPU?”
The “what” part is easy: These are new kernel threads
that handle RCU grace periods, previously handled mainly in
softirq context.
An “RCU grace period” is a period of time after which
all pre-existing RCU read-side critical sections have completed,
so that if an RCU updater
removes a data element from an RCU-protected data structure and then
waits for an RCU grace period, it may subsequently safely carry out
destructive-to-readers actions, such as freeing the data element.
RCU read-side critical sections begin with rcu_read_lock()
and end with rcu_read_unlock().
Updaters can wait for an RCU grace period using synchronize_rcu(),
or they can asynchronously schedule a function to be invoked after
a grace period using call_rcu().
RCU's read-side primitives are extremely fast and scalable, so it
can be quite helpful in read-mostly situations.
For more detail on RCU, see:
The RCU API, 2010 Edition,
What is RCU, Fundamentally?,
this
set of slides [PDF], and
the RCU home page.
Quick Quiz 1:
Why would latency be reduced by moving RCU work to a kthread?
And why would anyone care about latency on huge machines?
Answer
The reason for moving RCU grace-period handling to a kernel thread
was to improve real-time latency (both interrupt latency and
scheduling latency) on huge systems by allowing RCU's
grace-period initialization to be preempted:
Without preemption, this initialization can inflict
more than 200 microseconds of latency
on huge systems.
In addition, this change will very likely also improve
RCU's energy efficiency while also simplifying the code.
These potential simplifications are due to the fact that kernel threads
make it easier to guarantee forward progress, avoiding hangs in
cases where
all CPUs are asleep and thus ignoring the current grace period,
as confirmed by Paul Walmsley.
But the key point here is that these kernel threads do not represent
new overhead: Instead, overhead that used to be hidden in softirq
context is now visible in kthread context.
Quick Quiz 2:
Wow!!! If
hackbench does a million grace periods in ten minutes,
just how many does something like
rcutorture do?
Answer
Now for “why so much CPU?”, which is the question
Ingo Molnar asked immediately upon seeing more than three minutes of
CPU time consumed by
rcu_sched after running a couple hours of kernel builds.
The answer is that Linux makes heavy use of RCU, so much so that
running hackbench for ten minutes can result in almost
one million RCU grace periods—and more than thirty seconds
of CPU time consumed by rcu_sched.
This works out to about thirty microseconds per grace period, which
is anything but excessive, considering the amount of work that grace
periods do.
As it turns out, the CPU consumption
of rcu_sched, rcu_preempt, and
rcu_bh
is often roughly equal to the sum of that of the ksoftirqd
threads.
Interestingly enough, in 3.6 and earlier, some of the RCU grace-period overhead
would have been charged to the ksoftirqd kernel threads.
But CPU overhead per grace period is only part of the story.
RCU works hard to process multiple updates (e.g., call_rcu()
or synchronize_rcu() invocations) with a single grace period.
It is not hard to achieve more than one hundred updates per grace
period, which results in a per-update overhead of only about 300 nanoseconds,
which is not bad at all.
Furthermore, workloads having well in excess of one thousand updates
per grace period
have been observed.
Of course, the per-grace-period CPU overhead does vary, and
with it the per-update overhead.
First, the greater the number of possible CPUs
(as given at boot time by nr_cpu_ids),
the more work RCU must do when initializing and cleaning up grace periods.
This overhead grows fairly slowly, with additional work required
with the addition of each set of 16 CPUs, though this number varies
depending on the
CONFIG_RCU_FANOUT_LEAF kernel configuration parameter
and also on the rcutree.rcu_fanout_leaf kernel boot parameter.
Second, the greater the number of idle CPUs, the more work RCU must
do when forcing quiescent states.
Yes, the busier the system, the less work RCU needs to do!
The reason for the extra work is that RCU is not permitted to disturb
idle CPUs for energy-efficiency reasons.
RCU must therefore probe a per-CPU data structure to read out idleness
state during each grace period, likely incurring a cache miss on each such
probe.
Third and finally, the overhead will vary depending on CPU clock rate,
memory-system performance, virtualization overheads, and so on.
All that aside, I see per-grace-period overheads ranging from 15 to
100 microseconds on the systems I use.
I suspect that a system with thousands of CPUs might consume many
hundreds of microseconds, or perhaps even milliseconds, of CPU time
for each grace period.
On the other hand, such a system might also handle a very large number
of updates per grace period.
Quick Quiz 3:
Now that all of the RCU overhead is appearing
on the
rcu_sched,
rcu_preempt, and
rcu_bh kernel threads, we should be able to more
easily identify that overhead and optimize RCU, right?
Answer
In conclusion, the rcu_sched, rcu_preempt, and
rcu_bh CPU overheads should not be anything to worry about.
They do not represent new overhead inflicted on post-3.6 kernels,
but rather better accounting of the
same overhead that RCU has been incurring all along.
Acknowledgments
I owe thanks to Ingo Molnar for first noting this issue and the need
to let the community know about it.
We all owe a debt of gratitude to Steve Dobbelstein,
Stephen Rothwell, Jon Corbet, and Paul Walmsley for their help
in making this article human-readable.
I am grateful to Jim Wasko for his support of this effort.
Quick Quiz 1:
Why would latency be reduced by moving RCU work to a kthread?
And why would anyone care about latency on huge machines?
Answer:
Moving work from softirq to a kthread allows that work to be more
easily preempted, and this preemption reduces scheduling latency.
Low scheduling latency is of course important in real-time applications,
but it also helps reduce
OS jitter.
Low OS jitter is critically important to certain types of
high-performance-computing (HPC) workloads, which is the type
of workload that tends to be run on huge systems.
Back to Quick Quiz 1.
Quick Quiz 2:
Wow!!! If hackbench does a million grace periods in ten minutes,
just how many does something like rcutorture do?
Answer:
Actually, rcutorture tortures RCU in many different ways,
including overly long read-side critical sections, transitions to and
from idle, and CPU hotplug operations.
Thus, a typical rcutorture run would probably “only”
do about 100,000 grace periods in a ten-minute interval.
In short, the grace-period rate can vary widely depending on your
hardware, kernel configuration, and workload.
Back to Quick Quiz 2.
Quick Quiz 3:
Now that all of the RCU overhead is appearing
on the rcu_sched, rcu_preempt, and
rcu_bh kernel threads, we should be able to more
easily identify that overhead and optimize RCU, right?
Answer:
Yes and no.
Yes, it is easier to optimize that which can be easily measured.
But no, not all of RCU's overhead appears on the
rcu_sched, rcu_preempt, and
rcu_bh kernel threads.
Some of it still appears on the ksoftirqd kernel
threads, and some of it is spread over other tasks.
Still, yes, the greater visibility should be helpful.
Back to Quick Quiz 3.
Comments (4 posted)
October 10, 2012
This article was contributed by Neil Brown
When a techno-geek gets a new toy there must always be an urge to take
it apart and see how it works. Practicalities (and warranties)
sometimes suppress that urge, but in the case of f2fs and this geek,
the urge was too strong. What follows is the result of taking apart
this new filesystem to see how it works.
f2fs (interestingly not "f3s") is the "flash-friendly file
system", a
new filesystem for Linux recently
announced
by engineers from Samsung. Unlike jffs2
and logfs, f2fs is not
targeted at raw flash
devices, but rather at the specific hardware that is commonly
available to consumers — SSDs, eMMC, SD cards, and other flash
storage with an
FTL
(flash translation layer) already built in.
It seems that as hardware
gets smarter, we need to make even more clever software to manage that
"smartness". Does this sound like parenting to anyone else?
f2fs is based on the log-structured filesystem (LFS) design — which
is hardly surprising given the
close match between the log-structuring
approach and the needs of flash.
For those not familiar with log-structured design, the key elements
are:
- That it requires copy-on-write, so data is always written to
previously unused space.
- That free space is managed in large regions which are written to
sequentially. When the number of free regions gets low, data that
is still live is coalesced from several regions into one free
region, thus creating more free regions. This process is known as
"cleaning" and the overhead it causes is one of the significant costs
of log structuring.
As the FTL typically uses a log-structured
design to provide the wear-leveling and write-gathering that flash
requires, this means that there are two log structures active on the
device — one in the firmware and one in the operating system.
f2fs is explicitly designed to make use of this fact and leaves a
number of tasks to the FTL while focusing primarily on those tasks
that it is well positioned to perform. So, for example, f2fs makes no
effort to distribute writes evenly across the address space to provide
wear-leveling.
The particular value that f2fs brings, which can justify it being
"flash friendly", is that it provides large-scale write gathering so
that when lots of blocks need to be written at the same time they are
collected into large sequential writes which are much easier for the
FTL to handle. Rather than creating a single large write, f2fs
actually creates up to six in parallel. As we shall see,
these are assigned different sorts of blocks with different life
expectancies. Grouping blocks with similar life expectancies together
tends to make the garbage collection process required by the LFS less expensive.
The "large-scale" is a significant qualifier — f2fs doesn't always
gather writes into contiguous streams, only almost always. Some
metadata, and occasionally even some regular data, is written via
random single-block writes. This would be anathema for a regular
log-structured filesystem, but f2fs chooses to avoid a lot of
complexity
by just doing small updates when necessary and leaving the FTL to make
those corner cases work.
Before getting into the details of how f2fs does what it does, a brief
list of some of the things it doesn't do is in order.
A feature that we might expect from a copy-on-write filesystem is
cheap snapshots as they can be achieved by simply not freeing up the
old copy. f2fs does not provide these and cannot in its current form
due to its two-locations approach to some metadata which will be detailed
later.
Other features that are missing are usage quotas, NFS export, and the
"security" flavor of extended attributes (xattrs). Each of these
could probably be added with minimal effort if they are needed, though
integrating quotas correctly with the crash recovery would be the most
challenging. We shouldn't be surprised to see some of these in a future
release.
Blocks, segments, sections, and zones
Like most filesystems, f2fs is comprised of blocks. All blocks
are 4K in size, though the code implicitly links the block size with
the system page size, so it is unlikely to work on systems with larger
page sizes as is possible with IA64 and PowerPC. The block addresses
are 32 bits so the total number of addressable bytes in the filesystem is at most 2(32+12) bytes or 16 terabytes. This is probably not a
limitation — for current flash hardware at least.
Blocks are collected into "segments". A segment is 512 blocks or 2MB
in size. The documentation describes this as a default, but this size
is fairly deeply embedded in the code. Each segment has a segment
summary block which lists the owner (file plus offset) of each block
in the segment. The summary is primarily used when cleaning to
determine which blocks need to be relocated and how to update the
index information after the relocation. One block can comfortably
store summary information for 512 blocks (with a bit of extra space
which has other uses), so 2MB is the natural size for a segment.
Larger would be impractical and smaller would be wasteful.
Segments are collected into sections. There is genuine flexibility in
the size of a section, though it must be a power of two. A section
corresponds to a "region" in the outline of log structuring given above. A
section is normally filled from start to end before looking around for
another section, and the cleaner processes one section at a time.
The default size when using the mkfs utility is 20, or
one segment per section.
f2fs has six sections "open" for writing at any time with
different sorts of data being written to each one. The different sections
allows for file content (data) to be kept separate from indexing
information (nodes), and for those to be divided into "hot", "warm",
and "cold" according to various heuristics. For example, directory
data is treated as hot and kept separate from file data because they have
different life expectancies. Data that is cold is
expected to remain unchanged for quite a long time, so a section full of cold
blocks is likely to not require any cleaning. Nodes that are hot are expected
to be updated soon, so if we wait a little while, a section that was full of hot
nodes will have very few blocks that are still live and thus will be cheap to
clean.
Sections are collected into zones. There may be any (integer) number
of sections in a zone though the default is again one. The sole
purpose of zones is to try to keep these six open sections in different
parts of the device. The theory seems to be that flash devices are
often made from a number of fairly separate sub-devices each of which
can process IO requests independently and hence in parallel. If zones
are sized to line up with the sub-devices, then the six open sections
can all handle writes in parallel and make best use of the
device.
These zones, full of sections of segments of blocks, make up the
"main" area of the filesystem. There is also a "meta" area which
contains a variety of different metadata such as the segment summary
blocks already mentioned. This area is not managed following normal
log-structured lines and so leaves more work for the FTL to do.
Hopefully it is small enough that this isn't a problem.
There are three approaches to management of writes in this area.
First, there is a small amount of read-only data (the superblock)
which is never written once the filesystem has been created. Second,
there are the segment summary blocks which have already been
mentioned. These are simply updated in-place. This can lead to
uncertainty as to the "correct" contents for the block after a crash,
however for segment summaries this is not an actual problem. The
information in it is checked for validity before it is used, and if
there is any chance that information is missing, it will be recovered
from other sources during the recovery process.
The third approach involves allocating twice as much space as is
required so that each block has two different locations it can exist
in, a primary and a secondary. Only one of these is "live" at any time
and the copy-on-write requirement of an LFS is met by simply writing to
the non-live location and updating the record of which is live.
This approach to metadata is the main impediment to providing snapshots.
f2fs does a small amount of journaling of updates to this last group
while creating a checkpoint, which might ease the task for the FTL
somewhat.
Files, inodes, and indexing
Most modern filesystems seem to use B-trees or similar structures for
managing indexes to locate the blocks in a file. In fact they are
so fundamental to btrfs that it takes its name from that data
structure. f2fs doesn't. Many filesystems reduce the size of the
index by the use of "extents" which provide a start and length of a
contiguous list of blocks rather than listing all the addresses
explicitly. Again, f2fs doesn't (though it does maintain one extent
per inode as a hint).
Rather, f2fs uses an indexing tree that is very reminiscent of the
original Unix filesystem and descendants such as ext3. The inode
contains a list of addresses for the early blocks in the file, then
some addresses for indirect blocks (which themselves contain more
addresses) as well as some double and triple-indirect blocks. While
ext3 has 12 direct addresses and one each of the indirection addresses,
f2fs has 929 direct address, two each of indirect and double-indirect
addresses, and a single triple-indirect address. This allows the
addressing of nearly 4TB for a file, or one-quarter of the maximum filesystem
size.
While this scheme has some costs — which is why other filesystems have
discarded it — it has a real benefit for an LFS. As f2fs does not use
extents, the index tree for a given file has a fixed and known size.
This means that when blocks are relocated through cleaning, it is
impossible for changes in available extents to cause the indexing tree
to get bigger — which could be embarrassing when the point of
cleaning is to free space. logfs, another reasonably modern
log structured filesystem for flash, uses much the same arrangement
for much the same
reason.
Obviously, all this requires a slightly larger inode than ext3 uses.
Copy-on-write is rather awkward for objects that are smaller than the
block size so f2fs reserves a full 4K block for each inode which
provides plenty of space for indexing. It even provides space to
store the (base) name of the file, or one of its names, together with
the inode number of the parent. This simplifies the recovery of
recently-created files during crash recovery and reduces the number of
blocks that need to be written for such a file to be safe.
Given that the inode is so large, one would expect that small files and
certainly small symlinks would be stored directly in the inode, rather
than just storing a single block address and storing the data
elsewhere. However f2fs doesn't do that. Most likely the reality is
that it doesn't do it yet. It is an easy enough optimization
to add, so it's unlikely to remain absent for long.
As already mentioned, the inode contains a single extent that is a
summary of some part of the index tree. It says that some range of blocks
in the file are contiguous in storage and gives the address of this
range. The filesystem attempts to keep the largest extent recorded
here and uses it to speed up address lookups. For the common case of a
file being written sequentially without any significant pause, this should
result in the entire file being in that one extent, and make lookups in the
index tree unnecessary.
Surprisingly, it doesn't seem there was enough space to store 64-bit
timestamps, so instead of nanosecond resolution for several centuries
in the future, it only provides single-second resolution until some
time in 2038. This oversight was
raised on linux-kernel
and may well be
addressed in a future release.
One of the awkward details of any copy-on-write filesystem is that
whenever a block is written, its address is changed, so its parent
in the indexing tree must change and be relocated, and so on up to the
root of the tree. The logging nature of an LFS means that
roll-forward during recovery can rebuild recent changes to the indexing tree so all
the changes do not have to be written immediately, but they do have to
be written eventually, and this just makes more work for the cleaner.
This is another area when f2fs makes use of its underlying FTL and
takes a short-cut. Among the contents of the "meta" area is a NAT —
a Node Address Table. Here "node" refers to inodes and to indirect
indexing blocks, as well as blocks used for xattr storage. When the address of
an inode is stored in a directory, or an index block is stored in
an inode or another index block, it isn't the block address that is
stored, but rather an offset into the NAT. The actual block address
is stored in the NAT at that offset. This means that when a data
block is written, we still need to update and write the node that points to it. But writing that node only requires updating the NAT
entry. The NAT is part of the metadata that uses two-location journaling
(thus depending on the FTL for write-gathering) and so does not require further
indexing.
Directories
An LFS doesn't really impose any particular requirements on the layout
of a directory, except to change the fewest number of blocks
possible, which is generally good for performance anyway. So we can
assess f2fs's directory structure on an equal footing with other
filesystems. The primary goal is to provide fast lookup by file name,
and to provide a stable address of each name that can be reported
using telldir().
The original Unix filesystem (once it had been adjusted for 256-byte
file names) used the same directory scheme as ext2 — sequential search
though a file full of directory entries. This is simple and
effective, but doesn't scale well to large directories.
More modern filesystems such as ext3, xfs,
and btrfs use various
schemes involving B-trees, sometimes indexed by a hash of the file
name. One of the problems with B-trees is that nodes sometimes need
to be split and this causes some directory entries to be moved around
in the file. This results in extra challenges to provide stable
addresses for telldir() and is probably the reason that telldir() is often
called out for being a poor interface.
f2fs uses some sequential searching and some hashing to provide a
scheme that is simple, reasonably efficient, and
trivially provides stable telldir() addresses. A lot of the hashing
code is borrowed from ext3, however f2fs omits the use of a per-directory seed.
This seed is a secret random number which ensures that the hash values used are
different in each directory, so they are not predictable. Using such a seed
provides protection against hash-collision attacks. While these might be
unlikely in practice, they are so easy to prevent that this omission is a
little surprising.
It is easiest to think of the directory structure as a series of hash
tables stored consecutively in a file. Each hash table has a number of fairly large buckets. A
lookup proceeds from the first hash table to the next, at each stage
performing a linear search through the appropriate bucket, until
either the name is found or the last hash table has been searched.
During the search, any free space in a suitable bucket is recorded
in case we need to create the name.
The first hash table has exactly one bucket which is two blocks in
size, so for the first few hundred entries, a simple linear search is
used. The second hash table has two buckets, then four, then eight
and so on until the 31st table with about a billion buckets, each two
blocks in size. Subsequent hash tables — should you need that many —
all have the same number of buckets as the 31st, but now they are four
blocks in size.
The result is that a linear search of several hundred entries can be required,
possibly progressing through quite a few blocks if the directory is very large.
The length of this search increases only as the logarithm of the number of
entries in the directory, so it scales fairly well. This is certainly
better than a purely sequential search, but seems like it could be a
lot more work than is really necessary. It does however guarantee
that only one block needs to be updated for each addition or deletion
of a file name, and since entries are never moved, the offset in the
file is a stable address for telldir(), which are valuable features.
Superblocks, checkpoints, and other metadata
All filesystems have a superblock and f2fs is no different. However
it does make a clear distinction between those parts of the superblock
which are read-only and those which can change. These are kept in two
separate data structures.
The f2fs_super_block, which is stored in the second block of
the device, contains only read-only data. Once the filesystem is
created, this is never changed. It describes how big the filesystem
is, how big the segments, sections, and zones are, how much space has
been allocated for the various parts of the "meta" area, and
other little details.
The rest of the information that you might expect to find in a
superblock, such as the amount of free space, the address of the
segments that should be written to next, and various other volatile
details, are stored in an f2fs_checkpoint. This "checkpoint"
is one of the metadata types that follows the two-location approach to
copy-on-write — there are two adjacent segments both of which store a
checkpoint, only one of which is current. The
checkpoint contains a version number so that when the filesystem is
mounted, both can be read and the one with the higher version number
is taken as the live version.
We have already mentioned the Node Address Table (NAT) and Segment
Summary Area (SSA) that also occupy the meta area with the superblock
(SB) and Checkpoints (CP). The one other item of metadata is the
Segment Info Table or SIT.
The SIT stores 74 bytes per segment and is kept separate from the
segment summaries because it is much more volatile. It primarily keeps
track of which blocks are still in active use so that the segment can
be reused when it has no active blocks, or can be cleaned when the
active block count gets low.
When updates are required to the NAT or the SIT, f2fs doesn't make them
immediately, but stores them in memory until the next checkpoint is
written. If there are relatively few updates then they are not written
out to their final home but are instead journaled in some spare space
in Segment Summary blocks that are normally written at the same time. If the
total amount of updates that are required to Segment Summary blocks
is sufficiently small, even they are not written and the SIT, NAT,
and SSA updates are all journaled with the Checkpoint block — which is
always written during checkpoint. Thus, while f2fs feels free to leave
some work to the FTL, it tries to be friendly and only performs random
block updates when it really has to. When f2fs does need to perform
random block updates it will perform several of them at once, which
might ease the burden on the FTL a little.
Knowing when to give up
Handling filesystem-full conditions in traditional filesystems is
relatively easy. If no space is left, you just return an error.
With a log-structured filesystem, it isn't that easy. There might be a
lot of free space, but it might all be in different sections and so it
cannot be used until those sections are "cleaned", with the live data
packed more densely into fewer sections. It usually makes sense to
over-provision a log-structured filesystem so there are always free
sections to copy data to for cleaning.
The FTL takes exactly this approach and will over-provision to both allow
for cleaning and to allow for parts of the
device failing due to excessive wear. As the FTL handles
over-provisioning internally there is little point in f2fs doing it as
well. So when f2fs starts running out of space, it essentially gives
up on the whole log-structured idea and just writes randomly
wherever it can. Inodes and index blocks are still handled
carefully and there is a small amount of over-provisioning for them,
but data is just updated in place, or written to any free block that
can be found. Thus you can expect performance of f2fs to degrade when
the filesystem gets close to full, but that is common to a lot of
filesystems so it isn't a big surprise.
Would I buy one?
f2fs certainly seems to contain a number of interesting ideas, and a
number of areas for possible improvement — both attractive attributes.
Whether reality will match the promise remains to be seen. One area
of difficulty is that the shape of an f2fs (such as section and zone
size) needs to be tuned to the particular flash device and its FTL; vendors are notoriously secretive about exactly how their FTL
works. f2fs also requires that the flash device is comfortable
having six or more concurrently "open" write areas. This may not be a
problem for Samsung, but does present some problems for your average
techno-geek — though Arnd Bergmann has done some
research that may prove useful. If this leads to people reporting performance results
based on experiments where the f2fs isn't tuned properly to the
storage device, it could be harmful for the project as a whole.
f2fs contains a number of optimizations which aim to ease the burden
on the FTL. It would be very helpful to know how often these actually
result in a reduction in the number of writes. That would help
confirm that they are a good idea, or suggest that further refinement
is needed. So, some gathering of statistics about how often the
various optimizations fire would help increase confidence in the
filesystem.
f2fs seems the have been written without much expectation of highly
parallel workloads. In particular, all submission of write requests
are performed under a single semaphore. So f2fs probably isn't the
filesystem to use for big-data processing on 256-core work-horses. It
should be fine on mobile computing devices for a few more years though.
And finally, lots of testing is required. Some preliminary performance
measurements have been
posted, but to get a
fair comparison you really need an "aged" filesystem and a large mix
of workloads. Hopefully someone will make the time to do the testing.
Meanwhile, would I use it? Given that my phone is as much a toy to
play with as a tool to use, I suspect that I would. However, I would
make sure I had reliable backups first. But then ... I probably should
do that anyway.
Comments (29 posted)
Patches and updates
Kernel trees
- Thomas Gleixner: 3.6.1-rt1 .
(October 9, 2012)
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Architecture-specific
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Distributions
By Jake Edge
October 10, 2012
Financial contributions are part of many free software projects. Users
can, and do, contribute in lots of different ways, but helping the project
keep the lights on and, perhaps, even cover some development time, is
fairly common. Financial donations may also be used to champion
certain features
for a project; by ponying up some money, the donor may get some input into the
direction of the project. The latter seems to be part of the motivation for
Canonical's recent push
to more prominently feature—but not require—financial contributions
as part of the desktop download process.
On October 9, Steve George, Canonical's VP of Communications and Products,
posted a message
to the company's blog that described the change:
Today, we're making it easier for people to financially contribute to
Ubuntu if they want to. By introducing a 'contribute' screen as part of the
desktop download process, people can choose to financially support
different aspects of Canonical's work: from gaming and apps, developing the
desktop, phone and tablet, to co-ordination of upstreams or supporting
Ubuntu flavours. It's important to note that Ubuntu remains absolutely
free, financial contribution remains optional and it is not required in
order to download the software.
It's a bit surprising to see a large, company-backed distribution looking
for financial contributions, but George made it clear that there has always
been a way to
do so, "albeit in a not-easy-to-find spot on our website". He
said that users have been asking for a simpler way to contribute money, so
Canonical was making one available. Now, clicking through to a desktop
download page brings up a web application (seen at right, click to see
a larger version) that allows contributions of up to $125 in each of eight
categories.
According to community manager Jono Bacon, the application
was inspired by the various Humble
Bundles, which allow people to pay what they wish for computer games or
ebooks. While the sliders used by Humble Bundles allow users to choose the
amount to pay the
authors, charities, and the company, the sliders in the Canonical
application allow a choice of features contributors would like to put
their money
toward. The possibilities are:
- Make the desktop more amazing
- Performance optimisation for games and apps
- Improve hardware support on more PCs
- Phone and tablet versions of Ubuntu
- Community participation in Ubuntu development
- Better coordination with Debian and upstreams
- Better support for flavours like Kubuntu, Xubuntu, Lubuntu
- Tip to Canonical – they help make it happen
To aid the user in evaluating how much to give, the application
suggests products that would cost roughly the same as the total contribution.
Those range from a "grande extra shot mocha latte chino" at $2, through a
"pair of LP Matador bongo drums" at $100, up to "an eight year-old
dromedary camel" at $1000. Visitors can either choose their donation level
and contribute via PayPal, or click through a link to go on to the download.
While Canonical undoubtedly does a great deal of work for the benefit of
millions of Linux users, this is a rather unconventional approach. It is a
little hard to imagine that it will generate a significant revenue stream,
at least for an organization the size of Canonical. But the real value to
Canonical
(and by extension,
Ubuntu) may be in the feedback it gets from users.
The categories for features are fairly broad, but a consensus among users
on one (or a few) of them is certainly useful information. The fact that
those users
are willing to pay something to make that vote makes the data all
the more
interesting. There have been persistent complaints that some of the other
Ubuntu flavors, Lubuntu, Xubuntu, Kubuntu, and so on, have lacked for
financial backing in comparison to the number of users they bring to the
table. This
effort would give those flavors an opportunity to send a message, for
example.
It would be great if the community were to get some visibility into the
contributions. Canonical may be understandably loath to give out direct
financial information, but there are other ways to do the reporting that
would still benefit both the Ubuntu community as well as the larger FOSS
ecosystem. Information on the "votes", perhaps as percentages of the
number and dollar amount of the contributions, would be useful. That would
help Ubuntu see places where more effort is desired as well as identifying
potential trouble spots for other distributions and projects. According to
George, the company is working on a schedule and format for reporting on
the contributions.
Canonical has pursued other non-traditional revenue sources in the past, so
this could just be another. With enough different revenue streams, even if
some are fairly small and unpredictable, the company could reach a
profitable state. That can only be a good thing for the long-term
prosperity of not just Canonical, but the Ubuntu distribution as well.
Ubuntu has millions of users worldwide and the Linux ecosystem is
richer for its presence, so anything that helps its continued existence
is definitely a net positive.
Comments (16 posted)
Brief items
On Mon, Oct 08, 2012 at 08:48:40PM +0200, Thijs Kinkhorst wrote:
> I find it therefore doubtful that keeping the bottle logo solves any
real world problem.
I find it doubtful that getting rid of the bottle logo solves any real world
problem.
--
Steve Langasek
Fedora has a huge history of "Hey this is a
great idea!" and getting a 30% solution put out with the idea that it
will become a 80% solution if people just wish hard enough... instead
the people who started it go off to new stuff that interests them and
the people who come after either throw away what was done before or
find that real life has other plans for them. And then infrastructure
gets handed the reigns of the nearly dead website, phone service, etc.
And when we say we can't support it.. we have to spend a year proving
that we can't get anyone to step up while everyone says "Geez
infrastructure can't do anything right."
--
Stephen J Smoogen
Perhaps some invalid data is better than no data at all.
--
Jesse Keating
Comments (none posted)
Distribution News
Debian GNU/Linux
Debian Project Leader Stefano Zacchiroli has a few bits on his September
activities. Topics include Google Code-In, Logo relicensing, and more.
Full Story (comments: none)
Fedora
Although the meeting was to discuss the Fedora 18 Beta freeze, the decided
upon one week delay will push the expected final release out to early
December.
Full Story (comments: none)
Newsletters and articles of interest
Comments (none posted)
Matthew Garrett
looks at
UEFI secure boot in smaller distributions. "
I've taken Suse's code for key management and merged it into my own shim tree with a few changes. The significant difference is a second stage bootloader signed with an untrusted key will cause a UI to appear, rather than simply refusing to boot. This will permit the user to then navigate the available filesystems, choose a key and indicate that they want to enrol it. From then on, the bootloader will trust binaries signed with that key."
Comments (none posted)
The H
reports
that the latest installation image for Arch Linux boots with systemd. "
Based on the 3.5.5 Linux kernel, Arch Linux 2012.10.06 is a regular monthly snapshot of the rolling-release operating system for new installations. Among the changes since the last snapshot are a simplified EFI boot and setup process, and the use of the gummiboot boot manager to display a menu on EFI systems. Additionally, new packages including ethtool, the FSArchiver tool, the Partimage and Partclone partition utilities, rfkill and the TestDisk data recovery tool are now available on the live system."
Comments (3 posted)
Page editor: Rebecca Sobol
Development
By Michael Kerrisk
October 10, 2012
There are many mechanisms for communicating information between
user-space applications and the kernel. System calls and pseudo-filesystems
such as /proc and /sys are of course the most well known.
Signals are similarly well known; the kernel employs signals to
inform a process of various synchronous or asynchronous events—for
example, when the process tries to write to a broken pipe or a child of the
process terminates.
There are also a number of more obscure mechanisms for communication
between the kernel and user space. These include the Linux-specific netlink sockets and user-mode
helper features. Netlink sockets provide a socket-style API for
exchanging information with the kernel. The user-mode helper feature allows
the kernel to automatically invoke user-space executables; this mechanism
is used in a number of places, including the implementation of control
groups and piping core dumps
to a user-space application.
The auxiliary vector, a mechanism for communicating information from
the kernel to user space, has remained largely invisible until
now. However, with the addition of a new library function,
getauxval(), in the GNU C library (glibc) 2.16 release that
appeared at the end of June, it has now become more visible.
Historically, many UNIX systems have implemented the auxiliary vector
feature. In essence, it is a list of key-value pairs that the kernel's ELF
binary loader (fs/binfmt_elf.c in the kernel source) constructs
when a new executable image is loaded into a process. This list is placed
at a specific location in the process's address space; on Linux systems it
sits at the high end of the user address space, just above the (downwardly
growing) stack, the command-line arguments (argv), and environment
variables (environ).
From the description and diagram, we can see that although the
auxiliary vector is somewhat hidden, it is accessible with a little
effort. Even without using the new library function, an application that
wants to access the auxiliary vector merely needs to obtain the address of
the location that follows the NULL pointer at the end of the
environment list. Furthermore, at the shell level, we can discover the
auxiliary vector that was supplied to an executable by setting the
LD_SHOW_AUXV environment variable when launching an application:
$ LD_SHOW_AUXV=1 sleep 1000
AT_SYSINFO_EHDR: 0x7fff35d0d000
AT_HWCAP: bfebfbff
AT_PAGESZ: 4096
AT_CLKTCK: 100
AT_PHDR: 0x400040
AT_PHENT: 56
AT_PHNUM: 9
AT_BASE: 0x0
AT_FLAGS: 0x0
AT_ENTRY: 0x40164c
AT_UID: 1000
AT_EUID: 1000
AT_GID: 1000
AT_EGID: 1000
AT_SECURE: 0
AT_RANDOM: 0x7fff35c2a209
AT_EXECFN: /usr/bin/sleep
AT_PLATFORM: x86_64
The auxiliary vector of each process on the system is also visible via
a corresponding /proc/PID/auxv file. Dumping the contents of the
file that corresponds to the above command (as eight-byte decimal numbers,
because the keys and values are of that size on the 64-bit system used for
this example), we can see the key-value pairs in the vector, followed by a
pair of zero values that indicate the end of the vector:
$ od -t d8 /proc/15558/auxv
0000000 33 140734096265216
0000020 16 3219913727
0000040 6 4096
0000060 17 100
0000100 3 4194368
0000120 4 56
0000140 5 9
0000160 7 0
0000200 8 0
0000220 9 4200012
0000240 11 1000
0000260 12 1000
0000300 13 1000
0000320 14 1000
0000340 23 0
0000360 25 140734095335945
0000400 31 140734095347689
0000420 15 140734095335961
0000440 0 0
0000460
Scanning the high end of user-space memory or /proc/PID/auxv
is a clumsy way of retrieving values from the auxiliary vector. The new
library function provides a simpler mechanism for retrieving individual
values from the list:
#include <sys/auxv.h>
unsigned long int getauxval(unsigned long int type);
The function takes a key as its single argument, and returns the
corresponding value. The glibc header files define a set of symbolic
constants with names of the form AT_* for the key value passed to
getauxval(); these names are exactly the same as the strings
displayed when executing a command with LD_SHOW_AUXV=1.
Of course, the obvious question by now is: what sort of information is
placed in the auxiliary vector, and who needs that information? The
primary customer of the auxiliary vector is the dynamic linker
(ld-linux.so). In the usual scheme of things, the kernel's ELF
binary loader constructs a process image by loading an executable into the
process's memory, and likewise loading the dynamic linker into memory. At
this point, the dynamic linker is ready to take over the task of loading
any shared libraries that the program may need in preparation for handing
control to the program itself. However, it lacks some pieces of
information that are essential for these tasks: the location of the program
inside the virtual address space, and the starting address at which
execution of the program should commence.
In theory, the kernel could provide a system call that the dynamic
linker could use in order to obtain the required information. However, this
would be an inefficient way of doing things: the kernel's program loader already has the information (because it has scanned the
ELF binary and built the process image) and
knows that the dynamic linker will need it. Rather than maintaining a
record of this information until the dynamic linker requests it, the kernel
can simply make it available in the process image at some location known to
the dynamic linker. That location is, of course, the auxiliary vector.
It turns out that there's a range of other information that the
kernel's program loader already has and which it knows the dynamic linker
will need. By placing all of this information in the auxiliary vector, the
kernel either saves the programming overhead of making this information
available in some other way (e.g., by implementing a dedicated system
call), or saves the dynamic linker the cost of making a system call, or
both. Among the values placed in the auxiliary vector and available via
getauxval() are the following:
- AT_PHDR and AT_ENTRY: The values for these keys
are the address of the ELF program headers of the executable and the entry
address of the executable. The dynamic linker uses this information to
perform linking and pass control to the executable.
- AT_SECURE: The kernel assigns a nonzero value to this key
if this executable should be treated securely. This setting may be
triggered by a Linux Security Module, but the common reason is that the
kernel recognizes that the process is executing a set-user-ID or
set-group-ID program. In this case, the dynamic linker disables the use of
certain environment variables (as described in the ld-linux.so(8)
manual page) and the C library changes other aspects of its behavior.
- AT_UID, AT_EUID, AT_GID, and
AT_EGID: These are the real and effective user and group IDs of
the process. Making these values available in the vector saves the dynamic
linker the cost of making system calls to determine the values. If the
AT_SECURE value is not available, the dynamic linker uses these
values to make a decision about whether to handle the executable securely.
- AT_PAGESZ: The value is the system page size. The
dynamic linker needs this information during the linking phase, and the C
library uses it in the implementation of the malloc family of
functions.
- AT_PLATFORM: The value is a pointer to a string
identifying the hardware platform on which the program is running. In some
circumstances, the dynamic linker uses this value in the interpretation of
rpath values. (The ld-linux.so(8) man page describes
rpath values.)
- AT_SYSINFO_EHDR: The value is a pointer to the page
containing the Virtual Dynamic Shared Object (VDSO) that the kernel creates
in order to provide fast implementations of certain system calls. (Some
documentation on the VDSO can be found in the kernel source file
Documentation/ABI/stable/vdso.)
- AT_HWCAP: The value is a pointer to a multibyte mask of
bits whose settings indicate detailed processor capabilities. This
information can be used to provide optimized behavior for certain library
functions. The contents of the bit mask are hardware dependent (for
example, see the kernel source file
arch/x86/include/asm/cpufeature.h for details relating to the
Intel x86 architecture).
- AT_RANDOM: The value is a pointer to sixteen random bytes
provided by the kernel. The dynamic linker uses this to implement a stack
canary.
The precise reasons why the GNU C library developers have chosen to add
the getauxval() function now are a little unclear. The commit
message and NEWS file entry for the change were merely brief explanations
of what the change was, rather than why it was made. The only clue provided by the implementer on the
libc-alpha mailing list suggested that doing so was useful to allow for
"future enhancements to the AT_ values, especially target-specific
ones." That comment, plus the observation that the glibc developers
tend to be rather conservative about adding new interfaces to the ABI,
suggest that that they have some interesting new user-space uses of the
auxiliary vector in mind.
Comments (8 posted)
Brief items
I recommend NOT assuming that package managers are the cat's pajamas
and that therefore we can all skip the ability to usefully build from
source.
—
John Gilmore
Comments (none posted)
The
KDE Manifesto has been
released. "
The KDE Manifesto is not intended to change the organization or the way it works. Its aim is only to describe how the KDE Community sees itself. What binds us together are certain values and their practical implications, without regard for who a person is or what background and skills they bring. It is a living document, so it will change over time as KDE continues to grow and mature. We are sharing the Manifesto to help people understand what KDE is all about, what we want to accomplish and why we do what we do."
Comments (8 posted)
Mozilla has released
Firefox 16. See the
details in the
release
notes. Firefox 16.0 is also available for Android. Here are the
Android
version release notes.
Comments (2 posted)
The Electronic Frontier Foundation (EFF) has
released
version 3.0 of HTTPS Everywhere. HTTPS Everywhere 3.0 adds encryption
protection to 1,500 more websites, twice as many as previous stable
releases. "
Our current estimate is that HTTPS Everywhere 3 should encrypt at least a hundred billion page views in the next year, and trillions of individual HTTP requests."
Comments (none posted)
Version 2.0 of the system diagnostic framework SystemTap has been released. This release adds a simple macro facility to the built-in scripting language, the ability to conditionally vary code based on the user's privilege level, and an experimental backend that allows SystemTap to profile a user's own processes (i.e., without root privileges).
Full Story (comments: none)
Antoine Martin wrote in to alert us to the latest release of xpra, the "screen for X" utility. This release includes a host of new features, including several video compression formats and experimental support for multiple, concurrent clients.
Full Story (comments: 2)
At its open.NASA blog, the US space agency is soliciting input from the public on the data sets and APIs it provides. "As we collect more and more data, figuring out the best way to distribute, use, and reuse the data becomes more and more difficult. API’s are one way we can significantly lower the barrier of entry to people from outside NASA being able to manipulate and access our public information." The current estimate is that NASA collects 15 terabytes of data per day, and future missions may collect far more.
Comments (none posted)
Newsletters and articles
Comments (1 posted)
Over at opensource.com, Ruth Suehle
reports on the Open Hardware Summit, which was recently held in New York. At the summit, the
Open Source Hardware Association was officially launched and various ideas about open hardware business strategies were discussed. "
Many in the audience were waiting for the afternoon session that included Bre Pettis, co-founder and CEO of MakerBot, creators of a popular open source 3D printer. Earlier in the week, the company announced its latest product, the Replicator 2 3D printer. At the same time, Pettis announced to much controversy, 'For the Replicator 2, we will not share the way the physical machine is designed or our GUI because we don’t think carbon-copy cloning is acceptable and carbon-copy clones undermine our ability to pay people to do development.'"
Comments (7 posted)
The H
creates
a standalone mobile telephone network using the sysmoBTS base station. "
In previous articles, we've looked at the question of how free are the phones that people use every day, and looked at the theory behind building your own GSM phone network using open source software. Now, in this article we take a look at the sysmoBTS, a small form-factor GSM Base Transceiver Station (BTS) built around these principles and the steps required to configure it to provide a standalone mobile telephone network that is useful for research, development and testing purposes."
Comments (2 posted)
For anybody wanting to work with Openstreetmap data using PostgreSQL, here's
a collection of useful tools and techniques. "
At first glance, OSM data and Postgres (specifically PostGIS) seem like a natural, easy fit for one another: OSM is vector data, PostGIS stores vector data. OSM has usernames and dates-modified, PostGIS has columns for storing those things in tables. OSM is a worldwide dataset, PostGIS has fast spatial indexes to get to the part you want. When you get to OSM’s free-form tags, though, the row/column model of Postgres stops making sense and you start to reach for linking tables or advanced features like hstore..
Comments (none posted)
At the Public Knowledge blog, Michael Weinberg addresses the differing legal underpinnings of open source hardware and open source software. "This combination – copyright that does not protect function, trademark that needs to be applied for and does not protect function, and patents that need to be applied for and can protect functions – means that most hardware projects are 'open' by default because their core functionality is not protected by any sort of intellectual property right. Of course, in this case 'open' means that their key functionality can be copied without legal repercussion, not that the schematics have been posted online or that it is easy to discover how they work (critical elements of open source hardware)." The article is an extension of Weinberg's recent talk at the Open Hardware Summit, and poses questions interesting in light of MakerBot's announcement that its latest 3D printer would not be open.
Comments (none posted)
Page editor: Nathan Willis
Announcements
Brief items
The Free Software Foundation has awarded its first *Respects Your
Freedom* (RYF) certification to the *LulzBot AO-100 3D Printer* sold
by Aleph Objects, Inc. "
The RYF certification mark means that the
product meets the FSF's standards in regard to users' freedom, control
over the product, and privacy."
Full Story (comments: none)
Articles of interest
The FSFE (Free Software Foundation Europe) newsletter covers Software
Freedom Day activities, software patents, free software in the French
public administration, and several other topics.
Full Story (comments: none)
Here's
a
lengthy New York Times article looking at the problems with the US
patent system. "
In the smartphone industry alone, according to a
Stanford University analysis, as much as $20 billion was spent on patent
litigation and patent purchases in the last two years — an amount equal to
eight Mars rover missions. Last year, for the first time, spending by Apple
and Google on patent lawsuits and unusually big-dollar patent purchases
exceeded spending on research and development of new products, according to
public filings.
Comments (67 posted)
The H
reports on plans for a MeeGo-based phone.
"
The Finnish startup Jolla Ltd says that it has raised €200 million from a number of, currently unnamed, telecommunications companies and that it will be unveiling a MeeGo-based device next month. The funding consortium is reported to include at least one telecom operator, a chipset maker, and device and component manufacturers." Though MeeGo itself is free software, Jolla evidently plans to keep its "
well-patented" user interface layer closed and license it to other companies.
Comments (125 posted)
New Books
Pragmatic Bookshelf has released "Practical Vim" by Drew Neil.
Full Story (comments: none)
Calls for Presentations
The Cloud Infrastructure, Distributed Storage and High Availability
mini-conference will take place January 28 as part of linux.conf.au 2013 in
Canberra, Australia. The call for papers closes November 4.
Full Story (comments: none)
Upcoming Events
Events: October 11, 2012 to December 10, 2012
The following event listing is taken from the
LWN.net Calendar.
| Date(s) | Event | Location |
October 11 October 12 |
Korea Linux Forum 2012 |
Seoul, South Korea |
October 12 October 13 |
Open Source Developer's Conference / France |
Paris, France |
| October 13 |
2012 Columbus Code Camp |
Columbus, OH, USA |
October 13 October 14 |
Debian Bug Squashing Party in Utrecht |
Utrecht, Netherlands |
October 13 October 14 |
Debian BSP in Alcester (Warwickshire, UK) |
Alcester, Warwickshire, UK |
October 13 October 14 |
PyCon Ireland 2012 |
Dublin, Ireland |
October 13 October 15 |
FUDCon:Paris 2012 |
Paris, France |
October 15 October 18 |
Linux Driver Verification Workshop |
Amirandes,Heraklion, Crete |
October 15 October 18 |
OpenStack Summit |
San Diego, CA, USA |
October 17 October 19 |
MonkeySpace |
Boston, MA, USA |
October 17 October 19 |
LibreOffice Conference |
Berlin, Germany |
October 18 October 20 |
14th Real Time Linux Workshop |
Chapel Hill, NC, USA |
October 20 October 21 |
PyCarolinas 2012 |
Chapel Hill, NC, USA |
October 20 October 21 |
PyCon Ukraine 2012 |
Kyiv, Ukraine |
October 20 October 21 |
Gentoo miniconf |
Prague, Czech Republic |
October 20 October 21 |
LinuxDays |
Prague, Czech Republic |
October 20 October 23 |
openSUSE Conference 2012 |
Prague, Czech Republic |
October 22 October 23 |
PyCon Finland 2012 |
Espoo, Finland |
October 23 October 25 |
Hack.lu |
Dommeldange, Luxembourg |
October 23 October 26 |
PostgreSQL Conference Europe |
Prague, Czech Republic |
October 25 October 26 |
Droidcon London |
London, UK |
October 26 October 27 |
Firebird Conference 2012 |
Luxembourg, Luxembourg |
October 26 October 28 |
PyData NYC 2012 |
New York City, NY, USA |
| October 27 |
pyArkansas 2012 |
Conway, AR, USA |
| October 27 |
Central PA Open Source Conference |
Harrisburg, PA, USA |
| October 27 |
Linux Day 2012 |
Hundreds of cities, Italy |
October 27 October 28 |
Technical Dutch Open Source Event |
Eindhoven, Netherlands |
October 29 November 1 |
Ubuntu Developer Summit - R |
Copenhagen, Denmark |
October 29 November 2 |
Linaro Connect |
Copenhagen, Denmark |
October 29 November 3 |
PyCon DE 2012 |
Leipzig, Germany |
| October 30 |
Ubuntu Enterprise Summit |
Copenhagen, Denmark |
November 3 November 4 |
MeetBSD California 2012 |
Sunnyvale, California, USA |
November 3 November 4 |
OpenFest 2012 |
Sofia, Bulgaria |
November 5 November 7 |
LinuxCon Europe |
Barcelona, Spain |
November 5 November 7 |
Embedded Linux Conference Europe |
Barcelona, Spain |
November 5 November 8 |
ApacheCon Europe 2012 |
Sinsheim, Germany |
November 5 November 9 |
Apache OpenOffice Conference-Within-a-Conference |
Sinsheim, Germany |
November 7 November 8 |
LLVM Developers' Meeting |
San Jose, CA, USA |
November 7 November 9 |
KVM Forum and oVirt Workshop Europe 2012 |
Barcelona, Spain |
| November 8 |
NLUUG Fall Conference 2012 |
ReeHorst in Ede, Netherlands |
November 9 November 11 |
Python Conference - Canada |
Toronto, ON, Canada |
November 9 November 11 |
Mozilla Festival |
London, England |
November 9 November 11 |
Free Society Conference and Nordic Summit |
Göteborg, Sweden |
November 10 November 16 |
SC12 |
Salt Lake City, UT, USA |
November 12 November 14 |
Qt Developers Days |
Berlin, Germany |
November 12 November 16 |
19th Annual Tcl/Tk Conference |
Chicago, IL, USA |
November 12 November 17 |
PyCon Argentina 2012 |
Buenos Aires, Argentina |
| November 16 |
PyHPC 2012 |
Salt Lake City, UT, USA |
November 16 November 19 |
Linux Color Management Hackfest 2012 |
Brno, Czech Republic |
November 20 November 24 |
8th Brazilian Python Conference |
Rio de Janeiro, Brazil |
| November 24 |
London Perl Workshop 2012 |
London, UK |
November 24 November 25 |
Mini Debian Conference in Paris |
Paris, France |
November 26 November 28 |
Computer Art Congress 3 |
Paris, France |
November 29 November 30 |
Lua Workshop 2012 |
Reston, VA, USA |
November 29 December 1 |
FOSS.IN/2012 |
Bangalore, India |
November 30 December 2 |
Open Hard- and Software Workshop 2012 |
Garching bei München, Germany |
November 30 December 2 |
CloudStack Collaboration Conference |
Las Vegas, NV, USA |
December 1 December 2 |
Konferensi BlankOn #4 |
Bogor, Indonesia |
| December 2 |
Foswiki Association General Assembly |
online and Dublin, Ireland |
| December 5 |
4th UK Manycore Computing Conference |
Bristol, UK |
December 5 December 7 |
Qt Developers Days 2012 North America |
Santa Clara, CA, USA |
December 5 December 7 |
Open Source Developers Conference Sydney 2012 |
Sydney, Australia |
December 7 December 9 |
CISSE 12 |
Everywhere, Internet |
December 9 December 14 |
26th Large Installation System Administration Conference |
San Diego, CA, USA |
If your event does not appear here, please
tell us about it.
Page editor: Rebecca Sobol