Brief items
The 3.2 kernel was released on January 4, after 72 days of
development. Among other things, this kernel adds the
proportional rate reduction TCP algorithm, the
extended verification module, the
CPU scheduler bandwidth controller, the
cross-memory attach IPC mechanism, the Hexagon
DSP architecture, improved recovery of corrupted Btrfs filesystems, and the
I/O-less dirty throttling code. See the
Kernelnewbies 3.2
page for lots more information.
As of this writing, the 3.3 merge window is open; see below for details on
what has been merged so far.
Stable updates: the 2.6.32.53, 3.0.16, and 3.1.8 stable kernel updates were released on
January 6. Each contains the usual long list of important fixes (OK,
2.6.32.53 only has nine fixes, but the newer kernels have quite a few
more).
The 2.6.32.54,
3.0.17,
3.1.9, and
3.2.1 stable updates are in the review
process; they can be expected on or after January 12. 3.1.9 is likely
to be the final update for the 3.1 kernel.
Comments (none posted)
Simplifying the code should always be the initial proposal. Adding
more complexity on top is the worst-case when-all-else-failed
option. Yet we so often reach for that option first :(
--
Andrew Morton
If this code were a character driver for an obscure serial port on
a lesser-known chip architecture, I don't think it would get any
attention at all. As it is, it's looking like at least a few man
months of work will be required, as well as some relatively
unneeded changes to Android user space, to get this feature into a
permanently acceptable state. I wouldn't be surprised to see this
stretch into a few calendar years.
Code that specializes the kernel in weird ways is accepted into the
kernel all the time, and I've tried to figure out why this
particular bit of code is treated differently. Especially since
this code is self-contained, configurable, and imposes no
perceivable long-term maintenance burden.
--
Tim Bird
So, I've said it many times before, and I'll say it again:
Yes, you are special and unique, just like everyone else.
The next person who says the "embedded is different" phrase again, owes
me a beer of my choice.
--
Greg Kroah-Hartman
Comments (6 posted)
Greg Kroah-Hartman has posted
an update on
his plans for long-term kernel maintenance. As he announced before, the
3.1 series is almost at the end of its update period; he is also
approaching the end of his maintenance for the long-lived 2.6.32 release.
"
It is approaching it's end-of-life, and I think I only have another
month or so doing releases of this. After I am finished with it, it might
be picked up by someone else, but I'm not going to promise
anything." As it happens, Tim Gardner has
stated that Ubuntu will support 2.6.32 through
April 2015 - though whether that support will translate into kernel
releases outside of the Ubuntu distribution is not clear. Ubuntu also
plans to take on 3.2 as a long-term supported kernel.
Comments (9 posted)
By Jonathan Corbet
January 11, 2012
Since the early days of the Linux device model, there has been a special
device class for "system devices," typically those built into the platform
itself. For almost as long, the driver core developers have felt that
there was no real need for this device type, which looks weirdly different
from every other type of device. For 3.3, they have actually done
something about it; system devices are no more.
All in-tree system device drivers have been fixed up to use regular devices
instead. The process is relatively simple; it can be seen in, for example,
this
commit updating kernel/time/clocksource.c. In short, the
embedded struct sys_device becomes a simple struct device
instead. Attributes defined with SYSDEV_ATTR() are switched to
DEVICE_ATTR(). The sysdev_class structure is turned into
a nearly empty bus_type structure instead. That is about all that
is required.
These changes, naturally, cause a user-space ABI change; system devices had
their own special area under /sys which goes away. That has the
potential to break programs and scripts, which would not be a good thing.
To avoid this problem, a special function has been added:
int subsys_system_register(struct bus_type *subsys,
const struct attribute_group **groups);
Registering a subsystem in this way will restore its old
/sys/devices/system hierarchy. Needless to say, this function
exists for backward compatibility purposes only; using it in new drivers is
not likely to be received well.
Comments (none posted)
By Jonathan Corbet
January 11, 2012
Over the years, we have seen a number of attempts to use the seccomp
("secure computing") mechanism to reduce the range of operations available
to a given process. The hope is to use such a mechanism as part of a
sandboxing solution that would allow (for example) a web browser to run
third-party code in a safer manner. Thus far, all of these attempts have
gone down in flames; see
Seccomp filters:
no clear path from last May for the most recent episode in this
particular story.
Things have been quiet on the seccomp front recently - until now. Will
Drewry, who has been behind the recent attempts to enhance seccomp, has
come up with an interesting new approach to
the problem. Whether this attempt will be more successful than its
predecessors remains to be seen, but Will has managed to step around some
of the traps that doomed his previous attempt.
In the last seccomp discussion, there was a fair amount of pressure to adapt the
kernel's tracing infrastructure to this task; there was also resistance to
using that infrastructure in that way. As explained in detail in the patch
posting, Will has come to the conclusion that the tracing infrastructure is
not really fit for the task anyway:
At every turn, it appears that the tracing infrastructure was
unsuited for being used for attack surface reduction or as a larger
security subsystem on its own. It is well suited for feeding a
policy enforcement mechanism (like seccomp), but not for letting
the logic co-exist. It doesn't mean that it has security problems,
just that there will be a continued struggle between having a
really good perf system and and really good kernel attack surface
reduction system if they were merged.
Will's new approach has a stroke of brilliance to it: rather than use the
ftrace filter mechanism, he has repurposed the networking layer's packet
filtering mechanism (BPF). The BPF code normally operates on packets; in
the seccomp context, instead, it operates on the register set at the time
of each system call. The registers will contain the system call number and
its parameters, allowing the filter to make a wide range of decisions on
what will (or will not) be allowed. BPF is also well-maintained and
well-optimized code; it even has an in-kernel just-in-time compiler. Some
of these advantages are lost because seccomp uses its own BPF
interpreter; one assumes that a way could be found to merge the two
implementations if the underlying idea looks like it will pass muster.
As of this writing, there has not really been time for comments on the new
patch. It will be interesting to see what the developers think.
Meanwhile, those wanting more information should see the patch posting and
the documentation file, which includes a
sample program showing how to use the new facility.
Comments (8 posted)
Kernel development news
By Jonathan Corbet
January 11, 2012
As of this writing, just over 5,700 non-merge changesets have been pulled
into the mainline for the 3.3 development cycle. A fair amount of work
remains to be pulled, so it looks like another fairly active cycle, though
perhaps not quite up to the level of 3.2.
Some of the more significant, user-visible changes merged so far include:
- The "team" network driver - a lightweight mechanism for bonding
multiple interfaces together - has been merged. The libteam project has the
user-space code needed to operate this device.
- The network priority control group controller has been added. This
controller allows the administrator to specify the priority with which
members of each control group have access to the network interfaces
available on the system. See net_prio.txt from the documentation
directory for more information.
- Also added is the TCP buffer size
controller which can be used to place limits on the amount of
kernel memory used to hold TCP buffers.
- The byte queue limits
infrastructure has been added, enabling control over how much data can
be queued for transmission over a network interface at any time.
- The Open vSwitch virtual network
switch has been merged.
- The ARM architecture has gained support for the "large physical
address extension," allowing 32-bit processors to address more than
4GB of installed memory.
- The "adaptive RED" queue management algorithm is now supported by the
networking layer.
- The near-field communications (NFC) layer has gained support for the
logical link control protocol (LLCP).
- The beginnings of dynamic frequency
selection support have been added to the wireless networking
subsystem.
- For S390 users who find the current limit of 3.8TB of RAM to be
constraining: 3.3 will add support for four-level page tables and an
upper limit of 64TB (for now).
- Various Android drivers have returned to the staging tree; see this article for more information.
- The C6X architecture (described in this
article) has been merged.
- The ext4 filesystem has added support for online resizing via the
EXT4_IOC_RESIZE_FS ioctl() command. This operation
does not (yet) work with filesystems using the "bigalloc" or "meta_bg"
features.
- The /proc filesystem has a new subdirectory for each process
called map_files; it contains a symbolic link describing
every file-backed mapping used by the relevant process. This feature
is one of many needed to support the desired checkpoint/restart
feature.
- /proc also supports a couple of new mount options. When
mounted with hidepid=1, /proc will deny access to
any process directories not owned by the requesting process. With
hidepid=2, even the existence of other processes will be
hidden. The default (hidepid=0) behavior is unchanged. The
other new option (gid=N) provides an ID for a group that is
allowed to access information for all processes regardless of the
hidepid= setting.
- New drivers:
- Systems and processors:
AppliedMicro APM8018X PowerPC processors,
Numascale NumaChip systems,
IBM Currituck (476fpe) boards, and
NVIDIA Tegra30 processors.
- Input:
TI TCA8418 keypad decoders,
Wacom Intuos4 wireless tablets,
EETI eGalax multi-touch panels,
GPIO-connected tilt switches,
Sharp GP2AP002A00F I2C Proximity/Opto sensors, and
PIXCIR I2C touchscreens.
- Miscellaneous: P7IOC PowerPC I/O hubs,
Dialog Semiconductor DA9052/53 PMIC devices,
SiRF SoC Platform Serial ports,
Analog Devices AD5421, AD5764, AD5744, and AD5380 digital to
analog converters,
GE PIO2 VME Parallel I/O cards,
OMAP 2/3/4 displays,
OMAP "Tiling and Isometric Lightweight Engine for Rotation" devices,
Dialog DA9052/DA9053 regulators,
VIA hardware watchdog timers, and
TI TCA6507 I2C LED controllers.
- Network: Calxeda 1G/10G XGMAC Ethernet interfaces and
ISA-based CC770 CAN controllers.
- USB: Marvell USB OTG transceivers and
Marvell EHCI host controllers.
- Graduations: Microsoft's Hyper-V virtual network
driver
and the gma500 graphics driver
have moved out of staging into the mainline.
Changes visible to kernel developers include:
- A reworked version of the DMA buffer sharing API has been merged; this
API has been described in a separate
article.
- The "memblock" low-level memory allocation API has been substantially
reworked.
- Quite a few VFS interfaces have been changed to use the
umode_t type for file mode bits.
- Also in the VFS: most of
the members of struct vfsmount have been moved elsewhere
(to a containing struct mount) and hidden from
filesystem code. A number of callbacks in struct
super_operations (specifically: show_stats(),
show_devname(), show_path() and
show_options()) now take a pointer to struct dentry
instead of struct vfsmount.
- The pin control subsystem has gained a
new configuration interface.
- Boolean module parameters have traditionally allowed the underlying
module variable to be of either bool or int type.
That tolerance is coming to an end with 3.3, where non-bool
types will generate a warning; the plan is apparently to change those
warnings to fatal compilation errors
in the 3.4 cycle. A lot of modules have seen type changes for their
parameters in preparation for the new regime.
- The "system device" type has been removed from the kernel; all
instances have been converted to regular devices instead. See this article for more information.
The merge window can be expected to remain open through approximately
January 18.
Comments (4 posted)
By Jonathan Corbet
January 10, 2012
Sometimes it seems that there are few uncontroversial topics in kernel
development, but saving power would normally be among them. Whether the
concern is keeping a battery from running down too soon or keeping the
planet from running down too soon, the ability to use less power per unit
of computation is seen as a good thing. So when the kernel's scheduler
maintainer threatened to rip out a bunch of power-saving code, it got some
people's attention.
The main thing the scheduler can do to reduce power consumption is to allow
as many CPUs as possible to stay in a deep sleep state for as long as
possible. With
contemporary hardware, a comatose CPU draws almost no power at all. If
there is a lot of CPU-intensive work to do, there will be obvious limits on
how much sleeping the CPUs can get away with. But, if the system is
lightly loaded, the way the scheduler distributes running processes can
have a significant effect on both performance and power use.
Since there is a bit of a performance tradeoff, the scheduler exports a
couple of tuning knobs under /sys/devices/system/cpu. The first,
called sched_mc_power_savings, has three possible settings:
- The scheduler will not consider power usage when distributing tasks;
instead, tasks will be distributed across the system for maximum
performance. This is the default value.
- One core will be filled with tasks before tasks will be moved to other
cores. The idea is to concentrate the running tasks on a relatively
small number of cores, allowing the others to remain idle.
- Like (1), but with the additional tweak that newly awakened tasks will
be directed toward "semi-idle" cores rather than started on an idle
core.
There is another knob, sched_smt_power_savings, that takes the
same set of values, but applies the results to the threads of symmetric
multithreading (SMT) processors instead. These threads look a lot like
independent processors, but, since they share most of the underlying
hardware, they are not truly independent from each other.
Recently, Youquan Song noticed that sched_smt_power_savings did
not actually work as advertised; a quick patch followed to fix the problem. Scheduler
maintainer Peter Zijlstra objected to the fix, but he also made it clear
that he objects to the power-saving machinery in general. Just to make
that clear, he came back with a patch
removing the whole thing and a threat to merge that patch unless somebody
puts some effort into cleaning up the power-saving code.
Peter subsequently made it clear that he sees the value of power-aware
scheduling; the real problem is in the implementation. And, within that,
the real problem seems to be the control knobs. The two knobs provide
similar behavioral controls at two levels of the scheduler domain
hierarchy. But, with three possible values for each, the result is nine
different modes that the scheduler can run in. That seems like too much
complexity for a situation where the real choice comes down to "run as fast
as possible," or "use as little power as possible."
In truth, it is not quite that simple. The performance cost of loading up
every thread in an SMT processor is likely to be higher than that of
concentrating tasks at higher levels. Those threads contend for the actual
CPU hardware, so they will slow each other down. So one could conceive of
situations where an administrator might want to enable different behavior
at different levels, but such situations are likely to be quite rare. It
is probably not worth the trouble of maintaining the infrastructure to
support nine separate scheduler modes just in case somebody wants to do
something special.
For added fun, early versions of the patch
adding the "book" scheduling level (used only by the s390 architecture)
included a sched_book_power_savings switch, though that switch
went away before the patch was merged. There is
also the looming possibility that somebody may want to do the same for
scheduling at the NUMA node level. There comes a point where the number of
possibilities becomes ridiculous. Some people - Peter, for example - think
that point has already been reached.
That conclusion leads naturally to talk of what should replace the current
mechanism. One solution would be a simple knob with two settings:
"performance" or "low power." It could, as Ingo Molnar suggested, default to performance for
line-connected systems and low power for systems on battery. That seems
like a straightforward solution, but there is also a completely different approach suggested by
Indan Zupancic: move that decision making into the CPU governor instead.
The governor is charged with deciding which power state a CPU should be in
at any given (idle) time. It could be given the additional task of
deciding when CPUs should be taken offline entirely; the scheduler could
then just do its normal job of distributing tasks among the CPUs that are
available to it. Moving this responsibility to the governor is an
interesting thought, but one which does not
currently have any code to back it up; until somebody rectifies that little
problem, a governor-based approach probably will not receive a whole lot
more consideration.
Somebody probably will come through with the single-knob approach, though;
whether they will follow through and clean up the power-saving
implementation within the scheduler is harder to say. But it should be
enough to avert the threat of seeing that code removed altogether. And
that is certainly a good thing; imagine the power that would be uselessly
consumed in a flamewar over a regression in the kernel's power-aware
scheduling ability.
Comments (19 posted)
By Jonathan Corbet
January 11, 2012
Back in August 2011, LWN
looked at the DMA
buffer sharing patch set posted by Marek Szyprowski. Since then, that
patch has been picked up by Sumit Semwal, who modified it considerably in
response to comments from a number of developers. The version of this
patch that was merged for 3.3 differs enough from its predecessors that it
merits another look here.
The core idea remains the same, though: this mechanism allows DMA buffers
to be shared between drivers that might otherwise be unaware of each
other. The initial target use is sharing buffers between producers and
consumers of video streams; a camera device, for example, could acquire a
stream of frames into a series of buffers that are shared with the graphics
adapter, enabling the capture and display of the data with no copying in
the kernel.
In the 3.3 sharing scheme, one driver will set itself up as an exporter of
sharable buffers. That requires providing a set of callbacks to the buffer
sharing code:
struct dma_buf_ops {
int (*attach)(struct dma_buf *buf, struct device *dev,
struct dma_buf_attachment *dma_attach);
void (*detach)(struct dma_buf *buf, struct dma_buf_attachment *dma_attach);
struct sg_table *(*map_dma_buf)(struct dma_buf_attachment *dma_attach,
enum dma_data_direction dir);
void (*unmap_dma_buf)(struct dma_buf_attachment *dma_attach, struct sg_table *sg);
void (*release)(struct dma_buf *);
};
Briefly, attach() and detach() inform the exporting
driver when others take or release references to the buffer. The
map_dma_buf() and unmap_dma_buf() callbacks, instead,
cause the buffer to be prepared (or unprepared) for DMA and pass ownership
between drivers. A call to release() will be made when the last
reference to the buffer is released.
The exporting driver makes the buffer available with a call to:
struct dma_buf *dma_buf_export(void *priv, struct dma_buf_ops *ops,
size_t size, int flags);
Note that the size of the buffer is specified here, but there is
no pointer to the buffer itself. In fact, the current version of the
interface never passes around CPU-accessible buffer pointers at all.
One of the actions performed by dma_buf_export() is the creation
of an anonymous file to represent the buffer; flags is used to set
the mode bits on that file.
Since the file is anonymous, it is not visible to the rest of the kernel
(or user space) in any useful way. Truly exporting the buffer, instead,
requires obtaining a file descriptor for it and making that descriptor
available to user space. The descriptor can be had with:
int dma_buf_fd(struct dma_buf *dmabuf);
There is no standardized mechanism for passing that file descriptor to user
space, so it seems likely that any subsystem implementing this
functionality will add its own special ioctl() operation to get a
buffer's file descriptor. The same is true for the act of passing a file
descriptor to drivers that will share this buffer; it is something that
will happen outside of the buffer-sharing API.
A driver wishing to share a DMA buffer has to go through a series of calls
after obtaining the corresponding file descriptor, the first of which is:
struct dma_buf *dma_buf_get(int fd);
This function obtains a reference to the buffer and returns a
dma_buf structure pointer that can be used with the other API calls to
refer to the buffer. When the driver is finished with the buffer, it
should be returned with a call to dma_buf_put().
The next step is to "attach" to the buffer with:
struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf,
struct device *dev);
This function will allocate and fill in yet another structure:
struct dma_buf_attachment {
struct dma_buf *dmabuf;
struct device *dev;
struct list_head node;
void *priv;
};
That structure will then be passed to the exporting driver's
attach() callback. There seems to be a couple of reasons for the
existence of this step, the first of which is simply to let the exporting
driver know about the consumers of the buffer. Beyond that, the
device structure passed by the calling driver can contain a
pointer (in its dma_params field) to one of these structures:
struct device_dma_parameters {
unsigned int max_segment_size;
unsigned long segment_boundary_mask;
};
The exporting driver should look at these constraints and ensure that the
buffer it is exporting can satisfy them; if not, the attach() call
should fail. If multiple drivers attach to the buffer, the exporting
driver will need to allocate the buffer in a way that satisfies all of
their constraints.
The final step is to map the buffer for DMA:
struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *attach,
enum dma_data_direction direction);
This call turns into a call to the exporting driver's
map_dma_buf() callback.
If this call succeeds, the return value will be a scatterlist that can be
used to program the DMA operation into the device. A successful return
also means that the calling driver's device owns the buffer; it should not
be touched by the CPU during this time.
Note that mapping a buffer is an operation
that can block for a number of reasons; if the buffer is busy elsewhere,
for example.
Also worth noting is that, until this call is made, the buffer need not
necessarily be allocated anywhere. The exporting driver can wait until
others have attached to the buffer so that it can see their DMA constraints
and allocate the buffer accordingly. Of course, if the buffer lives in
device memory or is otherwise constrained on the exporting side, it can be
allocated sooner.
After the DMA operation is completed, the sharing driver should unmap the
buffer with:
void dma_buf_unmap_attachment(struct dma_buf_attachment *attach,
struct sg_table *sg_table);
That will, in turn, generate a call to the exporting driver's
unmap_dma_buf() function. Detaching from the buffer (when it is
no longer needed) can be done with:
void dma_buf_detach(struct dma_buf *dmabuf, struct dma_buf_attachment *attach);
As might be expected, this function will call the exporting driver's
detach() callback.
As of 3.3, there are no users for this interface in the mainline kernel.
There seems to be a fair amount of interest in using it, though, so Dave
Airlie pushed it into the mainline with the
idea that it would make the development of users easier. Some of those
users can be seen (in an early form) in Dave's
drm-prime repository and Rob
Clark's OMAP4 tree.
Comments (4 posted)
Patches and updates
Kernel trees
- Linus Torvalds: Linux 3.2 .
(January 5, 2012)
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Miscellaneous
- Lucas De Marchi: kmod 3 .
(January 5, 2012)
Page editor: Jonathan Corbet
Next page: Distributions>>