Brief items
The 3.4 kernel is out,
released on
May 20. Significant features in
this release include the Yama security module, support for the
x32 ABI, asymmetric multiprocessing support,
the
dm-verity device mapper target, and
more. For details, see the always-excellent
KernelNewbies 3.4 page.
The 3.5 merge window is open as of this writing; see the separate article
below for a summary of interesting changes merged so far.
Stable updates:
3.0.32
and 3.3.7 were released on May 21,
3.2.18 on May 21,
and 2.6.34.12 on May 22.
Comments (none posted)
Enough data has come in to satisfy me that with all the
improvements in Linux over the last year, and with BQL, codel and
fq_codel, that we've won a major battle in the war against
bufferbloat.
—
Dave Täht
else if(!strcmp(str, "pony")) {
tsc_clocksource_reliable = 1;
sched_clock_stable = 1;
tsc_perfect_smp_synchronization = 1;
}
—
Steven Rostedt
else if (!strcmp(str, "real")
panic("Can't handle real TSCs!\n");
—
Thomas Gleixner
Prediction: instead of Oracle coming out and admitting they were
morons about their idiotic suit against Android, they'll come out
posturing and talk about how they'll be vindicated, and pay lawyers
to take it to the next level of idiocy.
Sometimes I really wish I wasn't always right. It's a curse, I tell
you.
—
Linus
Torvalds
Comments (1 posted)
Kernel development news
By Jonathan Corbet
May 23, 2012
Shortly after the release of the 3.4 kernel, Linus started the entire
process all over again for the 3.5 development cycle; over 2,500 changesets
were pulled into the mainline on the first day, and 4,600 have been merged
as of this writing. It looks like it will be
an interesting cycle with a lot of new stuff coming in and the removal of a
bunch of old cruft. As of this writing, user-visible changes pulled for
3.5 include:
- The TCP connection repair interface,
useful for the implementation of checkpoint/restart functionality, has
been merged.
- The networking stack has gained support for RFC 5827 early retransmit,
a mechanism aimed at speeding recovery from packet loss.
- The CoDel queue management algorithm,
which, hopefully, will be an important component in the solution to
the bufferbloat problem, has been merged.
- The seccomp filters mechanism has been
merged; it allows processes to reduce the set of available system
calls through the use of a mechanism based on the Berkeley packet
filter. See Documentation/prctl/seccomp_filter.txt
for details.
- The Yama security module has two increasingly restrictive modes for
controlling access to the PTRACE_ATTACH functionality.
- The logging reliability patch set
has been merged.
- The NUMA scheduler has been rewritten with the result that it will
make different, hopefully better scheduling decisions. Also, as has
been threatened for some months, the
power-aware scheduling code has been removed in the hope that somebody
will replace it with something that actually works.
- A lot of code has been removed in this development cycle, including
the ixp2000 Ethernet driver,
support for the sun4c SPARC CPU,
the ip_queue netfilter module (superseded by
nfnetlink_queue),
all support for token ring networking,
drivers for all MCA-based network cards,
support for the Econet protocol,
support for ARMv3 processors,
support for Intel IXP2xxx (XScale) processors,
support for ST-Ericsson U5500 development boards,
the Motorola 68360 serial port driver, and
the workqueue tracer.
- New drivers include:
- Processors and systems:
Blackfin BF609-based boards, and
Renesas Armadillo-800 EVA and KZM-A9-GT boards.
- Miscellaneous:
TI TPS65090 power regulators,
TI Palmas series power management chips,
RICOH RC5T583 power regulators,
Freescale MXS, IMX6Q, IMX53 and IMX51 IOMUX controllers,
ST Microelectronics SPEAr3xx pin controllers,
Renesas Emma Mobile SoC GPIO controllers and integrated serial ports,
Intersil ISL29028 concurrent light and proximity sensors,
TAOS TSL/TMD2x71 and TSL/TMD2x72 light and proximity sensors,
Analog Devices AD8366 variable gain amplifiers, and
Atmel AT91 analog to digital converters.
- Network: WIZnet W5100 and W5300 adapters,
Marvell Avastar 88W8797 wireless chipsets,
Emulex One Connect InfiniBand-over-Ethernet controllers, and
GCT Semiconductor GDM72xx WiMAX controllers.
- USB:
Marvell PXA USB OTG controllers,
Broadcom BCMA and SSB host controllers,
NXP ISP1301 USB transceivers, and
NXP LPC32XX USB peripheral controllers.
Also added is a "configurable composite gadget" driver that
allows user-space configuration of enabled functions.
- Staging graduations: the industrial I/O (IIO) core has
moved into drivers/iio;
VME drivers are now in drivers/vme, and
the Intel management engine interface (MEI) driver is now in
drivers/misc.
Changes visible to kernel developers include:
- The many variants of the NLA_PUT() macro used with netlink
have been removed. Code should use one of the nla_put()
versions instead and make its error handling explicit.
- The mac80211 layer has gained support for MBSS mesh synchronization.
- There is new core support for the writing of near-field communication
(NFC) drivers using the HCI specification; see Documentation/nfc/nfc-hci.txt for details.
- The "regmap" subsystem, which centralizes the handling of banks of
device registers, now has support for registers in I/O memory.
- The pin control subsystem now has full device tree support.
- The Android "switch" class has been brought into the mainline and
extended into a general "external connector" framework.
- The "ramoops" mechanism has been reworked to use the pstore interface.
If the usual schedule holds, this merge window can be expected to close
around June 4. Watch this space next week for coverage of the next sets
of patches to be pulled into the mainline for the 3.5 development
cycle.
Comments (5 posted)
By Jonathan Corbet
May 23, 2012
Once upon a time, your editor had a job that involved working with a Data
General Nova system. The Nova had an interesting characteristic: since it
contained true core memory, the contents of that memory would persist
across a reboot—or a power-down. So the end-of-day shutdown procedure was
a simple matter of turning the machine off; when it was turned on the next
morning, it would simply continue where it was before. There were no
complaints about system boot time with that machine. The replacement of
core memory with silicon-based RAM brought a lot of nice advantages, but
the nonvolatile nature was lost on the way. But it appears that
nonvolatile memory may be about to make a comeback, bringing some
interesting development problems with it.
Matthew Wilcox raised the issue, noting
that nonvolatile memory (NVM) is coming, that it promises bandwidth and latency
numbers similar to those offered by dynamic RAM, and that, being cheaper
than DRAM, it is likely to be offered in larger sizes than DRAM is. He
later disclaimed any resemblance between
this description and any future products to be offered by his employer; it
is, he says, simply where the industry is going. Given that, it would be a
good idea for the kernel community to be ready for this technology when it
arrives.
One part of being ready is figuring out how to deal with nonvolatile
memory within the kernel. The suggested approach was to use a filesystem:
We think the appropriate way to present directly addressable NVM to
in-kernel users is through a filesystem. Different technologies
may want to use different filesystems, or maybe some forms of
directly addressable NVM will want to use the same filesystem as
each other.
A filesystem approach would allow the association of names with regions of
NVM space; an API was then proposed to allow the kernel to perform tasks
like mapping regions of NVM into the kernel's address space.
One question that came up quickly was: won't the use of the filesystem
model slow things down? There is a lot of overhead in the block layer,
which was not designed to deal with "storage" that operates at full memory
bandwidth. Matthew was never thinking of bringing in the full block layer,
though; instead, he said: "I'm hoping
that filesystem developers will indicate enthusiasm for moving to new
APIs." Such enthusiasm was in short supply in this discussion; that
is probably more indicative of a lack of thought about the problem than any
sort of active opposition (which was also in short supply).
James Bottomley, though, questioned the
filesystem idea, suggesting that NVM should be treated like memory. He
said that the way to access NVM might be through the kernel's normal memory
APIs, with nonvolatility just being another attribute of interest. One
could imagine calling kmalloc() with a new
GFP_NONVOLATILE flag, for example. The only problem with this approach is that
it is not enough to request an arbitrary nonvolatile region; callers will
usually want a specific NVM region that, presumably, contains data
from a previous use. So the memory API would have to be extended with some
sort of namespace giving reliable access to persistent data. To many, that
namespace looks like a filesystem; James suggested using 32-bit keys like
the SYSV shared memory mechanism does, but admirers of SYSV IPC tend to be
scarce indeed on linux-kernel.
So, while there are a lot of details to be worked out, some sort of
name-based kernel API seems certain to come about. Then there will be a
mechanism, either through the memory-related or filesystem-related system
calls, to make NVM available to user space. But that leads to another,
perhaps harder question: what, then, do we do with all that fast,
nonvolatile memory?
Some of it, certainly, could be used for caching; technologies like bcache could make good use of it. The page
cache could go there; Matthew suggested that the inode cache might be
another possibility. Both could speed booting considerably, though it
would be necessary to somehow validate the cache contents against
filesystems that could have changed while the system was down. Boaz
Harrosh suggested that filesystems could
store their journals in that space, speeding journal access and reducing
journal I/O load on the underlying storage devices. He also mentioned
checkpointing the system to NVM, allowing for quick recovery should the
system go down unexpectedly. Vyacheslav Dubeyko had some wilder ideas about how NVM could
eliminate system bootstrap entirely and make the concept of filesystems
obsolete; instead, everything would just live in a persistent object
environment.
Clearly, many of these ideas are going to take some time to come to
fruition. Nonvolatile memory changes things in fundamental ways; Linux may
have to scramble to keep up, but, then, that is a high-quality problem to
have. It will be most interesting to watch how this plays out over the
coming years.
Comments (43 posted)
By Jake Edge
May 23, 2012
Four bytes may not seem like a lot of space—typically it
isn't—but when that space is wasted millions of times, it starts to
add up. In addition, if the extra space has become part of the kernel ABI
(intentionally or not), it will be difficult to remove it. That particular
problem came up again in a recent linux-kernel discussion regarding the
trace event
header.
Just over a year ago, we looked at the
unused lock_depth field in event headers. Frederic Weisbecker had
added the field temporarily to assist in removal of the big kernel lock
(BKL), and once
the BKL was gone Steven Rostedt removed those, now useless, four bytes from the
header. Unfortunately, in the interim, PowerTOP had started accessing
events in the perf ring buffer, so removing lock_depth broke
PowerTOP. That field wasn't actually used by PowerTOP, but the tool
expected the header to have a particular size, which changed after
Rostedt removed the wasted space.
That led to a reversion of the removal, which means that every
event recorded by ftrace or perf has added overhead. The event format is
fully self-describing, however, so there is no need for utilities like
PowerTOP to grub around in the binary data making assumptions about what
the format is. It was, however, easier to read the data directly rather
than parse the format description, which is why PowerTOP did so.
Rostedt has created a library to parse
trace events using the format data that the kernel provides to avoid that
situation in the future. That
library was picked up by the recently
released PowerTOP 2.0, so Rostedt posted an
RFC asking when the lock_depth field—renamed to
padding as part of the revert—could be removed.
Linus Torvalds was not particularly concerned about the wasted space, but did want
to understand which distributions were picking up the new PowerTOP. It
turns out that the version in Fedora 14 (which Torvalds said he still uses
sometimes) is old enough that it doesn't use perf
events at all, so it is unaffected. More recent Fedoras (16, 17) are using
PowerTOP 1.98 which won't work with kernels built without the padding.
The assumption in the thread is that distributions will be picking up
PowerTOP 2.0 for releases coming later in the year, but that still leaves
users who build their own kernels on existing distributions in a bit of a
bind if the padding is removed. Existing distributions also have
various lifespans, and some will not be picking up the latest PowerTOP at all.
Rostedt asked how long the kernel needed to support
older distributions. PowerTOP, it seems, is in a different category from
other applications
because it is a developer-oriented tool. So Torvalds was willing to see the kernel change even if some
distributions get left behind:
But breaking something like a F14-15 timeframe distro or
something staid like a SLES (or "Debian Stale" or whatever they call
that thing that only takes crazy-old binaries)? It's fine. We don't
want to *rush* into it, but no, if those distros are basically not
updating, we can't care about them forever for something like
powertop.
Things that break *normal* applications are different. There the rule
really must be "never".
Arjan van de Ven concurred, pointing to 3.6
as a potential time frame to remove the padding, noting that those who
haven't updated their distribution to get the newer PowerTOP are unlikely
to be updating their kernel either. Rostedt said he will
queue the patch up for 3.6 or 3.7.
While the four bytes seems unimportant to both Torvalds and Ingo Molnar, Rostedt pointed out that it is a frequent problem for
tracing users. Beyond that, though, he disagrees with Molnar's contention
that the wasted space is merely a "cosmetic detail":
4 bytes is not cosmetic for a 32 byte event. That's 1/8th overhead. If
we could get rid of 4 bytes from struct page, would we do that? It's
only just 4 bytes for [every] 4096 bytes. Just a 1/1024 overhead. Of course
perf events are much bigger than 32 bytes. It's one of the biggest
complaints I hear about perf, the size of its events. We should be
trying hard to fix that.
For memory-constrained situations, for example on embedded devices or for users
trying to squeeze every process they can onto their systems, reducing the
overhead of events can make a difference. By capturing more events in the
same amount of memory, there is a better chance of finding the problem that
tracing was enabled for. When the issue came up a year ago, David Sharp of
Google noted that the size of events was a
big problem for the search giant. Others undoubtedly face similar challenges.
While the format of the perf ring buffer data may soon be a solved
problem—though it's possible, if unlikely, that other tools are
manually pulling data from the ring buffer—tracepoints as a whole are
still an
unresolved ABI issue. Right now, much of the work is in adding new
tracepoints, but some day one or more of those may need to come out or be
modified. If tools are dependent on specific tracepoints providing the
exact same
information in just the right place in the code, changing those will be a
real problem. And it will be one that is difficult for a library to paper over.
Comments (14 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>