Brief items
The current development kernel is 2.6.36-rc1, released (without
announcement) on August 15. A number of patches have been merged since
last week's merge window summary;
see below for the most significant of them. Overall, the headline additions to
2.6.36 look to be the AppArmor security module, a new
suspend mechanism which might -
or might not - address the needs of the Android project, the LIRC infrared
controller driver suite, a
new
out-of-memory killer, and the
fanotify hooks for anti-malware
applications.
The
full
changelog is available for those who want all the details.
A handful of patches have been merged since 2.6.36-rc1; they include parts
of the VFS scalability patch set by Nick Piggin. We'll take a closer look
at those patches for next week's edition.
Stable updates: The 2.6.27.51, 2.6.32.19, 2.6.34.4, and 2.6.35.2 updates were released on
August 13. Greg notes that the 2.6.34 updates are coming to an end,
with only one more planned. There is another 2.6.27 update in the review
process as of this writing; the future of 2.6.27 updates appears to be
short as well, given that Greg can no longer boot such old kernels on any
hardware in his possession.
Previously, the 2.6.27.50, 2.6.32.18, 2.6.34.3, and 2.6.35.1 updates came out on August 10.
Comments (none posted)
So remember guys. Windows suspend/resume may work just fine. Mac
too. But Linux's suspend/resume isn't a buggy pile of crap. It's an
intelligent buggy pile of crap, that just wants to be loved.
--
Christian
Hammond
My initial impression was also that power savings was Android's
single supreme goal, but a careful reading of this thread and the
ones preceding it taught me otherwise. Please see below for my
current understanding of what they are trying to accomplish.
--
Paul McKenney
When a user says, "Show the changes to the lower file system in my
overlaid file system," they are actually saying, "Replace
everything in /bin, but not /etc/hostname, and merge the lower
package database with the upper package database, and update
/etc/resolv.conf, unless it's the mailserver..."
--
Valerie Aurora
Actually, it's not that complicated:
1) base and suffices choose the possible types.
2) order of types is always the same: int -> unsigned -> long -> unsigned
long -> long long -> unsigned long long
3) we always choose the first type the value would fit into
4) L in suffix == "at least long"
5) LL in suffix == "at least long long"
6) U in suffix == "unsigned"
7) without U in suffix, base 10 == "signed"
That's it.
-- Language lessons from
Al Viro
Comments (3 posted)
Several developers concerned with Linux power management met for a
mini-summit in Boston on August 9, immediately prior to LinuxCon.
Numerous topics were discussed, including suspend performance, the Android
and MeeGo approaches to power management, and more. Len Brown has posted a
set of notes from this gathering; click below for the full text.
Full Story (comments: 7)
LWN readers will have seen our reporting from the Linux Storage and
Filesystem Summit (
day 1,
day 2), held on
August 8 and 9 in Boston. Your editor was unable to attend the
storage-specific sessions, though, so they were not covered in those
articles. Fortunately, James Bottomley took detailed notes, which he has
now made available to us. Click below for the results, covering iSCSI, SAN
management, thin provisioning, trim and discard, solid-state storage
devices, multipath, and more.
Full Story (comments: 3)
The KVM Forum was held concurrently with LinuxCon on August 9 and 10.
Slides
from all of the presentations made at the forum are now available in
PDF format.
Comments (none posted)
By Jake Edge
August 18, 2010
Developers, understandably, want their code to be used, but turning
new features on by default is often thought to be taking things a bit too
far. Herbert Xu and other kernel crypto subsystem developers recently ran
afoul of this policy when a new option controlling the
self-testing the crypto drivers at boot time was set to "yes" by default.
They undoubtedly thought that this feature was important—bad
cryptography
can lead to system or data corruption—but Linux has a longstanding
policy that features should default to "off". When David Howells ran
into a problem caused by a bug when loading
the cryptomgr module, Linus Torvalds was quick to sharply remind Xu of that
policy.
The proximate cause of Howells's problem was that the cryptomgr was
returning a value that made it appear as if it was not loaded. That caused
a cascade of problems early in the boot sequence when the module loader was
trying
to write an error message to /dev/console, which had not yet been
initialized. Xu sent out a patch to fix
that problem, but Howells's bisection pointed to a commit that added a way
to disable boot-time crypto self-tests—defaulted to running the
tests.
Torvalds was characteristically blunt: "People always think that their magical code is so important. I tell
you up-front that [it] absolutely is not. Just remove the crap entirely,
please." He was unhappy that, at least by default, everyone would
be running these self-tests every time they boot. But Xu was worried
about data corruption and potentially flaky crypto hardware:
The purpose of these tests are to make a particular driver or
implementation available only if it passes them. So your encrypted
disk/file system will not be subject to a hardware/software combo
without at least some semblance of testing.
The last thing you want to is to upgrade your kernel with a new
hardware crypto driver that detects that you have a piece of rarely-
used crypto [hardware], decides to use it and ends up making your
data toast.
But Torvalds was unconvinced: "The _developer_ had better test the thing. That is absolutely
_zero_ excuse for then forcing every boot for every poor user to re-do
the test over and over again.". Others were not so sure, however.
Kyle Moffett noted that he had been
personally bitten by new hardware crypto drivers that failed the
self-tests—thus falling back to the software implementation—so
he would like to see more testing:
So there are unique and compelling reasons for default-enabled basic
smoke tests of cryptographic support during boot. To be honest, the
test and integration engineer in me would like it if there were more
intensive in-kernel POST tests that could be enabled by a kernel
parameter or something for high-reliability embedded devices.
Basically Torvalds's point was that making every user pay the cost to run
the self-tests at boot time was too high. The drivers should be
reliable or they shouldn't be in the kernel. He continued: "And if you worry about alpha-particles, you should run a
RAM test on every boot. But don't ask _me_ to run one."
Though Xu posted a patch to default the
self-tests to "off", it has not yet made its way into the mainline. Given
Torvalds's statements, though, that will probably happen relatively soon.
If distributions disagree with his assessment, they can, and presumably
will, enable the tests for their kernels.
Comments (2 posted)
Kernel development news
By Jonathan Corbet
August 16, 2010
The 2.6.36 merge window closed with the release of 2.6.36-rc1 on
August 15. About 1000 changesets were merged into the mainline after
last week's update; this
article will cover the significant additions since then, starting with the
user-visible changes:
- The Squashfs filesystem has gained support for filesystems compressed
with LZO.
- The Ceph filesystem now has advisory locking support.
- There is now support for erase and trim operations (including the
"secure" variants) in the multi-media card (MMC) subsystem. The block
layer has been extended with a new REQ_SECURE flag and a new
BLKSECDISCARD ioctl() command to support this
functionality.
- New drivers:
- Block: ARM PXA-based PATA controllers.
- Systems and processors:
Income s.r.o. PXA270 single-board computers,
Wiliboard WBD-111 boards, and
Samsung S5PC210-based systems.
- Miscellaneous:
Semtech SX150-series I2C GPIO expanders,
Intersil ISL12022 RTC chips,
Freescale IMX DryIce real time clocks,
Dallas Semiconductor DS3232 real-time clock chips,
SuperH Mobile HDMI controllers,
iPAQ h1930/h1940/rx1950 battery controllers,
Intel MID battery controllers,
STMicroelectronics STMPE I/O expanders,
TI TPS6586x power management chips,
ST-Ericsson AB8500 power regulators,
Maxim Semiconductor MAX8998 power management controllers,
Intersil ISL6271A power regulators,
Analog Devices AD5398/AD5821 regulators,
Lenovo IdeaPad ACPI rfkill switches,
Winbond/Nuvoton NUC900 I2C controllers, and
SMSC EMC2103 temperature and fan sensors.
Changes visible to kernel developers include:
- The ioctl() file operation has been removed, now that
all in-tree users have been converted to the unlocked_ioctl()
version which does not acquire the big kernel lock. Removal of the
BKL has gotten yet another step closer.
- The nearly unused function dma_is_consistent(), meant to
indicate whether cache-coherent DMA can be performed on a specific
range of memory, has been removed.
- The kfifo API has been reworked for ease of use
and performance. Some examples of how to use the API have been added
under samples/kfifo.
- There is a new set of functions for avoiding races with sysfs access
to module parameters:
kparam_block_sysfs_read(name);
kparam_unblock_sysfs_read(name);
kparam_block_sysfs_write(name);
kparam_unblock_sysfs_write(name);
Here, name is the name of the parameter as supplied to
module_param() in the same source file. They are implemented
with a mutex.
- Experimental support for multiplexed I2C busses has been added.
All told, some 7,770 changes were incorporated during this merge window.
There were not a whole lot of changes pushed back this time around. The
biggest feature which was not merged, perhaps, was transparent hugepages,
but that omission is most likely due to the lack of a proper pull request
from the maintainer.
Now the stabilization period begins. Linus has suggested that he plans to
repeat his attempt to hold a hard line against any post-rc1 changes which
are not clearly important fixes; we will see how that works out in practice.
Comments (2 posted)
By Jonathan Corbet
August 18, 2010
What happens if you try to put one billion files onto a Linux filesystem?
One might see this as an academic sort of question; even the most
enthusiastic music downloader will have to work a while to collect that
much data. It would require over 30,000 (clean) kernel trees to add up to
a billion files. Even contemporary desktop systems, which often seem to be
quite adept at the creation of vast numbers of small files, would be hard
put to make a billion of them. But, Ric Wheeler says, this is a problem we
need to be thinking about now, or we will not be able to scale up to
tomorrow's storage systems. His LinuxCon talk used the billion-file
workload as a way to investigate the scalability of the Linux
storage stack.
One's first thought, when faced with the prospect of handling one billion
files, might be to look for workarounds. Rather than shoveling all of those
files into a single filesystem, why not spread them out across a series of
smaller filesystems? The problems with that approach are that (1) it
limits the
kernel's ability to optimize head seeks and such, reducing performance, and
(2) it forces developers (or administrators) to deal with the hassles
involved in actually distributing the files. Inevitably things will get
out of balance, forcing things to be redistributed in the future.
Another possibility is to use a database rather than the filesystem. But
filesystems are familiar to developers and users, and they come with the
operating system from the outset. Filesystems also are better at handling
partial failure; databases, instead, tend to be all-or-nothing affairs.
If one wanted to experiment with a billion-file filesystem, how would one
come up with hardware which is up to the task? The
most obvious way at the moment is with external disk arrays. These boxes
feature non-volatile caching and a hierarchy of storage technologies. They
are often quite fast at streaming data, but random access may be fast or
slow, depending on where the data of interest is stored. They cost $20,000
and up.
With regard to solid-state storage, Ric noted only that 1Tb still costs a
good $1000. So rotating media is likely to be with us for a while.
What if you wanted to put together a 100Tb array on your own? They did it
at Red Hat; the system involved four expansion shelves holding 64 2Tb
drives. It cost over $30,000, and was, Ric said, a generally bad idea.
Anybody wanting a big storage array will be well advised to just go out and
buy one.
The filesystem life cycle, according to Ric, starts with a mkfs operation.
The filesystem is filled, iterated over in various ways, and an occasional
fsck run is required. At some point in the future, the files are removed.
Ric put up a series of plots showing how ext3, ext4, XFS, and btrfs
performed on each of those operations with a one-million-file filesystem.
The results
varied, with about the only consistent factor being that ext4 generally
performs better than ext3. Ext3/4 are much slower than the others at
creating filesystems, due to the need to create the static inode tables.
On the other hand, the worst performers when creating 1 million files
were ext3 and XFS. Everybody except ext3 performs reasonably well when
running fsck - though btrfs shows room for some optimization. The big
loser when it comes to removing those million files is XFS.
To see the actual plots, have a look at Ric's
slides [PDF].
It's one thing to put one million files into a filesystem, but what about
one billion? Ric did this experiment on ext4, using the homebrew
array described above. Creating the filesystem in the first place was not
an exercise for the impatient; it took about four hours to run. Actually
creating those one billion files, instead, took a full four days. Surprisingly,
running fsck on this filesystem only took 2.5 hours - a real walk in the
park. So, in other words, Linux can handle one billion files now.
That said, there are some lessons that came out of this experience; they
indicate where some of the problems are going to be. The first of these is
that running fsck on an ext4 filesystem takes a lot of memory: on a
70Tb filesystem with one billion files, 10GB of RAM was needed. That
number goes up to 30GB when XFS is used, though, so things can get worse.
The short conclusion: you can put a huge amount of storage onto a small
server, but you'll not be able to run the filesystem checker on it.
That is a good limitation to know about ahead of time.
Next lesson: XFS, for all of its strengths, struggles when faced with
metadata-intensive workloads. There is work in progress to improve things
in this area, but, for now, it will not perform as well as ext3 in such
situations.
According to Ric, running ls on a huge filesystem is "a bad idea";
iterating over that many files can generate a lot of I/O activity. When
trying to look at that many files, you need to avoid running
stat() on every one of them or trying to sort the whole list.
Some filesystems can return the file type with the name in
readdir() calls, eliminating the need to call stat() in
many situations; that can help a lot in this case.
In general, enumeration of files tends to be slow; we can do, at best, a
few thousand files per second. That may seem like a lot of files, but, if
the target is one billion files, it will take a very long time to get
through the whole list. A related problem is backup and/or replication.
That, too, will take a very long time, and it can badly affect the
performance of other things running at the same time. That can be a
problem because, given that a backup can take days, it really needs to be
run on an operating, production system. Control groups and the I/O
bandwidth controller can maybe help to preserve system performance in such
situations.
Finally, application developers must bear in mind that processes which run
this long will invariably experience failures, sooner or later. So they
will need to be designed with some sort of checkpoint and restart
capability. We also need to do better about moving on quickly when I/O
operations fail; lengthy retry operations can take a slow process and turn
it into an interminable one.
In other words, as things get bigger we will run into some scalability
problems. There is nothing new in that revelation. We've always overcome
those problems in the past, and should certainly be able to do so in the
future. It's always better to think about these things before they become
urgent problems, though, so talks like Ric's provide a valuable service to
the community.
Comments (16 posted)
By Jonathan Corbet
August 18, 2010
One of the higher-profile decisions made at the recently-concluded
Linux Storage and Filesystem
summit was to get rid of support for barriers in the Linux kernel block
subsystem. This was a popular decision, but also somewhat misunderstood
(arguably, by your editor above all). Now,
a new patch series from Tejun
Heo shows how request ordering will likely be handled between filesystems
and the block layer in the future.
The block layer must be able to reorder disk I/O operations if it is to
obtain the sort of performance that users expect from their systems. On
rotating media, there is much to be gained by minimizing head seeks, and
that goal is best achieved by executing all nearby requests together,
regardless of the order in which those requests were issued. Even with
flash-based devices, there is some benefit to be had by grouping adjacent
requests, especially when small requests can be coalesced into larger
operations. Proper dispatch of requests to the low-level device driver is
normally the I/O scheduler's job; the scheduler will freely reorder
requests, blissfully ignorant of the higher-level decisions which created
those requests in the first place.
Note that this reordering also usually happens within the storage device
itself; requests will be cached in (possibly volatile) memory and writes
will be executed at a time which the hardware deems to be convenient. This
reordering is typically invisible to the operating system.
The problem, of course, is that it is not always safe to reorder I/O
requests in arbitrary ways. The classic example is that of a journaling
filesystem, which operates in roughly this way:
- Begin a new transaction.
- Write all planned metadata changes to the journal. Depending on
the filesystem and its configuration, data changes may go to the
journal as well.
- Write a commit record closing out the transaction.
- Begin the process of writing the journaled changes to the filesystem
itself.
- Goto 1.
If the system were to crash before step 3 completes, everything written to
the journal would be lost, but the integrity of the filesystem would be
unharmed. If the system crashes after step 3, but before the changes
are written to the filesystem, those changes will be replayed at the next
mount, preserving both the metadata and the filesystem's integrity. Thus,
journaling makes a filesystem relatively crash-proof.
But imagine what can happen if requests are reordered. If the commit
record is written before all of the other changes have been written to the
journal, then, after a crash, an incomplete journal would be replayed,
corrupting the filesystem. Or, if a transaction frees some disk blocks
which are subsequently reused elsewhere in the filesystem, and the reused
blocks are written before the transaction which freed them is committed, a
crash at the wrong time would, once again, corrupt things. So, clearly,
the filesystem must be able to impose some ordering on how requests are
executed; otherwise, its attempts to guarantee filesystem integrity in all
situations may well be for nothing.
For some years, the answer has been barrier requests. When the filesystem
issues a request to the block layer, it can mark that request as a barrier,
indicating that the block layer should execute all requests issued before
the barrier prior to doing any requests issued afterward. Barriers should,
thus, ensure that operations make it to the media in the right order while
not overly constraining the block layer's ability to reorder requests
between the barriers.
In practice, barriers have an unpleasant reputation for killing block I/O
performance, to the point that administrators are often tempted to turn
them off and take their risks. While the tagged queue operations provided
by contemporary hardware should implement barriers reasonably well,
attempts to make use of those features have generally run into
difficulties. So, in the real world, barriers are implemented by simply draining the
I/O request queue prior to issuing the barrier operation, with some flush
operations thrown in to get the hardware to actually commit the data to
persistent media. Queue-drain operations will stall the device and kill
the parallelism needed for full performance; it's not surprising that the
use of barriers can be painful.
In their discussions of this problem, the storage and filesystem developers
have realized that the ordering semantics provided by block-layer barriers
are much stronger than necessary. Filesystems need to ensure that certain
requests are executed in a specific order, and they need to ensure that
specific requests have made it to the physical media before starting
others. Beyond that, though, filesystems need not concern themselves with
the ordering for most other requests, so the use of barriers constrains the
block layer more than is required. In general, it was concluded,
filesystems should concern themselves with ordering, since that's where the
information is, and not dump that problem into the block layer.
To implement this reasoning, Tejun's patch gets rid of hard-barrier operations
in the block layer; any filesystem trying to use them will get a cheery
EOPNOTSUPP error for its pains. A filesystem which wants
operations to happen in a specific order will simply need to issue them in
the proper order, waiting for completion when necessary. The block layer
can then reorder requests at will.
What the block layer cannot do, though, is evade the responsibility for
getting important requests to the physical media when the filesystem
requires it. So, while barrier requests are going away, "flush requests"
will replace them. On suitably-capable devices, a flush request can have
two separate requirements: (1) the write cache must be flushed before
beginning the operation, and (2) the data associated with the flush
request itself must be committed to persistent media by the time the
request completes. The second part is often called a "force unit access"
(or FUA) request.
In this world, a journaling filesystem can issue all of the journal writes
for a given transaction, then wait for them to complete. At that point, it
knows that the writes have made it to the device, but the device might have
cached those requests internally. The write of the commit record can then
follow, with both the "flush" and "FUA" bits set; that will ensure that all
of the journal data makes it to physical media before the commit record
does, and that the commit record itself is written by the time the request
completes. Meanwhile, all other I/O operations - playing through previous
transactions or those with no transaction at all - can be in flight at the
same time, avoiding the queue stall which characterizes the barrier
operations implemented by current kernels.
The patch set has been well received, but there is still work to be done,
especially with regard to converting filesystems to the new way of doing
things. Christoph Hellwig has posted a set of patches to that end.
A lot of testing will be required as well; there is little desire
to introduce bugs in this area, since the consequences of failure are so
high. But the development cycle has just begun, leaving a fair amount of
time to shake down this work before the 2.6.37 merge window opens.
Comments (15 posted)
Patches and updates
Kernel trees
Core kernel code
Device drivers
Filesystems and block I/O
Memory management
Architecture-specific
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>