LWN.net Logo

LinuxConf.eu wrapup

By Jonathan Corbet
September 12, 2007

The very first LinuxConf Europe event was held in Cambridge, UK, in the first week of September. This conference is the result of a cooperation between the UK Unix User Group and the German Unix User Group; it is, in a sense, a combination of the UKUUG and Linux-Kongress events held in previous years. Talks by Dirk Hohndel and Michael Kerrisk were published last week. Here is a summary of some other LCE events.

Power management remains the focus of a great deal of attention. Arjan van de Ven started off a set of power-related talks with an overview of where the problems are. His biggest point is that software is a critical part of the power consumption picture; contemporary hardware provides a number of [Arjan van de Ven] power-saving features, but software has a tendency to defeat them. Many of the ways in which this happens have been covered here before, so there is no need to repeat them. The core lesson here is that transitions between power states are expensive, so it is important that hardware components, once put into a power-saving state, be allowed to stay there for some time.

In the case of the CPU, idle periods of 20ms to 50ms are needed for effective power savings. Past kernels have rather defeated that goal, though, by receiving a clock interrupt every 1-10ms. The dynamic tick patches have finally fixed that problem, making it possible for longer sleeps to happen. But then user space comes along and ruins things. Since the advent of PowerTop, though, improvements have been coming quickly. Many distributions now consume at least 30% less power in typical laptop use.

Things may be getting better, but Matthew Garrett started the following session by noting that Linux still sucks - at least, it sucks power. This is a problem, he says, because getting half the battery lifetime as Windows on the same hardware is really embarrassing. Systems are still waking up far too much; the problems exist in both kernel and user space.

[Matthew Garrett] On the kernel side, the usual culprits - device drivers - are a big part of the problem. There are quite a few drivers which poll their hardware - sometimes up to 100 times every second. In some cases this cannot be avoided; the hardware may be broken in a way which requires this kind of polling. But in other cases the polling can be made smarter - such as turning it off when the device is not in use. There is still work to be done in this area.

User-space applications remain a problem. People tracking down wakeups often blame the X server, but the real trouble is usually the applications which are causing X to wake up. There is a tool in the works which will identify the real source of X wakeups; this is a good thing: once problems are identified they are usually fixed pretty quickly. Polling for vertical retrace periods (so that the display can be updated without artifacts) seems to be a particular problem; some API work is being done to make it easier to avoid this polling. Evidently there are also some applications which repeatedly ask the server if a particular extension is available; since the set of extensions does not change while the server is running, there is little point in doing this.

There are some interesting things which can be done to better use the power-saving features of the hardware. For example, some framebuffers can compress the video data into a dedicated memory area, then drive the video from the compressed data. This technique reduces video memory bandwidth, saving power (up to half a watt) in the process. An interesting consequence is that the amount of power saved is dependent on how well the screen's contents compress - a user's choice of background wallpaper will affect their power usage.

Finally, there is a lot to be gained if device drivers can communicate more information to user space, making polling unnecessary. Applications which poll for changes to the audio volume are an example here; if the sound system simply told them that the volume had been adjusted, they could update their displays and go back to sleep.

[Jörn Engel] Jörn Engel gave a talk on the death of hard disks. His core point is that flash-based storage is faster, requires less power, makes less noise, and is more robust than rotating storage. It is also more expensive, for now, but flash is getting cheaper much more quickly. Jörn projects that flash-based drives will become more economical than hard drives between 2012 and 2019, depending on which drives one looks at.

Flash makes life easier in a number of ways; the lack of seek delays, for example, means that much of the trouble the kernel goes to in scheduling of block I/O operations can be eliminated. On the other hand, flash has challenges of its own: it is not quite the random-access array of blocks that one would like. In particular, writing to flash requires dealing with wear-leveling issues, erase operations, and more.

Manufacturers have done their best to paper over these issues through the use of translation layers which make a flash array look like a simple disk drive. These layers make it easier to use flash with existing software, but there are problems: performance is not always what one would like, and there can be hidden caches which delay the persistent storage of data. So Jörn has a request to the flash manufacturers: give us direct access to the flash array, without translation layers, and let us figure out how to best support it.

Chris Mason is not waiting for flash to take over; instead, he is working on the next-generation Linux filesystem for rotating disks. The result, Btrfs, was the subject of Chris's talk at LCE. LWN covered Btrfs last June.

[Chris Mason] Chris's motivation is the fact that disks are, for all practical purposes, getting slower - the time required to read an entire disk is growing. Most systems still store large numbers of small files, leading to a lot of wasted space. Btrfs tries to address these issues and provide a number of interesting features as well. It is extent-based, resulting in more efficient storage of larger files. Small files are packed into the filesystem tree itself, eliminating the internal fragmentation experienced by a number of other filesystems. It has indexed directories, data and metadata checksums, efficient snapshots, sequence numbers in objects (facilitating quick and easy incremental backups), an online filesystem checker in the works, and more.

The directories are actually indexed twice. One index is there for fast filename lookup; the other one, instead, lets the readdir() system call return files in inode-number order, speeding filesystem traversals. Extended attributes are stored as directory entries. Every file has a backpointer to its containing directory - and, yes, multiply-linked files have backpointers to all of the directories in which they are found.

Perhaps the most fun part of the talk was the plots Chris has generated from various benchmark runs. The limiting factor on filesystem performance is generally disk seeks; it is important to minimize disk head movement. In general, ext3 tends to move the disk head all over the platter during benchmark runs while Btrfs and XFS do better. Chris noted that better writeback clustering in the virtual memory subsystem would help ext3.

[Seek counts plot]

More benchmark plots (some animated) can be found in the Btrfs benchmark and Seekwatcher pages. Toward the end, Chris was asked whether performance slows down when the disk gets full. The answer was "no" because the system crashes instead. That's a good reminder that Btrfs remains an early-stage development; the on-disk format has not even been finalized yet. But the production version of Btrfs is certainly something to look forward to.

Back in 2000, the British Computer Society awarded its Lovelace Medal to Linus Torvalds. In 2007, the society finally caught up with him to deliver the medal - though, as speaker Dr. David Hartley noted, they probably were almost as quick as the post office would have been. As is typically the case, Linus seemed somewhat embarrassed by the attention.

LinuxConf Europe intends to be a conference on a truly European scale. To that end, next year's event will likely move to Germany; the details were not yet finalized to the point that the location could be announced at this year's conference, though. LCE, helped by the kernel summit, has gotten this institution off to a good start; your editor is looking forward to next year's edition.


(Log in to post comments)

Sucking power

Posted Sep 13, 2007 2:33 UTC (Thu) by mrons (subscriber, #1751) [Link]

So if "User-space applications remain a problem" for power, why doesn't Windows have a similar problem?

Sucking power

Posted Sep 13, 2007 3:12 UTC (Thu) by JoeBuck (subscriber, #2330) [Link]

My guess is that some of what we consider userspace, they consider part of the OS (specifically, all those cool KDE and Gnome features). I think Windows has paid a cost for the tighter integration (more security issues) but might also have seen some benefits in the power area.

Sucking power

Posted Sep 13, 2007 15:57 UTC (Thu) by rfunk (subscriber, #4054) [Link]

Of course, in our case I would think that the Qt and Gtk+ libraries should
be able to deal with much of those problems themselves, centralizing the
problem. I suspect few people these days run apps that talk directly to
the X server.

Sucking power

Posted Sep 20, 2007 6:14 UTC (Thu) by tuna (guest, #44480) [Link]

Those libraries are dealing with the problem, look at the new polling functions in Glib such as g_timeout_add_seconds ().

Sucking power

Posted Sep 13, 2007 16:04 UTC (Thu) by iabervon (subscriber, #722) [Link]

My feeling is that a smaller portion of applications on Windows do anything at all when they aren't the user's main task. There's a relatively small set of applications that will interrupt you (many of which are integrated with the kernel and sleep on input from drivers instead of polling), and some that play background music, but mostly they don't do anything when they don't have focus. Linux doesn't have so much of a mechanism for telling applications whether the user cares about them (and tends to have users watching a dozen things out of the corners of their eyes), so everything has to be polling if the things it is supposed to respond to aren't things that can be blocked on.

It's also the case with Gtk+ at least that it's hard to get good results out of multiple threads (each of which blocks on a different input) all updating the display; the only easy thing that works (last time I checked) is to have a single thread poll for input having happened and take care of responding. I think Windows has a totally different locking model that requires less polling.

Of course, I run a particularly old-school UI on my laptop, and I've been getting ~14 W for interaction stuff (i.e., when I'm not actually creating significant load; how much power doing work uses is a separate issue). On my laptop, the main sources of wakeups seem to be that the trackpad's input doesn't get batched at all and is high-frequency when in use, and iwlwifi is always busy.

Sucking power

Posted Sep 16, 2007 3:36 UTC (Sun) by sobdk (guest, #38278) [Link]

What makes you think Windows doesn't have a similar problem? With all of the crap you have to install on Windows like virus scanners, anti-spyware tools, internet security suites I'm sure the situation is much much worse. You did get me curious enough to do a little test today on my wife's machine that has both Windows Vista (pretty much a bare install with virus software) and Kubuntu.

The awesome power of Windows Vista!

The gist is that on the same machine Windows would idle at about 79 Watts but Kubuntu 7.04 with a kernel.org 2.6.22.6 used 69 Watts. The real kicker for me was that even though I thought we had been putting the machine to sleep in Windows all of this time it actually does nothing!

Vista ?

Posted Sep 16, 2007 9:46 UTC (Sun) by khim (subscriber, #9252) [Link]

Why the hell you are comparing Linux with Vista ? Vista is a hog in all senses. XP or W2K are good, though (W2K or Core, but not on Pentium 4).

Vista ?

Posted Sep 16, 2007 15:05 UTC (Sun) by sobdk (guest, #38278) [Link]

>Why the hell you are comparing Linux with Vista?

Well, it was the only Windows machine I have (Thank God!).

Honestly though do you think my comparison was unfair? I compared the latest and greatest from both the Linux world and the Windows world. Most consumers whether they like it or not will will get a new PC that has Windows Vista preinstalled. I would personally love to hear someone from Microsoft say "No wait that's not fair! Please do a comparison with our older far superior W2K and XP!"

Additionally I can tell you that at WinHEC in 2006 I sat through several fine developer (marketing?) sessions explaining how all of the suspend and resume problems of XP were going to be solved in Vista. In XP drivers (and perhaps app software) can simply veto a request for suspend. "Less asking and more Telling" was the moto for Vista as they explained that all drivers would have to get their act together and support suspend properly. Well someone didn't get the memo on my machine.

LinuxConf.eu wrapup

Posted Sep 13, 2007 9:54 UTC (Thu) by pointwood (guest, #2814) [Link]

It is certainly good to hear that there is a lot of focus on power management. My 2 Thinkpads runs great on Linux, but the battery doesn't survive very long :(

I'm looking forward to upgrading to Kubunty Gutsy which should have some of the improvements included and see how much of an improvement I'll get over Feisty. Maybe I should run a simple benchmark to record the (I hope) improvement...

LinuxConf.eu wrapup

Posted Sep 13, 2007 10:58 UTC (Thu) by jengelh (subscriber, #33263) [Link]

>Every file has a backpointer to its containing directory - and, yes, multiply-linked files have backpointers to all of the directories in which they are found.

Hah finally inotify is going to shine.

LinuxConf.eu wrapup

Posted Sep 13, 2007 21:11 UTC (Thu) by intgr (subscriber, #39733) [Link]

How is it going to shine any more than it does now? inotify works at the VFS layer, so it works the same way regardless of the underlying file system.

LinuxConf.eu wrapup

Posted Sep 13, 2007 21:34 UTC (Thu) by jengelh (subscriber, #33263) [Link]

Yes, of course you are right, inotify is at the VFS. But the VFS is deep and shallow :-)
Consider the following little pseudo function. (I am not sure it is lock-wise correct, but that is not the issue here.)

static ssize_t foofs_write(struct file *filp, const char __user *buf, size_t len, off_t *ppos)
{
    struct fooinode *ino = filp->f_dentry->d_inode->i_private;
    struct dentry *de;
    list_for_each_entry(de, ino->dentries, ...) {
        fsnotify_modify(de);
    }
    return do_sync_write(filp, buf, len, ppos);
}

This is specific to foofs that it can actually trigger fsnotify_modify events on all dentries that reference the same inode, because only foofs has the 1:N mapping from inode:dentries.

The regular case outside foofs only triggers an fsnotify_modify on the dentry you actually opened (see fs/read_write.c in vfs_write()).

LinuxConf.eu wrapup

Posted Sep 20, 2007 6:54 UTC (Thu) by joib (guest, #8541) [Link]

In general, ext3 tends to move the disk head all over the platter during benchmark runs while Btrfs and XFS do better. Chris noted that better writeback clustering in the virtual memory subsystem would help ext3.

So presumably ext4, which adds delayed and multiblock allocation to ext3 will improve things. Also, IIRC there was some discussion about moving these features into the VFS layer. That being said, perhaps ext4 development is slowing down now that Sun went and bought clusterfs.

Filesystem block size

Posted Sep 20, 2007 8:57 UTC (Thu) by forthy (guest, #1525) [Link]

One part of the filesystem cruft IMHO is the block size limit. Remember, the rule of thumb for random access to blocked devices is that access time and transfer time should be about the same. For current disks, half a megabyte appears to be the sweet spot (maybe on some large terabyte disks even one megabyte); 15 years ago, 4k was right. Linux so far limited block size to 4k, and only has recently allowed filesystems to use other block sizes, too; mostly to mount medias formatted on platforms with larger blocks (like the 8k blocks on IA64).

A new file system therefore should be able to put data together in blocks that will be able to grow in future as well (transfer rate increases with the square root of density or capacity, while seek time can be assumed to be almost constant). It needs to take care that data can be packed into these larger blocks - if you have half a megabyte as minimum transfer unit, you don't put a directory into one block, but a whole directory subtree, maybe with associated inodes and everything. You also put a bunch of small files into one single block, as well.

Essentially, as processors get faster and disks get larger, the behavior starts to shift. Main memory today reacts more like disks in the early days (cached block access instead of direct access), and disks tend to behave more like tapes (longer sequential access). Make sure you get all the informations you need with as few seeks as possible.

Filesystem block size

Posted Sep 20, 2007 15:37 UTC (Thu) by joib (guest, #8541) [Link]

Remember, the rule of thumb for random access to blocked devices is that access time and transfer time should be about the same. For current disks, half a megabyte appears to be the sweet spot (maybe on some large terabyte disks even one megabyte);

Oracle recommends a big stripe size for raid10 arrays, largely based on a similar argument

Linux so far limited block size to 4k, and only has recently allowed filesystems to use other block sizes, too; mostly to mount medias formatted on platforms with larger blocks (like the 8k blocks on IA64).

Up to 64 KB block size for ext2/3/4, a year ago, I don't know if they have been merged yet.

A new file system therefore should be able to put data together in blocks that will be able to grow in future as well (transfer rate increases with the square root of density or capacity, while seek time can be assumed to be almost constant). It needs to take care that data can be packed into these larger blocks - if you have half a megabyte as minimum transfer unit, you don't put a directory into one block, but a whole directory subtree, maybe with associated inodes and everything.

I think you have identified the correct disease (seek time remaining roughly constant vs. everything else improving), but I'm not convinced your cure is the correct one.

Consider what is already being done today (some filesystems like xfs have had many of these features already): Rather than huge block sizes, allocate many blocks at the same time (delayed allocation and multiblock allocation (mballoc), apparently making their way into the VFS layer), use extents to store many adjacent blocks (most filesystems except for ext2/3). For small files, readahead takes care of reading many blocks when the heads are moved to read at another spot.

You also put a bunch of small files into one single block, as well.

Reiser did a related thing (tail packing), but apparently the implementation is considered pretty complicated. In any case, the issue is not large vs. small block size, but rather block allocation. With mballoc and extents, small block filesystems get most of the advantage that large blocks have, without the complexity associated with tail packing.

Essentially, as processors get faster and disks get larger, the behavior starts to shift. Main memory today reacts more like disks in the early days (cached block access instead of direct access), and disks tend to behave more like tapes (longer sequential access).

"Disk is the new tape" - Jim Gray

Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds