User: Password:
|
|
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current stable 2.6 kernel remains 2.6.29. The 2.6.30 merge window is open (see below), and no stable updates have been released over the last week.

The 2.6.29.1 stable update is in the review process as of this writing. This update, containing just over 40 fixes, can be expected around April 2 or 3.

Comments (none posted)

Kernel development news

Quotes of the week

Really someone needs to sit down and actually build a proper model of the VM behaviour in a tool like netlogo rather than continually keep adding ever more complex and thus unpredictable hacks to it. That way we might better understand what is occurring and why.
-- Alan Cox

It is very disappointing that nobody appears to have attempted to do _any_ sensible tuning of these controls in all this time - we just keep thrashing around trying to pick better magic numbers in the base kernel.

Maybe we should set the tunables to 99.9% to make it suck enough to motivate someone.

-- Andrew Morton

We kernel people really are special. Expecting normal apps to spend the kind of effort we do (in scalability, in error handling, in security) is just not realistic.
-- Linus Torvalds

Comments (2 posted)

2.6.30 merge window, part I

By Jonathan Corbet
April 1, 2009
As of this writing, almost 6200 non-merge changesets have been added to the mainline for the 2.6.30 release. So the merge window is well and truly open. There's a lot of stuff set up for 2.6.30 already, with more certainly to come. The user-visible changes merged so far include:

  • The relatime mount option is now the default; this means that file access times will only be updated if they are newer than the creation or modification times. Another change merged also causes the access time to be updated at least once per day. Users needing access times to be updated for every access can use the new "strictatime" mount option to get that behavior. See That massive filesystem thread for more information on this change.

  • At long last, the integrity management patches have been merged. Among other things, this code can use the trusted platform module (TPM) to ensure that the files on a system have not been tampered with and to do remote attestation.

  • Also at long last, TOMOYO Linux has been merged. TOMOYO is a pathname-based security module similar to (but significantly different from) AppArmor.

  • There is a new cpuinfo_transition_latency sysfs variable for CPU frequency governors; it serves to inform user space of the time it takes for the CPU to transition from one frequency to another.

  • There is now support for the new AES-NI cryptographic instructions being introduced into Intel processors; see this white paper [PDF] for details on AES-NI.

  • The x86_64 and SuperH architectures have gained kexec jump support.

  • There is a new guest debugging interface for KVM, allowing the host to do interactive debugging of guest systems. KVM has also gained support for PowerPC e500 processors.

  • There is a new user-space API for detailed control over the timestamping of network packets. See Documentation/networking/timestamping.txt for details.

  • The Reliable Datagram Sockets (RDS) protocol is now supported by the networking layer. See Documentation/networking/rds.txt for more information.

  • The x86 architecture now has an option to put a "canary" value at the end of the kernel stack; if that value ever changes, the stack has been (accidentally or maliciously) overrun.

  • The reiserfs filesystem has seen a burst of work which cleans up the code, improves SELinux support, and improves performance. This is likely to be the last set of updates for reiserfs.

  • The usual array of new drivers has been merged. They include:

    • Block: PCI-Express SAS 6Gb/s host adapters.

    • Graphics: AMD R6xx/R7xx GPUs (2D only for now).

    • Networking: USB Qualcomm Serial modems, Marvell Libertas 8686 SPI 802.11b/g cards, Marvell 88w8xxx TOPDOG PCI/PCIe wireless cards, Prism54 SPI (stlc45xx) wireless adapters, Atmel at76c503/at76c505/at76c505a USB wireless adapters, OpenCores 10/100 Mbps Ethernet MAC devices, and Atheros "otus" 802.11n USB devices.

    • Sound: Mitac MIO A701 phones, Wolfson Micro WM8400 and WM9705 codecs, Wolfson Microelectronics 1133-EV1 modules, Atmel Audio Bitstream DAC devices, Atmel AC97 controllers, Asaki-Kasei AK4104 S/PDIF transmitters, Echo Audio IndigoIOx and IndigoDJx cards, Turtle Beach Classic, Fiji and Pinnacle cards, and Asus Xonar Essence STX sound cards.

    • Video/DVB: Mars-Semi MR97310A USB cameras, Freescale MC44S803 low power CMOS broadband tuners, SQ Technologies SQ905-based USB cameras, i.MX3x camera sensor interfaces, ST STV0900 satellite demodulators, ST STV6110 silicon tuners, SQ Technologies SQ905C-based USB cameras Zarlink ZL10036 silicon tuners, LG Electronics LGDT3305-based tuners, Hauppauge HD PVR USB devices, and Intel CE6230 DVB-T USB2.0 receivers.

    • Processors and systems: SuperH SH7786, ESPT-Giga SH7763-based reference boards, SMSC reference platform with a SH7709S CPUs, Palm LifeDrive and Tungsten|T5 systems, Brivo Systems LLC ACS-5000 master boards, Dave/DENX QongEVB-LITE platforms, Marvell RD-78x00-mASA development boards, Marvell PXA168 and PXA910 processors, TI OMAP850 processors, OMAP 3430 SDP boards, Nokia RX-51 internet tablets, Teltonika 3G Router RUT100 systems, Faraday FA526 cores, Cortina Systems Gemini family SoCs, GE Fanuc SBC310 and PPC9A PowerPC boards, Freescale Media5200 boards, AMCC Redwood(460SX) systems, Phytec phyCORE-MPC5200B-IO (pcm032) boards, and Freescale MPC8544 ("Socrates") boards.

    • Miscellaneous: AMCC PPC4xx crypto accelerators, Adrienne Electronics Corporation PCI time code devices, Symbol 6608 barcode scanners, E-Ink Broadsheet/Epson S1D13521 controllers, NXP Semiconductor PCA9665 i2c controllers, and Siemens Syleus and Hades sensor chips.

  • The "Phidgets" USB drivers have been removed; users should shift to the user-space drivers instead.

Changes visible to kernel developers include:

  • The adaptive spinning mutex patch has been merged. This change will cause mutexes to behave more like spinlocks in the contended case. If (and only if) the lock is held by code running on a different CPU, the mutex code will spin on the assumption that the lock will be released soon. This behavior results in significant performance improvements. Btrfs, which had its own spinning mutex implementation, has been converted to the new mutexes.

  • There is a new set of functions added to the crypto API which allow for piecewise compression and decompression of data.

  • The bus_id member of struct device is gone; code needing that information should use the dev_name() macro instead.

  • There is a new timer function:

        int mod_timer_pending(struct timer_list *timer, unsigned long expires);
    

    It is like mod_timer() with the exception that it will not reactivate an already-expired timer.

  • There have been some changes around the fasync() function in struct file_operations. This function is now responsible for maintaining the FASYNC bit in struct file; it is also now called without the big kernel lock held. Finally, a positive return value from fasync() is mapped to zero, meaning that the return value from fasync_helper() can be returned directly by fasync(). (This is your editor's modest contribution to 2.6.30).

  • The SCSI layer has a new support library for object storage device support; see Documentation/scsi/osd.txt for details.

  • The x86 "subarchitecture" mechanism has been removed, now that no architectures actually use it. The Voyager architecture has been removed as a result of these changes.

  • x86 is also the first architecture to use a new per-CPU memory allocator merged for 2.6.30. This allocator changes little at the API level, but it will provide for more efficient and flexible per-CPU variable management.

  • Support for compressing the kernel with the bzip2 or lzma algorithms has been added. Support for the old zImage format has been removed.

  • The asynchronous function call infrastructure is now enabled by default.

  • The DMA operations debugging facility has been merged.

  • The owner field of struct proc_dir_entry has been removed, causing lots of changes throughout the tree.

If the usual two-week pattern holds, the merge window can be expected to remain open through about April 9. The rate at which changes flow into the mainline will likely be lower for the second half of the merge window - the alternative is for this development cycle to be far larger than any of its predecessors. But it is certain that more interesting changes will be merged for 2.6.30.

Comments (21 posted)

Flushing out pdflush

April 1, 2009

This article was contributed by Goldwyn Rodrigues

The kernel page cache contains in-memory copies of data blocks belonging to files kept in persistent storage. Pages which are written to by a processor, but not yet written to disk, are accumulated in cache and are known as "dirty" pages. The amount of dirty memory is listed in /proc/meminfo. Pages in the cache are flushed to disk after an interval of 30 seconds. Pdflush is a set of kernel threads which are responsible for writing the dirty pages to disk, either explicitly in response to a sync() call, or implicitly in cases when the page cache runs out of pages, if the pages have been in memory for too long, or there are too many dirty pages in the page cache (as specified by /proc/sys/vm/dirty_ratio).

At a given point of time, there are between two and eight pdflush threads running in the system. The number of pdflush threads is determined by the load on the page cache; new pdflush threads are spawned if none of the existing pdflush threads have been idle for more than one second, and there is more work in the pdflush work queue. On the other hand, if the last active pdflush thread has been asleep for more than one second, one thread is terminated. Termination of threads happens until only a minimum number of pdflush threads remain. The current number of running pdflush threads is reflected by /proc/sys/vm/nr_pdflush_threads.

A number of pdflush-related issues have come to light over time. Pdflush threads are common to all block devices, but it is thought that they would perform better if they concentrated on a single disk spindle. Contention between pdflush threads is avoided through the use of the BDI_pdflush flag on the backing_dev_info structure, but this interlock can also limit writeback performance. Another issue with pdflush is request starvation. There is a fixed number of I/O requests available for each queue in the system. If the limit is exceeded, any application requesting I/O will block waiting for a new slot. Since pdflush works on several queues, it cannot block on a single queue. So, it sets the wbc->nonblocking writeback information flag. If other applications continue to write on the device, pdflush will not succeed in allocating request slots. This may lead to starvation of access to the queue, if pdflush repeatedly finds the queue congested.

Jens Axboe in his patch set proposes a new idea of using flusher threads per backing device info (BDI), as a replacement for pdflush threads. Unlike pdflush threads, per-BDI flusher threads focus on a single disk spindle. With per-BDI flushing, when the request_queue is congested, blocking happens on request allocation, avoiding request starvation and providing better fairness.

With pdflush, The dirty inode list is stored by the super block of the filesystem. Since the per-BDI flusher needs to be aware of the dirty pages to be written by its assigned device, this list is now stored by the BDI. Calls to flush dirty inodes on the superblock result in flushing the inodes from the list of dirty inodes on the backing device for all devices listed for the filesystem.

As with pdflush, per-BDI writeback is controlled through the writeback_control data structure, which instructs the writeback code what to do, and how to perform the writeback. The important fields of this structure are:

  • sync_mode: defines the way synchronization should be performed with respect to inode locking. If set to WB_SYNC_NONE, the writeback will skip locked inodes, where as if set to WB_SYNC_ALL will wait for locked inodes to be unlocked to perform the writeback.

  • nr_to_write: the number of pages to write. This value is decremented as the pages are written.

  • older_than_this: If not NULL, all inodes older than the jiffies recorded in this field are flushed. This field takes precedence over nr_to_write.

The struct bdi_writeback keeps all information required for flushing the dirty pages:

    struct bdi_writeback {
	struct backing_dev_info *bdi;
	unsigned int nr;
	struct task_struct	*task;
	wait_queue_head_t	wait;
	struct list_head	b_dirty;
	struct list_head	b_io;
	struct list_head	b_more_io;

	unsigned long		nr_pages;
	struct super_block	*sb;
    };

The bdi_writeback structure is initialized when the device is registered through bdi_register(). The fields of the bdi_writeback are:

  • bdi: the backing_device_info associated with this bdi_writeback,

  • task: contains the pointer to the default flusher thread which is responsible for spawning threads for performing the flushing work,

  • wait: a wait queue for synchronizing with the flusher threads,

  • b_dirty: list of all the dirty inodes on this BDI to be flushed,

  • b_io: inodes parked for I/O,

  • b_more_io: more inodes parked for I/O; all inodes queued for flushing are inserted in this list, before being moved to b_io,

  • nr_pages: total number of pages to be flushed, and

  • sb: the pointer to the superblock of the filesystem which resides on this BDI.

nr_pages and sb are parameters passed asynchronously to the the BDI flush thread, and are not fixed through the life of the bdi_writeback. This is done to facilitate devices with multiple filesystem, hence multiple super_blocks. With multiple super_blocks on a single device, a sync can be requested for a single filesystem on the device.

The bdi_writeback_task() function waits for the dirty_writeback_interval, which by default is 5 seconds, and initiates wb_do_writeback(wb) periodically. If there are no pages written for five minutes, the flusher thread exits (with a grace period of dirty_writeback_interval). If a writeback work is later required (after exit), new flusher threads are spawned by the default writeback thread.

Writeback flushes are done in two ways:

  • pdflush style: This is initiated in response to an explicit writeback request, for example syncing inode pages of a super_block. wb_start_writeback() is called with the superblock information and the number of pages to be flushed. The function tries to acquire the bdi_writeback structure associated with the BDI. If successful, it stores the superblock pointer and the number of pages to be flushed in the bdi_writeback structure and wakes up the flusher thread to perform the actual writeout for the superblock. This is different from how pdflush performs writeouts: pdflush attempts to grab the device from the writeout path, blocking the writeouts from other processes.

  • kupdated style: If there is no explicit writeback requests, the thread wakes up periodically to flush dirty data. The first time one of the inode's pages stored in the BDI is dirtied, the dirtying-time is recorded in the inode's address space. The periodic writeback code walks through the superblock's inode list, writing back dirty pages of the inodes older than a specified point in time. This is run once per dirty_writeback_interval, which defaults to five seconds.

After review of the first attempt, Jens added functionality of having multiple flusher threads per device based on the suggestions of Andrew Morton. Dave Chinner suggested that filesystems would like to have a flusher thread per allocation group. In the patch set (second iteration) which followed, Jens added a new interface in the superblock to return the bdi_writeback structure associated with the inode:

    struct bdi_writeback *(*inode_get_wb) (struct inode *);

If inode_get_wb is NULL, the default bdi_writeback of the BDI is returned, which means there is only one bdi_writeback thread for the BDI. The maximum number of threads that can be started per BDI is 32.

Initial experiments conducted by Jens found an 8% increase in performance on a simple SATA drive running Flexible File System Benchmark (ffsb). File layout was smoother as compared to the vanilla kernel as reported by vmstat, with a uniform distribution of buffers written out. With a ten-disk btrfs filesystem, per-BDI flushing performed 25% faster. The writeback is tracked by Jens's block layer git tree (git://git.kernel.dk/linux-2.6-block.git) under the "writeback" branch. There have been no comments on the second iteration so far, but per-BDI flusher threads is still not ready enough to go into the 2.6.30 tree.

Acknowledgments: Thanks to Jens Axboe for reviewing and explaining certain aspects of the patch set.

Comments (14 posted)

That massive filesystem thread

By Jonathan Corbet
March 31, 2009
Long, highly-technical, and animated discussion threads are certainly not unheard of on the linux-kernel mailing list. Even by linux-kernel standards, though, the thread that followed the 2.6.29 announcement was impressive. Over the course of hundreds of messages, kernel developers argued about several aspects of how filesystems and block I/O work on contemporary Linux systems. In the end (your editor will be optimistic and say that it has mostly ended), we had a lot of heat - and some useful, concrete results.

One can only pity Jesper Krogh, who almost certainly didn't know what he was getting into when he posted a report of a process which had been hung up waiting for disk I/O for several minutes. All he was hoping for was a suggestion on how to avoid these kinds of delays - which are a manifestation of the famous ext3 fsync() problem - on his server. What he got, instead, was to be copied on the entire discussion.

Journaling priority

One of the problems is at least somewhat understood: a call to fsync() on an ext3 filesystem will force the filesystem journal (and related file data) to be committed to disk. That operation can create a lot of write activity which must be waited for. But contemporary I/O schedulers tend to favor read operations over writes. Most of the time, that is a rational choice: there is usually a process waiting for a read to complete, but writes can be done asynchronously. A journal commit is not asynchronous, though, and it can cause a lot of things to wait while it is in progress. So it would be better not to put journal I/O operations at the end of the queue.

In fact, it would be better not to make journal operations contend with the rest of the system at all. To that end, Arjan van de Ven has long maintained a simple patch which gives the kjournald thread realtime I/O priority. According to Alan Cox, this patch alone is sufficient to make a lot of the problems go away. The patch has never made it into the mainline, though, because Andrew Morton has blocked it. This patch, he says, does not address the real problem, and it causes a lot of unrelated I/O traffic to benefit from elevated priority as well. Andrew says the real fix is harder:

The bottom line is that someone needs to do some serious rooting through the very heart of JBD transaction logic and nobody has yet put their hand up. If we do that, and it turns out to be just too hard to fix then yes, perhaps that's the time to start looking at palliative bandaids.

Bandaid or not, this approach has its adherents. The ext4 filesystem has a new mount option (journal_ioprio) which can be used to set the I/O priority for journaling operations; it defaults to something higher than normal (but not realtime). More recently, Ted Ts'o has posted a series of ext3 patches which sets the WRITE_SYNC flag on some journal writes. That flag marks the operations as synchronous, which will keep them from being blocked by a long series of read operations. According to Ted, this change helps quite a bit, at least when there is a lot of read activity going on. The ext3 changes have not yet been merged for 2.6.30 as of this writing (none of Ted's trees have), but chances are they will go in before 2.6.30-rc1.

data=ordered, fsync(), and fbarrier()

The real problem, though, according to Ted, is the ext3 data=ordered mode. That is the mode which makes ext3 relatively robust in the face of crashes, but, says Ted, it has done so at the cost of performance and the encouragement of poor user-space programming. He went so far as to express his regrets for this behavior:

All I can do is apologize to all other filesystem developers profusely for ext3's data=ordered semantics; at this point, I very much regret that we made data=ordered the default for ext3. But the application writers vastly outnumber us, and realistically we're not going to be able to easily roll back eight years of application writers being trained that fsync() is not necessary, and actually is detrimental for ext3.

The only problem here is that not everybody believes that ext3's behavior is a bad thing - at least, with regard to robustness. Much of this branch of the discussion covered the same issues raised by LWN in Better than POSIX? a couple of weeks before. A significant subset of developers do not want the additional robustness provided by ext3 data=ordered mode to go away. Matthew Garrett expressed this position well:

But you're still arguing that applications should start using fsync(). I'm arguing that not only is this pointless (most of this code will never be "fixed") but it's also regressive. In most cases applications don't want the guarantees that fsync() makes, and given that we're going to have people running on ext3 for years to come they also don't want the performance hit that fsync() brings. Filesystems should just do the right thing, rather than losing people's data and then claiming that it's fine because POSIX said they could.

One option which came up a couple of times was to extend POSIX with a new system call (called something like fbarrier()) which would enforce ordering between filesystem operations. A call to fbarrier() could, for example, cause the data written to a new file to be forced out to disk before that file could be renamed on top of another file. The idea has some appeal, but Linus dislikes it:

Anybody who wants more complex and subtle filesystem interfaces is just crazy. Not only will they never get used, they'll definitely not be stable...

So rather than come up with new barriers that nobody will use, filesystem people should aim to make "badly written" code "just work" unless people are really really unlucky. Because like it or not, that's what 99% of all code is.

And that is almost certainly how things will have to work. In the end, a system which just works is the system that people will want to use.

relatime

Meanwhile, another branch of the conversation revisited an old topic: atime updates. Unix-style filesystems traditionally track the time that each file was last accessed ("atime"), even though, in reality, there is very little use for this information. Tracking atime is a performance problem, in that it turns every read operation into a filesystem write as well. For this reason, Linux has long had a "noatime" mount option which would disable atime updates on the indicated filesystem.

As it happens, though, there can be problems with disabling atime entirely. One of them is that the mutt mail client uses atime to determine whether there is new mail in a mailbox. If the time of last access is prior to the time of last modification, mutt knows that mail has been delivered into that mailbox since the owner last looked at it. Disabling atime breaks this mechanism. In response to this problem, the kernel added a "relatime" option which causes atime to be updated only if the previous value is earlier than the modification time. The relatime option makes mutt work, but it, too, turns out to be insufficient: some distributions have temporary-directory cleaning programs which delete anything which hasn't been used for a sufficiently long period. With relatime, files can appear to be totally unused, even if they are read frequently.

If relatime could be made to work, the benefits could be significant; the elimination of atime updates can get rid of a lot of writes to the disk. That, in turn, will reduce latencies for more useful traffic and will also help to avoid disk spin-ups on laptops. To that end, Matthew Garrett posted a patch to modify the relatime semantics slightly: it allows atime to be updated if the previous value is more than one day in the past. This approach eliminates almost all atime updates while still keeping the value close to current.

This patch was proposed for merging, and more: it was suggested that relatime should be made the default mode for filesystems mounted under Linux. Anybody wanting the traditional atime behavior would have to mount their filesystems with the new "strictatime" mount option. This idea ran into some immediate opposition, for a couple of reasons. Andrew Morton didn't like the hardwired 24-hour value, saying, instead, that the update period should be given as a mount option. This option would be easy enough to implement, but few people think there is any reason to do so; it's hard to imagine a use case which requires any degree of control over the granularity of atime updates.

Alan Cox, instead, objected to the patch as an ABI change and a standards violation. He tried to "NAK" the patch, saying that, instead, this sort of change should be done by distributors. Linus, however, said he doesn't care; the relatime change and strictatime option were the very first things he merged when he opened the 2.6.30 merge window. His position is that the distributors have had more than a year to make this change, and they haven't done so. So the best thing to do, he says, is to change the default in the kernel and let people use strictatime if they really need that behavior.

For the curious, Valerie Aurora has written a detailed article about this change. She doesn't think that the patch will survive in its current form; your editor, though, does not see a whole lot of pressure for change at this point.

I/O barriers

Suppose you are a diligent application developer who codes proper fsync() calls where they are necessary. You might think that you are then protected against data loss in the face of a crash. But there is still a potential problem: the disk drive may lie to the operating system about having written the data to persistent media. Contemporary hardware performs aggressive caching of operations to improve performance; this caching will make a system run faster, but at the cost of adding another way for data to get lost.

There is, of course, a way to tell a drive to actually write data to persistent media. The block layer has long had support for barrier operations, which cause data to be flushed to disk before more operations can be initiated. But the ext3 filesystem does not use barriers by default because there is an associated performance penalty. With ext4, instead, barriers are on by default.

Jeff Garzik pointed out one associated problem: a call to fsync() does not necessarily cause the drive to flush data to the physical media. He suggested that fsync() should create a barrier, even if the filesystem as a whole is not using barriers. In that way, he says, fsync() might actually live up to the promise that it is making to application developers.

The idea was not controversial, even though people are, as a whole, less concerned with caches inside disk drives. Those caches tend to be short-lived, and they are quite likely to be written even if the operating system crashes or some other component of the system fails. So the chances of data loss at that level are much smaller than they are with data in an operating system cache. Still, it's possible to provide a higher-level guarantee, so Fernando Luis Vazquez Cao posted a series of patches to add barriers to fsync() calls. And that is when the trouble started.

The fundamental disagreement here is over what should happen when an attempt to send a flush operation to the device fails. Fernando's patch returned an ENOTSUPP error to the caller, but Linus asked for it to be removed. His position is that there is nothing that the caller can do about a failed barrier operation anyway, so there is no real reason to propagate that error upward. At most, the system should set a flag noting that the device doesn't support barriers. But, says Linus, filesystems should cope with what the storage device provides.

Ric Wheeler, instead, argues that filesystems should know if barrier operations are not working and be able to respond accordingly. Says Ric:

One thing the caller could do is to disable the write cache on the device. A second would be to stop using the transactions - skip the journal, just go back to ext2 mode or BSD like soft updates.

Basically, it lets the file system know that its data integrity building blocks are not really there and allows it (if it cares) to try and minimize the chance of data loss.

Alan Cox also jumped into this discussion in favor of stronger barriers:

Throw and pray the block layer can fake it simply isn't a valid model for serious enterprise computing, and if people understood the worst cases, for a lot of non enterprise computing.

Linus appears to be unswayed by these arguments, though. In his view, filesystems should do the best they can and accept what the underlying device is able to do. As of this writing, no patches adding barriers to fsync() have been merged into the mainline.

Related to this is the concept of laptop mode. It has been suggested that, when a system is in laptop mode, an fsync() call should not actually flush data to disk; flushing the data would cause the drive to spin up, defeating the intent of laptop mode. The response to I/O barrier requests would presumably be similar. Some developers oppose this idea, though, seeing it as a weakening of the promises provided by the API. This looks like a topic which could go a long time without any real resolution.

Performance tuning

Finally, there was some talk about trying to make the virtual memory subsystem perform better in general. Part of the problem here has been recognized for some time: memory sizes have grown faster than disk speeds. So it takes a lot longer to write out a full load of dirty pages than it did in the past. That simple dynamic is part of the reason why writeout operations can stall for long periods; it just takes that long to funnel gigabytes of data onto a disk drive. It is generally expected that solid-state drives will eventually make this problem go away, but it is also expected that it will be quite some time, yet, before those drives are universal.

In the mean time, one can try to improve performance by not allowing the system to accumulate as much data in need of writing. So, rather than letting dirty pages stay in cache for (say) 30 seconds, those pages should be flushed more frequently. Or the system could adjust the percentage of RAM which is allowed to be dirty, perhaps in response to observations about the actual bandwidth of the backing store devices. The kernel already has a "percentage dirty" limit, but some developers are now suggesting that the limit should be a fixed number of bytes instead. In particular, that limit should be set to the number of bytes which can be flushed to the backing store device in (say) one second.

Nobody objects to the idea of a better-tuned virtual memory subsystem. But there is some real disagreement over how that tuning should be done. Some developers argue for exposing the tuning knobs to user space and letting the distributors work it out. Andrew is a strong proponent of this approach:

I've seen you repeatedly fiddle the in-kernel defaults based on in-field experience. That could just as easily have been done in initscripts by distros, and much more effectively because it doesn't need a new kernel. That's data.

The fact that this hasn't even been _attempted_ (afaik) is deplorable. Why does everyone just sit around waiting for the kernel to put a new value into two magic numbers which userspace scripts could have set?

The objections to that approach follow these lines: the distributors cannot get these numbers right; in fact, they are not really even inclined to try to get them right. The proper tuning values tend to change from one kernel to the next, so it makes sense to keep them with the kernel itself. And the kernel should be able to get these things right if it is doing its job at all. Needless to say, Linus argues for this approach, saying:

We should aim to get it right. The "user space can tweak any numbers they want" is ALWAYS THE WRONG ANSWER. It's a cop-out, but more importantly, it's a cop-out that doesn't even work, and that just results in everybody having different setups. Then nobody is happy.

Linus has suggested (but not implemented) one set of heuristics which could help the system to tune itself. Neil Brown also has a suggested approach, based on measuring the actual performance of the system's storage devices. Fixing things at this level is likely to take some time; virtual memory changes always do. But some smart people are starting to think about the problem, and that's an important first step.

That, too, could be said for the discussion as a whole. There are clearly a lot of issues surrounding filesystems and I/O which have come to the surface and need to be discussed. The Linux kernel community as a whole needs to think through the sort of guarantees (for both robustness and performance) it will offer to its users and how those guarantees will be fulfilled. As it happens, the 2009 Linux Storage & Filesystems Workshop begins on April 6. Many of these topics are likely to be discussed there. Your editor has managed to talk his way into that room; stay tuned.

Comments (72 posted)

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds