Brief items
The current stable 2.6 kernel remains 2.6.29. The 2.6.30 merge
window is open (see below), and no stable updates have been released over
the last week.
The 2.6.29.1 stable update is in the review
process as of this writing. This update, containing just over 40
fixes, can be expected around April 2 or 3.
Comments (none posted)
Kernel development news
Really someone needs to sit down and actually build a proper model
of the VM behaviour in a tool like netlogo rather than continually
keep adding ever more complex and thus unpredictable hacks to
it. That way we might better understand what is occurring and why.
--
Alan Cox
It is very disappointing that nobody appears to have attempted to
do _any_ sensible tuning of these controls in all this time - we
just keep thrashing around trying to pick better magic numbers in
the base kernel.
Maybe we should set the tunables to 99.9% to make it suck enough to
motivate someone.
--
Andrew Morton
We kernel people really are special. Expecting normal apps to spend
the kind of effort we do (in scalability, in error handling, in
security) is just not realistic.
--
Linus Torvalds
Comments (2 posted)
By Jonathan Corbet
April 1, 2009
As of this writing, almost 6200 non-merge changesets have been added to the
mainline for the 2.6.30 release. So the merge window is well and truly
open. There's a lot of stuff set up for 2.6.30 already, with more
certainly to come. The user-visible changes merged so far include:
- The relatime mount
option is now the default; this means that file access times will only
be updated if they are newer than the creation or modification times.
Another change merged also causes the access time to be updated at
least once per day. Users needing access times to be updated for
every access can use the new "strictatime" mount option to get that
behavior. See That massive
filesystem thread for more information on this change.
- At long last, the integrity
management patches have been merged. Among other things, this
code can use the trusted platform module (TPM) to ensure that the
files on a system have not been tampered with and to do remote
attestation.
- Also at long last, TOMOYO
Linux has been merged. TOMOYO is a pathname-based security module
similar to (but significantly different from) AppArmor.
- There is a new cpuinfo_transition_latency sysfs variable for
CPU frequency governors; it serves to inform user space of the time it
takes for the CPU to transition from one frequency to another.
- There is now support for the new AES-NI cryptographic instructions
being introduced into Intel processors; see this
white paper [PDF] for details on AES-NI.
- The x86_64 and SuperH architectures have gained kexec jump support.
- There is a new guest debugging interface for KVM, allowing the host to
do interactive debugging of guest systems. KVM has also gained
support for PowerPC e500 processors.
- There is a new user-space API for detailed control over the
timestamping of network packets. See Documentation/networking/timestamping.txt
for details.
- The Reliable Datagram Sockets (RDS) protocol is now supported by the
networking layer. See Documentation/networking/rds.txt for more
information.
- The x86 architecture now has an option to put a "canary" value at the
end of the kernel stack; if that value ever changes, the stack has
been (accidentally or maliciously) overrun.
- The reiserfs filesystem has seen a burst of work which cleans up the
code, improves SELinux support, and improves performance. This is
likely to be the last set of updates for reiserfs.
- The usual array of new drivers has been merged. They include:
- Block: PCI-Express SAS 6Gb/s host adapters.
- Graphics: AMD R6xx/R7xx GPUs (2D only for now).
- Networking: USB Qualcomm Serial modems,
Marvell Libertas 8686 SPI 802.11b/g cards,
Marvell 88w8xxx TOPDOG PCI/PCIe wireless cards,
Prism54 SPI (stlc45xx) wireless adapters,
Atmel at76c503/at76c505/at76c505a USB wireless adapters,
OpenCores 10/100 Mbps Ethernet MAC devices, and
Atheros "otus" 802.11n USB devices.
- Sound: Mitac MIO A701 phones,
Wolfson Micro WM8400 and WM9705 codecs,
Wolfson Microelectronics 1133-EV1 modules,
Atmel Audio Bitstream DAC devices,
Atmel AC97 controllers,
Asaki-Kasei AK4104 S/PDIF transmitters,
Echo Audio IndigoIOx and IndigoDJx cards,
Turtle Beach Classic, Fiji and Pinnacle cards, and
Asus Xonar Essence STX sound cards.
- Video/DVB: Mars-Semi MR97310A USB cameras,
Freescale MC44S803 low power CMOS broadband tuners,
SQ Technologies SQ905-based USB cameras,
i.MX3x camera sensor interfaces,
ST STV0900 satellite demodulators,
ST STV6110 silicon tuners,
SQ Technologies SQ905C-based USB cameras
Zarlink ZL10036 silicon tuners,
LG Electronics LGDT3305-based tuners,
Hauppauge HD PVR USB devices, and
Intel CE6230 DVB-T USB2.0 receivers.
- Processors and systems: SuperH SH7786,
ESPT-Giga SH7763-based reference boards,
SMSC reference platform with a SH7709S CPUs,
Palm LifeDrive and Tungsten|T5 systems,
Brivo Systems LLC ACS-5000 master boards,
Dave/DENX QongEVB-LITE platforms,
Marvell RD-78x00-mASA development boards,
Marvell PXA168 and PXA910 processors,
TI OMAP850 processors,
OMAP 3430 SDP boards,
Nokia RX-51 internet tablets,
Teltonika 3G Router RUT100 systems,
Faraday FA526 cores,
Cortina Systems Gemini family SoCs,
GE Fanuc SBC310 and PPC9A PowerPC boards,
Freescale Media5200 boards,
AMCC Redwood(460SX) systems,
Phytec phyCORE-MPC5200B-IO (pcm032) boards, and
Freescale MPC8544 ("Socrates") boards.
- Miscellaneous: AMCC PPC4xx crypto accelerators,
Adrienne Electronics Corporation PCI time code devices,
Symbol 6608 barcode scanners,
E-Ink Broadsheet/Epson S1D13521 controllers,
NXP Semiconductor PCA9665 i2c controllers, and
Siemens Syleus and Hades sensor chips.
- The "Phidgets" USB drivers have been removed; users should shift
to the user-space
drivers instead.
Changes visible to kernel developers include:
- The adaptive spinning mutex patch has been merged. This change will
cause mutexes to behave more like spinlocks in the contended case. If
(and only if) the lock is held by code running on a different CPU, the
mutex code will
spin on the assumption that the lock will be released soon. This behavior
results in significant performance improvements. Btrfs, which had its own spinning mutex
implementation, has been converted to the new mutexes.
- There is a new set of functions added to the crypto API which allow
for piecewise compression and decompression of data.
- The bus_id member of struct device is gone; code
needing that information should use the dev_name() macro
instead.
- There is a new timer function:
int mod_timer_pending(struct timer_list *timer, unsigned long expires);
It is like mod_timer() with the exception that it will not
reactivate an already-expired timer.
- There have been some changes around the fasync() function in
struct file_operations. This function is now responsible for
maintaining the FASYNC bit in struct file; it is
also now called without the big kernel lock held. Finally, a positive
return value from fasync() is mapped to zero, meaning that
the return value from fasync_helper() can be returned
directly by fasync(). (This is your editor's modest
contribution to 2.6.30).
- The SCSI layer has a new support library for object storage device
support; see Documentation/scsi/osd.txt for details.
- The x86 "subarchitecture" mechanism has been removed, now that no
architectures actually use it. The Voyager architecture has been
removed as a result of these changes.
- x86 is also the first architecture to use a new per-CPU memory
allocator merged for 2.6.30. This allocator changes little at the API
level, but it will provide for more efficient and flexible per-CPU
variable management.
- Support for compressing the kernel with the bzip2 or lzma algorithms
has been added. Support for the old zImage format has been
removed.
- The asynchronous function
call infrastructure is now enabled by default.
- The DMA operations debugging
facility has been merged.
- The owner field of struct proc_dir_entry has been
removed, causing lots of changes throughout the tree.
If the usual two-week pattern holds, the merge window can be expected to
remain open through about April 9. The rate at which changes flow
into the mainline will likely be lower for the second half of the merge
window - the alternative is for this development cycle to be far larger
than any of its predecessors. But it is certain that more interesting
changes will be merged for 2.6.30.
Comments (21 posted)
April 1, 2009
This article was contributed by Goldwyn Rodrigues
The kernel page cache contains in-memory copies of data blocks
belonging to files kept in persistent storage.
Pages which are written to by a processor, but not yet written to disk, are
accumulated in cache and are known as "dirty" pages. The amount of
dirty memory is listed in
/proc/meminfo. Pages in
the cache are flushed to disk after an interval of 30 seconds. Pdflush
is a set of kernel threads which are responsible for writing the
dirty pages to disk, either explicitly in response to a
sync() call, or
implicitly in cases when the page cache runs out of pages, if the
pages have been in memory for too long, or there are too many dirty pages
in the page cache (as specified by
/proc/sys/vm/dirty_ratio).
At a given point of time, there are between two and eight pdflush threads running in the
system. The number of pdflush threads is determined by the load on the
page cache; new pdflush threads are spawned if
none of the existing pdflush threads have been idle for more than
one second, and there is more work in the pdflush work queue.
On the other hand, if the last active pdflush thread has been asleep
for more than one second, one thread is terminated. Termination of
threads happens until only a minimum number of pdflush
threads remain. The current number of running pdflush threads is
reflected by /proc/sys/vm/nr_pdflush_threads.
A number of pdflush-related issues have come to light over time.
Pdflush threads are common to all block devices, but it is thought that
they would perform better if they concentrated on a single disk spindle.
Contention between pdflush threads is avoided through the use of the
BDI_pdflush flag on the backing_dev_info structure, but
this interlock can also limit writeback performance.
Another issue with pdflush is
request starvation. There is a fixed number of I/O requests available for each
queue in the system. If the limit is exceeded, any application
requesting I/O will block waiting for a new slot. Since pdflush works on several
queues, it cannot block on a single queue. So, it sets the
wbc->nonblocking writeback information flag. If other applications continue to write on the
device, pdflush will not succeed in allocating request slots.
This may lead to starvation of
access to the queue, if pdflush repeatedly finds the queue congested.
Jens Axboe in his patch set proposes a new
idea of using flusher threads per backing device info (BDI), as a
replacement for
pdflush threads. Unlike pdflush threads, per-BDI flusher threads focus
on a single disk spindle. With per-BDI flushing, when the
request_queue is congested, blocking happens on request allocation,
avoiding request starvation and providing better fairness.
With pdflush, The dirty inode list is stored by
the super block of the filesystem. Since the per-BDI flusher needs to
be aware of the dirty pages to be written by its assigned device, this list is now stored by the BDI.
Calls to flush dirty inodes on the superblock result in flushing the
inodes from the list of dirty inodes on the backing device for all
devices listed for the filesystem.
As with pdflush, per-BDI writeback is controlled through the
writeback_control data structure, which instructs the writeback code
what to do, and how to perform the writeback. The important fields of this
structure are:
- sync_mode: defines the way synchronization should be performed
with respect to inode locking. If set to WB_SYNC_NONE, the writeback
will skip locked inodes, where as if set to WB_SYNC_ALL will wait for
locked inodes to be unlocked to perform the writeback.
- nr_to_write: the number of pages to write. This value is
decremented as the pages are written.
- older_than_this: If not NULL, all inodes older than the
jiffies recorded in this field are flushed. This field takes precedence over
nr_to_write.
The struct bdi_writeback keeps all information required for flushing
the dirty pages:
struct bdi_writeback {
struct backing_dev_info *bdi;
unsigned int nr;
struct task_struct *task;
wait_queue_head_t wait;
struct list_head b_dirty;
struct list_head b_io;
struct list_head b_more_io;
unsigned long nr_pages;
struct super_block *sb;
};
The bdi_writeback structure is initialized when the device is registered through
bdi_register(). The fields of the bdi_writeback are:
- bdi: the backing_device_info associated with this
bdi_writeback,
- task: contains the pointer to the default flusher thread
which is responsible for spawning threads for performing the
flushing work,
- wait: a wait queue for synchronizing with the flusher threads,
- b_dirty: list of all the dirty inodes on this BDI to be flushed,
- b_io: inodes parked for I/O,
- b_more_io: more inodes parked for I/O; all inodes queued for
flushing are inserted in this list, before being moved to
b_io,
- nr_pages: total number of pages to be flushed, and
- sb: the pointer to the superblock of the filesystem which
resides on this BDI.
nr_pages and sb are parameters passed asynchronously to
the the BDI flush thread, and are not fixed through the life of the
bdi_writeback. This is done to facilitate devices with multiple
filesystem, hence multiple super_blocks. With multiple super_blocks
on a single device, a sync can be requested for a single filesystem
on the device.
The bdi_writeback_task() function waits for the
dirty_writeback_interval,
which by default is 5 seconds, and initiates wb_do_writeback(wb)
periodically. If there are no pages written for five minutes, the flusher
thread exits (with a grace period of dirty_writeback_interval).
If a writeback work is later required (after exit), new flusher
threads are spawned by the default writeback thread.
Writeback flushes are done in two ways:
- pdflush style: This is initiated in response to an explicit writeback
request, for example syncing inode pages of a super_block.
wb_start_writeback() is called with the superblock information
and the number of pages to be flushed. The function tries to acquire
the bdi_writeback structure associated with the BDI. If successful, it
stores the superblock pointer and the number of pages to be flushed in the
bdi_writeback structure and wakes up the flusher thread to perform the
actual writeout for the superblock. This is different from how pdflush
performs writeouts: pdflush attempts to grab the device from the
writeout path, blocking the writeouts from other processes.
- kupdated style: If there is no explicit writeback requests, the thread
wakes up periodically to flush dirty data. The
first time one of the inode's pages stored in the BDI is dirtied, the
dirtying-time is recorded in the inode's address space. The periodic
writeback code walks through the superblock's inode list, writing
back dirty pages of the inodes older than a specified point in time.
This is run once per dirty_writeback_interval, which defaults
to five seconds.
After review of the first
attempt, Jens added
functionality of having multiple flusher threads per device based on
the suggestions of Andrew Morton. Dave Chinner suggested that
filesystems would like to have a flusher thread per allocation group.
In the patch set (second iteration) which followed, Jens added a
new interface in the superblock to return the bdi_writeback structure
associated with the inode:
struct bdi_writeback *(*inode_get_wb) (struct inode *);
If inode_get_wb is NULL, the default bdi_writeback of the BDI is
returned, which means there is only one bdi_writeback thread for the BDI. The
maximum number of threads that can be started per BDI is 32.
Initial experiments conducted by Jens found an 8% increase in
performance on a simple SATA drive running Flexible File System
Benchmark (ffsb). File layout was smoother as compared to the
vanilla kernel as reported by vmstat, with a uniform distribution of
buffers written out. With a ten-disk btrfs filesystem, per-BDI flushing performed
25% faster. The writeback is tracked by Jens's block layer git tree
(git://git.kernel.dk/linux-2.6-block.git) under the "writeback" branch.
There have been no comments on the second iteration so far, but
per-BDI flusher threads is still not ready enough to go into the
2.6.30 tree.
Acknowledgments: Thanks to Jens Axboe for reviewing and explaining
certain aspects of the patch set.
Comments (14 posted)
By Jonathan Corbet
March 31, 2009
Long, highly-technical, and animated discussion threads are certainly not
unheard of on the linux-kernel mailing list. Even by linux-kernel
standards, though,
the
thread that followed the 2.6.29 announcement was impressive. Over the
course of hundreds of messages, kernel developers argued about several
aspects of how filesystems and block I/O work on contemporary Linux
systems. In the end (your editor will be optimistic and say that it has
mostly ended), we had a lot of heat - and some useful, concrete results.
One can only pity Jesper Krogh, who almost certainly didn't know what he
was getting into when he posted a report of
a process which had been hung up waiting for disk I/O for several minutes.
All he was hoping for was a suggestion on how to avoid these kinds of
delays - which are a manifestation of the famous ext3 fsync()
problem - on his server. What he got, instead, was to be copied on the
entire discussion.
Journaling priority
One of the problems is at least somewhat understood: a call to
fsync() on an ext3 filesystem will force the filesystem journal
(and related file data) to
be committed to disk. That operation can create a lot of write activity
which must be waited for. But contemporary I/O schedulers tend to
favor read operations over writes. Most of the time, that is a rational
choice: there is usually a process waiting for a read to complete, but
writes can be done asynchronously. A journal commit is not asynchronous,
though, and it can cause a lot of things to wait while it is in progress.
So it would be better not to put journal I/O operations at the end of the
queue.
In fact, it would be better not to make journal operations contend with the
rest of the system at all. To that end, Arjan van de Ven has long
maintained a simple patch
which gives the kjournald thread realtime I/O priority. According to Alan Cox, this patch alone is
sufficient to make a lot of the problems go away. The patch has never made
it into the mainline, though, because Andrew
Morton has blocked it. This patch, he says, does not address the real
problem, and it causes a lot of unrelated I/O traffic to benefit from
elevated priority as well. Andrew says the real fix is harder:
The bottom line is that someone needs to do some serious rooting
through the very heart of JBD transaction logic and nobody has yet
put their hand up. If we do that, and it turns out to be just too
hard to fix then yes, perhaps that's the time to start looking at
palliative bandaids.
Bandaid or not, this approach has its adherents. The ext4 filesystem has a
new mount option (journal_ioprio) which can be used to set the I/O
priority for journaling operations; it defaults to something higher than
normal (but not realtime). More recently, Ted Ts'o has posted a series of ext3 patches which
sets the WRITE_SYNC flag on some journal writes. That flag marks
the operations as synchronous, which will keep them from being blocked by a
long series of read operations. According to Ted, this change helps quite
a bit, at least when there is a lot of read activity going on. The ext3
changes have not yet been merged for 2.6.30 as of this writing (none of
Ted's trees have), but chances are they will go in before 2.6.30-rc1.
data=ordered, fsync(), and fbarrier()
The real problem, though, according to Ted, is the ext3
data=ordered mode. That is the mode which makes ext3 relatively
robust in the face of crashes, but, says Ted, it has done so at the cost of
performance and the encouragement of poor user-space programming. He went
so far as to express his regrets for
this behavior:
All I can do is apologize to all other filesystem developers
profusely for ext3's data=ordered semantics; at this point, I very
much regret that we made data=ordered the default for ext3. But
the application writers vastly outnumber us, and realistically
we're not going to be able to easily roll back eight years of
application writers being trained that fsync() is not necessary,
and actually is detrimental for ext3.
The only problem here is that not everybody believes that ext3's behavior
is a bad thing - at least, with regard to robustness. Much of this branch
of the discussion covered the same issues raised by LWN in Better than POSIX? a couple of
weeks before. A significant subset of developers do not want the
additional robustness provided by ext3 data=ordered mode to go
away. Matthew Garrett expressed this position
well:
But you're still arguing that applications should start using
fsync(). I'm arguing that not only is this pointless (most of this
code will never be "fixed") but it's also regressive. In most cases
applications don't want the guarantees that fsync() makes, and
given that we're going to have people running on ext3 for years to
come they also don't want the performance hit that fsync()
brings. Filesystems should just do the right thing, rather than
losing people's data and then claiming that it's fine because POSIX
said they could.
One option which came up a couple of times was to extend POSIX with a new
system call (called something like fbarrier()) which would enforce
ordering between filesystem operations. A call to fbarrier()
could, for example, cause the data written to a new file to be forced out
to disk before that file could be renamed on top of another file. The idea
has some appeal, but Linus dislikes it:
Anybody who wants more complex and subtle filesystem interfaces is
just crazy. Not only will they never get used, they'll definitely
not be stable...
So rather than come up with new barriers that nobody will use,
filesystem people should aim to make "badly written" code "just
work" unless people are really really unlucky. Because like it or
not, that's what 99% of all code is.
And that is almost certainly how things will have to work. In the end, a
system which just works is the system that people will want to use.
relatime
Meanwhile, another branch of the conversation revisited an old topic: atime
updates. Unix-style filesystems traditionally track the time that each
file was last accessed ("atime"), even though, in reality, there is very
little use for this information. Tracking atime is a performance problem,
in that it turns every read operation into a filesystem write as well. For
this reason, Linux has long had a "noatime" mount option which would
disable atime updates on the indicated filesystem.
As it happens, though, there can be problems with disabling atime
entirely. One of them is that the mutt mail client uses atime to
determine whether there is new mail in a mailbox. If the time of last
access is prior to the time of last modification, mutt knows that
mail has been delivered into that mailbox since the owner last looked at
it. Disabling atime breaks this mechanism. In response to this problem,
the kernel added a "relatime" option which causes atime to be updated only
if the previous value is earlier than the modification time. The relatime
option makes mutt work, but it, too, turns out to be insufficient:
some distributions have temporary-directory cleaning programs which delete
anything which hasn't been used for a sufficiently long period. With
relatime, files can appear to be totally unused, even if they are read
frequently.
If relatime could be made to work, the benefits could be significant; the
elimination of atime updates can get rid of a lot of writes to the disk.
That, in turn, will reduce latencies for more useful traffic and will also
help to avoid disk spin-ups on laptops. To that end, Matthew Garrett
posted a patch to modify the relatime semantics slightly: it allows atime
to be updated if the previous value is more than one day in the past. This
approach eliminates almost all atime updates while still keeping the value
close to current.
This patch was proposed for merging, and more: it was suggested that
relatime should be made the default mode for filesystems mounted under
Linux. Anybody wanting the traditional atime behavior would have to mount
their filesystems with the new "strictatime" mount option. This idea ran
into some immediate opposition, for a couple of reasons. Andrew Morton didn't like the hardwired 24-hour value,
saying, instead, that the update period should be given as a mount option.
This option would be easy enough to implement, but few people think there
is any reason to do so; it's hard to imagine a use case which requires any
degree of control over the granularity of atime updates.
Alan Cox, instead, objected to the patch as
an ABI change and a standards violation. He tried to "NAK" the patch,
saying that, instead, this sort of change should be done by distributors.
Linus, however, said he doesn't care; the
relatime change and strictatime option were the very first things he merged
when he opened the 2.6.30 merge window. His position is that the
distributors have had more than a year to make this change, and they
haven't done so. So the best thing to do, he says, is to change the
default in the kernel and let people use strictatime if they really need
that behavior.
For the curious, Valerie Aurora has written a detailed
article about this change. She doesn't think that the patch will
survive in its current form; your editor, though, does not see a whole lot
of pressure for change at this point.
I/O barriers
Suppose you are a diligent application developer who codes proper
fsync() calls where they are necessary. You might think that you
are then protected against data loss in the face of a crash. But there is
still a potential problem: the disk drive may lie to the operating system
about having written the data to persistent media. Contemporary hardware
performs aggressive caching of operations to improve performance; this
caching will make a system run faster, but at the cost of adding another
way for data to get lost.
There is, of course, a way to tell a drive to actually write data to
persistent media. The block layer has long had support for barrier
operations, which cause data to be flushed to disk before more operations
can be initiated. But the ext3 filesystem does not use barriers by default
because there is an associated performance penalty. With ext4, instead,
barriers are on by default.
Jeff Garzik pointed out one associated
problem: a call to fsync() does not necessarily cause the drive to
flush data to the physical media. He suggested that fsync()
should create a barrier, even if the filesystem as a whole is not using
barriers. In that way, he says, fsync() might actually live up to
the promise that it is making to application developers.
The idea was not controversial, even though people are, as a whole, less
concerned with caches inside disk drives. Those caches tend to be
short-lived, and they are quite likely to be written even if the operating
system crashes or some other component of the system fails. So the chances
of data loss at that level are much smaller than they are with data in an
operating system cache. Still, it's possible to provide a higher-level
guarantee, so Fernando Luis Vazquez Cao posted a series of patches to add barriers to
fsync() calls. And that is when the trouble started.
The fundamental disagreement here is over what should happen when an
attempt to send a flush operation to the device fails. Fernando's patch
returned an ENOTSUPP error to the caller, but Linus asked for it to be removed. His position is
that there is nothing that the caller can do about a failed barrier
operation anyway, so there is no real reason to propagate that error
upward. At most, the system should set a flag noting that the device
doesn't support barriers. But, says Linus, filesystems should cope with
what the storage device provides.
Ric Wheeler, instead, argues that
filesystems should know if barrier operations are not working and be able
to respond accordingly. Says Ric:
One thing the caller could do is to disable the write cache on the
device. A second would be to stop using the transactions - skip the
journal, just go back to ext2 mode or BSD like soft updates.
Basically, it lets the file system know that its data integrity
building blocks are not really there and allows it (if it cares) to
try and minimize the chance of data loss.
Alan Cox also jumped into this discussion
in favor of stronger barriers:
Throw and pray the block layer can fake it simply isn't a valid
model for serious enterprise computing, and if people understood
the worst cases, for a lot of non enterprise computing.
Linus appears to be unswayed by these arguments, though. In his view,
filesystems should do the best they can and accept what the underlying
device is able to do. As of this writing, no patches adding barriers to
fsync() have been merged into the mainline.
Related to this is the concept of laptop mode. It has been suggested that, when a system is in laptop
mode, an fsync() call should not actually flush data to disk;
flushing the data would cause the drive to spin up, defeating the intent of
laptop mode. The response to I/O barrier requests would presumably be
similar. Some developers oppose this idea, though, seeing it as a
weakening of the promises provided by the API. This looks like a topic
which could go a long time without any real resolution.
Performance tuning
Finally, there was some talk about trying to make the virtual memory
subsystem perform better in general. Part of the problem here has been
recognized for some time: memory sizes have grown faster than disk speeds.
So it takes a lot longer to write out a full load of dirty pages than it
did in the past. That simple dynamic is part of the reason why writeout
operations can stall for long periods; it just takes that long to funnel
gigabytes of data onto a disk drive. It is generally expected that
solid-state drives will eventually make this problem go away, but it is
also expected that it will be quite some time, yet, before those drives are
universal.
In the mean time, one can try to improve performance by not allowing the
system to accumulate as much data in need of writing. So, rather than
letting dirty pages stay in cache for (say) 30 seconds, those pages
should be flushed more frequently. Or the system could adjust the
percentage of RAM which is allowed to be dirty, perhaps in response to
observations about the actual bandwidth of the backing store devices.
The kernel already has a "percentage dirty" limit, but some developers are
now suggesting that the limit should be a fixed number of bytes instead.
In particular, that limit should be set to the number of bytes which can be
flushed to the backing store device in (say) one second.
Nobody objects to the idea of a better-tuned virtual memory subsystem. But
there is some real disagreement over how that tuning should be done. Some
developers argue for exposing the tuning knobs to user space and letting
the distributors work it out. Andrew is a
strong proponent of this approach:
I've seen you repeatedly fiddle the in-kernel defaults based on
in-field experience. That could just as easily have been done in
initscripts by distros, and much more effectively because it
doesn't need a new kernel. That's data.
The fact that this hasn't even been _attempted_ (afaik) is deplorable.
Why does everyone just sit around waiting for the kernel to put a new
value into two magic numbers which userspace scripts could have
set?
The objections to that approach follow these lines: the distributors cannot
get these numbers right; in fact, they are not really even inclined to try
to get them right. The proper tuning values tend to change from one kernel
to the next, so it makes sense to keep them with the kernel itself. And
the kernel should be able to get these things right if it is doing its job
at all. Needless to say, Linus argues for
this approach, saying:
We should aim to get it right. The "user space can tweak any
numbers they want" is ALWAYS THE WRONG ANSWER. It's a cop-out, but
more importantly, it's a cop-out that doesn't even work, and that
just results in everybody having different setups. Then nobody is
happy.
Linus has suggested (but not implemented) one
set of heuristics which could help the system to tune itself. Neil
Brown also has a suggested approach, based
on measuring the actual performance of the system's storage devices.
Fixing things at this level is likely to take some time; virtual memory
changes always do. But some smart people are starting to think about the
problem, and that's an important first step.
That, too, could be said for the discussion as a whole. There are clearly
a lot of issues surrounding filesystems and I/O which have come to the
surface and need to be discussed. The Linux kernel community as a whole needs to
think through the sort of guarantees (for both robustness and performance)
it will offer to its users and how
those guarantees will be fulfilled. As it happens, the 2009 Linux
Storage & Filesystems Workshop begins on April 6. Many of
these topics are likely to be discussed there. Your editor has managed to
talk his way into that room; stay tuned.
Comments (72 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>