Brief items
The current stable 2.6 kernel is 2.6.17.3,
released on June 30. It
was a single-fix release for a denial of service vulnerability in the
netfilter SCTP connection tracking code. One day earlier,
2.6.17.2 had been released with
a relatively large set of important fixes. The SCTP fix can also be found
in
2.6.16.23.
The current 2.6 prepatch is 2.6.18-rc1, released by Linus on
July 5. A summary of changes can be found in a separate article
below. Also available are the short-form changelog (too bulky
to be included with Linus's announcement) and the long-form
changelog.
The current -mm tree is 2.6.17-mm6. Recent changes to
-mm include some extensions to the read-copy-update API, some "massive" CPU
scheduler cleanup work, the removal of a number of old (OSS) sound drivers,
and a set of patches shrinking the inode structure. A great many
patches have been removed from -mm as they have found their way into
2.6.18-rc1.
Comments (none posted)
Kernel development news
I like colorized diffs, but let's face it, those particular color
choices will make most people decide to pick out their eyes with a
fondue fork. And that's not good. Digging in your eye-sockets with
a fondue fork is strictly considered to be bad for your health, and
seven out of nine optometrists are dead set against the practice.
So in order to avoid a lot of blind git users, please apply this
patch.
-- Linus Torvalds
Comments (3 posted)
Your editor, having returned from an all-too-short vacation, was faced with
the prospect of looking over the 4500 (and counting) patches merged for the
2.6.18-rc1 release. Much of what has been merged is the usual set of fixes
and updates, but some more user and developer-visible patches have gone in
as well. The user-visible patches include:
- The new core time system has finally found its way into the mainline;
it was covered here in
January, 2005, but has evolved considerably since then.
- New device drivers for SMSC LAN911x Ethernet chipsets,
ZyDAS ZD1211-based wireless LAN adapters,
Myricom Myri-10G interfaces, CS553x NAND flash controllers,
Amstrad E3 Delta flash controllers, Abit uGuru hardware monitoring
chips, NS LM70 temperature sensors, a number of Echoaudio sound cards,
and more.
- Generic support for hardware random number generators has been added,
along with drivers for a long list of generators.
- The Philips Webcam driver has seen a massive update which adds image
decompression support (without legal issues this time), support for a
number of new devices, and many improvements.
- A large set of NFS patches has been merged, adding, among other
things, direct I/O support.
- A netlink interface for networking bridging management.
- A netfilter connection tracking helper for the SIP protocol.
- The TCP Low
Priority, TCP Compound, and TCP Veno
congestion control algorithms.
- A new mechanism for attaching SELinux labels to network packets.
There is also a new set of hooks allowing SELinux to regulate the
kernel key management subsystem.
- Extended attribute support in the JFFS2 filesystem.
- A number of kernel include files have been cleaned up to make it
easier to include them into user-space applications.
- PCI devices now export an "enable" attribute via sysfs. The main
purpose for the new attribute is to allow the X server to enable and
disable devices without doing direct I/O memory access.
- The swapless page migration
patches have been merged, easing the movement of pages between
NUMA nodes. There is also a new move_pages() system call
which can be used to determine where pages reside and possibly move
them to a new node.
- The TCP segmentation offload code has been updated and improved.
There is a new "generic segmentation offload" layer which can emulate
TSO in software; evidently this approach yields some of the
performance benefits of TSO on hardware which does not support
segmentation offloading.
- The default disk I/O scheduler is now the "completely fair queueing"
(CFQ) scheduler.
- A massive set of serial ATA
changes has been merged, including a new error handler, rewritten
programmed I/O support, native command queueing (NCQ) support (which
should improve performance considerably), and hotplug support.
- Priority-inheriting
futexes have been merged into the mainline.
- SMPnice, a set of
scheduler heuristic changes meant to improve handling of low-priority
processes on SMP systems, has been merged.
Internal API changes visible to kernel developers include:
- The generic IRQ layer
has been merged. The SA_* flags to request_irq()
have been renamed; the new prefix is IRQF_. A long series of patches
has converted in-tree drivers over to the new names; The old names
are scheduled for removal in January, 2007.
- 64-bit resources are now
supported. This change affects a number of users of the resource
management API.
- The kernel lock
validator has gone in, along with a number of fixes for potential
deadlocks found by the validator.
- At long last, the devfs subsystem has been removed.
- An API and support for
the Intel I/OAT DMA engine.
- The skb_linearize() function has been reworked, and no longer
has a GFP flags argument. There is also a new
skb_linearize_cow() function which ensures that the resulting
SKB is writable.
- Network drivers should no longer manipulate the xmit_lock
spinlock in the net_device structure; instead, the following
new functions should be used:
int netif_tx_lock(struct net_device *dev);
int netif_tx_lock_bh(struct net_device *dev);
void netif_tx_unlock(struct net_device *dev);
void netif_tx_unlock_bh(struct net_device *dev);
int netif_tx_trylock(struct net_device *dev);
- The long-deprecated inter_module API has finally been removed
altogether.
- A new kernel API providing access to the "inotify" functionality has
been added.
- The old scsi_request infrastructure has been removed, since
there are no longer any in-tree drivers which use it.
- The include file <linux/usb_input.h> is now
<linux/usb/input.h>.
- The VFS get_sb() filesystem method has a new prototype:
int (*get_sb)(struct file_system_type fstype, int flags,
const char *dev_name, void *data,
struct vfsmount *mnt);
The mnt parameter is new; it allows the filesystem to receive
a pointer to the target mount point structure. The mount point should
be associated with the superblock in the get_sb() method with
a call to:
int simple_set_mnt(struct vfsmount *mnt, struct super_block *sb);
The return value of get_sb() has also been changed to
an int error status. The various get_sb_*()
convenience functions have had the same changes applied. The purpose
of all this work is to allow NFS to share superblocks across mount
points.
- The statfs() superblock operation has a new prototype:
int (*statfs)(struct dentry *dentry, struct kstatfs *stats);
The old struct super_block pointer is now a dentry
pointer instead.
- Some functions have been added to make it easy for kernel code to
allocate a buffer with vmalloc() and map it into user space.
They are:
void *vmalloc_user(unsigned long size);
void *vmalloc_32_user(unsigned long size);
int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
unsigned long pgoff);
The first two functions are a form of vmalloc() which obtain
memory intended to be mapped into user space; among other things, they
zero the entire range to avoid leaking data.
vmalloc_32_user() allocates low memory only. A call to
remap_vmalloc_range() will complete the job; it will refuse,
however, to remap memory which has not been allocated with one of the
two functions above.
- The read-copy-update API is now accessible only to GPL-licensed
modules. The deprecated function synchronize_kernel() has
also been removed.
- There is a new strstrip() library function which removes
leading and trailing white space from a string.
- A new WARN_ON_ONCE macro will test a condition and complain
if that condition evaluates true - but only once per boot.
- A number of crypto API changes have been merged, the biggest being a
change to most algorithm-specific functions to take a pointer to the
crypto_tfm structure, rather than the old "context" pointer.
This change was necessary to support parameterized algorithms.
- There is a new make target "headers_install". Its purpose is
to install a set of kernel headers useful for libraries and user-space
tools. A limited set of headers is installed, and those headers are
sanitized on their way to the destination directory. It is hoped that
distributors will use this mechanism to set up kernel headers for
inclusion from user space in the future.
As of this writing, the 2.6.18 merge window has closed, so there probably
will not be a whole lot of additions to the above list.
Comments (7 posted)
A few weeks ago, this page
looked
at possible additions to the ext3 filesystem and the question of
whether the time had come to freeze ext3 and put new features into a new
ext4 filesystem again. The ext2/3 filesystem developers have now
responded to that discussion
with a clear answer: they will be moving on to ext4.
More specifically, a new filesystem will be created under fs/ext4
in the kernel source. Said filesystem will register itself as
"ext3dev," in an attempt to make it crystal clear that it is a
development filesystem, not suitable for the storage of data which one
actually wishes to keep. New feature work - especially changes which
change on-disk formats and prevent interoperation with current ext3 implementations
- will go into this new filesystem, while ext3 will continue to receive bug
fixes and some safe improvements. Throughout this process, the new
filesystem will retain its ability to work with the current ext3 format.
Sometime in the future, ext3dev will be declared stable and renamed "ext4."
Once the last bugs have been shaken out, this filesystem will lose its
"experimental" designation and users will be encouraged to upgrade. Since
support for ext3 formats will be there, this upgrade should be an easy
process, with no backup-and-restore step or downtime required. Further in
the future, the ext3 code may be removed and ext4 would transparently handle
ext3 filesystems as well.
There seems to be little opposition to this approach, so it would appear
that things will happen this way. Since the addition of a new,
experimental filesystem carries little regression risk, the creation of
ext4 and the addition of some new features (extents, for example) could yet
happen for 2.6.18.
Comments (2 posted)
July 5, 2006
This article was contributed by Valerie Henson
The Linux file systems community met in Portland in June 2006 to
discuss the next 5 years of file system development in Linux.
Organized by
Val Henson, Zach
Brown, and Arjan van de Ven, and sponsored by
Intel,
Google,
Oracle, the
Linux File Systems
Workshop brought together thirteen Linux file systems developers
and experts to share data and brainstorm for three days. Our goal was
to discuss the direction of Linux file systems development during the
next 5 years, with a focus on disruptive technologies rather than
incremental improvements. Our goal was not to design one new file
system to rule them all, but to come up with several useful new file
system architecture ideas (which may or may not reuse existing file
system code). To stay focused, we explicitly ruled out discussion of
the design of distributed or clustered file systems, with the
exception of how they impact local file system design. We came out of
the workshop with broad agreement on the problems facing Linux file
systems, several exciting file system architecture ideas, and a
commitment to working together on the next generation of Linux file
systems.
The Problem
Why do we need a Linux file systems workshop, when all seems well in
Linux file systems land? Disks purr gently along, larger and fatter
than ever before, but still essentially the same. I/O errors are an
endangered species, more rumor than fact, and easily corrected with a
simple fsck. The "df" command returns a comforting 50% free on most
of your file systems. You chuckle gently as you read old file system
man pages with directions for tuning inode/block ratios. Sure, that
32-bit file system size limit is looming somewhere over the horizon,
but a quick patch to change the size of your block pointers is all you
need and you'll be back in business again. After all, file systems
are a solved problem, right? Right?
If computer hardware never changed, we kernel developers would have
nothing better to do than argue about the optimal scheduling algorithm
and flame each others' coding style. Unfortunately, hardware has this
terrible habit of changing frequently, drastically, and worst of all,
exponentially. File systems are especially vulnerable to changes in
hardware because of their long-lived nature. Much of operating
systems software can be changed at will given a simple system reboot.
But file systems - and their on-disk data layouts - live on and on.
What has changed in hardware that affects file systems? Let's start
with some simple, unavoidable facts about the way disks are evolving.
Everyone knows that disk capacity is growing exponentially, doubling
every 9-18 months. But what about disk bandwidth and seek time? At
the last Storage Networking World
conference, Seagate presented some details of their hard disk
road map for the next 7 years (see page 16 of the
slides [PDF]). Their predictions for 3.5 inch hard disks are summarized
in the following table.
| Parameter | 2006 | 2009 | 2013 | Improvement |
| Capacity (GB) | 500 | 2000 | 8000 | 16x |
| Bandwidth (Mb/s) | 1000 | 2000 | 5000 | 5x |
| Read seek time (ms) | 8 | 7.2 | 6.5 | 1.2x |
In summary, over the next 7 years, disk capacity will increase by 16
times, while disk bandwidth will increase only 5 times, and seek time
will barely budge! Today it takes a theoretical minimum 4,000
seconds, or about 1 hour to read an entire disk sequentially (in
reality, it's longer due to a variety of factors). In 2013, it will
take a minimum of 12,800 seconds, or about 3.5 hours, to read an
entire disk - an increase of 3 times. Random I/O workloads are even
worse, since seek times are nearly flat. A workload that reads, e.g.,
10% of the disk non-sequentially will take much longer on our 8TB
2013-era disk than it did on our 500GB 2006-era disk.
Another interesting change in hardware is the rate of increase in
capacity versus the rate of reduction in I/O errors per bit. In order
for a disk to have the same overall number of I/O errors, every time
capacity doubles, the per-bit I/O error rate must halve. Needless to
say, this isn't happening, so I/O errors are actually more common even
though the per-bit error rate has dropped.
These are only a few of the changes in disk hardware that will occur
over the next decade. What do these changes mean for file systems?
First, fsck will take a lot longer in absolute terms, because disk
capacity is larger, but disk bandwidth is relatively smaller, and seek
time is relatively much larger. Fsck on multi-terabyte file systems
today can easily take 2 days, and in the future it will take even
longer! Second, the increasing number of I/O errors means that fsck
is going to happen a lot more often - and journaling won't help.
Existing file systems simply weren't designed with this kind of I/O
error frequency in mind.
These problems aren't theoretical - they are already affecting systems
that you care about. Recently, the main server for Linux kernel
source, kernel.org, suffered file system corruption from a failure at
the RAID level. It took over a week for fsck to repair the (ext3)
file system, when it would have taken far less time to restore from
backup.
The workshop
Now that the stage is set, we'll move on to what happened at the 2006
Workshop. The coverage has been split into the following pages:
- Day 1, devoted mostly to understand
the current state of the art: file system repair, disk errors, lessons
learned from existing file systems, and major filesystem
architectures.
- Days 2 and 3, concerned with the way
forward: interesting ideas, near-term needs, and development plans.
Comments (34 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Janitorial
Memory management
Networking
Architecture-specific
Page editor: Jonathan Corbet
Next page: Distributions>>