Brief items
The 2.6.25 merge window is still open, so there have not yet been
any prepatches for this development cycle. Patches continue to flow into
the mainline repository, with some 7500 changesets merged (as of this
writing) for 2.6.25.
The current -mm tree is 2.6.24-mm1. Recent changes
to -mm include the dropping of a number of subsystem trees due to patch
conflicts and the movement of vast numbers of patches into the mainline.
For older kernels: 2.6.22.17, released on
February 6, contains a significant number of fixes. It is likely to
be the last release in the 2.6.22.x series.
Comments (none posted)
Kernel development news
I don't think that "developer-centric" debugging is really even
remotely our problem, and that I'm personally a lot more interested
in infrastructure that helps normal users give better
bug-reports. And kgdb isn't even _remotely_ it.
--
Linus Torvalds, still not sold on kernel
debuggers.
I used kgdb continuously for 4-5 years until it broke. I don't
think I ever used it much for "debugging" as such. I used it more
for general observation of what's going on in the kernel.
--
Andrew Morton
Comments (6 posted)
By Jonathan Corbet
February 6, 2008
Since
last week's
installment, some 3800 changesets have been merged into the mainline
git repository. Some of the more interesting user-visible changes found in
that patch stream include:
- Support for new hardware, including RDC R-321x system-on-chip
processors, Onkyo SE-90PCI and SE-200PCI sound devices, Xilinx ML403
AC97 controllers, TI TLV320AIC3X audio codecs, Realtek
ALC889/ALC267/ALC269 codecs, VIA VT1708B HD audio codecs, SiS 7019
Audio Accelerator devices, C-Media 8788 (Oxygen) audio chipsets, Asus
AV200-based sound cards, Freescale MPC8610 audio devices, Audiotrak
Prodigy 7.1 HiFi audio devices, Conexant 5051 audio codecs,
MediaTek/TempoTec HiFier Fantasia sound cards, wireless RNDIS devices
(and Broadcom 4320-based devices in particular), USB printer gadgets
(intended for use in printer firmware),
and NetEffect 1/10Gb ethernet adapters.
- The nearly-unused ALSA sequencer instrument layer has been removed.
- SELinux has a new set of checks which allow the creation of policies
which control the flow of packets into and out of the system.
- Netfilter has a more flexible "hashlimit" mechanism for limiting the
number of packets to/from a given source over time.
- There is a new "flow" classifier for the network fair queueing code
which allows the more flexible creation of traffic policies.
- The futex mechanism has a new "bitset wait" mechanism which allows for
more targeted wakeups. This feature will be used by glibc to
implement optimized reader-writer locks.
- PCI hotplug is no longer an experimental feature.
- Support for PCI Express ASPM, a power management protocol, has been
added.
- The virtio "balloon" driver (which can be used to change the amount of
memory used by a KVM guest) and PCI driver have been added.
- The CLONE_STOPPED bit (for the clone() system call)
is said to be unused and is planned for removal. For 2.6.25, a
warning will be printed.
- The timerfd() system call is back, with a reworked, more capable
API.
- The page map patches,
which enable much better accounting of memory use by processes, have
been merged.
- The "PM QOS" infrastructure allows both kernel and user-space code to
register quality-of-service requirements (in the form of CPU DMA
latency, network latency, and network throughput). These requirements
will be taken into account when the kernel considers putting the
system into a lower-power state.
- Per-process capability bounding sets (which permanently remove
potential capabilities from a process) are now supported. 64-bit
capability mask support has also been merged.
- The simplified mandatory
access control kernel (SMACK) security module has been merged.
- The smbfs filesystem has (finally) been deprecated in favor of CIFS.
It is now scheduled for removal in 2.6.27.
- There is a new RPC transport module allowing (client) NFS mounts using
RDMA.
Changes visible to kernel developers include:
- A large number of SUNRPC symbols (rpc_* and
rpcauth_*) have been changed to GPL-only exports.
- The x86 architecture merger continues, with quite a few files being
coalesced.
- The "flatmem" and "discontigmem" memory models have been removed on
the 64-bit x86 architecture; "sparsemem" is now used for all builds.
- The x86 spinlock implementation has been replaced with a "ticket
spinlock" mechanism which provides fair FIFO behavior.
- The fastcall function attribute didn't do anything on the x86
architecture, so it has been removed.
- x86 has a new set of functions for easily manipulating page
attributes. They are:
set_memory_uc(unsigned long addr, int numpages); /* Uncached */
set_memory_wb(unsigned long addr, int numpages); /* Cached */
set_memory_x(unsigned long addr, int numpages); /* Executable */
set_memory_nx(unsigned long addr, int numpages); /* Non-executable */
set_memory_ro(unsigned long addr, int numpages); /* Read-only */
set_memory_rw(unsigned long addr, int numpages); /* Read-write */
There is also a set of set_pages_* functions which take a
struct page pointer rather than a beginning address.
- Early-boot debugging of x86 systems via the FireWire port is now
supported.
- Bidirectional command support has been added to the SCSI layer.
- There is a new process state called TASK_KILLABLE. It is a
blocked state similar to TASK_UNINTERRUPTIBLE, with the
difference that a wakeup will happen upon delivery of a fatal signal.
The idea is to allow (almost) uninterruptible sleeps, but to still
allow the process to be killed outright - thus ending the problem of
unkillable processes stuck in the "D" state. There is a new set of
functions for using this state: wait_event_killable(),
schedule_timeout_killable(), mutex_lock_killable(),
etc.
- add_disk_randomness() has been unexported as there are no
more in-tree users.
- pci_enable_device_bars() has been replaced by two
more-specific functions: pci_enable_device_io() and
pci_enable_device_mem().
- The high-resolution timer API has been augmented with:
unsigned long hrtimer_forward_now(struct hrtimer *timer,
ktime_t interval);
It will move the given timer's expiration forward past the current
time as determined by the associated clock.
- The device structure now holds a pointer to a
device_dma_parameters structure:
struct device_dma_parameters {
unsigned int max_segment_size;
unsigned long segment_boundary_mask;
};
These parameters are used by the DMA mapping layer (and the IOMMU
mapping code in particular) to ensure that I/O operations are set up
within the device's constraints. The PCI layer supports this feature
with two new functions:
int pci_set_dma_max_seg_size(struct pci_dev *dev, unsigned int size);
int pci_set_dma_seg_boundary(struct pci_dev *dev, unsigned long mask);
Drivers for devices with unusually strict DMA limitations should
probably use these functions to ensure that those restrictions are
respected.
One thing which has not made it into 2.6.25 is the KGDB debugger for
the x86 architecture. Amusingly, a linux.conf.au kernel mini-conf
discussion of "sneaking" KGDB past Linus proceeded for some time before the
participants noticed him standing in the back of the room listening to the
whole thing. His current position is that
he won't pull it as part of
the x86 tree, and he's still not much interested in the idea in general.
As of this writing, the merge window is still open and could stay that way
for as much as a week. So more interesting code could still find its way
in through this merge window; stay tuned.
Comments (3 posted)
By Jake Edge
February 6, 2008
Performance, or lack thereof, has often been a knock against the
venerable Network File System (NFS), but no real competition has emerged.
NFS also has some serious flaws for programmers and users, with behavior
that is markedly different from that of local filesystems. Both of these
problems are spurring the creation of new network filesystems; two of
which were announced in the last week.
The Coherent Remote File System (CRFS) was introduced last week at
linux.conf.au by Zach Brown of Oracle. It uses BTRFS—pronounced
"butter-f-s"—as its storage on the server, rather than layering atop
any POSIX filesystem as NFS does. According to Brown, BTRFS has a number
of important features that outweigh the inconvenience for users of getting
their data into a BTRFS volume. The biggest is the ability to do compound
operations (creating or unlinking a file for example) in an atomic and
idempotent manner.
CRFS has a userspace daemon (crfsd) that talks to the BTRFS volume as well
as multiple clients. The clients use the kernel VFS caching infrastructure
extensively, thus are implemented as kernel modules. A user wishing
to access the underlying BTRFS volume on the server, must mount it as a
CRFS volume; crfsd must have exclusive access to the BTRFS. This is also
different from NFS which will cooperate with local mounts of the underlying
filesystem.
The basic idea behind CRFS is to have clients cache as much of the
filesystem data as they can while using cache coherency protocols to reduce
the amount of network traffic that gets generated. Clients
keep track of the cache state for each object they have stored, while the
server tracks the cache state of all objects that any client has. The
messages between server and client consist of cache state transitions and
the data being transferred.
Data transfer in both directions is done using CRFS "item ranges". CRFS
objects use the BTRFS key scheme to represent objects (file data, directories,
directory entries, inodes, etc.) in the filesystem.
An item range is a contiguous section of the key space, specified by a
minimum and maximum key value as part of the message. When the client is
filling its cache, it can request a particular key but also offer to take
other surrounding keys as part of the response; if the server sees those
keys in the BTRFS leaf node, it can send them along as well.
Something on the order of a 3x speedup over asynchronous NFS mounts is
the current performance of CRFS for a simple untar. Comparing to
synchronous NFS mounts (where each write has to actually hit the remote
disk) is not a sensible comparison; there is a roughly 10x speed difference
between the two types of NFS mounts. Brown has been working on CRFS for
"about a year" and is planning to release the code eventually. Until that
happens, the slides
[PDF] and video
[Theora] from his talk—as well as a few postings to his weblog—are the only
sources of information about CRFS.
Another filesystem, that aims to have a broader reach than
CRFS, is the Parallel Optimized Host Message Exchange
Layered File System (POHMELFS), announced in linux-kernel posting by
Evgeniy Polyakov. POHMELFS is meant to be a building block for a
distributed filesystem that would offer a multi-server architecture and
allow for disconnected filesystem operations. Polyakov has only been
working on it for a month, so it is, at best, the start of a proof of concept.
The POHMELFS vision is in some ways similar to CRFS in that the clients
will handle as much as possible locally, with minimal server interaction.
Like CRFS, client kernel modules talk to a server userspace daemon, using
cache coherency protocols to keep the data and metadata in sync. For CRFS,
the coherency is not yet implemented, but is fleshed out to some
extent,
while POHMELFS has quite a bit of fleshing out to do. Unlike CRFS,
POHMELFS supports POSIX filesystems on the server side and the code is
available now.
There are some rather large hurdles to overcome in the POHMELFS vision, not
least of which is handling file IDs in separate client-side filesystems such
that they can be synchronized with the server. The current code implements
a write-through cache version that creates objects on the server before
they are
used in the client side cache. There is also an additional patch that
implements a hack to disable the
writeback cache and use only the client side caching. The latter is, not
surprisingly, very fast, but not terribly usable for multiple mounts of the
filesystem. Essentially Polyakov is showing the benefits of client-side
caching, but in the context of a broader scheme.
It will be a long time, if ever, that we see some descendant of either of
these filesystems in the kernel. There is much work to be done, but they
are worth looking at to see where networking and distributed filesystems may be
headed. For them to be useful outside of just
the Linux world—like the ubiquity of NFS—there would have to be some kind of standardization
followed by adoption by the major players. That will take a very long time.
Comments (11 posted)
By Jonathan Corbet
February 6, 2008
Spinlocks are the lowest-level mutual exclusion mechanism in the Linux
kernel. As such, they have a great deal of influence over the safety and
performance of the kernel, so it is not surprising that a great deal of
optimization effort has gone into the various (architecture-specific)
spinlock implementations. That does not mean that all of the work has been
done, though; a patch merged for 2.6.25 shows that there is always more
which can be done.
On the x86 architecture, in the 2.6.24 kernel, a spinlock is represented by
an integer value. A value of one indicates that the lock is available.
The spin_lock() code works by decrementing the value (in a
system-wide atomic manner), then looking to see whether the result is
zero; if so, the lock has been successfully obtained. Should, instead, the
result of the decrement option be negative, the spin_lock() code
knows that the lock is owned by somebody else. So it busy-waits ("spins")
in a tight loop until the value of the lock becomes positive; then it goes
back to the beginning and tries again.
Once the critical section has been executed, the owner of the lock releases
it by setting it to 1.
This implementation is very fast, especially in the uncontended case (which
is how things should be most of the time). It also makes it easy to see
how bad the contention for a lock is - the more negative the value of the
lock gets, the more processors are trying to acquire it. But there is one
shortcoming with this approach: it is unfair. Once the lock is released,
the first processor which is able to decrement it will be the new owner.
There is no way to ensure that the processor which has been waiting the
longest gets the lock first; in fact, the processor which just released the
lock may, by virtue of owning that cache line, have an advantage should it
decide to reacquire the lock quickly.
One would hope that spinlock unfairness would not be a problem; usually, if
there is serious contention for locks, that contention is a performance
issue even before fairness is taken into account. Nick Piggin recently
revisited this issue, though, after noticing:
On an 8 core (2 socket) Opteron, spinlock unfairness is extremely
noticable, with a userspace test having a difference of up to 2x
runtime per thread, and some threads are starved or "unfairly"
granted the lock up to 1 000 000 (!) times.
This sort of runtime difference is certainly undesirable. But lock
unfairness can also create latency issues; it is hard to give latency
guarantees when the wait time for a spinlock can be arbitrarily long.
Nick's response
was a new spinlock implementation which he calls "ticket
spinlocks." Under the initial version of this patch, a spinlock became a
16-bit quantity, split into two bytes:
Each byte can be thought of as a ticket number. If you have ever been to a
store where customers take paper tickets to ensure that they are served in
the order of arrival, you can think of the "next" field as being the number
on the next ticket in the dispenser, while "owner" is the number appearing
in the "now serving" display over the counter.
So, in the new scheme, the value of a lock is initialized (both fields) to
zero. spin_lock() starts by noting the value of the lock, then
incrementing the "next" field - all in a single, atomic operation. If the
value of "next" (before the increment) is equal to "owner," the lock has
been obtained and work can continue. Otherwise the processor will spin,
waiting until "owner" is incremented to the right value. In this scheme,
releasing a lock is a simple matter of incrementing "owner."
The implementation described above does have one small disadvantage in that
it limits the number of processors to 256 - any more than that, and a
heavily-contended lock could lead to multiple processors thinking they had
the same ticket number. Needless to say, the resulting potential for
mayhem is not something which can be tolerated. But the 256-processor
limit is an unwelcome constraint for those working on large systems, which
already have rather more processors than that. So the add-on "big
ticket" patch - also merged for 2.6.25 - uses 16-bit values when the
configured maximum number of processors exceeds 256. That raises the
maximum system size to 65536 processors - who could ever want more than
that?
With the older spinlock implementation, all processors contending for a
lock fought to see who could grab it first. Now they wait nicely in line
and grab the lock in the order of arrival. Multi-thread run times even
out, and maximum latencies are reduced (and, more to the point, made
deterministic). There is a slight cost to the new implementation, says
Nick, but that gets very small on contemporary processors and is
essentially zero relative to the cost of a cache miss - which is a common
event when dealing with contended locks. The x86 maintainers clearly
thought that the benefits of eliminating the unseemly scramble for
spinlocks exceeded this small cost; it seems unlikely that others will disagree.
Comments (29 posted)
Patches and updates
Kernel trees
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
- sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org: Devpts namespace.
(February 6, 2008)
Miscellaneous
- Eric Leblond: ulogd.
(February 5, 2008)
Page editor: Jake Edge
Next page: Distributions>>