Brief items
The current 2.6 prepatch remains 2.6.24-rc8; no new prepatches have
been released over the last week. Around 100 patches have gone into the
mainline repository since -rc8 was released. Your editor expects the final
2.6.24 release just before everybody heads off to linux.conf.au.
The current -mm tree is 2.6.24-rc8-mm1. Andrew has been
expressing some frustration with the process of bringing together -mm
patches:
The volume of rejects and build errors which are caused by
subsystem maintainers fiddling with other people's stuff is quite
out of control. Something needs to happen here.
What has happened for the moment is that a lot of git trees have been
dropped from this release. Other changes include asynchronous crypto
support in the device mapper, a number of Chinese translations of core
kernel documents, a lot of IDE updates, and a Sony memory stick driver.
For older kernels: 2.6.16.59 was released with
about a dozen fixes on January 19. 2.6.16.60-rc1 (January 22)
starts the next cycle with several more fixes.
Comments (none posted)
Kernel development news
As my daughter would say: that patch fell out of the ugly tree, and hit
every branch on the way down. Very impressive.
--
Linus Torvalds (for the curious, here is
the referenced patch)
These things are all _soo_ much simpler than all the issues you
have to do in the kernel, so this is just a complete toy compared
to all the things we do inside Linux to do the same thing with
pluggable hashes on a per-path-component basis etc.
(User space developers are weenies. One of the most fun parts of
git development for me has been how easy everything is ;)
--
Linus Torvalds (thanks to Nicholas
Pitre)
One thing the kernel never faced was fifteen years of fundamental
stagnation with a wealth of kludge-arounds piled on top.
--
Keith Packard
Are you saying that this linux can run on a computer without
windows underneath it, at all ? As in, without a boot disk, without
any drivers, and without any services ?
That sounds preposterous to me.
--
"jerryleecooper"
Comments (16 posted)
By Jonathan Corbet
January 23, 2008
Last week's Kernel Page may
have been filesystem-heavy, but there was still a big omission, in the form
of ext4. But ext4, being the successor to ext3, may well be the filesystem
many of us are using a few years from now. Things have been relatively
quiet on that front - at least, outside of the relevant mailing lists - but
the ext4 developers have not been idle. Some of their work has now come to
the surface with Ted Ts'o's posting of the
ext4 merge plans for 2.6.25.
One of the changes going into ext4 is a lifting of the longstanding 4KB
block size limit. That does not mean that just any block size works, though,
and this feature will benefit fewer people than one might think, for one
specific reason: the block size must still be no larger than the page size
on the host system. So those of us running x86 systems with 4KB pages will
be stuck with 4KB blocks still. And, on any system, the maximum block size
is now 64KB.
One amusing effect of this change is that the size of a directory entry can
now be as large as 64KB as well. But the field which holds the size of
directory entries is only 16 bits wide. So a special hack has been
employed to recognize 64KB directory entries and keep everything
consistent.
Some internal variables have overflow problems as well. Block numbers are
stored as a signed, 32-bit quantity, and so are block group numbers. That
limits the maximum size of a filesystem to a mere 256PB. In 2.6.25, these values will
become unsigned long variables, eliminating that intolerably low limit.
Through some trickery, the inode field which stores the number of blocks
associated with a file will be expanded to 48 bits, raising the
maximum size of an individual file to just under 248 512-byte
blocks.
The work does not stop there, though: another patch redefines that field
to mean the number of filesystem blocks (instead of 512-byte sectors) used
by the file. This is a change which has to be handled carefully, since it
is an on-disk format change which could create trouble for people with
existing ext4 filesystems. Everybody who is using ext4 should certainly be
doing so with the knowledge that it's a development filesystem and is only
suitable for storing files which are not valuable for more than about
30 minutes - Rawhide OpenOffice.org updates, say. But it still would be
nice to not trash every existing ext4 filesystem out there. So the
i_blocks field will continue, by default, to hold the number of
512-byte blocks. But, if that field exceeds 32 bits and forces the use of
48-bit numbers, it is thereafter interpreted as filesystem blocks. Since
no existing filesystems are yet using 48-bit numbers, this approach
successfully avoids breaking them.
Journal checksums are another feature arriving for 2.6.25. If the system
crashes, the journal is used to recover any transactions which were
committed, but which did not actually make it to disk. It sure would
be nice to know that the journal, as stored in the filesystem, is intact
before using it to make changes elsewhere.
The checksum enables the filesystem to ensure that the journal is good and
avoid (further) corrupting the filesystem if it is not. An interesting
side benefit is that the checksum loosens the constraints on how the
journal is written to disk, since an incompletely-written journal will now
be detected; that should help to improve filesystem performance slightly.
Note that full data checksumming is still not on the agenda for ext4. But
checksumming the journal is a good (if small) step in the right direction.
Another change is a VFS API change, in that it turns the i_version
field of the inode structure into an unsigned, 64-bit value on all
architectures. This version number is incremented when the file is
changed, and it's stored (split into two fields) in the on-disk inode.
64-bit version numbers are required by NFSv4, which uses them to provide
the dreaded "stale file handle" error when things change.
There is a new ioctl() (EXT4_IOC_MIGRATE) which can be
used to explicitly request that the on-disk inode for a file be converted
to the ext4 format.
The ext4 filesystem is extent-based, and has been for some time.
"Extent-based" means that it tracks block allocations by extents (first
block, number of blocks) rather than storing pointers to each individual
block, as is done in ext3. There are a number of performance benefits to
doing things this way, especially for larger files. Those benefits
disappear, though, if a file's blocks cannot be grouped into the smallest
number of extents possible.
One technique which greatly helps in optimizing block allocations for files
is to allocate them in relatively large groups, rather than individually.
In 2.6.25, ext4 will contain the multi-block allocator, which does exactly
that. One might think that allocating a few blocks at a time would not be
that big of a change, but the multi-block allocator is by far the most
complex patch in the set. A lot of effort and heuristics go into deciding
how many blocks to allocate, finding the optimal set of blocks, tracking
the allocation, recovering blocks which end up never being used, ensuring
that an application cannot read pre-allocated (but unwritten) blocks in
search of leaked secrets, etc. It is quite a bit of code, but it is worth
the trouble; multi-block allocation will be enabled by default in 2.6.25.
As noted above, a number of these patches force changes to the on-disk data
structure. According to Ted, though, these should be the last on-disk
changes for ext4. There are some features which still will not have been
merged when 2.6.25 comes around - delayed allocation and online
defragmentation among them - but they should not require format changes.
So ext4 is getting closer to the point where it is considered ready for
production use.
It is not at that point yet, though, and people who use it are still
doing so at their own risk. To help drive that point home, Ted has
proposed a new mount flag
(called test_fs) which communicates to the kernel the user's
understanding that they are about to mount a developmental filesystem and
will not go filing lawsuits if things go wrong. In the absence of this
mount option, an ext4 filesystem will refuse to mount. One might think
that child-proofing the filesystem in this way would not be necessary, but
some extra care in this area can only be a good thing. Filesystem-related
surprises are rarely welcome.
Comments (14 posted)
By Jake Edge
January 23, 2008
Stuttering audio or an unresponsive desktop – typically caused by
operating system latency – are two things that annoy
users. They can be difficult problems to diagnose, though, as they are
transient
and buried deep inside the kernel. A new tool, LatencyTOP, seeks to provide more
information on where latency is occurring so that it can be fixed or avoided.
Latency is the measure of how much time elapses between when an action is
initiated and when its effects become visible. If a user clicks the mouse
button in an application, the latency is the amount of time between that
click and when the associated action begins. There are lots of different
reasons for
latency, some of which are outside of Linux's control; being able
to measure what latency the OS is contributing will be very useful.
LatencyTOP is reporting on a specific subset of latency causes, as described
in the announcement:
There are many types and causes of latency, and LatencyTOP [focuses on the]
type
that causes audio skipping and desktop stutters. Specifically, LatencyTOP
focuses on the cases where the applications want to run and execute useful
code, but there's some resource that's not currently available (and the
kernel then blocks the process). This is done both on a system level and
on a per process level, so that you can see what's happening to the system,
and which process is suffering and/or causing the delays.
LatencyTOP measures the average and maximum amount of latency in various
operations by inserting annotation calls in the kernel. An example from
the announcement is instructive:
asmlinkage long sys_sync(void)
{
+ struct latency_entry reason;
+ set_latency_reason("sync system call", &reason);
do_sync(1);
+ restore_latency_reason(&reason);
+
return 0;
}
The scheduler accumulates any time spent sleeping, between the
set_latency_reason() and
restore_latency_reason() calls,
charging it to the "sync system call". Any lower level calls to set the
latency reason will be ignored in this code path – they may be useful
in other code paths – as it is the highest level active reason that
gets charged.
The current interface for annotating is likely to change, though the
semantics will stay the same. Comments on the
original submission suggested using the kernel markers feature that was
merged for 2.6.24. LatencyTOP developer Arjan van de Ven seems amenable to
that; reusing a kernel interface, rather than adding a new one, is
generally the right choice. There is other work to do as well, the patch
was submitted for other kernel hackers to test and comment on, not to be
merged into the mainline.
LatencyTOP comes with a userspace application, shown at right, that
displays the information gathered. It reads from the
/proc/latency_stats file that is created by the LatencyTOP infrastructure patch
– so long as you enable CONFIG_LATENCYTOP in the kernel. It displays
the nine – an off-by-one in the code as it would seem that ten
were intended – largest latencies over the past 30 seconds in the upper pane.
A list of process names runs along the bottom of the display, which can be
selected with the arrow keys. The latency sources for
that process will then be shown in the lower pane. The example at left
shows the tool with the
firefox process selected. As can be seen, there are still lots of areas
that need annotations – "Unknown reason" along with the wait channel are
displayed when the reason has not been set. When narrowing a problem down,
it should be straightforward for a kernel hacker to add annotations to the
appropriate locations.
LatencyTOP, like its sibling PowerTOP –
also developed by van de Ven at the Intel Open Source Technology Center
– is a powerful tool for trying to track down system problems. It
will probably undergo some changes along the way: the userspace
application is still rather rudimentary and the kernel data collection
needs finer-grained locking. But, before too long, a mainstream tool
to measure system latency based on this work should appear.
Comments (5 posted)
By Jonathan Corbet
January 23, 2008
Virtualized guests running under Linux like to think that they are doing
their own memory management. The truth of the matter, though, is that the
host system cannot allow guests to directly modify the page tables used by
the hardware; allowing that sort of access would compromise the security of
the host. So, somehow, the host must be involved in the guest's memory
management. One common technique is through the use of shadow page
tables. Guest systems maintain their own page tables, but they are not the
tables used by the memory management unit. Instead, whenever the guest
makes a change to its tables, the host system intercepts the operation,
checks it for validity, then mirrors the change in the real page tables,
which "shadow" those maintained by the guest.
One problem with this technique, as implemented in Linux currently, is that
there is no easy way for the host to feed page table changes back to the
guest. In particular, if the host system decides that it wants to push a
given page out to swap, it can't tell the guest that the page is no longer
resident. So virtualization mechanisms like KVM avoid the problem
altogether by pinning pages in memory
when they are mapped in shadow page tables. That solves the problem, but
it makes it impossible to swap processes running KVM-based virtual machines out of main
memory.
This seems like a good thing to fix. And a fix exists, in the form of the
MMU notifiers patch posted by
Andrea Arcangeli (from his shiny new Qumranet address). This patch allows
an interested subsystem to be notified whenever specific memory management
events take place. The process starts by setting up a set of callbacks:
struct mmu_notifier_ops {
void (*release)(struct mmu_notifier *mn,
struct mm_struct *mm);
int (*age_page)(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long address);
void (*invalidate_page)(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long address);
void (*invalidate_range)(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long start, unsigned long end);
};
These callbacks are bundled into an mmu_notifier structure:
struct mmu_notifier {
struct hlist_node hlist;
const struct mmu_notifier_ops *ops;
};
The interested code then registers its notifier with:
void mmu_notifier_register(struct mmu_notifier *mn,
struct mm_struct *mm);
Here, mm is the mm_struct structure associated with a
given address space. It is not expected that anybody will be interested in
all memory management events, so notifiers are associated with
specific address spaces. Once the notifier is in place, the callbacks will
be invoked when interesting things happen:
- release() is called when the relevant mm_struct
is about to go away. So it will be the last callback made to that
notifier.
- age_page() indicates that the memory management subsystem
wants to clear the "referenced" flag on the page associated with the
given address. This callback should return the previous
value of the referenced bit, or the closest approximation available on
the host architecture.
- invalidate_page() and invalidate_range() are both
ways of telling the guest that the given address(es) are no longer
valid - the page has been reclaimed. Upon return from this callback,
the affected address range should not be referenced by the guest.
For the curious, the KVM patches
(showing how these notifiers are used there) have also been posted.
While this patch set is aimed at KVM, there has been some interest from
other directions as well - virtual machines are not the only places where
separate (but related) page tables are maintained. Graphical processing
units on contemporary video cards are an example - they have their own
memory management units and have some interesting management issues of their own.
Remote DMA (RDMA) engines are another possible user. So these patches have
attracted comments from a few potential users, and have changed
significantly since their first posting. The discussion is still ongoing,
so further changes may come about before the notifiers find their way into
the mainline.
Comments (3 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Janitorial
Networking
Architecture-specific
Security-related
Virtualization and containers
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>