By Jonathan Corbet
August 12, 2009
Tux3. The once-noisy
Tux3 development community has
gone rather quiet in recent months. An inquiry into the status of the
project led to one of last week's
quotes of the week, wherein
developer Daniel Phillips pled a lack of time and expressed regrets at not
having merged the code into the mainline months ago. When asked (by Ted
Ts'o) for a description of what makes Tux3 interesting, Daniel
responded this way:
I think Tux3 fills an empty niche in our filesystem ecology where
a simple, clean and modern general purpose filesystem should exist
and there is none. In concrete terms, Tux3 implements a
single-pointer-per-extent model that Btrfs and ZFS do not. This
allows a very simple *physical* design, with much complexity
pushed to the *logical* level where things generally behave
better. A simple physical design offers many benefits, including
making it easier to take a run at that holiest of holy grails,
online check and repair.
What Tux3 needs, it seems, is some new development energy. It could be an
interesting project for developers who are wanting to get started in
filesystem development.
Resource counters. The resource
counter mechanism is built into control groups; it is intended for use
by tools like the memory use controller. These counters contain, at their
core, a (believe it or not) counter value which tracks the current usage of
a resource by a given control group. This counter has run into the same
problem which afflicts any frequently-changed global variable: it scales
poorly due to cache line bouncing. The usage of some resources (pages of
memory, for example) can change frequently, causing the associated counter
to be a drag on the system as a whole.
Balbir Singh's scalable resource counters
patch aims to fix that situation. With this patch, the single "usage"
counter becomes an array of per-CPU counters. Since each processor works
with its own copy of the counter, there is no more cache line bouncing and
things run faster. The down side is that the count becomes approximate.
The per-CPU counters are summed occasionally to keep everything roughly in
sync, but keeping exact counts would take away much of the scalability that
this patch was meant to provide. The good news is that exact counts are
not really needed anyway; as long as the counter reflects something close
enough to reality, the system will work essentially as it did before - only
a little more quickly.
Inline spinlocks. Once upon a time, spinlocks were implemented with
a series of inline functions, on the notion that such a
performance-critical primitive would need to be as fast as possible. That
changed in 2004, when
spinlocks were turned into normal functions. The function call overhead
hurt a bit, but moving spinlocks out-of-line made the kernel considerably
smaller, which has performance benefits of its own. And that's how
spinlocks have been ever since.
The pendulum may be about to swing the other way again, though, at least
for the S390 architecture. Heiko Carstens noted that function calls on
this architecture are quite expensive. He put together an inline spinlocks patch and
measured performance improvements of 1-5%. So he would like to put this
patch into the mainline, along with a configuration option allowing each
architecture to choose the best way to implement spinlocks. So far, there
has been little commentary for or against this idea.
Const seq_operations. James Morris has posted a patch making seq_operations structures
constant throughout the kernel. These structures are almost always
populated at compile time and never need to change; allowing the function
pointers therein to be overwritten can only be useful to those who would
like to subvert the kernel. A number of core VFS operations structures
have been made const over the years, but seq_operations
has not been addressed until now. James says: "This is derived from
the grsecurity patch, although generated
from scratch because it's simpler than extracting the changes
from there."
data=guarded. Back in the middle of the discussion of crash robustness
and latency in the ext3 filesystem, Chris Mason came forward with a
proposal for a data=guarded
mode, which would delay metadata updates when files change size to
prevent the disclosure of unrelated information. Since then, the
data=guarded patch has disappeared from view. In response to a query from
Frans Pop, Chris confirmed that he is still
working on that code, and that he plans to get it merged for 2.6.32.
Among those welcoming the news was Andi Kleen, who remarked: "data=writeback already cost
me a few files after crashes here." The data=guarded mode may not
help with that particular problem, though: it is really meant to combine
the security benefits of data=ordered (not disclosing random data, in
particular) with the performance benefits of data=writeback. The worst
data-loss problems should have already been addressed by the robustness
fixes that went into ext3 for 2.6.30.
(
Log in to post comments)