Brief items
The current 2.6 development prepatch is 2.6.31-rc1,
released on June 24. The
2.6.31 merge window is now closed, and the stabilization phase has
begun. As always, see
the
long-format changelog for the details.
There have been no stable updates over the last week. The
2.6.27.26, 2.6.29.6, and 2.6.30.1 updates are in the review process,
though, with an expected release on or after July 3. The 2.6.30.1
update contains over 100 patches.
Comments (none posted)
Kernel development news
I have a special mix of crack that helps me see Patterns
everywhere, even in C code. Some patterns are bright, shiny, and
elegant. Others are muddy and confused. struct request_queue has
a distinct shadow over it just now.
--
Neil Brown
The whole VM is designed around the notion that most of memory is
just clean caches, and it's designed around that simply because if
it's not true, the VM freedom is so small that there's not a lot a
VM can reasonably do.
--
Linus Torvalds
I'd be quite surprised if they deliberately changed their VFAT code
to break Linux with this patch. I'd say it is more likely that once
Linux kernels with this change are in widespread use that Microsoft
will start to test any changes in their VFAT filesystem to make
sure it works with Linux with this patch.
--
Andrew Tridgell
Perhaps we should require that the kernel developers and mainstream
distribution maintainers all run Ardour for three weeks and attempt
at least two multitrack/multichannel recordings. At least by then
they'd maybe have a better notion of what defines a system for
serious recording.
--
Dave Phillips
Comments (6 posted)
By Jonathan Corbet
July 1, 2009
O_NODE. Miklos Szeredi has
proposed a new flag
(
O_NODE) which could be passed to
open() calls. This
flag, in essence, says that the calling program wants to open the indicated
filesystem node, but doesn't want to actually
do anything with it.
With such opens, the underlying
open() file operation will not be
called, reads and writes will not be allowed, etc.
One might well wonder what the use for such an operation is. The main
motivation would appear to be to allow an application to create a file
descriptor which can be passed to other system calls - fstat(),
say, or openat(). File descriptors used in this way do not really
need access to the underlying file, so it makes sense to provide a way to
create file descriptors without that access.
O_PONIES. Rik van Riel has proposed
another new open flag (actually called O_REWRITE) which is
intended to help applications easily avoid the "zero-length files after a
crash" problem. A program could open an existing file with
O_REWRITE and get some special semantics. The new file would
exist, invisibly, alongside the existing file for as long as it remains
open; during that time, any other opens of that file would get the old
version. Once the file is closed, the kernel will rename it to the given
name in an atomic manner, ensuring that either the old version or the
(full) new version will exist should a crash happen in the middle.
This option would make it easy for application developers to rewrite
existing files without worrying about robustness. Some might respond that
it would be better to just teach those developers to use fsync(),
but, as Rik notes, "relying on application developers to get it right
seems to not have worked out well." Rik's proposal currently lacks
an accompanying patch, so it's not destined for the mainline anytime soon.
VFAT patents. As discussed elsewhere, Andrew
Tridgell has posted a new lawyer-approved patch aimed at working around
Microsoft's VFAT patents. The discussion on the lists has taken a bit of a
different course this time around; there is still some annoyance at making
changes like this to deal with the problems of the U.S. patent system, but
those voices have been relatively quiet. Not completely quiet, though;
Alan Cox said:
Putting the stuff in kernel upsets everyone who isn't under US
rule, creates situations where things cannot be discussed and
doesn't make one iota of difference to the vendors because they
will remove the code from the source tree options and all anyway -
because as has already been said it reduces risk.
Beyond that, what developers worry about is interoperability with other
VFAT implementations. Alan Cox is asking
that, if this patch goes in, the modified filesystem no longer be called
"VFAT," since, as he sees it, it's now something else. Ted Ts'o has responded that "VFAT" is a bit of a slippery
concept to begin with. It's not clear how this issue will be resolved.
Voyager's voyage. James Bottomley is a proud owner of an archaic
Voyager system; Voyager is an
x86-based architecture with a number of
quaint features - though, contrary to rumor, steam power is not among
them. It is not clear whether any Voyager-based systems are still running
outside of James's basement. Nonetheless, James has been maintaining the
Voyager architecture for years.
More recently, Voyager got kicked out when the code was broken in the
process of an x86 subarchitecture-support rewrite. When James tried to get
it put back in, x86 Ingo Molnar objected, saying that the costs of the
patch were not justified by the benefits of serving such a small user base
in the mainline kernel. In the end, Ingo rejected the patch outright, leading to what
appeared to be an unsolvable stalemate between the two developers.
Things changed about the time that Linus jumped
into the conversation:
Ingo, "absurd irrelevance" is not a reason. If it was, we'd lose about
half our filesystems etc.
Neither is "thousands of lines of code", or "it hasn't always
worked". Again, if it was, then we'd have to get rid of just about
all drivers out there.
Ingo eventually backed down on a number of
his complaints about the Voyager patches. What remains, though, is a long
list of technical problems with the Voyager tree and how it has been
managed. James has accepted those complaints as valid, and will work
toward resolving them. Before too long, Voyager owners (both of them)
should once again have full support for their beloved architecture in the
mainline kernel.
Comments (26 posted)
July 1, 2009
This article was contributed by Valerie Aurora (formerly Henson)
If a file system discussion goes on long enough, someone will bring up
soft updates eventually, usually in the form of, "Duhhh, why are you
Linux people so stupid? Just use soft updates, like BSD!" Generally,
there will be no direct reply to this comment and the conversation
will flow silently around it, like a stream around an inky black
boulder. Why is this? Is it pure NIH (Not Invented Here) on the part
of Linux developers (and Solaris and OS X and AIX and...) or is there
something deeper going on? Why are soft updates so famous and yet so
seldom implemented? In this article, I will argue that soft updates
are, simply put, too hard to understand, implement, and maintain to be
part of the mainstream of file system development - while
simultaneously attempting to explain how soft updates work. Oh, the
irony!
Soft updates: The view from 50,000 feet
Soft updates is one of a family of techniques for maintaining on-disk
file system consistency. The basic problem is that a file system
doesn't always get shut down cleanly - think power outage or operating
system crash - and if this happens in the middle of an update to the
file system (say, deleting a file), the on-disk state of the file
system may be inconsistent (corrupt). The original solution to this
problem was to run fsck on the entire file system to find and
correct inconsistencies; ext2 is an example of a file system that uses
this approach. (Note that this use of fsck - to recover from
an unclean shutdown - is different from the use of fsck to
check and repair a file system that has suffered corruption through
some other cause.)
The fsck approach has obvious drawbacks
(excessive time, possible lost data), so file system developers have
invented new techniques. The most popular and well-known is that of
logging or journaling: before we begin writing out the changes to the
file system, we write a short description of the changes we are about
to make (a journal entry) to a separate area of the disk (the
journal). If the system crashes in the middle of writing out the
changes, we simply finish up the changes by replaying the journal
entry at the next file system mount.
Soft updates, instead, takes a two-step approach to crash recovery. First, we
carefully order writes to disk so that, at the time of a crash (or any
other time), the
only inconsistencies are ones in which a file system structure is
marked as allocated when it is actually unused. Second, at the next
boot after the crash, fsck is run in the background on a file
system snapshot (more on that later) to find and free file system
structures that are wrongly marked as allocated. Basically, soft
updates orders writes to the disk so that only relatively harmless
inconsistencies are possible, and then fixes them in the background by
checking and repairing the entire file system. The benchmark results
are fairly stunning: in common workloads, performance is often
within 5% of that of BSD's memory-only file system. The older version
of FFS, which used synchronous writes and foreground fsck to
provide similar reliability, often runs 20-30% slower than the
in-memory file system.
Step 1: Update dependencies
The first part of implementing soft updates is figuring out how to
order the writes to the disk so that after a crash, the only possible
errors are inodes and blocks erroneously marked as allocated (when
they are actually free). First, the authors lay out some rules to
follow when writing changes to disk in order to accomplish this goal.
From the paper:
- Never point to a structure before it has been initialized
(e.g., an inode must be initialized before a directory entry
references it).
- Never re-use a resource before nullifying all previous pointers
to it (e.g., an inode's pointer to a data block must be nullified
before that disk block may be re-allocated for a new inode).
- Never reset the old pointer to a live resource before the new
pointer has been set (e.g., when renaming a file, do not remove the
old name for an inode until after the new name has been written).
Pairs of changes in which one change must to be written to disk before
the next change can be written, according to the above rules, are
called update dependencies. For some more examples of update
dependencies, take the case of writing to the first block in a file
for the first time. The first update dependency is that the block
bitmap, which records which blocks are in-use, must be written to show
that the block is in use before the block pointer in the inode is set.
If a crash were to occur at this point, the only inconsistency would
be one bit in the block bitmap showing a block is allocated when it
isn't actually. This is a resource leak, and must be fixed
eventually, but the file system can operate correctly with this error
as long as it doesn't run out of blocks.
The second update dependency is that the data in the block itself must
be written before the block pointer in the inode can be set (along
with the increase in the inode size and the associated timestamp
updates). If it weren't, a crash at this point would result in
garbage appearing in the file - a potential security hole, as well, if
that garbage came from a previously written file. Instead, a crash
would result in a leaked block (marked as allocated when it isn't)
that happens to contain the data from the attempted write. As a
result, the write to the bitmap and the write of the data to the block
must complete (in any order) before the write that updates the inode's
block pointer, size, and timestamps.
These rules about ordering of writes aren't new for soft updates; they
were originally created for writes to a "normal" FFS file system. In
the original FFS code, ordering of writes is enforced with synchronous
writes - that is, the ongoing file system operation (create, unlink,
etc.) waits for each ordered write to hit disk before going on to the
next step. While the write is in progress, the operating system
buffer containing the disk block in question is locked. Any other
operation needing to change that buffer has to wait its turn. As a
result, many metadata operations progress at disk speed (i.e.,
murderously slowly).
Step 2: Efficiently satisfying update dependencies
So far, we have determined that synchronous writes on locked buffers
are a slow, painful way of enforcing the ordering of writes to the
file system. But synchronous writes are overkill for most file system
operations; other than fsync(), we generally don't want a
guarantee that the result has been written to stable storage before
the system call returns, and as we've seen, the file system code
itself usually only cares about the order of writes, not when they
complete. What we want is a way to record changes to metadata, along
with the associated ordering constraints, and then schedule the actual
writes at our leisure. No problem, right? We'll just add a couple of
pointers to each in-memory buffer containing metadata, linking it to
the blocks it has come before and after.
Turns out there is a problem: cyclical dependencies. We have to write
to the disk in block-size units, and each block can potentially
contain metadata affected by more than one metadata operation. If two
different operations affect the same blocks, it can easily result in
conflicting requirements: operation A requires that block 1 be written
before block 2, and operation B requires that block 2 be written
before block 1. Now you can't write out any changes without violating
the ordering constraints. What to do?
Most people, at this point, decide to use journaling or copy-on-write
to deal with this problem. Both techniques group related changes into
transactions - a set of writes that must take effect all at once - and
write them out to disk in such a manner that they take effect
atomically. But if you are Greg Ganger and Yale Patt, you come up
with a scheme to record individual modifications to blocks (such as
the update to a single bit in a bitmap block) and their relationships
to other individual changes (that change requires this other change to
be written out first). Then, when you write out a block, you lock it
and iterate through the records of individual changes to this block.
For each individual change whose dependencies haven't yet been
satisfied, you undo that change to the block, and then write out the
resulting block. When the write is done, you re-apply the changes
(roll forward), unlock, and continue on your way until the next write.
The write you just completed may have satisfied the update
dependencies of other blocks, so now you can go through the same
process (lock, roll back, write, roll forward, unlock) for those
blocks. Eventually, all the dependencies will be satisfied and
everything will be written to disk, all without running into any circular
dependencies. This, in a nutshell, is what makes soft updates unique.
Recording changes and dependencies
So what does a record of a metadata change, and its corresponding
update dependencies, actually look like at the data structure level?
First, there are twelve (as of the 1999 paper) distinct structures to
record the different types of dependencies:
| Structure | Dependency tracked |
| bmsafemap | block/inode bitmaps |
| inodedep | inode |
| allocdirect | blocks referenced by direct block pointers |
| indirdep | indirect block |
| allocindir | blocks referenced by indirect block pointers |
| pagedep | directory entry add/remove |
| mkdir | new directory creation |
| dirrem | directory removal |
| freefrag | fragment to be freed |
| freeblks | block to be freed |
| freefile | inode to be freed |
Each kind of dependency-tracking structure includes pointers that
allow it to be linked into lists attached to the buffers containing
the relevant on-disk structures. These lists are what the soft
updates code traverses during the roll-back and roll-forward
operations on a block being written to disk. Each dependency
structure has a set of state flags describing the current status of
the dependency. The flags indicate whether the dependency is
currently applied to the associated buffer, whether all of the writes
it depends on have completed, and whether the update described by the
dependency tracking structure itself has been written to disk. When
all three of these flags are set (the update is applied to the
in-memory buffer, all its dependent writes are completed, and the
update is written to disk), the dependency structure can be thrown
away.
Page 7 of the 1999
soft updates paper [PDF] begins the descriptions of
specific kinds of update dependency structures and their relationships
to each other. I've read this paper at least 15 times, and each time
I when get to page 7, I'm feeling pretty good and thinking, "Yeah,
okay, I must be smarter now than the last time I read this because I'm
getting it this time," - and then I turn to page 8 and my head
explodes. Here's the first figure on that page:
And that's only the figure from the left-hand column. An only
slightly less complex spaghetti-gram occupies the right-hand column.
This goes on for six pages, describing each specific kind of update
dependency and its progression through various lists associated with
buffers and file system structures and, most importantly, other update
dependency structures. You find yourself struggling through paragraphs like:
Figure 10 shows the structures involved in renaming a file. [Figure 10
looks much like the figure above.] The dependencies follow the same
series of steps as those for adding a new file entry, with two
variations. First, when a roll-back of an entry is needed because its
inode has not yet been written to disk, the entry must be set back to
the previous inode number rather than zero. The previous inode number
is stored in a dirrem structure. The DIRCHG flag is set in
the diradd structure so that the roll-back code knows to use
the old inode number stored in the dirrem structure. The
second variation is that, after the modified directory entry is
written to disk, the dirrem structure is added to the work
daemon's tasklist list so that the link count of the old
inode will be decremented as described in Section 3.9.
Say that three times fast!
The point is not that the details of soft updates are too complex for
mere humans to understand (although I personally I wouldn't bet
against Greg Ganger being superhuman). The point is that this
complexity reflects a lack of generality and abstraction in the design
of soft updates as a whole. In soft updates, every file system
operation must be individually analyzed for write dependencies, every
on-disk structure must have a custom-designed dependency tracking
structure, and every update operation must allocate one of these
structures and hook itself into the web of other custom-designed
dependency tracking structures. If you add a new file system feature,
like extended attributes, or change the on-disk format, you have to
start from scratch and reason out the relevant dependencies, design a
new structure, and write the roll-forward/roll-back routines. It's
fiddly, tedious work, and the difficulty of doing it correctly doesn't
make it any better a use of programmer staff-hours.
Contrast the highly operation-specific design of soft updates to the
transaction-style interface used by most journaling and copy-on-write
file systems. When you begin a logical operation (such as a file
create), you create a transaction handle. Then, for each on-disk
structure you have to modify, you add that buffer to the list of
buffers modified by this transaction. When you are done, you close
out the transaction and hand it off to the journaling (or COW)
subsystem, which figures out how to merge it with other transactions
and write them out to disk properly. The user of the transaction
interface only has to know how to open, close, and add blocks to a
transaction, while the transaction code only has to know which blocks
are part of the same transaction. Adding a new write operation
requires no special knowledge or analysis beyond remembering to add
changed blocks to the transaction.
The lack of generalization and abstraction shows up again when the
combination of update dependency ordering and the underlying disk
format poke out and cause strange user-visible behavior. The most
prominent example shows up when removing a directory; following the
rules governing update dependencies means that a directory's ".."
entry can't be removed until the directory itself is recorded as
unlinked on the disk. Chains of update dependencies sometimes
resulted in up to two minutes of delay between the return
of rmdir(), and the corresponding decrement of the parent
directory's link count. This can break, among other things, a simple
recursive "rm -rf". The fix was to fake up a second link
count that is reported to userspace, but the real problem was a
too-tight coupling between on-disk structures, the system to maintain
on-disk consistency, and the user-visible structures. Long chains of
update dependencies cause problems elsewhere, during unmount
and fsync() in particular.
Fsck and the snapshot
But wait, there's more! The second stage of recovery for soft updates
is to run fsck after the next boot, in the background using a
snapshot of the file system metadata. File system snapshots in FFS
are implemented by creating a sparse file the same size as the file
system - the snapshot file. Whenever a block of metadata is altered,
the original data is first copied to the corresponding block in the
snapshot file. Reads of unaltered blocks in the snapshot redirect to
the originals. Online fsck runs on the snapshot of the file
system metadata, finding leaked blocks and inodes. Once it completes,
fsck uses a special system call to mark these blocks and
inodes as free again.
Online fsck implemented in this manner has severe limitations. First,
recovery from a crash still requires reading and processing the
metadata for the entire file system - in the background, certainly,
but that's still a lot of
I/O. (Freeblock
scheduling piggybacks low-priority I/O, like that of a
background fsck, on high-priority foreground I/O so that it
interferes as little as possible with "real" work, but that's cold
comfort.) Second, it's not actually a full file system check and
repair, it's just a scan for leaked blocks and inodes - expected
inconsistencies. The whole concept of running fsck on a
snapshot file whose blocks are allocated from the same file system
assumes that the file system is not corrupted in a way that leaves
blocks marked as free when they are actually allocated.
Conclusion
Conceptually, soft updates can be explained concisely - order writes
according to some simple rules, track updates to metadata blocks,
roll-back updates with unsatisfied dependencies before writing the
block to disk, then roll-forward the updates again. But when it come
to implementation, only programmers with deep, encyclopedic knowledge
of the file system's on-disk format can derive the correct ordering
rules and construct the associated data structures and web of
dependencies. The close coupling of on-disk data structures and their
updates and the user-visible data structures and their updates results
in weird, counter-intuitive behavior that must be covered up with
additional code.
Overall, soft updates is a sophisticated, insightful, clever idea - and
an evolutionary dead end. Journaling and copy-on-write systems are
easier to implement, require less special-purpose code, and demand far
less of the programmer writing to the interface.
Comments (39 posted)
By Jake Edge
July 1, 2009
We last looked at the perfcounters
patches back in
December, shortly after they appeared. Since that time, a great deal of
work has been done, culminating in perfcounters being included into the
mainline during
the recently completed 2.6.31 merge window. Along the way, a
tool to use perfcounters, called perf, was added to the mainline
as well.
Adding perf to the kernel tools/ directory is one of the
more surprising aspects of the perfcounters merge. Kernel hackers have
long been leery of adding user-space tools into the kernel source tree, but
Linus Torvalds was unconvinced by multiple
complaints about that approach. He pointed to oprofile to explain:
It took literally
months for the user mode tools to catch up and get the patches to support
new functionality into CVS (or is it SVN?), and after that it took even
longer for them to become part of a release and be picked up by
distributions. In fact, I'm not sure it is part of a release even now - I
had to make a bug report to Fedora to get atom and Nehalem support in my
tools: I think they took the unofficial patch.
Others were not so sure that the oprofile being developed
separately from the kernel was the root cause of those failures. Christoph
Hellwig had other ideas: "I don't
think oprofile has been a [disaster] because of any kind of split,
but because the design has been a failure from day 1." But,
Torvalds wants to try including the tool to
see
where it leads: "Let's give a _new_ approach a chance, and
see if we can avoid the mistakes of yesteryear this time."
The perf tool itself is a fairly simple command-line tool, which
can be built from the tools/perf directory. It also includes
some documentation, in the form of man pages that are also
available via perf help (as well as in HTML and other formats).
At its simplest, it gathers and reports some statistics for a particular
command:
$ perf stat ./hackbench 10
Time: 4.174
Performance counter stats for './hackbench 10':
8134.135358 task-clock-msecs # 1.859 CPUs
23524 context-switches # 0.003 M/sec
1095 CPU-migrations # 0.000 M/sec
16964 page-faults # 0.002 M/sec
10734363561 cycles # 1319.669 M/sec
12281522014 instructions # 1.144 IPC
121964514 cache-references # 14.994 M/sec
10280836 cache-misses # 1.264 M/sec
4.376588249 seconds time elapsed.
This summarizes the performance events that occurred while running the
hackbench micro-benchmark program. There are a combination of
hardware events (cycles, instructions, cache-references, and cache-misses)
as well as software events (task-clock-msecs, context-switches,
CPU-migrations, and page-faults) that are derived from the kernel code and
not the processor-specific performance monitoring unit (PMU). Currently,
support for hardware events is available for Intel, AMD, and PowerPC
PMUs, but other architectures still have support for the software
events.
There is also a top-like mode for observing which kernel functions
are being executed most frequently in a continuously updating display:
$ perf top -c 1000 -p 3216
------------------------------------------------------------------------------
PerfTop: 360 irqs/sec kernel:65.0% [1000 cycles], (target_pid: 3216)
------------------------------------------------------------------------------
samples pcnt RIP kernel function
______ _______ _____ ________________ _______________
1214.00 - 5.3% - 00000000c045cb4d : lock_acquire
1148.00 - 5.0% - 00000000c045d1d3 : lock_release
911.00 - 4.0% - 00000000c045d377 : lock_acquired
509.00 - 2.2% - 00000000c05a0cbc : debug_locks_off
490.00 - 2.2% - 00000000c05a2f08 : _raw_spin_trylock
489.00 - 2.1% - 00000000c041d1d8 : read_hpet
488.00 - 2.1% - 00000000c04419b8 : run_timer_softirq
483.00 - 2.1% - 00000000c04d5f72 : do_sys_poll
477.00 - 2.1% - 00000000c05a34a0 : debug_smp_processor_id
462.00 - 2.0% - 00000000c043df85 : __do_softirq
404.00 - 1.8% - 00000000c074d93f : sub_preempt_count
353.00 - 1.5% - 00000000c074d9d2 : add_preempt_count
338.00 - 1.5% - 00000000c0408a76 : native_sched_clock
318.00 - 1.4% - 00000000c074b4c3 : _spin_lock_irqsave
309.00 - 1.4% - 00000000c044ea10 : enqueue_hrtimer
This is a static version of the output from looking at a largely quiescent
firefox process (pid 3216), sampling every 1000 cycles.
There is quite a bit more that perf can do. There is a
record sub-function that gathers the performance counter data into
a perf.data file which can be used by other commands:
$ perf record ./hackbench 10
Time: 4.348
[ perf record: Captured and wrote 2.528 MB perf.data (~110448 samples) ]
$ perf report --sort comm,dso,symbol | head -15
#
# (110146 samples)
#
# Overhead Command Shared Object Symbol
# ........ ................ ......................... ......
#
10.70% hackbench [kernel] [k] check_bytes_and_report
9.07% hackbench [kernel] [k] slab_pad_check
5.67% hackbench [kernel] [k] on_freelist
5.28% hackbench [kernel] [k] lock_acquire
5.03% hackbench [kernel] [k] lock_release
3.19% hackbench [kernel] [k] init_object
3.02% hackbench [kernel] [k] lock_acquired
2.47% hackbench [kernel] [k] _raw_spin_trylock
This output shows the top eight kernel functions executed while running
hackbench. The same data file can also be used by
perf
annotate (when given a symbol name and the appropriate
vmlinux file)
to show the disassembled code for a function, along with the
number of samples recorded on each instruction. There is clearly a wealth
of information that can be derived from the tool.
The original posting of the perfcounters patches came as somewhat of a surprise
to Stéphane Eranian, who had long been working on another
performance monitoring solution, "perfmon". While he is still a bit
skeptical of perfcounters, which were originally proposed by Ingo Molnar
and Thomas Gleixner, he has been reviewing the patches, and providing
lengthy comments. Molnar, also responded at length, breaking his reply into
multiple chunks which can be found in the thread.
Perfmon was targeted at exposing as much of the underlying PMU data as
possible to user space, but Molnar explicitly rejects that goal:
So for every "will you support advanced PMU feature X, Y and Z"
question you ask, the first-level answer is: 'please show the
developer usecase and integrate it into our tools so we can see how
it all works and how useful it is'.
"A tool might want to do this" is not a good enough answer. We now
have a working OSS tool-space with 'perf' where such arguments for
more PMU features can be made in very specific terms: patches,
numbers and comparisons. Actual hands-on utility, happy developers
and faster apps is what matters in the end - not just the list of
PMU features we expose.
His focus, presumably shared with his co-maintainers Peter Zijlstra and Paul
Mackerras, is to generalize performance measurement features so that they
are not dependent on any particular CPU and that they fit well with
developer work flow: "I do claim we had few if any sane performance
analysis tools before
under Linux, and i think we are still in the stone ages and still
have a lot of work to do in this area." From Molnar's perspective,
that ease of use for users and developers is one of the main areas where
perfmon fell short.
Molnar is not shy about pointing out that perfcounters still needs a lot of
work, but the framework is there, so features can be added to that. As
yet, there is no documentation in the kernel Documentation/
directory, but one presumes that will be handled sometime soon. Overall,
perfcounters and the perf tool look to be a highly useful addition
to the kernel, one that should start providing benefits—in the form
of better performance—in the near term.
Comments (18 posted)
By Jonathan Corbet
July 1, 2009
One of the features merged for 2.6.31 is the "fsnotify" file event
notification framework. Fsnotify serves as a new, common underpinning for
the inotify and dnotify APIs, simplifying the code considerably. But this
simplification, as welcome as it is, was never the real purpose behind
fsnotify. Instead, fsnotify exists to serve as the support structure for
fanotify, the "fscking all
notification system," which has now been posted for further review.
Fanotify was once known as TALPA; its main purpose is to
enable the implementation of malware scanners on Linux systems. When TALPA
was first proposed, it ran into criticism on a number of fronts, not the
least of which being a general disdain for malware scanning as a security
technique. The sad fact of the matter, though, is that a number of
customers require this functionality, so a market for such products on
Linux exists. Thus far, scanning products for Linux have relied on a
number of distasteful techniques, including hooking into the system call
table or the loading of binary-only security modules. Fanotify, it is
hoped, will help to wean these products off of such hacks and get them out
of the kernel altogether.
The user-space API used by fanotify is, to your editor's eye, a little
strange. An fanotify application starts by opening a socket with the new
PF_FANOTIFY protocol family. This socket must then be bound to an
"address" described this way:
struct fanotify_addr {
sa_family_t family;
__u32 priority;
__u32 group_num;
__u32 mask;
__u32 timeout;
__u32 unused[16];
};
The family field should be AF_FANOTIFY. The
priority field is used to determine which socket gets a specific
event if more than one fanotify socket exists; lower priority values win.
The group_num is used by the fsnotify layer to identify a group of
event listeners. The timeout field currently appears to be
unused. Finally, mask describes the events that the application
is interested in hearing about:
- FAN_ACCESS: every file access.
- FAN_MODIFY: file modifications.
- FAN_CLOSE: when files are closed.
- FAN_OPEN: open() calls.
- FAN_ACCESS_PERM: like FAN_ACCESS, except that the
process trying to access the file is put on hold while the fanotify
client decides whether to allow the operation.
- FAN_OPEN_PERM: like FAN_OPEN, but with the
permission check.
- FAN_EVENT_ON_CHILD: the caller is interested in events on
full directory hierarchies.
- FAN_GLOBAL_LISTENER: notify for events on all files in the
system.
Once the socket has been bound, the application can learn about filesystem
activity using the well-known event-reading system call
getsockopt(). A call to getsockopt() with
SOL_FANOTIFY as the level and FANOTIFY_GET_EVENT as the
option will return one or more structures like this:
struct fanotify_event_metadata {
__u32 event_len;
__s32 fd;
__u32 mask;
__u32 f_flags;
pid_t pid;
pid_t tgid;
__u64 cookie;
};
Here, fd is an open, read-only file descriptor for the file in
question, mask describes the event (using the flags described
above), f_flags is a copy of the flags provided by the process
trying to access the file, and pid and tgid identify that
process (in a namespace-unaware way, currently). If the event is one
requiring permission from the application, cookie will contain a
value which can be used to grant or deny that permission.
Note that the provided file descriptor should eventually be closed;
otherwise these file descriptors are likely to accumulate rather quickly.
When access decisions are being made, the application must notify the
kernel with a call to setsockopt() using the
FANOTIFY_ACCESS_RESPONSE option and a structure like:
struct fanotify_so_access {
__u64 cookie;
__u32 response;
};
The cookie value from the event should be provided, and
response should be one of FAN_ALLOW or
FAN_DENY. If the application does not respond within five
seconds, the kernel will allow the action to proceed. Five seconds should
be sufficient for file scanning, but it could be a problem with some other
possible applications of fanotify, such as hierarchical storage management
systems. Fanotify developer Eric Paris notes that a future option allowing the
response to be delayed indefinitely will probably be added at some point.
It is possible to adjust the set of files subject to notifications with the
FANOTIFY_SET_MARK, FANOTIFY_REMOVE_MARK, and
FANOTIFY_CLEAR_MARKS operations. If the
FAN_GLOBAL_LISTENER option was provided at bind time, then all
files are "marked" at the outset; FANOTIFY_REMOVE_MARK can be used
to prune those which are not interesting. Otherwise at least one
FANOTIFY_SET_MARK call must be made before events will be
returned.
Some details have been left out, but the above discussion covers the core
parts of the fanotify API. Comments on this posting have been relatively
scarce; opposition to this feature seems to have faded away over the last
year or so. What's left is getting the API right; your editor suspects
that the use of getsockopt() as an event retrieval interface may
raise a few eyebrows sooner or later. But, once that's ironed out, chances
are good that Linux will be well on the way toward having a general file
access notification and permission interface.
Comments (21 posted)
Patches and updates
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Architecture-specific
Security-related
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>