Brief items
The current 2.6 development kernel is 2.6.30-rc7,
released on May 23.
"
So go wild. I suspect I'll do an -rc8, but we're definitely getting
closer to release-time - it would be good to get as much testing as
possible, and it should generally be pretty safe to try this all
out." The
long-format
changelog has the details.
The current stable 2.6 kernel remains 2.6.29.4; there have been no
stable releases over the last week.
Comments (none posted)
Kernel development news
Interesting, how telling somebody that they need to learn C is
considered an unacceptable thing to do. Hostile to newbies, or
some such. Introducing more magic that has to be learnt if one
wants to read the kernel source, OTOH, is just fine...
--
Al Viro
Sorry but are you really suggesting every program in the world that
uses write() anywhere should put it into a loop? That seems just
like really bad API design to me, requiring such contortions in a
fundamental system call just to work around kernel deficiencies.
I can just imagine the programmers putting nasty comments about the
Linux kernel on top of those loops and they would be fully
deserved.
--
Andi Kleen discovers POSIX
Hey, don't look at me - blame Brian Kernighan or George Bush or
someone.
--
Andrew Morton disclaims responsibility
Comments (5 posted)
By Jonathan Corbet
May 27, 2009
Union directories. While a number of developers are working on the
full union mount problem, Miklos Szeredi has taken a simpler approach:
union directories. Only
top-level directory unification is provided, and changes can only be made
to the top-level filesystem. That eliminates the need for a lot of complex
code doing directory copy-up, whiteouts, and such, but also reduces the
functionality significantly.
Optimizing writeback timers: on a normal Linux system, the
pdflush process wakes up every five seconds to force dirty page
cache pages back to their backing store on disk. This wakeup happens
whether or not there is anything needing to be written back. Unnecessary
wakeups are increasingly unwelcome, especially on systems where power
consumption matters, so it would be nice to let pdflush sleep when
there is nothing for it to do.
Artem Bityutskiy has put together a patch set to do just that. It
changes the filesystem API to make it easier for the core VFS to know when
a specific filesystem has dirty data. That information is then used to
decide whether pdflush needs to be roused from its slumber. The
idea seems good, but there's one little problem: this work conflicts with
the per-BDI flusher threads
patches by Jens Axboe. Jens's patches get rid of the pdflush
timer and make a lot of other changes, so these two projects do not
currently play well together. So Artem is headed back to the drawing board
to base his work on top of Jens's patches instead of the mainline.
recvmmsg(). Arnaldo Carvalho de Melo has proposed a new system call for
the socket API:
struct mmsghdr {
struct msghdr msg_hdr;
unsigned msg_len;
};
ssize_t recvmmsg(int socket, struct mmsghdr *mmsg, int vlen, int flags);
The difference between this system call and recvmsg() is that it
is able to accept multiple messages with a single call. That, in turn,
reduces system call overhead in high-bandwidth network applications. The
comments in the patch suggest that sendmmsg() is in the plans, but
no implementation has been posted yet.
There was a suggestion that this
functionality could be obtained by extending recvmsg() with a new
message flag, rather than adding a new system call. But, as David Miller
pointed out, that won't work. The kernel
currently ignores unrecognized flags; that will make it impossible for user
space to determine whether a specific kernel supports multiple-message
receives or not. So the new system call is probably how this feature will
be added.
Comments (6 posted)
By Jonathan Corbet
May 27, 2009
As the 2.6.30 development cycle heads toward a close, it is natural to look
back at what has been merged and where it came from. So here is LWN's
traditional look at who wrote the code which went into the mainline this
time around.
Once again, 2.6.30 was a large development cycle; it saw the incorporation
(through just after 2.6.30-rc7) of 11,733 non-merge changesets from 1125
developers. The number of changesets exceeds 2.6.29, but the number of
developers falls just short of the 1166 seen last time around. Those
developers added 1.14 million lines of code this time around, while
taking out 513,000, for a net growth of 624,000 lines.
The individual developer statistics for 2.6.30 look like:
| Most active 2.6.30 developers |
| By changesets |
| Ingo Molnar | 324 | 2.8% |
| Bill Pemberton | 227 | 1.9% |
| Stephen Hemminger | 204 | 1.7% |
| Hans Verkuil | 199 | 1.7% |
| Takashi Iwai | 188 | 1.6% |
| Bartlomiej Zolnierkiewicz | 186 | 1.6% |
| Steven Rostedt | 179 | 1.5% |
| Greg Kroah-Hartman | 150 | 1.3% |
| Jeremy Fitzhardinge | 125 | 1.1% |
| Mark Brown | 107 | 0.9% |
| Jaswinder Singh Rajput | 105 | 0.9% |
| Rusty Russell | 100 | 0.9% |
| Tejun Heo | 98 | 0.8% |
| Johannes Berg | 98 | 0.8% |
| Hannes Eder | 88 | 0.8% |
| Michal Simek | 85 | 0.7% |
| Luis R. Rodriguez | 85 | 0.7% |
| Sujith | 85 | 0.7% |
| David Howells | 80 | 0.7% |
| Yinghai Lu | 78 | 0.7% |
|
| By changed lines |
| Greg Kroah-Hartman | 120353 | 9.0% |
| ADDI-DATA GmbH | 43420 | 3.3% |
| Mithlesh Thukral | 42424 | 3.2% |
| Alex Deucher | 26576 | 2.0% |
| David Schleef | 25905 | 1.9% |
| David Woodhouse | 24636 | 1.8% |
| Ramkrishna Vepa | 23495 | 1.8% |
| Lior Dotan | 22506 | 1.7% |
| Eric Moore | 22266 | 1.7% |
| Eilon Greenstein | 18399 | 1.4% |
| Jaswinder Singh Rajput | 18168 | 1.4% |
| Hans Verkuil | 18048 | 1.4% |
| David Howells | 17941 | 1.3% |
| Andy Grover | 16355 | 1.2% |
| Michal Simek | 15827 | 1.2% |
| Sri Deevi | 15514 | 1.2% |
| Frank Mori Hess | 15450 | 1.2% |
| Ben Hutchings | 15031 | 1.1% |
| Ingo Molnar | 13876 | 1.0% |
| Bill Pemberton | 13817 | 1.0% |
|
On the changesets side, Ingo Molnar is at the top of the list this time
around; as usual, he created a vast number of patches - about five per day
- in the x86 architecture code, ftrace, and beyond. Bill Pemberton is
perhaps better known as the maintainer of the Elm mail client; he did a lot
of cleanup work with the COMEDI drivers in the -staging tree. The bulk of
Stephen Hemminger's work involved converting network drivers to the new
net_device_ops API. Hans Verkuil continues to improve the
Video4Linux2 framework and associated drivers, and Takashi Iwai continues
to generate a lot of patches as the ALSA maintainer.
Linus kicked off the 2.6.30 development cycle by noting that about one third of
the changes in 2.6.30-rc1 were "crap." So, unsurprisingly, the top three
entries in the "by changed lines" column all got there through the addition
of -staging drivers. Alex Deucher added Radeon R6xx/R7xx support; many of
his "changed lines" were associated microcode firmware. And David Schleef
added another set of drivers to the -staging tree.
Contributions to 2.6.30 could be traced back to some 190 employers.
Looking at the most-active employer information, we see:
| Most active 2.6.30 employers |
| By changesets |
| (None) | 1970 | 16.8% |
| Red Hat | 1305 | 11.1% |
| (Unknown) | 1184 | 10.1% |
| Intel | 855 | 7.3% |
| Novell | 832 | 7.1% |
| IBM | 630 | 5.4% |
| (Consultant) | 293 | 2.5% |
| Atheros Communications | 262 | 2.2% |
| Oracle | 252 | 2.1% |
| University of Virginia | 227 | 1.9% |
| Fujitsu | 217 | 1.8% |
| Vyatta | 204 | 1.7% |
| Renesas Technology | 152 | 1.3% |
| NTT | 121 | 1.0% |
| MontaVista | 115 | 1.0% |
| HP | 107 | 0.9% |
| Wolfson Microelectronics | 105 | 0.9% |
| (Academia) | 102 | 0.9% |
| Nokia | 98 | 0.8% |
| XenSource | 91 | 0.8% |
|
| By lines changed |
| (Unknown) | 181413 | 13.6% |
| Novell | 164229 | 12.3% |
| (None) | 118095 | 8.9% |
| Intel | 86060 | 6.5% |
| Red Hat | 73954 | 5.5% |
| LinSysSoft Technologies | 64798 | 4.9% |
| ADDI-DATA GmbH | 43420 | 3.3% |
| SofaWare | 39245 | 2.9% |
| Broadcom | 31956 | 2.4% |
| AMD | 28364 | 2.1% |
| Entropy Wave | 25905 | 1.9% |
| IBM | 25702 | 1.9% |
| Oracle | 25588 | 1.9% |
| NTT | 25235 | 1.9% |
| Neterion | 23495 | 1.8% |
| LSI Logic | 22304 | 1.7% |
| Atheros Communications | 21627 | 1.6% |
| (Consultant) | 19209 | 1.4% |
| Freescale | 16139 | 1.2% |
| PetaLogix | 15846 | 1.2% |
|
These numbers are somewhat similar to those seen in previous development
cycles. There are a few unfamiliar companies here; they are pretty much
all present as a result of contributions to -staging. It is interesting to
note that Atheros and Broadcom, once known as uncooperative companies, are
increasing their contributions over time.
Your editor has not looked at signoff statistics for the last few cycles.
The interesting thing to be found in Signed-off-by tags is an indication of
who the gatekeepers to the kernel are. Especially if one disregards
signoffs by the author of each patch, what remains is (mostly) the signoffs
of subsystem maintainers who approved the patches for merging. For 2.6.30,
these numbers look like this:
| Top non-author signoffs in 2.6.30 |
| Individuals |
| David S. Miller | 1216 | 12.1% |
| John W. Linville | 865 | 8.6% |
| Ingo Molnar | 836 | 8.3% |
| Greg Kroah-Hartman | 797 | 7.9% |
| Mauro Carvalho Chehab | 784 | 7.8% |
| Andrew Morton | 660 | 6.6% |
| James Bottomley | 250 | 2.5% |
| Linus Torvalds | 219 | 2.2% |
| Len Brown | 189 | 1.9% |
| Takashi Iwai | 165 | 1.6% |
| Jeff Kirsher | 145 | 1.4% |
| Russell King | 127 | 1.3% |
| H. Peter Anvin | 120 | 1.2% |
| Mark Brown | 115 | 1.1% |
| Jesse Barnes | 111 | 1.1% |
| Benjamin Herrenschmidt | 111 | 1.1% |
| Reinette Chatre | 104 | 1.0% |
| Martin Schwidefsky | 95 | 0.9% |
| Avi Kivity | 91 | 0.9% |
| Paul Mundt | 89 | 0.9% |
|
| Employers |
| Red Hat | 4264 | 42.4% |
| Novell | 1386 | 13.8% |
| Intel | 951 | 9.5% |
| Google | 660 | 6.6% |
| (None) | 408 | 4.1% |
| IBM | 378 | 3.8% |
| Linux Foundation | 219 | 2.2% |
| (Consultant) | 166 | 1.6% |
| (Unknown) | 127 | 1.3% |
| Wolfson Microelectronics | 115 | 1.1% |
| Renesas Technology | 92 | 0.9% |
| Marvell | 91 | 0.9% |
| Atomide | 81 | 0.8% |
| Oracle | 80 | 0.8% |
| Astaro | 65 | 0.6% |
| Freescale | 63 | 0.6% |
| Cisco | 61 | 0.6% |
| Analog Devices | 60 | 0.6% |
| Univ. of Michigan CITI | 59 | 0.6% |
| Panasas | 58 | 0.6% |
|
Signoffs have always been more concentrated than contributions in general.
Still, one wonders how David Miller manages to approve a solid twenty
patches every day. On the employer side, things are more concentrated than
ever; over half of the patches going into the kernel go through the hands
of a developer at Red Hat or Novell. Developers, it seems, work for a
great many companies, but subsystem maintainers gravitate toward a small
handful of firms.
All told, the picture remains one of a well-oiled, fast-moving development
process. We also see a picture of a -staging tree which is growing at a
tremendous rate; your editor is tempted to exclude -staging patches from
future reports if the rate does not slow somewhat. Even without -staging,
though, a lot of work is being done on the kernel, with the
participation of a large group of developers, and it doesn't look like it
will be slowing down anytime soon.
Postscript: Jan Engelhardt sent your editor a pointer to a
short script which, through use of the git blame command,
tallies up the "ownership" of every line in the kernel. The top results
for 2.6.30-rc7 look like this:
| Who last touched kernel code lines |
| Lines | Pct | Who |
| 4063723 | 35.17% |
Linus Torvalds |
| 464021 | 4.02% |
Greg Kroah-Hartman |
| 94200 | 0.82% |
David Howells |
| 86031 | 0.74% |
David S. Miller |
| 82608 | 0.71% |
Luis R. Rodriguez |
| 72200 | 0.62% |
Bryan Wu |
| 70128 | 0.61% |
Takashi Iwai |
| 66859 | 0.58% |
Ralf Baechle |
| 55785 | 0.48% |
Hans Verkuil |
| 54069 | 0.47% |
Paul Mundt |
| 54007 | 0.47% |
Kumar Gala |
| 53288 | 0.46% |
David Brownell |
| 51640 | 0.45% |
Russell King |
| 50611 | 0.44% |
Paul Mackerras |
| 49499 | 0.43% |
Andrew Victor |
| 49347 | 0.43% |
Mauro Carvalho Chehab |
| 49256 | 0.43% |
Alan Cox |
| 47305 | 0.41% |
Mikael Starvik |
| 47040 | 0.41% |
Ben Dooks |
| 44307 | 0.38% |
Benjamin Herrenschmidt |
Linus shows a high ownership because he was the initial committer at the
beginning of the git era. To a rough approximation, one can conclude that
approximately one third of the code in the kernel has not been touched
since that time. There are other interesting things which can be done with
line-level statistics; your editor plans to explore this idea some in the
future.
Comments (26 posted)
May 26, 2009
This article was contributed by Nitin Gupta
The idea of memory compression—compress relatively unused pages and
store them in memory
itself—is simple and has been around for a long
time. Compression, through the elimination of expensive disk I/O, is far
faster than swapping those pages to secondary storage.
When a page is needed again, it is decompressed and given back, which
is, again, much faster than going to swap.
An implementation of this idea on Linux is currently under development
as the compcache
project. It creates a virtual block device (called ramzswap) which acts
as a swap disk. Pages swapped to this disk are compressed and stored in
memory itself. The project home contains use cases, performance
numbers, and other related bits. The whole aim of the project is not
just performance — on swapless setups, it allows running applications
that would otherwise simply fail due to lack of memory. For example,
Edubuntu included
compcache to lower the RAM requirements of its installer.
The performance
page on the project wiki shows numbers for configurations that
closely match netbooks, thin clients, and embedded devices. These
initial results look promising. For example, in the benchmark for thin
clients, ramzswap gives nearly the same effect as doubling the memory.
Another benchmark
shows that average time required to complete swap requests is
reduced drastically with ramzswap. With a swap partition located on
a 10000 RPM disk, average time required for swap read and write
requests was found to be 168ms and 355ms, respectively. While with
ramzswap, corresponding numbers were mere 12µs and 7µs, respectively —
this includes time for checking zero-filled pages and
compressing/decompressing all non-zero pages.
The approach of using a virtual block device is a major simplification
over earlier attempts. The previous implementation required changes to the
swap write path, page
fault handler, and page cache lookup functions (find_get_page() and
friends). Those patches did not gain widespread acceptance due to their
intrusive nature. The new approach is far less intrusive, but at a cost:
compcache has lost the ability to
compress page cache (filesystem backed) pages. It
can now compress swap cache (anonymous) pages only. At the same time,
this simplicity and non-intrusiveness got it included in Ubuntu, ALT
Linux, LTSP
(Linux Terminal Server Project) and maybe other places as well.
It should be noted that, when used at the hypervisor level, compcache
can compress any part of the guest memory and for any kind of guest OS
(Linux, Windows etc) — this should allow running more virtual
machines for
a given amount of total host memory. For example, in KVM the
guest physical memory is simply anonymous memory for the host (Linux
kernel in this case). Also, with the recent MMU notifier support
included in the Linux kernel, nearly the entire guest
physical memory is now swappable [PDF].
Implementation
All of the individual components are separate kernel modules:
- LZO compressor: lzo_compress.ko, lzo_decompress.ko (already in
mainline)
- xvMalloc memory allocator: xvmalloc.ko
- compcache block device driver: ramzswap.ko
Once these modules are loaded, one can just enable the ramzswap swap device:
swapon /dev/ramzswap0
Note that ramzswap cannot be used as a generic block device. It can
only handle page-aligned I/O, which is sufficient for use as a swap
device. No use case has yet come to light that would justify the effort
to make it a generic compressed read-write block device. Also, to
minimize block layer overhead, ramzswap uses the "no queue" mode of
operation. Thus, it accepts requests directly from the block layer and
avoids all overhead due to request queue logic.
The ramzswap module accepts parameters for "disk" size, memory limit, and
backing swap partition. The optional backing swap partition parameter
is the physical disk swap partition where ramzswap will forward
read/write requests for pages that compress to a size larger than
PAGE_SIZE/2 — so we keep only highly compressible pages in
memory.
Additionally, purely zero filled pages are checked and no memory is
allocated for such pages. For "generic" desktop workloads (Firefox,
email client, editor, media player etc.), we typically see 4000-5000
zero filled pages.
Memory management
One of the biggest challenges in this project is to manage variable
sized compressed chunks. For this, ramzswap uses memory allocator
called xvmalloc
developed specifically for this project. It has O(1) malloc/free, very
low fragmentation (within 10% of ideal in all tests), and can use
highmem (useful on 32-bit systems with >1G memory). It exports a
non-standard allocator interface:
struct xv_pool *xv_create_pool(void);
void xv_destroy_pool(struct xv_pool *pool);
int xv_malloc(struct xv_pool *pool, u32 size, u32 *pagenum, u32 *offset, gfp_t flags);
void xv_free(struct xv_pool *pool, u32 pagenum, u32 offset);
xv_malloc() returns a <pagenum, offset>
pair. It is then up to the caller to map this page (with kmap())
to get a valid kernel-space pointer.
The justification for the use of a custom memory allocator was provided when the
compcache patches
were posted to linux-kernel. Both the SLOB and SLUB allocators were found to
be unsuitable for use in this project. SLOB targets embedded devices and claims
to have good space efficiency. However, it was found to have some major
problems: It has O(n) alloc/free behavior and can lead to large amounts of wasted
space as
detailed in this LKML post.
On the other hand, SLUB has different set of problems.
The first is the usual fragmentation issue. The data presented here
shows that kmalloc uses ~43% more memory than xvmalloc. Another problem is
that it depends
on allocating higher order pages to reduce fragmentation. This is not
acceptable for ramzswap as it is used in tight-memory situations, so higher
order allocations are almost guaranteed to fail.
The xvmalloc allocator, on the other hand, always allocates zero-order
pages when it needs to expand a memory pool.
Also, both SLUB and SLOB are limited to allocating from
low memory. This
particular limitation is applicable only for 32-bit system with more
than 1G of memory. On such systems, neither allocator is able to
allocate from the high memory zone. This restriction is not acceptable for
the compcache project. Users with such configurations reported memory
allocation failures from ramzswap (before xvmalloc was developed) even
when plenty of high-memory was available. The xvmalloc allocator,
on the other hand, is able to allocate from the high memory region.
Considering above points, xvmalloc could potentially replace the
SLOB allocator. However, this would involve lot of additional work as
xvmalloc provides a non-standard
malloc/free interface. Also, xvmalloc is not
scalable in its current state (neither is SLOB) and hence cannot be
considered as a replacement for SLUB.
The memory needed for compressed pages is not pre-allocated; it grows
and shrinks on demand. On initialization, ramzswap creates an xvmalloc
memory pool. When the pool does not have enough memory to satisfy an
allocation request, it grows by allocating single (0-order) pages from
kernel page allocator. When an object is freed, xvmalloc merges it with
adjacent free blocks in the same page. If the resulting free block size
is equal to PAGE_SIZE, i.e. the page no longer contains any object; we
release the page back to the kernel.
This allocation and freeing of objects can lead to fragmentation of the
ramzswap memory. Consider the case where a lot of objects are freed
in a short period of time and, subsequently, there are very few swap
write requests. In that case, the xvmalloc pool can end up with a lot of
partially filled pages, each containing
only a small number of live
objects. To handle this case, some sort of xvmalloc memory defragmentation
scheme would need to be implemented; this could be done by
relocating objects from almost-empty pages to other pages in the xvmalloc
pool. However, it should be noted that, practically, after months of
use on several desktop machines, waste due to xvmalloc memory
fragmentation never exceeded 7%.
Swap limitations and and tools
Being a block device, ramzswap can never know when a compressed page is no
longer required — say, when the owning process has exited. Such stale
(compressed) pages simply waste memory. But with recent "
swap discard" support,
this is no longer as much of a problem. Swap discard sends BIO_RW_DISCARD bio request when it
finds a free swap cluster during swap
allocation. Although compcache does not get the callback
immediately after a page becomes stale, it is still better than just
keeping those pages in memory until they are overwritten by another
page. Support for the swap discard mechanism was added in compcache-0.5.
In general, the discard request comes
a long time after a page has become stale. Consider a case where
a memory-intensive workload terminates and there is no further
swapping activity. In those cases, ramzswap will end up having lots of
stale pages. No discard requests will come to ramzswap since no further
swap allocations are being done. Once swapping activity starts
again, it is expected that discard requests will be received for some of these
stale pages. So, to make ramzswap more effective, changes are
required in the kernel (not yet done) to scan the swap bitmap more
aggressively to find any
freed swap clusters — at least in the case of RAM backed swap devices.
Also, an adaptive compressed cache resizing policy would be useful
— monitor accesses to the compressed cache and move relatively unused
pages to a physical swap device. Currently, ramzswap can simply
forward uncompressible pages to a backing swap disk, but it cannot swap out
memory allocated by xvmalloc.
Another interesting sub-project is the SwapReplay
infrastructure. This tool is meant to easily test memory allocator behavior under
actual swapping conditions. It is a kernel module and a set of
userspace tools to replay swap events in userspace. The kernel module
stacks a pseudo block device (/dev/sr_relay) over a physical swap device.
When kernel swaps over this pseudo device, it dumps a <sector number, R/W
bit, compress length> tuple to userspace and then
forwards the I/O request to the backing swap device (provided as a
swap_replay module parameter). This data can then be parsed using a
parser library which provides a callback interface for
swap events. Clients using this library can provide any action for
these events — show compressed length histograms, simulate ramzswap
behavior etc. No kernel patching is required for this functionality.
The swap replay infrastructure has been very useful throughout
ramzswap development. The ability to replay swap traces allows for easy and
consistent simulation of any workload without the need to set it up and run it
again and again. So, if a user is suffering from high memory
fragmentation under some workloads, he could simply send me swap trace
for his workload and I have all the data needed to reproduce the
condition on my side — without the need to set up the same workload.
Clients for the parser library were written to simulate ramzswap behavior
over traces from a variety of workloads leading to easier evaluation of
different memory allocators and, ultimately, development and enhancement
of the xvmalloc allocator. In the future, it will also help testing variety
of eviction policies to support adaptive compressed cache resizing.
Conclusion
The compcache project is currently under active development; some of the
additional features planned are: adaptive compression cache
resizing, allow swapping of xvmalloc memory to physical swap disk,
memory defragmentation by relocating compressed chunks within memory
and compressed swapping to disk (4-5 pages swapped out with single disk
I/O). Later, it might be extended to compress page-cache pages too
(as
earlier patches did) — for now, it just includes the ramzswap component to
handle anonymous memory compression.
Last time the ramzswap patches were submitted for review, only LTSP
performance data was provided as a justification for this feature.
Andrew Morton was not
satisfied with this data. However, now there is a lot more data
uploaded to the performance page on the project wiki that shows
performance improvements with ramzswap. Andrew also pointed out lack
of data for cases where ramzswap can cause performance loss:
We would also be
interested in seeing the performance _loss_ from these
patches. There must be some cost somewhere. Find a worstish-case test
case and run it and include its results in the changelog too, so we
better understand the tradeoffs involved here.
The project still lacks data for such cases. However, it should
be available by the 2.6.32 time frame, when these patches will be posted
again for possible inclusion in mainline.
Comments (25 posted)
By Jonathan Corbet
May 25, 2009
LWN
covered the debugfs API
back in 2004. Rather more recently, Shen Feng kindly proposed the addition
of LWN's debugfs article as a file in the Documentation directory. There
was only one little problem with that suggestion: as one might expect, the
debugfs API has changed a little since 2004. The following is an attempt
to update the original document to cover the full API as it exists in the
2.6.30 kernel.
Debugfs exists as a simple way for kernel developers to make information
available to user space. Unlike /proc, which is only meant for
information about a process, or sysfs, which has strict
one-value-per-file rules, debugfs has no rules at all. Developers can put
any information they want there. The debugfs filesystem is also intended
to not serve as a stable ABI to user space; in theory, there are no
stability constraints placed on files exported there. The real world is not always so simple, though;
even debugfs interfaces are best designed with the idea that they will need
to be maintained forever.
Debugfs is typically mounted with a command like:
mount -t debugfs none /sys/kernel/debug
(Or an equivalent /etc/fstab line).
There is occasional dissent on the mailing lists regarding the proper mount
location for debugfs, and some documentation refers to mount points like
/debug instead. For now, user-space code which uses debugfs files
will be more portable if it finds the debugfs mount point in
/proc/mounts.
Note that the debugfs API is exported GPL-only to modules.
Code using debugfs should include <linux/debugfs.h>. Then,
the first order of business will be to create at least one directory to
hold a set of debugfs files:
struct dentry *debugfs_create_dir(const char *name, struct dentry *parent);
This call, if successful, will make a directory called name
underneath the indicated parent directory. If parent is
NULL, the directory will be created in the debugfs root. On
success, the return value is a struct dentry pointer which can be
used to create files in the directory (and to clean it up at the end). A
NULL return value
indicates that something went wrong. If -ENODEV is returned, that
is an indication that the kernel has been built without debugfs support and
none of the functions described below will work.
The most general way to create a file within a debugfs directory is with:
struct dentry *debugfs_create_file(const char *name, mode_t mode,
struct dentry *parent, void *data,
const struct file_operations *fops);
Here, name is the name of the file to create, mode
describes the access permissions the file should have, parent
indicates the directory which should hold the file, data will be
stored in the i_private field of the resulting inode
structure, and fops is a set of file operations which implement
the file's behavior. At a minimum, the read() and/or
write() operations should be provided; others can be included as
needed. Again, the return value will be a dentry pointer to the
created file, NULL for error, or -ENODEV if debugfs
support is missing.
In a number of cases, the creation of a set of file operations is not
actually necessary; the debugfs code provides a number of helper functions
for simple situations. Files containing a single integer value can be
created with any of:
struct dentry *debugfs_create_u8(const char *name, mode_t mode,
struct dentry *parent, u8 *value);
struct dentry *debugfs_create_u16(const char *name, mode_t mode,
struct dentry *parent, u16 *value);
struct dentry *debugfs_create_u32(const char *name, mode_t mode,
struct dentry *parent, u32 *value);
struct dentry *debugfs_create_u64(const char *name, mode_t mode,
struct dentry *parent, u64 *value);
These files support both reading and writing the given value; if a specific
file should not be written to, simply set the mode bits
accordingly. The values in these files are in decimal; if hexadecimal is
more appropriate, the following functions can be used instead:
struct dentry *debugfs_create_x8(const char *name, mode_t mode,
struct dentry *parent, u8 *value);
struct dentry *debugfs_create_x16(const char *name, mode_t mode,
struct dentry *parent, u16 *value);
struct dentry *debugfs_create_x32(const char *name, mode_t mode,
struct dentry *parent, u32 *value);
Note that there is no debugfs_create_x64().
These functions are useful as long as the developer knows the size of the
value to be exported. Some types can have different widths on different
architectures, though, complicating the situation somewhat. There is a
function meant to help out in one special case:
struct dentry *debugfs_create_size_t(const char *name, mode_t mode,
struct dentry *parent,
size_t *value);
As might be expected, this function will create a debugfs file to represent
a variable of type size_t.
Boolean values can be placed in debugfs with:
struct dentry *debugfs_create_bool(const char *name, mode_t mode,
struct dentry *parent, u32 *value);
A read on the resulting file will yield either Y (for non-zero
values) or N, followed by a newline. If written to, it will
accept either upper- or lower-case values, or 1 or 0.
Any other input will be silently ignored.
Finally, a block of arbitrary binary data can be exported with:
struct debugfs_blob_wrapper {
void *data;
unsigned long size;
};
struct dentry *debugfs_create_blob(const char *name, mode_t mode,
struct dentry *parent,
struct debugfs_blob_wrapper *blob);
A read of this file will return the data pointed to by the
debugfs_blob_wrapper structure. Some drivers use "blobs" as a
simple way to return several lines of (static) formatted text output. This
function can be used to export binary information, but there does not
appear to be any code which does so in the mainline. Note that files
created with debugfs_create_blob() are read-only.
There are a couple of other directory-oriented helper functions:
struct dentry *debugfs_rename(struct dentry *old_dir,
struct dentry *old_dentry,
struct dentry *new_dir,
const char *new_name);
struct dentry *debugfs_create_symlink(const char *name,
struct dentry *parent,
const char *target);
A call to debugfs_rename() will give a new name to an existing
debugfs file, possibly in a different directory. The new_name
must not exist prior to the call; the return value is old_dentry
with updated information. Symbolic links can be created with
debugfs_create_symlink().
There is one important thing that all debugfs users must take into account:
there is no automatic cleanup of any directories created in debugfs. If a
module is unloaded without explicitly removing debugfs entries, the result
will be a lot of stale pointers and no end of highly antisocial behavior.
So all debugfs users - at least those which can be built as modules - must
be prepared to remove all files and directories they create there. A file
can be removed with:
void debugfs_remove(struct dentry *dentry);
The dentry value can be NULL.
Once upon a time, debugfs users were required to remember the
dentry pointer for every debugfs file they created so that they
could all be cleaned up. We live in more civilized times now, though, and
debugfs users can call:
void debugfs_remove_recursive(struct dentry *dentry);
If this function is passed a pointer for the dentry corresponding
to the top-level directory, the entire hierarchy below that directory will
be removed.
Comments (1 posted)
Patches and updates
Kernel trees
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>