|
|
Log in / Subscribe / Register

Kernel development

Brief items

Kernel release status

The current 2.6 development kernel is 2.6.30-rc7, released on May 23. "So go wild. I suspect I'll do an -rc8, but we're definitely getting closer to release-time - it would be good to get as much testing as possible, and it should generally be pretty safe to try this all out." The long-format changelog has the details.

The current stable 2.6 kernel remains 2.6.29.4; there have been no stable releases over the last week.

Comments (none posted)

Kernel development news

Quotes of the week

Interesting, how telling somebody that they need to learn C is considered an unacceptable thing to do. Hostile to newbies, or some such. Introducing more magic that has to be learnt if one wants to read the kernel source, OTOH, is just fine...
-- Al Viro

Sorry but are you really suggesting every program in the world that uses write() anywhere should put it into a loop? That seems just like really bad API design to me, requiring such contortions in a fundamental system call just to work around kernel deficiencies.

I can just imagine the programmers putting nasty comments about the Linux kernel on top of those loops and they would be fully deserved.

-- Andi Kleen discovers POSIX

Hey, don't look at me - blame Brian Kernighan or George Bush or someone.
-- Andrew Morton disclaims responsibility

Comments (5 posted)

In brief

By Jonathan Corbet
May 27, 2009
Union directories. While a number of developers are working on the full union mount problem, Miklos Szeredi has taken a simpler approach: union directories. Only top-level directory unification is provided, and changes can only be made to the top-level filesystem. That eliminates the need for a lot of complex code doing directory copy-up, whiteouts, and such, but also reduces the functionality significantly.

Optimizing writeback timers: on a normal Linux system, the pdflush process wakes up every five seconds to force dirty page cache pages back to their backing store on disk. This wakeup happens whether or not there is anything needing to be written back. Unnecessary wakeups are increasingly unwelcome, especially on systems where power consumption matters, so it would be nice to let pdflush sleep when there is nothing for it to do.

Artem Bityutskiy has put together a patch set to do just that. It changes the filesystem API to make it easier for the core VFS to know when a specific filesystem has dirty data. That information is then used to decide whether pdflush needs to be roused from its slumber. The idea seems good, but there's one little problem: this work conflicts with the per-BDI flusher threads patches by Jens Axboe. Jens's patches get rid of the pdflush timer and make a lot of other changes, so these two projects do not currently play well together. So Artem is headed back to the drawing board to base his work on top of Jens's patches instead of the mainline.

recvmmsg(). Arnaldo Carvalho de Melo has proposed a new system call for the socket API:

    struct mmsghdr {
	struct msghdr	msg_hdr;
	unsigned	msg_len;
    };

    ssize_t recvmmsg(int socket, struct mmsghdr *mmsg, int vlen, int flags);

The difference between this system call and recvmsg() is that it is able to accept multiple messages with a single call. That, in turn, reduces system call overhead in high-bandwidth network applications. The comments in the patch suggest that sendmmsg() is in the plans, but no implementation has been posted yet.

There was a suggestion that this functionality could be obtained by extending recvmsg() with a new message flag, rather than adding a new system call. But, as David Miller pointed out, that won't work. The kernel currently ignores unrecognized flags; that will make it impossible for user space to determine whether a specific kernel supports multiple-message receives or not. So the new system call is probably how this feature will be added.

Comments (6 posted)

Developer statistics for 2.6.30

By Jonathan Corbet
May 27, 2009
As the 2.6.30 development cycle heads toward a close, it is natural to look back at what has been merged and where it came from. So here is LWN's traditional look at who wrote the code which went into the mainline this time around.

Once again, 2.6.30 was a large development cycle; it saw the incorporation (through just after 2.6.30-rc7) of 11,733 non-merge changesets from 1125 developers. The number of changesets exceeds 2.6.29, but the number of developers falls just short of the 1166 seen last time around. Those developers added 1.14 million lines of code this time around, while taking out 513,000, for a net growth of 624,000 lines.

The individual developer statistics for 2.6.30 look like:

Most active 2.6.30 developers
By changesets
Ingo Molnar3242.8%
Bill Pemberton2271.9%
Stephen Hemminger2041.7%
Hans Verkuil1991.7%
Takashi Iwai1881.6%
Bartlomiej Zolnierkiewicz1861.6%
Steven Rostedt1791.5%
Greg Kroah-Hartman1501.3%
Jeremy Fitzhardinge1251.1%
Mark Brown1070.9%
Jaswinder Singh Rajput1050.9%
Rusty Russell1000.9%
Tejun Heo980.8%
Johannes Berg980.8%
Hannes Eder880.8%
Michal Simek850.7%
Luis R. Rodriguez850.7%
Sujith850.7%
David Howells800.7%
Yinghai Lu780.7%
By changed lines
Greg Kroah-Hartman1203539.0%
ADDI-DATA GmbH434203.3%
Mithlesh Thukral424243.2%
Alex Deucher265762.0%
David Schleef259051.9%
David Woodhouse246361.8%
Ramkrishna Vepa234951.8%
Lior Dotan225061.7%
Eric Moore222661.7%
Eilon Greenstein183991.4%
Jaswinder Singh Rajput181681.4%
Hans Verkuil180481.4%
David Howells179411.3%
Andy Grover163551.2%
Michal Simek158271.2%
Sri Deevi155141.2%
Frank Mori Hess154501.2%
Ben Hutchings150311.1%
Ingo Molnar138761.0%
Bill Pemberton138171.0%

On the changesets side, Ingo Molnar is at the top of the list this time around; as usual, he created a vast number of patches - about five per day - in the x86 architecture code, ftrace, and beyond. Bill Pemberton is perhaps better known as the maintainer of the Elm mail client; he did a lot of cleanup work with the COMEDI drivers in the -staging tree. The bulk of Stephen Hemminger's work involved converting network drivers to the new net_device_ops API. Hans Verkuil continues to improve the Video4Linux2 framework and associated drivers, and Takashi Iwai continues to generate a lot of patches as the ALSA maintainer.

Linus kicked off the 2.6.30 development cycle by noting that about one third of the changes in 2.6.30-rc1 were "crap." So, unsurprisingly, the top three entries in the "by changed lines" column all got there through the addition of -staging drivers. Alex Deucher added Radeon R6xx/R7xx support; many of his "changed lines" were associated microcode firmware. And David Schleef added another set of drivers to the -staging tree.

Contributions to 2.6.30 could be traced back to some 190 employers. Looking at the most-active employer information, we see:

Most active 2.6.30 employers
By changesets
(None)197016.8%
Red Hat130511.1%
(Unknown)118410.1%
Intel8557.3%
Novell8327.1%
IBM6305.4%
(Consultant)2932.5%
Atheros Communications2622.2%
Oracle2522.1%
University of Virginia2271.9%
Fujitsu2171.8%
Vyatta2041.7%
Renesas Technology1521.3%
NTT1211.0%
MontaVista1151.0%
HP1070.9%
Wolfson Microelectronics1050.9%
(Academia)1020.9%
Nokia980.8%
XenSource910.8%
By lines changed
(Unknown)18141313.6%
Novell16422912.3%
(None)1180958.9%
Intel860606.5%
Red Hat739545.5%
LinSysSoft Technologies647984.9%
ADDI-DATA GmbH434203.3%
SofaWare392452.9%
Broadcom319562.4%
AMD283642.1%
Entropy Wave259051.9%
IBM257021.9%
Oracle255881.9%
NTT252351.9%
Neterion234951.8%
LSI Logic223041.7%
Atheros Communications216271.6%
(Consultant)192091.4%
Freescale161391.2%
PetaLogix158461.2%

These numbers are somewhat similar to those seen in previous development cycles. There are a few unfamiliar companies here; they are pretty much all present as a result of contributions to -staging. It is interesting to note that Atheros and Broadcom, once known as uncooperative companies, are increasing their contributions over time.

Your editor has not looked at signoff statistics for the last few cycles. The interesting thing to be found in Signed-off-by tags is an indication of who the gatekeepers to the kernel are. Especially if one disregards signoffs by the author of each patch, what remains is (mostly) the signoffs of subsystem maintainers who approved the patches for merging. For 2.6.30, these numbers look like this:

Top non-author signoffs in 2.6.30
Individuals
David S. Miller121612.1%
John W. Linville8658.6%
Ingo Molnar8368.3%
Greg Kroah-Hartman7977.9%
Mauro Carvalho Chehab7847.8%
Andrew Morton6606.6%
James Bottomley2502.5%
Linus Torvalds2192.2%
Len Brown1891.9%
Takashi Iwai1651.6%
Jeff Kirsher1451.4%
Russell King1271.3%
H. Peter Anvin1201.2%
Mark Brown1151.1%
Jesse Barnes1111.1%
Benjamin Herrenschmidt1111.1%
Reinette Chatre1041.0%
Martin Schwidefsky950.9%
Avi Kivity910.9%
Paul Mundt890.9%
Employers
Red Hat426442.4%
Novell138613.8%
Intel9519.5%
Google6606.6%
(None)4084.1%
IBM3783.8%
Linux Foundation2192.2%
(Consultant)1661.6%
(Unknown)1271.3%
Wolfson Microelectronics1151.1%
Renesas Technology920.9%
Marvell910.9%
Atomide810.8%
Oracle800.8%
Astaro650.6%
Freescale630.6%
Cisco610.6%
Analog Devices600.6%
Univ. of Michigan CITI590.6%
Panasas580.6%

Signoffs have always been more concentrated than contributions in general. Still, one wonders how David Miller manages to approve a solid twenty patches every day. On the employer side, things are more concentrated than ever; over half of the patches going into the kernel go through the hands of a developer at Red Hat or Novell. Developers, it seems, work for a great many companies, but subsystem maintainers gravitate toward a small handful of firms.

All told, the picture remains one of a well-oiled, fast-moving development process. We also see a picture of a -staging tree which is growing at a tremendous rate; your editor is tempted to exclude -staging patches from future reports if the rate does not slow somewhat. Even without -staging, though, a lot of work is being done on the kernel, with the participation of a large group of developers, and it doesn't look like it will be slowing down anytime soon.

Postscript: Jan Engelhardt sent your editor a pointer to a short script which, through use of the git blame command, tallies up the "ownership" of every line in the kernel. The top results for 2.6.30-rc7 look like this:

Who last touched kernel code lines
LinesPctWho
406372335.17% Linus Torvalds
4640214.02% Greg Kroah-Hartman
942000.82% David Howells
860310.74% David S. Miller
826080.71% Luis R. Rodriguez
722000.62% Bryan Wu
701280.61% Takashi Iwai
668590.58% Ralf Baechle
557850.48% Hans Verkuil
540690.47% Paul Mundt
540070.47% Kumar Gala
532880.46% David Brownell
516400.45% Russell King
506110.44% Paul Mackerras
494990.43% Andrew Victor
493470.43% Mauro Carvalho Chehab
492560.43% Alan Cox
473050.41% Mikael Starvik
470400.41% Ben Dooks
443070.38% Benjamin Herrenschmidt

Linus shows a high ownership because he was the initial committer at the beginning of the git era. To a rough approximation, one can conclude that approximately one third of the code in the kernel has not been touched since that time. There are other interesting things which can be done with line-level statistics; your editor plans to explore this idea some in the future.

Comments (26 posted)

Compcache: in-memory compressed swapping

May 26, 2009

This article was contributed by Nitin Gupta

The idea of memory compression—compress relatively unused pages and store them in memory itself—is simple and has been around for a long time. Compression, through the elimination of expensive disk I/O, is far faster than swapping those pages to secondary storage. When a page is needed again, it is decompressed and given back, which is, again, much faster than going to swap.

An implementation of this idea on Linux is currently under development as the compcache project. It creates a virtual block device (called ramzswap) which acts as a swap disk. Pages swapped to this disk are compressed and stored in memory itself. The project home contains use cases, performance numbers, and other related bits. The whole aim of the project is not just performance — on swapless setups, it allows running applications that would otherwise simply fail due to lack of memory. For example, Edubuntu included compcache to lower the RAM requirements of its installer.

The performance page on the project wiki shows numbers for configurations that closely match netbooks, thin clients, and embedded devices. These initial results look promising. For example, in the benchmark for thin clients, ramzswap gives nearly the same effect as doubling the memory. Another benchmark shows that average time required to complete swap requests is reduced drastically with ramzswap. With a swap partition located on a 10000 RPM disk, average time required for swap read and write requests was found to be 168ms and 355ms, respectively. While with ramzswap, corresponding numbers were mere 12µs and 7µs, respectively — this includes time for checking zero-filled pages and compressing/decompressing all non-zero pages.

The approach of using a virtual block device is a major simplification over earlier attempts. The previous implementation required changes to the swap write path, page fault handler, and page cache lookup functions (find_get_page() and friends). Those patches did not gain widespread acceptance due to their intrusive nature. The new approach is far less intrusive, but at a cost: compcache has lost the ability to compress page cache (filesystem backed) pages. It can now compress swap cache (anonymous) pages only. At the same time, this simplicity and non-intrusiveness got it included in Ubuntu, ALT Linux, LTSP (Linux Terminal Server Project) and maybe other places as well.

It should be noted that, when used at the hypervisor level, compcache can compress any part of the guest memory and for any kind of guest OS (Linux, Windows etc) — this should allow running more virtual machines for a given amount of total host memory. For example, in KVM the guest physical memory is simply anonymous memory for the host (Linux kernel in this case). Also, with the recent MMU notifier support included in the Linux kernel, nearly the entire guest physical memory is now swappable [PDF].

Implementation

All of the individual components are separate kernel modules:

  • LZO compressor: lzo_compress.ko, lzo_decompress.ko (already in mainline)
  • xvMalloc memory allocator: xvmalloc.ko
  • compcache block device driver: ramzswap.ko
Once these modules are loaded, one can just enable the ramzswap swap device:
    swapon /dev/ramzswap0
Note that ramzswap cannot be used as a generic block device. It can only handle page-aligned I/O, which is sufficient for use as a swap device. No use case has yet come to light that would justify the effort to make it a generic compressed read-write block device. Also, to minimize block layer overhead, ramzswap uses the "no queue" mode of operation. Thus, it accepts requests directly from the block layer and avoids all overhead due to request queue logic.

The ramzswap module accepts parameters for "disk" size, memory limit, and backing swap partition. The optional backing swap partition parameter is the physical disk swap partition where ramzswap will forward read/write requests for pages that compress to a size larger than PAGE_SIZE/2 — so we keep only highly compressible pages in memory. Additionally, purely zero filled pages are checked and no memory is allocated for such pages. For "generic" desktop workloads (Firefox, email client, editor, media player etc.), we typically see 4000-5000 zero filled pages.

Memory management

One of the biggest challenges in this project is to manage variable sized compressed chunks. For this, ramzswap uses memory allocator called xvmalloc developed specifically for this project. It has O(1) malloc/free, very low fragmentation (within 10% of ideal in all tests), and can use highmem (useful on 32-bit systems with >1G memory). It exports a non-standard allocator interface:

    struct xv_pool *xv_create_pool(void);
    void xv_destroy_pool(struct xv_pool *pool);

    int xv_malloc(struct xv_pool *pool, u32 size, u32 *pagenum, u32 *offset, gfp_t flags);
    void xv_free(struct xv_pool *pool, u32 pagenum, u32 offset);

xv_malloc() returns a <pagenum, offset> pair. It is then up to the caller to map this page (with kmap()) to get a valid kernel-space pointer.

The justification for the use of a custom memory allocator was provided when the compcache patches were posted to linux-kernel. Both the SLOB and SLUB allocators were found to be unsuitable for use in this project. SLOB targets embedded devices and claims to have good space efficiency. However, it was found to have some major problems: It has O(n) alloc/free behavior and can lead to large amounts of wasted space as detailed in this LKML post.

On the other hand, SLUB has different set of problems. The first is the usual fragmentation issue. The data presented here shows that kmalloc uses ~43% more memory than xvmalloc. Another problem is that it depends on allocating higher order pages to reduce fragmentation. This is not acceptable for ramzswap as it is used in tight-memory situations, so higher order allocations are almost guaranteed to fail. The xvmalloc allocator, on the other hand, always allocates zero-order pages when it needs to expand a memory pool.

Also, both SLUB and SLOB are limited to allocating from low memory. This particular limitation is applicable only for 32-bit system with more than 1G of memory. On such systems, neither allocator is able to allocate from the high memory zone. This restriction is not acceptable for the compcache project. Users with such configurations reported memory allocation failures from ramzswap (before xvmalloc was developed) even when plenty of high-memory was available. The xvmalloc allocator, on the other hand, is able to allocate from the high memory region.

Considering above points, xvmalloc could potentially replace the SLOB allocator. However, this would involve lot of additional work as xvmalloc provides a non-standard malloc/free interface. Also, xvmalloc is not scalable in its current state (neither is SLOB) and hence cannot be considered as a replacement for SLUB.

The memory needed for compressed pages is not pre-allocated; it grows and shrinks on demand. On initialization, ramzswap creates an xvmalloc memory pool. When the pool does not have enough memory to satisfy an allocation request, it grows by allocating single (0-order) pages from kernel page allocator. When an object is freed, xvmalloc merges it with adjacent free blocks in the same page. If the resulting free block size is equal to PAGE_SIZE, i.e. the page no longer contains any object; we release the page back to the kernel.

This allocation and freeing of objects can lead to fragmentation of the ramzswap memory. Consider the case where a lot of objects are freed in a short period of time and, subsequently, there are very few swap write requests. In that case, the xvmalloc pool can end up with a lot of partially filled pages, each containing only a small number of live objects. To handle this case, some sort of xvmalloc memory defragmentation scheme would need to be implemented; this could be done by relocating objects from almost-empty pages to other pages in the xvmalloc pool. However, it should be noted that, practically, after months of use on several desktop machines, waste due to xvmalloc memory fragmentation never exceeded 7%.

Swap limitations and and tools

Being a block device, ramzswap can never know when a compressed page is no longer required — say, when the owning process has exited. Such stale (compressed) pages simply waste memory. But with recent "swap discard" support, this is no longer as much of a problem. Swap discard sends BIO_RW_DISCARD bio request when it finds a free swap cluster during swap allocation. Although compcache does not get the callback immediately after a page becomes stale, it is still better than just keeping those pages in memory until they are overwritten by another page. Support for the swap discard mechanism was added in compcache-0.5.

In general, the discard request comes a long time after a page has become stale. Consider a case where a memory-intensive workload terminates and there is no further swapping activity. In those cases, ramzswap will end up having lots of stale pages. No discard requests will come to ramzswap since no further swap allocations are being done. Once swapping activity starts again, it is expected that discard requests will be received for some of these stale pages. So, to make ramzswap more effective, changes are required in the kernel (not yet done) to scan the swap bitmap more aggressively to find any freed swap clusters — at least in the case of RAM backed swap devices. Also, an adaptive compressed cache resizing policy would be useful — monitor accesses to the compressed cache and move relatively unused pages to a physical swap device. Currently, ramzswap can simply forward uncompressible pages to a backing swap disk, but it cannot swap out memory allocated by xvmalloc.

Another interesting sub-project is the SwapReplay infrastructure. This tool is meant to easily test memory allocator behavior under actual swapping conditions. It is a kernel module and a set of userspace tools to replay swap events in userspace. The kernel module stacks a pseudo block device (/dev/sr_relay) over a physical swap device. When kernel swaps over this pseudo device, it dumps a <sector number, R/W bit, compress length> tuple to userspace and then forwards the I/O request to the backing swap device (provided as a swap_replay module parameter). This data can then be parsed using a parser library which provides a callback interface for swap events. Clients using this library can provide any action for these events — show compressed length histograms, simulate ramzswap behavior etc. No kernel patching is required for this functionality.

The swap replay infrastructure has been very useful throughout ramzswap development. The ability to replay swap traces allows for easy and consistent simulation of any workload without the need to set it up and run it again and again. So, if a user is suffering from high memory fragmentation under some workloads, he could simply send me swap trace for his workload and I have all the data needed to reproduce the condition on my side — without the need to set up the same workload.

Clients for the parser library were written to simulate ramzswap behavior over traces from a variety of workloads leading to easier evaluation of different memory allocators and, ultimately, development and enhancement of the xvmalloc allocator. In the future, it will also help testing variety of eviction policies to support adaptive compressed cache resizing.

Conclusion

The compcache project is currently under active development; some of the additional features planned are: adaptive compression cache resizing, allow swapping of xvmalloc memory to physical swap disk, memory defragmentation by relocating compressed chunks within memory and compressed swapping to disk (4-5 pages swapped out with single disk I/O). Later, it might be extended to compress page-cache pages too (as earlier patches did) — for now, it just includes the ramzswap component to handle anonymous memory compression.

Last time the ramzswap patches were submitted for review, only LTSP performance data was provided as a justification for this feature. Andrew Morton was not satisfied with this data. However, now there is a lot more data uploaded to the performance page on the project wiki that shows performance improvements with ramzswap. Andrew also pointed out lack of data for cases where ramzswap can cause performance loss:

We would also be interested in seeing the performance _loss_ from these patches. There must be some cost somewhere. Find a worstish-case test case and run it and include its results in the changelog too, so we better understand the tradeoffs involved here.

The project still lacks data for such cases. However, it should be available by the 2.6.32 time frame, when these patches will be posted again for possible inclusion in mainline.

Comments (25 posted)

An updated guide to debugfs

By Jonathan Corbet
May 25, 2009
LWN covered the debugfs API back in 2004. Rather more recently, Shen Feng kindly proposed the addition of LWN's debugfs article as a file in the Documentation directory. There was only one little problem with that suggestion: as one might expect, the debugfs API has changed a little since 2004. The following is an attempt to update the original document to cover the full API as it exists in the 2.6.30 kernel.

Debugfs exists as a simple way for kernel developers to make information available to user space. Unlike /proc, which is only meant for information about a process, or sysfs, which has strict one-value-per-file rules, debugfs has no rules at all. Developers can put any information they want there. The debugfs filesystem is also intended to not serve as a stable ABI to user space; in theory, there are no stability constraints placed on files exported there. The real world is not always so simple, though; even debugfs interfaces are best designed with the idea that they will need to be maintained forever.

Debugfs is typically mounted with a command like:

    mount -t debugfs none /sys/kernel/debug

(Or an equivalent /etc/fstab line). There is occasional dissent on the mailing lists regarding the proper mount location for debugfs, and some documentation refers to mount points like /debug instead. For now, user-space code which uses debugfs files will be more portable if it finds the debugfs mount point in /proc/mounts.

Note that the debugfs API is exported GPL-only to modules.

Code using debugfs should include <linux/debugfs.h>. Then, the first order of business will be to create at least one directory to hold a set of debugfs files:

    struct dentry *debugfs_create_dir(const char *name, struct dentry *parent);

This call, if successful, will make a directory called name underneath the indicated parent directory. If parent is NULL, the directory will be created in the debugfs root. On success, the return value is a struct dentry pointer which can be used to create files in the directory (and to clean it up at the end). A NULL return value indicates that something went wrong. If -ENODEV is returned, that is an indication that the kernel has been built without debugfs support and none of the functions described below will work.

The most general way to create a file within a debugfs directory is with:

    struct dentry *debugfs_create_file(const char *name, mode_t mode,
				       struct dentry *parent, void *data,
				       const struct file_operations *fops);

Here, name is the name of the file to create, mode describes the access permissions the file should have, parent indicates the directory which should hold the file, data will be stored in the i_private field of the resulting inode structure, and fops is a set of file operations which implement the file's behavior. At a minimum, the read() and/or write() operations should be provided; others can be included as needed. Again, the return value will be a dentry pointer to the created file, NULL for error, or -ENODEV if debugfs support is missing.

In a number of cases, the creation of a set of file operations is not actually necessary; the debugfs code provides a number of helper functions for simple situations. Files containing a single integer value can be created with any of:

    struct dentry *debugfs_create_u8(const char *name, mode_t mode,
				     struct dentry *parent, u8 *value);
    struct dentry *debugfs_create_u16(const char *name, mode_t mode,
				      struct dentry *parent, u16 *value);
    struct dentry *debugfs_create_u32(const char *name, mode_t mode,
				      struct dentry *parent, u32 *value);
    struct dentry *debugfs_create_u64(const char *name, mode_t mode,
				      struct dentry *parent, u64 *value);

These files support both reading and writing the given value; if a specific file should not be written to, simply set the mode bits accordingly. The values in these files are in decimal; if hexadecimal is more appropriate, the following functions can be used instead:

    struct dentry *debugfs_create_x8(const char *name, mode_t mode,
				     struct dentry *parent, u8 *value);
    struct dentry *debugfs_create_x16(const char *name, mode_t mode,
				      struct dentry *parent, u16 *value);
    struct dentry *debugfs_create_x32(const char *name, mode_t mode,
				      struct dentry *parent, u32 *value);

Note that there is no debugfs_create_x64().

These functions are useful as long as the developer knows the size of the value to be exported. Some types can have different widths on different architectures, though, complicating the situation somewhat. There is a function meant to help out in one special case:

    struct dentry *debugfs_create_size_t(const char *name, mode_t mode,
				         struct dentry *parent, 
					 size_t *value);

As might be expected, this function will create a debugfs file to represent a variable of type size_t.

Boolean values can be placed in debugfs with:

    struct dentry *debugfs_create_bool(const char *name, mode_t mode,
				       struct dentry *parent, u32 *value);

A read on the resulting file will yield either Y (for non-zero values) or N, followed by a newline. If written to, it will accept either upper- or lower-case values, or 1 or 0. Any other input will be silently ignored.

Finally, a block of arbitrary binary data can be exported with:

    struct debugfs_blob_wrapper {
	void *data;
	unsigned long size;
    };

    struct dentry *debugfs_create_blob(const char *name, mode_t mode,
				       struct dentry *parent,
				       struct debugfs_blob_wrapper *blob);

A read of this file will return the data pointed to by the debugfs_blob_wrapper structure. Some drivers use "blobs" as a simple way to return several lines of (static) formatted text output. This function can be used to export binary information, but there does not appear to be any code which does so in the mainline. Note that files created with debugfs_create_blob() are read-only.

There are a couple of other directory-oriented helper functions:

    struct dentry *debugfs_rename(struct dentry *old_dir, 
    				  struct dentry *old_dentry,
		                  struct dentry *new_dir, 
				  const char *new_name);

    struct dentry *debugfs_create_symlink(const char *name, 
                                          struct dentry *parent,
				      	  const char *target);

A call to debugfs_rename() will give a new name to an existing debugfs file, possibly in a different directory. The new_name must not exist prior to the call; the return value is old_dentry with updated information. Symbolic links can be created with debugfs_create_symlink().

There is one important thing that all debugfs users must take into account: there is no automatic cleanup of any directories created in debugfs. If a module is unloaded without explicitly removing debugfs entries, the result will be a lot of stale pointers and no end of highly antisocial behavior. So all debugfs users - at least those which can be built as modules - must be prepared to remove all files and directories they create there. A file can be removed with:

    void debugfs_remove(struct dentry *dentry);

The dentry value can be NULL.

Once upon a time, debugfs users were required to remember the dentry pointer for every debugfs file they created so that they could all be cleaned up. We live in more civilized times now, though, and debugfs users can call:

    void debugfs_remove_recursive(struct dentry *dentry);

If this function is passed a pointer for the dentry corresponding to the top-level directory, the entire hierarchy below that directory will be removed.

Comments (3 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 2.6.30-rc7 ?
Thomas Gleixner 2.6.29.4-rt15 ?
Thomas Gleixner 2.6.29.4-rt16 ?

Architecture-specific

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Arnaldo Carvalho de Melo New socket API: recvmmsg ?
Dmitry Eremin-Solenikov IEEE 802.15.4 implementation for Linux ?

Security-related

Virtualization and containers

Benchmarks and bugs

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds