Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.35-rc5, released on July 12. "And I merged the ARM defconfig minimization thing, which isn't the final word on the whole defconfig issue, but since it removes almost 200 _thousand_ lines of ARM defconfig noise, it's a pretty big deal." We looked at the ARM defconfig issue a few weeks back, and Linus has pulled from Uwe Kleine-König's tree that provides a starting point for the defconfig cleanup. The short-form changelog is appended to the announcement, and all the details are available in the full changelog.

Five stable kernels were released on July 5: 2.6.27.48, 2.6.31.14, 2.6.32.16, 2.6.33.6, and 2.6.34.1.

Comments (none posted)

Quotes of the week

hrmpf, one of those wonderful messages where neither it nor its source code give you any clue regarding what caused it nor how to fix it.

-- Andrew Morton

This has been an especially interesting year in the field. We've landed the infrastructure for generic runtime power management, glued that into PCI and started implementing that at the driver level. pm_qos is being reworked to improve performance and scalability as we start seeing more drivers that need to express their own constraints. And, of course, we had the wakelock/suspend blockers conversation that didn't end in a terribly satisfactory manner, although Rafael is now working on an implementation that presents equivalent functionality with a different userspace API.

-- Matthew Garrett gives an overview of the world of Linux power management

Comments (none posted)

Kernel development news

Kernel development statistics for 2.6.35

July 14, 2010

This article was contributed by Greg Kroah-Hartman.

In the tradition of summarizing the statistics of the Linux kernel releases before the actual release of the kernel version itself, here is a summary of what has happened in the Linux kernel tree over the past few months.

This kernel release has seen 9460 changesets from about 1145 different developers so far. This continues the trend over the past few kernel releases for the size of both the changes as well as the development community as can be seen in this table:

Kernel Patches Devs

2.6.29 11,600 1170

2.6.30 11,700 1130

2.6.31 10,600 1150

2.6.32 10,800 1230

2.6.33 10,500 1150

2.6.34 9,100 1110

2.6.35 9,460 1145

Kernel	Patches	Devs
2.6.29	11,600	1170
2.6.30	11,700	1130
2.6.31	10,600	1150
2.6.32	10,800	1230
2.6.33	10,500	1150
2.6.34	9,100	1110
2.6.35	9,460	1145

Perhaps our years of increasing developer activity — getting more developers per release and more changes per release — has finally reached a plateau. If so, that is not a bad thing, as a number of us were wondering what the limits of our community were going to be. At around 10 thousand changes per release, that limit is indeed quite high, so there is no need to be concerned, as the Linux kernel is still, by far, the most active software development project the world has ever seen.

In looking at the individual developers, the quantity and size of contributions is still quite large:

Most active 2.6.35 developers

By changesets

Mauro Carvalho Chehab 228 2.3%

Dan Carpenter 135 1.3%

Greg Kroah-Hartman 134 1.3%

Arnaldo Carvalho de Melo 121 1.2%

Johannes Berg 105 1.0%

Ben Dooks 98 1.0%

Julia Lawall 96 1.0%

Hans Verkuil 92 0.9%

Alexander Graf 84 0.8%

Eric Dumazet 82 0.8%

Peter Zijlstra 79 0.8%

Paul Mundt 79 0.8%

Johan Hovold 75 0.7%

Tejun Heo 74 0.7%

Stephen Hemminger 74 0.7%

Mark Brown 71 0.7%

Sage Weil 70 0.7%

Alex Deucher 68 0.7%

Randy Dunlap 67 0.7%

Frederic Weisbecker 66 0.7%

By changed lines

Uwe Kleine-König 194249 18.5%

Ralph Campbell 53250 5.1%

Greg Kroah-Hartman 31714 3.0%

Stepan Moskovchenko 30037 2.9%

Arnaud Patard 28783 2.7%

Mauro Carvalho Chehab 27902 2.7%

Eliot Blennerhassett 18490 1.8%

Luis R. Rodriguez 16555 1.6%

Daniel Mack 16176 1.5%

Bob Beers 11703 1.1%

Jason Wessel 10502 1.0%

Viresh KUMAR 10499 1.0%

Barry Song 10116 1.0%

James Chapman 9645 0.9%

Steve Wise 9580 0.9%

Sjur Braendeland 8775 0.8%

Alex Deucher 7721 0.7%

Jassi Brar 7554 0.7%

Sujith 7544 0.7%

Giridhar Malavali 6867 0.7%

Uwe Kleine-König, who works for Pengutronix, dominates the "changed lines" list due to one patch that Linus just pulled for the 2.5.35-rc5 release that deleted almost all of the ARM default config files. Linus responded when Uwe posted his patch with:

Well, I can hardly refuse a pull that removes almost 200k lines. So I'd happily pull it. Just this single line in your email is a very very powerful thing:

> 177 files changed, 652 insertions(+), 194157 deletions(-)

Other than that major cleanup, the majority of the work was in drivers. Ralph Campbell did a lot of Infiniband driver work, I did a lot of cleanup on some staging drivers, and Stepan Moskovchenko and Arnaud Patard contributed new drivers to the staging tree. Mauro Carvalho Chehab contributed lots of Video for Linux driver work — rounding out the top 6 contributors by lines of code changed.

Continuing the view that this kernel is much like previous ones, 177 different employers were found to have contributed to the 2.6.35 kernel release:

Most active 2.6.35 employers

By changesets

(None) 1429 14.2%

Red Hat 1185 11.8%

(Unknown) 904 9.0%

Intel 637 6.3%

Novell 559 5.6%

IBM 295 2.9%

Nokia 253 2.5%

(Consultant) 215 2.1%

Atheros Communications 175 1.7%

AMD 173 1.7%

Oracle 169 1.7%

Samsung 163 1.6%

Texas Instruments 162 1.6%

(Academia) 140 1.4%

Fujitsu 138 1.4%

Google 122 1.2%

Renesas Technology 102 1.0%

Analog Devices 98 1.0%

Simtec 96 1.0%

NTT 93 0.9%

By lines changed

Pengutronix 195175 18.6%

Red Hat 82334 7.8%

(None) 79313 7.6%

(Unknown) 72426 6.9%

QLogic 72131 6.9%

Novell 49651 4.7%

Intel 47260 4.5%

Code Aurora Forum 40081 3.8%

Mandriva 29105 2.8%

Atheros Communications 29055 2.8%

Samsung 25817 2.5%

ST Ericsson 20463 2.0%

Analog Devices 18889 1.8%

AudioScience Inc. 18545 1.8%

caiaq 16194 1.5%

Nokia 14891 1.4%

Texas Instruments 14864 1.4%

(Consultant) 14209 1.4%

IBM 12235 1.2%

ST Microelectronics 11728 1.1%

But enough of the normal way of looking at the kernel as a whole body. Let's try something different this time, and break the contributions down by the different functional areas of the kernel.

The kernel is a bit strange in that it is a mature body of code that still changes quite frequently and throughout the whole body of code. It is not just drivers that are changing, but the "core" kernel as well. That is pretty unusual for a mature code base. The core kernel code — those files that all architectures and users use no matter what their configuration is — comprises 5% of the kernel (by lines of code), and you will find that 5% of the total kernel changes happen in that code. Here is the raw number of changes for the "core" kernel files for the 2.6.35-rc5 release.

Action Lines % of all changes

Added 27,550 4.50%

Deleted 7,450 1.90%

Modified 6,847 4.93%

Action	Lines	% of all changes
Added	27,550	4.50%
Deleted	7,450	1.90%
Modified	6,847	4.93%

Note that the percent deleted are a bit off because of the huge defconfig delete by Uwe as described above.

So, if the changes are made in a uniform way across the kernel, does that mean that the same companies contribute in a uniform way as well, or do some contribute more to one area than another?

I've broken the kernel files down into six different categories:

core : This includes the files in the init, block, ipc, kernel, lib, mm, and virt subdirectories.
drivers : This includes the files in the crypto, drivers, sound, security, include/acpi, include/crypto, include/drm, include/media, include/mtd, include/pcmcia, include/rdma, include/rxrpc, include/scsi, include/sound, and include/video subdirectories.
filesystems : This includes the files in the fs subdirectory.
networking : This includes the files in the net and include/net subdirectories.
architecture-specific : This includes the files in the arch, include/xen, include/math-emu, and include/asm-generic subdirectories.
miscellaneous : This includes all of the rest of the files not included in the above categories.

Based on these categories, the size of the 2.6.35 kernel is as follows:

Category % Lines

Core 4.37%

Drivers 57.06%

Filesystems 7.21%

Networking 5.03%

Arch-specific 21.92%

Miscellaneous 4.43%

Category	% Lines
Core	4.37%
Drivers	57.06%
Filesystems	7.21%
Networking	5.03%
Arch-specific	21.92%
Miscellaneous	4.43%

Here are the top companies contributing to the different areas of the kernel:

Most active 2.6.35 employers (core)

By changesets

Red Hat 218 16.5%

(None) 148 11.2%

IBM 66 5.0%

Novell 60 4.5%

Intel 59 4.5%

(Unknown) 57 4.3%

Fujitsu 33 2.5%

Google 30 2.3%

Wind River 22 1.7%

Oracle 22 1.7%

Nokia 22 1.7%

(Consultant) 22 1.7%

By lines changed

Wind River 9535 25.4%

Red Hat 6277 16.7%

Novell 2385 6.4%

(None) 2074 5.5%

IBM 2064 5.5%

Intel 1480 3.9%

Fujitsu 1431 3.8%

Google 1417 3.8%

VirtualLogix Inc. 992 2.6%

ST Ericsson 957 2.6%

caiaq 707 1.9%

(Unknown) 614 1.6%

The companies contributing to the core kernel files are not surprising. These companies have all contributed to Linux for a long time, and it is part of their core strategy. Wind River has a high number of lines changed due to all of the KGDB work that Jason Wessel has been doing in getting that codebase cleaned up and merged into the main kernel tree.

Most active 2.6.35 employers (drivers)

By changesets

(None) 1022 18.1%

(Unknown) 678 12.0%

Red Hat 528 9.4%

Intel 499 8.9%

Novell 336 6.0%

Nokia 199 3.5%

Atheros Communications 165 2.9%

(Academia) 94 1.7%

IBM 86 1.5%

QLogic 86 1.5%

By lines changed

QLogic 72122 12.2%

(None) 61356 10.4%

(Unknown) 60802 10.3%

Red Hat 47204 8.0%

Intel 39891 6.7%

Novell 36951 6.2%

Code Aurora Forum 34888 5.9%

Mandriva 28867 4.9%

Atheros Communications 28844 4.9%

AudioScience Inc. 18535 3.1%

Because the drivers make up over 50% of the overall size of the kernel, the contributions here track the overall company statistics very closely. The company AudioScience Inc. sneaks onto the list of changes due to all of the work that Eliot Blennerhassett has been doing on the asihpi sound driver.

Most active 2.6.35 employers (filesystems)

By changesets

Red Hat 134 15.9%

Oracle 77 9.1%

New Dream Network 76 9.0%

Novell 76 9.0%

(Unknown) 73 8.7%

(None) 58 6.9%

NetApp 42 5.0%

Parallels 39 4.6%

IBM 23 2.7%

Univ. of Michigan CITI 23 2.7%

By lines changed

Oracle 7194 24.2%

Red Hat 6392 21.5%

Novell 3989 13.4%

(Unknown) 3081 10.4%

(None) 2024 6.8%

New Dream Network 1423 4.8%

NetApp 897 3.0%

Google 857 2.9%

Parallels 687 2.3%

(Consultant) 546 1.8%

Filesystem contributions, like drivers, match up with the different company strengths. New Dream Network might not be a familiar name to a lot of people, but their development on the Ceph filesystem pushed it into the list of top contributors. The University of Michigan did a lot of NFS work, bringing that organization into the top ten.

Most active 2.6.35 employers (networking)

By changesets

SFR 74 9.6%

(Consultant) 73 9.5%

Red Hat 72 9.3%

(None) 67 8.7%

ProFUSION 55 7.1%

Intel 45 5.8%

Astaro 35 4.5%

Vyatta 34 4.4%

(Unknown) 34 4.4%

Oracle 20 2.6%

ST Ericsson 20 2.6%

Univ. of Michigan CITI 20 2.6%

By lines changed

Katalix Systems 9213 24.2%

ST Ericsson 8003 21.0%

(Consultant) 3691 9.7%

Univ. of Michigan CITI 2334 6.1%

Astaro 1956 5.1%

Red Hat 1882 4.9%

Intel 1607 4.2%

SFR 1555 4.1%

ProFUSION 1065 2.8%

(None) 1060 2.8%

(Unknown) 1035 2.7%

Like the filesystem list, networking also shows the University of Michigan's large contributions as well as many of the other common Linux companies. But here a number of not-so-familiar companies start showing up.

SFR is a French mobile phone company, and contributed lots of changes all through the networking core. ProFUSION is an embedded development company that did a lot of Bluetooth development for this kernel release. Katalix Systems is another embedded development company and they contributed a lot of l2tp changes. Astaro is a networking security company that contributed a number of netfilter changes.

Most active 2.6.35 employers (architecture-specific)

By changesets

Red Hat 146 7.2%

(None) 143 7.0%

IBM 120 5.9%

Novell 109 5.4%

Samsung 100 4.9%

Texas Instruments 94 4.6%

AMD 90 4.4%

Simtec 85 4.2%

(Unknown) 75 3.7%

(Consultant) 73 3.6%

By lines changed

Pengutronix 194211 60.5%

Samsung 15341 4.8%

ST Microelectronics 10038 3.1%

(None) 8338 2.6%

Red Hat 7981 2.5%

(Consultant) 6695 2.1%

IBM 6064 1.9%

Novell 5973 1.9%

Code Aurora Forum 5114 1.6%

Analog Devices 4345 1.4%

With the architecture-specific files taking up the second largest chunk of code in the kernel, the list of contributing companies is closer to the list of overall contributors as well, with more hardware companies showing that they contribute a lot of development to get Linux working properly on their specific processors.

Most active 2.6.35 employers (miscellaneous)

By changesets

Red Hat 206 26.9%

(None) 110 14.4%

(Unknown) 35 4.6%

Novell 28 3.7%

Intel 27 3.5%

IBM 18 2.4%

Fujitsu 16 2.1%

Google 15 2.0%

Wind River 9 1.2%

(Academia) 9 1.2%

Vyatta 9 1.2%

By lines changed

Red Hat 12772 34.0%

Broadcom 6082 16.2%

(None) 5156 13.7%

(Unknown) 2757 7.3%

Intel 2212 5.9%

(Academia) 1850 4.9%

Samsung 769 2.1%

Wind River 593 1.6%

Fujitsu 592 1.6%

Nokia 532 1.4%

IBM 499 1.3%

The rest of the various kernel files that don't fall into any other major category show that Red Hat has done a lot of work on the userspace performance monitoring tools that are bundled with the Linux kernel.

As for overall trends in the different categories, Red Hat shows that they completely dominate all areas of developing the Linux kernel when it comes to the number of contributions. No other company shows up in the top ten contributors for all categories like they do. But, by breaking out the kernel contributions in different areas of the kernel, we see that a number of different companies are large contributors in different, important areas. Normally these contributions get drowned out by the larger contributors, but the more specialized contributors are just as important to advancing the Linux kernel.

Comments (15 posted)

A brief history of union mounts

July 14, 2010

This article was contributed by Valerie Aurora

Several weeks ago, I mentioned on my blog that I planned to move out of programming in the near future. A few days later I received this email from a kernel hacker friend:

At first, I thought we were losing a great hacker... But then I read on your blog: "Don't worry, I'm going to get union mounts into mainline before I change careers," and I realized this means you'll be with us for a few years yet! :)

How long has union mounts existed without going into the mainline Linux kernel? Well, to put it in a human perspective, if you'd been born the same year as the first Linux implementation of union mounts, you'd be writing your college application essays right now. Werner Almsberger began work on the Inheriting File System, one of the early ancestors of Linux union mounts, in 1993 - 17 years ago!

Background

A union mount does the opposite of a normal mount: Instead of hiding the namespace of the file system covered by the new mount, it shows a combination of the namespaces of the unioned file systems. Some use cases include a writable live CD/DVD-based system (without a complicated mess of symbolic links, bind mounts, and writable directories), and a shared base file system used by multiple clients. For an extremely detailed review of unioning file systems in general, see the LWN series:

This article will provide a high-level overview of various implementations of union mounts from the original 1993 Inheriting File System through the present day VFS-based union mount implementation and plans for near-term development. We deliberately leave aside unionfs, aufs, and other non-VFS implementations of unioning, in large part because the probability of merging a non-VFS unioning file system into mainline appears to be even lower than that of a VFS-based solution.

`readdir()` redux

Throughout this article, we will place special emphasis on the evolution of readdir(), since historically it has been the greatest stumbling block for any implementation of union mounts. A summary from the first article in the LWN unioning file systems series:

One of the great tragedies of the UNIX file system interface is the enshrinement of readdir(), telldir(), seekdir(), etc. family in the POSIX standard. An application may begin reading directory entries and pause at any time, restarting later from the "same" place in the directory. The kernel must give out 32-bit magic values which allow it to restart the readdir() from the point where it last stopped. Originally, this was implemented the same way as positions in a file: the directory entries were stored sequentially in a file and the number returned was the offset of the next directory entry from the beginning of the directory. Newer file systems use more complex schemes and the value returned is no longer a simple offset. To support readdir(), a unioning file system must merge the entries from lower file systems, remove duplicates and whiteouts, and create some sort of stable mapping that allows it to resume readdir() correctly. Support from userspace libraries can make this easier by caching the results in user memory.

Union mounts development time line

As mentioned earlier, one of the first implementations of a unioning was the Inheriting File System. In a pattern to be repeated by many future developers, Werner quickly became disenchanted with the complexity of the implementation of IFS and stopped working on it, suggesting that future developers try a mixed user/kernel implementation instead:

Well, I completed it to the point where it was a nice proof of concept, but still with problems (leaks inodes, probably has a few races left, was also a bit too liberal with locking, etc.).

Then I looked back at what I did and was disgusted by its complexity. So I decided that, before I might even consider proposing inclusion into the mainstream kernel, I'd have to see how much poorer (performance-wise) a user-space implementation would be. I did some initial hacking on NFS until I convinced myself that userfs might be the better approach. Unfortunately, I never found the time to work on that.

Many other kernel developers agreed with Werner. One of Linus Torvalds' earliest recorded NAKs of a kernel-based union file system came in 1996:

While at USENIX, I saw the _correct_ way to do a union FS. It was done as a pre-loaded shared library, and because of that it was a lot more flexible than any kernel implementation would ever be [...] After having seen that, I don't think I necessarily would even want a kernel implementation. It simply was so much better done in user space.

In 1998, Werner updated his IFS page to suggest working on a unioning file system as a good academic research topic:

Sounds like a very nice master's thesis topic for some good Linux hacker ;-) [...] So far nobody has taken the challenge. So, if you're an aspiring kernel hacker, aren't afraid of complexity, have a lot of time, and are looking for an interesting but useful project, you may just have found it :-)

Around 2003 - 2004, Jan Blunck took up the gauntlet Werner threw down and began working on union mounts for his thesis. The union mount implementation Jan produced lay dormant until 2007, when discussion about merging unionfs into mainline triggered renewed interest in a VFS-based version of unioning. At that point, Bharata B. Rao took the lead and began working with Jan Blunck on a new version of union mounts. Bharata and Jan posted several versions in 2007.

The first version posted in April 2007 used Jan's original strategy of keeping two pointers in the dentry for each directory, one pointing to the directory below this dentry's in the union stack, and one to the dentry of the topmost directory. The drawback to this implementation is that each file system can only be in one union stack at a time, since dentries are shared between all mounts of the same underlying file system.

The second version posted in May 2007 implemented yet another minor variation on in-kernel readdir(), this time using per file pointer cookies. From the patch set's documentation:

When two processes issue readdir()/getdents() call on the same unioned directory, both of them would be referring to the same dentries via their file structures. So it becomes necessary to maintain rdstate separately for these two instances. This is achieved by using a cookie variable in the rdstate. Each of these rdstate instances would get a different cookie thereby differentiating them.

In June 2007, Bharata and Jan posted a third version with an important and novel change to the way union stacks are formed. They replaced the in-dentry links to the topmost and lower directories with an external structure of pointers to (vfsmount, dentry) pairs. For the first time, a file system could be part of more than one union mount. From the patch set's documentation:

In this new approach, the way union stack is built and traversed has been changed. Instead of dentry-to-dentry links forming the stack between different layers, we now have (vfsmount, dentry) pairs as the building blocks of the union stack. Since this (vfsmount, dentry) combination is unique across all namespaces, we should be able to maintain the union stack sanely even if the filesystem is union mounted privately in different namespaces or if it appears under different mounts due to various types of bind mounts.

In July 2007, Jan posted a fourth version with some relatively minor changes to the way whiteouts were implemented, among a few other things. Jan says, "I'm able to compile the kernel with this patches applied on a 3 layer union mount with the [separate] layers bind mounted to different locations. I haven't done any performance tests since I think there is a more important topic ahead: better readdir() support."

In December 2007, Bharata B. Rao posted a fifth version that implemented another in-kernel version of readdir():

In this approach, the cached dirents are given offsets in the form of linearly increasing indices/cookies (like 0, 1, 2,...). This helps us to uniformly define offsets across all the directories of the union irrespective of the type of filesystem involved. Also this is needed to define a seek behaviour on the union mounted directory. This cache is stored as part of the struct file of the topmost directory of the union and will remain as long as the directory is kept open.

However, this approach had multiple problems, including excessive use of kernel memory to cache directory entries and to keep the mapping of indices to dentries.

readdir() continued to be a stumbling block, and union mounts development slowed down for most of 2008. In April 2008, Nagabhushan BS implemented and posted a version of union mounts with most of the readdir() logic moved to glibc. "I went through Bharata's RFC post on glibc based Union Mount readdir solution (http://lkml.org/lkml/2008/3/11/34) and have come up with patches against glibc to implement the same."

However, moving the complexity to user space wasn't the panacea everyone had hoped for. The glibc maintainers had many objections, the kernel interface was an obvious kludge (returning whiteouts for "." to signal a unioned directory), and no one could figure out how to handle NFS sanely.

In November 2008, Miklos Szeredi posted a simplified version of union mounts that implemented readdir() in the kernel.

The directory entries are read starting from the top layer and they are maintained in a cache. Subsequently when the entries from the bottom layers of the union stack are read they are checked for duplicates (in the cache) before being passed out to the user space. There can be multiple calls to readdir/getdents routines for reading the entries of a single directory. But union directory cache is not maintained across these calls. Instead for every call, the previously read entries are re-read into the cache and newly read entries are compared against these for duplicates before being they are returned to user space.

This implementation only worked for file systems that return a simple increasing offset in the d_off field for readdir(). So ext2 worked, but any file system with a modern directory hashing scheme did not.

In early 2009, I started to get interested in union mounts. I talked to several groups inside Red Hat and asked them what they needed most from file systems. I heard the same story over and over: "We really really need a unioning file system, but for some reason no one at Red Hat will support unionfs..." I did some research on the available implementations and decided to go to work on Jan Blunck's union mount patch set.

In May 2009, Jan Blunck and I posted a version of union mounts that implemented in-kernel readdir() using a new concept: the fallthru directory entry. The basic idea is that the first time readdir() is called on a directory, the visible directory entries from all the underlying directories are copied up to the topmost directory as fallthru directory entries. This eliminated all the problems I knew of in previous readdir() implementations, but required the topmost file system to always be read-write. This implementation also was limited to only two layers: one read-only file system overlaid with one read-write file system because we were concerned with lock ordering problems.

In October 2009, I posted a version of union mounts that implemented some of the more difficult system calls, such as truncate(), pivot_root(), and rename(). However, implementing chmod() and other system calls that modified files without opening them turned out to be fairly difficult with the current code base. We thought the hard part was copying up file data in open(), rename, and link(), but it turned out they were somewhat easier to implement because they already looked up the parent directory of the file to be altered. For union mounts, we need the parent directory's dentry and vfsmount in order to create a new version of the file in the topmost level of the union file system if necessary. open(), rename, and link() also needed the parent directory in order to create new directory entries, so we just reused the parent in the union mount in-kernel copyup code. But system calls like chmod() that only alter existing files did not bother to lookup the parent directory's path, only the target. Regretfully, I decided to start on a major rewrite.

In March 2010, I posted a rewrite of the pathname lookup mechanism for union mounts, taking into account Al Viro's recent VFS cleanups and removing a great deal of unnecessary code duplication.

In May 2010, I posted the first version of union mounts that implemented nearly all file related system calls correctly. The four exceptions were fchmod(), fchown(), fsetxattr(), and futimensat(), which will fail on read-only file descriptors. (UNIX is full of surprises; none of the VFS experts I talked to knew that these system calls would succeed on a read-only fd.)

The central primitive in this version is a function called user_path_nd(). It is a combination of user_path(), which looks up a pathname and returns the corresponding dentry and vfsmount, and user_path_parent(), which looks up the parent directory of the file or directory given by the pathname and returns the struct nameidata for the parent. (struct nameidata is too complex to describe in full here, but suffice to say it is usually needed to create an entry in a directory.) user_path_nd() returns both the parent's nameidata and the child's path. Once we have both these pieces of information, we can do an in-kernel copyup of a file in chmod() or any other system call that modifies a file. Unfortunately, user_path_nd() is also the weakest point in this version of union mounts: it's racy, inefficient, and copies up files even if the system call fails.

The day after I posted that version, I flew to North Carolina for a long-anticipated in-person code review with Al Viro. We spent three days in his office painfully reviewing the entire union mount implementation. Al immediately figured out how to delete a third of the code I'd spent the last year carefully massaging, and then outlined how to rewrite the other two-thirds of the code more elegantly, including user_path_nd(). As a result of this code review marathon, Linux has a feature-complete implementation of union mounts that has undergone a full code review by the Linux VFS maintainer for the first time. Of course, the resulting todo list is long and complex, and some problems may turn out to be insoluble, but it's an important step forward.

The biggest design change Al suggested was to move the head of the union stack back into the dentry, while keeping the rest of the union stack in a singly linked list of struct union_dir's allocated external to the dentries for the read-only parts of the union stack. This combines the speed and elegance of Jan Blunck's original design using in-dentry pointers to the union stack, with the flexibility of Bharata B. Rao's (vfsmount, dentry) pairs, which allow file systems to be part of many read-only layers. This change removed the entire union stack hash table and the associated lookup logic and shrunk the union_dir struct from 7 members to 2. I posted this hybrid linked list version on June 15, 2010.

Most recently, on June 25th, 2010, I posted a version that implemented submounts in the read-only layers, as well as allowed more than two read-only layers again. Then I went on a two week vacation - the longest vacation I've had since I started working on union mounts - and tried to forget everything I knew about it.

Future Work

The next step is to implement the remainder of Al Viro's review comments. The last big-ticket item is rewriting user_path_nd() and the in-kernel file copyup boilerplate. After that, it's back for another round of code review from Al and the other VFS maintainers. The 2010 Linux Storage and File Systems workshop is in early August. With luck we can hash out any remaining architectural problems face-to-face at the workshop and possibly merge union mounts into mainline before it's old enough to vote. Or it might languish for another 17 years outside the kernel. Such are the vicissitudes of Linux kernel development.

Acknowledgments: I want to extend special thanks to the following people: Kevin Roderick, who provided moral support, Tim Bowen, who gave me a free day at the Spoke6 co-working space while I worked on this article, and, of course, Jake Edge, whose editorial suggestions were, as usual, right on.

Comments (19 posted)

The USB composite framework

July 14, 2010

This article was contributed by Michal "mina86" Nazarewicz

Linux is widely used in mobile devices, which should not come as a surprise. It is a powerful and versatile system, and one of its strengths is its support for USB devices of all kinds. That includes "gadgets" — devices that act as USB slaves — like USB flash drives (i.e. pendrives). The USB composite framework makes writing drivers for these kinds of devices relatively easy.

As users keep more data on their mobile devices, the demand for interoperability with desktop computers increases. No one wants to buy a special cable or a "docking station" just to copy a few photos. What users want is to connect the device via a USB cable and get it working out of the box. Linux can give that to them.

Have you ever wondered how this actually works? What happens behind the scenes when a USB connection is established? Better yet, have you wondered how to write a USB gadget for your new and shiny embedded evaluation board?

In this article, I will try to shed some light on that topic.

USB overview

The Universal Serial Bus (or USB) standard defines a master-slave communication protocol. This means that there is one control entity (a master or a host), which decides who can transmit data through the wire. The other entities (slaves, devices, or gadgets) must obey and respond to the host's requests. Slaves do not communicate with each other. A host is usually a desktop computer, while the gadgets are devices such as mice, keyboards, phones, printers, etc.

People are used to seeing Linux systems in the master or host role on a USB bus. But the Linux USB stack also provides support for the slave or gadget role — the device at the other end of the wire. For example, when one connects a pendrive to a Linux host, it handles it with a usb-storage driver. However, if we had a Linux machine with a USB Device Controller (or UDC), we could run Alan Stern's File Storage Gadget (or FSG). FSG is, as its name implies, a gadget driver which implements the USB mass storage device class (or UMC). That would allow the machine in question to act as a USB drive (aka pendrive).

When a device is connected, an enumeration process begins. During this process, the device is assigned a unique 7-bit identifier used for addressing. As a consequence, up to 127 slaves (including hubs) can be connected to a single host.

Communication is based on logical pipes which join the master with one of a slave's endpoints (or EPs). There can be up to 16 endpoints (numbered from 0 to 15) on a device. Endpoint zero (or EP0) is reserved for setup requests (eg. a query for descriptors, request to set a particular configuration, etc.).

Pipes are unidirectional (one-way) and data can go to (via an IN endpoint) or from (via an OUT endpoint) the host. (It is important to remember that from a slave's point of view, an IN endpoint is the one it writes to and an OUT endpoint is the one it reads from.) There are also four transfer modes: bulk, isochronous, interrupt, and control.

Endpoints are grouped into interfaces which are then grouped into configurations. Different configurations may contain different interfaces, as well as have different power demands. All that information is saved in various descriptors requested by the host during enumeration. One can see them using the lsusb tool. Here is the stripped-down (and annotated) output for a Kingston pendrive:

    Bus 001 Device 004: ID 0951:1614 Kingston Technology
    Device Descriptor:
      idVendor           0x0951 Kingston Technology
      idProduct          0x1614
      bNumConfigurations      1     [only one configuration]
      Configuration Descriptor:     [the first and only config]
        bNumInterfaces          1    [only one interface]
        MaxPower              200mA
        Interface Descriptor:        [the first and only intf.]
          bNumEndpoints           2   [two endpoints]
          bInterfaceClass         8 Mass Storage
          bInterfaceSubClass      6 SCSI
          bInterfaceProtocol     80 Bulk (Zip)
          Endpoint Descriptor:        [the first endpoint]
            bEndpointAddress     0x81  EP 1 IN
            bmAttributes            2
              Transfer Type            Bulk
              Usage Type               Data
          Endpoint Descriptor:        [the second endpoint]
            bEndpointAddress     0x02  EP 2 OUT
            bmAttributes            2
              Transfer Type            Bulk
              Usage Type               Data

After the host receives the descriptors and learns what kind of a gadget has been connected, it can choose a configuration to be used and start communicating. At most one configuration can be active at a time.

Linux USB composite framework

There is, however, another module that implements UMC: my Mass Storage Gadget (or MSG). The obvious question is, why there are two drivers that seem to do the very same thing. This has something to do with the Linux USB composite framework.

The "old way" of creating gadgets is to get the specification and implement everything as a single, monolithic module. Gadget Zero, File Storage Gadget, and GadgetFS are examples of such gadgets.

This approach has two rather big disadvantages:

many of the common USB functionalities (core device setup requests on EP0) have to be implemented in each and every module; and
it can be tricky to combine the code from several gadgets into a new gadget with combined functionality.

For those reasons, David Brownell came up with the composite framework which has two advantages over the old approach:

all of the core USB requests are implemented by the framework; and
a single functionality or a USB composite function is developed separately from other functions as well as from the USB bus logic that is not directly related to this function. Later, such functions are combined using the composite function to form a composite gadget.

[USB Composite Gadget's descriptors
structure]

From a composite gadget's perspective, a device has some functions grouped into configurations. One function may be present in any number of configurations. Each function may have several interfaces and other descriptors but that is transparent to the kernel module.

Put on top of the "raw" USB descriptors structure, a USB composite function can be regarded as an abstraction for a group of interfaces.

That is another excellent property of the framework — most implementation details are hidden "under the hood" and one does not need to think about them when developing a gadget. Instead of thinking about endpoints and interfaces, one thinks about functions. Therefore, FSG is a gadget developed in the "old way", whereas MSG is a composite gadget which uses only one composite function — the Mass Storage Function (or MSF). As a matter of fact, MSF has been created from FSG to allow for the creation of more complicated drivers that would have UMC as part of their functionality.

Overall driver structure

In this article, I will try to explain how to create a mass storage composite gadget. It is in the kernel already, but let's forget that FSG and MSG exist for a moment.

What is great about Linux, is that a lot has already been done and one can get results with relatively little effort. As such, I will show how to create a working driver using MSF and some "composite glue".

I will start with the structure of the module, while skipping the details of the Mass Storage Function. The first step is to define a device descriptor. It stores some basic information about the gadget:

    static struct usb_device_descriptor msg_dev_desc = {
    	.bLength =		sizeof msg_dev_desc,
    	.bDescriptorType =	USB_DT_DEVICE,
    	.bcdUSB =		cpu_to_le16(0x0200),
    	.idVendor =		cpu_to_le16(FSG_VENDOR_ID),
    	.idProduct =		cpu_to_le16(FSG_PRODUCT_ID),
    };

The usb_device_descriptor structure has some more fields but they are not required or not important for our module. What has been set is:

bLength and bDescriptorType: A standard fields each descriptor has.
bsdUSB: The version of USB specification the device supports encoded in BCD (so 0x200 means 2.00).
idVendor and idProduct: Each device must have a unique vendor and product identifier pair. To avoid collisions, companies (vendors) can buy a vendor ID which gives them a namespace of 65536 product IDs to use. NetChip has donated some product IDs to the Linux community. Later, the Linux Foundation got the whole vendor ID for use with Linux. FSG_VENDOR_ID is actually NetChip's vendor ID and, along with FSG_PRODUCT_ID, that is what FSG uses.

The next step is to define an USB configuration which will be provided by the driver. It is described by a usb_configuration structure which, among other things, points to a bind callback function. Its purpose is to bind all USB composite functions to the configuration. Usually, it is a simple function, as most of the job is done prior to its invocation.

Put together it looks as follows:

    static struct usb_configuration msg_config = {
    	.label			= "Linux Mass Storage",
    	.bind			= msg_do_config,
    	.bConfigurationValue	= 1,
    	.bmAttributes		= USB_CONFIG_ATT_SELFPOWER,
    };

    static int __ref msg_do_config(struct usb_configuration *c)
    {
    	return fsg_add(c->cdev, c, &msg_fsg_common);
    }

The msg_config object specifies a label (used for debug messages), the bind callback, configuration's number (each configuration must have a unique, non-zero number), and indicates that the device is self powered. All that the msg_bind does is bind the MSF to the configuration.

That definition is then used by the msg_bind() function, which is a callback to set up composite functions, prepare descriptors, add all configurations supported by the device, etc.:

    static int __ref msg_bind(struct usb_composite_dev *cdev)
    {
    	int ret;

    	ret = msg_fsg_init(cdev);
    	if (ret < 0)
    		return ret;

    	ret = usb_add_config(cdev, &msg_config);
    	if (ret >= 0)
    		set_bit(0, &msg_registered);
    	fsg_common_put(&msg_fsg_common);
    	return ret;
    }

The msg_bind() function does the following: initializes the Mass Storage Function, adds the previously defined configuration to the USB device, and (at the end) puts the msg_fsg_common object. . If everything succeeds, it sets the msg_registered flag so it is recorded that the gadget has been registered and initialized.

With all of the above, a composite device can be defined. For this purpose, the usb_composite_driver structure is used. Besides specifying the name, it points to the device descriptors and the bind callback:

    static struct usb_composite_driver msg_device = {
    	.name		= "g_my_mass_storage",
    	.dev		= &msg_dev_desc,
    	.bind		= msg_bind,
    };

At this point, all that is left are the init and exit module functions:

    static int __init msg_init(void)
    {
    	return usb_composite_register(&msg_device);
    }

    static void msg_exit(void)
    {
    	if (test_and_clear_bit(0, &msg_registered))
    		usb_composite_unregister(&msg_device);
    }

They use the usb_composite_register() and usb_composite_unregister() functions to register and unregister the device. The msg_registered variable is used to ensure the device is unregistered only once.

To sum things up:

A composite device (msg_device) is registered when in msg_init() when the module loads.
It has a device bind callback (msg_bind()) that initializes MSF and adds configuration to the gadget.
The configuration (msg_config) has its own bind callback (msg_do_config()), which binds MSF to the configuration.
The really hard work is done inside the MSF.

Mass Storage Function

With the big picture in mind, lets get into the finer details: the inner workings of the Mass Storage Function. There are a couple of things to watch out for when dealing with it.

First of all, because MSF can be bound to several configurations, it needs to share some data between the instances and at the same time store information specific for each configuration. The fsg_common structure is used for shared data. An instance of this structure needs to be initialized prior to binding MSF.

Because the common object is used by several MSF instances, it has no single owner thus a reference counter is needed to decide when it can be destroyed. That's the reason for the fsg_common_put() call at the end of msg_bind() function.

Closely connected with the fsg_common structure is a worker thread which MSF uses to handle all the host's requests. When a fsg_common object is created, a thread is started as well. It terminates either when the fsg_common object is destroyed or when it is killed with an INT, TERM, or KILL signal. In the latter case, the fsg_common object may still exist even after worker's death. Whatever reason, when thread exits a thread_exits callback is invoked.

It is important to note that a signal may terminate the worker thread, but why would one want to do that? The reason is simple. As long as MSF is holding any open files, the filesystems which those files belong to cannot be unmounted. That is bad news for a shutdown script.

What Alan Stern came up with in FSG, is to close all backing files when the worker thread receives an INT, TERM, or KILL signal. Because MSF is to be used with various composite gadgets, rather than hardcoding that behavior a callback has been introduced.

The last thing to note is that MSF is customizable. The UMC specification allows for a single device to have several logical units (sometimes called LUNs, which is strictly speaking incorrect since LUN stands for Logical Unit Number). Each logical unit may be read-only or read-write, may emulate a CD-ROM or disk drive, and may be removable or not.

All of this configuration must be specified when the fsg_common structure is initialized. The fsg_config structure is used for exactly that purpose. In most cases, a module author does not want to fill it themselves, but rather let a user of the module decide the settings.

To make it as easy as possible, an fsg_module_parameters structure and an FSG_MODULE_PARAMETERS() macro are provided by the MSF. The former stores user-supplied arguments, whereas the latter defines several module parameters.

Having an fsg_module_parameters object, one may use fsg_config_from_params() followed by fsg_common_init() to create an fsg_common object. Alternatively, fsg_common_from_params() can be used which merges the call to the other two functions.

Here is how it all works when put together:

    static struct fsg_module_parameters msg_mod_data = { .stall = 1 };
    FSG_MODULE_PARAMETERS(/* no prefix */, msg_mod_data);

    static struct fsg_common msg_fsg_common;

    static int msg_thread_exits(struct fsg_common *common)
    {
    	msg_exit();
    	return 0;
    }

    static int msg_fsg_init(struct usb_composite_dev *cdev)
    {
    	struct fsg_config config;
    	struct fsg_common *retp;

    	fsg_config_from_params(&config, &msg_mod_data);

    	config.thread_exits = msg_thread_exits;

    	retp = fsg_common_init(&msg_fsg_common, cdev, &config);
    	return IS_ERR(retp) ? PTR_ERR(retp) : 0;
    }

The msg_exit() function has been chosen as MSF's thread_exits callback. Since MSF is nonoperational after the thread has exited, there is no need to keep the composite device registered, instead the gadget is unregistered.

At this point, it should become obvious why the msg_registered flag is being used. Since usb_composite_unregister() can be called from two different places, a mechanism to guarantee that it will be called only once is needed — atomic bit operations are perfect for such tasks.

And that would be it. We are done. One can grab the full source code and start playing with it.

The beauty of the composite framework is that all the really hard stuff has been already written. One can write devices and experiment with different configurations without deep knowledge of the USB specification or the Linux gadget API. At the same time, it is a perfect introduction to some more serious USB programming.

Running

To use the gadget, one needs to provide a disk image that will act as a real USB device to the USB host. Using dd on the device is perfect for creating one:

    # dd if=/dev/zero of=disk.img bs=1M count=64

With disk image in place, the module can be loaded:

    # insmod g_my_mass.ko file=$PWD/disk.img

Connecting the device to the host should produce several messages in the host system log, among others:

    usb 1-4.4: new high speed USB device using ehci_hcd and address 8
    usb 1-4.4: New USB device found, idVendor=0525, idProduct=a4a5
    usb-storage: device scan complete
    sd 6:0:0:0: [sdb] Attached SCSI removable disk
    sd 6:0:0:0: [sdb] 131072 512-byte logical blocks: (67.1 MB/64.0 MiB)
     sdb: unknown partition table

All that is left is creating a partition with a filesystem and starting using the pendrive:

    # fdisk /dev/sdb
    ...
    # dmesg -c
    sd 6:0:0:0: [sdb] Assuming drive cache: write through
     sdb: sdb1

    # mkfs.vfat /dev/sdb1
    mkfs.vfat 3.0.9 (31 Jan 2010)
    # mount /dev/sdb1 /mnt/
    # touch /mnt/foo
    # umount /mnt

As has been shown, the gadget works like a charm.

Conclusion

The Linux USB composite framework provides a way to add USB devices in a fairly straightforward way. Before the composite framework came along, developers needed to implement all USB requests for each gadget they wanted to add to the system. The framework handles basic USB requests and separates each USB composite function, which allows gadget authors to think in terms of functions rather than low-level interfaces and communication handling.

As one might guess, this article just scratches the surface of what the composite framework can do. The driver that was shown is a single-configuration, single-function gadget, so the advantages over non-composite gadgets is not readily apparent. A future article may look at drivers for more powerful gadgets using the composite framework.

Comments (14 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 2.6.35-rc5 ?

Thomas Gleixner 2.6.33.6-rt26 ?

Architecture-specific

Pekka Enberg x86: Early-boot serial I/O support ?

Tejun Heo x86-64: software IRQ masking and handling ?

Stephen Rothwell powerpc: reduce the size of the defconfigs ?

Yinghai Lu Use memblock with x86 ?

Build system

Stephen Rothwell kbuild: Enable building defconfigs from Kconfig files ?

Grant Likely Kconfig: Enable Kconfig fragments to be used for defconfig ?

Core kernel code

Mathieu Desnoyers Generic Ring Buffer Library ?

Development tools

Srikar Dronamraju Uprobes Patches: ?

Peter Zijlstra perf pmu interface changes -v3 ?

Device drivers

Arjan van de Ven pm: Add runtime PM statistics to sysfs ?

Arnd Bergmann further BKL removal ?

Documentation

Michael Kerrisk man-pages-3.25 is released ?

Filesystems and block I/O

Munehiro Ikeda blkiocg async support ?

Aneesh Kumar K.V Generic name to handle and open by handle syscalls ?

Arnd Bergmann block: BKL removal, version 4 ?

Arnd Bergmann VFS: turn no_llseek into the default ?

Yehuda Sadeh ceph-rbd: ceph RADOS block device ?

Vivek Goyal cfq-iosched: Implement cfq group idling ?

David Howells Add a dentry op to handle automounting rather than abusing follow_link ?

Memory management

FUJITA Tomonori unify dma_get_cache_alignment implementations ?

Nathan Fontenot De-couple sysfs memory directories from memory sections ?

Xiaotian Feng [RFC] swap over nfs -v21 ?

Networking

Michael S. Tsirkin netfilter: add CHECKSUM target ?

Simon Horman IPVS full NAT support + netfilter 'ipvs' match support ?

Security-related

David P. Quigley Labeled-NFS: Security Label support in NFSv4 ?

Page editor: Jake Edge
Next page: Distributions>>

Kernel development

Brief items

Kernel release status

Quotes of the week

Kernel development news

Kernel development statistics for 2.6.35

A brief history of union mounts

Background

readdir() redux

Union mounts development time line

Future Work

The USB composite framework

USB overview

Linux USB composite framework

Overall driver structure

Mass Storage Function

Running

Conclusion

Patches and updates

Kernel trees

Architecture-specific

Build system

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related

`readdir()` redux