Brief items
The current development kernel is 2.6.35-rc5, released on July 12. "And
I merged the ARM defconfig minimization thing, which isn't the final word
on the whole defconfig issue, but since it removes almost 200 _thousand_
lines of ARM defconfig noise, it's a pretty big deal." We looked at the ARM defconfig
issue a few weeks back, and Linus has pulled from Uwe Kleine-König's tree that provides a
starting point for the defconfig cleanup. The short-form changelog is
appended to the announcement, and all the details are available in the full changelog.
Five stable kernels were released on July 5: 2.6.27.48,
2.6.31.14, 2.6.32.16, 2.6.33.6, and 2.6.34.1.
Comments (none posted)
hrmpf, one of those wonderful messages where neither it nor its source
code give you any clue regarding what caused it nor how to fix it.
--
Andrew Morton
This has been an especially interesting year in the field. We've landed the
infrastructure for generic runtime power management, glued that into PCI
and started implementing that at the driver level. pm_qos is being reworked
to improve performance and scalability as we start seeing more drivers that
need to express their own constraints. And, of course, we had the
wakelock/suspend blockers conversation that didn't end in a terribly
satisfactory manner, although Rafael is now working on an implementation
that presents equivalent functionality with a different userspace API.
--
Matthew Garrett
gives an overview of the world of Linux power management
Comments (none posted)
Kernel development news
In the tradition of summarizing the statistics of the Linux kernel
releases before the actual release of the kernel version itself, here is
a summary of what has happened in the Linux kernel tree over the past
few months.
This kernel release has seen 9460 changesets from about 1145 different
developers so far. This continues the trend over the past few kernel
releases for the size of both the changes as well as the development
community as can be seen in this table:
| Kernel | Patches | Devs |
| 2.6.29 |
11,600 |
1170 |
| 2.6.30 |
11,700 |
1130 |
| 2.6.31 |
10,600 |
1150 |
| 2.6.32 |
10,800 |
1230 |
| 2.6.33 |
10,500 |
1150 |
| 2.6.34 |
9,100 |
1110 |
| 2.6.35 |
9,460 |
1145 |
Perhaps our years of increasing developer activity — getting more
developers per release and more changes per release — has finally
reached a plateau. If so, that is not a bad thing, as a number of us
were wondering what the limits of our community were going to be. At
around 10 thousand changes per release, that limit is indeed quite high,
so there is no need to be concerned, as the Linux kernel is still, by
far, the most active software development project the world has ever
seen.
In looking at the individual developers, the quantity and size of
contributions is still quite large:
| Most active 2.6.35 developers |
| By changesets |
| Mauro Carvalho Chehab | 228 | 2.3% |
| Dan Carpenter | 135 | 1.3% |
| Greg Kroah-Hartman | 134 | 1.3% |
| Arnaldo Carvalho de Melo | 121 | 1.2% |
| Johannes Berg | 105 | 1.0% |
| Ben Dooks | 98 | 1.0% |
| Julia Lawall | 96 | 1.0% |
| Hans Verkuil | 92 | 0.9% |
| Alexander Graf | 84 | 0.8% |
| Eric Dumazet | 82 | 0.8% |
| Peter Zijlstra | 79 | 0.8% |
| Paul Mundt | 79 | 0.8% |
| Johan Hovold | 75 | 0.7% |
| Tejun Heo | 74 | 0.7% |
| Stephen Hemminger | 74 | 0.7% |
| Mark Brown | 71 | 0.7% |
| Sage Weil | 70 | 0.7% |
| Alex Deucher | 68 | 0.7% |
| Randy Dunlap | 67 | 0.7% |
| Frederic Weisbecker | 66 | 0.7% |
|
| By changed lines |
| Uwe Kleine-König | 194249 | 18.5% |
| Ralph Campbell | 53250 | 5.1% |
| Greg Kroah-Hartman | 31714 | 3.0% |
| Stepan Moskovchenko | 30037 | 2.9% |
| Arnaud Patard | 28783 | 2.7% |
| Mauro Carvalho Chehab | 27902 | 2.7% |
| Eliot Blennerhassett | 18490 | 1.8% |
| Luis R. Rodriguez | 16555 | 1.6% |
| Daniel Mack | 16176 | 1.5% |
| Bob Beers | 11703 | 1.1% |
| Jason Wessel | 10502 | 1.0% |
| Viresh KUMAR | 10499 | 1.0% |
| Barry Song | 10116 | 1.0% |
| James Chapman | 9645 | 0.9% |
| Steve Wise | 9580 | 0.9% |
| Sjur Braendeland | 8775 | 0.8% |
| Alex Deucher | 7721 | 0.7% |
| Jassi Brar | 7554 | 0.7% |
| Sujith | 7544 | 0.7% |
| Giridhar Malavali | 6867 | 0.7% |
|
Uwe Kleine-König, who works for Pengutronix, dominates the
"changed lines" list due to one patch that Linus just pulled for the 2.5.35-rc5 release that deleted almost all of the ARM
default config files. Linus responded when
Uwe posted his patch with:
Well, I can hardly refuse a pull that removes almost 200k lines. So
I'd happily pull it. Just this single line in your email is a very
very powerful thing:
> 177 files changed, 652 insertions(+), 194157 deletions(-)
Other than that major cleanup, the majority of the work was in drivers.
Ralph Campbell did a lot of Infiniband driver work, I did a lot of
cleanup on some staging drivers, and Stepan Moskovchenko and Arnaud
Patard contributed new drivers to the staging tree.
Mauro Carvalho Chehab contributed lots of Video for Linux driver work —
rounding out the top 6 contributors by lines of code changed.
Continuing the view that this kernel is much like previous ones, 177
different employers were found to have contributed to the 2.6.35 kernel
release:
| Most active 2.6.35 employers |
| By changesets |
| (None) | 1429 | 14.2% |
| Red Hat | 1185 | 11.8% |
| (Unknown) | 904 | 9.0% |
| Intel | 637 | 6.3% |
| Novell | 559 | 5.6% |
| IBM | 295 | 2.9% |
| Nokia | 253 | 2.5% |
| (Consultant) | 215 | 2.1% |
| Atheros Communications | 175 | 1.7% |
| AMD | 173 | 1.7% |
| Oracle | 169 | 1.7% |
| Samsung | 163 | 1.6% |
| Texas Instruments | 162 | 1.6% |
| (Academia) | 140 | 1.4% |
| Fujitsu | 138 | 1.4% |
| Google | 122 | 1.2% |
| Renesas Technology | 102 | 1.0% |
| Analog Devices | 98 | 1.0% |
| Simtec | 96 | 1.0% |
| NTT | 93 | 0.9% |
|
| By lines changed |
| Pengutronix | 195175 | 18.6% |
| Red Hat | 82334 | 7.8% |
| (None) | 79313 | 7.6% |
| (Unknown) | 72426 | 6.9% |
| QLogic | 72131 | 6.9% |
| Novell | 49651 | 4.7% |
| Intel | 47260 | 4.5% |
| Code Aurora Forum | 40081 | 3.8% |
| Mandriva | 29105 | 2.8% |
| Atheros Communications | 29055 | 2.8% |
| Samsung | 25817 | 2.5% |
| ST Ericsson | 20463 | 2.0% |
| Analog Devices | 18889 | 1.8% |
| AudioScience Inc. | 18545 | 1.8% |
| caiaq | 16194 | 1.5% |
| Nokia | 14891 | 1.4% |
| Texas Instruments | 14864 | 1.4% |
| (Consultant) | 14209 | 1.4% |
| IBM | 12235 | 1.2% |
| ST Microelectronics | 11728 | 1.1% |
|
But enough of the normal way of looking at the kernel as a whole body.
Let's try something different this time, and break the contributions
down by the different functional areas of the kernel.
The kernel is a bit strange in that it is a mature body of code that still
changes quite frequently and throughout the whole body of code. It is not
just drivers that are changing, but the "core" kernel as well. That is
pretty unusual for a mature code base.
The core kernel code — those files that all architectures and
users use no matter what their configuration is — comprises 5% of the
kernel (by lines of code), and you will find that 5% of the total kernel
changes happen
in that code. Here is the raw number of changes for the "core" kernel
files for the 2.6.35-rc5 release.
| Action | Lines | % of all changes |
| Added |
27,550 |
4.50% |
| Deleted |
7,450 |
1.90% |
| Modified |
6,847 |
4.93% |
Note that the percent deleted are a bit off because of the huge defconfig
delete by Uwe
as described above.
So, if the changes are made in a uniform way across the kernel, does
that mean that the same companies contribute in a uniform way as well,
or do some contribute more to one area than another?
I've broken the kernel files down into six different categories:
-
core
: This includes the files in the
init, block, ipc, kernel, lib, mm, and virt
subdirectories.
-
drivers
: This includes the files in the
crypto, drivers, sound, security, include/acpi,
include/crypto, include/drm, include/media, include/mtd, include/pcmcia,
include/rdma, include/rxrpc, include/scsi, include/sound, and include/video
subdirectories.
-
filesystems
: This includes the files in the
fs
subdirectory.
-
networking
: This includes the files in the
net and include/net
subdirectories.
-
architecture-specific
: This includes the files in the
arch, include/xen, include/math-emu, and include/asm-generic
subdirectories.
-
miscellaneous
: This includes all of the rest of the files not included in the above
categories.
Based on these categories, the size of the 2.6.35 kernel is as follows:
| Category | % Lines |
| Core |
4.37% |
| Drivers |
57.06% |
| Filesystems |
7.21% |
| Networking |
5.03% |
| Arch-specific |
21.92% |
| Miscellaneous |
4.43% |
Here are the top companies contributing to the different areas of the
kernel:
| Most active 2.6.35 employers (core) |
| By changesets |
| Red Hat | 218 | 16.5% |
| (None) | 148 | 11.2% |
| IBM | 66 | 5.0% |
| Novell | 60 | 4.5% |
| Intel | 59 | 4.5% |
| (Unknown) | 57 | 4.3% |
| Fujitsu | 33 | 2.5% |
| Google | 30 | 2.3% |
| Wind River | 22 | 1.7% |
| Oracle | 22 | 1.7% |
| Nokia | 22 | 1.7% |
| (Consultant) | 22 | 1.7% |
|
| By lines changed |
| Wind River | 9535 | 25.4% |
| Red Hat | 6277 | 16.7% |
| Novell | 2385 | 6.4% |
| (None) | 2074 | 5.5% |
| IBM | 2064 | 5.5% |
| Intel | 1480 | 3.9% |
| Fujitsu | 1431 | 3.8% |
| Google | 1417 | 3.8% |
| VirtualLogix Inc. | 992 | 2.6% |
| ST Ericsson | 957 | 2.6% |
| caiaq | 707 | 1.9% |
| (Unknown) | 614 | 1.6% |
|
The companies contributing to the core kernel files are not surprising.
These companies have all contributed to Linux for a long time, and it is
part of their core strategy. Wind River has a high number of lines
changed due to all of the KGDB work that Jason Wessel has been doing in
getting that codebase cleaned up and merged into the main kernel tree.
| Most active 2.6.35 employers (drivers) |
| By changesets |
| (None) | 1022 | 18.1% |
| (Unknown) | 678 | 12.0% |
| Red Hat | 528 | 9.4% |
| Intel | 499 | 8.9% |
| Novell | 336 | 6.0% |
| Nokia | 199 | 3.5% |
| Atheros Communications | 165 | 2.9% |
| (Academia) | 94 | 1.7% |
| IBM | 86 | 1.5% |
| QLogic | 86 | 1.5% |
|
| By lines changed |
| QLogic | 72122 | 12.2% |
| (None) | 61356 | 10.4% |
| (Unknown) | 60802 | 10.3% |
| Red Hat | 47204 | 8.0% |
| Intel | 39891 | 6.7% |
| Novell | 36951 | 6.2% |
| Code Aurora Forum | 34888 | 5.9% |
| Mandriva | 28867 | 4.9% |
| Atheros Communications | 28844 | 4.9% |
| AudioScience Inc. | 18535 | 3.1% |
|
Because the drivers make up over 50% of the overall
size of the kernel, the contributions here track the overall company statistics
very closely. The company AudioScience Inc. sneaks onto the list of
changes due to all of the work that Eliot Blennerhassett has been doing
on the asihpi sound driver.
| Most active 2.6.35 employers (filesystems) |
| By changesets |
| Red Hat | 134 | 15.9% |
| Oracle | 77 | 9.1% |
| New Dream Network | 76 | 9.0% |
| Novell | 76 | 9.0% |
| (Unknown) | 73 | 8.7% |
| (None) | 58 | 6.9% |
| NetApp | 42 | 5.0% |
| Parallels | 39 | 4.6% |
| IBM | 23 | 2.7% |
| Univ. of Michigan CITI | 23 | 2.7% |
|
| By lines changed |
| Oracle | 7194 | 24.2% |
| Red Hat | 6392 | 21.5% |
| Novell | 3989 | 13.4% |
| (Unknown) | 3081 | 10.4% |
| (None) | 2024 | 6.8% |
| New Dream Network | 1423 | 4.8% |
| NetApp | 897 | 3.0% |
| Google | 857 | 2.9% |
| Parallels | 687 | 2.3% |
| (Consultant) | 546 | 1.8% |
|
Filesystem contributions, like drivers, match up with the different
company strengths. New Dream Network might not be a familiar name to a
lot of people, but their development on the Ceph filesystem pushed it
into the list of top contributors. The University of Michigan did a lot
of NFS work, bringing that organization into the top ten.
| Most active 2.6.35 employers (networking) |
| By changesets |
| SFR | 74 | 9.6% |
| (Consultant) | 73 | 9.5% |
| Red Hat | 72 | 9.3% |
| (None) | 67 | 8.7% |
| ProFUSION | 55 | 7.1% |
| Intel | 45 | 5.8% |
| Astaro | 35 | 4.5% |
| Vyatta | 34 | 4.4% |
| (Unknown) | 34 | 4.4% |
| Oracle | 20 | 2.6% |
| ST Ericsson | 20 | 2.6% |
| Univ. of Michigan CITI | 20 | 2.6% |
|
| By lines changed |
| Katalix Systems | 9213 | 24.2% |
| ST Ericsson | 8003 | 21.0% |
| (Consultant) | 3691 | 9.7% |
| Univ. of Michigan CITI | 2334 | 6.1% |
| Astaro | 1956 | 5.1% |
| Red Hat | 1882 | 4.9% |
| Intel | 1607 | 4.2% |
| SFR | 1555 | 4.1% |
| ProFUSION | 1065 | 2.8% |
| (None) | 1060 | 2.8% |
| (Unknown) | 1035 | 2.7% |
|
Like the filesystem list, networking also shows the University of
Michigan's large contributions as well as many of the other common Linux
companies. But here a number of not-so-familiar companies start showing
up.
SFR is a French mobile phone company, and contributed lots of changes
all through the networking core. ProFUSION is an embedded development
company that did a lot of Bluetooth development for this kernel release.
Katalix Systems is another embedded development company and they
contributed a lot of l2tp changes. Astaro is a networking security
company that contributed a number of netfilter changes.
| Most active 2.6.35 employers (architecture-specific) |
| By changesets |
| Red Hat | 146 | 7.2% |
| (None) | 143 | 7.0% |
| IBM | 120 | 5.9% |
| Novell | 109 | 5.4% |
| Samsung | 100 | 4.9% |
| Texas Instruments | 94 | 4.6% |
| AMD | 90 | 4.4% |
| Simtec | 85 | 4.2% |
| (Unknown) | 75 | 3.7% |
| (Consultant) | 73 | 3.6% |
|
| By lines changed |
| Pengutronix | 194211 | 60.5% |
| Samsung | 15341 | 4.8% |
| ST Microelectronics | 10038 | 3.1% |
| (None) | 8338 | 2.6% |
| Red Hat | 7981 | 2.5% |
| (Consultant) | 6695 | 2.1% |
| IBM | 6064 | 1.9% |
| Novell | 5973 | 1.9% |
| Code Aurora Forum | 5114 | 1.6% |
| Analog Devices | 4345 | 1.4% |
|
With the architecture-specific files taking up the second largest chunk
of code in the kernel, the list of contributing companies is closer to
the list of overall contributors as well, with more hardware companies
showing that they contribute a lot of development to get Linux working
properly on their specific processors.
| Most active 2.6.35 employers (miscellaneous) |
| By changesets |
| Red Hat | 206 | 26.9% |
| (None) | 110 | 14.4% |
| (Unknown) | 35 | 4.6% |
| Novell | 28 | 3.7% |
| Intel | 27 | 3.5% |
| IBM | 18 | 2.4% |
| Fujitsu | 16 | 2.1% |
| Google | 15 | 2.0% |
| Wind River | 9 | 1.2% |
| (Academia) | 9 | 1.2% |
| Vyatta | 9 | 1.2% |
|
| By lines changed |
| Red Hat | 12772 | 34.0% |
| Broadcom | 6082 | 16.2% |
| (None) | 5156 | 13.7% |
| (Unknown) | 2757 | 7.3% |
| Intel | 2212 | 5.9% |
| (Academia) | 1850 | 4.9% |
| Samsung | 769 | 2.1% |
| Wind River | 593 | 1.6% |
| Fujitsu | 592 | 1.6% |
| Nokia | 532 | 1.4% |
| IBM | 499 | 1.3% |
|
The rest of the various kernel files that don't fall into any other
major category show that Red Hat has done a lot of work on the userspace
performance monitoring tools that are bundled with the Linux kernel.
As for overall trends in the different categories, Red Hat shows that
they completely dominate all areas of developing the Linux kernel when it
comes to the number of contributions. No other company shows up in the top
ten contributors for all categories like they do. But, by breaking out the
kernel contributions in different areas of the kernel, we see that a number
of different companies are large contributors in different, important
areas. Normally these contributions get drowned out by the larger
contributors, but the more specialized contributors are just as important
to advancing the Linux kernel.
Comments (15 posted)
July 14, 2010
This article was contributed by Valerie Aurora (formerly Henson)
Several weeks ago, I mentioned on my blog that I
planned to
move out of programming in the near future. A few days later I
received this email from a kernel hacker friend:
At first, I thought we were losing a great hacker... But
then I read on your blog: "Don't worry, I'm going to get union mounts
into mainline before I change careers," and I realized this means
you'll be with us for a few years yet! :)
How long has union mounts existed without going into the mainline
Linux kernel? Well, to put it in a human perspective, if you'd been
born the same year as the first Linux implementation of union mounts,
you'd be writing your college application essays right now. Werner
Almsberger began work on
the Inheriting
File System, one of the early ancestors of Linux union mounts, in
1993 - 17 years ago!
Background
A union mount does the opposite of a normal mount: Instead of hiding
the namespace of the file system covered by the new mount, it shows a
combination of the namespaces of the unioned file systems. Some use
cases include a writable live CD/DVD-based system (without a
complicated mess of symbolic links, bind mounts, and writable
directories), and a shared base file system used by multiple clients.
For an extremely detailed review of unioning file systems in general,
see the LWN series:
This article will provide a high-level overview of various
implementations of union mounts from the original 1993 Inheriting File
System through the present day VFS-based union mount implementation
and plans for near-term development. We deliberately leave aside
unionfs, aufs, and other non-VFS implementations of unioning, in large
part because the probability of merging a non-VFS unioning file system
into mainline appears to be even lower than that of a VFS-based
solution.
readdir() redux
Throughout this article, we will place special emphasis on the
evolution of
readdir(), since historically it has been the
greatest stumbling block for any implementation of union mounts. A
summary from the first article in the LWN unioning file systems
series:
One of the great tragedies of the UNIX file system interface is the
enshrinement
of readdir(), telldir(), seekdir(),
etc. family in the POSIX standard. An application may begin reading
directory entries and pause at any time, restarting later from the
"same" place in the directory. The kernel must give out 32-bit magic
values which allow it to restart the readdir() from the point
where it last stopped. Originally, this was implemented the same way
as positions in a file: the directory entries were stored sequentially
in a file and the number returned was the offset of the next directory
entry from the beginning of the directory. Newer file systems use more
complex schemes and the value returned is no longer a simple
offset. To support readdir(), a unioning file system must
merge the entries from lower file systems, remove duplicates and
whiteouts, and create some sort of stable mapping that allows it to
resume readdir() correctly. Support from userspace libraries
can make this easier by caching the results in user memory.
Union mounts development time line
As mentioned earlier, one of the first implementations of a unioning
was the Inheriting
File System. In a pattern to be repeated by many future
developers, Werner quickly became disenchanted with the complexity of
the implementation of IFS and stopped working on it, suggesting that
future developers try a mixed user/kernel implementation instead:
Well, I completed it to the point where it was a nice proof of
concept, but still with problems (leaks inodes, probably has a few
races left, was also a bit too liberal with locking, etc.).
Then I looked back at what I did and was disgusted by its
complexity. So I decided that, before I might even consider proposing
inclusion into the mainstream kernel, I'd have to see how much poorer
(performance-wise) a user-space implementation would be. I did some
initial hacking on NFS until I convinced myself that userfs might be
the better approach. Unfortunately, I never found the time to work on that.
Many other kernel developers agreed with Werner. One of Linus
Torvalds' earliest
recorded NAKs of a kernel-based union file system came in 1996:
While at USENIX, I saw the _correct_ way to do a union FS. It was done
as a pre-loaded shared library, and because of that it was a lot more
flexible than any kernel implementation would ever be [...] After
having seen that, I don't think I necessarily would even want a kernel
implementation. It simply was so much better done in user space.
In 1998, Werner updated his IFS page to suggest working on
a unioning file system as a good academic research topic:
Sounds like a very nice master's thesis topic for some good Linux
hacker ;-) [...] So far nobody has taken the challenge. So, if you're
an aspiring kernel hacker, aren't afraid of complexity, have a lot of
time, and are looking for an interesting but useful project, you may
just have found it :-)
Around 2003 - 2004, Jan Blunck took up the gauntlet Werner threw down
and
began working
on union mounts for his thesis. The union mount implementation
Jan produced lay dormant until 2007, when discussion about
merging unionfs
into mainline
triggered renewed interest in a VFS-based version of unioning. At
that point, Bharata B. Rao took the lead and began working with Jan
Blunck on a new version of union mounts. Bharata and Jan posted
several versions in 2007.
The first
version posted in April 2007 used Jan's original strategy of keeping two
pointers in the dentry for each directory, one pointing to the
directory below this dentry's in the union stack, and one to the
dentry of the topmost directory. The drawback to this implementation
is that each file system can only be in one union stack at a time,
since dentries are shared between all mounts of the same underlying
file system.
The second
version posted in May 2007 implemented yet another minor variation on
in-kernel readdir(), this time using per file pointer cookies. From
the patch set's documentation:
When two processes issue readdir()/getdents() call
on the same unioned directory, both of them would be referring to the
same dentries via their file structures. So it becomes necessary to
maintain rdstate separately for these two instances. This is achieved
by using a cookie variable in the rdstate. Each of these rdstate
instances would get a different cookie thereby differentiating them.
In June 2007, Bharata and
Jan posted
a third version with an important and novel change to the way
union stacks are formed. They replaced the in-dentry links to the
topmost and lower directories with an external structure of pointers
to (vfsmount, dentry) pairs. For the first time, a file system could
be part of more than one union mount. From the patch set's
documentation:
In this new approach, the way union stack is built and traversed has
been changed. Instead of dentry-to-dentry links forming the stack
between different layers, we now have (vfsmount, dentry) pairs as the
building blocks of the union stack. Since this (vfsmount, dentry)
combination is unique across all namespaces, we should be able to
maintain the union stack sanely even if the filesystem is union
mounted privately in different namespaces or if it appears under
different mounts due to various types of bind mounts.
In July 2007, Jan
posted
a fourth version with some relatively minor changes to the way
whiteouts were implemented, among a few other things. Jan says,
"I'm able to compile the kernel with this patches applied on a 3
layer union mount with the [separate] layers bind mounted to different
locations. I haven't done any performance tests since I think there is
a more important topic ahead: better readdir() support."
In December 2007, Bharata B.
Rao posted
a fifth version that implemented another in-kernel version
of readdir():
In this approach, the cached dirents are given offsets in the form of
linearly increasing indices/cookies (like 0, 1, 2,...). This helps us
to uniformly define offsets across all the directories of the union
irrespective of the type of filesystem involved. Also this is needed
to define a seek behaviour on the union mounted directory. This cache
is stored as part of the struct file of the topmost directory of the
union and will remain as long as the directory is kept open.
However, this approach had multiple problems, including excessive use
of kernel memory to cache directory entries and to keep the mapping of
indices to dentries.
readdir() continued to be a stumbling block, and union mounts
development slowed down for most of 2008. In April 2008, Nagabhushan
BS implemented
and posted
a version of union mounts with most of the readdir() logic
moved to glibc. "I went through Bharata's RFC post on glibc
based Union Mount readdir solution
(http://lkml.org/lkml/2008/3/11/34)
and have come up with patches against glibc to implement the
same."
However, moving the complexity to user space wasn't the panacea everyone
had hoped for. The glibc maintainers had many objections, the kernel
interface was an obvious kludge (returning whiteouts for "." to signal
a unioned directory), and no one could figure out how to handle NFS
sanely.
In November 2008, Miklos
Szeredi posted
a simplified version of union mounts that implemented readdir() in the
kernel.
The directory entries are read starting from the top layer and they
are maintained in a cache. Subsequently when the entries from the
bottom layers of the union stack are read they are checked for
duplicates (in the cache) before being passed out to the user
space. There can be multiple calls to readdir/getdents routines for
reading the entries of a single directory. But union directory cache
is not maintained across these calls. Instead for every call, the
previously read entries are re-read into the cache and newly read
entries are compared against these for duplicates before being they
are returned to user space.
This implementation only worked for file systems that return a simple
increasing offset in the d_off field for readdir(). So ext2 worked,
but any file system with a modern directory hashing scheme did not.
In early 2009, I started to get interested in union mounts. I talked
to several groups inside Red Hat and asked them what they needed most
from file systems. I heard the same story over and over: "We
really really need a unioning file system, but for some reason no one
at Red Hat will support unionfs..." I did some research on the
available implementations and decided to go to work on Jan Blunck's
union mount patch set.
In May 2009, Jan Blunck and
I posted
a version of union mounts that implemented
in-kernel readdir() using a new concept: the fallthru
directory entry. The basic idea is that the first
time readdir() is called on a directory, the visible
directory entries from all the underlying directories are copied up to
the topmost directory as fallthru directory entries. This eliminated
all the problems I knew of in previous readdir()
implementations, but required the topmost file system to always be
read-write. This implementation also was limited to only two layers:
one read-only file system overlaid with one read-write file system
because we were concerned with lock ordering problems.
In October 2009,
I posted
a version of union mounts that implemented some of the more difficult
system calls, such as truncate(), pivot_root(),
and rename(). However, implementing chmod() and
other system calls that modified files without opening them turned out
to be fairly difficult with the current code base. We thought the
hard part was copying up file data
in open(), rename, and link(), but it
turned out they were somewhat easier to implement because they already
looked up the parent directory of the file to be altered. For union
mounts, we need the parent directory's
dentry and vfsmount in order to create a new version
of the file in the topmost level of the union file system if
necessary. open(), rename, and link() also
needed the parent directory in order to create new directory entries,
so we just reused the parent in the union mount in-kernel copyup code.
But system calls like chmod() that only alter existing files
did not bother to lookup the parent directory's path, only the
target. Regretfully, I decided to start on a major rewrite.
In March 2010,
I posted
a rewrite of the pathname lookup mechanism for union mounts, taking
into account Al Viro's recent VFS cleanups and removing a great deal
of unnecessary code duplication.
In May 2010,
I
posted the first version of union mounts that implemented nearly
all file related system calls correctly. The four exceptions
were fchmod(), fchown(), fsetxattr(),
and futimensat(), which will fail on read-only file
descriptors. (UNIX is full of surprises; none of the VFS experts I
talked to knew that these system calls would succeed on a read-only
fd.)
The central primitive in this version is a function
called user_path_nd(). It is a combination
of user_path(), which looks up a pathname and returns the
corresponding dentry and vfsmount, and user_path_parent(),
which looks up the parent directory of the file or directory given by
the pathname and returns the struct nameidata for the parent. (struct
nameidata is too complex to describe in full here, but suffice to say
it is usually needed to create an entry in a
directory.) user_path_nd() returns both the parent's
nameidata and the child's path. Once we have both these pieces of
information, we can do an in-kernel copyup of a file
in chmod() or any other system call that modifies a file.
Unfortunately, user_path_nd() is also the weakest point in
this version of union mounts: it's racy, inefficient, and copies up
files even if the system call fails.
The day after I posted that version, I flew to North Carolina for a
long-anticipated in-person code review with Al Viro. We spent three
days in his office painfully reviewing the entire union
mount implementation. Al immediately figured out how to delete a
third of the code I'd spent the last year carefully massaging, and
then outlined how to rewrite the other two-thirds of the code more
elegantly, including user_path_nd(). As a result of this
code review marathon, Linux has a feature-complete implementation of
union mounts that has undergone a full code review by the Linux VFS
maintainer for the first time. Of course, the resulting todo list is
long and complex, and some problems may turn out to be insoluble, but
it's an important step forward.
The biggest design change Al suggested was to move the head of the
union stack back into the dentry, while keeping the rest of the union
stack in a singly linked list of struct union_dir's allocated
external to the dentries for the read-only parts of the union stack.
This combines the speed and elegance of Jan Blunck's original
design using in-dentry pointers to the union
stack, with the flexibility of Bharata B. Rao's (vfsmount,
dentry) pairs, which allow file systems to be part of many
read-only layers. This change removed the entire union stack hash
table and the associated lookup logic and shrunk
the union_dir struct from 7 members to 2.
I posted
this hybrid linked list version on June 15, 2010.
Most recently, on June 25th, 2010,
I posted
a version that implemented submounts in the read-only layers, as well
as allowed more than two read-only layers again. Then I went on a two
week vacation - the longest vacation I've had since I started working
on union mounts - and tried to forget everything I knew about it.
Future Work
The next step is to implement the remainder of Al Viro's review
comments. The last big-ticket item is
rewriting user_path_nd() and the in-kernel file copyup
boilerplate. After that, it's back for another round of code review
from Al and the other VFS maintainers. The 2010 Linux Storage and
File Systems workshop is in early August. With luck we can hash out
any remaining architectural problems face-to-face at the workshop and
possibly merge union mounts into mainline before it's old enough to
vote. Or it might languish for another 17 years outside the kernel.
Such are the vicissitudes of Linux kernel development.
Acknowledgments: I want to extend special thanks to the following
people: Kevin Roderick, who provided moral support, Tim Bowen, who
gave me a free day at the
Spoke6 co-working space while I
worked on this article, and, of course, Jake Edge, whose editorial
suggestions were, as usual, right on.
Comments (19 posted)
July 14, 2010
This article was contributed by Michal "mina86" Nazarewicz
Linux is widely used in mobile devices, which should not come as a surprise.
It is a powerful and versatile system, and one of its strengths is its support
for USB devices of all kinds. That includes "gadgets" — devices that act as
USB slaves — like USB flash drives (i.e. pendrives). The USB
composite framework makes writing drivers for these kinds of devices
relatively easy.
As users keep more data on their mobile devices, the demand for
interoperability with desktop computers increases. No one wants to
buy a special cable or a "docking station" just to copy a few
photos. What users want is to connect the device via a USB cable
and get it working out of the box. Linux can give that to
them.
Have you ever wondered how this actually works? What happens
behind the scenes when a USB connection is established? Better yet,
have you wondered how to write a USB gadget for your new and shiny
embedded evaluation board?
In this article, I will try to shed some light on that topic.
USB overview
The Universal Serial Bus (or USB) standard defines
a master-slave communication protocol. This means that there is
one control entity (a master or a host),
which decides who can transmit data through the wire. The other
entities (slaves, devices, or
gadgets) must obey and respond to the host's requests.
Slaves do not communicate with each other. A host is usually
a desktop computer, while the gadgets are devices such as mice,
keyboards, phones, printers, etc.
People are used to seeing Linux systems in the master or host
role on a USB bus. But the Linux USB stack also provides support
for the slave or gadget role — the device at the other end
of the wire. For example, when one connects a pendrive to
a Linux host, it handles it with a usb-storage
driver.
However, if we had a Linux machine with a USB Device Controller (or
UDC), we could run Alan Stern's File
Storage Gadget (or FSG). FSG is, as its name implies, a gadget driver
which implements the USB mass storage device class (or UMC). That
would allow the machine in question to act as a USB drive (aka pendrive).
When a device is connected, an enumeration process
begins. During this process, the device is assigned a unique
7-bit identifier used for addressing. As a consequence, up to 127
slaves (including hubs) can be connected to a single host.
Communication is based on logical pipes which join
the master with one of a slave's endpoints (or
EPs).
There can be up to 16 endpoints (numbered from 0
to 15) on a device. Endpoint zero (or EP0) is
reserved for setup requests (eg. a query for descriptors,
request to set a particular configuration, etc.).
Pipes are unidirectional (one-way) and data can go to (via an
IN endpoint) or from (via an OUT endpoint)
the host. (It is important to remember that from a slave's point
of view, an IN endpoint is the one it writes to and an OUT
endpoint is the one it reads from.) There are also four transfer
modes: bulk, isochronous, interrupt, and control.
Endpoints are grouped into interfaces which are then
grouped into configurations. Different configurations
may contain different interfaces, as well as have different power
demands. All that information is saved in various
descriptors requested by the host during enumeration.
One can see them using the lsusb tool. Here is the stripped-down
(and annotated) output for a Kingston pendrive:
Bus 001 Device 004: ID 0951:1614 Kingston Technology
Device Descriptor:
idVendor 0x0951 Kingston Technology
idProduct 0x1614
bNumConfigurations 1 [only one configuration]
Configuration Descriptor: [the first and only config]
bNumInterfaces 1 [only one interface]
MaxPower 200mA
Interface Descriptor: [the first and only intf.]
bNumEndpoints 2 [two endpoints]
bInterfaceClass 8 Mass Storage
bInterfaceSubClass 6 SCSI
bInterfaceProtocol 80 Bulk (Zip)
Endpoint Descriptor: [the first endpoint]
bEndpointAddress 0x81 EP 1 IN
bmAttributes 2
Transfer Type Bulk
Usage Type Data
Endpoint Descriptor: [the second endpoint]
bEndpointAddress 0x02 EP 2 OUT
bmAttributes 2
Transfer Type Bulk
Usage Type Data
After the host receives the descriptors and learns what kind of
a gadget has been connected, it can choose a configuration to be
used and start communicating. At most one configuration can be
active at a time.
Linux USB composite framework
There is, however, another module that implements UMC: my Mass
Storage Gadget (or MSG). The obvious question is, why there
are two drivers that seem to do the very same thing. This has
something to do with the Linux USB composite framework.
The "old way" of creating gadgets is to get the specification
and implement everything as a single, monolithic module. Gadget
Zero, File Storage Gadget, and GadgetFS
are examples of such gadgets.
This approach has two rather big disadvantages:
- many of the common USB functionalities (core device setup
requests on EP0) have to be implemented in each and every
module; and
- it can be tricky to combine the code from several gadgets into a
new gadget with combined functionality.
For those reasons, David Brownell came up with the composite
framework which has two advantages over the old
approach:
- all of the core USB requests are implemented by the
framework; and
- a single functionality or a USB composite
function is developed separately from other
functions as well as from the USB bus logic that is not directly related
to this function. Later, such functions are combined
using the composite function to form a composite
gadget.
From a composite gadget's perspective, a device has some
functions grouped into configurations. One function may be
present in any number of configurations. Each function may have
several interfaces and other descriptors but that is transparent
to the kernel module.
Put on top of the "raw" USB descriptors structure, a USB
composite function can be regarded as an abstraction for a group
of interfaces.
That is another excellent property of the framework —
most implementation details are hidden "under the hood" and one
does not need to think about them when developing a gadget.
Instead of thinking about endpoints and interfaces, one thinks
about functions. Therefore, FSG is a gadget developed in the "old way", whereas
MSG is a composite gadget which uses only one composite function
— the Mass
Storage Function (or MSF).
As a matter of fact, MSF has been created from FSG to allow for
the creation of more complicated drivers that would have UMC as
part of their functionality.
Overall driver structure
In this article, I will try to explain how to create a mass
storage composite gadget. It is in the kernel already,
but let's forget that FSG and MSG exist for a moment.
What is great about Linux, is that a lot has already been done
and one can get results with relatively little effort. As such,
I will show how to create a working driver using MSF and some
"composite glue".
I will start with the structure of the module, while skipping the details of the Mass
Storage Function. The first step is to define a device descriptor. It
stores
some basic information about the gadget:
static struct usb_device_descriptor msg_dev_desc = {
.bLength = sizeof msg_dev_desc,
.bDescriptorType = USB_DT_DEVICE,
.bcdUSB = cpu_to_le16(0x0200),
.idVendor = cpu_to_le16(FSG_VENDOR_ID),
.idProduct = cpu_to_le16(FSG_PRODUCT_ID),
};
The usb_device_descriptor
structure has some more fields but they are not required or
not important for our module. What has been set is:
- bLength and bDescriptorType
- A standard fields each descriptor has.
- bsdUSB
- The version of USB specification the device supports encoded
in BCD (so 0x200 means 2.00).
- idVendor and idProduct
- Each device must have a unique vendor and product
identifier pair. To avoid collisions, companies (vendors) can
buy a vendor ID which gives them a namespace of 65536 product
IDs to use.
NetChip has donated some product IDs to the Linux community.
Later, the Linux Foundation got the whole vendor ID for use
with Linux.
FSG_VENDOR_ID
is actually NetChip's vendor ID and, along with
FSG_PRODUCT_ID, that is what FSG uses.
The next step is to define an USB configuration which will be
provided by the driver. It is described by a usb_configuration
structure which, among other things, points to a bind
callback function. Its purpose is to bind all USB composite functions
to the configuration. Usually, it is a simple function, as most of
the job is done prior to its invocation.
Put together it looks as follows:
static struct usb_configuration msg_config = {
.label = "Linux Mass Storage",
.bind = msg_do_config,
.bConfigurationValue = 1,
.bmAttributes = USB_CONFIG_ATT_SELFPOWER,
};
static int __ref msg_do_config(struct usb_configuration *c)
{
return fsg_add(c->cdev, c, &msg_fsg_common);
}
The msg_config object specifies a label (used for debug
messages), the bind callback, configuration's number
(each configuration must have a unique, non-zero number), and
indicates that the device is self powered. All that the
msg_bind does is bind the MSF to the configuration.
That definition is then used by the msg_bind() function,
which is a callback to set up composite functions,
prepare descriptors, add all configurations supported by the
device, etc.:
static int __ref msg_bind(struct usb_composite_dev *cdev)
{
int ret;
ret = msg_fsg_init(cdev);
if (ret < 0)
return ret;
ret = usb_add_config(cdev, &msg_config);
if (ret >= 0)
set_bit(0, &msg_registered);
fsg_common_put(&msg_fsg_common);
return ret;
}
The
msg_bind() function does the following: initializes the Mass
Storage Function, adds the previously defined configuration to the USB
device, and (at the end)
puts the
msg_fsg_common object.
. If everything succeeds, it sets the
msg_registered
flag so it is recorded that the gadget has been registered and
initialized.
With all of the above, a composite device can be
defined. For this purpose, the usb_composite_driver
structure is used. Besides specifying the name, it points to
the device descriptors and the bind callback:
static struct usb_composite_driver msg_device = {
.name = "g_my_mass_storage",
.dev = &msg_dev_desc,
.bind = msg_bind,
};
At this point, all that is left are the init and exit module
functions:
static int __init msg_init(void)
{
return usb_composite_register(&msg_device);
}
static void msg_exit(void)
{
if (test_and_clear_bit(0, &msg_registered))
usb_composite_unregister(&msg_device);
}
They use the usb_composite_register()
and usb_composite_unregister()
functions to register and unregister the device. The
msg_registered variable is used to ensure the device is
unregistered only once.
To sum things up:
- A composite device (msg_device) is registered when
in msg_init() when the module loads.
- It has a device bind callback (msg_bind())
that initializes MSF and adds configuration to the gadget.
- The configuration (msg_config) has its own
bind callback (msg_do_config()), which binds MSF
to the configuration.
- The really hard work is done inside the MSF.
Mass Storage Function
With the big picture in mind, lets get into the finer details: the
inner workings of the Mass Storage Function. There are a couple
of things to watch out for when dealing with it.
First of all, because MSF can be bound to several
configurations, it needs to share some data between the instances
and at the same time store information specific for each
configuration. The fsg_common
structure is used for shared data. An instance of this
structure needs to be initialized prior to binding MSF.
Because the common object is used by several MSF instances, it
has no single owner thus a reference counter is needed to decide
when it can be destroyed. That's the reason for the fsg_common_put()
call at the end of msg_bind()
function.
Closely connected with the fsg_common structure is
a worker
thread which MSF uses to handle all the host's requests. When
a fsg_common object is created, a thread is started as well.
It terminates either when the fsg_common object is
destroyed or when it is killed with an INT, TERM, or KILL signal.
In the latter case, the fsg_common object may still exist
even after worker's death. Whatever reason, when thread exits
a thread_exits callback is invoked.
It is important to note that a signal may terminate the worker
thread, but why would one want to do that? The reason is simple. As
long as MSF is holding any open
files, the filesystems which those files belong to cannot be
unmounted. That is bad news for a shutdown script.
What Alan Stern came up with in FSG, is to close all backing
files when the worker thread receives an INT, TERM, or KILL signal.
Because MSF is to be used with various composite gadgets, rather
than hardcoding that behavior a callback has been introduced.
The last thing to note is that MSF is customizable. The UMC
specification allows for a single device to have several
logical units (sometimes called LUNs, which
is strictly speaking incorrect since LUN stands for Logical Unit
Number). Each logical unit may be read-only or read-write, may emulate
a CD-ROM or disk drive, and may be removable or not.
All of this configuration must be specified when the
fsg_common structure is initialized. The fsg_config
structure is used for exactly that purpose. In most cases,
a module author does not want to fill it themselves, but rather let
a user of the module decide the settings.
To make it as easy as possible, an fsg_module_parameters
structure and an FSG_MODULE_PARAMETERS()
macro are provided by the MSF. The former stores
user-supplied arguments, whereas the latter defines several module
parameters.
Having an fsg_module_parameters object, one may use fsg_config_from_params()
followed by fsg_common_init()
to create an fsg_common object. Alternatively, fsg_common_from_params()
can be used which merges the call to the other two functions.
Here is how it all works when put together:
static struct fsg_module_parameters msg_mod_data = { .stall = 1 };
FSG_MODULE_PARAMETERS(/* no prefix */, msg_mod_data);
static struct fsg_common msg_fsg_common;
static int msg_thread_exits(struct fsg_common *common)
{
msg_exit();
return 0;
}
static int msg_fsg_init(struct usb_composite_dev *cdev)
{
struct fsg_config config;
struct fsg_common *retp;
fsg_config_from_params(&config, &msg_mod_data);
config.thread_exits = msg_thread_exits;
retp = fsg_common_init(&msg_fsg_common, cdev, &config);
return IS_ERR(retp) ? PTR_ERR(retp) : 0;
}
The msg_exit() function has been chosen as MSF's
thread_exits callback. Since MSF is nonoperational after
the thread has exited, there is no need to keep the composite
device registered, instead the gadget is unregistered.
At this point, it should become obvious why the
msg_registered flag is being used. Since
usb_composite_unregister() can be called from two different
places, a mechanism to guarantee that it will be called only once
is needed — atomic bit operations are perfect for such
tasks.
And that would be it. We are done. One can grab the full source code and start playing with
it.
The beauty of the composite framework is that all the really
hard stuff has been already written. One can write devices and
experiment with different configurations without deep knowledge of
the USB specification or the Linux gadget API. At the same time,
it is a perfect introduction to some more serious USB
programming.
Running
To use the gadget, one needs to provide a disk image that will
act as a real USB device to the USB host. Using dd on
the device is perfect for creating one:
# dd if=/dev/zero of=disk.img bs=1M count=64
With disk image in place, the module can be loaded:
# insmod g_my_mass.ko file=$PWD/disk.img
Connecting the device to the host should produce several
messages in the host system log, among others:
usb 1-4.4: new high speed USB device using ehci_hcd and address 8
usb 1-4.4: New USB device found, idVendor=0525, idProduct=a4a5
usb-storage: device scan complete
sd 6:0:0:0: [sdb] Attached SCSI removable disk
sd 6:0:0:0: [sdb] 131072 512-byte logical blocks: (67.1 MB/64.0 MiB)
sdb: unknown partition table
All that is left is creating a partition with a filesystem and
starting using the pendrive:
# fdisk /dev/sdb
...
# dmesg -c
sd 6:0:0:0: [sdb] Assuming drive cache: write through
sdb: sdb1
# mkfs.vfat /dev/sdb1
mkfs.vfat 3.0.9 (31 Jan 2010)
# mount /dev/sdb1 /mnt/
# touch /mnt/foo
# umount /mnt
As has been shown, the gadget works like a charm.
Conclusion
The Linux USB composite framework provides a way to add USB devices in a
fairly straightforward way. Before the composite framework came along,
developers
needed to implement all USB requests for each gadget they wanted to add to
the system. The framework handles basic USB requests and separates each
USB composite function, which allows gadget authors to think in terms of
functions rather than low-level interfaces and communication handling.
As one might guess, this article just scratches the surface of what the
composite framework can do. The driver that was shown is a
single-configuration, single-function gadget, so the advantages over
non-composite gadgets is not readily apparent. A future article may look
at drivers for more powerful gadgets using the composite framework.
Comments (12 posted)
Patches and updates
Kernel trees
Build system
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Page editor: Jake Edge
Next page: Distributions>>