Kernel developers have a certain tendency to be insulated from many of the people who actually
make use of their code. As a way of helping bridge this gap, the kernel
summit often invites specific users to address the group and talk about the
problems they are having with the kernel. In Tokyo, the summit heard from
a panel of five representatives from Japanese companies in both the
enterprise and embedded areas. The panel, which was really a series of
five independent presentations, shone an interesting light on how Linux is
being used in Japan.
First up was Norihiro Kumagai, representing Sharp. Sharp decided a while
back to move from its proprietary, in-house realtime operating system to
Linux. Its products are generally based on system-on-chip hardware, with
drivers coming from the component vendors. That tends to force the use of
binary-only drivers. Why does Sharp not just reject components which do
not come with free drivers? The answer is that it is still hard to find
competitively-priced parts with open drivers.
Sharp's systems run a Linux kernel, but they don't look like traditional
Linux systems. To begin with, Sharp has dispensed with shells altogether,
choosing instead to boot directly into application code. The main purpose
there is to achieve a faster startup. A lot of work has been done on
getting the kernel to load quickly; it turns out that there is a tradeoff
between the time required to load a kernel (arguing for strong compression)
and the time required to extract it (arguing against that compression).
Has Sharp been working with the community to improve boot time? Regardless
of any other concerns, there is a real impediment to doing that: Sharp's
systems are running a 2.4.20 kernel. That, evidently, is what their
supplier (MontaVista) is giving to them.
These systems do a lot with user-space drivers. Much of the code was
ported directly from the in-house RTOS, and it was easier to make it work
in user space. There is a special char driver which is used to dispatch
interrupts to user space. All of Sharp's code runs within the context of a
single process; evidently it just works.
Use of filesystems is also minimized. JFFS2 was evaluated, but found to be
too slow to mount. The JFFS2 garbage collection thread is too heavyweight,
and they also experienced some stability problems with (the 2.4.20 version
of) JFFS2. So, instead, the root comes from a read-only cramfs image, and
persistent data is managed through direct, user-space access to the MTD
device.
Why did Sharp choose Linux? They liked the quality of the core kernel
code. The driver support is good, and the quality of the compiler
toolchain was unmatched anywhere else. The availability of source code is
crucially important; Sharp needs to be able to fix problems on short notice
and without relying on anybody else. And Sharp appreciates the ongoing
work being done by the development community - even though it is making
little use of the community's recent work. In response to a question on
whether Sharp had tried a more recent kernel, the answer was that it was
usually not possible; the board support is usually not in the mainline so
they have to go with what the embedded system provider is offering.
Next up was Ryoichi Sugimura of Panasonic and also a director of the LiMo
Foundation. Panasonic, he says, started looking at Linux for digital TV
applications around 1996, and for mobile use around 2000. Initially
Panasonic looked at sharing its work through the Consumer Electronics Linux
Foundation (CELF), but, more recently, it has been contributing code into
LiMo instead.
Like others, Panasonic has faced a number of challenges in making Linux
work for its products. Reducing memory use was high on the list; it was
addressed through the use of execute-in-place technology and more. Startup
time is important; it was improved through heavy use of prelinking.
Realtime performance was required; to make that happen, Panasonic has
employed the realtime preemption patch set. And power consumption is
always an issue; one thing that Panasonic did here was to get rid of the
periodic timer tick when the processor goes idle.
As mentioned before, Panasonic is now working with LiMo as the outlet for
its code contributions. They are having some trouble, though, figuring out
how to get code upstream. There is also concern about multicore systems
which present some serious development and debugging challenges.
Tim Bird presented as a representative of both Sony and CELF. According to
Tim, Linux has achieved world domination everywhere except on desktop
systems. It is, he says, the new monopoly, but it's a benevolent monopoly
which is less worrisome than many others.
Tim raised a number of "pain points" for Sony. One of those is the
"version gap" between what embedded vendors are shipping and the mainline.
It is, he says, getting better; just a few years ago, the 2.4 kernel was
still being used in new products. Now most companies have moved up until
at least 2.6.11, which is still not great. Sony is currently looking at
2.6.29 for products being designed now.
A question was asked: how does the decision on kernel versions get made?
It seems there tend to be a lot of internal battles around that decision.
Often, though, it is made by default: system-on-chip vendors make a new
product, then get one of the embedded Linux vendors to create a kernel for
them. That will be the kernel which is available for manufacturers to
use. The presence of binary-only drivers can further constrain options in
this area.
Patch maintenance is another source of pain. According to Tim, Sony is
currently carrying 1029 patches against the 2.6.29 kernels; as was observed
by the audience, this is worse than many of the enterprise kernels. The
patches break down this way:
| 637 |
External features not currently in
the mainline, including the realtime tree and the
LTTng trace toolkit |
| 164 | Board support |
| 93 |
Realtime fixes and tuning patches |
| 68 | Local features and fixes |
| 34 |
Internal build system patches |
| 28 |
Fixes backported from later kernels |
It was observed that backports are a relatively small part of the total.
What, it was asked, does Sony do about security? The response was
relatively vague; in essence, we were told that the security needs were
reduced because Sony's devices are closed systems.
Sony would like to get more patches into the mainline, but that proves to
be a challenging thing to do. Developers who submit patches are often
rewarded by requests for an expanded scope and other work which is only
partially related. For example, a patch adding memory notifications to
control groups drew a request that the author create a generic event
mechanism for the control group subsystem. But embedded developers are
often not full-time kernel developers; they have neither the time nor the
skills to respond to this kind of request.
So what happens instead is that embedded developers suffer an ongoing
barrage of criticism about their lack of contribution. They would like to
contribute, but their code tends not to be good enough. The complaints are
not fun, but Tim notes that there are fewer complaints than there once
were.
Sony wishes that there were fewer barriers to switching versions. One
thing that would help a lot would be to merge some of the significant
out-of-tree projects, starting with the realtime tree.
Other issues include the growing size of the kernel, though, as Tim notes,
"Moore's law saves us." He also acknowledged that the biggest bloat
problems are in user space. Boot time is always important to embedded
developers, but it has been improving quickly recently. Filesystems need
work; UBIFS still must scan the media at mount time, so it is not suitable
for a fast-booting system. It seems that the flash filesystem developers
know how to solve this problem, but nobody has actually done it yet. Power
management is an issue; embedded vendors want their systems to be "mostly
asleep" and consuming as little power as possible. Memory management can
be a struggle; better ways for notification of and recovery from low-memory
situations would be helpful. Video and audio drivers can be problematic,
especially in conjunction with the realtime patches. And security is
always on the radar - even SMACK is too big for systems like this; SELinux
was not even mentioned.
Moving away from embedded systems, Takahiro Itagaki of NTT and PostgreSQL
developer, talked about what PostgreSQL would like to see from the kernel.
Unlike some other database systems, PostgreSQL does not do any direct I/O.
The project's philosophy is that it would like to take advantage of the
buffer management done by the kernel and not attempt to rewrite the block
layer in user space. As will be seen, this approach has some costs, but it
also enables PostgreSQL to be supported on a number of platforms by a
relatively small team of developers.
One thing PostgreSQL would like to see is support for low-priority I/O by
background tasks. Things like vacuuming the database should run in a way
which does not interfere with production work. The biggest problem seems
to be that calls to fsync() take the inode mutex, thus blocking
most other operations on the file. Even lseek() will block while
this is happening. Much of the problem could be solved just by avoiding
lseek() and using pread() and pwrite() instead.
That, however, is a hard sell; evidently pread() disabled
readahead on certain other operating systems. An alternative is to fix
lseek(); evidently that has been attempted, but the patch was not
accepted.
The other issue for PostgreSQL is duplicated caching between the database
and the kernel. Since buffered I/O is being used, any cache kept by
PostgreSQL itself risks duplicating data already stored in memory by the
kernel. Much of this could be avoided if PostgreSQL were to use
mmap() to access its files. But that creates problems in
situations where blocks must be written in a specific order - PostgreSQL's
write-ahead logging in particular. To avoid this problem, the PostgreSQL
developers would like to have a special madvise() operation to
tell the kernel not to flush specific blocks to disk. As it turns out,
though, this would be an expensive option to implement, so enthusiasm was
low.
Linus suggested that the use of mmap() was not the way to go, that
it would always be more painful. Chris Mason said that one option could be
to avoid writing the commit block to mapped memory entirely until the rest
of the data had made it to disk; that would avoid the problems that result
if the commit block is written too soon. But Alan Cox warned that disk
drives will reorder operations, so there is a need for barriers in any
case, and those cannot be done through mapped memory. So a true solution
seems elusive.
The final presenter was Kazuhiro Itakura of the Bank of Tokyo-Mitsubishi
UFJ, Ltd. His main request was for better support for mirroring in LVM.
Current mirroring suffers from the problem that the mirror log goes to one
destination only. If that device goes down, the other will go into
read-only mode, effectively stopping the system. Mirroring also does not
detect a device which simply goes out to lunch without returning an I/O
error status. Proprietary Unix systems have evidently done a better job of
mirroring for a long time; it would be nice to see this in Linux as well.
As it happens, more robust mirroring can be done now with network-attached
storage devices. For locally-attached system, loopback can be used. It is
non-trivial to set up, though; it was acknowledged that a better solution
is needed.
The panel ended there. It was seen by most as a useful exercise;
opportunities for either side to talk to the other in this way are
relatively rare. There is always value for kernel developers in hearing
where their users are suffering; with any luck at all, the result will be a
better kernel for everybody.
Regressions
(
Log in to post comments)