Brief items
The current development kernel is 2.5.67, which was
released by Linus on April 7. This big
patch includes more IDE work, a big x86-64 merge, more preparation for an
enlarged
dev_t type, a bunch of PCMCIA work, a new SCSI debug module, some
IPSec patches, some driver model work, and many other fixes and updates.
See
the long-format changelog for the
details.
Linus's BitKeeper repository contains the first steps in a process of
marking user-space pointers with a new
__user attribute. This
attribute is meant to be used by static code checkers to find places where
these pointers are being dereferenced directly. There also a small change
to the semantics of msync(MS_ASYNC) (it no longer actually starts
any I/O), some reverse-mapping VM speedups, a new requirement that gcc
version 2.95 (or later) be used to compile the kernel, a big pile of small
fixes from Alan Cox, an NFSv4 update, a big IA-64 update, and a number of
other fixes.
The current prepatch from Alan Cox is 2.5.67-ac1; The most significant
change here is the inclusion of Bartlomiej Zolnierkiewicz's new taskfile
IDE I/O implementation (covered briefly here last
week). "Handle with care, no naked flames, do
not inhale...."
The current stable kernel is 2.4.20. Marcelo released the seventh 2.4.21 prepatch on
April 4; it is, he says, hopefully the last prepatch in the 2.4.21
series (before the release candidates start). This prepatch includes e1000
and e100 updates, another large set of fixes from the -ac tree, a bluetooth
update, some ext3 fixes, and a number of other tweaks.
Comments (none posted)
Kernel development news
There have been no new patches toward an expanded
dev_t type for a
week or two. The discussion goes on, however. Things do seem to be heading
toward a conclusion as it becomes clear that the real issue is the scope of
the changes to be made for 2.5.
The expansion of dev_t is uncontroversial; the only real point of
discussion there is how big it should be. That will be Linus's call; he
hinted a while back that he was changing his mind and prefered a 64-bit
value (32 bits each for the major and minor number) over 32 bits with a
12:20 split. In more recent times he has been silent.
The real disagreement has to do with the form of the expanded
dev_t patches, which implement something that looks very much like
the old, static device number space. Some developers (well, one at least:
Roman Zippel) complain that the patch should "go all the way" and create a
fully dynamic number space. He cites
numerous quotes from Chairman Linus,
who favors a dynamic device numbering scheme, to support his point.
(Linus, again, has been silent in the current discussion).
Unless he comes up with some impressive patches quickly, Roman looks likely
to lose this argument. The focus of the work at the moment is to relieve
an immediate, pressing problem: the lack of available device numbers. The
problem is especially acute for SCSI disk drives, where the number of
possible disks is too small, and they have been restricted to 16
partitions. A simple fix for this problem will make the people most
concerned with dev_t expansion happy for now.
The bigger problem - the management of an entirely dynamic device number
space - is still characterized by a paucity of working solutions. One
approach (devfs) works, but it is a solution that is disliked by many. The
most viable competing approach at the moment looks like the hotplug
mechanism, which allows the kernel developers to push the entire problem
into user space. Some promising work is being done in that area, but it is
unlikely that even those closest to this work would claim that it will be
ready for production deployment in the near future. There is also the
little matter of the 2.5 feature freeze to worry about.
So a fully dynamic device number space looks like a 2.7 development. Few
people contest the idea that a dynamic number space is, in the long run, a
better way of doing things. But few people are ready to make that jump for
2.6.
Comments (5 posted)
One would think that it wouldn't be worth arguing over... The macro in
question is defined as:
#define SET_MODULE_OWNER(dev) ((dev)->owner = THIS_MODULE)
Rusty Russell had marked that macro as "deprecated" during the course of
his module work. There was, he thought, no real reason to keep it around.
Others disagreed, though, and Zwane Mwaikambo recently submitted (and Linus
accepted) a little patch to un-deprecate the macro. Why
do people care, when it's just as easy to set the owner field of
the structure in question directly?
The real reason, it seems, is that the macro helps in writing device
drivers which work over a wide range of kernels. Various structures
(including file_operations and net_device) lacked an
owner field in the 2.2 kernel. If a driver uses
SET_MODULE_OWNER, it is easy to make that driver compile under 2.2
with a suitable compatibility macro. If the driver sets the owner
field directly, the only way to make it work with older kernels is with
#ifdef, which is strongly discouraged in kernel code.
SET_MODULE_OWNER thus takes the form of a simple accessor function
which helps code work regardless of what actually happens inside a
particular structure.
The final solution was to leave the macro un-deprecated, but with a comment
from Jeff Garzik:
/* Think of SET_MODULE_OWNER like an IBM mainframe: leave it in a dark
corner for years, don't break it, but don't ever upgrade it either
:) If there is something newer and sexier than the mainframe, it's
ok to use that instead. The mainframe won't feel lonely. -- Jeff
Garzik */
Comments (1 posted)
Stephen Smalley has a mission: he would like to get the NSA's
Security-Enhanced Linux (SELinux) patches merged into the 2.5 kernel. In
theory this task should not be all that hard: the whole point of the Linux
Security Module patches is to make it possible to plug in new security
regimes at will. At the moment, however, things don't actually work that
well. Thus a couple of new patches which have been sent out for comments.
The first patch is relatively
straightforward. Files in SELinux have "security labels" which provide
fine-grained control over which processes can access them. SELinux needs a
mechanism to set and read those labels. So the extended attributes patch
just provides an easy mechanism for the manipulation of security labels on
files in an ext3 filesystem. Eventually, says Stephen, it will be
necessary to add this interface to most filesystems - including the virtual
ones. For example, a suitably patch version of OpenSSH can set labels on
pseudo terminals if /dev/pts supports them..
The second patch is a little trickier.
SELinux also attaches attributes to processes, and it needs an interface by
which those attributes can be manipulated from user space. At one point,
this interface was provided by the general-purpose sys_security()
system call that was part of the LSM patch. sys_security() did not sit well with a number of kernel
developers, however, and it was removed in 2.5.50. General-purpose
"multiplexor" system call interfaces are very much out of favor; they make
it almost impossible to understand the actual interface exported by the
kernel.
So SELinux has to figure out a way to manage process attributes without
sys_security(). Their options would be (1) to add a new,
special-purpose system call, or (2) find some other, trickier way of
doing it. They opted for the latter.
With the process attributes patch, each /proc entry corresponding
to a process
would have a new attr subdirectory, containing three files.
attr/current could be read to obtain the current security
attributes for a process, but (in SELinux, at least), could not be
written. A process can write its own attr/exec file, which
is a place to store process attributes for the future. The next time that
the process performs an exec() call to run a new image, the
attributes stored in attr/exec will be applied. Needless to say,
the currently loaded security module gets veto power over which attributes
can be written to that file. Finally, attr/fscreate contains
attributes which will be applied to the next file created by the process.
Storing file attributes there avoids race conditions where a program
wearing a black hat attempts to access a file in the time between its
creation and when security attributes are applied.
Kernel developers do not like multiplexor interfaces, but it is probably
worth discussing whether system interfaces based on magic /proc
files are better. One could say that, with /proc, at least the
interface is visible. For now, at least, that discussion is not happening;
there have been, as of this writing, no public comments posted in the day
since the patches went out.
Comments (4 posted)
Driver porting
The direct memory access (DMA) support layer has been extensively changed
in 2.6, but, in many cases, device drivers should work unaltered. For
developers working on new drivers, or for those wanting to keep their code
current with the latest API, there are a fair number of changes to be aware
of.
The most evident change is the creation of the new generic DMA layer. Most
driver programmers will be aware of the pci_* DMA support
functions; SPARC programmers may have also encountered the analogous set of
sbus_* functions. Starting with 2.5.53, a new set of generic DMA
functions was added which is intended to provide a DMA support API that is
not specific to any particular bus. The new functions look much like the
old ones; changing from one API to the other is a fairly automatic job.
The discussion below will note changes in the DMA API without looking at
every new dma_* function. See our DMA API quick reference
page for a concise summary of the mapping from the old PCI interface to
the new generic functions.
Allocating DMA regions
The new and old DMA APIs both distinguish between "consistent" (or
"coherent") and "streaming" memory. Consistent memory is guaranteed to
look the same to the processor and to DMA-capable devices, without problems
caused by caching; it is most often used for long-lasting, bidirectional
I/O buffers. Streaming memory may have cache effects, and is generally
used for a single transfer.
The PCI functions for allocating consistent memory are unchanged from 2.4:
void *pci_alloc_consistent(struct pci_dev *dev, size_t size,
dma_addr_t *dma_handle);
void pci_free_consistent(struct pci_dev *dev, size_t size,
void *cpu_addr, dma_addr_t dma_handle);
The generic version is a little different, adopting the term "coherent" for
this type of memory, and adding an allocation flag:
void *dma_alloc_coherent(struct device *dev, size_t size,
dma_addr_t *dma_handle, int flag);
void dma_free_coherent(struct device *dev, size_t size,
void *cpu_addr, dma_addr_t dma_handle);
Here the added flag argument is the usual memory allocation flag.
pci_alloc_consistent() is deemed to have an implicit
GFP_ATOMIC flag.
For single-buffer streaming allocations, the PCI interface is, once again,
unchanged, and the generic DMA interface is isomorphic to the PCI version.
There is now an enumerated type for describing the direction of the
mapping:
enum dma_data_direction {
DMA_BIDIRECTIONAL = 0,
DMA_TO_DEVICE = 1,
DMA_FROM_DEVICE = 2,
DMA_NONE = 3,
};
The actual mapping and unmapping functions are:
dma_addr_t dma_map_single(struct device *dev, void *addr,
size_t size,
enum dma_data_direction direction);
void dma_unmap_single(struct device *dev, dma_addr_t dma_addr,
size_t size,
enum dma_data_direction direction);
dma_addr_t dma_map_page(struct device *dev, struct page *page,
unsigned long offset, size_t size,
enum dma_data_direction direction);
void dma_unmap_page(struct device *dev, dma_addr_t dma_addr,
size_t size,
enum dma_data_direction direction);
As is the case with the PCI versions of these functions, use of the
offset and size parameters is discouraged unless you
really know what you are doing.
There has been one significant change in the creation of scatter/gather
streaming DMA mappings. The 2.4 version of struct scatterlist
used a char * pointer (called address) for the
buffer to be mapped, with a
struct page pointer that would be used only for high memory
addresses. In 2.6, the address pointer is gone, and all
scatterlists must be built using struct page pointers.
The generic versions of the scatter/gather functions are:
int dma_map_sg(struct device *dev, struct scatterlist *sg,
int nents, enum dma_data_direction direction);
void dma_unmap_sg(struct device *dev, struct scatterlist *sg,
int nhwentries, enum dma_data_direction direction);
Noncoherent DMA mappings
The generic DMA layer in 2.6 includes a set of functions for the creation
of explicitly noncoherent mappings. Very few drivers will need to use this
interface; it is mostly intended for code that must work on older platforms
that are unable to create coherent mappings. Note that there are no PCI
equivalents for these functions; you must use the generic variants.
A noncoherent mapping is created with:
void *dma_alloc_noncoherent(struct device *dev, size_t size,
dma_addr_t *dma_handle, int flag);
This function behaves identically to dma_alloc_coherent(), except
that the returned mapping might not be in coherent memory. Drivers using
this memory must be careful to follow the ownership rules and call the
appropriate dma_sync_* functions when needed. An additional
function:
void dma_sync_single_range(struct device *dev, dma_addr_t dma_handle,
unsigned long offset, size_t size,
enum dma_data_direction direction);
Will synchronize only a portion of a (larger) noncoherent mapping.
When your driver is done with the mapping, it should be returned to the
system with:
void dma_free_noncoherent(struct device *dev, size_t size,
void *cpu_addr, dma_addr_t dma_handle);
Double address cycle addressing
The PCI bus is capable of a "double address cycle" (DAC) mode of
operation. DAC enables the use of 64-bit DMA addresses, greatly expanding
the range of memory which is reachable on systems without I/O memory
mapping units. DAC is also expensive, however, and is not properly
supported by all devices and buses. So the DMA support routines will
normally go out of their way to avoid creating mappings that require DAC -
even when the driver has set an address mask that would allow it.
There are occasions where DAC is useful, however. In particular, very
large DMA mappings may not be possible in the normal, single-cycle address
range. For these rare cases, the PCI layer (but not the generic DMA layer)
provides a special set of functions. Note that the DAC functions can be
very expensive to use; they should generally be avoided unless absolutely
necessary. These functions aren't strictly a 2.6 feature; they were also
added to 2.4.13.
A DAC-capable driver must begin by setting a separate address mask:
int pci_dac_set_dma_mask(struct pci_dev *dev, u64 mask);
The mask describes the address range that your device can
support. If the function returns non-zero, DAC addressing cannot be used
and should not be attempted.
A DAC mapping is created with:
dma64_addr_t pci_dac_page_to_dma(struct pci_dev *dev,
struct page *page,
unsigned long offset,
int direction);
There's a few things to note about DAC mappings. They can only be created
using struct page pointers and offsets; DAC mappings, by their
nature, will be in high memory and thus will not have kernel virtual
addresses. DAC mappings are a straight address translation requiring no
external resources, so there is no need to explicitly unmap them after
use. Finally, all DAC mappings are inconsistent (noncoherent) mappings, so
explicit synchronization is needed to ensure that the device and CPU see
the same memory. For a DAC mapping, use:
void pci_dac_dma_sync_single(struct pci_dev *dev,
dma64_addr_t dma_addr,
size_t len, int direction);
Some other details
On many architectures, no resources are consumed by DMA mappings, and thus
there is no real need to unmap them. The various unmap functions are set
up as no-ops on those architectures, but some programmers evidently dislike
the need to remember DMA mapping addresses and lengths unnecessarily. So
2.6 (and 2.4 as of 2.4.18) has a fairly elaborate bit of preprocessor abuse
which can be used to save a couple words of memory. See
Documentation/DMA-mapping.txt in
the source tree if this appeals to you.
The "PCI pool" interface is definitely not a 2.5-specific feature, since it
first appeared in 2.4.4. That is new enough, however, that some references
(i.e. Linux Device Drivers, Second Edition) do not cover them. The
PCI pool interface enables the use of very small DMA buffers. In the past,
such buffers would often be kept in device-specific structures. Some users
ran into trouble, however, when the DMA buffer shared a cache line with
other members of the same structure. The PCI pool interface was created to
help move tiny DMA buffers into their own space and avoid this sort of
memory corruption. Again, see DMA-mapping.txt for the details.
Comments (none posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Janitorial
- Christoph Hellwig: libfs.
(April 6, 2003)
Memory management
Networking
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>