|
|
Log in / Subscribe / Register

Kernel development

Brief items

Kernel release status

The current development kernel is 2.5.67, which was released by Linus on April 7. This big patch includes more IDE work, a big x86-64 merge, more preparation for an enlarged dev_t type, a bunch of PCMCIA work, a new SCSI debug module, some IPSec patches, some driver model work, and many other fixes and updates. See the long-format changelog for the details.

Linus's BitKeeper repository contains the first steps in a process of marking user-space pointers with a new __user attribute. This attribute is meant to be used by static code checkers to find places where these pointers are being dereferenced directly. There also a small change to the semantics of msync(MS_ASYNC) (it no longer actually starts any I/O), some reverse-mapping VM speedups, a new requirement that gcc version 2.95 (or later) be used to compile the kernel, a big pile of small fixes from Alan Cox, an NFSv4 update, a big IA-64 update, and a number of other fixes.

The current prepatch from Alan Cox is 2.5.67-ac1; The most significant change here is the inclusion of Bartlomiej Zolnierkiewicz's new taskfile IDE I/O implementation (covered briefly here last week). "Handle with care, no naked flames, do not inhale...."

The current stable kernel is 2.4.20. Marcelo released the seventh 2.4.21 prepatch on April 4; it is, he says, hopefully the last prepatch in the 2.4.21 series (before the release candidates start). This prepatch includes e1000 and e100 updates, another large set of fixes from the -ac tree, a bluetooth update, some ext3 fixes, and a number of other tweaks.

Comments (none posted)

Kernel development news

The ongoing device number debate

There have been no new patches toward an expanded dev_t type for a week or two. The discussion goes on, however. Things do seem to be heading toward a conclusion as it becomes clear that the real issue is the scope of the changes to be made for 2.5.

The expansion of dev_t is uncontroversial; the only real point of discussion there is how big it should be. That will be Linus's call; he hinted a while back that he was changing his mind and prefered a 64-bit value (32 bits each for the major and minor number) over 32 bits with a 12:20 split. In more recent times he has been silent.

The real disagreement has to do with the form of the expanded dev_t patches, which implement something that looks very much like the old, static device number space. Some developers (well, one at least: Roman Zippel) complain that the patch should "go all the way" and create a fully dynamic number space. He cites numerous quotes from Chairman Linus, who favors a dynamic device numbering scheme, to support his point. (Linus, again, has been silent in the current discussion).

Unless he comes up with some impressive patches quickly, Roman looks likely to lose this argument. The focus of the work at the moment is to relieve an immediate, pressing problem: the lack of available device numbers. The problem is especially acute for SCSI disk drives, where the number of possible disks is too small, and they have been restricted to 16 partitions. A simple fix for this problem will make the people most concerned with dev_t expansion happy for now.

The bigger problem - the management of an entirely dynamic device number space - is still characterized by a paucity of working solutions. One approach (devfs) works, but it is a solution that is disliked by many. The most viable competing approach at the moment looks like the hotplug mechanism, which allows the kernel developers to push the entire problem into user space. Some promising work is being done in that area, but it is unlikely that even those closest to this work would claim that it will be ready for production deployment in the near future. There is also the little matter of the 2.5 feature freeze to worry about.

So a fully dynamic device number space looks like a 2.7 development. Few people contest the idea that a dynamic number space is, in the long run, a better way of doing things. But few people are ready to make that jump for 2.6.

Comments (5 posted)

SET_MODULE_OWNER

One would think that it wouldn't be worth arguing over... The macro in question is defined as:

    #define SET_MODULE_OWNER(dev) ((dev)->owner = THIS_MODULE)

Rusty Russell had marked that macro as "deprecated" during the course of his module work. There was, he thought, no real reason to keep it around. Others disagreed, though, and Zwane Mwaikambo recently submitted (and Linus accepted) a little patch to un-deprecate the macro. Why do people care, when it's just as easy to set the owner field of the structure in question directly?

The real reason, it seems, is that the macro helps in writing device drivers which work over a wide range of kernels. Various structures (including file_operations and net_device) lacked an owner field in the 2.2 kernel. If a driver uses SET_MODULE_OWNER, it is easy to make that driver compile under 2.2 with a suitable compatibility macro. If the driver sets the owner field directly, the only way to make it work with older kernels is with #ifdef, which is strongly discouraged in kernel code. SET_MODULE_OWNER thus takes the form of a simple accessor function which helps code work regardless of what actually happens inside a particular structure.

The final solution was to leave the macro un-deprecated, but with a comment from Jeff Garzik:

/* Think of SET_MODULE_OWNER like an IBM mainframe: leave it in a dark corner for years, don't break it, but don't ever upgrade it either :) If there is something newer and sexier than the mainframe, it's ok to use that instead. The mainframe won't feel lonely. -- Jeff Garzik */

Comments (1 posted)

Supporting SELinux

Stephen Smalley has a mission: he would like to get the NSA's Security-Enhanced Linux (SELinux) patches merged into the 2.5 kernel. In theory this task should not be all that hard: the whole point of the Linux Security Module patches is to make it possible to plug in new security regimes at will. At the moment, however, things don't actually work that well. Thus a couple of new patches which have been sent out for comments.

The first patch is relatively straightforward. Files in SELinux have "security labels" which provide fine-grained control over which processes can access them. SELinux needs a mechanism to set and read those labels. So the extended attributes patch just provides an easy mechanism for the manipulation of security labels on files in an ext3 filesystem. Eventually, says Stephen, it will be necessary to add this interface to most filesystems - including the virtual ones. For example, a suitably patch version of OpenSSH can set labels on pseudo terminals if /dev/pts supports them..

The second patch is a little trickier. SELinux also attaches attributes to processes, and it needs an interface by which those attributes can be manipulated from user space. At one point, this interface was provided by the general-purpose sys_security() system call that was part of the LSM patch. sys_security() did not sit well with a number of kernel developers, however, and it was removed in 2.5.50. General-purpose "multiplexor" system call interfaces are very much out of favor; they make it almost impossible to understand the actual interface exported by the kernel.

So SELinux has to figure out a way to manage process attributes without sys_security(). Their options would be (1) to add a new, special-purpose system call, or (2) find some other, trickier way of doing it. They opted for the latter.

With the process attributes patch, each /proc entry corresponding to a process would have a new attr subdirectory, containing three files. attr/current could be read to obtain the current security attributes for a process, but (in SELinux, at least), could not be written. A process can write its own attr/exec file, which is a place to store process attributes for the future. The next time that the process performs an exec() call to run a new image, the attributes stored in attr/exec will be applied. Needless to say, the currently loaded security module gets veto power over which attributes can be written to that file. Finally, attr/fscreate contains attributes which will be applied to the next file created by the process. Storing file attributes there avoids race conditions where a program wearing a black hat attempts to access a file in the time between its creation and when security attributes are applied.

Kernel developers do not like multiplexor interfaces, but it is probably worth discussing whether system interfaces based on magic /proc files are better. One could say that, with /proc, at least the interface is visible. For now, at least, that discussion is not happening; there have been, as of this writing, no public comments posted in the day since the patches went out.

Comments (4 posted)

Driver porting

Driver porting: DMA changes

This article is part of the LWN Porting Drivers to 2.6 series.
The direct memory access (DMA) support layer has been extensively changed in 2.6, but, in many cases, device drivers should work unaltered. For developers working on new drivers, or for those wanting to keep their code current with the latest API, there are a fair number of changes to be aware of.

The most evident change is the creation of the new generic DMA layer. Most driver programmers will be aware of the pci_* DMA support functions; SPARC programmers may have also encountered the analogous set of sbus_* functions. Starting with 2.5.53, a new set of generic DMA functions was added which is intended to provide a DMA support API that is not specific to any particular bus. The new functions look much like the old ones; changing from one API to the other is a fairly automatic job.

The discussion below will note changes in the DMA API without looking at every new dma_* function. See our DMA API quick reference page for a concise summary of the mapping from the old PCI interface to the new generic functions.

Allocating DMA regions

The new and old DMA APIs both distinguish between "consistent" (or "coherent") and "streaming" memory. Consistent memory is guaranteed to look the same to the processor and to DMA-capable devices, without problems caused by caching; it is most often used for long-lasting, bidirectional I/O buffers. Streaming memory may have cache effects, and is generally used for a single transfer.

The PCI functions for allocating consistent memory are unchanged from 2.4:

    void *pci_alloc_consistent(struct pci_dev *dev, size_t size,
			       dma_addr_t *dma_handle);
    void pci_free_consistent(struct pci_dev *dev, size_t size,
			     void *cpu_addr, dma_addr_t dma_handle);

The generic version is a little different, adopting the term "coherent" for this type of memory, and adding an allocation flag:

    void *dma_alloc_coherent(struct device *dev, size_t size,
			     dma_addr_t *dma_handle, int flag);
    void dma_free_coherent(struct device *dev, size_t size,
			   void *cpu_addr, dma_addr_t dma_handle);

Here the added flag argument is the usual memory allocation flag. pci_alloc_consistent() is deemed to have an implicit GFP_ATOMIC flag.

For single-buffer streaming allocations, the PCI interface is, once again, unchanged, and the generic DMA interface is isomorphic to the PCI version. There is now an enumerated type for describing the direction of the mapping:

    enum dma_data_direction {
        DMA_BIDIRECTIONAL = 0,
        DMA_TO_DEVICE = 1,
        DMA_FROM_DEVICE = 2,
        DMA_NONE = 3,
    };

The actual mapping and unmapping functions are:

    dma_addr_t dma_map_single(struct device *dev, void *addr,
	                      size_t size,
			      enum dma_data_direction direction);
    void dma_unmap_single(struct device *dev, dma_addr_t dma_addr,
		          size_t size,
			  enum dma_data_direction direction);

    dma_addr_t dma_map_page(struct device *dev, struct page *page,
	                    unsigned long offset, size_t size,
			    enum dma_data_direction direction);
    void dma_unmap_page(struct device *dev, dma_addr_t dma_addr, 
                        size_t size,
			enum dma_data_direction direction);

As is the case with the PCI versions of these functions, use of the offset and size parameters is discouraged unless you really know what you are doing.

There has been one significant change in the creation of scatter/gather streaming DMA mappings. The 2.4 version of struct scatterlist used a char * pointer (called address) for the buffer to be mapped, with a struct page pointer that would be used only for high memory addresses. In 2.6, the address pointer is gone, and all scatterlists must be built using struct page pointers.

The generic versions of the scatter/gather functions are:

    int dma_map_sg(struct device *dev, struct scatterlist *sg, 
                   int nents, enum dma_data_direction direction);
    void dma_unmap_sg(struct device *dev, struct scatterlist *sg, 
                      int nhwentries, enum dma_data_direction direction);

Noncoherent DMA mappings

The generic DMA layer in 2.6 includes a set of functions for the creation of explicitly noncoherent mappings. Very few drivers will need to use this interface; it is mostly intended for code that must work on older platforms that are unable to create coherent mappings. Note that there are no PCI equivalents for these functions; you must use the generic variants.

A noncoherent mapping is created with:

    void *dma_alloc_noncoherent(struct device *dev, size_t size,
			        dma_addr_t *dma_handle, int flag);

This function behaves identically to dma_alloc_coherent(), except that the returned mapping might not be in coherent memory. Drivers using this memory must be careful to follow the ownership rules and call the appropriate dma_sync_* functions when needed. An additional function:

    void dma_sync_single_range(struct device *dev, dma_addr_t dma_handle,
		               unsigned long offset, size_t size,
			       enum dma_data_direction direction);

Will synchronize only a portion of a (larger) noncoherent mapping.

When your driver is done with the mapping, it should be returned to the system with:

    void dma_free_noncoherent(struct device *dev, size_t size, 
                              void *cpu_addr, dma_addr_t dma_handle);

Double address cycle addressing

The PCI bus is capable of a "double address cycle" (DAC) mode of operation. DAC enables the use of 64-bit DMA addresses, greatly expanding the range of memory which is reachable on systems without I/O memory mapping units. DAC is also expensive, however, and is not properly supported by all devices and buses. So the DMA support routines will normally go out of their way to avoid creating mappings that require DAC - even when the driver has set an address mask that would allow it.

There are occasions where DAC is useful, however. In particular, very large DMA mappings may not be possible in the normal, single-cycle address range. For these rare cases, the PCI layer (but not the generic DMA layer) provides a special set of functions. Note that the DAC functions can be very expensive to use; they should generally be avoided unless absolutely necessary. These functions aren't strictly a 2.6 feature; they were also added to 2.4.13.

A DAC-capable driver must begin by setting a separate address mask:

    int pci_dac_set_dma_mask(struct pci_dev *dev, u64 mask);

The mask describes the address range that your device can support. If the function returns non-zero, DAC addressing cannot be used and should not be attempted.

A DAC mapping is created with:

    dma64_addr_t pci_dac_page_to_dma(struct pci_dev *dev,
				     struct page *page,
				     unsigned long offset,
				     int direction);

There's a few things to note about DAC mappings. They can only be created using struct page pointers and offsets; DAC mappings, by their nature, will be in high memory and thus will not have kernel virtual addresses. DAC mappings are a straight address translation requiring no external resources, so there is no need to explicitly unmap them after use. Finally, all DAC mappings are inconsistent (noncoherent) mappings, so explicit synchronization is needed to ensure that the device and CPU see the same memory. For a DAC mapping, use:

    void pci_dac_dma_sync_single(struct pci_dev *dev,
				 dma64_addr_t dma_addr,
				 size_t len, int direction);

Some other details

On many architectures, no resources are consumed by DMA mappings, and thus there is no real need to unmap them. The various unmap functions are set up as no-ops on those architectures, but some programmers evidently dislike the need to remember DMA mapping addresses and lengths unnecessarily. So 2.6 (and 2.4 as of 2.4.18) has a fairly elaborate bit of preprocessor abuse which can be used to save a couple words of memory. See Documentation/DMA-mapping.txt in the source tree if this appeals to you.

The "PCI pool" interface is definitely not a 2.5-specific feature, since it first appeared in 2.4.4. That is new enough, however, that some references (i.e. Linux Device Drivers, Second Edition) do not cover them. The PCI pool interface enables the use of very small DMA buffers. In the past, such buffers would often be kept in device-specific structures. Some users ran into trouble, however, when the DMA buffer shared a cache line with other members of the same structure. The PCI pool interface was created to help move tiny DMA buffers into their own space and avoid this sort of memory corruption. Again, see DMA-mapping.txt for the details.

Comments (none posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 2.5.67 ?
Alan Cox Linux 2.5.67-ac1 ?
Andrew Morton 2.5.66-mm3 ?
Alan Cox Linux 2.5.66-ac2 ?
Marcelo Tosatti Linux 2.4.21-pre7 ?
J.A. Magallon Linux 2.4.21-pre6-jam1 ?
Con Kolivas 2.4.20-ck5 ?

Core kernel code

Development tools

Device drivers

Documentation

Denis Vlasenko lk maintainers ?
Dave Hansen meminfo documentation ?

Filesystems and block I/O

Janitorial

Christoph Hellwig libfs ?

Memory management

Matthew Dobson Memory Binding Take 2 (0/1) ?
Matthew Dobson Memory Binding Take 2 (1/1) ?
Hugh Dickins flush flush_page_to_ram ?

Networking

Benchmarks and bugs

Miscellaneous

Matthias Andree lk-changelog.pl 0.93 ?
Douglas Gilbert sg3_utils-1.03 released ?

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2003, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds