Kernel Summit 2005: I/O Busses

[Posted July 19, 2005 by corbet]

From LWN's 2005 Kernel Summit coverage.

The session on I/O buses got off to a bit of a slow start. PCI-Express support was the first topic to cover, but the fact that, for the most part, PCI-Express simply works put a damper on the discussion. There are some potential issues with interrupt ordering, but nothing anybody was all that worried about.

PCI error recovery was the next topic. Ben Herrenschmidt discussed the two current patches out there; these patches were discussed in detail in the July 13 LWN Kernel Page, so there is no need to repeat that material here. The discussion did not produce any clear consensus on which patch - if either - should be merged.

The bulk of the time was spent discussing I/O memory management units. The IOMMU, on systems which have them, implements a virtual address space seen by peripherals. This layer of indirection has a couple of advantages: it can be used to make otherwise unreachable memory available for DMA operations, and it can cause physically distributed memory to appear to be contiguous. The IOMMU, thus, can be used to implement scatter/gather I/O operations without the target device knowing about it.

As it turns out, there are several different scatter/gather interfaces in the kernel. The block layer has its own, well developed code; networking, too, has a strong scatter/gather implementation. There are, however, a number of char drivers which wish to do scatter/gather I/O, and each one has its own implementation; some are better than others. There have been suggestions for a generic scatter/gather implementation for the whole kernel before, and James raised the idea again. Nobody appears to be in a hurry to reimplement kernel scatter/gather operations, though; the parts of the kernel which benefit most from the capability already have solid implementations that work well.

Just because a system has an IOMMU does not mean that it should always be used. Some hardware implements its own scatter/gather operations, and, in many cases, that implementation is quite efficient. Currently there is no mechanism in the kernel for using - or bypassing - the IOMMU on a per-device basis, and no real way to know which option performs the best. What became clear in the discussion is that somebody needs to put some serious effort into measuring the performance impact of the IOMMU on various architectures, and with various peripherals. For now, what most people have seems to be guesswork.

Peripherals which perform their own scatter/gather often can support both 32-bit and 64-bit descriptors. For many reasons, the use of 32-bit descriptors tends to be more efficient. The question was asked: is it worth using the 64-bit modes at all? One answer came from one hardware company representative: support for 32-bit descriptors can be expected to fade away. Commercial pressures may lead to 64-bit being the only well-supported mode available.

Finally, there is the issue of support for systems without an IOMMU. Such systems cannot work with 64-bit DMA addresses. In fact, many of them do not even handle full 32-bit addresses; it is fairly common to find hardware which can only address 31 bits. The kernel does not currently handle such hardware well. The memory zone mechanism was designed around this type of problem, but there are only two zones of interest; for a device which cannot deal with 32-bit addresses, the only safe zone is the DMA zone, which, reflecting its history, only uses the bottom 24 bits. This zone is thus constrained to be small. There really needs to be a way to allocate memory which is outside of the traditional DMA zone, but which still fits within a constrained address mask.

Thus, asks James, do we need a new memory allocation API? He proposed either kmalloc_dev() or kmalloc_mask(). The former would take a pointer to a device structure; it would have the advantage of also being able to allocate local memory on a NUMA system. The latter, instead, would simply try to find memory addressable within a given mask. In either case, the new functions would not be implemented through the creation of new memory zones - nobody wants to add more zone balancing challenges to the kernel. Instead, a best-effort attempt would be made to allocate suitable memory from the lower end of ZONE_NORMAL; if that does not work, the attempt will fail.

As for the question of which API should be implemented: the developers decided that they wanted both. James promised to post a patch.

Index entries for this article
Kernel	Direct memory access
Kernel	PCI

I/O Busses

Posted Jul 19, 2005 6:10 UTC (Tue) by mst@mellanox.co.il (guest, #27097) [Link]

> PCI-Express support was the first topic to cover, but the fact that,
> for the most part, PCI-X simply works put a damper on the discussion.

Dont you mean "PCI-Express simply works"?

I/O Busses

Posted Jul 19, 2005 9:27 UTC (Tue) by simon_kitching (guest, #4874) [Link]

Maybe it was just a typo, but PCI Express is abbreviated PCIe. PCI-X is a different thing.