LWN.net Weekly Edition for January 9, 2020

Welcome to the LWN.net Weekly Edition for January 9, 2020

This edition contains the following feature content:

Some median Python NaNsense: not-a-number values create confusion for users of the Python statistics module, but the right fix is not obvious.
Toward a conclusion for Python dictionary "addition": the PEP 584 discussion winds down.
A medley of performance-related BPF patches: BPF is being sped up in a number of ways.
Removing the Linux /dev/random blocking pool: kernel developers give up on providing "true" random numbers.
The trouble with IPv6 extension headers: what is the best way for the networking developers to keep features that are prone to abuse out of the Internet protocols?

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Some median Python NaNsense

By Jonathan Corbet
January 3, 2020

Anybody who has ever taken a numerical analysis course understands that floating-point arithmetic on computers is a messy affair. Even so, it is easy to underestimate just how messy things can be. This topic came to the fore in an initially unrelated python-ideas mailing-list thread; what should the Python statistics module do with floating-point values that are explicitly not numbers?

Kemal Diri doubtless did not mean to start a massive thread with this request to add a built-in function to the language to calculate the average of the values in a list. That request was quickly dismissed, but the developers went on to the seemingly strange behavior of the statistics module's median() function when presented with floating-point not-a-number values.

What's not-a-number?

Python integer values have no specific internal representation and are defined to cover an arbitrary range. The float type, instead, is restricted to what the underlying CPU implements; on almost all current hardware, floating-point numbers are described by IEEE 754. That format looks like this:

The interpretation of these numbers is relatively straightforward — except for the quirks. Since this is essentially a sign/magnitude format, it is capable of representing both positive and negative values of zero; both are treated as zero during mathematical operations and (most) comparisons, but they are distinct numbers. The exponent is a biased value that makes comparisons easier. And, importantly for this discussion, an exponent value that is all ones is interpreted in special ways.

In particular, if the exponent is all ones and the mantissa is all zeroes, then the value represents infinity (usually written as "inf"). The sign bit matters, so there can be both positive and negative inf values. Any other value of the mantissa, instead, creates a special value called "not-a-number" or NaN. These values can be created by the hardware when an operation results in a value that cannot be represented — arithmetic overflow, for example. NaN values are also often used in scientific code to represent missing data, though that was apparently not a part of the original intent for NaNs.

There is more. With 52 bits to play with (plus the sign bit), it is obviously possible to create a lot of unique NaN values. Some code uses the mantissa bits to record additional information. But the most significant bit of the mantissa is special: if that bit is set, the result is a "quiet NaN"; if it is clear, instead, the value is a "signaling NaN". Quiet NaNs will quietly propagate through computations; adding a number to a quiet NaN yields another quiet NaN as a result. Operations on signaling NaNs are supposed to generate an immediate error.

(And, yes, some processors invert the meaning of the "quiet" bit, but that's more pain than we need to get into here.)

The problem with median()

The diversion of the initial thread came about when David Mertz pointed out that median() can yield some interesting results when the input data includes NaNs. To summarize his examples:

   >>> import statistics
   >>>
   >>> NaN = float('nan')
   >>> statistics.median([NaN, 1, 2, 3, 4])
   2
   >>> statistics.median([1, 2, 3, 4, NaN])
   3

This strikes Mertz as a nonsensical result: the return value from a function like median() should not depend on the order of the data passed to it, but that is exactly what happens if the input data contains NaN values.

There was no immediate consensus on whether this behavior is indeed a problem or not. Richard Damon asserted: "Getting garbage answers for garbage input isn't THAT unreasonable". Steven D'Aprano, who added the statistics module to the standard library, agreed: "this is a case of garbage in, garbage out". In both cases, the basis of the argument is that median() is documented to require orderable input values, and that is not what is provided in the example above.

Naturally, others see things differently. It turns out that IEEE 754 defines a "total order" that can be used to order all floating-point values. In this order, it turns out that signaling NaNs are bigger than infinity (in both the positive and negative directions), and quiet NaNs are even bigger than that. So some felt that this total order should be used in calculating the median of a list of floating-point values. Brendan Barnwell went a bit further by arguing that NaN values should be treated like other numbers: "The things that computers work with are floats, and NaN is a float, so in any relevant sense it is a number". D'Aprano disagreed strongly:

If something doesn't quack like a duck, doesn't swim like a duck, and doesn't walk like a duck, and is explicitly called Not A Duck, would we insist that it's actually a duck?

Despite insisting that median() is within its rights to return strange results in this situation, D'Aprano also acknowledged that silently returning those results might not be the best course of action. All that is needed is to decide what that best course would actually be.

What should `median()` do

Much of the discussion was thus understandably focused on what the correct behavior for median() (and for other functions in the statistics module that have similar issues) should be. There were two core concerns here: what the proper behavior is, and the performance cost of implementing a different solution.

The cost is the easier concern to get a handle on. median() currently works by sorting the supplied values into an ordered list, then taking the value in the middle. Dealing with NaN values would require adding extra tests to the sort, slowing it down. D'Aprano figured that this change would slow median() by a factor between four and eight times. Christopher Barker did some tests and measured a slowdown range between two and ten times, depending on how the checking for NaN values is done. (And, for the curious, testing for NaN is not as easy as one might hope.) Some of this slowdown could be addressed by switching to a smarter way of calculating the median.

Once those little problems have been dealt with, there is still the issue of what median() should actually do. A number of alternatives surfaced during the discussion:

Keep the current behavior, which works well enough for most users (who may never encounter a NaN value in their work) and doesn't hurt performance. Users who want specialized NaN handling could use a library like NumPy instead.
Simply ignore NaNs.
Raise an exception if the input data contains NaNs. This applies to quiet NaNs; if a signaling NaN is encountered, D'Aprano said, an exception should always be raised.
Return NaN if the input data contains NaNs.
Move the NaN values to one end of the list.
Sort the list using total order (which would move NaNs to both ends of the list, depending on their sign bits).
Probably one or two others as well.

D'Aprano quickly came to the conclusion that the community is unlikely to come to a consensus on what the proper handling for NaN values is. So he has proposed adding a keyword argument (called nan_policy) to median() that would allow the user to pick between a subset of the options above. The default behavior would be to raise an exception.

This proposal appears to have brought the discussion to an orderly close; nobody seems to strongly object to handling the problem in this way. All that is left is to actually write the code to implement this behavior. After that, one would hope, not-a-number handling in the statistics module would be not-a-problem.

Comments (33 posted)

Toward a conclusion for Python dictionary "addition"

By Jake Edge
January 8, 2020

One of Guido van Rossum's last items of business as he finished his term on the inaugural steering council for Python was to review the Python Enhancement Proposal (PEP) that proposes a new update and union operators for dictionaries. He would still seem to be in favor of the idea, but it will be up to the newly elected steering council and whoever the council chooses as the PEP-deciding delegate (i.e. BDFL-Delegate). Van Rossum provided some feedback on the PEP and, inevitably, the question of how to spell the operator returned, but the path toward getting a decision on it is now pretty clear.

PEP 584 ("Add + and += operators to the built-in dict class") has been in the works since last March, but the idea has been around for a lot longer than that. LWN covered a discussion back in March 2015, though it had come up well before that as well. It is a seemingly "obvious" language enhancement, at least for proponents, that would simply create an operator for dictionaries to either update them in-place or to easily create a combination of two dictionaries:

>>> d = {'spam': 1, 'eggs': 2, 'cheese': 3}
>>> e = {'cheese': 'cheddar', 'aardvark': 'Ethel'}
>>> d + e
{'spam': 1, 'eggs': 2, 'cheese': 'cheddar', 'aardvark': 'Ethel'}
>>> e + d
{'cheese': 3, 'aardvark': 'Ethel', 'spam': 1, 'eggs': 2}

>>> d += e
>>> d
{'spam': 1, 'eggs': 2, 'cheese': 'cheddar', 'aardvark': 'Ethel'}

As can be seen, the operation would not be commutative as the value for any shared keys will come from the second (i.e. right-hand) operand, which makes it order-dependent. There are some who do not see the operators as desirable features for the language, but the most vigorous discussion over the last year or so has been about its spelling, with a strong preference for using | and |= among participants in those threads—including Van Rossum.

At the beginning of December, Van Rossum posted his review of the PEP to the python-ideas mailing list. He encouraged the authors (Brandt Bucher and Steven D'Aprano) to request a BDFL-Delegate for the PEP from the steering council, noting that he would not be on the council after the end of the year. D'Aprano indicated that he would be doing so. Apparently that happened, because, tucked away in the notes from the November and December steering council meetings was a mention that a BDFL-Delegate had been assigned—none other than Van Rossum himself.

In his review, he comes down strongly in favor of | and |= and had some other minor suggestions. He said: "All in all I would recommend to the SC to go forward with this proposal, targeting Python 3.9, assuming the operators are changed to | and |=, and the PEP is brought more in line with the PEP editing guidelines from PEP 1 and PEP 12." Given that, and that he is the decision maker for the PEP, it would seem to be smooth sailing for its acceptance.

That did not stop some from voicing objections to the PEP as a whole or the spelling of the operator in particular, of course, though the discussion was collegial as is so often the case in the Python world. Van Rossum thought that | might be harder for newcomers, but was not particularly concerned about that: "I don't think beginners should be taught these operators as a major tool in their toolbox". But Ryan Gonzalez thought that beginners might actually find that spelling easier because of its congruence to the Python set union operator.

Serhiy Storchaka is not a fan of the PEP in general, but believes that | is a better choice than +. He thinks there are already other ways to accomplish the same things that the operators would provide and that their use may be error-prone. He also had a performance concern, but Brett Cannon pointed out that it might only exist for CPython; PyPy and other Pythons might not have the same performance characteristics. Furthermore:

To me this PEP is entirely a question of whether the operators will increase developer productivity and not some way to do dict merging faster, and so performance questions should stay out of it unless it's somehow slower than dict.update().

Marco Sulla made the argument that using | is illogical because sets also support a set difference operation using -, while the PEP does not propose that operator for dictionaries (though it should be noted that a previous incarnation of the PEP did have "subtraction", but it was not well-received and was dropped). Andrew Barnert felt that "illogical" was not the right reason to choose one spelling over the other:

It’s logical to spell the union of two dicts the same way you spell the union of two sets; it’s also logical to spell the concatenation of two dicts the same way you spell the concatenation of two lists. The question is which one is a more useful analogy, or which one is less potentially confusing, not which one you can come up with a convoluted way of declaring illogical if you really try.

Sulla continued by saying that since list and string subtraction make no sense, that it is an unfair comparison. But Chris Angelico pointed out that's not necessarily the case either, since that operation does make sense in some contexts. While he doesn't necessarily think Python should add support for those use cases, "I do ask people to be a little more respectful to the notion that these operations are meaningful". What followed was a bit of a digression into mathematics and the meaning of various operations, much of which had little to do with Python.

There were two offshoots of the discussion. "Random832" suggested a generic way to add an operator specific to a module: all code in the module could use the operator but it would not bubble out from there. Cannon thought it could be quite confusing to programmers who did not realize the operator was redefined. "And debugging this wouldn't be fun either." Storchaka brought up some performance concerns, which could perhaps be worked around, but the general reaction to Random832's idea was negative.

Jonathan Fine thought that since the proposed | operator gives preference to the right operand ("merge-right" in his terminology), there was a need for a merge-left operation. He called it gapfill(), which was a puzzling name choice to some; it would only add values for keys in the right-hand operand that were not present in the left-hand one. While the use case of, say, filling in defaults to a dictionary that held command-line options is reasonable, there are a number of other ways to do that (as is also true for |, however). Fine did not propose that an operator be added but did note that some other Python operations could be seen to give preference to the left-hand operand, which might make the merge-right | operator confusing. There was not a lot of reaction to the idea, but it doesn't look to be going anywhere for now.

D'Aprano plans to update the PEP based on the feedback from Van Rossum and others. It presumably also needs to run the gauntlet of the python-dev mailing list before Van Rossum can decide its fate. There is still plenty of time for all of that before the Python 3.9 release, even though the project adopted a 12-month release cycle a few months back. Python 3.9 is due in early October; it's a pretty good bet that | and |= for dictionaries will make the cut. Even if they do not, though, one of the goals was to put the subject to rest once and for all; a rejected PEP would serve as a place to point those who ask about dictionary "addition" in the future.

Comments (7 posted)

A medley of performance-related BPF patches

By Jonathan Corbet
January 2, 2020

One of the advantages of the in-kernel BPF virtual machine is that it is fast. BPF programs are just-in-time compiled and run directly by the CPU, so there is no interpreter overhead. For many of the intended use cases, though, "fast" can never be quite fast enough. It is thus unsurprising that there are currently a number of patch sets under development that are intended to speed up one aspect or another of using BPF in the system. A few, in particular, seem about ready to hit the mainline.

The BPF dispatcher

BPF programs cannot run until they are "attached" to a specific call point. Tracing programs are attached to tracepoints, while networking express data path (XDP) programs are attached to a specific network device. In general, more than one program can be attached at any given location. When it comes time to run attached programs, the kernel will work through a linked list and invoke each program in turn.

Actually executing a compiled BPF program is done with an indirect jump. Such jumps were never entirely fast, but in the age of speculative-execution vulnerabilities those jumps have been turned into retpolines — a construct that defeats a number of Spectre attacks, but which also turns indirect jumps into something that is far slower than they were before. For cases where BPF programs are invoked frequently, such as for every incoming network packet, that extra overhead hurts.

There have been a number of efforts aimed at reducing the retpoline performance penalty in various parts of the kernel. The BPF dispatcher patch set is Björn Töpel's approach to the problem for BPF programs, and for the XDP use case in particular. It maintains a machine-code trampoline containing a direct jump instruction for every attached BPF program; this trampoline must be regenerated whenever a program is added to or removed from the list. When the time comes to call a BPF program, the trampoline is invoked with the address of the program of interest; it then executes a binary search to find the direct-jump instruction corresponding to that program. The jump is then executed, causing the desired program to be run.

That may seem like a lot of overhead to replace an indirect call, but it is still faster than using a retpoline — by a factor of about three, according to the performance result posted with the patch series. In fact, indirect jumps are so expensive that the dispatcher is competitive even in the absence of retpolines, so it is enabled whether retpolines are in use or not. This code is in its fifth revision and seems likely to make its way into the mainline before too long.

Memory-mappable maps

BPF maps are the way that BPF programs store persistent data; they come in a number of varieties but are essentially associative arrays that can be shared with other BPF programs or with user space. Access to maps from within BPF programs is done by way of special helper functions; since everything happens within the kernel, this access is relatively fast. Getting at a BPF map from user space, instead, must be done with the bpf() system call, which provides operations like BPF_MAP_LOOKUP_ELEM and BPF_MAP_UPDATE_ELEM.

If one simply needs to read out the results at the end of a tracing run, calling bpf() is unlikely to be a problem. In the case of user-space programs that run for a long time and access a lot of data in BPF maps, though, the system-call overhead may well prove to be too much. Much of the time, the key to good performance is avoiding system calls as much as possible; making a call into the system for each item of data exchanged with a BPF program runs counter to that principle. Andrii Nakryiko has a partial solution to this problem in the form of memory-mappable BPF maps. It allows a user-space process to map a BPF array map (one that is indexed with simple integers) directly into its address space; thereafter, data in BPF maps can be accessed directly, with no need for system calls at all.

There are some limitations in the current patch set; only array maps can be mapped in this way, and maps containing spinlocks cannot be mapped (which makes sense, since user space will be unable to participate in the locking protocol anyway). Maps must be created with the BPF_F_MAPPABLE attribute (which causes them to be laid out differently in memory) to be mappable. This patch set has been applied to the BPF repository and can be expected to show up in the 5.6 kernel.

Batched map operations

Memory-mapping BPF maps is one way of avoiding the bpf() system call but, as seen above, it has some limitations. A different approach to reducing system calls can be seen in the batched operations patch set from Brian Vazquez. System calls are still required to access BPF map elements, but it becomes possible to access multiple elements with a single system call.

In particular, the patch set introduces four new map-related commands for the bpf() system call: BPF_MAP_LOOKUP_BATCH, BPF_MAP_LOOKUP_AND_DELETE_BATCH, BPF_MAP_UPDATE_BATCH, and BPF_MAP_DELETE_BATCH. These commands require the following structure to be passed in the bpf() call:

    struct { /* struct used by BPF_MAP_*_BATCH commands */
        __aligned_u64   in_batch;
        __aligned_u64   out_batch;
        __aligned_u64   keys;
        __aligned_u64   values;
        __u32           count;
        __u32           map_fd;
        __u64           elem_flags;
        __u64           flags;
    } batch;

For lookup operations (which, despite their name, are intended to read through a map's entries rather than look up specific entries), keys points to an array able to hold count keys; values is an array for count values. The kernel will pass through the map, storing the keys and associated values for a maximum of that many elements, and setting count to the number actually returned. Setting in_batch to NULL starts the lookup at the beginning of the map; the out_batch value can be used for subsequent calls to pick up where the previous call left off, thus allowing traversal of the entire map.

Update and delete operations expect keys to contain the keys for the map elements to be affected. Updates also use values for the new values to be associated with keys.

The batch operations do not eliminate system calls for access to map elements, but they can reduce those calls considerably; one call can affect 100 (or more) elements at a time rather than just one element. The batch operations do have some significant advantages over memory-mapping; for example, they can be used for any map type, not just array maps. It is also possible to perform operations (like deletion) that cannot be done with memory-mapping.

There is thus a place for both approaches. This patch set is in its third revision, having picked up a number of reviews and acks along the way, so it, too, seems likely to be merged in the near future.

Comments (6 posted)

Removing the Linux /dev/random blocking pool

By Jake Edge
January 7, 2020

The random-number generation facilities in the kernel have been reworked some over the past few months—but problems in that subsystem have been addressed over an even longer time frame. The most recent changes were made to stop the getrandom() system call from blocking for long periods of time at system boot, but the underlying cause was the behavior of the blocking random pool. A recent patch set would remove that pool and it would seem to be headed for the mainline kernel.

Andy Lutomirski posted version 3 of the patch set toward the end of December. It makes "two major semantic changes to Linux's random APIs". It adds a new GRND_INSECURE flag to the getrandom() system call (though Lutomirski refers to it as getentropy(), which is implemented in glibc using getrandom() with fixed flags); that flag would cause the call to always return the amount of data requested, but with no guarantee that the data is random. The kernel would just make its best effort to give the best random data it has at that point in time. "Calling it 'INSECURE' is probably the best we can do to discourage using this API for things that need security."

The patches also remove the blocking pool. The kernel currently maintains two pools of random data, one that corresponds to /dev/random and another for /dev/urandom, as described in this 2015 article. The blocking pool is the one for /dev/random; reads to that device will block (thus the name) until "enough" entropy has been gathered from the system to satisfy the request. Further reads from that file will also block if there is insufficient entropy in the pool.

Removing the blocking pool means that reads from /dev/random behave like getrandom() with a flags value of zero (and turns the GRND_RANDOM flag into a noop). Once the cryptographic random-number generator (CRNG) has been initialized, reads from /dev/random and calls to getrandom(..., 0) will not block and will return the requested amount of random data. Lutomirski said:

I believe that Linux's blocking pool has outlived its usefulness. Linux's CRNG generates output that is good enough to use even for key generation. The blocking pool is not stronger in any material way, and keeping it around requires a lot of infrastructure of dubious value.

The changes were made with an eye toward ensuring that existing programs are not really affected; in fact, the problems with long waits for things like generating GnuPG keys will get better.

This series should not break any existing programs. /dev/urandom is unchanged. /dev/random will still block just after booting, but it will block less than it used to. getentropy() with existing flags will return output that is, for practical purposes, just as strong as before.

Lutomirski noted that there is still the open question of whether the kernel should provide so-called "true random numbers", which is, to a certain extent, what the blocking pool was meant to do. He can only see one reason to do so: "compliance with government standards". He suggested that if the kernel were to provide that, it should be done through an entirely different interface—or be punted to user space by providing a way for it to extract raw event samples that could be used to create such a blocking pool.

Stephan Müller suggested that his Linux random-number generator (LRNG) patch set (now up to version 26) might be a way to provide true random numbers for applications that need them. The LRNG is "fully compliant to SP800-90B requirements", which makes it a solution to the governmental-standards problem. Matthew Garrett objected to the term "true random data", noting that the devices being sampled could, in principle, be modeled accurately enough to make them predictable: "We're not sampling quantum events here." Müller said that the term comes from the German AIS 31 standard to describe a random-number generator that only produces output "at an equal rate as the underlying noise source produces entropy".

Beyond the terminology, though, having a blocking pool as is proposed by the LRNG patches will just lead to various problems, at least if it is available without privilege, Lutomirski said:

This doesn’t solve the problem. If two different users run stupid programs like gnupg, they will starve each other.

As I see it, there are two major problems with /dev/random right now: it’s prone to DoS (i.e. starvation, malicious or otherwise), and, because no privilege is required, it’s prone to misuse. Gnupg is misuse, full stop. If we add a new unprivileged interface, gnupg and similar programs will use it, and we lose all over again.

Müller noted that the addition of getrandom() will now allow GnuPG to use that interface since it will provide the needed guarantee that the pool has been initialized. From discussions with GnuPG maintainer Werner Koch, Müller believes that guarantee is the only reason GnuPG currently reads directly from /dev/random. But if there is an unprivileged interface that is subject to denial of service (like /dev/random today), it will be misused by some applications, Lutomirski asserted.

Theodore Y. Ts'o, who is the maintainer of the Linux random-number subsystem, appears to have changed his mind along the way about the need for a blocking pool. He said that removing that pool would effectively get rid of the idea that Linux has a true random-number generator (TRNG), which "is not insane; this is what the *BSD's have always done". He, too, is concerned that providing a TRNG mechanism will just serve as an attractant for application developers. He also thinks that it is not really possible to guarantee a TRNG in the kernel, given all of the different types of hardware supported by Linux. Even making the facility only available to root will not solve the problem:

Application programmers would give instructions requiring that their application be installed as root to be more secure, "because that way you can get access the _really_ good random numbers".

Müller asked if Ts'o was giving up on the blocking pool implementation that he had added long ago. Ts'o agreed that he was; he is planning to take the patches from Lutomirski and is pretty strongly opposed to adding a blocking interface back into the kernel.

The kernel can't offer up any guarantees about whether or not the noise source has been appropriately characterized. All say, a GPG or OpenSSL developer can do is get the vague sense that TRUERANDOM is "better" and of course, they want the best security, so of *course* they are going to try to use it. At which point it will block, and when some other clever user (maybe a distro release engineer) puts it into an init script, then systems will stop working and users will complain to Linus.

For cryptographers and others who really need a TRNG, Ts'o is also in favor of providing them a way to collect their own entropy in user space to use as they see fit. Entropy collection is not something that the kernel can reliably do on all of the different hardware that it supports, nor can it estimate the amount of entropy provided by the different sources, he said.

The kernel shouldn't be mixing various noise sources together, and it certainly shouldn't be trying to claim that it knows how many bits of entropy that it gets when [it] is trying to play some jitter entropy game on a stupid-simple CPU architecture for IOT/Embedded user cases where everything is synchronized off of a single master oscillator, and there is no CPU instruction reordering or register renaming, etc., etc.

You can talk about providing tools that try to make these estimations --- but these sorts of things would have to be done on each user's hardware, and for most distro users, it's just not practical.

So if it's just for cryptographers, then let it all be done in userspace, and let's not make it easy for GPG, OpenSSL, etc., to all say, "We want TrueRandom(tm); we won't settle for less". We can talk about how do we provide the interfaces so that those cryptographers can get the information they need so they can get access to the raw noise sources, separated out and named, and with possibly some way that the noise source can authenticate itself to the Cryptographer's userspace library/application.

There was a bit of discussion about how that interface might look; there may be security implications for some of the events, for example. Ts'o noted that the keyboard scan codes (i.e. the keys pressed) are mixed into the pool as part of the entropy collection. "Exposing this to userspace, even if it is via a privileged system call, would be... unwise." It does seem possible that other event timings could provide some kind of side-channel information leak as well.

So it would seem that a longtime feature of the Linux random-number subsystem is on its way out. Given the changes that the random-number subsystem have undergone recently, it effectively was only causing denial-of-service problems when it was used; there are now better ways to get the best random numbers that the kernel can provide. If a TRNG is still desired for Linux, that lack will need to be addressed in the future, but likely will not be done within the kernel itself.

Comments (23 posted)

The trouble with IPv6 extension headers

By Jonathan Corbet
January 7, 2020

It has taken longer than anybody might have liked, but the IPv6 protocol is slowly displacing IPv4 across the Internet. A quick, highly scientific "grep the access logs" test shows that about 16% of the traffic to LWN.net is currently using IPv6, and many large corporate networks are using IPv6 exclusively internally. This version of the IP protocol was designed to be more flexible than IPv4 in a number of ways; the "extension header" mechanism is one way in which that flexibility is achieved. A proposal to formalize extension-header processing in the kernel's networking stack has led to some concerns, though, about how this feature will be used and what role Linux should play in its development.

In both versions of the IP protocol, the header of each packet contains a collection of information about how the packet is to be handled; at a minimum, it contains source and destination addresses and a higher-level protocol number. In IPv4, the contents of the header are rigidly specified; it is difficult to add new types of information to the header. When IPv6 was designed, extension headers were added as a way to (relatively) easily add new information in the future.

A few extension header types are defined in RFC8200 (which describes IPv6). Two of particular interest are the "Hop-by-Hop" and "Destination" headers; the former is meant to be acted upon by every system that handles the packet, while the latter is only for the destination node's attention. These headers may contain one or more options, each encoded in a type-length-value (TLV) format. RFC8200 only defines a couple of options that insert padding into the header, but there is interest in adding a number of others.

For example, In-situ Operations, Administration, and Maintenance options are meant to allow providers to collect telemetry information on packets passing through their networks. The Path MTU mechanism uses a Hop-by-Hop option to discover the maximum packet size a path can handle. Firewall and Service Tickets (FAST) are a Hop-by-Hop option that documents a packet's right to traverse a network or pass through a firewall. The Segment Routing options allows a packet to contain the path it should take through a network. And so on.

Tom Herbert has been working on a patch series making a number of changes to how IPv6 extension headers are handled in Linux. It adds infrastructure to allow kernel modules to register their support for specific Hop-by-Hop and Destination options, and makes the creation and parsing of the associated TLVs easy. Specific options may be added to packets or connections by unprivileged users, while others are restricted to privileged users only; either way, the code tries to ensure that the options are well-formed and ordered correctly.

One of the most controversial features is not actually a part of this patch set; Herbert lists it as work for the future. This feature would perform the insertion of new extension headers into packets passing through a system. Header insertion is a violation of RFC8200, but that naturally doesn't stop the purveyors of routers and other middleboxes from doing it anyway. That creates all of the usual problems, including packet transmission failing for reasons that are entirely opaque to a distant sender, proprietary headers leaking onto the public Internet, and more.

Networking maintainer David Miller was less than pleased by the idea of adding header-insertion capabilities to the Linux kernel:

And honestly, this stuff sounds so easy to misuse by governments and other entities. It could also be used to allow ISPs to limit users in very undesirable and unfair ways. And honestly, surveillance and limiting are the most likely uses for such a facility. I can't see it legitimately being promoted as a "security" feature, really.

It is not hard to imagine how injected headers could be used. They could mark "slow lane" packets, for example, or packets that should be forwarded to that mysterious locked room in an Internet service provider's basement. These are not capabilities that Linux developers are generally enthusiastic about supporting; it is thus not surprising that Miller made it clear that he is in no hurry to merge this code into the networking subsystem.

Herbert acknowledged Miller's concerns, but noted that router vendors will engage in abuse regardless of whether Linux supports a specific feature. None of this behavior requires the use of extension headers at all. Adding better extension header support to the kernel, though, might be a way to minimize the scope of these abuses in the future:

This is why Linux is so critical to networking, it is the only open forum where real scrutiny is applied to how protocols are implemented. If the alternatives are given free [rein] to lead then it's very likely we'll end up being stuck with what they do and probably have to follow their lead regardless of how miserable they make the protocols. We've already seen this in segment routing, their attempts to kill IP fragmentation, and all the other examples of protocol ossification that unnecessarily restrict what hosts are allowed to send in the network and hence reduce the utility and security we are able to offer the user.

One way in which Herbert hopes to improve the situation is via a new attribution option that would at least allow network managers to determine the source of an injected extension header that is causing problems. As things stand now, there is no way to know which system may be injecting problematic headers into packets as they pass through. More generally, he hopes that showing how to do things "right" will help to deter the worst hacks. Miller was skeptical about whether this could work; Herbert countered with protocols like QUIC, TLS, and TCP fast open as examples of how Linux developers have been able to steer protocols in a better direction in the past.

That is where the conversation stands as of this writing. How it is resolved matters, though. For all practical purposes, Linux is the reference implementation and the proving ground for the protocols that make up the public Internet. Adoption by Linux ensures that a feature will be available across the net; rejection can doom a feature in the long run. But rejection also abdicates the community's role in the development of new protocols, and Linux, too, can be routed around if the forces driving a feature are strong enough. Whether we want to resist header injection or to try to mitigate its worst abuses from the inside is a question that the networking community will need to find an answer to in the relatively near future.

Comments (4 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

Briefs: Firefox 72; Ruby 2.7; news.gmane.org; Quotes; ...
Announcements: Newsletters; conferences; security updates; kernel patches; ...

Next page: Brief items>>