LWN.net Weekly Edition for January 17, 2019

Welcome to the LWN.net Weekly Edition for January 17, 2019

This edition contains the following feature content:

Fedora, UUIDs, and user tracking: the Fedora project is looking for ways to improve its understanding of its user base without threatening privacy.
Approaching the kernel year-2038 end game: the job of preparing for 2038 is almost done — at the kernel level, at least.
A slow start to openSUSE's board election: an openSUSE board blowup from last year and how it may have affected the current election.
Ringing in a new asynchronous I/O API: a different approach to high-performance asynchronous I/O.
Adiantum: encryption for the low end: an encryption mode designed to protect low-end mobile devices.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Fedora, UUIDs, and user tracking

By Jake Edge
January 15, 2019

"User tracking" is generally contentious in free-software communities—even if the "tracking" is not really intended to do so. It is often distributions that have the most interest in counting their users, but Linux users tend to be more privacy conscious than users of more mainstream desktop operating systems. The Fedora project recently discussed how to count its users and ways to preserve their privacy while doing so.

Ben Cotton brought up the topic in the context of a proposal for Fedora 30. Instead of the current method of counting unique IP addresses that request updates from the DNF mirrors, which is an unreliable estimator of Fedora usage, the proposal would create a unique user ID (UUID) for each installed system that would be sent with DNF mirror-list requests. It explicitly calls out privacy concerns: "We don't want to track; just count."

The proposal outlines the kind of information that the project would like to count, including the version of Fedora, the Fedora variant (or spin), and the architecture of the machine. It would also be useful to have some way to distinguish long-lived installations from one-off test systems in virtual machines. Currently, variants cannot be distinguished and the unique IP counting method both undercounts systems behind network address translation (NAT) and overcounts systems that change IP addresses frequently. The UUID is similar to what openSUSE uses, so "this is ground already traveled".

Using the machine ID (stored in /etc/machine-id) as the UUID is not part of the plan, since it may be used in other ways that would facilitate tracking. So some kind of random UUID would be generated for this purpose. But, as Lennart Poettering pointed out, sending a UUID makes tracking possible even if the project doesn't want to do that tracking. Essentially, users would need to trust that the project isn't doing the tracking because it says it isn't. While he was skeptical that Fedora really wanted to use a UUID that way, he did suggest using an application-specific machine ID, like those calculated by sd_id128_get_machine_app_specific(). That way, Fedora would be using an existing mechanism that generates a UUID using the machine ID and an ID specific to the counting application.

Poettering also mentioned that Ubuntu counts installations via NTP, which might be an option if Fedora wanted to run its own NTP servers. Both Ubuntu and Fedora configure their systems to regularly ping the NTP servers. Another possibility would be to send a "countme" flag once a day as part of the captive-portal and connectivity detection that is already installed with Fedora, but that did not sit well with Kevin Kofler. He called the existing NetworkManager-config-connectivity-fedora package "spyware" and does not install it on his systems. Fedora project leader Matthew Miller (who is also the owner of the feature proposal) said that the connectivity check could be used but it would only count a subset of desktops and not other types of installations, such as server, cloud, or container. In addition, setting up NTP servers would be much more work than hosting a UUID-counting service, he said.

Miller said that the intention is to rotate the logs "fairly frequently", but that is not really visible to users so there is still a trust factor present. But Tom Gundersen suggested another approach:

You could move the rotation to the client by hashing the UUID with a timestamp of sufficiently coarse granularity (a week?) before submitting it.

Then you make sure that all UUIDs submitted by a given machine during a given time window are the same, but UUIDs submitted in different windows are not related, and you don't have to trust the server to respect your privacy.

That approach would "make sense" Poettering said, though he still advocated using NTP or the "HTTP ping" that is done as part of the captive-portal detection. Others, such as Bruno Wolff III, are worried that even if the UUIDs are changed frequently, users still have to trust Fedora (or someone who gained access to the logs) not to correlate UUIDs, IP addresses, and other information to track users that way. Beyond that, Nicolas Mailhot is concerned about interaction with the EU General Data Protection Regulation (GDPR); that requires a shift in thinking about how data can be misused:

That's what the GDPR is about. It's *your* responsibility as data collector to think about how data could be used, it's *your* problem to protect it, it's *your* problem if it's misused, you can not make it available on a platter for others to do evil things with and claim it's those people's problem.

Wolff also pointed out that attackers may try to send UUIDs that are unexpected. Those could be generated to try to attack the system in some way or they could simply be strings containing profanity or other "not safe for work" (NSFW) content. He wants to ensure that the actual UUID strings don't end up in reports or require review by humans. Even ensuring that the strings are valid hexadecimal doesn't preclude inventive usage that could embarrass the project or offend people. Beyond that, UUIDs could be changed more frequently to try to inflate the statistics.

As these privacy and other problems with the UUID scheme were being discussed, Poettering came up with a scheme that alleviated most of the problems that were identified. He proposed that a "countme" flag simply be added to a single mirror-list query each week. The sum of all such queries over a week's time should provide an accurate estimate of the number of Fedora systems. That way, UUIDs need not be stored, which removes much of the concern—data that is not stored cannot be misused.

Poettering followed up by noting that avoiding even the appearance of tracking will likely result in fewer users disabling the counting mechanism. Miller was enthusiastic about the idea; he suggested that since there would be no UUID associated with the information, the "countme" flag could increment once per week, which would give some additional information about the longevity of systems—without providing much information that could be used for tracking.

It would not even necessarily require that every machine reported, Roberto Ragusa suggested. Machines could decide whether to report based on some property of their machine ID (e.g. divides evenly by 1000) or by combining machine ID and the date so that the counted systems would change over time. Then the counts could simply be multiplied by whatever is used as a modulus to provide the actual estimate.

Overall, there were few complaints about the simpler counting mechanism. Miller has updated the proposal using Poettering's method; it should be posted to the mailing list soon, once he receives some feedback from the DNF developers. It seems likely that Fedora 30 will have the feature when it is released, which is currently scheduled for the end of April.

We have looked at other user-counting initiatives and proposals along the way. In 2010, there was a proposal to add UUID tracking to Yum, but Fedora has been trying to figure how to unobtrusively count users for longer than that. A 2006 scheme involving a tracking image was proposed for Fedora Core 7. More recently, the Django web-framework project discussed adding analytics that would report to Google servers, which was not popular with Debian (at least).

There is a certain amount of tension between the needs of a distribution or software project and the needs of users—especially when it comes to privacy issues. Being able to show the existence of more project users will generally lead to a higher profile and potentially more funding for development and other activities. Counting variants can also help projects make better decisions about where to allocate their scarce resources. But many users do not want to be tracked, though they may be willing to be counted. This Fedora proposal seems like it finds a reasonable balance by reusing an existing mechanism without adding something that could be tracked. It will be interesting to see what Fedora finds once it rolls out this counting feature to users.

Comments (65 posted)

Approaching the kernel year-2038 end game

By Jonathan Corbet
January 11, 2019

In January 2038, the 32-bit time_t value used on many Unix-like systems will run out of bits and be unable to represent the current time. This may seem like a distant problem, but, as Tom Scott recently observed, the year-2038 apocalypse is now closer to the present than the year-2000 problem. The fact that systems being deployed now will still be operating in 2038 adds urgency to the issue as well. The good news is that work has been underway for years to prepare Linux for this date, so there should be no need to call developers out of retirement in 2037 in a last-minute panic. Some of the final steps in this transition for the core kernel have been posted, and seem likely to be merged for 5.1.

There have been numerous phases to this work, which has been carried out primarily by Arnd Bergmann and Deepa Dinamani. Timekeeping within the kernel has been reworked to use 64-bit values throughout, even on 32-bit systems, for example. A lot of work was required to get there, but that was, in some sense, the easy part; since the changes were all internal to the kernel, the developers involved were free to change interfaces when needed. Life becomes more difficult when it comes to the system-call interface, since that cannot be changed at whim without breaking user-space applications.

The approach that has been taken here, for many of the relevant system calls, is to recognize that most systems already have a 64-bit solution for 32-bit applications. Most 64-bit kernels are able to run 32-bit processes; to do so, they provide a set of compatibility (or "compat") system calls to perform impedance matching. Typically, these compat calls simply reformat 32-bit types into their 64-bit equivalent, then pass the result to the native 64-bit implementations. In other words, the compat calls do exactly what is needed to connect a user space process using 32-bit times to a kernel that uses 64-bit times throughout.

Much of the work that has been done to this point, thus, has been promoting these compat system calls to become the native 32-bit system calls. User space sees no changes, but the kernel is able to leave 32-bit times behind entirely. To that end, one of the key changes in this patch set posted by Bergmann is to take the compat calls and define them as proper system calls for 32-bit systems. In the process, these calls are renamed; futex() becomes futex_time32(), for example. Then, 32-bit architectures are switched over to use the new _time32() calls.

The only remaining problem, of course, is that user space is still using 32-bit times, so things will still explode on schedule in 2038. Fixing that problem is not something that the kernel can do on its own, but it can provide the infrastructure to make the transition possible. In particular, for all of the _time32() calls described above, the patch set also exposes the 64-bit versions with _time64() suffixes. So, once this patch is applied, both the (broken) 32-bit and (fixed) 64-bit interfaces are available in 32-bit systems.

At this point, the ball moves into the court of the C library and distribution developers. A new C library release can define the system-call interfaces with 64-bit time values, and implement those interfaces with calls to the _time64() versions. Older binaries, instead, will continue to use the 32-bit versions. For many applications, all that will be needed at this point is a rebuild and they will be prepared to survive the 2038 transition. Others, of course, will require more work. Distributors have the option of rebuilding everything they ship for 64-bit time, then disabling 32-bit times entirely by turning off the COMPAT_32BIT_TIME configuration variable. Most distributors, though, are likely to support both modes for some time yet.

For the curious, the system calls affected are: adjtimex(), clock_adjtime(), clock_getres(), clock_gettime(), clock_nanosleep(), clock_settime(), futex(), io_getevents(), io_pgetevents(), mq_timedsend(), mq_timedreceive(), nanosleep(), ppoll(), pselect6(), recvmmsg(), rt_sigtimedwait(), sched_rr_get_interval(), semtimedop(), timer_gettime(), timer_settime(), timerfd_gettime(), timerfd_settime(), and utimensat(). The plan for the GNU C Library transition has been posted in great detail as well.

These changes fix the core kernel system-call interfaces, but that is not the end of the story. There are many other places in the kernel's user-space API where time values appear, and many of them need to be fixed as well. Those are slowly being addressed. Consider, for example, the SO_TIMESTAMP socket option (described in this man page); it enables the reception of control messages with network timestamp values. Those values are specified using struct timeval, which is not year-2038 safe.

This patch set from Dinamani addresses that problem by adding a new set of options that are year-2038 safe. An application can request SO_TIMESTAMP_NEW to get a new control-message format with 64-bit times; the SO_TIMESTAMPNS and SO_TIMESTAMPING options have seen a similar treatment. Socket timeout values also have a year-2038 problem; this patch series adds SO_RCVTIMEO_NEW and SO_SNDTIMEO_NEW to address it. Once again, libraries and (possibly) applications will need to be changed to be able to make use of these new options.

Once this work gets in, the kernel community, at least, can begin to think that there is some light at the end of the tunnel. Problems will remain, mostly in filesystem timestamps and time values that are passed to ioctl() calls, but the work as a whole can be seen as entering the clean-up phase. For library developers and distributors, though, the real work is just beginning. The good news is that they still have some time to get their piece of the work done so that systems deployed in the near future will be ready for the not-so-near (but approaching rapidly) 2038 deadline.

Comments (40 posted)

A slow start to openSUSE's board election

By Jonathan Corbet
January 10, 2019

What if you announced a board election and nobody ran? That is the quandary the openSUSE project faced as recently as January 4, when the nomination deadline loomed and no candidates for the three open seats had come forward. The situation has since changed, and openSUSE members will have a wide slate of candidates to choose from. But the seeming reticence to come forward may well be a reflection of some unresolved tensions that exploded into a flame war several months ago.

The openSUSE board is not a hugely powerful organization. It serves as a central coordinating point for the project, and it commands a small budget that can be used to promote the project and its initiatives. The board is also the decision maker for conduct issues, with the authority to suspend or expel members from the project. It is made up of six members: a chair appointed by SUSE, and five members elected by the project membership for two-year terms.

The five-week nomination period for the current election began on December 11; it closes on January 13. After that campaigning begins, with ballots being returned in the first half of February; results are set to be announced on February 16. All of that, of course, assumes that candidates actually show up to campaign for those seats; at the beginning of January, that had not happened. The backup plan, as noted in the article linked in the introduction, is that "the three remaining members of the openSUSE Board will be tasked to choose new Board Members, based on their own personal choices, to fill those three vacant seats" (emphasis in the original). That doesn't strike anybody as the best way to properly represent the openSUSE community.

There are plenty of reasons why candidates for a volunteer board might be in short supply. People are busy and do not always have time available for that sort of commitment. The nomination period coincided with the holidays when potential candidates are occupied with vacations, travel, and finding the perfect gift for that impossible-to-please relative. Not everybody wants to take on the stress of campaigning for an office and possibly being rejected by one's peers. And so on. But it may also be that potential candidates saw a side of the board that they didn't like back in August and have chosen not to be a part of it.

As detailed in the July 31 board meeting minutes, the openSUSE board was presented with a proposal to sponsor a youth football team in Bavaria. This was an unusual request. As noted in the minutes, the acceptance was justified by saying: "even though this is not a conference, there will be openSUSE Contributors present". Said contributors, most likely, will be too busy remonstrating with the referee to ponder openSUSE development while at the event but, even so, the board voted — five in favor, one against — to provide an undisclosed amount of sponsorship.

This decision caused a few raised eyebrows among the openSUSE members, some of whom felt that those funds could have been spent in ways that were more likely to benefit the openSUSE project. In the resulting discussion, board member Ana Martínez let it be known that hers had been the dissenting vote on this issue. The discussion meandered along a little further, wandering into important topics like potentially sponsoring a pony if a humane way could be found to brand it with the openSUSE logo — a brilliant opportunity that, for some unknown reason, was not pursued further.

Also in that discussion, though, board chair Richard Brown let it be known that he was not happy about Martínez having disclosed her vote, invoking "the Board's rule that decisions made collectively are defended collectively". Martínez apologized at the time, and the topic appeared to be finished — until the August 21 meeting minutes noted that the appropriateness of expressing personal points of view had been a topic at that meeting. The resulting discussion went on for about 150 posts — of about 250 total posted to the opensuse-project list between then and now.

Bryan Lunduke started it off by arguing that it's crucial that board members be able to explain their positions:

The Board positions are elected (other than Richard, who is installed by a corporation). If Board members cannot speak their personal opinions on any (and every) topic, it becomes *pointless* to have elections -- as there would be no way to easily understand the differences between elected officials and hold them accountable to the people who elected them.

Brown responded with a lengthy argument on why he felt that the board needs to act as a unified organization, while encouraging dissent internally:

The Board needs to be an environment that supports dissent too. I like the fact that the current Board is full of passionate members and we manage to fill a whole hour every week seeking agreement on the topics we're discussing. But the Board's ability to function in it's roles of decision makers of last resort, or arbitrator of disputes, is severely impacted if the Board cannot be seen to be a unified body. Or else every decision can easily be picked apart by people playing political games of divide and conquer.

He also raised the issue of code-of-conduct decisions, where knowing how individuals had voted could lead to substantial tension going forward. When Martínez argued in favor of more transparency, he responded that she was not living up to the expectations for a board member, and that if her views won out "the Board's ability to function in the roles it exists for will be grossly undermined". After that, the discussion became increasingly unpleasant.

At its close, Lunduke, who has served on the board in the past, departed the project with a fair amount of fanfare, citing "profane, rude, slanderous... and just plain not nice" responses from some current board members. He was probably referring to messages like this one. Martínez, instead, resigned her seat in November, citing a lack of time to keep up with the responsibilities of board membership.

As far as can be seen from the public record, there has been no change to the policy described by Brown wherein board members are expected to keep their personal opinions on board decisions to themselves. During the discussion, he had said that anything else would lead to people choosing not to run for the board:

If every decision of the Board is open to the level of public flagellation as you see in this thread, I think it's safe to say that no right minded individual will ever volunteer for the Board. It would be a torturous punishment, not a difficult but necessary service provided by volunteers as it currently is.

For comparison, the Fedora Council conducts most meetings in public, seemingly without the sort of problems discussed here. Meanwhile, openSUSE board decisions have not been subject to this sort of "flagellation" since the discussion died down, but volunteers for the board proved to be scarce anyway. It would not be surprising to learn that many potential members were put off by this heated debate, regardless of whether they agree with the "unified front" policy. The good news for the openSUSE project is that the drought has ended. Current board members Sarah Julia Kriesch and Christian Boltz have decided to run for re-election; they will be competing against Vinzenz Vietzke, Sébastien Poher, Axel Braun, and Nathan Wolf, at least.

So it seems that the openSUSE project will have a reasonable set of candidates to choose from when the voting begins. Whether the transparency issue will come up during the campaigning remains to be seen. But it seems like a question that is likely to come up again at some point, both in the openSUSE project and in many other community governance settings where contributors have to balance their individual roles with the expectations associated with any decision-making roles they may fill.

Comments (18 posted)

Ringing in a new asynchronous I/O API

By Jonathan Corbet
January 15, 2019

While the kernel has had support for asynchronous I/O (AIO) since the 2.5 development cycle, it has also had people complaining about AIO for about that long. The current interface is seen as difficult to use and inefficient; additionally, some types of I/O are better supported than others. That situation may be about to change with the introduction of a proposed new interface from Jens Axboe called "io_uring". As might be expected from the name, io_uring introduces just what the kernel needed more than anything else: yet another ring buffer.

Setting up

Any AIO implementation must provide for the submission of operations and the collection of completion data at some future point in time. In io_uring, that is handled through two ring buffers used to implement a submission queue and a completion queue. The first step for an application is to set up this structure using a new system call:

    int io_uring_setup(int entries, struct io_uring_params *params);

The entries parameter is used to size both the submission and completion queues. The params structure looks like this:

    struct io_uring_params {
	__u32 sq_entries;
	__u32 cq_entries;
	__u32 flags;
	__u16 resv[10];
	struct io_sqring_offsets sq_off;
	struct io_cqring_offsets cq_off;
    };

On entry, this structure (with the possible exception of flags as described later) should simply be initialized to zero. On return from a successful call, the sq_entries and cq_entries fields will be set to the actual sizes of the submission and completion queues; the code is set up to allocate entries submission entries, and twice that many completion entries.

The return value from io_uring_setup() is a file descriptor that can then be passed to mmap() to map the buffer into the process's address space. More specifically, three calls are needed to map the two ring buffers and an array of submission-queue entries; the information needed to do this mapping will be found in the sq_off and cq_off fields of the io_uring_params structure. In particular, the submission queue, which is a ring of integer array indices, is mapped with a call like:

    subqueue = mmap(0, params.sq_off.array + params.sq_entries*sizeof(__u32),
    		    PROT_READ|PROT_WRITE|MAP_SHARED|MAP_POPULATE,
		    ring_fd, IORING_OFF_SQ_RING);

Where params is the io_uring_params structure, and ring_fd is the file descriptor returned from io_uring_setup(). The addition of params.sq_off.array to the length of the region accounts for the fact that the ring is not located right at the beginning. The actual array of submission-queue entries, instead, is mapped with:

    sqentries = mmap(0, params.sq_entries*sizeof(struct io_uring_sqe),
    		    PROT_READ|PROT_WRITE|MAP_SHARED|MAP_POPULATE,
		    ring_fd, IORING_OFF_SQES);

This separation of the queue entries from the ring buffer is needed because I/O operations may well complete in an order different from the submission order. The completion queue is simpler, since the entries are not separated from the queue itself; the incantation required is similar:

    cqentries = mmap(0, params.cq_off.cqes + params.cq_entries*sizeof(struct io_uring_cqe),
    		    PROT_READ|PROT_WRITE|MAP_SHARED|MAP_POPULATE,
		    ring_fd, IORING_OFF_CQ_RING);

It's perhaps worth noting at this point that Axboe is working on a user-space library that will hide much of the complexity of this interface from most users.

I/O submission

Once the io_uring structure has been set up, it can be used to perform asynchronous I/O. Submitting an I/O request involves filling in an io_uring_sqe structure, which looks like this (simplified a bit):

    struct io_uring_sqe {
	__u8	opcode;		/* type of operation for this sqe */
	__u8	flags;		/* IOSQE_ flags */
	__u16	ioprio;		/* ioprio for the request */
	__s32	fd;		/* file descriptor to do IO on */
	__u64	off;		/* offset into file */
	void	*addr;		/* buffer or iovecs */
	__u32	len;		/* buffer size or number of iovecs */
	union {
	    __kernel_rwf_t	rw_flags;
	    __u32		fsync_flags;
	};
	__u64	user_data;	/* data to be passed back at completion time */
	__u16	buf_index;	/* index into fixed buffers, if used */
    };

The opcode describes the operation to be performed; options include IORING_OP_READV, IORING_OP_WRITEV, IORING_OP_FSYNC, and a couple of others that we will return to. There are clearly a number of parameters that affect how the I/O is performed, but most of them are relatively straightforward: fd describes the file on which the I/O will be performed, for example, while addr and len describe a set of iovec structures pointing to the memory where the I/O is to take place.

As mentioned above, the io_uring_sqe structures are kept in an array that is mapped into both user and kernel space. Actually submitting one of those structures requires placing its index into the submission queue, which is defined this way:

    struct io_uring {
	u32 head;
	u32 tail;
    };

    struct io_sq_ring {
	struct io_uring		r;
	u32			ring_mask;
	u32			ring_entries;
	u32			dropped;
	u32			flags;
	u32			array[];
    };

The head and tail values are used to manage entries in the ring; if the two values are equal, the ring is empty. User-space code adds an entry by putting its index into array[r.tail] and incrementing the tail pointer; only the kernel side should change r.head. Once one or more entries have been placed in the ring, they can be submitted with a call to:

    int io_uring_enter(unsigned int fd, u32 to_submit, u32 min_complete, u32 flags);

Here, fd is the file descriptor associated with the ring, and to_submit is the number of entries in the ring that the kernel should submit at this time. The return value should be zero if all goes well.

Completion events will find their way into the completion queue as operations are executed. If flags contains IORING_ENTER_GETEVENTS and min_complete is nonzero, io_uring_enter() will block until at least that many operations have completed. The actual results can be found in the completion structure:

    struct io_uring_cqe {
	__u64	user_data;	/* sqe->user_data submission passed back */
	__s32	res;		/* result code for this event */
	__u32	flags;
    };

Where user_data is a value passed from user space when the operation was submitted and res is the return code for the operation. The flags field will contain IOCQE_FLAG_CACHEHIT if the request could be satisfied without needing to perform I/O — an option that may yet have to be reconsidered given the current concern about using the page cache as a side channel.

These structures live in the completion queue, which looks similar to the submission queue:

    struct io_cq_ring {
	struct io_uring		r;
	u32			ring_mask;
	u32			ring_entries;
	u32			overflow;
	struct io_uring_cqe	cqes[];
    };

In this ring, the r.head index points to the first available completion event, while r.tail points to the last; user space should only change r.head.

The interface as described so far is enough to enable a user-space program to enqueue multiple I/O operations and to collect the results as those operations complete. The functionality is similar to what the current AIO interface provides, though the interface is quite different. Axboe claims that it is far more efficient, but no benchmark results have been included yet to back up that claim. Among other things, this interface can do asynchronous buffered I/O without a context switch in cases where the desired data is in the page cache; buffered I/O has always been a bit of a sore spot for Linux AIO.

Advanced features

There are, however, some more features worthy of note in this interface. One of those is the ability to map a program's I/O buffers into the kernel. This mapping normally happens with each I/O operation so that data can be copied into or out of the buffers; the buffers are unmapped when the operation completes. If the buffers will be used many times over the course of the program's execution, it is far more efficient to map them once and leave them in place. This mapping is done by filling in yet another structure describing the buffers to be mapped:

    struct io_uring_register_buffers {
	struct iovec *iovecs;
	__u32 nr_iovecs;
    };

That structure is then passed to another new system call:

    int io_uring_register(unsigned int fd, unsigned int opcode, void *arg);

In this case, the opcode should be IORING_REGISTER_BUFFERS. The buffers will remain mapped for as long as the initial file descriptor remains open, unless the program explicitly unmaps them with IORING_UNREGISTER_BUFFERS. Mapping buffers in this way is essentially locking memory into RAM, so the usual resource limit that applies to mlock() applies here as well. When performing I/O to premapped buffers, the IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED operations should be used.

There is also an IORING_REGISTER_FILES operation that can be used to optimize situations where many operations will be performed on the same file(s).

In many high-bandwidth settings, it can be more efficient for the application to poll for completion events rather than having the kernel collect them and wake the application up; that is the motivation behind the existing block-layer polling interface, for example. Polling is most efficient in situations where, by the time the application gets around to doing a poll, there is almost certainly at least one completion ready for it to consume. This polling mode can be enabled for io_uring by setting the IORING_SETUP_IOPOLL flag when calling io_uring_setup(). In such rings, an occasional call to io_uring_enter() (with the IORING_ENTER_GETEVENTS flag set) is mandatory to ensure that completion events actually make it into the completion queue.

Finally, there is also a fully polled mode that (almost) eliminates the need to make any system calls at all. This mode is enabled by setting the IORING_SETUP_SQPOLL flag at ring setup time. A call to io_uring_enter() will kick off a kernel thread that will occasionally poll the submission queue and automatically submit any requests found there; receive-queue polling is also performed if it has been requested. As long as the application continues to submit I/O and consume the results, I/O will happen with no further system calls.

Eventually, though (after one second currently), the kernel will get bored if no new requests are submitted and the polling will stop. When that happens, the flags field in the submission queue structure will have the IORING_SQ_NEED_WAKEUP bit set. The application should check for this bit and, if it is set, make a new call to io_uring_enter() to start the mechanism up again.

This patch set is in its third version as of this writing, though that is a bit deceptive since there were (at least) ten revisions of the polled AIO patch set that preceded it. While it is possible that the interface is beginning to stabilize, it would not be surprising to see some significant changes yet. One review comment that has not yet been addressed is Matthew Wilcox's request that the name be changed to "something that looks a little less like io_urine". That could yet become the biggest remaining issue — as we all know, naming is always the hardest part in the end. But, once those details are worked out, the kernel may yet have an asynchronous I/O implementation that is not a constant source of complaints.

For the curious, Axboe has posted a complete example of a program that uses the io_uring interface.

Comments (48 posted)

Adiantum: encryption for the low end

By Jake Edge
January 16, 2019

Low-end devices bound for developing countries, such as those running the Android Go edition, lack encryption support because the hardware doesn't provide any cryptographic acceleration. That means users in developing countries have no protection for the data on their phones. Google would like to change that situation. The company worked on adding the Speck cipher to the kernel, but decided against using it because of opposition due to Speck's origins at the US National Security Agency (NSA). As a replacement, the Adiantum encryption mode was developed; it has been merged for Linux 5.0.

Eric Biggers has been spearheading the effort; he posted version 4 of the Adiantum patch set in mid-November and it was pulled by kernel crypto maintainer Herbert Xu shortly thereafter; it will appear in the 5.0 kernel. Meanwhile Speck was removed from the kernel in 4.20 for lack of any maintainer or users. The Adiantum patch description is lengthy and informative, but there is also a paper by Biggers and Paul Crowley (who did much of the work in coming up with Adiantum and its predecessor HPolyC). Incidentally, the paper notes that the name "Adiantum" is the genus of the maidenhair fern.

Adiantum is intended to be a choice for the encryption and decryption algorithm for disk encryption on Linux systems. It can be used either for block-level encryption as part of dm-crypt or for file and directory encryption as part of fscrypt. Adiantum and its supporting crypto primitives needed to be added to the kernel so that it can be used from these kernel subsystems. Most of the 14-part patch set is adding various crypto primitives used by Adiantum.

It's worth noting that Adiantum is not a new encryption algorithm as such; instead, it is a repackaging of the ChaCha stream cipher that makes it useful for disk encryption. That makes reasoning about its security relatively straightforward:

Adiantum is a construction, not a primitive. Its security is reducible to that of XChaCha12 and AES-256, subject to a security bound; the proof is in Section 5 of our paper. Therefore, one need not "trust" Adiantum; they only need trust XChaCha12 and AES-256.

In this way, the authors have tried, with apparent success, to avoid the trust issues that surrounded Speck.

Many low-end, inexpensive devices (e.g. mobile phones for the developing world) and even some smartwatches are shipped with older or less powerful Arm CPUs that lack the cryptographic extensions that more recent processors have. The goal was to find a way to encrypt filesystem data on those devices and, crucially, to be able to decrypt it quickly enough that users will not be annoyed by the performance—or have their batteries unduly impacted. Speck mostly fit the bill, but it turns out that Adiantum is even faster (roughly 30%), so the political issues that made Speck untenable turned out to be a boon for users.

HPolyC was the original algorithm that Biggers and Crowley were planning to use as a Speck replacement; it was already faster than Speck but some further refinements led to Adiantum and even better performance. The main change between HPolyC and Adiantum is the hash function used. The Poly1305 message authentication code (MAC) hash family is used by both, but Adiantum first uses a hash from the NH family of hashes to effectively compress the data by 32x first. After that, Adiantum uses Poly1305.

Both Poly1305 and NH are families of hash functions that are deemed "almost universal". A universal hash family has the property that it minimizes collisions even if the input is controlled by an adversary. Each member of the family is generally able to spread the input over a wide number of buckets but any single member will be susceptible to a preimage attack. By choosing one of the family members at random, that kind of attack is thwarted.

Using NH in addition to Poly1305 does reduce the key agility of Adiantum; the paper recommends using HPolyC in applications that need to be able to switch keys quickly. For performance reasons, NH is easily implemented in SIMD assembly (such as Arm NEON) but the more complicated Poly1305 is written in C, which aids portability.

The encryption cipher used is XChaCha12, which is a block cipher based on the ChaCha family of stream ciphers. It uses 12 rounds, as the name would imply, which is lower than the 20-round ChaCha that is commonly used. The best-known attacks against ChaCha are for the seven-round variant, so ChaCha12 still provides a strong cipher. Two rounds of XChaCha12 are followed by an AES-256 encryption, but of just 16 bytes. AES is often used for disk encryption on higher-end devices because their processors provide AES acceleration, but it is far too slow and power hungry to run on low-end devices.

According to Biggers, this provides a better security margin than HPolyC or AES. In addition, Adiantum has the property that changing a single bit in the input completely scrambles the block, unlike other modes (e.g. XTS), where it will only affect 16 bytes in the block.

Adiantum is a length-preserving encryption, which is important for disk encryption. It would be ideal to store random nonces along with each block of ciphertext, Biggers said, but that requires another layer (such as dm-integrity) to manage the extra data per block. That negatively impacts performance, so, at least for now, length-preserving encryption is needed.

"Encryption for all" is an explicit goal in various domains; it has driven the push for "HTTPS Everywhere", for example. It is nice to see work being done to ensure that people in developing countries will be able to secure their data on what may well be their only computing device: their mobile phone. One hopes that Adiantum and HPolyC will be adopted widely—in Android and beyond.

Comments (13 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

Briefs: Brief items from around the community.
Announcements: Newsletters; events; security updates; kernel patches; ...

Next page: Brief items>>