Timer IDs, CRIU, and ABI challenges

By Jonathan Corbet
March 6, 2025

The kernel project has usually been willing to make fundamental internal changes if they lead to a better kernel in the end. The project also, though, goes out of its way to avoid breaking interfaces that have been exposed to user space, even if programs come to rely on behavior that was never documented. Sometimes, those two principles come into conflict, leading to a situation where fixing problems within the kernel is either difficult or impossible. This sort of situation has been impeding performance improvements in the kernel's POSIX timers implementation for some time, but it appears that a solution has been found.

Timers and CRIU

The POSIX timers API allows a process to create its own private interval timers based on any of the clocks provided by the kernel. A process calls timer_create() to create such a timer:

    int timer_create(clockid_t clockid, struct sigevent *sevp, timer_t *id);

The id argument is a pointer to a location where the kernel can return an ID used to identify the new timer; it is of type timer_t, which maps eventually to an int. Various other system calls can use that ID to arm or disarm the timer, query its current status, or delete it entirely. The man page for timer_create() indicates that each created timer will have an ID that is unique within the creating process, but makes no other promises about the returned value.

The "unique within the process" guarantee came with the 3.10 kernel release in 2013; previously, the timer IDs were unique system-wide. To understand that change, one has to look at the Checkpoint/Restore in Userspace (CRIU) project, which has long worked on the ability to save the state of a group of processes to persistent storage, then restore that group at a future time, possibly on a different system. Reconstructing the state of a set of processes well enough that the processes themselves are not aware of having been restored in this way is a challenging task; the CRIU developers have often struggled to get all of the pieces working (and to keep them that way).

POSIX timers were one of the places where they ran into trouble. To restore a process that is using timers, CRIU must be able to recreate the timers with the same ID they had at checkpoint time, but the system-call API provides no way to request a specific timer ID. Even if such an ability existed, though, the existence of a single, system-wide ID space for timers was an insurmountable problem; CRIU might try to recreate a timer for a process, only to find that some other, unrelated process in the system already had a timer with that ID. In such cases, the restore would fail.

This problem was addressed with this patch from Pavel Emelyanov, which implemented a new hash table to store timer IDs. That table was still global, but the timer IDs kept therein took the identity of the owning process (specifically, the address of its signal_struct structure) into account, separating each process's timer IDs from all the others. At that point, the problem of ID collision when restoring a process went away.

The other problem — the lack of a way to request a specific timer ID — remained, though. To address that problem, CRIU stuck with the approach it had used before, which was based on some internal knowledge about how the kernel allocates those IDs. There is a simple, per-process counter, starting at zero, that is used for timer IDs; that counter is incremented every time a new timer is created. So a series of timer_create() calls will yield a predictable series of IDs, counting through the integer space. When CRIU must create a timer with a specific ID within a to-be-restored process, it takes advantage of this knowledge and simply runs a loop, allocating and destroying timers, until the requested ID is returned.

If a process only creates a small number of timers in its lifetime, this linear ID search will not take long. Checkpointing, though, is often used on long-running processes in order to save their state should something go wrong partway through. That kind of process, if it regularly creates and destroys timers, can end up with IDs that are spread out widely in the integer space. That, in turn, means it can take a long time to land on the needed ID at restore time.

Without a paddle

In 2023, Thomas Gleixner sent this summary in response to a timer bug report; he noted that in some cases, the allocation loop "will run for at least an hour to restore a single timer". That is not the speedy restore operation that CRIU users may have been hoping for. But the real problem at the time was that the requirement to allocate timer IDs sequentially in the kernel was getting in the way of some needed changes to the internal global hash table, which, in turn, were blocking other changes within the timer subsystem. Since that behavior could not be changed without breaking CRIU, Gleixner concluded that the kernel was "up the regression creek without a paddle".

At the time, some possible solutions were considered. Reducing the ID space from 0..INT_MAX to something smaller could speed the ID search, but it would still be an ABI break; CRIU would no longer be able to restore any process that had created timers with a larger ID. A new system call to create a timer with a given ID was another possibility but, due to how the timer API works (and the sigevent structure it accepts), the 64-bit and 32-bit versions of the system call could not be made compatible. That would require the addition of another "compat" system call, which is something the kernel developers have gone out of their way to avoid for some time. In the end, the conversation wound down with no solution being found.

In mid-February 2025, networking developer Eric Dumazet posted a patch series aimed at reducing locking contention in the kernel's timer code, citing "incidents involving gazillions of concurrent posix timers". That work elicited some testy responses from Gleixner, but there was no questioning the existence of a real problem. So Gleixner went off to create his own patch series, incorporating Dumazet's work, but then aiming to solve the other problems as well. Most of the series is focused on implementing a new hash table that lacks the performance problems found in current kernel; benchmark results included in the cover letter show that some success was achieved on that front.

A better solution for CRIU

But then Gleixner set out to solve the CRIU problem as well. Rather than create a new system call to enable the creation of a timer with a specific ID, though, he concluded that the id argument to timer_create() could be used to provide that ID. All that is needed is a flag to tell timer_create() to use the provided value rather than generating its own ... but timer_create() has no flags argument. So, if timer_create() is to gain the ability to read a timer ID from user space, some other way needs to be found to let it know that this behavior is requested.

The answer is a pair of new prctl() operations. A call like this:

    prctl(PR_TIMER_CREATE_RESTORE_IDS, PR_TIMER_CREATE_RESTORE_IDS_ON);

will cause the calling process to enter a "timer restoration mode" that causes timer_create() to read the requested timer ID from the location pointed to by the id parameter passed from user space. The special value TIMER_ANY_ID can provided in cases where user space does not have an ID it would like to request. Another prctl() call with PR_TIMER_CREATE_RESTORE_IDS_OFF will exit the restoration mode, causing any subsequent timer_create() calls to generate an ID internally as usual.

This functionality is narrowly aimed at CRIU's needs. Normally, adding this kind of process-wide state would be an invitation for problems; some distant thread could make a timer_create() call while the restoration mode is enabled, but expecting the old behavior, and thus be unpleasantly surprised. But CRIU can use this mode at the special point where the restarted processes have been created, but are not yet allowed to resume running at the spot where they were checkpointed. At that time, CRIU is entirely in control and can manage the state properly.

Another important point is that the prctl() call will fail on an older kernel that does not support the timer restoration mode. When CRIU sees that failure, it can go back to the old, brute-force method of allocating timers. The CRIU developers will thus be able to take advantage of the new API while maintaining compatibility for users on older kernels.

One problem that will remain even after this series is merged is that the sequential-allocation behavior of timer_create(), in the absence of the new prctl() operation, is still part of the kernel's ABI. The timer developers never meant to make that promise, but they are stuck with it for as long as CRIU installations continue to depend on it. The good news is that updating CRIU will generally be necessary for users who update their kernels anyway, since that is the only way to get support for newer kernel features. So, perhaps before too long, the sequential-allocation guarantee for timer_create() can be retired — unless some other user that depends on it emerges from the woodwork.

Index entries for this article
Kernel	Checkpointing
Kernel	Development model/User-space ABI
Kernel	Releases/6.15
Kernel	System calls/timer_create()

No `flags` argument... oh, wait!

Posted Mar 7, 2025 8:19 UTC (Fri) by Karellen (subscriber, #67644) [Link] (6 responses)

but timer_create() has no flags argument.

What, again?

...Oh, wait, the API is specified by POSIX, so Linux can't add extra parameters, even if experience shows it would have been wise to do so! :-)

No `flags` argument... oh, wait!

Posted Mar 7, 2025 8:37 UTC (Fri) by wahern (subscriber, #37304) [Link] (1 responses)

timer_create was standardized in 1997, nearly 2 decades before including a flags argument in new syscalls became common in Linux. See this 2014 LWN article: https://lwn.net/Articles/585415/. A year and half later Linux added Documentation/adding-syscalls.txt (now Documentation/process/adding-syscalls.rst) which made it policy: https://lwn.net/Articles/654026/.

No `flags` argument... oh, wait!

Posted Mar 7, 2025 11:13 UTC (Fri) by wahern (subscriber, #37304) [Link]

Also, FWIW, I think you could add flags to timer_create via the clockid argument, the same way flags were added to the socket syscall, where SOCK_CLOEXEC and SOCK_NONBLOCK were defined to be OR'able into the type argument.

No `flags` argument... oh, wait!

Posted Mar 7, 2025 9:21 UTC (Fri) by ballombe (subscriber, #9523) [Link] (3 responses)

POSIX does not mandate syscall interface, only the libc interface, so the linux syscall could still have a flag
that would be set by the libc.

No `flags` argument... oh, wait!

Posted Mar 10, 2025 8:27 UTC (Mon) by tglx (subscriber, #31301) [Link] (2 responses)

Which is not backwards compatible as it would break existing usage of the syscall. So no, we can't add a flag.

No `flags` argument... oh, wait!

Posted Mar 10, 2025 9:19 UTC (Mon) by Wol (subscriber, #4433) [Link] (1 responses)

Is there a runtime mechanism to check for the existence of a syscall? If so, can you just rename it?

Cheers,
Wol

No `flags` argument... oh, wait!

Posted Mar 10, 2025 16:13 UTC (Mon) by tglx (subscriber, #31301) [Link]

Of course we could add a new syscall, but as explained by Jonathan that would require a compat syscall as well, which is what we really try to avoid. As the use case for this is very narrow, the prctl() turned out to be the least of all evils while maintaining full user space compatibility.

Stash in clockid?

Posted Mar 7, 2025 11:55 UTC (Fri) by eru (subscriber, #2753) [Link] (4 responses)

I wonder why the extension to choose timerid was not stashed into the clockid parameter by setting a higher bit ? There is room, the constants for clockid are small integers. This would have kept the effect local.

Stash in clockid?

Posted Mar 7, 2025 15:08 UTC (Fri) by abatters (✭ supporter ✭, #6932) [Link] (1 responses)

There is already a hack to put file descriptors in clockids; see "Dynamic clocks" section of https://man7.org/linux/man-pages/man2/clock_gettime.2.html

Stash in clockid?

Posted Mar 7, 2025 16:54 UTC (Fri) by eru (subscriber, #2753) [Link]

There would still be room, since the clockid is a 32-bit value and file descriptors are never very large. The clock type values take the bottom 3 bits. If the sign bit is left alone, and we use bit 30 for the special flag, this still leaves 27 bits for the fd part, which is way more than enough.

Stash in clockid?

Posted Mar 7, 2025 15:12 UTC (Fri) by bushdave (guest, #58418) [Link] (1 responses)

That wouldn't be a compatible interface. In the POSIX API, the timer_t *id pointer is write-only for the system call. It is fair to assume that it points to uninitialized memory, with any combination of bits set.

Stash in clockid?

Posted Mar 7, 2025 15:13 UTC (Fri) by bushdave (guest, #58418) [Link]

Never mind, I understand now that I thought I read something I didnt.