|
|
Subscribe / Log in / New account

The edge-triggered misunderstanding

By Jonathan Corbet
August 5, 2021
The Android 12 beta release first appeared in May of this year. As is almost obligatory, this release features "the biggest design change in Android's history"; what's an Android release without requiring users to relearn everything? That historical event was not meant to include one change that many beta testers are noticing, though: a kernel regression that breaks a significant number of apps. This problem has just been fixed, but it makes a good example of why preventing regressions can be so hard and how the kernel project responds to them when they do happen.

Back in late 2019, David Howells made some changes to the pipe code to address a couple of problems. Unfortunately, that work caused the sort of regression that the kernel community finds most unacceptable: it slowed down (or even hung) kernel builds. After an extensive discussion, an unfortunate interaction with the GNU make job server was identified and a a fix by Linus Torvalds was applied that made the problem go away. The 5.5 kernel was released shortly afterward, kernel builds sped back up, and the problem was deemed to have been solved.

Not done yet

At the end of July, Sandeep Patil informed the community that, while the GNU make problem may have been fixed, the fix created a new problem of its own. He included a patch to revert Torvalds's fix. That revert clearly was never going to be applied on its own — kernel developers still lack an appetite for slower builds — but it did spark an investigation of the real problem.

The 2019 pipe rework and subsequent fix had been much more painful than Torvalds thought they should be, so he made a number of changes to both the organization and the behavior of the code. Specifically, an important change was made to how pipes work with system calls like epoll_wait(), poll(), and select(). If the desired type of I/O is not possible without blocking, these calls will put the calling process onto a wait queue. When the situation changes (data can now be read or written), the processes on the appropriate wait queue are woken so that they can request their I/O.

The 2019 fix changed the way that wakeup was performed. Before, a write to a pipe would unconditionally wake any waiting readers; indeed, it could wake them multiple times in a single system call. The fix changed this behavior to only perform the wakeup if the pipe buffer was empty at the start of the operation; a write to a pipe that already contained data waiting to be read would simply add the new data without causing the wakeup. The reasoning behind this change was straightforward enough: the polling system calls will return immediately if data is already available for reading, so there should not be any waiters if the pipe has available data.

On the edge

There is, however, a mode to epoll_wait() called "edge triggered" (or EPOLLET) that behaves a little differently. A process requesting an edge-triggered wait will not see an immediate return from epoll_wait() if there is available data; instead, it will wait until the situation changes. At least, that is how it worked before the 2019 patches. Once the pipe driver stopped performing wakeups any time that data arrived, processes initiating edge-triggered waits when there was already data available would not see the "edge" and would not wake.

There was, evidently, reason to wonder if this kind of problem would ever arise. The previous version of pipe_write() included this helpful comment:

    /* Always wake up, even if the copy fails. Otherwise
     * we lock up (O_NONBLOCK-)readers that sleep due to
     * syscall merging.
     * FIXME! Is this really true?
     */

It turns out that it was really true. There are a number of Android libraries, such as Realm, that depend on getting edge-triggered wakeups even if there had been data waiting in the pipe before the epoll_wait() call was made. The purpose, evidently, is to wait until the pipe buffer is full, then read all of the data in a single operation. Those libraries broke when the 5.10 kernel went into the Android 12 beta, bringing down a set of apps with them. Realm has since worked around the problem but, as Patil pointed out, the joy of bundling means that "it will be a while before all applications incorporate the updated library". Fixing the kernel would repair the problem for all apps.

There seems to be widespread agreement that these libraries manifest a misunderstanding of what "edge-triggered" means and are using the edge-triggered mode incorrectly. As Torvalds explained:

This is literally an epoll() confusion about what an "edge" is.

An edge is not "somebody wrote more data". An edge is "there was no data, now there is data".

And a level triggered event is *also* not "somebody wrote more data". A level-triggered signal is simply "there is data".

Notice how neither edge nor level are about "more data". One is about the edge of "no data" -> "some data", and the other is just a "data is available".

That, however, is not how edge-triggered operation was implemented for pipes. Unsurprisingly, in a demonstration of Hyrum's law in action, applications began to rely on the actual behavior of the system rather than the intended semantics. The epoll() man page agrees with Torvalds's description, describing just the sort of blocking behavior experienced by the broken apps. In the distant past, kernel developers might have just answered that these libraries are doing it wrong. But that's not how the kernel works now; thus, Torvalds continued:

But we have the policy that regressions aren't about documentation or even sane behavior.

Regressions are about whether a user application broke in a noticeable way.

That interpretation of "regression" requires that the problem be fixed. And indeed that was done with a new fix that was merged for 5.14-rc4 at the end of July and was included in the 5.10.56 and 5.13.8 stable updates. This patch does not quite restore the old behavior; specifically, it will only perform one wakeup per write operation. It does appear to have fixed the problem, though.

Problem solved?

The 5.5 kernel was released in January 2020 — a time when few of us understood what was about to descend upon us, and a significant kernel regression is just another addition to the list. This regression thus endured for a year and a half, and found its way into the 5.10 long-term-stable release last December. The fact that it only surfaced now suggests a testing gap among certain users; happily, it was caught before the next Android release was finalized.

The ominous question that remains, though, is whether, in that year and a half, any applications became dependent on the newer semantics. And, indeed, there is already a report (from Intel's automated test system) that the hackbench benchmark regressed by nearly 13% after the latest fix was applied. Torvalds responded that he is "not sure how much hackbench really matters" and that the regression "probably doesn't matter one whit". Even so, he posted a new patch that provides something closer to the older behavior — but only if the pipe is being used with one of the polling functions. Should it turn out that the hackbench regression does matter, there will be a fix in hand.

If the latest fix breaks something else, though, kernel developers may have a difficult choice to make. It is possible that there is no way forward without leaving a broken application somewhere; this is why it is important to catch regressions early. With luck, that sort of breakage won't happen and this particular episode will now be finally done.

Index entries for this article
KernelAndroid
KernelEpoll


to post comments

The edge-triggered misunderstanding

Posted Aug 5, 2021 17:38 UTC (Thu) by scientes (guest, #83068) [Link] (15 responses)

The Android release also exists in this creepy world where Google has their own Top Level Domain.

The edge-triggered misunderstanding

Posted Aug 5, 2021 21:01 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

On the other hand, this world also has TLDs for: .bar, .beer and .coffee! It can't be THAT bad.

The edge-triggered misunderstanding

Posted Aug 5, 2021 21:06 UTC (Thu) by flussence (guest, #85566) [Link] (13 responses)

They own a whole bunch of TLDs: several named after products they're in the process of killing, common english words and suffixes, .ads (predictably), .dev (which they effectively stole from the commons by forcing HSTS and breaking a lot of internal network setups), and for some reason, .zip.

The edge-triggered misunderstanding

Posted Aug 5, 2021 21:14 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (12 responses)

> .dev (which they effectively stole from the commons by forcing HSTS and breaking a lot of internal network setups)

AFAIK, `.dev` was never a protected TLD. Tor had `.onion` officially protected through IETF for just this reason. I don't know why `.dev` wasn't protected as well if it was so widely used.

.dev

Posted Aug 5, 2021 22:32 UTC (Thu) by tialaramex (subscriber, #21167) [Link]

The Tor project had a rationale for how names in .onion actually work, resulting in them having global meaning. The various incompetent .dev hijackers don't have any rationale, their idea is they should just get to make up whatever they want and they're astonished that this doesn't work. Accordingly you could hardly expect them to get around to actually going through the process to reserve these names (which took a bunch of time to produce RFC 7686), and if they had attempted to do so you can expect they'd immediately have started bickering about how everybody else's usage is wrong and only theirs is OK.

The more popular hijacked TLDs were .mail, .corp, and .home - ICANN concluded that these names were so poisoned it couldn't delegate them. There was in fact an attempt to write an ID for those TLDs but as expected it descended into squabbling and went nowhere.

Hijacking is a perfectly reasonable way to carve out space if you have the numbers on your side. The IETF's OID arc 1.3.6.1 was hijacked. 1.3.6 means the US Department of Defense but of course the DoD didn't delegate the arc to the IETF, instead the IETF simply hijacked it, reasoning that the fledgling Internet needs a OID arc and if they understood that the DoD would obviously donate 1.3.6.1 so why waste time actually figuring out who, if anybody, had authority to authorise it. Today you use OIDs under 1.3.6.1 all over the place and think nothing of it, the hijacking was entirely successful, it would be unthinkable to "give back" 1.3.6.1

But the .dev users didn't have the numbers on their side and so .dev is now a somewhat successful TLD with HSTS enabled. Unsuccessful hijack attempts go badly for the hijackers.

The edge-triggered misunderstanding

Posted Aug 5, 2021 23:03 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (10 responses)

In general, IETF doesn't really like this sort of "make up your own fake TLD" nonsense, at least with respect to "real" DNS (although "fake" DNS such as mDNS does get to reserve names like .local). They did reserve .home.arpa for use in "home networks" - but there's no equivalent for other networks. I read that less as an endorsement than as a tacit acknowledgement that home networks are, for most consumers, an unfixable clusterfuck, and so we'd rather just partition it off to a subdomain of .arpa (which is already a weird no-man's land) rather than trying to get non-technical end users excited about configuring their networks properly. Also, telling every end user in the world to go buy their own domain name is just never going to work.

For "real" (business) networks, I think it's wiser to use a subdomain of example.com (which you own), and configure DNS to search from there (which you can do via DHCP or as part of your client provisioning setup). The net effect of this is that users can type http://foo/ in their web browsers, and it automatically expands to (say) foo.dev.example.com, with no potential for confusion with foo.dev. That will allow you to acquire a certificate for *.dev.example.com and run "real" TLS to any client device, without having to install custom root certificates on the client, run your own internal CA, etc. It also means that, when a user copies and pastes a URL, they always get the FQDN, so that it will work even on devices which have not been set up for this DNS search chicanery (assuming, of course, that your security policies allow such devices to connect in the first place).

Disclaimer: I work for Google.

The edge-triggered misunderstanding

Posted Aug 5, 2021 23:17 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (5 responses)

I use `.internal` as the TLD in my home network. IIRC, that is one of the reserved names as well that doesn't route over the global DNS network. I had been using `dev.<mydomain>`, but it just got too messy trying to sync up public DNS routing and internal routing with the same names that I just gave up and used something private internally.

The edge-triggered misunderstanding

Posted Aug 6, 2021 12:26 UTC (Fri) by Jonno (subscriber, #49613) [Link] (1 responses)

> I use `.internal` as the TLD in my home network. IIRC, that is one of the reserved names as well that doesn't route over the global DNS network.

It isn't. The only reserved TLDs are: .example, .invalid, .local, .localhost, .onion, and .test. There are also some reserved subdomains of other TLDs, see [special-use-domain-names] for a complete list.

There are two current proposals for adding new reserved TLDs: .alt [draft-ietf-dnsop-alt-tld] and the country codes that ISO-3166-1 specifies will never be assigned to a territory (.aa .qm .qn .qo .qp .qq .qw .qs .qt .qu .qv .qw .qx .qy .qz .xa .xb .xc .xd .xe .xf .xg .xh .xi .xj .xk .xl .xm .xn .xo .xp .xr .xs .xt .xu .xv .xw .xx .xy .xz .zz) [draft-ietf-dnsop-private-use-tld].

There was a 2017 proposal for .internal [draft-wkumari-dnsop-internal], but it was not adopted.

[special-use-domain-names] https://www.iana.org/assignments/special-use-domain-names...
[draft-ietf-dnsop-alt-tld] https://datatracker.ietf.org/doc/draft-ietf-dnsop-alt-tld/
[draft-ietf-dnsop-private-use-tld] https://datatracker.ietf.org/doc/draft-ietf-dnsop-private...
[draft-wkumari-dnsop-internal] https://datatracker.ietf.org/doc/draft-wkumari-dnsop-inte...

The edge-triggered misunderstanding

Posted Aug 6, 2021 13:29 UTC (Fri) by mathstuf (subscriber, #69389) [Link]

Ah I missed that the `.internal` RFC was never adopted. I guess I'll put migration to…something else on my list. Thanks.

The edge-triggered misunderstanding

Posted Aug 6, 2021 16:14 UTC (Fri) by tialaramex (subscriber, #21167) [Link] (2 responses)

> IIRC, that is one of the reserved names as well that doesn't route over the global DNS network.

Somebody else already explained that in fact .internal is not reserved although there was effort in that direction. But I want to particularly respond to "doesn't route over the global DNS network" because this is a grave misunderstanding of what's going on.

Your bogus names _will_ leak. If you've just made up bogus names for some service you would cheerfully have given a name in the public DNS that's delegated to you then it's no worse than before. But a LOT of people have this idea that their bogus names are "not routed" and so magically this is safer, the exact same mistake that happens for RFC 1918 addressing and with the same consequences.

These bogus names aren't special, nothing magically says "Oh this name is a double-secret internal name, protect it", it just goes in the big pile of bogus names like all the others. Unsurprisingly bad guys are much more interested in what's in that pile, although infrastructure researchers pick through it too.

> it just got too messy trying to sync up public DNS routing and internal routing with the same names

But your eventual solution doesn't sync up to public DNS either. So, just don't do that. It makes plenty of sense that I have some.internal.tlrmx.org without there being any public DNS records for that name. But unlike hijacking a TLD I *can* add public DNS records later if I want to, because it's in a namespace I actually control.

The edge-triggered misunderstanding

Posted Aug 6, 2021 17:18 UTC (Fri) by mathstuf (subscriber, #69389) [Link]

I should say that I'm not worried about leakage so much as squatting, so I wasn't clear there. If someone knows I have a local DNS name of `git` or can guess my naming scheme for hosts…so be it.

We just used to use a `foobar.com` for internal routing (related to our actual domain name, but not publicly registered) at $DAYJOB. "Hilarity" ensued when someone *did* register that name and we started getting hosts external to our network responding to our SSH queries. I presume they were watching dead DNS queries pile up and snatched it in the hopes to get something juicy.

The edge-triggered misunderstanding

Posted Aug 7, 2021 4:55 UTC (Sat) by NYKevin (subscriber, #129325) [Link]

As mentioned upthread, it is completely safe to use fake subdomains of home.arpa, because it was explicitly reserved for this purpose (RFC 8375). Of course, they can still leak, but if you're running a local recursive resolver, you can at least configure it to avoid routing home.arpa queries outside of your homenet (because they will never be valid in the global DNS). So in principle, you can prevent fake DNS queries from reaching the external world, if you really want to. It's just a lot of work for questionable benefit, IMHO.

Technically, RFC 8375 positions itself as a "correction" to RFC 7788, but it also acknowledges that you don't have to be doing HNCP to use home.arpa. Which is a Good Thing, because I still can't figure out if HNCP is even a real protocol, or just a stack of "wouldn't it be nice if consumer products did [X]" ideas wearing a trenchcoat.

The edge-triggered misunderstanding

Posted Aug 6, 2021 12:12 UTC (Fri) by Wol (subscriber, #4433) [Link]

> Also, telling every end user in the world to go buy their own domain name is just never going to work.

Which is why, back in the day, demon was such a good IAP (Internet *access* provider). You opened an account, and you got <account>.demon.co.uk. Whether you used it as a host, or a subdomain, or whatever, was up to you. You also got a static IPv4, which you again could use for your host, or an RFC1918 network, or whatever, was up to you ...

The problem, of course, was this implied the user had some technical nous, which back then they typically did. Then, of course, the bean counters took demon over, and everything just imploded ...

(Obviously, the bean counters were massively helped by clueless telcos and regulators ...)

Cheers,
Wol

The edge-triggered misunderstanding

Posted Aug 8, 2021 4:13 UTC (Sun) by motiejus (subscriber, #92837) [Link] (2 responses)

How can I get a real tls cert for a subdomain.exame.com? I did a bit of searching, to no avail.

TLS for private subdomains

Posted Aug 9, 2021 9:40 UTC (Mon) by james (subscriber, #1325) [Link] (1 responses)

Assuming you own the parent domain, you use DNS verification. You would have to (temporarily) create a public TXT record in the parent domain's DNS, as specified by the certificate authority: for example, you could put a record for _acme-challenge.test.internal.example.com in the public example.com DNS.

This would (tend to) confirm that the subdomain exists, but if you want a "real" TLS certificate, the subdomain will be included in public Certificate Transparency logs.

You would not have to publish any A or AAAA records for the subdomain, nor would you have to make any computers on the subdomain available to the outside world.

TLS for private subdomains

Posted Aug 9, 2021 12:54 UTC (Mon) by motiejus (subscriber, #92837) [Link]

I thought you literally meant internal.example.com :)

The edge-triggered misunderstanding

Posted Aug 5, 2021 21:01 UTC (Thu) by mkerrisk (subscriber, #1978) [Link] (26 responses)

It is possible that there is no way forward without leaving a broken application somewhere; this is why it important to catch regressions early.
sigh... This regression was reported quite a while ago... https://lore.kernel.org/lkml/300cb158-5ab1-ed55-404f-8abc9cbdcae0@gmail.com/:
Yes, user space code does surprising things. But, give people enough time and every detail of API behavior will come to be depended upon by someone. We don't know if anyone depends on the old pipe EPOLLET behavior. I also imagine the chances are small, but if users do depend on it, they are in for an unpleasant surprise (missed notifications).

The edge-triggered misunderstanding

Posted Aug 5, 2021 23:31 UTC (Thu) by Karellen (subscriber, #67644) [Link]

Is that a report of a regression though? i.e. Does a change in behaviour count as a regression if no users actually rely on the legacy unintended (buggy) semantics.

Yes, ultimately a user did rely on those semantics, so it turned out to be a regression, but that was not known then and at the time it was merely a hypothetical possibility where "the chances are small" of it being an issue.

(And, to be fair, of the millions of places that Linux is used, and from the millions of software projects that run on it, in 18 months only 1 library has been found to rely on the buggy behaviour. The chances were indeed small.)

The edge-triggered misunderstanding

Posted Aug 5, 2021 23:34 UTC (Thu) by mkerrisk (subscriber, #1978) [Link] (24 responses)

The epoll() man page agrees with Torvalds's description, describing just the sort of blocking behavior experienced by the broken apps.
Well, that is questionable. Elsewhere, the page says:
Since even with edge-triggered epoll, multiple events can be generated upon receipt of multiple chunks of data, the caller has the option to specify the EPOLLONESHOT flag, to tell epoll to disable the associated file descriptor after the receipt of an event with epoll_wait(2).
As I argued in Oct 2020, the existence of EPOLLONESHOT is, in my estimation, an argument against Linus's interpretation. And:
That, however, is not how edge-triggered operation was implemented for pipes.
In fact, every other mechanism that I tested (Internet domain stream sockets, POSIX message queues, terminals, hierarchical epoll) behaves the same as EPOLLET historically behaved for epoll.

The edge-triggered misunderstanding

Posted Aug 6, 2021 0:10 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (23 responses)

Well, the man page (epoll(7)) also says this:

> An application that employs the EPOLLET flag should use
> nonblocking file descriptors to avoid having a blocking read or
> write starve a task that is handling multiple file descriptors.
> The suggested way to use epoll as an edge-triggered (EPOLLET)
> interface is as follows:
>
> a) with nonblocking file descriptors; and
>
> b) by waiting for an event only after read(2) or write(2) return
> EAGAIN.

My read is that the admonition about EPOLLONESHOT is simply a warning that you are guaranteed *at least* one notification per edge event, rather than exactly one notification per edge event, so if you don't want to get duplicate notifications, you have to explicitly say you don't want them. That doesn't mean that you should rely on getting duplicate notifications, just that the kernel won't promise to prevent them altogether.

But OTOH the pattern described above is clearly wrong, because there is a race between read(2)/write(2) returning EAGAIN and you calling epoll_wait(2), during which more data could become available. This is safe as long as you don't use EPOLLONESHOT, because an event will still be generated even if you are not waiting for it (and then epoll_wait will return immediately). But if you use EPOLLONESHOT, then you have to re-arm the epoll before calling read or write to avoid the race, which renders the whole thing rather pointless because now you have another race between epoll_ctl and read/write, where you can still receive duplicate notifications anyway.

The last time I had to look at this man page (many years ago), I felt that the entire EPOLLET mode of operation was a disaster waiting to happen, and frankly I'm still not sure what problem it solves that can't be more easily solved using EPOLLLT. If there's more data to consume, don't you want to, I don't know, actually consume it?

Disclaimer: I work for Google, but not on Android.

The edge-triggered misunderstanding

Posted Aug 6, 2021 0:44 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (16 responses)

> The last time I had to look at this man page (many years ago), I felt that the entire EPOLLET mode of operation was a disaster waiting to happen, and frankly I'm still not sure what problem it solves that can't be more easily solved using EPOLLLT. If there's more data to consume, don't you want to, I don't know, actually consume it?

After doing some further research, I think the intended use case was "multiple threads call epoll_wait and the kernel dispatches work to them automatically." Which is nice and all, but IMHO this is more safely handled with explicit userspace thread pooling and work queueing. That way, you can guarantee that exactly one "logical work object" is handled by exactly one thread at a time, and you can be completely confident that you won't end up with duplicate or missed notifications (you use EPOLLLT, the controller thread calls epoll_ctl(..., EPOLL_CTL_DEL, ...) on the notified file descriptor when an event happens, and the worker thread does the reverse once it's finished with the resulting work object).

Sure, you burn an extra thread, but threads are cheap, and this design is extensible to more complex work objects (e.g. involving more than one file descriptor).

The edge-triggered misunderstanding

Posted Aug 6, 2021 1:24 UTC (Fri) by wahern (subscriber, #37304) [Link] (15 responses)

> you use EPOLLLT, the controller thread calls epoll_ctl(..., EPOLL_CTL_DEL, ...) on the notified file descriptor when an event happens, and the worker thread does the reverse once it's finished with the resulting work object).

This entirely defeats the purpose of epoll, which is O(1) polling. A design that requires deleting/adding or rearming a descriptor on every event devolves to similar performance as poll(2). This is why Zed Shaw's Mongrel2 ended up switching to poll (as a performance "optimization") for non-long-poll connections, though I don't think he realized the issue. The issue was the design and composition of the internal Mongrel2 APIs, which necessitated the delete/add for every event.

The edge-triggered misunderstanding

Posted Aug 6, 2021 3:44 UTC (Fri) by comex (subscriber, #71521) [Link] (14 responses)

Looking at the implementation of epoll, it keeps the list of descriptors in a red-black tree, so I'd expect adding and deleting them to be O(log n), a far cry from poll()'s O(n). And I'd be surprised if the tree walk itself was a major component of the cost. Locking overhead and system call overhead, on the other hand, make this pattern probably best to avoid.

The edge-triggered misunderstanding

Posted Aug 6, 2021 5:57 UTC (Fri) by pbonzini (subscriber, #60935) [Link]

Though, if you are using Linux specific system calls, polling with either io_submit or io_uring is preferable anyway...

The edge-triggered misunderstanding

Posted Aug 6, 2021 15:37 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (12 responses)

> Locking overhead and system call overhead, on the other hand, make this pattern probably best to avoid.

But that applies just as well to EPOLLONESHOT! So what's the point of it?

The way I see it, either you can tolerate duplicate events, or you can't. If you can't, then you *must* do some sort of explicit syscall to tell the kernel when you're done with a given event and that it is now safe to dispatch another event. But as I explained above, if you want to use EPOLLONESHOT and epoll_ctl(..., EPOLL_CTL_MOD, ...) for that purpose, then you have to call epoll_ctl either before or after the read/write, and both ways result in a race condition. So you just can't use EPOLLONESHOT for that purpose, and must instead use single-threaded EPOLLLT with explicit add/remove. There's no other safe, duplicate-free way to use this API.

OTOH, if you can tolerate duplicate notifications, then I don't see the point of EPOLLONESHOT at all. Just use EPOLLLT (single-threaded) or EPOLLET (multi-threaded, but you somehow don't care about the threads walking all over each other when the same fd gets signalled multiple times).

The edge-triggered misunderstanding

Posted Aug 6, 2021 16:55 UTC (Fri) by HenrikH (subscriber, #31152) [Link] (8 responses)

Well that depends on how EPOLLONESHOT is implemented in epoll when you rearm it, if epoll checks the state of the fd (aka if there is a buffer on EPOLLIN or if it can write on EPOLLOUT) then it can avoid a race condition here. Also epoll could guess that if EPOLLONESHOT is used then the fd will be rearmed again so it could retain the fd in it's internal list but only mark it as being temporarily disabled.

I have not checked any of the kernel sources though so I don't know if any of this is actually implemented, but it could be.

The edge-triggered misunderstanding

Posted Aug 6, 2021 17:30 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (6 responses)

> Well that depends on how EPOLLONESHOT is implemented in epoll when you rearm it, if epoll checks the state of the fd (aka if there is a buffer on EPOLLIN or if it can write on EPOLLOUT) then it can avoid a race condition here.

That's how EPOLLLT behaves. If you're using EPOLLET, then you have explicitly asked for the opposite behavior (i.e. to only notify when the fd *changes states*). There is nothing in epoll(7) (or any of the other pages I looked at) which suggests that a "disabled" EPOLLONESHOT fd can still receive events, which are saved until the fd is re-armed. The implication is that the events are simply dropped.

The edge-triggered misunderstanding

Posted Aug 6, 2021 21:17 UTC (Fri) by HenrikH (subscriber, #31152) [Link] (5 responses)

EPOLLONESHOT was introduced so that you could load balance reads over multiple threads, if this race that you talk about would exist then it would make the whole idea of the flag moot (and a lot of software out there completely broken). It was introduced after Bryan Cantrill's epoll-rant in 2015 https://www.youtube.com/watch?v=l6XQUciI-Sc&t=3420s where he critiques that epoll cannot be used to load balance reads in multiple threads (among other things).

The edge-triggered misunderstanding

Posted Aug 6, 2021 21:34 UTC (Fri) by mkerrisk (subscriber, #1978) [Link] (1 responses)

> It was introduced after Bryan Cantrill's epoll-rant in 2015

EPOLLONESHOT appeared in Linux 2.6.2, in 2004. I think you perhaps are thinking of EPOLLEXCLUSIVE.

The edge-triggered misunderstanding

Posted Aug 7, 2021 2:07 UTC (Sat) by HenrikH (subscriber, #31152) [Link]

Yes it seems like Wikipedia mixed them up, he does speak about epoll in that video as if EPOLLONESHOT didn't exist though which is why I was fooled, come to think of it I should have reacted to the date of 2015 since I've used the flag for many years before that :)

The edge-triggered misunderstanding

Posted Aug 6, 2021 22:21 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (2 responses)

Maybe the documentation is incomplete or misleading, but as it stands, the man page says you get a race condition. If this race condition does not in fact exist, then the man page should explicitly call that out, rather than tacitly assuming that we all watch random YouTube videos of BSD developers criticizing Linux.

The edge-triggered misunderstanding

Posted Aug 7, 2021 2:18 UTC (Sat) by HenrikH (subscriber, #31152) [Link]

Ok so I'm trying to find the actual discussion on LKML that led to the flag but atleast the comment that Davide Libenzi have on the code is this:

+ * EPOLLONESHOT bit that disables the descriptor when an event is received,
+ * until the next EPOLL_CTL_MOD will be issued.

Note the "when an even it received _until_" so the event is put on hold until the fd is re-armed by a EPOLL_CTL_MOD.

The edge-triggered misunderstanding

Posted Aug 7, 2021 3:23 UTC (Sat) by HenrikH (subscriber, #31152) [Link]

Well the manpage should perhaps made a bit clear because the behaviour is specifically like the one I've described. I just made a test of this using pipes on a 5.11 kernel, strace: (* lines from a different pid)

* write(4, "6\334\177", 3) = 3
epoll_create1(EPOLL_CLOEXEC) = 5
epoll_ctl(5, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLONESHOT|EPOLLET, {u32=3, u64=2581266273925070851}}) = 0
epoll_wait(5, [{EPOLLIN, {u32=3, u64=2581266273925070851}}], 1, -1) = 1
read(3, "6\334\177", 3) = 3
* write(4, "6\334\177", 3) = 3
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=10, tv_nsec=0}, 0x7fff71722d70) = 0
epoll_ctl(5, EPOLL_CTL_MOD, 3, {EPOLLIN|EPOLLONESHOT|EPOLLET, {u32=3, u64=2581266273925070851}}) = 0
epoll_wait(5, [{EPOLLIN, {u32=3, u64=2581266273925070851}}], 1, -1) = 1
read(3, "6\334\177", 3) = 3

The edge-triggered misunderstanding

Posted Aug 6, 2021 17:37 UTC (Fri) by comex (subscriber, #71521) [Link]

At the time you call epoll_ctl to add or re-enable a fd, the kernel will always check whether the fd is already ready and queue an event for it if so. [1] To this extent, it acts 'level-triggered' regardless of whether you request level-triggered or edge-triggered mode. So it should be safe to use epoll with EPOLLET plus EPOLLONESHOT, and always start by adding/enabling the FD, not calling read/write. Then once you receive an event, you call read/write, then re-arm with EPOLL_CTL_MOD. If another event comes between the read/write and the re-arm, you will still get notified.

However, this does directly contradict the man page's "suggested way" of using EPOLLET [2] which is to only use epoll after read/write return EAGAIN. And the behavior of checking for readiness at add/enable time doesn't seem to be documented, so it's hard to say if this is meant to be guaranteed.

[1] https://stackoverflow.com/questions/12920243/if-a-file-is...

[2] https://man7.org/linux/man-pages/man7/epoll.7.html

The edge-triggered misunderstanding

Posted Aug 6, 2021 17:12 UTC (Fri) by HenrikH (subscriber, #31152) [Link] (2 responses)

Checking the man page shows that re-arming the fd on EPOLLONESHOT is done via EPOLL_CTL_MOD and not EPOLL_CTL_ADD so the fd remains in the epoll internal list of descriptors.

The edge-triggered misunderstanding

Posted Aug 6, 2021 17:19 UTC (Fri) by comex (subscriber, #71521) [Link] (1 responses)

That still takes the lock and walks the tree, but at least it doesn't have to allocate a new `struct epitem` like EPOLL_CTL_ADD does.

The edge-triggered misunderstanding

Posted Aug 6, 2021 18:20 UTC (Fri) by HenrikH (subscriber, #31152) [Link]

And more importantly I think that it also means that there is no race condition between read returning EAGAIN and you re-arming the fd since epoll can mark the fd since it remains in the list.

The edge-triggered misunderstanding

Posted Aug 6, 2021 7:56 UTC (Fri) by Sesse (subscriber, #53779) [Link] (4 responses)

FWIW, EPOLLONESHOT is not really relevant for most applications, and doesn't have a lot to do with edge-triggering. I believe the original use-case was a socket where you only wanted to connect() and see whether there was anything at all on the given port (some P2P server needed it).

As background information: The way edge-triggering works is that you keep your socket in the set all the time (so there's no race against epoll_add). The simplest example is a socket where you have a lot of data you want to write, but it could get filled up; you write until the socket buffer is full and get EAGAIN (now the level is low), and then you know that you will get a wakeup through epoll when there's room the next time (the level goes from low to high). With level-triggering, you would have to take the socket in and out of the epoll set all the time, so edge-triggering is more efficient. (But you have to be really careful never to forget about the socket in the high-level state, since you won't ever get any wakeups for it then!) A similar pattern exists for reads, where you could need to pause reading due to your own buffers getting full.

The edge-triggered misunderstanding

Posted Aug 6, 2021 14:05 UTC (Fri) by abatters (✭ supporter ✭, #6932) [Link] (2 responses)

> FWIW, EPOLLONESHOT is not really relevant for most applications, and doesn't have a lot to do with edge-triggering. I believe the original use-case was a socket where you only wanted to connect() and see whether there was anything at all on the given port (some P2P server needed it).

I use EPOLLOUT | EPOLLONESHOT on a dummy fd that is always ready as a generic way for another thread to wakeup epoll_wait without using signals. Another way would be to write data to a pipe and then read it out after it woke up the epoll_wait, but that would require more syscalls.

The edge-triggered misunderstanding

Posted Aug 6, 2021 14:37 UTC (Fri) by Sesse (subscriber, #53779) [Link] (1 responses)

Isn't that what eventfd is for? It sounds more idiomatic than modifying the epoll set from another thread (which is what you're doing, right?).

The edge-triggered misunderstanding

Posted Aug 6, 2021 15:30 UTC (Fri) by abatters (✭ supporter ✭, #6932) [Link]

Thanks for pointing that out; I hadn't seen eventfd before.

The edge-triggered misunderstanding

Posted Aug 19, 2021 7:31 UTC (Thu) by njs (subscriber, #40338) [Link]

> FWIW, EPOLLONESHOT is not really relevant for most applications [...] I believe the original use-case was a socket where you only wanted to connect() and see whether there was anything at all on the given port (some P2P server needed it).

It's also useful for working around epoll's weird ambiguities between file descriptors/file descriptions -- if you get into the situation where your fd table and epoll's fd table don't match up, then EPOLLONESHOT is the only way to escape infinite wakeup loops:

https://github.com/python-trio/trio/blob/cb48b33a42b09dde...

Of course correct code should never hit this case in the first place, but for generic library code that wants to degrade gracefully it's very useful.

It also reduces syscalls if you're using epoll to implement a higher-level API that's built around issuing read/write operations (like you see with Go/io_uring/IOCP), rather than "register an fd for repeated usage".

The edge-triggered misunderstanding

Posted Aug 7, 2021 6:31 UTC (Sat) by jonas.bonn (subscriber, #47561) [Link]

The true benefit of edge notifications is for _writeability_ notifications. It allows to forego the enable/disable EPOLLWR dance every time a message is enqueued and thus saves a large number of syscalls.

The edge-triggered misunderstanding

Posted Aug 5, 2021 23:09 UTC (Thu) by Bigos (subscriber, #96807) [Link] (4 responses)

In my current understanding of the kernel workflow, if the Android project uses kernel code from 1.5 years ago and releases it only now it should manage any regressions by itself.

I understand that such regressions only come into light when the code is used, but this is what the "beta" distributions are about. With such a slow development workflow, we need a testing window in this waterfall scheme. If a piece of code (like a kernel) is updated every year (or more), you will see bugs and regressions each time they are updated.

What can upstream do about bugs that are filed years after they are produced? They can fix them immediately, but that is already years after they have been introduced. For a downstream that is lagging behind, it is thus their responsibility to deal with this whole situation.

And then what about the "no regression policy"? This is indeed a difficult question. When the regression is reported so long after it has been introduced it is difficult to measure if reverting to the old behavior is good or not. The policy is about "not breaking userspace", but there are many "userspace" in parallel that are using the kernel, each using different kernel/userpace combo. The policy makes sense when the userspace is using a recent kernel, but falls apart when the kernel used is so old the upstream forgot about it, but the kernel is "new" for a particular downstream.

--------------------------------
Wild delusions below...

For the kernel itself, the policy should thus change, in my opinion. If the downstream doesn't want to update the kernel in a "reasonable" timeframe (which would need to be defined), the no-regression-policy should not apply (or rather, not apply universally; not breaking current downstream would take priority over not breaking past downstream). I am just a bystander dreamer, though, understanding that such a thing would be difficult to impose and the benefits might be questionable...

The edge-triggered misunderstanding

Posted Aug 6, 2021 0:53 UTC (Fri) by mathstuf (subscriber, #69389) [Link] (1 responses)

In CMake (which largely has the same regression policies except that undocumented or explicitly internal bits are more of a "why are you fiddling around there?" than Linux), the policy is that if the regression is more than a release behind, there's not much we can do in the general case. With a lack of a report in a timely manner, there's not much to be done as the *new* behavior is also legitimately something that may be depended upon. But, we also have fairly quick uptake in the development distros which help to find issues with builds that are hosted in such projects at least as well (since user testing isn't quite as necessary for CMake's "did we break something?" detection).

Basically, test things as much as possible, but if you fall behind, it's going to be a tough decision to figure out what is the right behavior going forwards.

The edge-triggered misunderstanding

Posted Aug 6, 2021 19:11 UTC (Fri) by khim (subscriber, #9252) [Link]

> except that undocumented or explicitly internal bits are more of a "why are you fiddling around there?" than Linux

Linux have insane number of unstable interfaces. And if changes to these break apps then no one would even think twice.

But syscalls are sacred. These are fixed in case of regressions religiously. And that's certainly a good thing.

The edge-triggered misunderstanding

Posted Aug 6, 2021 19:07 UTC (Fri) by khim (subscriber, #9252) [Link] (1 responses)

You are almost correct but devil is in details. Let me show you with a change in the quote:

> In my current understanding of the kernel workflow, if the Android project uses kernel code from 1.5 years ago and releases it only now it should manage any regressions backporting by itself.

And yes, sure. They will.

> What can upstream do about bugs that are filed years after they are produced?

Fix them? Why is that so hard to understand for some? Regressions are regressions. They have to be fixed if there are real programs used by real users (as in: anything that existed before introduction of regression and not made specifically to show-off that you can write superfragile code which may be broken by any unrelated change in the kernel).

Remember that story with autofs? Please pay close attention to that line: automount program from the autofs-tools package, which is still in use on a great many systems, had run into this problem a number of years ago.

Yet it was fixed, because, you know, regressions are regressions.

> For a downstream that is lagging behind, it is thus their responsibility to deal with this whole situation.

Oh, sure. Of course. After regression is fixed in all supported kernels the job of Linus and kernel developers is done. Backporting these changes to older, already usupported, kernels is not their job.

But it have to be fixed in Linus tree even if it was introduced 10 years ago.

That's the rule and that is why linux is the most popular OS kernel (maybe by now more popular than all others combined on an MMU-equipped devices).

And yes, sometimes it requires quite a dozy solutions to satisfy both apps and libraries developed before regression happened and after. I guess at some point it would just be impossible to support them all. But AFAIK this never happened till now (or, more likely, it happened, but nobody noticed: that old if nobody notices, it's not broken rule).

> I am just a bystander dreamer, though, understanding that such a thing would be difficult to impose and the benefits might be questionable...

Depends on who would you ask, obviously. Apple, Google, Microsoft would be delighted by such a decision. Maybe even Fuchsia get a chance. But it's hard to see what this adoption of CADT model would bring to Linux.

The edge-triggered misunderstanding

Posted Aug 19, 2021 14:45 UTC (Thu) by mrugiero (guest, #153040) [Link]

> Fix them? Why is that so hard to understand for some? Regressions are regressions. They have to be fixed if there are real programs used by real users (as in: anything that existed before introduction of regression and not made specifically to show-off that you can write superfragile code which may be broken by any unrelated change in the kernel).

Except when two contradictory behaviors have been in the wild for a while, fixing the regression may be a regression itself. Only time will tell, but if/when it happens, which one will you pick? In that scenario someone will be broken in userspace, inevitably. The wording in what you quoted wasn't necessarily the best, but this is the real issue. In fact, the current regression is a bugfix, if we believe the docs, regardless of the "not breaking userspace rule", an implementation contradicting the documented promise is a bug.

The edge-triggered misunderstanding

Posted Aug 6, 2021 2:05 UTC (Fri) by itsmycpu (guest, #139639) [Link] (15 responses)

As a potential future user of epoll, it seems "one wakeup per write operation" is the best and cleanest definition for edge-triggered so far. (Assuming, of course, the wake up comes after all the data from the write is available for reading.) Depending on the use case, I would want to use either the newest version of edge-triggered, or level-triggered.

It doesn't seem clear what the benefit of the kernel 5.5 version of edge-triggered would be intended to be, compared to level triggered. How is "there was no data, now there is data" sufficiently different from "there is data"? After reading all the data?

The documentation in the man page says:
> edge-triggered mode delivers events only when changes occur on the monitored file descriptor
Without knowing the internals of a file descriptor, I would expect that all writes count as a change to the file, not just writes to an empty file. The latter would require additional reasoning and documentation.

The edge-triggered misunderstanding

Posted Aug 6, 2021 9:30 UTC (Fri) by intgr (subscriber, #39733) [Link] (5 responses)

> How is "there was no data, now there is data" sufficiently different from "there is data"?

This was explained upthread: https://lwn.net/Articles/865404/

> After doing some further research, I think the intended use case was "multiple threads call epoll_wait and the kernel dispatches work to them automatically."

With level-triggered behavior, all events would be dispatched to all threads. Edge-triggered instead behaves as if the event is "consumed" after dispatching.

The edge-triggered misunderstanding

Posted Aug 6, 2021 13:20 UTC (Fri) by itsmycpu (guest, #139639) [Link]

> This was explained upthread: https://lwn.net/Articles/865404/

Actually I already wrote about using it in a multi-thread scenario, in another discussion.
However, such a scenario isn't obvious at all and not described in the man page documentation. Not the normal use case.
And it works as intended *only* with the 5.5 version.

So in my opinion, a bit of a stretch. Even if some might like such a scenario, having a mode that is useful only in such a specific way, would require pointing out such use cases, in order to make sense.

The edge-triggered misunderstanding

Posted Aug 6, 2021 14:18 UTC (Fri) by itsmycpu (guest, #139639) [Link] (3 responses)

> With level-triggered behavior, all events would be dispatched to all threads. Edge-triggered instead behaves as if the event is "consumed" after dispatching.

I missed that this is still part of the reply to my post, so I'm sorry for splitting my response and not responding to the intended meaning at first.

Are you suggesting that edge-triggered will not wake more than one thread? I can see why you might interpret the comment further above in that way,
but it doesn't seem documented in that way, and I also haven't noticed any indication in that direction anywhere else. I am not convinced it is actually so, as I would expect that to be a separate option, if it really were an option. (Though perhaps a useful one.)

The edge-triggered misunderstanding

Posted Aug 6, 2021 14:40 UTC (Fri) by itsmycpu (guest, #139639) [Link]

> Are you suggesting that edge-triggered will not wake more than one thread? I can see why you might interpret the comment further above in that way,
> but it doesn't seem documented in that way, [...]

Sorry once more, I was too quick. This is actually described by the epoll man page as a special case, at the end:

> If multiple threads (or processes, if child processes have
> inherited the epoll file descriptor across fork(2)) are blocked
> in epoll_wait(2) waiting on the same epoll file descriptor and a
> file descriptor in the interest list that is marked for edge-
> triggered (EPOLLET) notification becomes ready, just one of the
> threads (or processes) is awoken from epoll_wait(2).

This behavior seems to apply only in that special case, like an additional option. It is not the general meaning of "edge-triggered". However certainly an interesting feature. I would hope that it generally applies when multiple threads use the same epoll file descriptor from the same epoll_create, so that it is not limited to forks.

The edge-triggered misunderstanding

Posted Aug 6, 2021 14:42 UTC (Fri) by intgr (subscriber, #39733) [Link] (1 responses)

> Are you suggesting that edge-triggered will not wake more than one thread?
> it doesn't seem documented in that way

It's documented that way in the current version of the man page at least
https://manpages.debian.org/testing/manpages/epoll.7.en.html

> If multiple threads (or processes, if child processes have
> inherited the epoll file descriptor across fork(2)) are blocked
> in epoll_wait(2) waiting on the same epoll file descriptor and a
> file descriptor in the interest list that is marked for edge-
> triggered (EPOLLET) notification becomes ready, just one of the
> threads (or processes) is awoken from epoll_wait(2). This
> provides a useful optimization for avoiding "thundering herd"
> wake-ups in some scenarios.

The edge-triggered misunderstanding

Posted Aug 6, 2021 14:49 UTC (Fri) by itsmycpu (guest, #139639) [Link]

> It's documented that way in the current version of the man page at least

Yes, I just realized that and corrected myself in the post just (2 minutes) before yours. :)

The edge-triggered misunderstanding

Posted Aug 6, 2021 12:07 UTC (Fri) by wtarreau (subscriber, #51152) [Link] (8 responses)

> it seems "one wakeup per write operation" is the best and cleanest definition for edge-triggered so far.

I can assure you that the notion of "level-triggered" vs "edge-triggered" is universally used between binary states, as it comes from the world of electronics when signals are sampled (clock, interrupts etc). There's no such a thing as "raising the signal to strengthen the interrupt". Either you report the interrupt as long as the signal is high (or low), or you report the interrupt only when it flips (hence the name "edge" because when you see the signal on an oscilloscope, you have an edge there).

Now the fact that you need it to work differently is another matter and I do think that there's a lot of value in plenty of userspace programs for a mechanism that wakes you up each time the conditions change.

Regardless, the rules for EPOLLET are pretty clear: only EAGAIN will re-arm the notification, so if you do not try to read/write and are reported a state change, you may very well never be aware that a buffer was fully consumed.

The edge-triggered misunderstanding

Posted Aug 6, 2021 13:45 UTC (Fri) by itsmycpu (guest, #139639) [Link] (2 responses)

> Regardless, the rules for EPOLLET are pretty clear: only EAGAIN will re-arm the notification, ...

How would that be clear as a "rule"? In the man page linked in the article, which is also referenced up thread, it is just a "suggested way of use":

> The suggested way to use epoll as an edge-triggered (EPOLLET) interface is as follows:
>
> a) with nonblocking file descriptors; and
>
> b) by waiting for an event only after read(2) or write(2) return EAGAIN.

However, the best I can tell, if used in this way, 'level-triggered' would accomplish the same thing, and using 'edge-triggered' in this scenario seems redundant and pointless.

It does not say there that EAGAIN is necessary to "re-arm" the notification. Later on, it just says that reading until EAGAIN is necessary to make sure that all the data is read, however that is quite obvious anyway. And nothing else is implied by the example above.

The edge-triggered misunderstanding

Posted Aug 9, 2021 16:31 UTC (Mon) by wtarreau (subscriber, #51152) [Link] (1 responses)

It's "suggested" in that it's an illustration of one way to do it. The crux of this is that you're not guaranteed to be notified again until you reach EAGAIN which rearms the event. And that's the normal way to deal with edge-triggered signals, it's not specific to epoll. Simply see it as a test-and-set on whether or not this requires to be reported. That's particularly efficient and does not guarantee that you will not get multiple notifications from other situations but it's the right way to make sure events are not reported more often than needed.

I do remember some old docs back in the 2.5.x days where it was clearly stated that you *had* to maintain your own event cache so that if you can't read till EAGAIN you need to know by yourself that you have to try again once possible. And that's how highly efficient I/Os are achieved anyway.

The edge-triggered misunderstanding

Posted Aug 9, 2021 20:12 UTC (Mon) by itsmycpu (guest, #139639) [Link]

> I do remember some old docs back in the 2.5.x days where it was clearly stated that you *had* to maintain your own event cache so that if you can't read till EAGAIN you need to know by yourself that you have to try again once possible.

With edge-triggered, that is still true, since there won't necessarily be another write (or not for a long time). So you'd need a timeout or something (in a single-threaded scenario) and maintain the info that you haven't encountered EAGAIN yet (unless you will try to read anyway, for example in certain intervals).

This makes sense if the fact that data did arrive, and the data itself, are handled at different times; for example if data is usually buffered for some time, but some other action is taken immediately.

The edge-triggered misunderstanding

Posted Aug 6, 2021 17:05 UTC (Fri) by HenrikH (subscriber, #31152) [Link] (4 responses)

>Regardless, the rules for EPOLLET are pretty clear: only EAGAIN will re-arm the notification

Not really, and this is why EPOLLONESHOT was introduced. EPOLLET will re-arm when there is a new write (unless it was done on a pipe on v5-v15) so if you use the same epoll fd in multiple threads with EPOLLET then once thread A have called read and there is a new write, epoll will notify thread B about EPOLLIN on the same fd. EPOLLINESHOT is the only solution to solve this.

The edge-triggered misunderstanding

Posted Aug 9, 2021 16:43 UTC (Mon) by wtarreau (subscriber, #51152) [Link] (3 responses)

Not exactly, EPOLLONESHOT is different, it's for when you're waiting for a single event and do not want to remain subscribed after that. It performs the EPOLL_CTL_DEL automatically for you. A typical example is that you read a request from a client, you get all of it and are not interested in reading whatever happens after until you're done processing it, because, maybe you've released the buffer and cannot allocate a new one in this case. Normally you would have to unregister the event. With EPOLLONESHOT you don't need, you'll get notified exactly once and you will have to subscribe again if needed.

The problem with pollers is that many users tend to consider them as providing exact guarantees, but they should not, instead a poller must be used as a hint that something happend on an FD, and the reported flags provide more or less accurate indications of the details. For example:
- the poller returns EPOLLIN but when you try to read you get EAGAIN. Why ? Because you're on UDP, the checksum was calculated on the fly during recvmsg(), which figured it was incorrect and destroyed the result.
- the poller returns the same event multiple times regardless of EPOLLET: there might be different conditions whose precedence cause it to be reported.
- you've read all the response from an FD, not reached EAGAIN, but you're certain based on the sockets you're using that you've reached the end. Usually you can consider that you'll be called again but there's no such guarantee
- EPOLLHUP and EPOLLRDHUP used to be wrong several times in the past
- epoll + splice() have done funny things at a time where splice() was quite bogus (before 2.6.25). For example splice() could report EAGAIN if the target was full, thus was not reading the input at all, yet this EAGAIN was irrelevant to the recv condition that was supposed to re-arm the poller.

The sole purpose of a poller is "only sleep if you're absolutely certain you can sleep; in doubt, wake up and notify me". It's not the poller's job to be exact about all conditions, it's the application's. In doubt the poller must wake up. In some cases I'm pretty sure that even EPOLLET will wake up more often than needed and that must not be a problem for the application.

The edge-triggered misunderstanding

Posted Aug 10, 2021 2:13 UTC (Tue) by HenrikH (subscriber, #31152) [Link] (2 responses)

If you use it that way they you are leaving behind a larger and larger watch list for epoll. EPOLLONESHOT does not call EPOLL_CTL_DEL automatically for you, it simply marks the fd on the watch list as temporarily disabled after it wakes up.

It can be used for the scenario that you described but you have to do a manual call to EPOLL_CTL_DEL to really remove the fd from the internal watchlist and remove a potential memory leak.

The edge-triggered misunderstanding

Posted Aug 10, 2021 4:08 UTC (Tue) by wtarreau (subscriber, #51152) [Link] (1 responses)

> EPOLLONESHOT does not call EPOLL_CTL_DEL automatically for you, it simply marks the fd on the watch list as temporarily disabled after it wakes up.

From the epoll(7) man page: "When the EPOLLONESHOT flag is specified, it is the caller's responsibility to rearm the file descriptor using epoll_ctl(2) with EPOLL_CTL_MOD".

You can use it for example to wait for a connect() to complete without having to disable polling on the FD when you don't intend to use the connection immediately (e.g. when preparing connection pools).

The edge-triggered misunderstanding

Posted Aug 10, 2021 16:50 UTC (Tue) by HenrikH (subscriber, #31152) [Link]

Yes, that goes exactly into what I was saying. Re-arming it resets the disabled flag in the internal watchlist. That you have to do this with EPOLL_CTL_MOD and not EPOLL_CTL_ADD should already there tell you that EPOLLONESHOT does not call EPOLL_CTL_DEL automatically.

So if you use this flag as a way to not have to call EPOLL_CTL_DEL then you have a memory leak (as well as an inefficient watchlist in epoll).

Not really sure where you are going with this since non of this contradicts what I have written now several times.


Copyright © 2021, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds