LWN: Comments on "The edge-triggered misunderstanding"

The edge-triggered misunderstanding

mrugiero — Thu, 19 Aug 2021 14:45:55 +0000

> Fix them? Why is that so hard to understand for some? Regressions are regressions. They have to be fixed if there are real programs used by real users (as in: anything that existed before introduction of regression and not made specifically to show-off that you can write superfragile code which may be broken by any unrelated change in the kernel).

Except when two contradictory behaviors have been in the wild for a while, fixing the regression may be a regression itself. Only time will tell, but if/when it happens, which one will you pick? In that scenario someone will be broken in userspace, inevitably. The wording in what you quoted wasn't necessarily the best, but this is the real issue. In fact, the current regression is a bugfix, if we believe the docs, regardless of the "not breaking userspace rule", an implementation contradicting the documented promise is a bug.

The edge-triggered misunderstanding

njs — Thu, 19 Aug 2021 07:31:00 +0000

> FWIW, EPOLLONESHOT is not really relevant for most applications [...] I believe the original use-case was a socket where you only wanted to connect() and see whether there was anything at all on the given port (some P2P server needed it).

It's also useful for working around epoll's weird ambiguities between file descriptors/file descriptions -- if you get into the situation where your fd table and epoll's fd table don't match up, then EPOLLONESHOT is the only way to escape infinite wakeup loops:

https://github.com/python-trio/trio/blob/cb48b33a42b09dde...

Of course correct code should never hit this case in the first place, but for generic library code that wants to degrade gracefully it's very useful.

It also reduces syscalls if you're using epoll to implement a higher-level API that's built around issuing read/write operations (like you see with Go/io_uring/IOCP), rather than "register an fd for repeated usage".

The edge-triggered misunderstanding

HenrikH — Tue, 10 Aug 2021 16:50:16 +0000

Yes, that goes exactly into what I was saying. Re-arming it resets the disabled flag in the internal watchlist. That you have to do this with EPOLL_CTL_MOD and not EPOLL_CTL_ADD should already there tell you that EPOLLONESHOT does not call EPOLL_CTL_DEL automatically.

So if you use this flag as a way to not have to call EPOLL_CTL_DEL then you have a memory leak (as well as an inefficient watchlist in epoll).

Not really sure where you are going with this since non of this contradicts what I have written now several times.

The edge-triggered misunderstanding

wtarreau — Tue, 10 Aug 2021 04:08:48 +0000

> EPOLLONESHOT does not call EPOLL_CTL_DEL automatically for you, it simply marks the fd on the watch list as temporarily disabled after it wakes up.

From the epoll(7) man page: "When the EPOLLONESHOT flag is specified, it is the caller's responsibility to rearm the file descriptor using epoll_ctl(2) with EPOLL_CTL_MOD".

You can use it for example to wait for a connect() to complete without having to disable polling on the FD when you don't intend to use the connection immediately (e.g. when preparing connection pools).

The edge-triggered misunderstanding

HenrikH — Tue, 10 Aug 2021 02:13:31 +0000

If you use it that way they you are leaving behind a larger and larger watch list for epoll. EPOLLONESHOT does not call EPOLL_CTL_DEL automatically for you, it simply marks the fd on the watch list as temporarily disabled after it wakes up.

It can be used for the scenario that you described but you have to do a manual call to EPOLL_CTL_DEL to really remove the fd from the internal watchlist and remove a potential memory leak.

The edge-triggered misunderstanding

itsmycpu — Mon, 09 Aug 2021 20:12:32 +0000

> I do remember some old docs back in the 2.5.x days where it was clearly stated that you *had* to maintain your own event cache so that if you can't read till EAGAIN you need to know by yourself that you have to try again once possible.

With edge-triggered, that is still true, since there won't necessarily be another write (or not for a long time). So you'd need a timeout or something (in a single-threaded scenario) and maintain the info that you haven't encountered EAGAIN yet (unless you will try to read anyway, for example in certain intervals).

This makes sense if the fact that data did arrive, and the data itself, are handled at different times; for example if data is usually buffered for some time, but some other action is taken immediately.

The edge-triggered misunderstanding

wtarreau — Mon, 09 Aug 2021 16:43:32 +0000

Not exactly, EPOLLONESHOT is different, it's for when you're waiting for a single event and do not want to remain subscribed after that. It performs the EPOLL_CTL_DEL automatically for you. A typical example is that you read a request from a client, you get all of it and are not interested in reading whatever happens after until you're done processing it, because, maybe you've released the buffer and cannot allocate a new one in this case. Normally you would have to unregister the event. With EPOLLONESHOT you don't need, you'll get notified exactly once and you will have to subscribe again if needed.

The problem with pollers is that many users tend to consider them as providing exact guarantees, but they should not, instead a poller must be used as a hint that something happend on an FD, and the reported flags provide more or less accurate indications of the details. For example:
- the poller returns EPOLLIN but when you try to read you get EAGAIN. Why ? Because you're on UDP, the checksum was calculated on the fly during recvmsg(), which figured it was incorrect and destroyed the result.
- the poller returns the same event multiple times regardless of EPOLLET: there might be different conditions whose precedence cause it to be reported.
- you've read all the response from an FD, not reached EAGAIN, but you're certain based on the sockets you're using that you've reached the end. Usually you can consider that you'll be called again but there's no such guarantee
- EPOLLHUP and EPOLLRDHUP used to be wrong several times in the past
- epoll + splice() have done funny things at a time where splice() was quite bogus (before 2.6.25). For example splice() could report EAGAIN if the target was full, thus was not reading the input at all, yet this EAGAIN was irrelevant to the recv condition that was supposed to re-arm the poller.

The sole purpose of a poller is "only sleep if you're absolutely certain you can sleep; in doubt, wake up and notify me". It's not the poller's job to be exact about all conditions, it's the application's. In doubt the poller must wake up. In some cases I'm pretty sure that even EPOLLET will wake up more often than needed and that must not be a problem for the application.

The edge-triggered misunderstanding

wtarreau — Mon, 09 Aug 2021 16:31:47 +0000

It's "suggested" in that it's an illustration of one way to do it. The crux of this is that you're not guaranteed to be notified again until you reach EAGAIN which rearms the event. And that's the normal way to deal with edge-triggered signals, it's not specific to epoll. Simply see it as a test-and-set on whether or not this requires to be reported. That's particularly efficient and does not guarantee that you will not get multiple notifications from other situations but it's the right way to make sure events are not reported more often than needed.

I do remember some old docs back in the 2.5.x days where it was clearly stated that you *had* to maintain your own event cache so that if you can't read till EAGAIN you need to know by yourself that you have to try again once possible. And that's how highly efficient I/Os are achieved anyway.

TLS for private subdomains

motiejus — Mon, 09 Aug 2021 12:54:16 +0000

I thought you literally meant internal.example.com :)

TLS for private subdomains

james — Mon, 09 Aug 2021 09:40:50 +0000

Assuming you own the parent domain, you use DNS verification. You would have to (temporarily) create a public TXT record in the parent domain's DNS, as specified by the certificate authority: for example, you could put a record for _acme-challenge.test.internal.example.com in the public example.com DNS.

This would (tend to) confirm that the subdomain exists, but if you want a "real" TLS certificate, the subdomain will be included in public Certificate Transparency logs.

You would not have to publish any A or AAAA records for the subdomain, nor would you have to make any computers on the subdomain available to the outside world.

The edge-triggered misunderstanding

motiejus — Sun, 08 Aug 2021 04:13:58 +0000

How can I get a real tls cert for a subdomain.exame.com? I did a bit of searching, to no avail.

The edge-triggered misunderstanding

jonas.bonn — Sat, 07 Aug 2021 06:31:51 +0000

The true benefit of edge notifications is for _writeability_ notifications. It allows to forego the enable/disable EPOLLWR dance every time a message is enqueued and thus saves a large number of syscalls.

The edge-triggered misunderstanding

NYKevin — Sat, 07 Aug 2021 04:55:23 +0000

As mentioned upthread, it is completely safe to use fake subdomains of home.arpa, because it was explicitly reserved for this purpose (RFC 8375). Of course, they can still leak, but if you're running a local recursive resolver, you can at least configure it to avoid routing home.arpa queries outside of your homenet (because they will never be valid in the global DNS). So in principle, you can prevent fake DNS queries from reaching the external world, if you really want to. It's just a lot of work for questionable benefit, IMHO.

Technically, RFC 8375 positions itself as a "correction" to RFC 7788, but it also acknowledges that you don't have to be doing HNCP to use home.arpa. Which is a Good Thing, because I still can't figure out if HNCP is even a real protocol, or just a stack of "wouldn't it be nice if consumer products did [X]" ideas wearing a trenchcoat.

The edge-triggered misunderstanding

HenrikH — Sat, 07 Aug 2021 03:23:09 +0000

Well the manpage should perhaps made a bit clear because the behaviour is specifically like the one I've described. I just made a test of this using pipes on a 5.11 kernel, strace: (* lines from a different pid)

* write(4, "6\334\177", 3) = 3
epoll_create1(EPOLL_CLOEXEC) = 5
epoll_ctl(5, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLONESHOT|EPOLLET, {u32=3, u64=2581266273925070851}}) = 0
epoll_wait(5, [{EPOLLIN, {u32=3, u64=2581266273925070851}}], 1, -1) = 1
read(3, "6\334\177", 3) = 3
* write(4, "6\334\177", 3) = 3
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=10, tv_nsec=0}, 0x7fff71722d70) = 0
epoll_ctl(5, EPOLL_CTL_MOD, 3, {EPOLLIN|EPOLLONESHOT|EPOLLET, {u32=3, u64=2581266273925070851}}) = 0
epoll_wait(5, [{EPOLLIN, {u32=3, u64=2581266273925070851}}], 1, -1) = 1
read(3, "6\334\177", 3) = 3

The edge-triggered misunderstanding

HenrikH — Sat, 07 Aug 2021 02:18:37 +0000

Ok so I'm trying to find the actual discussion on LKML that led to the flag but atleast the comment that Davide Libenzi have on the code is this:

+ * EPOLLONESHOT bit that disables the descriptor when an event is received,
+ * until the next EPOLL_CTL_MOD will be issued.

Note the "when an even it received _until_" so the event is put on hold until the fd is re-armed by a EPOLL_CTL_MOD.

The edge-triggered misunderstanding

HenrikH — Sat, 07 Aug 2021 02:07:36 +0000

Yes it seems like Wikipedia mixed them up, he does speak about epoll in that video as if EPOLLONESHOT didn't exist though which is why I was fooled, come to think of it I should have reacted to the date of 2015 since I've used the flag for many years before that :)

The edge-triggered misunderstanding

NYKevin — Fri, 06 Aug 2021 22:21:31 +0000

Maybe the documentation is incomplete or misleading, but as it stands, the man page says you get a race condition. If this race condition does not in fact exist, then the man page should explicitly call that out, rather than tacitly assuming that we all watch random YouTube videos of BSD developers criticizing Linux.

The edge-triggered misunderstanding

mkerrisk — Fri, 06 Aug 2021 21:34:55 +0000

> It was introduced after Bryan Cantrill's epoll-rant in 2015

EPOLLONESHOT appeared in Linux 2.6.2, in 2004. I think you perhaps are thinking of EPOLLEXCLUSIVE.

The edge-triggered misunderstanding

HenrikH — Fri, 06 Aug 2021 21:17:43 +0000

EPOLLONESHOT was introduced so that you could load balance reads over multiple threads, if this race that you talk about would exist then it would make the whole idea of the flag moot (and a lot of software out there completely broken). It was introduced after Bryan Cantrill's epoll-rant in 2015 https://www.youtube.com/watch?v=l6XQUciI-Sc&t=3420s where he critiques that epoll cannot be used to load balance reads in multiple threads (among other things).

The edge-triggered misunderstanding

khim — Fri, 06 Aug 2021 19:11:15 +0000

> except that undocumented or explicitly internal bits are more of a "why are you fiddling around there?" than Linux

Linux have insane number of unstable interfaces. And if changes to these break apps then no one would even think twice.

But syscalls are sacred. These are fixed in case of regressions religiously. And that's certainly a good thing.

The edge-triggered misunderstanding

khim — Fri, 06 Aug 2021 19:07:40 +0000

You are almost correct but devil is in details. Let me show you with a change in the quote:

> In my current understanding of the kernel workflow, if the Android project uses kernel code from 1.5 years ago and releases it only now it should manage any ~~regressions~~ backporting by itself.

And yes, sure. They will.

> What can upstream do about bugs that are filed years after they are produced?

Fix them? Why is that so hard to understand for some? Regressions are regressions. They have to be fixed if there are real programs used by real users (as in: anything that existed before introduction of regression and not made specifically to show-off that you can write superfragile code which may be broken by any unrelated change in the kernel).

Remember that story with autofs? Please pay close attention to that line: automount program from the autofs-tools package, which is still in use on a great many systems, had run into this problem a number of years ago.

Yet it was fixed, because, you know, regressions are regressions.

> For a downstream that is lagging behind, it is thus their responsibility to deal with this whole situation.

Oh, sure. Of course. After regression is fixed in all supported kernels the job of Linus and kernel developers is done. Backporting these changes to older, already usupported, kernels is not their job.

But it have to be fixed in Linus tree even if it was introduced 10 years ago.

That's the rule and that is why linux is the most popular OS kernel (maybe by now more popular than all others combined on an MMU-equipped devices).

And yes, sometimes it requires quite a dozy solutions to satisfy both apps and libraries developed before regression happened and after. I guess at some point it would just be impossible to support them all. But AFAIK this never happened till now (or, more likely, it happened, but nobody noticed: that old if nobody notices, it's not broken rule).

> I am just a bystander dreamer, though, understanding that such a thing would be difficult to impose and the benefits might be questionable...

Depends on who would you ask, obviously. Apple, Google, Microsoft would be delighted by such a decision. Maybe even Fuchsia get a chance. But it's hard to see what this adoption of CADT model would bring to Linux.

The edge-triggered misunderstanding

HenrikH — Fri, 06 Aug 2021 18:20:49 +0000

And more importantly I think that it also means that there is no race condition between read returning EAGAIN and you re-arming the fd since epoll can mark the fd since it remains in the list.

The edge-triggered misunderstanding

comex — Fri, 06 Aug 2021 17:37:27 +0000

At the time you call epoll_ctl to add or re-enable a fd, the kernel will always check whether the fd is already ready and queue an event for it if so. [1] To this extent, it acts 'level-triggered' regardless of whether you request level-triggered or edge-triggered mode. So it should be safe to use epoll with EPOLLET plus EPOLLONESHOT, and always start by adding/enabling the FD, not calling read/write. Then once you receive an event, you call read/write, then re-arm with EPOLL_CTL_MOD. If another event comes between the read/write and the re-arm, you will still get notified.

However, this does directly contradict the man page's "suggested way" of using EPOLLET [2] which is to only use epoll after read/write return EAGAIN. And the behavior of checking for readiness at add/enable time doesn't seem to be documented, so it's hard to say if this is meant to be guaranteed.

[1] https://stackoverflow.com/questions/12920243/if-a-file-is...

[2] https://man7.org/linux/man-pages/man7/epoll.7.html

The edge-triggered misunderstanding

NYKevin — Fri, 06 Aug 2021 17:30:35 +0000

> Well that depends on how EPOLLONESHOT is implemented in epoll when you rearm it, if epoll checks the state of the fd (aka if there is a buffer on EPOLLIN or if it can write on EPOLLOUT) then it can avoid a race condition here.

That's how EPOLLLT behaves. If you're using EPOLLET, then you have explicitly asked for the opposite behavior (i.e. to only notify when the fd *changes states*). There is nothing in epoll(7) (or any of the other pages I looked at) which suggests that a "disabled" EPOLLONESHOT fd can still receive events, which are saved until the fd is re-armed. The implication is that the events are simply dropped.

The edge-triggered misunderstanding

comex — Fri, 06 Aug 2021 17:19:13 +0000

That still takes the lock and walks the tree, but at least it doesn't have to allocate a new `struct epitem` like EPOLL_CTL_ADD does.

The edge-triggered misunderstanding

mathstuf — Fri, 06 Aug 2021 17:18:27 +0000

I should say that I'm not worried about leakage so much as squatting, so I wasn't clear there. If someone knows I have a local DNS name of `git` or can guess my naming scheme for hosts…so be it.

We just used to use a `foobar.com` for internal routing (related to our actual domain name, but not publicly registered) at $DAYJOB. "Hilarity" ensued when someone *did* register that name and we started getting hosts external to our network responding to our SSH queries. I presume they were watching dead DNS queries pile up and snatched it in the hopes to get something juicy.

The edge-triggered misunderstanding

HenrikH — Fri, 06 Aug 2021 17:12:23 +0000

Checking the man page shows that re-arming the fd on EPOLLONESHOT is done via EPOLL_CTL_MOD and not EPOLL_CTL_ADD so the fd remains in the epoll internal list of descriptors.

The edge-triggered misunderstanding

HenrikH — Fri, 06 Aug 2021 17:05:17 +0000

>Regardless, the rules for EPOLLET are pretty clear: only EAGAIN will re-arm the notification

Not really, and this is why EPOLLONESHOT was introduced. EPOLLET will re-arm when there is a new write (unless it was done on a pipe on v5-v15) so if you use the same epoll fd in multiple threads with EPOLLET then once thread A have called read and there is a new write, epoll will notify thread B about EPOLLIN on the same fd. EPOLLINESHOT is the only solution to solve this.

The edge-triggered misunderstanding

HenrikH — Fri, 06 Aug 2021 16:55:11 +0000

Well that depends on how EPOLLONESHOT is implemented in epoll when you rearm it, if epoll checks the state of the fd (aka if there is a buffer on EPOLLIN or if it can write on EPOLLOUT) then it can avoid a race condition here. Also epoll could guess that if EPOLLONESHOT is used then the fd will be rearmed again so it could retain the fd in it's internal list but only mark it as being temporarily disabled.

I have not checked any of the kernel sources though so I don't know if any of this is actually implemented, but it could be.

The edge-triggered misunderstanding

tialaramex — Fri, 06 Aug 2021 16:14:00 +0000

> IIRC, that is one of the reserved names as well that doesn't route over the global DNS network.

Somebody else already explained that in fact .internal is not reserved although there was effort in that direction. But I want to particularly respond to "doesn't route over the global DNS network" because this is a grave misunderstanding of what's going on.

Your bogus names _will_ leak. If you've just made up bogus names for some service you would cheerfully have given a name in the public DNS that's delegated to you then it's no worse than before. But a LOT of people have this idea that their bogus names are "not routed" and so magically this is safer, the exact same mistake that happens for RFC 1918 addressing and with the same consequences.

These bogus names aren't special, nothing magically says "Oh this name is a double-secret internal name, protect it", it just goes in the big pile of bogus names like all the others. Unsurprisingly bad guys are much more interested in what's in that pile, although infrastructure researchers pick through it too.

> it just got too messy trying to sync up public DNS routing and internal routing with the same names

But your eventual solution doesn't sync up to public DNS either. So, just don't do that. It makes plenty of sense that I have some.internal.tlrmx.org without there being any public DNS records for that name. But unlike hijacking a TLD I *can* add public DNS records later if I want to, because it's in a namespace I actually control.

The edge-triggered misunderstanding

NYKevin — Fri, 06 Aug 2021 15:37:32 +0000

> Locking overhead and system call overhead, on the other hand, make this pattern probably best to avoid.

But that applies just as well to EPOLLONESHOT! So what's the point of it?

The way I see it, either you can tolerate duplicate events, or you can't. If you can't, then you *must* do some sort of explicit syscall to tell the kernel when you're done with a given event and that it is now safe to dispatch another event. But as I explained above, if you want to use EPOLLONESHOT and epoll_ctl(..., EPOLL_CTL_MOD, ...) for that purpose, then you have to call epoll_ctl either before or after the read/write, and both ways result in a race condition. So you just can't use EPOLLONESHOT for that purpose, and must instead use single-threaded EPOLLLT with explicit add/remove. There's no other safe, duplicate-free way to use this API.

OTOH, if you can tolerate duplicate notifications, then I don't see the point of EPOLLONESHOT at all. Just use EPOLLLT (single-threaded) or EPOLLET (multi-threaded, but you somehow don't care about the threads walking all over each other when the same fd gets signalled multiple times).

The edge-triggered misunderstanding

abatters — Fri, 06 Aug 2021 15:30:46 +0000

Thanks for pointing that out; I hadn't seen eventfd before.

The edge-triggered misunderstanding

itsmycpu — Fri, 06 Aug 2021 14:49:42 +0000

> It's documented that way in the current version of the man page at least

Yes, I just realized that and corrected myself in the post just (2 minutes) before yours. :)

The edge-triggered misunderstanding

intgr — Fri, 06 Aug 2021 14:42:55 +0000

> Are you suggesting that edge-triggered will not wake more than one thread?
> it doesn't seem documented in that way

It's documented that way in the current version of the man page at least
https://manpages.debian.org/testing/manpages/epoll.7.en.html

> If multiple threads (or processes, if child processes have
> inherited the epoll file descriptor across fork(2)) are blocked
> in epoll_wait(2) waiting on the same epoll file descriptor and a
> file descriptor in the interest list that is marked for edge-
> triggered (EPOLLET) notification becomes ready, just one of the
> threads (or processes) is awoken from epoll_wait(2). This
> provides a useful optimization for avoiding "thundering herd"
> wake-ups in some scenarios.

The edge-triggered misunderstanding

itsmycpu — Fri, 06 Aug 2021 14:40:05 +0000

> Are you suggesting that edge-triggered will not wake more than one thread? I can see why you might interpret the comment further above in that way,
> but it doesn't seem documented in that way, [...]

Sorry once more, I was too quick. This is actually described by the epoll man page as a special case, at the end:

This behavior seems to apply only in that special case, like an additional option. It is not the general meaning of "edge-triggered". However certainly an interesting feature. I would hope that it generally applies when multiple threads use the same epoll file descriptor from the same epoll_create, so that it is not limited to forks.

The edge-triggered misunderstanding

Sesse — Fri, 06 Aug 2021 14:37:08 +0000

Isn't that what eventfd is for? It sounds more idiomatic than modifying the epoll set from another thread (which is what you're doing, right?).

The edge-triggered misunderstanding

itsmycpu — Fri, 06 Aug 2021 14:18:18 +0000

> With level-triggered behavior, all events would be dispatched to all threads. Edge-triggered instead behaves as if the event is "consumed" after dispatching.

I missed that this is still part of the reply to my post, so I'm sorry for splitting my response and not responding to the intended meaning at first.

Are you suggesting that edge-triggered will not wake more than one thread? I can see why you might interpret the comment further above in that way,
but it doesn't seem documented in that way, and I also haven't noticed any indication in that direction anywhere else. I am not convinced it is actually so, as I would expect that to be a separate option, if it really were an option. (Though perhaps a useful one.)

The edge-triggered misunderstanding

abatters — Fri, 06 Aug 2021 14:05:45 +0000

> FWIW, EPOLLONESHOT is not really relevant for most applications, and doesn't have a lot to do with edge-triggering. I believe the original use-case was a socket where you only wanted to connect() and see whether there was anything at all on the given port (some P2P server needed it).

I use EPOLLOUT | EPOLLONESHOT on a dummy fd that is always ready as a generic way for another thread to wakeup epoll_wait without using signals. Another way would be to write data to a pipe and then read it out after it woke up the epoll_wait, but that would require more syscalls.

The edge-triggered misunderstanding

itsmycpu — Fri, 06 Aug 2021 13:45:19 +0000

> Regardless, the rules for EPOLLET are pretty clear: only EAGAIN will re-arm the notification, ...

How would that be clear as a "rule"? In the man page linked in the article, which is also referenced up thread, it is just a "suggested way of use":

> The suggested way to use epoll as an edge-triggered (EPOLLET) interface is as follows:
>
> a) with nonblocking file descriptors; and
>
> b) by waiting for an event only after read(2) or write(2) return EAGAIN.

However, the best I can tell, if used in this way, 'level-triggered' would accomplish the same thing, and using 'edge-triggered' in this scenario seems redundant and pointless.

It does not say there that EAGAIN is necessary to "re-arm" the notification. Later on, it just says that reading until EAGAIN is necessary to make sure that all the data is read, however that is quite obvious anyway. And nothing else is implied by the example above.

The edge-triggered misunderstanding

mathstuf — Fri, 06 Aug 2021 13:29:42 +0000

Ah I missed that the `.internal` RFC was never adopted. I guess I'll put migration to…something else on my list. Thanks.