In search of an appropriate RLIMIT_MEMLOCK default
The kernel implements a set of resource limits applied to each (unprivileged) running process; they regulate how much CPU time a process can use, how many files it can have open, and more. The setrlimit() man page documents the full set. Of interest here is RLIMIT_MEMLOCK, which places a limit on how much memory a process can lock into RAM. Its default value is 64KB; the system administrator can raise it, but unprivileged processes cannot.
Once upon a time, locking memory was a privileged operation. The ability to prevent memory from being swapped out can present resource-management problems for the kernel; if too much memory is locked, there will not be enough left for the rest of the system to function normally. The widespread use of cryptographic utilities like GnuPG eventually led to this feature being made available to all processes, though. By locking memory containing sensitive data (keys and passphrases, for example), GnuPG can prevent that data from being written to swap devices or core-dump files. To enable this extra security, the kernel community opened up the mlock() system call to all users, but set the limit for the number of pages that can be locked to a relatively low value.
Uses of memory change over time. GnuPG does not really need more locked memory than it did years ago, but there are now other ways that users can run into the locked-memory limit. BPF programs, for example, are stored in unswappable kernel memory, with the space used being charged against this limit. These programs tend to be relatively small, but 64KB is likely to be constraining for many users. The big new consumer of locked memory, though, is io_uring.
Whenever the kernel sets up a user-space buffer for I/O, that buffer must be locked into memory for the duration of the operation. This locking is a short-lived affair and is not charged against the user's limit. There is, however, quite a bit of work involved in setting up an I/O buffer and locking it in memory; if that buffer is used for frequent I/O operations, the setup and teardown costs can reach a point where they slow the application measurably. As a way of eliminating this cost, the io_uring subsystem allows users to "register" their buffers; that operation sets up the buffers for I/O and leaves them in place where they can be used repeatedly.
I/O buffers can be large, so locking them into memory can consume significant amounts of RAM; it thus makes sense that a limit on how much memory can be locked in this way should be imposed. So, when buffers are registered, the kernel charges them against the same locked-memory limit. This is where the 64KB limit becomes truly constraining; to make the use of io_uring worthwhile, one almost certainly wants to use much larger buffers than will fit in that space. The 64KB default limit, as a result, has the potential to make io_uring unavailable to users unless it is increased by distributors or administrators — and that tends not to happen.
To avoid this problem, DeVault would like to raise that limit to 8MB. Expecting the problem to be addressed elsewhere, he said, is not realistic:
The buck, as it were, stops with the kernel. It's much easier to address it here than it is to bring it to hundreds of distributions, and it can only realistically be relied upon to be high-enough by end-user software if it is more-or-less ubiquitous.
Matthew Wilcox pointed
out that there are plenty of other ways for a malicious user to lock down at
least 8MB of memory, so he saw no added danger from this change, but with a
couple of
reservations. Perhaps it would be better to somehow scale the limit, he
said, so that it would be smaller on machines with small amounts of
memory. He also wondered if 8MB was the right value for the new limit, or
whether io_uring users would need still more. Jens Axboe, the maintainer
of io_uring, replied
that "8MB is plenty for most casual use cases
", and those are
the cases that should "just work" without the need for administrator
intervention.
Andrew Morton, though, was not convinced about this value — or any other:
We're never going to get this right, are we? The only person who can decide on a system's appropriate setting is the operator of that system. Haphazardly increasing the limit every few years mainly reduces incentive for people to get this right.
DeVault answered that
"perfect is the enemy of good
", and that he lacked the time to
try to convince all of the distributors to configure a more realistic
default. Morton's further suggestion
that the limit should have been set to zero from the beginning to force a
solution in user space was not received
well. And that, more or less, is where the conversation wound down.
One line of thought here seems to be that the kernel community should not
try to come up with usable defaults for parameters like
RLIMIT_MEMLOCK; that will force downstream distributors to think
about what their users need and configure things accordingly. But that
seems like a recipe for the status quo, where a useful new feature is, in
fact, not useful on most systems. Putting some thought into reasonable
default values is something one normally expects from a software project;
it's not clear why the kernel would be different in this regard. So
this change will, in all likelihood, eventually find its way in, but perhaps
not until the emails-to-lines-changed ratio becomes even higher.
Index entries for this article | |
---|---|
Kernel | io_uring |
Kernel | Memory management/User-space memory locking |
Kernel | Resource limits |
Posted Nov 19, 2021 20:46 UTC (Fri)
by atnot (subscriber, #124910)
[Link] (3 responses)
I have the opposite thought there. When downstream is forced to set a configuration option, the knowledge about which values are sensible stops flowing upstream. That means an endless growth of suboptimal config flags and tweaking knobs that nobody dares touch because they don't understand what they are set to in the real world.
That is why I think bad heuristics are always preferable to flexible configuration: people will send patches for bad heuristics, flexible configuration just ends up piling up in (often secret) internal tweak lists.
Posted Nov 19, 2021 21:33 UTC (Fri)
by tux3 (subscriber, #101245)
[Link]
These are not the kind of changes that frequently result in systems not booting. I don't want to minimize how bad it is to break working code. It is bad. It makes me want to run 2.6.
But, probably, you save more pain by preventing default ossification than you cause by making rc testers speak up a little more often.
Posted Nov 19, 2021 22:09 UTC (Fri)
by mtaht (subscriber, #11087)
[Link]
Putting 1ms of requests in io_uring seems like a reasonable amount of buffering before context switching back into the application.
Figuring out a BQL-like yet predictive mechanism for sizing io_uring in this way was too long to scribble in the margins of this post.
Posted Nov 20, 2021 23:50 UTC (Sat)
by gerdesj (subscriber, #5446)
[Link]
Me too. The article clearly states that Jens Axboe was consulted and he gave his opinion. That's an "expert opinion" in anyone's book. It wasn't a "hand on the book" type of opinion but a piece of knowledge from someone who knows stuff.
For me, as a sysadmin, that is good enough for a default. I have a massive number of things to juggle already and I need sensible defaults. If it is wrong for my use case, I'll wrangle the problem myself - that's my problem, but I want sensible defaults and an opinion from someone who wrote the code is good enough for me.
I really don't want to be bothered with a minor religious disagreement. Kernel runtime parameter config is ... sysadmin stuff - that's what I do (or try to avoid as much as possible). Kernel parameter defaults: I want the likes of Jens to opine on that. If the defaults eventually turn out to be mad then that will fairly soon come back as a bug and get fixed at source. Also the person(s) best placed to fix the problem will get the best feedback.
Posted Nov 20, 2021 9:34 UTC (Sat)
by Lionel_Debroux (subscriber, #30014)
[Link] (6 responses)
https://mobile.twitter.com/spendergrsec/status/1461794900...
8MB of data instead of 64kb (old unpriv limit for most systems) guaranteed not to fault (i.e. crash) when the kernel accesses it (modulo SMAP of course)
I don't want the io_uring or eBPF attack surface, because these pieces of infrastructure, appeared on security vulnerability lists multiple times, and more fundamentally, I don't need them in the first place: most of the many computers I have access to don't have SSDs, even with a SATA interface, and don't run high-performance workloads where eBPF is relevant.
Posted Nov 20, 2021 20:11 UTC (Sat)
by Sesse (subscriber, #53779)
[Link] (4 responses)
Posted Nov 20, 2021 21:25 UTC (Sat)
by andresfreund (subscriber, #69562)
[Link] (3 responses)
Posted Nov 20, 2021 21:33 UTC (Sat)
by Sesse (subscriber, #53779)
[Link] (2 responses)
Posted Nov 20, 2021 22:05 UTC (Sat)
by andresfreund (subscriber, #69562)
[Link] (1 responses)
There's a second issue with just using readiness for receiving network data: The amount of buffer space that requires. Newer kernel version address that by providing buffer space that isn'tt associated with a specific IO that are used when needed. But that buffer space IIRC is counted as locked.
Posted Nov 20, 2021 22:19 UTC (Sat)
by andresfreund (subscriber, #69562)
[Link]
Posted Nov 21, 2021 20:37 UTC (Sun)
by calumapplepie (guest, #143655)
[Link]
The change may not give you any direct performance improvement: however, the improved preformance of the servers you use (as they are encouraged to use io_uring) benefits you, and even that tiny benefit is probably greater than the cost of slightly reducing the difficulty of exploits that rely on directing the kernel to access userspace buffers without faulting.
Posted Nov 20, 2021 9:43 UTC (Sat)
by ddevault (subscriber, #99589)
[Link] (1 responses)
Either way this is resolved, I ended up coming up with a workaround for my original problem, but it would certainly be much better to do without it. Here's to the eternal bikeshed!
Posted Dec 2, 2021 14:15 UTC (Thu)
by smitty_one_each (subscriber, #28989)
[Link]
"Drewnux is your video server distro, because it gets the IO right."
Posted Nov 21, 2021 11:39 UTC (Sun)
by nilsmeyer (guest, #122604)
[Link] (8 responses)
Posted Nov 21, 2021 23:41 UTC (Sun)
by NYKevin (subscriber, #129325)
[Link] (7 responses)
So the actual consumer of the setrlimit(2) API is generally going to be container orchestration systems, and/or distros that replace the kernel's defaults with their own defaults. If you're not using a container orchestration system, then it's pretty likely that your limits are either whatever the (distro or kernel) default is, or managed in an ad-hoc way where you reactively bump them up or down in order to fix problems, rather than proactively planning out your usage in a systematic fashion. But lots of people don't use orchestration systems. So the defaults had better be pretty reasonable, since bad defaults will inconvenience a very large group of sysadmins.
Posted Nov 24, 2021 8:53 UTC (Wed)
by nilsmeyer (guest, #122604)
[Link] (6 responses)
Posted Nov 25, 2021 9:44 UTC (Thu)
by NYKevin (subscriber, #129325)
[Link] (5 responses)
The cattle people, by contrast, don't care about rlimit defaults in the first place, because they just ingest the dimension into their existing resource management system and automate the whole problem away. So in a sense, these defaults are a pets-only affair, and you need to take a realistic view of what a pet-shop sysadmin can plausibly do in terms of tuning arcane system parameters that they likely have never even heard of. Of course, plenty of pets are actually workstations, or laptops, or other such non-server devices,* in which case the sysadmin is probably either a SWE or the IT department, and might not even speak fluent bash, let alone know what an "rlimit" is supposed to be.
* I'm ignoring Android and Android-like devices because I assume that Google or whoever will extensively customize absolutely everything, and pick sane defaults (or at least, defaults that are not obviously ridiculous). Also, it would not really make sense for Linus and co. to try and guess how to tune Linux to perform well on a smartphone.
Posted Nov 25, 2021 13:03 UTC (Thu)
by Wol (subscriber, #4433)
[Link] (4 responses)
And you're forgetting people like me, who run a home server, so just like MS pisses me off with the error message "contact your system administrator", you're doing the same telling me to "contact yourself to fix the problem", WITHOUT giving me any clues as to what the problem is, or how to solve it.
So the equivalence "pet-shop admin == end user without a clue" is largely true. You're just throwing these people under the bus, but the reality is these people are also VERY IMPORTANT in the PR war. We run linux from choice, and our contribution is political, not technical ...
Cheers,
Posted Nov 26, 2021 0:25 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link] (3 responses)
Posted Nov 26, 2021 11:16 UTC (Fri)
by farnz (subscriber, #17727)
[Link] (1 responses)
I went along to my employer's optional internal training on being productive with the command line, and was surprised by how much of what my fellow attendees thought was "amazing information" was stuff I considered basic CLI knowledge.
CLIs just aren't the normal form of interaction for modern programmers, most of the time, and thus the basics aren't being learnt (or taught).
Posted Nov 30, 2021 16:32 UTC (Tue)
by nix (subscriber, #2304)
[Link]
Posted Nov 26, 2021 13:07 UTC (Fri)
by Wol (subscriber, #4433)
[Link]
sed, grep, awk, etc are all power tools I have never really got to grips with, because I've never been in an envioronment where they were either (a) the goto tools, or (b) my colleagues were familiar with them.
And while I've forgotten most of my CPL, it had a lot of power capabilites that were the equivalent.
But yes, I did grow up and cut my computing teeth in a time before guis like X and Windows. It just wasn't in a nix environment, which is why I'm so damning of (li)nux sometimes - I feel it's a case of "Unix won because it was good enough and cheap enough". Doesn't stop it being crap :-)
Cheers,
Posted Nov 23, 2021 14:26 UTC (Tue)
by ncm (guest, #165)
[Link] (2 responses)
If buffers only written to by the kernel were charged to the kernel, that might sidestep most of the conflict. That would depend on such buffers being made to be writable only by the kernel, which would be a new but IMO valuable feature. A buffer I cannot write into is one that cannot be sprayed by malicious code in my process.
Posted Nov 23, 2021 17:27 UTC (Tue)
by notriddle (subscriber, #130608)
[Link] (1 responses)
Posted Dec 10, 2021 7:17 UTC (Fri)
by massimiliano (subscriber, #3048)
[Link]
Posted Nov 23, 2021 15:43 UTC (Tue)
by stevie-oh (subscriber, #130795)
[Link] (7 responses)
Why isn't this a problem for applications that aren't using io_uring?
What if I issue a `read` call with an 8MB buffer?
What if I create eight new threads and have each thread issue a `read` call into an 8MB buffer? That's 64MB right there.
As I understand it, from a computational standpoint, using io_uring does not increase a program's capabilities. In the same way that the bulk of the Linux kernel could be written in Brainf*ck instead of C, any program implemented using io_uring could could accomplish the same sort of thing in terms of multiple threads and ordinary calls to `read`/`write`/etc.
Posted Nov 23, 2021 16:21 UTC (Tue)
by corbet (editor, #1)
[Link] (1 responses)
Posted Dec 10, 2021 8:42 UTC (Fri)
by njs (subscriber, #40338)
[Link]
Posted Nov 23, 2021 16:27 UTC (Tue)
by david.hildenbrand (subscriber, #108299)
[Link] (4 responses)
So there is an additional long-term reference to a page that is not due to the page being mapped into a user space page table.
While we have such a reference on a page, such a page cannot be swapped out ("locked into memory") like mlock. But worse, it cannot be migrated/moved around in memory anymore, meaning that it can prevent memory compaction. Possibly forever.
It behaves "at least as worse" as mlock'ed memory, therefore, for now we account it as RLIMIT_MEMLOCK. But really, it's much worse.
The only reason io_uring uses FOLL_LONGTERM is because it's faster than alternatives that are "nice" to the MM subsystem and can actually drop references temporarily to re-take them when necessary later.
a) Why isn't this a problem for applications that aren't using io_uring?
We barely have features that expose FOLL_LONGTERM to unprivileged user space (for a good reason if you ask me).
b) What if I issue a `read` call with an 8MB buffer?
IIUC, with O_DIRECT will take a short-term reference for DMA and pass it to the HW for a very short time frame, then release the page reference. Without O_DIRECT ,we simply copy the result to the target user space page from an in-kernel buffer.
c) What if I create eight new threads and have each thread issue a `read` call into an 8MB buffer? That's 64MB right there.
At least the 8MB user-space buffer exists only once.
d) "using io_uring does not increase a program's capabilities"
In regards to fixed buffers I think you are correct. It's a pure performance improvement.
Posted Nov 25, 2021 0:45 UTC (Thu)
by Fowl (subscriber, #65667)
[Link] (3 responses)
Posted Nov 25, 2021 8:15 UTC (Thu)
by david.hildenbrand (subscriber, #108299)
[Link] (2 responses)
It should be as simple as handling (and logging/warning users) errors from IORING_REGISTER_BUFFERS and then, instead of e.g., issuing a IORING_OP_READ_FIXED/IORING_OP_WRITE_FIXED, issue an IORING_OP_READ/IORING_OP_WRITE. Of course, further buffer index management in user space has to onsider that the buffer registration previously failed.
A fallback inside the kernel itself would also be imaginable, however, this rather hides the error from the user and might bring surprises: not everybody wants to have a silent fallback.
Posted Nov 25, 2021 18:59 UTC (Thu)
by andresfreund (subscriber, #69562)
[Link] (1 responses)
Posted Nov 26, 2021 8:43 UTC (Fri)
by david.hildenbrand (subscriber, #108299)
[Link]
The man page mentions for IORING_REGISTER_BUFFERS: "After a successful call, the supplied buffers are mapped into the kernel and eligible for I/O. To make use of them, the application must specify the IORING_OP_READ_FIXED or IORING_OP_WRITE_FIXED opcodes in the submission queue entry ...". No notion of IOSQE_BUFFER_SELECT.
IOSQE_BUFFER_SELECT in contrast is "Used in conjunction with the IORING_OP_PROVIDE_BUFFERS command".
Digging through the code, I can spot that io_import_fixed() is really only used for IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED. IOSQE_BUFFER_SELECT references only buffers in the "IOSQE_BUFFER_SELECT" ("io_buffers") domain, not in the "IORING_REGISTER_BUFFERS" ("user_bufs") domain.
To be precise (include/uapi/linux/io_uring.h):
union {
IOSQE_BUFFER_SELECT effectively uses the same member, but it's interpreted as a buffer group selection, not a fixed buffer selection.
Posted Nov 23, 2021 19:40 UTC (Tue)
by flussence (guest, #85566)
[Link]
Visible arbitrary limits like this are stable ABI, whether we like it or not. By increasing them you subconsciously train developers to take that number for granted and grow their software towards it, and we've seen the end result of that in this area with glibc's gigantic thread stacks leading to supposedly portable software not working in musl. We need good backpressure mechanisms, not more bufferbloat.
Posted Nov 26, 2021 18:44 UTC (Fri)
by Hattifnattar (subscriber, #93737)
[Link] (1 responses)
Then have a 3-tier configuration system, where distributions and users would not generally _replace_ config files, but rather supplement/override values in them via separate (config-level and user-level) configs.
Posted Nov 26, 2021 19:18 UTC (Fri)
by Wol (subscriber, #4433)
[Link]
But your idea should be implemented everywhere. Systemd uses that idea - where I know it has system and sysadmin config files (and user where appropriate).
But so many programs have a monolithic config file. Running gentoo, I have to be careful the config file doesn't get updated trashing my config. I made the mistake of editing the Dovecot config file, and then an update trashed everything. Only after that :-) did I discover that dovecot imports a local config, which gentoo obviously doesn't provide, so now all my config is in that ... :-)
But how many other programs are that sensible?
Cheers,
Posted Dec 2, 2021 21:19 UTC (Thu)
by mezcalero (subscriber, #45103)
[Link]
Lennart
In search of an appropriate RLIMIT_MEMLOCK default
In search of an appropriate RLIMIT_MEMLOCK default
Don't make the default so hopelessly hard to change that users work around it locally, without trying to work with upstream.
Let vaguely reasonable limit increase requests go through, until and unless real-world users find practical concerns.
People who like stability will still run stable kernels, and people who like BPF in their io-uring get make -rc1 a proper -rc1.
In search of an appropriate RLIMIT_MEMLOCK default
outstanding in the kernel, and instead of thinking about that in terms of size, but time.
In search of an appropriate RLIMIT_MEMLOCK default
In search of an appropriate RLIMIT_MEMLOCK default
"
This MEMLOCK_LIMIT change is great for exploit reliability when accessing userland data directly (afaik my enlightenment exploits were the only ones at the time that accounted for this)
"
Therefore, changing the default value of RLIMIT_MEMLOCK strongly looks like a proposition with net negative value - worse security without a functional improvement relevant to me, the users of said computers... and the probably still many users of other computers whose workloads don't benefit from io_uring & eBPF, too.
In search of an appropriate RLIMIT_MEMLOCK default
In search of an appropriate RLIMIT_MEMLOCK default
In search of an appropriate RLIMIT_MEMLOCK default
In search of an appropriate RLIMIT_MEMLOCK default
In search of an appropriate RLIMIT_MEMLOCK default
In search of an appropriate RLIMIT_MEMLOCK default
In search of an appropriate RLIMIT_MEMLOCK default
In search of an appropriate RLIMIT_MEMLOCK default
In search of an appropriate RLIMIT_MEMLOCK default
In search of an appropriate RLIMIT_MEMLOCK default
In search of an appropriate RLIMIT_MEMLOCK default
In search of an appropriate RLIMIT_MEMLOCK default
In search of an appropriate RLIMIT_MEMLOCK default
Wol
In search of an appropriate RLIMIT_MEMLOCK default
In search of an appropriate RLIMIT_MEMLOCK default
In search of an appropriate RLIMIT_MEMLOCK default
In search of an appropriate RLIMIT_MEMLOCK default
Wol
In search of an appropriate RLIMIT_MEMLOCK default
In search of an appropriate RLIMIT_MEMLOCK default
Then perhaps the sensible proposal would be to have two separated limits: one (with a low default, like 64Kb) for locked memory writable by user space, and another (with a larger default, like 8Mb, because it is safe anyway) for locked memory writable only by the kernel.
In search of an appropriate RLIMIT_MEMLOCK default
In search of an appropriate RLIMIT_MEMLOCK default
A basic read() call does not lock the buffer into memory most of the time. If you are doing some sort of direct I/O, the buffer may indeed be locked into memory but (as the article says), that lock will only exist for the duration of the I/O operation and is not charged against the locked-memory limit. Registered buffers, instead, remain locked in memory indefinitely, thus need to be treated differently.
In search of an appropriate RLIMIT_MEMLOCK default
In search of an appropriate RLIMIT_MEMLOCK default
In search of an appropriate RLIMIT_MEMLOCK default
In search of an appropriate RLIMIT_MEMLOCK default
In search of an appropriate RLIMIT_MEMLOCK default
In search of an appropriate RLIMIT_MEMLOCK default
In search of an appropriate RLIMIT_MEMLOCK default
/* index into fixed buffers, if used */
__u16 buf_index;
/* for grouped buffer selection */
__u16 buf_group;
}
In search of an appropriate RLIMIT_MEMLOCK default
In search of an appropriate RLIMIT_MEMLOCK default
In search of an appropriate RLIMIT_MEMLOCK default
Wol
In search of an appropriate RLIMIT_MEMLOCK default