In search of an appropriate RLIMIT_MEMLOCK default

By Jonathan Corbet
November 19, 2021

One does not normally expect a lot of disagreement over a 13-line patch that effectively tweaks a single line of code. Occasionally, though, such a patch can expose a disagreement over how the behavior of the kernel should be managed. This patch from Drew DeVault, who is evidently taking a break from stirring up the npm community, is a case in point. It brings to light the question of how the kernel community should pick default values for configurable parameters like resource limits.

The kernel implements a set of resource limits applied to each (unprivileged) running process; they regulate how much CPU time a process can use, how many files it can have open, and more. The setrlimit() man page documents the full set. Of interest here is RLIMIT_MEMLOCK, which places a limit on how much memory a process can lock into RAM. Its default value is 64KB; the system administrator can raise it, but unprivileged processes cannot.

Once upon a time, locking memory was a privileged operation. The ability to prevent memory from being swapped out can present resource-management problems for the kernel; if too much memory is locked, there will not be enough left for the rest of the system to function normally. The widespread use of cryptographic utilities like GnuPG eventually led to this feature being made available to all processes, though. By locking memory containing sensitive data (keys and passphrases, for example), GnuPG can prevent that data from being written to swap devices or core-dump files. To enable this extra security, the kernel community opened up the mlock() system call to all users, but set the limit for the number of pages that can be locked to a relatively low value.

Uses of memory change over time. GnuPG does not really need more locked memory than it did years ago, but there are now other ways that users can run into the locked-memory limit. BPF programs, for example, are stored in unswappable kernel memory, with the space used being charged against this limit. These programs tend to be relatively small, but 64KB is likely to be constraining for many users. The big new consumer of locked memory, though, is io_uring.

Whenever the kernel sets up a user-space buffer for I/O, that buffer must be locked into memory for the duration of the operation. This locking is a short-lived affair and is not charged against the user's limit. There is, however, quite a bit of work involved in setting up an I/O buffer and locking it in memory; if that buffer is used for frequent I/O operations, the setup and teardown costs can reach a point where they slow the application measurably. As a way of eliminating this cost, the io_uring subsystem allows users to "register" their buffers; that operation sets up the buffers for I/O and leaves them in place where they can be used repeatedly.

I/O buffers can be large, so locking them into memory can consume significant amounts of RAM; it thus makes sense that a limit on how much memory can be locked in this way should be imposed. So, when buffers are registered, the kernel charges them against the same locked-memory limit. This is where the 64KB limit becomes truly constraining; to make the use of io_uring worthwhile, one almost certainly wants to use much larger buffers than will fit in that space. The 64KB default limit, as a result, has the potential to make io_uring unavailable to users unless it is increased by distributors or administrators — and that tends not to happen.

To avoid this problem, DeVault would like to raise that limit to 8MB. Expecting the problem to be addressed elsewhere, he said, is not realistic:

The buck, as it were, stops with the kernel. It's much easier to address it here than it is to bring it to hundreds of distributions, and it can only realistically be relied upon to be high-enough by end-user software if it is more-or-less ubiquitous.

Matthew Wilcox pointed out that there are plenty of other ways for a malicious user to lock down at least 8MB of memory, so he saw no added danger from this change, but with a couple of reservations. Perhaps it would be better to somehow scale the limit, he said, so that it would be smaller on machines with small amounts of memory. He also wondered if 8MB was the right value for the new limit, or whether io_uring users would need still more. Jens Axboe, the maintainer of io_uring, replied that "8MB is plenty for most casual use cases", and those are the cases that should "just work" without the need for administrator intervention.

Andrew Morton, though, was not convinced about this value — or any other:

We're never going to get this right, are we? The only person who can decide on a system's appropriate setting is the operator of that system. Haphazardly increasing the limit every few years mainly reduces incentive for people to get this right.

DeVault answered that "perfect is the enemy of good", and that he lacked the time to try to convince all of the distributors to configure a more realistic default. Morton's further suggestion that the limit should have been set to zero from the beginning to force a solution in user space was not received well. And that, more or less, is where the conversation wound down.

One line of thought here seems to be that the kernel community should not try to come up with usable defaults for parameters like RLIMIT_MEMLOCK; that will force downstream distributors to think about what their users need and configure things accordingly. But that seems like a recipe for the status quo, where a useful new feature is, in fact, not useful on most systems. Putting some thought into reasonable default values is something one normally expects from a software project; it's not clear why the kernel would be different in this regard. So this change will, in all likelihood, eventually find its way in, but perhaps not until the emails-to-lines-changed ratio becomes even higher.

Index entries for this article
Kernel	io_uring
Kernel	Memory management/User-space memory locking
Kernel	Resource limits

In search of an appropriate RLIMIT_MEMLOCK default

Posted Nov 19, 2021 20:46 UTC (Fri) by atnot (subscriber, #124910) [Link] (3 responses)

> One line of thought here seems to be that the kernel community should not try to come up with usable defaults for parameters like RLIMIT_MEMLOCK; that will force downstream distributors to think about what their users need and configure things accordingly.

I have the opposite thought there. When downstream is forced to set a configuration option, the knowledge about which values are sensible stops flowing upstream. That means an endless growth of suboptimal config flags and tweaking knobs that nobody dares touch because they don't understand what they are set to in the real world.

That is why I think bad heuristics are always preferable to flexible configuration: people will send patches for bad heuristics, flexible configuration just ends up piling up in (often secret) internal tweak lists.

In search of an appropriate RLIMIT_MEMLOCK default

Posted Nov 19, 2021 21:33 UTC (Fri) by tux3 (subscriber, #101245) [Link]

I'd go even further, and suggest that kernel devs should let downstream users figure converge on a reasonable value empirically.
Don't make the default so hopelessly hard to change that users work around it locally, without trying to work with upstream.
Let vaguely reasonable limit increase requests go through, until and unless real-world users find practical concerns.

These are not the kind of changes that frequently result in systems not booting. I don't want to minimize how bad it is to break working code. It is bad. It makes me want to run 2.6.

But, probably, you save more pain by preventing default ossification than you cause by making rc testers speak up a little more often.
People who like stability will still run stable kernels, and people who like BPF in their io-uring get make -rc1 a proper -rc1.

In search of an appropriate RLIMIT_MEMLOCK default

Posted Nov 19, 2021 22:09 UTC (Fri) by mtaht (subscriber, #11087) [Link]

One way to think about this is how many I/O operations do you want to be held
outstanding in the kernel, and instead of thinking about that in terms of size, but time.

Putting 1ms of requests in io_uring seems like a reasonable amount of buffering before context switching back into the application.

Figuring out a BQL-like yet predictive mechanism for sizing io_uring in this way was too long to scribble in the margins of this post.

In search of an appropriate RLIMIT_MEMLOCK default

Posted Nov 20, 2021 23:50 UTC (Sat) by gerdesj (subscriber, #5446) [Link]

"I have the opposite thought there."

Me too. The article clearly states that Jens Axboe was consulted and he gave his opinion. That's an "expert opinion" in anyone's book. It wasn't a "hand on the book" type of opinion but a piece of knowledge from someone who knows stuff.

For me, as a sysadmin, that is good enough for a default. I have a massive number of things to juggle already and I need sensible defaults. If it is wrong for my use case, I'll wrangle the problem myself - that's my problem, but I want sensible defaults and an opinion from someone who wrote the code is good enough for me.

I really don't want to be bothered with a minor religious disagreement. Kernel runtime parameter config is ... sysadmin stuff - that's what I do (or try to avoid as much as possible). Kernel parameter defaults: I want the likes of Jens to opine on that. If the defaults eventually turn out to be mad then that will fairly soon come back as a bug and get fixed at source. Also the person(s) best placed to fix the problem will get the best feedback.

In search of an appropriate RLIMIT_MEMLOCK default

Posted Nov 20, 2021 9:34 UTC (Sat) by Lionel_Debroux (subscriber, #30014) [Link] (6 responses)

As usual, spender has an interesting take on the matter: raising the RLIMIT_MEMLOCK default value is a great thing... for exploit writers :)

https://mobile.twitter.com/spendergrsec/status/1461794900...
"
This MEMLOCK_LIMIT change is great for exploit reliability when accessing userland data directly (afaik my enlightenment exploits were the only ones at the time that accounted for this)

8MB of data instead of 64kb (old unpriv limit for most systems) guaranteed not to fault (i.e. crash) when the kernel accesses it (modulo SMAP of course)
"

I don't want the io_uring or eBPF attack surface, because these pieces of infrastructure, appeared on security vulnerability lists multiple times, and more fundamentally, I don't need them in the first place: most of the many computers I have access to don't have SSDs, even with a SATA interface, and don't run high-performance workloads where eBPF is relevant.
Therefore, changing the default value of RLIMIT_MEMLOCK strongly looks like a proposition with net negative value - worse security without a functional improvement relevant to me, the users of said computers... and the probably still many users of other computers whose workloads don't benefit from io_uring & eBPF, too.

In search of an appropriate RLIMIT_MEMLOCK default

Posted Nov 20, 2021 20:11 UTC (Sat) by Sesse (subscriber, #53779) [Link] (4 responses)

io_uring is very much relevant even for spinning rust, though.

In search of an appropriate RLIMIT_MEMLOCK default

Posted Nov 20, 2021 21:25 UTC (Sat) by andresfreund (subscriber, #69562) [Link] (3 responses)

Efficient completion (rather than readiness) based network IO is also useful in a lot of situations that do not involve meaningful amounts of disk IO. And doing that while not wasting memory requires requires registered buffers...

In search of an appropriate RLIMIT_MEMLOCK default

Posted Nov 20, 2021 21:33 UTC (Sat) by Sesse (subscriber, #53779) [Link] (2 responses)

I think I might have missed something about how to use io_uring with network I/O, though. Are you still supposed to use epoll to check for readiness? I mean, you can't just fire off read() or send() (as appropriate) for every socket, since the ring is of finite size, right?

In search of an appropriate RLIMIT_MEMLOCK default

Posted Nov 20, 2021 22:05 UTC (Sat) by andresfreund (subscriber, #69562) [Link] (1 responses)

Network IO that internally is waiting doesn't take up a ring entry. Of course that opens up the danger that there are completions that don't fit the ring - which has been addressed by storing overflowing completions off ring (there's a feature flag to test for for this).

There's a second issue with just using readiness for receiving network data: The amount of buffer space that requires. Newer kernel version address that by providing buffer space that isn'tt associated with a specific IO that are used when needed. But that buffer space IIRC is counted as locked.

In search of an appropriate RLIMIT_MEMLOCK default

Posted Nov 20, 2021 22:19 UTC (Sat) by andresfreund (subscriber, #69562) [Link]

See https://lwn.net/Articles/815491/

In search of an appropriate RLIMIT_MEMLOCK default

Posted Nov 21, 2021 20:37 UTC (Sun) by calumapplepie (guest, #143655) [Link]

I don't think it's a significant help for exploit writers. The entire mm subsystem is built around keeping in-use pages in memory: it isn't tough to have your exploiting process regularly access values from your userspace buffer, preventing it from being swapped out. Further, the kernel won't even be trying to swap out pages unless memory pressure is fairly high: and even in that case, it would be prioritizing the little-used pages (which, for every application that I know of, there are _many_).

The change may not give you any direct performance improvement: however, the improved preformance of the servers you use (as they are encouraged to use io_uring) benefits you, and even that tiny benefit is probably greater than the cost of slightly reducing the difficulty of exploits that rely on directing the kernel to access userspace buffers without faulting.

In search of an appropriate RLIMIT_MEMLOCK default

Posted Nov 20, 2021 9:43 UTC (Sat) by ddevault (subscriber, #99589) [Link] (1 responses)

Given a 10,000 line patch, there is little to argue over. Given a somewhat arbitrarily selected 64-bit integer to tune, and we'll come up with 18,446,744,073,709,551,616 possible answers :)

Either way this is resolved, I ended up coming up with a workaround for my original problem, but it would certainly be much better to do without it. Here's to the eternal bikeshed!

In search of an appropriate RLIMIT_MEMLOCK default

Posted Dec 2, 2021 14:15 UTC (Thu) by smitty_one_each (subscriber, #28989) [Link]

It seems as though this would be a good selling point for distributions that make heavy use of the feature.

"Drewnux is your video server distro, because it gets the IO right."

In search of an appropriate RLIMIT_MEMLOCK default

Posted Nov 21, 2021 11:39 UTC (Sun) by nilsmeyer (guest, #122604) [Link] (8 responses)

I'm thinking of a different risk here, that is people who run into limits just running the program as root instead. A typical source of confusion here is whether to set the limits (somewhere) in /etc/pam.d or as a systemd override, or just people who don't want to expend the effort to research.

In search of an appropriate RLIMIT_MEMLOCK default

Posted Nov 21, 2021 23:41 UTC (Sun) by NYKevin (subscriber, #129325) [Link] (7 responses)

The thing to bear in mind about resource limits is that, in most cases, regular apps and services do not directly call setrlimit(2) themselves. Instead, you put everything (that you care about) into a container, and then tell your orchestration system to enforce resource limits, which may or may not be implemented as a call to setrlimit(2). Why? Because you want to manage those resource limits in a central fashion which is aware of the overall amount of resources available in the entire cluster or data center, not just each individual machine. Managing your resources one machine at a time is way too far to the pets end of the pets-vs-cattle spectrum; if your setup is small enough to be a pet shop rather than a cattle shop, your resource management process is probably just "Is the server beefy enough? (Y/N)" rather than some convoluted setrlimit(2) nonsense.

So the actual consumer of the setrlimit(2) API is generally going to be container orchestration systems, and/or distros that replace the kernel's defaults with their own defaults. If you're not using a container orchestration system, then it's pretty likely that your limits are either whatever the (distro or kernel) default is, or managed in an ad-hoc way where you reactively bump them up or down in order to fix problems, rather than proactively planning out your usage in a systematic fashion. But lots of people don't use orchestration systems. So the defaults had better be pretty reasonable, since bad defaults will inconvenience a very large group of sysadmins.

In search of an appropriate RLIMIT_MEMLOCK default

Posted Nov 24, 2021 8:53 UTC (Wed) by nilsmeyer (guest, #122604) [Link] (6 responses)

I'm just describing what I've seen in the wild. Of course having the cattle approach and an orchestration system and competent admin/SRE teams is better(tm), but that's just not the reality for a lot of organizations I've seen from the inside - granted that's usually why I'm asked to take a look in the first place.

In search of an appropriate RLIMIT_MEMLOCK default

Posted Nov 25, 2021 9:44 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (5 responses)

I entirely agree. In fact, that's actually my whole point - defaults are used by lots of sysadmins, most of whom are not really in a position to come up with their own settings (at least, not without a lot of manual profiling and other busywork that most sysadmins frankly are not going to get around to doing). When the kernel developers say "I dunno, let's just make it the sysadmin's problem," in practice this actually means that many systems will be poorly configured, unless or until the misconfiguration gets so egregiously bad that someone with the authority to do so actually files a P1 ticket (or Urgent, or whatever label your ticketing system uses for "we actually care about this one").

The cattle people, by contrast, don't care about rlimit defaults in the first place, because they just ingest the dimension into their existing resource management system and automate the whole problem away. So in a sense, these defaults are a pets-only affair, and you need to take a realistic view of what a pet-shop sysadmin can plausibly do in terms of tuning arcane system parameters that they likely have never even heard of. Of course, plenty of pets are actually workstations, or laptops, or other such non-server devices,* in which case the sysadmin is probably either a SWE or the IT department, and might not even speak fluent bash, let alone know what an "rlimit" is supposed to be.

* I'm ignoring Android and Android-like devices because I assume that Google or whoever will extensively customize absolutely everything, and pick sane defaults (or at least, defaults that are not obviously ridiculous). Also, it would not really make sense for Linus and co. to try and guess how to tune Linux to perform well on a smartphone.

In search of an appropriate RLIMIT_MEMLOCK default

Posted Nov 25, 2021 13:03 UTC (Thu) by Wol (subscriber, #4433) [Link] (4 responses)

> The cattle people, by contrast, don't care about rlimit defaults in the first place, because they just ingest the dimension into their existing resource management system and automate the whole problem away. So in a sense, these defaults are a pets-only affair, and you need to take a realistic view of what a pet-shop sysadmin can plausibly do in terms of tuning arcane system parameters that they likely have never even heard of. Of course, plenty of pets are actually workstations, or laptops, or other such non-server devices,* in which case the sysadmin is probably either a SWE or the IT department, and might not even speak fluent bash, let alone know what an "rlimit" is supposed to be.

And you're forgetting people like me, who run a home server, so just like MS pisses me off with the error message "contact your system administrator", you're doing the same telling me to "contact yourself to fix the problem", WITHOUT giving me any clues as to what the problem is, or how to solve it.

So the equivalence "pet-shop admin == end user without a clue" is largely true. You're just throwing these people under the bus, but the reality is these people are also VERY IMPORTANT in the PR war. We run linux from choice, and our contribution is political, not technical ...

Cheers,
Wol

In search of an appropriate RLIMIT_MEMLOCK default

Posted Nov 26, 2021 0:25 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (3 responses)

While I'm quite sure that YOU speak fluent bash, since you read and comment on LWN articles, there are plenty of SWEs who can barely function in a non-GUI environment. It's actually kind of terrifying to me, but in a creeping existential dread sort of way.

In search of an appropriate RLIMIT_MEMLOCK default

Posted Nov 26, 2021 11:16 UTC (Fri) by farnz (subscriber, #17727) [Link] (1 responses)

I went along to my employer's optional internal training on being productive with the command line, and was surprised by how much of what my fellow attendees thought was "amazing information" was stuff I considered basic CLI knowledge.

CLIs just aren't the normal form of interaction for modern programmers, most of the time, and thus the basics aren't being learnt (or taught).

In search of an appropriate RLIMIT_MEMLOCK default

Posted Nov 30, 2021 16:32 UTC (Tue) by nix (subscriber, #2304) [Link]

This has, of course, been true for decades. The stuff at the start of Charlie Stross's _The Atrocity Archive_ involving a necromantic summoning training course in which our protagonist has exactly the same realisation was drawn directly from, er, life -- and that was written in the year 2000...

In search of an appropriate RLIMIT_MEMLOCK default

Posted Nov 26, 2021 13:07 UTC (Fri) by Wol (subscriber, #4433) [Link]

Actually, I *DON'T* speak fluent bash. My scripting language of choice would be an obsolete language called CPL (Command Procedure Language). I think there are a few Pr1mates here who would recognise it :-)

sed, grep, awk, etc are all power tools I have never really got to grips with, because I've never been in an envioronment where they were either (a) the goto tools, or (b) my colleagues were familiar with them.

And while I've forgotten most of my CPL, it had a lot of power capabilites that were the equivalent.

But yes, I did grow up and cut my computing teeth in a time before guis like X and Windows. It just wasn't in a nix environment, which is why I'm so damning of (li)nux sometimes - I feel it's a case of "Unix won because it was good enough and cheap enough". Doesn't stop it being crap :-)

Cheers,
Wol

In search of an appropriate RLIMIT_MEMLOCK default

Posted Nov 23, 2021 14:26 UTC (Tue) by ncm (guest, #165) [Link] (2 responses)

Using the same number for io_uring as for everything else might be the problem.

If buffers only written to by the kernel were charged to the kernel, that might sidestep most of the conflict. That would depend on such buffers being made to be writable only by the kernel, which would be a new but IMO valuable feature. A buffer I cannot write into is one that cannot be sprayed by malicious code in my process.

In search of an appropriate RLIMIT_MEMLOCK default

Posted Nov 23, 2021 17:27 UTC (Tue) by notriddle (subscriber, #130608) [Link] (1 responses)

Limits that are “charged to the kernel” aren’t really limits at all. They’re DoS vectors.

In search of an appropriate RLIMIT_MEMLOCK default

Posted Dec 10, 2021 7:17 UTC (Fri) by massimiliano (subscriber, #3048) [Link]

Then perhaps the sensible proposal would be to have two separated limits: one (with a low default, like 64Kb) for locked memory writable by user space, and another (with a larger default, like 8Mb, because it is safe anyway) for locked memory writable only by the kernel.

In search of an appropriate RLIMIT_MEMLOCK default

Posted Nov 23, 2021 15:43 UTC (Tue) by stevie-oh (subscriber, #130795) [Link] (7 responses)

Something I'm not quite grasping here -- and I freely admit that I have only the faintest idea of how the underlying mechanisms work, so it may be obvious to those more familiar with them:

Why isn't this a problem for applications that aren't using io_uring?

What if I issue a `read` call with an 8MB buffer?

What if I create eight new threads and have each thread issue a `read` call into an 8MB buffer? That's 64MB right there.

As I understand it, from a computational standpoint, using io_uring does not increase a program's capabilities. In the same way that the bulk of the Linux kernel could be written in Brainf*ck instead of C, any program implemented using io_uring could could accomplish the same sort of thing in terms of multiple threads and ordinary calls to `read`/`write`/etc.

In search of an appropriate RLIMIT_MEMLOCK default

Posted Nov 23, 2021 16:21 UTC (Tue) by corbet (editor, #1) [Link] (1 responses)

A basic read() call does not lock the buffer into memory most of the time. If you are doing some sort of direct I/O, the buffer may indeed be locked into memory but (as the article says), that lock will only exist for the duration of the I/O operation and is not charged against the locked-memory limit. Registered buffers, instead, remain locked in memory indefinitely, thus need to be treated differently.

In search of an appropriate RLIMIT_MEMLOCK default

Posted Dec 10, 2021 8:42 UTC (Fri) by njs (subscriber, #40338) [Link]

Don't the io_ring submission and completion buffers also count against the RLIMIT_MEMLOCK limit? That's much more annoying, because they're the equivalent of e.g. an epoll fd's internal state -- that's kernel memory, not mapped into userspace, but it's still memory. Yet with epoll, userspace gets to freely register as many fds as it wants, without having to worry about hitting RLIMIT_MEMLOCK.

In search of an appropriate RLIMIT_MEMLOCK default

Posted Nov 23, 2021 16:27 UTC (Tue) by david.hildenbrand (subscriber, #108299) [Link] (4 responses)

io_uring with fixed buffers uses a technology called GUP (Get user Pages) with the flavor FOLL_LONGTERM. For starters: take a reference to a target (user space) page with the semantics of holding the reference for a long time, possibly forever. Later, we can simply access that page without any additional locking and page-table lookups, because the additional reference makes the page "stick around" and prevents it from getting freed.

So there is an additional long-term reference to a page that is not due to the page being mapped into a user space page table.

While we have such a reference on a page, such a page cannot be swapped out ("locked into memory") like mlock. But worse, it cannot be migrated/moved around in memory anymore, meaning that it can prevent memory compaction. Possibly forever.

It behaves "at least as worse" as mlock'ed memory, therefore, for now we account it as RLIMIT_MEMLOCK. But really, it's much worse.

The only reason io_uring uses FOLL_LONGTERM is because it's faster than alternatives that are "nice" to the MM subsystem and can actually drop references temporarily to re-take them when necessary later.

a) Why isn't this a problem for applications that aren't using io_uring?

We barely have features that expose FOLL_LONGTERM to unprivileged user space (for a good reason if you ask me).

b) What if I issue a `read` call with an 8MB buffer?

IIUC, with O_DIRECT will take a short-term reference for DMA and pass it to the HW for a very short time frame, then release the page reference. Without O_DIRECT ,we simply copy the result to the target user space page from an in-kernel buffer.

c) What if I create eight new threads and have each thread issue a `read` call into an 8MB buffer? That's 64MB right there.

At least the 8MB user-space buffer exists only once.

d) "using io_uring does not increase a program's capabilities"

In regards to fixed buffers I think you are correct. It's a pure performance improvement.

In search of an appropriate RLIMIT_MEMLOCK default

Posted Nov 25, 2021 0:45 UTC (Thu) by Fowl (subscriber, #65667) [Link] (3 responses)

Hacks upon hacks but... couldn't io_uring conditionally use FOLL_LONGTERM based on if there was sufficient RLIMIT_MEMLOCK quota left?

In search of an appropriate RLIMIT_MEMLOCK default

Posted Nov 25, 2021 8:15 UTC (Thu) by david.hildenbrand (subscriber, #108299) [Link] (2 responses)

User space can implement such a fallback quite easily I think.

It should be as simple as handling (and logging/warning users) errors from IORING_REGISTER_BUFFERS and then, instead of e.g., issuing a IORING_OP_READ_FIXED/IORING_OP_WRITE_FIXED, issue an IORING_OP_READ/IORING_OP_WRITE. Of course, further buffer index management in user space has to onsider that the buffer registration previously failed.

A fallback inside the kernel itself would also be imaginable, however, this rather hides the error from the user and might bring surprises: not everybody wants to have a silent fallback.

In search of an appropriate RLIMIT_MEMLOCK default

Posted Nov 25, 2021 18:59 UTC (Thu) by andresfreund (subscriber, #69562) [Link] (1 responses)

There's no quite as easy fallback when registered buffers are used via IOSQE_BUFFER_SELECT though... Using the pessimistic amount of memory of fallback might cause further issues.

In search of an appropriate RLIMIT_MEMLOCK default

Posted Nov 26, 2021 8:43 UTC (Fri) by david.hildenbrand (subscriber, #108299) [Link]

I'm for sure no io_uring expert :) Can you elaborate how IOSQE_BUFFER_SELECT is related to IORING_REGISTER_BUFFERS?

The man page mentions for IORING_REGISTER_BUFFERS: "After a successful call, the supplied buffers are mapped into the kernel and eligible for I/O. To make use of them, the application must specify the IORING_OP_READ_FIXED or IORING_OP_WRITE_FIXED opcodes in the submission queue entry ...". No notion of IOSQE_BUFFER_SELECT.

IOSQE_BUFFER_SELECT in contrast is "Used in conjunction with the IORING_OP_PROVIDE_BUFFERS command".

Digging through the code, I can spot that io_import_fixed() is really only used for IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED. IOSQE_BUFFER_SELECT references only buffers in the "IOSQE_BUFFER_SELECT" ("io_buffers") domain, not in the "IORING_REGISTER_BUFFERS" ("user_bufs") domain.

To be precise (include/uapi/linux/io_uring.h):

union {
/* index into fixed buffers, if used */
__u16 buf_index;
/* for grouped buffer selection */
__u16 buf_group;
}

IOSQE_BUFFER_SELECT effectively uses the same member, but it's interpreted as a buffer group selection, not a fixed buffer selection.

In search of an appropriate RLIMIT_MEMLOCK default

Posted Nov 23, 2021 19:40 UTC (Tue) by flussence (guest, #85566) [Link]

64KB is a number from a medieval era and an indication that the whole mechanism needs to be rethought. Do away with the hard limit entirely and put that effort into making the OOM killer more robust.

Visible arbitrary limits like this are stable ABI, whether we like it or not. By increasing them you subconsciously train developers to take that number for granted and grow their software towards it, and we've seen the end result of that in this area with glibc's gigantic thread stacks leading to supposedly portable software not working in musl. We need good backpressure mechanisms, not more bufferbloat.

In search of an appropriate RLIMIT_MEMLOCK default

Posted Nov 26, 2021 18:44 UTC (Fri) by Hattifnattar (subscriber, #93737) [Link] (1 responses)

Personally, I would like the kernel itself to have "unreasonable" default settings everywhere possible (e.g. everything optional turned off, any numeric parameter set to zero, etc), BUT come with a default human-readable config file where the reasonable values for a typical setup were set. This config would be a part of the upstream, subject to the same strictures and developed by the same people, but the policy would be separated from the code.

Then have a 3-tier configuration system, where distributions and users would not generally _replace_ config files, but rather supplement/override values in them via separate (config-level and user-level) configs.

In search of an appropriate RLIMIT_MEMLOCK default

Posted Nov 26, 2021 19:18 UTC (Fri) by Wol (subscriber, #4433) [Link]

I'd like decent defaults.

But your idea should be implemented everywhere. Systemd uses that idea - where I know it has system and sysadmin config files (and user where appropriate).

But so many programs have a monolithic config file. Running gentoo, I have to be careful the config file doesn't get updated trashing my config. I made the mistake of editing the Dovecot config file, and then an update trashed everything. Only after that :-) did I discover that dovecot imports a local config, which gentoo obviously doesn't provide, so now all my config is in that ... :-)

But how many other programs are that sensible?

Cheers,
Wol

In search of an appropriate RLIMIT_MEMLOCK default

Posted Dec 2, 2021 21:19 UTC (Thu) by mezcalero (subscriber, #45103) [Link]

In systemd we actually already bump the RLIMIT_NOFILE hard limit for all userspace. I'd be fine to do the same for RLIMIT_MEMLOCK, if there are good reasons for it. We are a bit more open regarding tweaking defaults -- if there are good reasons for it -- than upstream kernel people i guess. But someone would have to file a github issue/pr for this, and include enough quotes from relevant people to make it sufficiently clear to us that this is a good idea.

Lennart