Two new ways to read a file quickly

By Jonathan Corbet
March 6, 2020

System calls on Linux are relatively cheap, though the mitigations for speculative-execution vulnerabilities have made them more expensive than they once were. But even cheap system calls add up if one has to make a large number of them. Thus, developers have been working on ways to avoid system calls for a long time. Currently under discussion is a pair of ways to reduce the number of system calls required to read a file's contents, one of which is rather simpler than the other.

readfile()

LWN recently looked at the proposed fsinfo() system call, which is intended to return information about mounted filesystems to an application. One branch of the discussion delved into whether that information could be exported via sysfs instead; one concern that was expressed about that approach is that the cost of reading a lot of little files would be too high. Miklos Szeredi argued that it would not be, but suggested that, if people were concerned, they could reduce this cost by introducing a new system call to read the contents of a file:

    ssize_t readfile(int dfd, const char *path, char *buf, size_t bufsize, 
		     int flags);

The dfd and path arguments would identify a file in the usual way. A successful readfile() would read the contents of the indicated file into buf, up to a maximum of bufsize bytes, returning the number of bytes actually read. On its face, readfile() adds nothing new; an application can obtain the same result with calls to openat(), read(), and close(). But it reduces the number of system calls required from three to one, and that turns out to be interesting to some users.

In particular, Karel Zak, the maintainer of the util-linux project, offered "many many beers" for the implementation of readfile(). Many of the utilities in util-linux (tools like ps and top, for example) spend a lot of time reading information from small /proc and sysfs files; having a readfile() call would make them quite a bit more efficient.

People who complain that it's hard to get kernel developers to pay attention to their problems clearly have missed an important technique; Greg Kroah-Hartman quickly responded with enthusiasm: "Unlimited beers for a 21-line kernel patch? Sign me up!". He provided a first implementation, and went on to say that this system call might actually make sense to have. Naturally, the patch has grown past 21 lines once all of the details that need to be taken into account were dealt with, and there is still a manual page to write. But it seems likely that there will be a submission of readfile() in the near future.

Of course, some people are already talking about the need for a writefile() as well.

readfile() on the ring

As the conversation progressed, Jann Horn pointed out that the developers working on io_uring have also expressed interest in adding a readfile()-like capability. The whole point of io_uring is to be able to perform system-call actions asynchronously and without having to actually call into the kernel, so it might seem like a good fit for this use case. He did note that truly supporting that feature in io_uring is "a bit complicated", since there is no way to propagate a file descriptor returned by openat() to a subsequent read() operation queued in the ring. Without that, the read() cannot be queued until after the file has been opened, defeating the purpose of the exercise.

The fact of the matter, though, is that "a bit complicated" is a good description of io_uring in general. It seems unlikely that the author of a tool like ps is going to want to go through all of the effort needed to set up an io_uring instance, map it into the address space, queue some operations, and start things running just to avoid some system calls when reading /proc. But the developers of other, more complex applications would, it seems, like to have this sort of capability.

In short order, perhaps in the hope of tapping into that "unlimited beer" stream, io_uring maintainer Jens Axboe posted a patch that fills in the missing piece. It works by remembering the file descriptor returned by the last openat() call in a given chain of operations. To implement a readfile(), an application could set up an io_uring chain with three operations, corresponding to the openat(), read(), and close() calls. For the latter two, though, the usual file-descriptor argument would be provided as the special constant IOSQE_FD_LAST_OPEN, which would be replaced by the descriptor for the last opened file when the operation executes.

This approach works, at the cost of complicating the interface and implementation with the magic file-descriptor substitution. Josh Triplett had a different idea, which he first posted in an LWN comment in January: let applications specify which file descriptor they would like to use when opening a file. He filled out that idea in March with a patch series adding a new O_SPECIFIC_FD flag to the openat2() system call. This feature is available independently of io_uring; if an application really wants to open a file on descriptor 42 and no other, the new flag makes that possible.

The patch set also adds a new prctl() operation to set the minimum file descriptor to use when the application has not requested a specific one. This minimum defaults to zero, preserving the "lowest available descriptor" semantics that Unix has guaranteed forever. A developer wanting to control the file descriptors used could raise this minimum and know that the kernel would not use any of the descriptors below that minimum without an explicit request.

It only took Axboe about three hours to come up with a new patch series integrating this work. It mostly consists of delaying the checks of file-descriptor validity so that they don't happen ahead of the call that actually makes a given descriptor valid. There seems to be a general agreement that this approach makes more sense than magic file-descriptor substitution, so this is the version that seems likely to go ahead.

At this point, though, this work has only circulated on the io_uring list, which has a relatively limited readership. Axboe has said that he plans to post it more generally in the near future, and that merging for 5.7 is within the realm of possibility. So it may well be that there will soon be two different ways for an application to read the contents of a file with a minimum of system calls — and Karel Zak may end up buying a lot of beer.

Index entries for this article
Kernel	io_uring
Kernel	System calls/readfile()

Two new ways to read a file quickly

Posted Mar 6, 2020 16:16 UTC (Fri) by mezcalero (subscriber, #45103) [Link] (50 responses)

Hmm, how is one actually supposed to use O_SPECIFICFD in real life though? I mean, picking some specific FD nr can work in trvial, single threaded programs maybe, but how is that supposed to work on typical generic userspace code that has many threads and many libraries/subsystems all running in the same address space (and fd space) that want to take possession of some fd? I mean are apps supposed to block fd ranges ahead of time, by dup()ing /dev/null a couple of times, or how is that supposed to work? not getting this...

Two new ways to read a file quickly

Posted Mar 6, 2020 17:32 UTC (Fri) by axboe (subscriber, #904) [Link] (39 responses)

The simple use case can be done by raising the min fd as described in the article. For slightly more complicated, app could manage that space (between 0/2 and min_fd). Outside of that, one idea bounced around was fd reservation, which essentially ends up being the same thing, except the kernel now manages the reservations for you.

Two new ways to read a file quickly

Posted Mar 6, 2020 20:18 UTC (Fri) by mezcalero (subscriber, #45103) [Link] (37 responses)

Quite frankly this doesn't sound thought to the end to me. Having to have a userspace pre-allocator/fd arena manager requires complex code that glibc would have to carry to be useful (since every lib, every subsystem loaded into the local process has to buy into this to work). Otherwise everyone is going to step on each others toes here. I am pretty sure picking a numeric fd is something the kernel should do, its kinda its primary job: allocating objects...

This concept makes things easy in kernel space maybe, but is not managable in non-trivial real-life apps in userspace I am sure.

I mean, its a bit like rtsigs: almost noone uses them, because there are no allocation semantics defined and thus library code could never safely, generically, reliably make use of them. rtsigs are a failure as an I/O concept in hindsight and this plays a major role in that I think, and I don't think it would be a good idea to repeat this with fds now. Its kinda like picking the semantics of one of the crappiest kernel concepts (rtsigs) we have and applying them to one of the saner kernel concepts we have (fds), just because its a bit easier to implement in the kernel (at the cost of much heavier userspace work).

Or to say this differently: few people would think that a userspace malloc() implementation that takes the memory pointer value as input (rather than as output) would be a good idea, I am sure.

Two new ways to read a file quickly

Posted Mar 6, 2020 20:46 UTC (Fri) by axboe (subscriber, #904) [Link] (11 responses)

I don't generally disagree with you. 1) The concept isn't finalized at all, and 2) As mentioned at the end, having the kernel manage the allocations were indeed one of the suggestions.

Two new ways to read a file quickly

Posted Mar 6, 2020 21:25 UTC (Fri) by willy (subscriber, #9762) [Link] (10 responses)

I'd like to discuss this on linux-fsdevel, but in lieu of that ...

I think the right way to do this is to have userspace open /dev/null as often as it needs to in order to create the fds it will need. Then use those fd #s in the io_uring calls.

Two new ways to read a file quickly

Posted Mar 6, 2020 23:58 UTC (Fri) by jlayton (subscriber, #31672) [Link] (6 responses)

I've been back and forth on it, but I think leaving fd allocation to the kernel is ultimately the right thing to do. Very few applications actually care _what_ fd they end up getting, and trying to do anything else is going to make it hard to eliminate competing users. This functionality will almost certainly end up in certain libraries after all so you would need a standard allocator of some sort.

Mainly, io_uring needs to be able to specify that a subsequent read (or write, or whatever) use the fd from an open done earlier in the same chain. I think just being able to express "use the fd from last open" would be super useful for about 90% of use cases, and you could always layer on another way to deal with multiple fds later.

Two new ways to read a file quickly

Posted Mar 7, 2020 6:33 UTC (Sat) by ncm (guest, #165) [Link] (5 responses)

In multi-threaded programs (which do exist), the concept of a "last-opened file descriptor" is entirely meaningless. How can anyone think this would be a good idea?

Two new ways to read a file quickly

Posted Mar 7, 2020 10:39 UTC (Sat) by intgr (subscriber, #39733) [Link] (4 responses)

But in the context of a single io_uring command queue, "last-opened file descriptor" seems perfectly well defined.

Two new ways to read a file quickly

Posted Mar 7, 2020 17:33 UTC (Sat) by justincormack (subscriber, #70439) [Link] (3 responses)

Not if you open lots of files asynchronously. You need to identify which open you meant.

Two new ways to read a file quickly

Posted Mar 7, 2020 18:01 UTC (Sat) by nivedita76 (subscriber, #121790) [Link] (2 responses)

A linked sequence of io_uring operations provides that. If you need to operate on lots of files, you have lots of linked sequences, each of which opens one file and operates on it.

Two new ways to read a file quickly

Posted Feb 20, 2024 13:40 UTC (Tue) by sammythesnake (guest, #17693) [Link] (1 responses)

Sorry for the thread necromancy, but...

Isn't operating on thousands of files in one io_uring linked sequence exactly the kind of thing some applications would like to do, reducing thousands of syscalls to a couple...?

Would some kind of per-io_uring-linked-sequence "pseudo-FD" make sense? In/alongside open operations, you could provide a number (1, 2, 3...) for each file opened in the sequence that the kernel transparently maps to "real" FDs internally. Later operations could then identify which of the files opened within the sequence should be acted on (e.g. "read the file *this sequence* calls "1". Maybe with negative FD numbers...?)

The pFD *could* be sequentially allocated so subsequent calls would simply say "the third one opened" but keeping those straight while editing the sequence would be error-prone, so that's probably not a win over finding a way to nominate a pFD.

Obviously, they're are details to sort out like managing the pFD->FD mappings, and getting the API right, but none of that sounds nastier than the other things suggested in this thread (to me, at least - I'm merely a curious bystander!)

This is presumably a very naive question, but can't an io_uring open() operation save the FD returned it a look-up table to be referenced by later operations - that would seem the "obvious" way to me, but I assume this isn't possible, or this whole thread would be moot...

Two new ways to read a file quickly

Posted Feb 20, 2024 16:32 UTC (Tue) by kleptog (subscriber, #1183) [Link]

I think you have described the descriptorless FDs: https://lwn.net/Articles/863071/

Two new ways to read a file quickly

Posted Mar 7, 2020 13:34 UTC (Sat) by josh (subscriber, #17465) [Link] (2 responses)

I don't think that opening /dev/null is the right approach here, for multiple reasons:

1) The block of reserved fds shouldn't actually consist of open file descriptors that take up resources, especially if we may want to have a block of reserved fds per thread.
2) If the only thing keeping an fd reserved is that it has an open file on it, then once that fd is opened and subsequently closed, it stops being reserved. The fd should stay reserved after being closed.
3) O_SPECIFIC_FD specifically doesn't allow opening "over" an existing open file descriptor the way dup2 does; it'll return -EBUSY. I felt that would be less error-prone, and would help catch races.

Two new ways to read a file quickly

Posted Mar 8, 2020 15:08 UTC (Sun) by pbonzini (subscriber, #60935) [Link] (1 responses)

Regarding 3 I think that's a mistake, O_SPECIFICFD to open and dup2 to close seems like a very easy way to manage pre-reserved file descriptors, even without the prctl.

Two new ways to read a file quickly

Posted Mar 8, 2020 15:14 UTC (Sun) by josh (subscriber, #17465) [Link]

Perhaps we could have a specialized "reserved" fd type, and use that instead of opening /dev/null, and only allow "overwriting" that.It would help to have a guaranteed continuous chunk of fds to allocate out of, though.

Two new ways to read a file quickly

Posted Mar 7, 2020 2:53 UTC (Sat) by ploxiln (subscriber, #58395) [Link] (24 responses)

Programs like ps, top, grep, and even nginx or mariadb, could benefit from this kind of thing, while not linking many "miscellaneous" libraries. Not all programs should be huge agglomerations like systemd or libreoffice or gnome-shell, in fact most programs shouldn't. And "miscellaneous" libraries should not be using these advanced APIs automatically anyway, they should always be under explicit control of the main program. You would never want libdbus or libsystemd or glib2 to use these advanced interfaces automatically. (Of course, systemd does seem to want to do every new whiz-bang thing automatically, default enabled, and then when it causes problems it's always someone else's fault...)

Two new ways to read a file quickly

Posted Mar 7, 2020 19:03 UTC (Sat) by dezgeg (subscriber, #92243) [Link] (13 responses)

> Programs like ps, top, grep, and even nginx or mariadb, could benefit from this kind of thing

Well what if you want to implement (or use) the ps command but as a C library?

> And "miscellaneous" libraries should not be using these advanced APIs automatically anyway, they should always be under explicit control of the main program.

Why not? As a potential user of an ps-as-a-C-library, why would I care what it does internally? And what classifies as an "advanced API" in contrast to a "non-advanced API" anyway?

Two new ways to read a file quickly

Posted Mar 7, 2020 19:44 UTC (Sat) by ploxiln (subscriber, #58395) [Link] (12 responses)

> As a potential user of an ps-as-a-C-library, why would I care what it does internally?

You should care about what every library you use does internally. Every single library you use brings security, compatibility, efficiency, and general tech-debt burden.

ps-as-a-library would need the old open/read/close method for compatibility with older kernels, and that method should be the default for newer kernels too. A user which knows what they are doing could enable the super-efficient mode explicitly, by passing a range of file descriptors they want ps-as-a-library to use for it.

> And what classifies as an "advanced API" in contrast to a "non-advanced API" anyway?

Anything which needs to be managed in a process-global manner. Signals are another example. Libraries may suggest the user register some signal handlers for them, or offer to register signal handlers if the user explicitly requests it.

zlib has no interest in this, nor libjpeg ... openssl takes sockets you create and open yourself, and they can even be non-blocking. The opening and closing of file descriptors is up to you, not the library. Or, some libraries offer helpers to open and close files or connections, which you have to call explicitly.

You could use a library which makes a mess - but don't.

Two new ways to read a file quickly

Posted Mar 7, 2020 20:19 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (11 responses)

> ps-as-a-library would need the old open/read/close method for compatibility with older kernels, and that method should be the default for newer kernels too.
Why? Compat code (to be eventually removed) is fine. But why wouldn't library use new functionality?

> A user which knows what they are doing could enable the super-efficient mode explicitly, by passing a range of file descriptors they want ps-as-a-library to use for it.
If a library can do it transparently then why not do it?

Two new ways to read a file quickly

Posted Mar 8, 2020 19:59 UTC (Sun) by ibukanov (subscriber, #3942) [Link] (10 responses)

> to be eventually removed

My past experience tells that most likely it is the new and shiny interface that will be removed or deprecated (to be replaced by even a newer and shinier interface) than something that is survived for ages and still does the job.

Case in point is epoll. If I have an old code that uses poll, I would not bother to replace that with epoll. I would rather go straight to io ring.

Two new ways to read a file quickly

Posted Mar 8, 2020 21:37 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link] (9 responses)

> My past experience tells that most likely it is the new and shiny interface that will be removed or deprecated (to be replaced by even a newer and shinier interface) than something that is survived for ages and still does the job.
Why not select() then, to be maximally traditional?

Traditional UNIX interfaces are badly designed, on all fronts (process management, file management, AIO, you name it). And traditional design affected the replacement interfaces as well: poll (and epoll), AIO, etc. These days thankfully the attitude is "screw UNIX and design something that makes sense".

This is definitely refreshing.

Two new ways to read a file quickly

Posted Mar 9, 2020 9:39 UTC (Mon) by ibukanov (subscriber, #3942) [Link] (8 responses)

It does not matter that the interface may look as badly designed from the present moment. The only thing that matters is that it is still used after 30 years meaning it is good enough across very huge range of hardware.

And if an application uses select and it shows its limitations, I will skip poll when patching it and depending on the target I will opt for epoll or the io ring. Which shows the same rule. A new thing most likely will be replaced by even newer thing than the old thing that still works acceptably stops being suitable.

Essentially the new thing is optimized for the present moment. We do not know how relevant that optimization will be in future. The old thing by surviving and still working shows that when it was created it was not over optimized.

Two new ways to read a file quickly

Posted Mar 9, 2020 16:34 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (7 responses)

> Essentially the new thing is optimized for the present moment. We do not know how relevant that optimization will be in future.
The thing is, readfile() is inherently more optimizable than the open()/read()/close() sequence. And it simply can't be slower than them.

Two new ways to read a file quickly

Posted Mar 9, 2020 21:11 UTC (Mon) by ibukanov (subscriber, #3942) [Link] (6 responses)

> readfile() is inherently more optimizable

It is more optimizable in context of the present hardware and current kernel code. RISC architecture is based on an assumption that a small number of fast operations is better than a big set of complex ones. So it could be that on future hardware one could have a small set set of super-fast syscalls. Then readfile implemented as a syscall would be a liability.

Two new ways to read a file quickly

Posted Mar 9, 2020 21:13 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

> It is more optimizable in context of the present hardware and current kernel code.
Incorrect. It'll ALWAYS be more optimizable. In the worst case it'll be no worse than open/read/close sequence.

The main potential for optimization is for networked filesystems where one readfile() request can easily save a couple of roundtrips.

Two new ways to read a file quickly

Posted Mar 17, 2020 16:18 UTC (Tue) by nix (subscriber, #2304) [Link] (4 responses)

ibukanov just gave you a plausible example of a situation in which they are not incorrect: in which common syscalls like read/write/open etc are ultrafast, while uncommon ones like readfile remain slow. Oops, now it's less optimizable.

Do you even read what you're responding to?

Two new ways to read a file quickly

Posted Mar 17, 2020 16:40 UTC (Tue) by mebrown (subscriber, #7960) [Link] (2 responses)

Did you notice he used the word "optimizable"? Paying careful attention to that suffix: -able, can you explain a future world that exists where it's not possible to optimize one system call to be faster than three system calls?

Two new ways to read a file quickly

Posted Mar 17, 2020 22:50 UTC (Tue) by nix (subscriber, #2304) [Link] (1 responses)

I just did: a potential world in which there is a small set of fast syscalls (perhaps architectural limitations prevent the set becoming larger), and a larger set of slow ones. open() and read() would almost certainly be in the fast set; readfile(), being rarely used to date, would surely be in the slow set. If syscall entry/exit in the fast set is at least three times faster than from the slow set, there's nothing you can do to make that one slow syscall faster than the three fast ones, as long as they're doing remotely the same work.

(Similar things have existed before, and will again: the vDSO is one such, as is the fast syscall optimization in Solaris on SPARC.)

Two new ways to read a file quickly

Posted Mar 18, 2020 10:10 UTC (Wed) by farnz (subscriber, #17727) [Link]

Except that for such a potential world to be worth considering, you need to explain how it's plausible.

The "fast syscall optimization" in Solaris on SPARC used the fact that SPARC has 128 syscall entry points in the hardware to optimize up to 128 syscalls - that's over a third of Linux syscalls, more if you ignore all the legacy syscalls (as Solaris could, since it could do the translation from legacy to current in libc). It only had such a drastic effect in Solaris since the "fast" syscalls didn't make use of the generic ABI translation at syscall entry that Solaris chose to do to simplify syscall implementation - in other words, it worked around a known software deficiency in Solaris, stemming from their desire to use the same SunStudio compiler and ABI for all code, rather than teaching SunStudio to have a syscall ABI for kernel code to use.

The vDSO isn't about syscalls per-se; the vDSO page is a userspace page that happens to be shared with the kernel, and contain userspace code and data from the kernel, allowing you to completely avoid making a syscall.

Remember that, at heart, syscalls are four machine micro-operations sequenced sensibly; everything else is built on top of this:

Save the current privilege level, so that you can restore it on return.
Save the next PC so that you can return back here.
Set the current privilege level.
Set PC to a syscall entry point.

Any optimization in hardware that leads to a subset of syscalls being faster has to be in the last micro-operation; all the others are common to all syscalls. The only such optimization that's possible is to have alternate syscall entry points for different syscalls; this is what the SPARC trap system does, using a 128 entry trap table to decide which syscall entry point to use.

Note, too, that the tendency over time is to optimize the hardware with a single syscall entry point, since that's just a single pointer-sized piece of data to track; Intel 8008 through to 80286 only had INT for syscalls, 80386 added call gates, while Pentium II added SYSENTER which only has a single possible entry point. Similarly, ARM, MIPS, POWER, PowerPC, RISC-V, and AArch64 all only have a single instruction to do syscalls that goes to a single syscall entry point (albeit that on POWER, PowerPC, ARM, and AArch64, that instruction also includes a limited amount of data that's supplied to the kernel, intended for use as a syscall number).

SPARC is the one exception to the rule that more modern architectures only have a single syscall entry point, with its trap table of 128 entries, and even then, it was only a performance win because Solaris was able to use the trap table to get around its own earlier bad decisions around syscall handling.

Two new ways to read a file quickly

Posted Mar 17, 2020 16:53 UTC (Tue) by farnz (subscriber, #17727) [Link]

Except that that's an implausible situation, based on the hardware of the last 50 (yes, 50!) years.

The trend has been towards fewer system call instructions, not more, over time. In the 1970s, you had things like the 8008's RST instructions, which gave you a small number of fast system calls. RISC CPUs have tended to have just a single syscall type instruction (sc/svc in PowerPC/POWER, SVC in AArch64, SWI in AArch32, syscall in MIPS), with the exception of SPARC, whose trap instructions allowed you to specify different trap handlers directly.

In modern x86, the SYSENTER/SYSCALL instructions are also a single option - there's no "fast path" included here at all.

Now, AArch32, AArch64, POWER/PowerPC, and VAX all have an argument supplied as part of the syscall instruction itself, but it's literally just an argument. It doesn't point you to a new trap handler, it's just an argument to the handler.

Two new ways to read a file quickly

Posted Mar 8, 2020 7:55 UTC (Sun) by mm7323 (subscriber, #87386) [Link] (9 responses)

> Programs like ps, top, grep ...

But is anyone actually complaining or concerned about poor performance of these programs?

Two new ways to read a file quickly

Posted Mar 8, 2020 11:29 UTC (Sun) by mpr22 (subscriber, #60784) [Link] (5 responses)

top's job is to inspect system performance, so it should itself be performant to minimize its impact on system performance.

And I'm sure plenty of people care about grep running slower than it needs to.

Two new ways to read a file quickly

Posted Mar 8, 2020 17:00 UTC (Sun) by mm7323 (subscriber, #87386) [Link] (4 responses)

And I'm sure plenty of people care about grep running slower than it needs to.

grep on large files should be IO bound, and grep on a small file is surely overshadowed by process startup time rather than an extra system call to get file contents into a buffer.

I've also never noticed top negatively impacting system performance; even the busybox version on little embedded systems has never caused me a problem or disappointed.

Generically allowing system-call batching is a good idea, but personally I'm less convinced by esoteric system-calls for specific and limited use cases.

Two new ways to read a file quickly

Posted Mar 8, 2020 18:06 UTC (Sun) by andresfreund (subscriber, #69562) [Link] (1 responses)

I have seen top use a noticeable fraction of CPU time. You're not gonna hit that on a 2 core system with 30 idling processes. But a busy 64 core system with a few thousand processes is a different story.

Two new ways to read a file quickly

Posted Mar 8, 2020 23:42 UTC (Sun) by himi (subscriber, #340) [Link]

Very much so, and even in less intrusive cases it introduced a spike in instantaneous load that can affect interactive and time sensitive processes (games, audio, that kind of thing).

And the general principle of having measurement of the system cause as little impact on the properties being measured definitely holds.

Two new ways to read a file quickly

Posted Mar 9, 2020 11:05 UTC (Mon) by Sesse (subscriber, #53779) [Link]

grep on a small file, sure, but what about grep on many small files (e.g. grep -r foo /usr/include)?

Two new ways to read a file quickly

Posted Mar 9, 2020 22:00 UTC (Mon) by roc (subscriber, #30627) [Link]

I grep through large source trees all the time --- several gigabytes of data spread over thousands or millions of files, all cached in RAM so not I/O bound.

Two new ways to read a file quickly

Posted Mar 8, 2020 20:02 UTC (Sun) by excors (subscriber, #95769) [Link]

Many people do care about grep being slow, and care enough to rewrite it from scratch. Even ack is advertised as being faster than grep (mostly by filtering the list of files to search, e.g. skipping .git directories by default), and ag is advertised as "an order of magnitude faster than ack" (because of multithreading, mmap() instead of read(), faster regex engine, etc), and ripgrep is advertised as being much faster than ag (better multithreading, both mmap() and read() in different situations, better regex engine, etc).

There's no point optimising the kernel for grep itself, because grep could be improved by maybe two orders of magnitude with purely application changes; but it might be worth considering whether kernel changes could improve the performance of ripgrep which has already taken most of the low-hanging fruit.

Two new ways to read a file quickly

Posted Mar 8, 2020 23:51 UTC (Sun) by himi (subscriber, #340) [Link]

The point of readfile() in the context of ps and the like is that it's opening, reading the contents of, and closing lots of small files - that's a lot of small operations performed on the filesystem, which in cases like a clustered filesystem could be translated into things like a lot of distributed lock operations and metadata operations, which can have major impacts on filesystem performance. Using a single syscall won't get rid of all that, but depending on the implementation it could represent a significant improvement.

Two new ways to read a file quickly

Posted Mar 17, 2020 16:38 UTC (Tue) by mebrown (subscriber, #7960) [Link]

I am. I have an embedded system that I need to monitor for performance issues and I'd rather not my performance monitoring tools cause performance issues!

In practice I observe that current implementations of top use a noticeable percentage of CPU, which can throw off my observations.

Two new ways to read a file quickly

Posted Mar 8, 2020 15:06 UTC (Sun) by pbonzini (subscriber, #60935) [Link]

Is there a dup2 io_uring operation that can be used instead of close?

Two new ways to read a file quickly

Posted Mar 6, 2020 17:39 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

I would guess that highly-parallel code would use some kind of allocator that would divide the descriptor space into thread-local arenas.

Two new ways to read a file quickly

Posted Mar 6, 2020 18:28 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (1 responses)

So... it's simpler to implement tcfdalloc() in userspace, than to just let io_uring do the "magic fd substitution" in kernelspace? Maybe I just don't understand what we mean by "simplicity."

Two new ways to read a file quickly

Posted Mar 6, 2020 18:42 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

The dense structure of file descriptors is actually becoming a problem in its own right. Having a way to opt out of it is a nice feature in itself.

Two new ways to read a file quickly

Posted Mar 6, 2020 17:46 UTC (Fri) by flussence (guest, #85566) [Link] (2 responses)

Bash, and I'd guess other POSIX shells, currently requires an extra dup2+close to force file handles to have a number other than the kernel-assigned one. That happens for every < or > redirect except when the command line is lucky enough to ask for the same number the kernel gives it (which never happens). An average autoconf script contains about 1500 of these, so it adds up.

Two new ways to read a file quickly

Posted Mar 6, 2020 21:45 UTC (Fri) by nivedita76 (subscriber, #121790) [Link] (1 responses)

Does it really? I thought the configure script opens the file handles it wants upfront, eg attaching 5 to config.log, and thereafter the commands that want to log just do >&5, which should just be a dup2 call?

Two new ways to read a file quickly

Posted Mar 6, 2020 23:06 UTC (Fri) by flussence (guest, #85566) [Link]

You're right, it looks like they already thought of this. There's still a few hundred lines that don't: >$CONFIG_STATUS (containing a string filename) and >/dev/null and the like. But that's something autotools upstream could fix on their own if they wanted.

Two new ways to read a file quickly

Posted Mar 7, 2020 1:16 UTC (Sat) by quotemstr (subscriber, #45331) [Link] (2 responses)

> I mean are apps supposed to block fd ranges ahead of time, by dup()ing /dev/null a couple of times

What's wrong with that? It seems like a perfectly reasonable approach to me.

More generally, the io_uring "use the last FD" proposal is just a special case of the "promise pipelining" approach that systems like Cap'n Proto implement [1]. In a way, the series of IO requests in io_uring amounts to a little program. We can add features like "local variables" (the last FD opened), flow control, or even branches to this little programming language, but I think it's better to just use eBPF to specify batches of system-call-like operations that a program might want to do in the kernel.

[1] https://capnproto.org/rpc.html

Two new ways to read a file quickly

Posted Mar 7, 2020 3:10 UTC (Sat) by josh (subscriber, #17465) [Link]

> I think it's better to just use eBPF to specify batches of system-call-like operations that a program might want to do in the kernel.

The X Window System solved this problem in the 1980s, and not by embedding a programming language in the X protocol. Letting the user specify the ID allows sending a batch of requests that include object creation.

X gave clients an initial range of IDs they could use, and then provided a protocol to ask for a new "block" of IDs. We could, similarly, provide a call to reserve a block of IDs, and a library can ask for and use such a block without any fear of stepping on other libraries.

The "minimum fd" mechanism was intended as a trivial reservation mechanism for programs to use, as well as something useful for other purposes (such as ensuring that you can't "accidentally" open a random file as file descriptor 1 if file descriptor 1 got closed). It's not the only such reservation mechanism possible, just the simplest.

Two new ways to read a file quickly

Posted Mar 7, 2020 6:02 UTC (Sat) by wahern (subscriber, #37304) [Link]

> In a way, the series of IO requests in io_uring amounts to a little program. We can add features like "local variables" (the last FD opened), flow control, or even branches to this little programming language

Long ago there was a proposal and proof of concept, syslets, that did exactly that: https://lwn.net/Articles/221887/ It eventually morphed into something simpler (https://lwn.net/Articles/261473/) and then just died.

Two new ways to read a file quickly

Posted Mar 7, 2020 3:15 UTC (Sat) by josh (subscriber, #17465) [Link]

The requested file descriptor must *not* already be open; if it is, the attempt to allocate it will return -EBUSY. This is intentionally not a dup2-syle operation that closes the target fd, because that would be more error-prone (stepping on a file descriptor used elsewhere in the program).

The file descriptor should be reserved, instead. The min_fd approach provides one way to do that; we could also add a more general mechanism to ask the kernel for a range of unused file descriptors.

Two new ways to read a file quickly

Posted Mar 6, 2020 16:33 UTC (Fri) by walters (subscriber, #7396) [Link] (9 responses)

> spend a lot of time reading information from small /proc and sysfs files; having a readfile() call would make them quite a bit more efficient.

Isn't the cost of that mostly locking and rendering them in the kernel, not the system calls? (OK I didn't try to measure, but I hope the people asking for this have tried and will refute me by posting numbers)

> To implement a readfile(), an application could set up an io_uring chain with three operations, corresponding to the openat(), read(), and close() calls.

One tricky thing with this is that the simple "read a file into memory" falls over badly if somehow one is passed a large file. In libostree we have a helper which switches to mmap() for large files: https://github.com/ostreedev/ostree/blob/26a2be0578ec1089...
and I know a lot of other userspace does that, like ripgrep, etc.

Two new ways to read a file quickly

Posted Mar 7, 2020 10:37 UTC (Sat) by adobriyan (subscriber, #30858) [Link] (8 responses)

> Isn't the cost of that mostly locking and rendering them in the kernel, not the system calls?

Yes. Naming /proc and /sys as an example is quite funny.

On my system the numbers are:
a) calling non-existent system call -- 600 cycles (as measured by rdtsc)
b) calling umask(0) -- 670 cycles (system call which does something)
c) open, read, close /proc/version -- ~6500 cycles (static /proc file which goes through seq_file interface)
d) open, read, close /proc/loadavg -- ~7580 cycles (dynamic /proc file)

Sysfs generally generate deeper hierarchies and (correct me, if I'm wrong) revalidates dentries on each lookup.
But sysfs have simple file contents.

I feel that readfile is not important. Stracing all those stat collecting top-like utilities shows that they are living in stone age.

5516 openat(AT_FDCWD, "/proc/uptime", O_RDONLY) = 5
5516 lseek(5, 0, SEEK_SET) = 0
5516 read(5, "4082.55 63567.25\n", 8191) = 17

and the it reseeks to offset 0 again.

5516 openat(AT_FDCWD, "/proc", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 6
5516 fstat(6, {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
5516 getdents64(6, /* 273 entries */, 32768) = 6856
5516 openat(AT_FDCWD___WHAT___, "/proc/1/stat", O_RDONLY) = 7

Reading file to Vec[u8] by default In Rust does multiple system calls because it doubles the buffer for vector contents and starts with small value like 16(?).

Why even help userspace developers?

Two new ways to read a file quickly

Posted Mar 7, 2020 11:34 UTC (Sat) by mpr22 (subscriber, #60784) [Link] (1 responses)

> Why even help userspace developers?

"Some userspace developers are gormless" is not an argument against providing better tools for userspace developers who are not gormless.

(Whether any particular tool is actually a better tool is a separate conversation.)

Two new ways to read a file quickly

Posted Mar 7, 2020 12:05 UTC (Sat) by adobriyan (subscriber, #30858) [Link]

It is not, but it can be very demoralizing.

If top(1) would start preading /proc/uptime, it will do 1 system call just like with readfile().

The best way to speed up reading lots of /proc and /sys files by factor of 5x is to upload statistics without VFS involvement.
but this battle is probably lost.

Two new ways to read a file quickly

Posted Mar 7, 2020 14:38 UTC (Sat) by burntsushi (guest, #110124) [Link] (5 responses)

> Reading file to Vec[u8] by default In Rust does multiple system calls because it doubles the buffer for vector contents and starts with small value like 16(?).

No it doesn't: https://doc.rust-lang.org/src/std/fs.rs.html#266-274

$ cat src/main.rs
fn main() -> Result<(), Box<dyn std::error::Error>> {
let data = std::fs::read("/tmp/some-big-file")?;
println!("{}", data.len());
Ok(())
}

$ cargo build --release

$ strace ./target/release/rustfile
openat(AT_FDCWD, "/tmp/some-big-file", O_RDONLY|O_CLOEXEC) = 3
fcntl(3, F_GETFD) = 0x1 (flags FD_CLOEXEC)
statx(0, NULL, AT_STATX_SYNC_AS_STAT, STATX_ALL, NULL) = -1 EFAULT (Bad address)
statx(3, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_BASIC_STATS, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=941088098, ...}) = 0
mmap(NULL, 941088768, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f9f65d43000
read(3, "Presented by IM Pictures\nProduce"..., 941088099) = 941088098
read(3, "", 1) = 0
close(3)

Two new ways to read a file quickly

Posted Mar 7, 2020 16:44 UTC (Sat) by adobriyan (subscriber, #30858) [Link] (4 responses)

>No it doesn't:

Most files in /proc report st_size=0.

openat(AT_FDCWD, "/proc/stat", O_RDONLY|O_CLOEXEC) = 3
fcntl(3, F_GETFD) = 0x1 (flags FD_CLOEXEC)
statx(0, NULL, AT_STATX_SYNC_AS_STAT, STATX_ALL, NULL) = -1 EFAULT (Bad address)
statx(3, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_BASIC_STATS, stx_attributes=0, stx_mode=S_IFREG|0444, stx_size=0, ...}) = 0
read(3, "cpu 2591925 76 66642 2680980 29", 32) = 32
read(3, "58 0 925 0 0 0\ncpu0 161817 6 407", 32) = 32
read(3, "8 167469 97 0 429 0 0 0\ncpu1 158"..., 64) = 64
read(3, "cpu2 158993 7 4186 170648 115 0 "..., 128) = 128
read(3, "60993 10 3957 168784 202 0 7 0 0"..., 256) = 256
read(3, "9 163063 143 0 60 0 0 0\ncpu12 16"..., 512) = 512
read(3, " 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"..., 1024) = 1024
read(3, " 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"..., 2048) = 821
read(3, "", 1227) = 0
close(3) = 0

Two new ways to read a file quickly

Posted Mar 7, 2020 23:02 UTC (Sat) by josh (subscriber, #17465) [Link] (3 responses)

It doesn't seem reasonable to put all the blame on userspace when the kernel gives it misleading information.

I wonder if we could enhance statx to have a STATX_SIZE_HINT flag? With that flag, statx could return a new attribute indicating that the file has an unspecified size and should be read in a single read call, along with a hint for a buffer size that's *probably* big enough. That would substantially reduce the number of read calls.

(Also, for future reference, the first statx call is Rust probing to see if the kernel supports statx, and it only happens for the first statx in the program. Likewise, the fcntl checks if the kernel respects O_CLOEXEC, and that only happens on the first open.)

Two new ways to read a file quickly

Posted Mar 9, 2020 14:10 UTC (Mon) by walters (subscriber, #7396) [Link] (2 responses)

Maybe a simpler change would be for the kernel to cache the *last* size of a file in /proc and report it? Though it might trigger bugs in userspace apps that wouldn't be prepared for the file growing between stat() and read().

Two new ways to read a file quickly

Posted Mar 9, 2020 15:29 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

I think any app or library expecting stat to have trustworthy information at read time without some safety net is already a disaster waiting to happen (especially if they're in the /proc hierarchy; the file could fail to *open* after stat returns because the process exited in the meantime).

Two new ways to read a file quickly

Posted Mar 11, 2020 11:51 UTC (Wed) by adobriyan (subscriber, #30858) [Link]

Best way to read /proc is to read PAGE_SIZE minimum at once, and interpret short read as EOF for small files like /proc/uptime or /proc/*/statm which are bounded in size. Bigger reads should be (PAGE_SIZE << n) for unbounded files (/proc/*/maps):

m->buf = seq_buf_alloc(m->size <<= 1);

Most of sysfs is 4KB tops but arbitrary sized for binary attributes.

Two new ways to read a file quickly

Posted Mar 6, 2020 17:40 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (8 responses)

> The dfd and path arguments would identify a file in the usual way. A successful readfile() would read the contents of the indicated file into buf, up to a maximum of bufsize bytes, returning the number of bytes actually read.
How does it indicate that there is still unread data left?

Two new ways to read a file quickly

Posted Mar 6, 2020 18:54 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (7 responses)

The initial patch does not appear to try. Either you get a short read, a complete read, or an error. I think the idea is that you pass a large-but-reasonable buffer size, and if you get a complete read, then you simply assume the file is too big, and you try again with open()/read()/close(). Alternatively, you stat() it first and then dynamically allocate a sufficiently large buffer (plus one byte to detect whether the file has grown, if you care).

(Obviously, this isn't going to work very well with special files like /dev/zero. But IMHO that's the application's fault for trying to read an infinite stream of data in the first place.)

Two new ways to read a file quickly

Posted Mar 6, 2020 19:28 UTC (Fri) by nkiesel (guest, #11748) [Link] (6 responses)

One could imagine an API which returns a negative number in error case, a number in [0;bufSize] for complete reads, and bufSize + 1 for incomplete reads.

Two new ways to read a file quickly

Posted Mar 6, 2020 22:58 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (5 responses)

You can emulate that in userspace: just allocate a buffer one byte longer and pass bufSize+1 in the first place. You even get to read the first "extra" byte for free!

Two new ways to read a file quickly

Posted Mar 7, 2020 10:58 UTC (Sat) by intgr (subscriber, #39733) [Link] (2 responses)

Although I understand how people might consider non-power-of-two sized buffers ugly. The extra 1 byte in an allocation of 8193 bytes would have an alignment overhead of 15 bytes.

In some cases malloc() will also call mmap() to get whole-page allocations, and that would cause a whole 4k page to be wasted for 1 byte.

Two new ways to read a file quickly

Posted Mar 7, 2020 18:06 UTC (Sat) by zlynx (guest, #2285) [Link] (1 responses)

Power of two sized buffers turn out to be a memory wasting trap though.

There was a blog article I read a while ago, linked off Hacker News I think, talking about some research into memory usage this programmer had done. It turned out that many allocations for buffers of sizes such as 4,096 bytes ended up having 16 bytes or so added to it. This was for memory allocation tracking, or if passed into some API the library would put it into a structure with other variables.

If I remember correctly the author determined that allocating 4,000 bytes was a nice round number that tended to round up to a 4,096 page size much more reliably.

Two new ways to read a file quickly

Posted Mar 19, 2020 12:43 UTC (Thu) by mgedmin (subscriber, #34497) [Link]

Was it https://blog.mozilla.org/nnethercote/2011/08/05/clownshoe... That was a great article from a Firefox developer working on memory optimization.

Two new ways to read a file quickly

Posted Mar 8, 2020 16:29 UTC (Sun) by nkiesel (guest, #11748) [Link]

Nope, because if the result is bufSize+1 the caller again does not know if the file fit perfectly into the buffer or if there are more bytes to read.

Two new ways to read a file quickly

Posted Mar 10, 2020 16:27 UTC (Tue) by ThomasBellman (guest, #67902) [Link]

Even better, take a cue from snprintf() and return the number of bytes that would have been read if the buffer was infinitely large (i.e, return the file size). And in case the kernel can't quickly determine the size of the file, return either MAXINT or bufSize+1.

Two new ways to read a file quickly

Posted Mar 6, 2020 18:11 UTC (Fri) by mm7323 (subscriber, #87386) [Link] (2 responses)

Do we need a writefile() too? What about preadfile() or readfile()?

I think looking at a way to generically batch any system call in an easy to use way might be a better approach, though not easy to get.

Perhaps a getlastfd() 'system call' could also be used to help bridge the problem of open() (or socket() etc...) then read() without needing to use fixed FD numbers or loop beck through userspace.

Two new ways to read a file quickly

Posted Mar 6, 2020 19:11 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (1 responses)

> I think looking at a way to generically batch any system call in an easy to use way might be a better approach, though not easy to get.

We have that. It's called "userspace."

Snark aside, I think any solution in this space needs to be very clear on exactly what set of problems it is solving, and what set of problems are out of scope. Otherwise, I imagine you would inevitably end up with a harder-to-use-but-less-flexible syscall interface.

> Perhaps a getlastfd() 'system call' could also be used to help bridge the problem of open() (or socket() etc...) then read() without needing to use fixed FD numbers or loop beck through userspace.

Would getlastfd() be thread-local, or would it be thread-unsafe? Does the kernel track fds in a thread-local way right now?

Two new ways to read a file quickly

Posted Mar 7, 2020 3:12 UTC (Sat) by josh (subscriber, #17465) [Link]

"userspace" has this increasingly problematic performance bottleneck called "system calls", which io_uring replaces with a much faster mechanism called "shared memory".

Two new ways to read a file quickly

Posted Mar 7, 2020 1:04 UTC (Sat) by quotemstr (subscriber, #45331) [Link] (4 responses)

These new system calls are nice, but we can't add a fixed-function system call for every single thing somebody might want to do in the kernel. My ideal is a mechanism to send a little BPF program to the kernel and let it specify a bunch of different things to do there --- for example, open a file, read this byte, mmap this range, whatever, all as a single unit of work, entering the kernel once. You'd register such a program with the kernel from userspace, get a file descriptor back, and invoke your custom program via a single system call.

This way, any userspace program would be able to build its own system call.

Two new ways to read a file quickly

Posted Mar 7, 2020 6:54 UTC (Sat) by ncm (guest, #165) [Link] (2 responses)

How is this (or any of the other notions) better than the series of system calls? In each case you have a series of things you want the kernel to do, and more or less elaborate ways to specify them. But actually doing the system calls achieves exactly the desired thing without bloating the kernel any.

Io_uring makes some sense, eliminating a million or a billion system calls in exchange for a little setup. Eliminating two out of three calls just looks foolish. Even if you have to read a thousand files, or a million: one million calls, three million calls, who cares? The time spent is in pottering around in the file system, not transitioning between user and system. An ordinary library function much shorter than 21 lines does it.

The whole discussion makes no sense. It makes me wonder if Linux is shuffling toward irrelevance. It has happened to better kernels.

Two new ways to read a file quickly

Posted Mar 7, 2020 7:12 UTC (Sat) by epa (subscriber, #39769) [Link]

I imagine that on some network file systems having a single system call means a single round trip to the server.

Two new ways to read a file quickly

Posted Mar 7, 2020 7:47 UTC (Sat) by beagnach (guest, #32987) [Link]

> The time spent is in pottering around in the file system, not transitioning between user and system.

I think much of the discussion is based on the assumption that there are situations where this is not true.
So far we've seen a lot generalizations - hopefully some benchmarks will be provided in the near future.

Two new ways to read a file quickly

Posted Mar 9, 2020 9:28 UTC (Mon) by pwfxq (subscriber, #84695) [Link]

I could see small utilities being cross-compiled into BPF and run directly in kernel space, rather than user space. After that, I could see some wag expanding the BPF instruction limit and getting Emacs to run in kernel space.

Two new ways to read a file quickly

Posted Mar 17, 2020 15:45 UTC (Tue) by nix (subscriber, #2304) [Link]

This will break a lot of software through name clashes once it hits glibc ((or, if namespaces exclude it by default, through misbinding of the symbol to the one in glibc).

readfile is a commonly used C identifier: Debian codesearch reveals 143 pages of broken packages, nearly all of which appear to be legitimate uses of this identifier in actual C code (though some is in C++ which is presumably safer).