Two new ways to read a file quickly
readfile()
LWN recently looked at the proposed fsinfo() system call, which is intended to return information about mounted filesystems to an application. One branch of the discussion delved into whether that information could be exported via sysfs instead; one concern that was expressed about that approach is that the cost of reading a lot of little files would be too high. Miklos Szeredi argued that it would not be, but suggested that, if people were concerned, they could reduce this cost by introducing a new system call to read the contents of a file:
ssize_t readfile(int dfd, const char *path, char *buf, size_t bufsize, int flags);
The dfd and path arguments would identify a file in the usual way. A successful readfile() would read the contents of the indicated file into buf, up to a maximum of bufsize bytes, returning the number of bytes actually read. On its face, readfile() adds nothing new; an application can obtain the same result with calls to openat(), read(), and close(). But it reduces the number of system calls required from three to one, and that turns out to be interesting to some users.
In particular, Karel Zak, the maintainer of the util-linux project, offered
"many many beers
" for the implementation of
readfile(). Many of the utilities in util-linux (tools like
ps and top, for example) spend a lot of time reading
information from small /proc and sysfs files; having a
readfile() call would make them quite a bit more efficient.
People who complain that it's hard to get kernel developers to pay
attention to their problems clearly have missed an important technique;
Greg Kroah-Hartman quickly responded
with enthusiasm: "Unlimited beers for a 21-line kernel patch? Sign
me up!
". He provided a first implementation, and went on to say
that this system call might actually
make sense to have. Naturally, the patch has grown past 21 lines once
all of the details that need to be taken into account were dealt with, and
there is still a manual page to write. But it seems likely that there will
be a submission of readfile() in the near future.
Of course, some people are already talking about the need for a writefile() as well.
readfile() on the ring
As the conversation progressed, Jann Horn pointed
out that the developers working on io_uring have also expressed interest in
adding a readfile()-like capability. The whole point of io_uring
is to be able to perform system-call actions asynchronously and without
having to actually call into the kernel, so it might seem like a good fit
for this use case. He did note that truly
supporting that feature in io_uring is "a bit complicated
",
since there is no way to propagate a file descriptor returned by
openat() to a subsequent read() operation queued in the
ring. Without that, the read() cannot be queued until after the
file has been opened, defeating the purpose of the exercise.
The fact of the matter, though, is that "a bit complicated" is a good description of io_uring in general. It seems unlikely that the author of a tool like ps is going to want to go through all of the effort needed to set up an io_uring instance, map it into the address space, queue some operations, and start things running just to avoid some system calls when reading /proc. But the developers of other, more complex applications would, it seems, like to have this sort of capability.
In short order, perhaps in the hope of tapping into that "unlimited beer" stream, io_uring maintainer Jens Axboe posted a patch that fills in the missing piece. It works by remembering the file descriptor returned by the last openat() call in a given chain of operations. To implement a readfile(), an application could set up an io_uring chain with three operations, corresponding to the openat(), read(), and close() calls. For the latter two, though, the usual file-descriptor argument would be provided as the special constant IOSQE_FD_LAST_OPEN, which would be replaced by the descriptor for the last opened file when the operation executes.
This approach works, at the cost of complicating the interface and implementation with the magic file-descriptor substitution. Josh Triplett had a different idea, which he first posted in an LWN comment in January: let applications specify which file descriptor they would like to use when opening a file. He filled out that idea in March with a patch series adding a new O_SPECIFIC_FD flag to the openat2() system call. This feature is available independently of io_uring; if an application really wants to open a file on descriptor 42 and no other, the new flag makes that possible.
The patch set also adds a new prctl() operation to set the minimum file descriptor to use when the application has not requested a specific one. This minimum defaults to zero, preserving the "lowest available descriptor" semantics that Unix has guaranteed forever. A developer wanting to control the file descriptors used could raise this minimum and know that the kernel would not use any of the descriptors below that minimum without an explicit request.
It only took Axboe about three hours to come up with a new patch series integrating this work. It mostly consists of delaying the checks of file-descriptor validity so that they don't happen ahead of the call that actually makes a given descriptor valid. There seems to be a general agreement that this approach makes more sense than magic file-descriptor substitution, so this is the version that seems likely to go ahead.
At this point, though, this work has only circulated on the io_uring list,
which has a relatively limited readership. Axboe has said
that he plans to post it more generally in the near future, and that
merging for 5.7 is within the realm of possibility. So it may well be that
there will soon be two different ways for an application to read the
contents of a file with a minimum of system calls — and Karel Zak may end
up buying a lot of beer.
Index entries for this article | |
---|---|
Kernel | io_uring |
Kernel | System calls/readfile() |
Posted Mar 6, 2020 16:16 UTC (Fri)
by mezcalero (subscriber, #45103)
[Link] (50 responses)
Posted Mar 6, 2020 17:32 UTC (Fri)
by axboe (subscriber, #904)
[Link] (39 responses)
Posted Mar 6, 2020 20:18 UTC (Fri)
by mezcalero (subscriber, #45103)
[Link] (37 responses)
This concept makes things easy in kernel space maybe, but is not managable in non-trivial real-life apps in userspace I am sure.
I mean, its a bit like rtsigs: almost noone uses them, because there are no allocation semantics defined and thus library code could never safely, generically, reliably make use of them. rtsigs are a failure as an I/O concept in hindsight and this plays a major role in that I think, and I don't think it would be a good idea to repeat this with fds now. Its kinda like picking the semantics of one of the crappiest kernel concepts (rtsigs) we have and applying them to one of the saner kernel concepts we have (fds), just because its a bit easier to implement in the kernel (at the cost of much heavier userspace work).
Or to say this differently: few people would think that a userspace malloc() implementation that takes the memory pointer value as input (rather than as output) would be a good idea, I am sure.
Posted Mar 6, 2020 20:46 UTC (Fri)
by axboe (subscriber, #904)
[Link] (11 responses)
Posted Mar 6, 2020 21:25 UTC (Fri)
by willy (subscriber, #9762)
[Link] (10 responses)
I think the right way to do this is to have userspace open /dev/null as often as it needs to in order to create the fds it will need. Then use those fd #s in the io_uring calls.
Posted Mar 6, 2020 23:58 UTC (Fri)
by jlayton (subscriber, #31672)
[Link] (6 responses)
Mainly, io_uring needs to be able to specify that a subsequent read (or write, or whatever) use the fd from an open done earlier in the same chain. I think just being able to express "use the fd from last open" would be super useful for about 90% of use cases, and you could always layer on another way to deal with multiple fds later.
Posted Mar 7, 2020 6:33 UTC (Sat)
by ncm (guest, #165)
[Link] (5 responses)
Posted Mar 7, 2020 10:39 UTC (Sat)
by intgr (subscriber, #39733)
[Link] (4 responses)
Posted Mar 7, 2020 17:33 UTC (Sat)
by justincormack (subscriber, #70439)
[Link] (3 responses)
Posted Mar 7, 2020 18:01 UTC (Sat)
by nivedita76 (subscriber, #121790)
[Link] (2 responses)
Posted Feb 20, 2024 13:40 UTC (Tue)
by sammythesnake (guest, #17693)
[Link] (1 responses)
Isn't operating on thousands of files in one io_uring linked sequence exactly the kind of thing some applications would like to do, reducing thousands of syscalls to a couple...?
Would some kind of per-io_uring-linked-sequence "pseudo-FD" make sense? In/alongside open operations, you could provide a number (1, 2, 3...) for each file opened in the sequence that the kernel transparently maps to "real" FDs internally. Later operations could then identify which of the files opened within the sequence should be acted on (e.g. "read the file *this sequence* calls "1". Maybe with negative FD numbers...?)
The pFD *could* be sequentially allocated so subsequent calls would simply say "the third one opened" but keeping those straight while editing the sequence would be error-prone, so that's probably not a win over finding a way to nominate a pFD.
Obviously, they're are details to sort out like managing the pFD->FD mappings, and getting the API right, but none of that sounds nastier than the other things suggested in this thread (to me, at least - I'm merely a curious bystander!)
This is presumably a very naive question, but can't an io_uring open() operation save the FD returned it a look-up table to be referenced by later operations - that would seem the "obvious" way to me, but I assume this isn't possible, or this whole thread would be moot...
Posted Feb 20, 2024 16:32 UTC (Tue)
by kleptog (subscriber, #1183)
[Link]
Posted Mar 7, 2020 13:34 UTC (Sat)
by josh (subscriber, #17465)
[Link] (2 responses)
1) The block of reserved fds shouldn't actually consist of open file descriptors that take up resources, especially if we may want to have a block of reserved fds per thread.
Posted Mar 8, 2020 15:08 UTC (Sun)
by pbonzini (subscriber, #60935)
[Link] (1 responses)
Posted Mar 8, 2020 15:14 UTC (Sun)
by josh (subscriber, #17465)
[Link]
Posted Mar 7, 2020 2:53 UTC (Sat)
by ploxiln (subscriber, #58395)
[Link] (24 responses)
Posted Mar 7, 2020 19:03 UTC (Sat)
by dezgeg (subscriber, #92243)
[Link] (13 responses)
Well what if you want to implement (or use) the ps command but as a C library?
> And "miscellaneous" libraries should not be using these advanced APIs automatically anyway, they should always be under explicit control of the main program.
Why not? As a potential user of an ps-as-a-C-library, why would I care what it does internally? And what classifies as an "advanced API" in contrast to a "non-advanced API" anyway?
Posted Mar 7, 2020 19:44 UTC (Sat)
by ploxiln (subscriber, #58395)
[Link] (12 responses)
You should care about what every library you use does internally. Every single library you use brings security, compatibility, efficiency, and general tech-debt burden.
ps-as-a-library would need the old open/read/close method for compatibility with older kernels, and that method should be the default for newer kernels too. A user which knows what they are doing could enable the super-efficient mode explicitly, by passing a range of file descriptors they want ps-as-a-library to use for it.
> And what classifies as an "advanced API" in contrast to a "non-advanced API" anyway?
Anything which needs to be managed in a process-global manner. Signals are another example. Libraries may suggest the user register some signal handlers for them, or offer to register signal handlers if the user explicitly requests it.
zlib has no interest in this, nor libjpeg ... openssl takes sockets you create and open yourself, and they can even be non-blocking. The opening and closing of file descriptors is up to you, not the library. Or, some libraries offer helpers to open and close files or connections, which you have to call explicitly.
You could use a library which makes a mess - but don't.
Posted Mar 7, 2020 20:19 UTC (Sat)
by Cyberax (✭ supporter ✭, #52523)
[Link] (11 responses)
> A user which knows what they are doing could enable the super-efficient mode explicitly, by passing a range of file descriptors they want ps-as-a-library to use for it.
Posted Mar 8, 2020 19:59 UTC (Sun)
by ibukanov (subscriber, #3942)
[Link] (10 responses)
My past experience tells that most likely it is the new and shiny interface that will be removed or deprecated (to be replaced by even a newer and shinier interface) than something that is survived for ages and still does the job.
Case in point is epoll. If I have an old code that uses poll, I would not bother to replace that with epoll. I would rather go straight to io ring.
Posted Mar 8, 2020 21:37 UTC (Sun)
by Cyberax (✭ supporter ✭, #52523)
[Link] (9 responses)
Traditional UNIX interfaces are badly designed, on all fronts (process management, file management, AIO, you name it). And traditional design affected the replacement interfaces as well: poll (and epoll), AIO, etc. These days thankfully the attitude is "screw UNIX and design something that makes sense".
This is definitely refreshing.
Posted Mar 9, 2020 9:39 UTC (Mon)
by ibukanov (subscriber, #3942)
[Link] (8 responses)
And if an application uses select and it shows its limitations, I will skip poll when patching it and depending on the target I will opt for epoll or the io ring. Which shows the same rule. A new thing most likely will be replaced by even newer thing than the old thing that still works acceptably stops being suitable.
Essentially the new thing is optimized for the present moment. We do not know how relevant that optimization will be in future. The old thing by surviving and still working shows that when it was created it was not over optimized.
Posted Mar 9, 2020 16:34 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (7 responses)
Posted Mar 9, 2020 21:11 UTC (Mon)
by ibukanov (subscriber, #3942)
[Link] (6 responses)
It is more optimizable in context of the present hardware and current kernel code. RISC architecture is based on an assumption that a small number of fast operations is better than a big set of complex ones. So it could be that on future hardware one could have a small set set of super-fast syscalls. Then readfile implemented as a syscall would be a liability.
Posted Mar 9, 2020 21:13 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (5 responses)
The main potential for optimization is for networked filesystems where one readfile() request can easily save a couple of roundtrips.
Posted Mar 17, 2020 16:18 UTC (Tue)
by nix (subscriber, #2304)
[Link] (4 responses)
Do you even read what you're responding to?
Posted Mar 17, 2020 16:40 UTC (Tue)
by mebrown (subscriber, #7960)
[Link] (2 responses)
Posted Mar 17, 2020 22:50 UTC (Tue)
by nix (subscriber, #2304)
[Link] (1 responses)
(Similar things have existed before, and will again: the vDSO is one such, as is the fast syscall optimization in Solaris on SPARC.)
Posted Mar 18, 2020 10:10 UTC (Wed)
by farnz (subscriber, #17727)
[Link]
Except that for such a potential world to be worth considering, you need to explain how it's plausible.
The "fast syscall optimization" in Solaris on SPARC used the fact that SPARC has 128 syscall entry points in the hardware to optimize up to 128 syscalls - that's over a third of Linux syscalls, more if you ignore all the legacy syscalls (as Solaris could, since it could do the translation from legacy to current in libc). It only had such a drastic effect in Solaris since the "fast" syscalls didn't make use of the generic ABI translation at syscall entry that Solaris chose to do to simplify syscall implementation - in other words, it worked around a known software deficiency in Solaris, stemming from their desire to use the same SunStudio compiler and ABI for all code, rather than teaching SunStudio to have a syscall ABI for kernel code to use.
The vDSO isn't about syscalls per-se; the vDSO page is a userspace page that happens to be shared with the kernel, and contain userspace code and data from the kernel, allowing you to completely avoid making a syscall.
Remember that, at heart, syscalls are four machine micro-operations sequenced sensibly; everything else is built on top of this:
Any optimization in hardware that leads to a subset of syscalls being faster has to be in the last micro-operation; all the others are common to all syscalls. The only such optimization that's possible is to have alternate syscall entry points for different syscalls; this is what the SPARC trap system does, using a 128 entry trap table to decide which syscall entry point to use.
Note, too, that the tendency over time is to optimize the hardware with a single syscall entry point, since that's just a single pointer-sized piece of data to track; Intel 8008 through to 80286 only had INT for syscalls, 80386 added call gates, while Pentium II added SYSENTER which only has a single possible entry point. Similarly, ARM, MIPS, POWER, PowerPC, RISC-V, and AArch64 all only have a single instruction to do syscalls that goes to a single syscall entry point (albeit that on POWER, PowerPC, ARM, and AArch64, that instruction also includes a limited amount of data that's supplied to the kernel, intended for use as a syscall number).
SPARC is the one exception to the rule that more modern architectures only have a single syscall entry point, with its trap table of 128 entries, and even then, it was only a performance win because Solaris was able to use the trap table to get around its own earlier bad decisions around syscall handling.
Posted Mar 17, 2020 16:53 UTC (Tue)
by farnz (subscriber, #17727)
[Link]
Except that that's an implausible situation, based on the hardware of the last 50 (yes, 50!) years.
The trend has been towards fewer system call instructions, not more, over time. In the 1970s, you had things like the 8008's RST instructions, which gave you a small number of fast system calls. RISC CPUs have tended to have just a single syscall type instruction (sc/svc in PowerPC/POWER, SVC in AArch64, SWI in AArch32, syscall in MIPS), with the exception of SPARC, whose trap instructions allowed you to specify different trap handlers directly.
In modern x86, the SYSENTER/SYSCALL instructions are also a single option - there's no "fast path" included here at all.
Now, AArch32, AArch64, POWER/PowerPC, and VAX all have an argument supplied as part of the syscall instruction itself, but it's literally just an argument. It doesn't point you to a new trap handler, it's just an argument to the handler.
Posted Mar 8, 2020 7:55 UTC (Sun)
by mm7323 (subscriber, #87386)
[Link] (9 responses)
But is anyone actually complaining or concerned about poor performance of these programs?
Posted Mar 8, 2020 11:29 UTC (Sun)
by mpr22 (subscriber, #60784)
[Link] (5 responses)
And I'm sure plenty of people care about grep running slower than it needs to.
Posted Mar 8, 2020 17:00 UTC (Sun)
by mm7323 (subscriber, #87386)
[Link] (4 responses)
grep on large files should be IO bound, and grep on a small file is surely overshadowed by process startup time rather than an extra system call to get file contents into a buffer.
I've also never noticed top negatively impacting system performance; even the busybox version on little embedded systems has never caused me a problem or disappointed. Generically allowing system-call batching is a good idea, but personally I'm less convinced by esoteric system-calls for specific and limited use cases.
Posted Mar 8, 2020 18:06 UTC (Sun)
by andresfreund (subscriber, #69562)
[Link] (1 responses)
Posted Mar 8, 2020 23:42 UTC (Sun)
by himi (subscriber, #340)
[Link]
And the general principle of having measurement of the system cause as little impact on the properties being measured definitely holds.
Posted Mar 9, 2020 11:05 UTC (Mon)
by Sesse (subscriber, #53779)
[Link]
Posted Mar 9, 2020 22:00 UTC (Mon)
by roc (subscriber, #30627)
[Link]
Posted Mar 8, 2020 20:02 UTC (Sun)
by excors (subscriber, #95769)
[Link]
There's no point optimising the kernel for grep itself, because grep could be improved by maybe two orders of magnitude with purely application changes; but it might be worth considering whether kernel changes could improve the performance of ripgrep which has already taken most of the low-hanging fruit.
Posted Mar 8, 2020 23:51 UTC (Sun)
by himi (subscriber, #340)
[Link]
Posted Mar 17, 2020 16:38 UTC (Tue)
by mebrown (subscriber, #7960)
[Link]
In practice I observe that current implementations of top use a noticeable percentage of CPU, which can throw off my observations.
Posted Mar 8, 2020 15:06 UTC (Sun)
by pbonzini (subscriber, #60935)
[Link]
Posted Mar 6, 2020 17:39 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
Posted Mar 6, 2020 18:28 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
Posted Mar 6, 2020 18:42 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Mar 6, 2020 17:46 UTC (Fri)
by flussence (guest, #85566)
[Link] (2 responses)
Posted Mar 6, 2020 21:45 UTC (Fri)
by nivedita76 (subscriber, #121790)
[Link] (1 responses)
Posted Mar 6, 2020 23:06 UTC (Fri)
by flussence (guest, #85566)
[Link]
Posted Mar 7, 2020 1:16 UTC (Sat)
by quotemstr (subscriber, #45331)
[Link] (2 responses)
What's wrong with that? It seems like a perfectly reasonable approach to me.
More generally, the io_uring "use the last FD" proposal is just a special case of the "promise pipelining" approach that systems like Cap'n Proto implement [1]. In a way, the series of IO requests in io_uring amounts to a little program. We can add features like "local variables" (the last FD opened), flow control, or even branches to this little programming language, but I think it's better to just use eBPF to specify batches of system-call-like operations that a program might want to do in the kernel.
Posted Mar 7, 2020 3:10 UTC (Sat)
by josh (subscriber, #17465)
[Link]
The X Window System solved this problem in the 1980s, and not by embedding a programming language in the X protocol. Letting the user specify the ID allows sending a batch of requests that include object creation.
X gave clients an initial range of IDs they could use, and then provided a protocol to ask for a new "block" of IDs. We could, similarly, provide a call to reserve a block of IDs, and a library can ask for and use such a block without any fear of stepping on other libraries.
The "minimum fd" mechanism was intended as a trivial reservation mechanism for programs to use, as well as something useful for other purposes (such as ensuring that you can't "accidentally" open a random file as file descriptor 1 if file descriptor 1 got closed). It's not the only such reservation mechanism possible, just the simplest.
Posted Mar 7, 2020 6:02 UTC (Sat)
by wahern (subscriber, #37304)
[Link]
Long ago there was a proposal and proof of concept, syslets, that did exactly that: https://lwn.net/Articles/221887/ It eventually morphed into something simpler (https://lwn.net/Articles/261473/) and then just died.
Posted Mar 7, 2020 3:15 UTC (Sat)
by josh (subscriber, #17465)
[Link]
The file descriptor should be reserved, instead. The min_fd approach provides one way to do that; we could also add a more general mechanism to ask the kernel for a range of unused file descriptors.
Posted Mar 6, 2020 16:33 UTC (Fri)
by walters (subscriber, #7396)
[Link] (9 responses)
Isn't the cost of that mostly locking and rendering them in the kernel, not the system calls? (OK I didn't try to measure, but I hope the people asking for this have tried and will refute me by posting numbers)
> To implement a readfile(), an application could set up an io_uring chain with three operations, corresponding to the openat(), read(), and close() calls.
One tricky thing with this is that the simple "read a file into memory" falls over badly if somehow one is passed a large file. In libostree we have a helper which switches to mmap() for large files: https://github.com/ostreedev/ostree/blob/26a2be0578ec1089...
Posted Mar 7, 2020 10:37 UTC (Sat)
by adobriyan (subscriber, #30858)
[Link] (8 responses)
Yes. Naming /proc and /sys as an example is quite funny.
On my system the numbers are:
Sysfs generally generate deeper hierarchies and (correct me, if I'm wrong) revalidates dentries on each lookup.
I feel that readfile is not important. Stracing all those stat collecting top-like utilities shows that they are living in stone age.
5516 openat(AT_FDCWD, "/proc/uptime", O_RDONLY) = 5
and the it reseeks to offset 0 again.
5516 openat(AT_FDCWD, "/proc", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 6
Reading file to Vec[u8] by default In Rust does multiple system calls because it doubles the buffer for vector contents and starts with small value like 16(?).
Why even help userspace developers?
Posted Mar 7, 2020 11:34 UTC (Sat)
by mpr22 (subscriber, #60784)
[Link] (1 responses)
"Some userspace developers are gormless" is not an argument against providing better tools for userspace developers who are not gormless.
(Whether any particular tool is actually a better tool is a separate conversation.)
Posted Mar 7, 2020 12:05 UTC (Sat)
by adobriyan (subscriber, #30858)
[Link]
If top(1) would start preading /proc/uptime, it will do 1 system call just like with readfile().
The best way to speed up reading lots of /proc and /sys files by factor of 5x is to upload statistics without VFS involvement.
Posted Mar 7, 2020 14:38 UTC (Sat)
by burntsushi (guest, #110124)
[Link] (5 responses)
No it doesn't: https://doc.rust-lang.org/src/std/fs.rs.html#266-274
$ cat src/main.rs
$ cargo build --release
$ strace ./target/release/rustfile
Posted Mar 7, 2020 16:44 UTC (Sat)
by adobriyan (subscriber, #30858)
[Link] (4 responses)
Most files in /proc report st_size=0.
openat(AT_FDCWD, "/proc/stat", O_RDONLY|O_CLOEXEC) = 3
Posted Mar 7, 2020 23:02 UTC (Sat)
by josh (subscriber, #17465)
[Link] (3 responses)
I wonder if we could enhance statx to have a STATX_SIZE_HINT flag? With that flag, statx could return a new attribute indicating that the file has an unspecified size and should be read in a single read call, along with a hint for a buffer size that's *probably* big enough. That would substantially reduce the number of read calls.
(Also, for future reference, the first statx call is Rust probing to see if the kernel supports statx, and it only happens for the first statx in the program. Likewise, the fcntl checks if the kernel respects O_CLOEXEC, and that only happens on the first open.)
Posted Mar 9, 2020 14:10 UTC (Mon)
by walters (subscriber, #7396)
[Link] (2 responses)
Posted Mar 9, 2020 15:29 UTC (Mon)
by mathstuf (subscriber, #69389)
[Link]
Posted Mar 11, 2020 11:51 UTC (Wed)
by adobriyan (subscriber, #30858)
[Link]
m->buf = seq_buf_alloc(m->size <<= 1);
Most of sysfs is 4KB tops but arbitrary sized for binary attributes.
Posted Mar 6, 2020 17:40 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (8 responses)
Posted Mar 6, 2020 18:54 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link] (7 responses)
(Obviously, this isn't going to work very well with special files like /dev/zero. But IMHO that's the application's fault for trying to read an infinite stream of data in the first place.)
Posted Mar 6, 2020 19:28 UTC (Fri)
by nkiesel (guest, #11748)
[Link] (6 responses)
Posted Mar 6, 2020 22:58 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link] (5 responses)
Posted Mar 7, 2020 10:58 UTC (Sat)
by intgr (subscriber, #39733)
[Link] (2 responses)
In some cases malloc() will also call mmap() to get whole-page allocations, and that would cause a whole 4k page to be wasted for 1 byte.
Posted Mar 7, 2020 18:06 UTC (Sat)
by zlynx (guest, #2285)
[Link] (1 responses)
There was a blog article I read a while ago, linked off Hacker News I think, talking about some research into memory usage this programmer had done. It turned out that many allocations for buffers of sizes such as 4,096 bytes ended up having 16 bytes or so added to it. This was for memory allocation tracking, or if passed into some API the library would put it into a structure with other variables.
If I remember correctly the author determined that allocating 4,000 bytes was a nice round number that tended to round up to a 4,096 page size much more reliably.
Posted Mar 19, 2020 12:43 UTC (Thu)
by mgedmin (subscriber, #34497)
[Link]
Posted Mar 8, 2020 16:29 UTC (Sun)
by nkiesel (guest, #11748)
[Link]
Posted Mar 10, 2020 16:27 UTC (Tue)
by ThomasBellman (guest, #67902)
[Link]
Posted Mar 6, 2020 18:11 UTC (Fri)
by mm7323 (subscriber, #87386)
[Link] (2 responses)
I think looking at a way to generically batch any system call in an easy to use way might be a better approach, though not easy to get.
Perhaps a getlastfd() 'system call' could also be used to help bridge the problem of open() (or socket() etc...) then read() without needing to use fixed FD numbers or loop beck through userspace.
Posted Mar 6, 2020 19:11 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
We have that. It's called "userspace."
Snark aside, I think any solution in this space needs to be very clear on exactly what set of problems it is solving, and what set of problems are out of scope. Otherwise, I imagine you would inevitably end up with a harder-to-use-but-less-flexible syscall interface.
> Perhaps a getlastfd() 'system call' could also be used to help bridge the problem of open() (or socket() etc...) then read() without needing to use fixed FD numbers or loop beck through userspace.
Would getlastfd() be thread-local, or would it be thread-unsafe? Does the kernel track fds in a thread-local way right now?
Posted Mar 7, 2020 3:12 UTC (Sat)
by josh (subscriber, #17465)
[Link]
Posted Mar 7, 2020 1:04 UTC (Sat)
by quotemstr (subscriber, #45331)
[Link] (4 responses)
This way, any userspace program would be able to build its own system call.
Posted Mar 7, 2020 6:54 UTC (Sat)
by ncm (guest, #165)
[Link] (2 responses)
Io_uring makes some sense, eliminating a million or a billion system calls in exchange for a little setup. Eliminating two out of three calls just looks foolish. Even if you have to read a thousand files, or a million: one million calls, three million calls, who cares? The time spent is in pottering around in the file system, not transitioning between user and system. An ordinary library function much shorter than 21 lines does it.
The whole discussion makes no sense. It makes me wonder if Linux is shuffling toward irrelevance. It has happened to better kernels.
Posted Mar 7, 2020 7:12 UTC (Sat)
by epa (subscriber, #39769)
[Link]
Posted Mar 7, 2020 7:47 UTC (Sat)
by beagnach (guest, #32987)
[Link]
I think much of the discussion is based on the assumption that there are situations where this is not true.
Posted Mar 9, 2020 9:28 UTC (Mon)
by pwfxq (subscriber, #84695)
[Link]
Posted Mar 17, 2020 15:45 UTC (Tue)
by nix (subscriber, #2304)
[Link]
readfile is a commonly used C identifier: Debian codesearch reveals 143 pages of broken packages, nearly all of which appear to be legitimate uses of this identifier in actual C code (though some is in C++ which is presumably safer).
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
2) If the only thing keeping an fd reserved is that it has an open file on it, then once that fd is opened and subsequently closed, it stops being reserved. The fd should stay reserved after being closed.
3) O_SPECIFIC_FD specifically doesn't allow opening "over" an existing open file descriptor the way dup2 does; it'll return -EBUSY. I felt that would be less error-prone, and would help catch races.
Two new ways to read a file quickly
Two new ways to read a file quickly
Programs like ps, top, grep, and even nginx or mariadb, could benefit from this kind of thing, while not linking many "miscellaneous" libraries. Not all programs should be huge agglomerations like systemd or libreoffice or gnome-shell, in fact most programs shouldn't.
And "miscellaneous" libraries should not be using these advanced APIs automatically anyway, they should always be under explicit control of the main program. You would never want libdbus or libsystemd or glib2 to use these advanced interfaces automatically.
(Of course, systemd does seem to want to do every new whiz-bang thing automatically, default enabled, and then when it causes problems it's always someone else's fault...)
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Why? Compat code (to be eventually removed) is fine. But why wouldn't library use new functionality?
If a library can do it transparently then why not do it?
Two new ways to read a file quickly
Two new ways to read a file quickly
Why not select() then, to be maximally traditional?
Two new ways to read a file quickly
Two new ways to read a file quickly
The thing is, readfile() is inherently more optimizable than the open()/read()/close() sequence. And it simply can't be slower than them.
Two new ways to read a file quickly
Two new ways to read a file quickly
Incorrect. It'll ALWAYS be more optimizable. In the worst case it'll be no worse than open/read/close sequence.
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
And I'm sure plenty of people care about grep running slower than it needs to.
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
and I know a lot of other userspace does that, like ripgrep, etc.
Two new ways to read a file quickly
a) calling non-existent system call -- 600 cycles (as measured by rdtsc)
b) calling umask(0) -- 670 cycles (system call which does something)
c) open, read, close /proc/version -- ~6500 cycles (static /proc file which goes through seq_file interface)
d) open, read, close /proc/loadavg -- ~7580 cycles (dynamic /proc file)
But sysfs have simple file contents.
5516 lseek(5, 0, SEEK_SET) = 0
5516 read(5, "4082.55 63567.25\n", 8191) = 17
5516 fstat(6, {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
5516 getdents64(6, /* 273 entries */, 32768) = 6856
5516 openat(AT_FDCWD___WHAT___, "/proc/1/stat", O_RDONLY) = 7
Two new ways to read a file quickly
Two new ways to read a file quickly
but this battle is probably lost.
Two new ways to read a file quickly
fn main() -> Result<(), Box<dyn std::error::Error>> {
let data = std::fs::read("/tmp/some-big-file")?;
println!("{}", data.len());
Ok(())
}
openat(AT_FDCWD, "/tmp/some-big-file", O_RDONLY|O_CLOEXEC) = 3
fcntl(3, F_GETFD) = 0x1 (flags FD_CLOEXEC)
statx(0, NULL, AT_STATX_SYNC_AS_STAT, STATX_ALL, NULL) = -1 EFAULT (Bad address)
statx(3, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_BASIC_STATS, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=941088098, ...}) = 0
mmap(NULL, 941088768, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f9f65d43000
read(3, "Presented by IM Pictures\nProduce"..., 941088099) = 941088098
read(3, "", 1) = 0
close(3)
Two new ways to read a file quickly
fcntl(3, F_GETFD) = 0x1 (flags FD_CLOEXEC)
statx(0, NULL, AT_STATX_SYNC_AS_STAT, STATX_ALL, NULL) = -1 EFAULT (Bad address)
statx(3, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_BASIC_STATS, stx_attributes=0, stx_mode=S_IFREG|0444, stx_size=0, ...}) = 0
read(3, "cpu 2591925 76 66642 2680980 29", 32) = 32
read(3, "58 0 925 0 0 0\ncpu0 161817 6 407", 32) = 32
read(3, "8 167469 97 0 429 0 0 0\ncpu1 158"..., 64) = 64
read(3, "cpu2 158993 7 4186 170648 115 0 "..., 128) = 128
read(3, "60993 10 3957 168784 202 0 7 0 0"..., 256) = 256
read(3, "9 163063 143 0 60 0 0 0\ncpu12 16"..., 512) = 512
read(3, " 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"..., 1024) = 1024
read(3, " 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"..., 2048) = 821
read(3, "", 1227) = 0
close(3) = 0
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
How does it indicate that there is still unread data left?
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Even better, take a cue from snprintf() and return the number of bytes that would have been read if the buffer was infinitely large (i.e, return the file size). And in case the kernel can't quickly determine the size of the file, return either MAXINT or bufSize+1.
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
Two new ways to read a file quickly
So far we've seen a lot generalizations - hopefully some benchmarks will be provided in the near future.
Two new ways to read a file quickly
Two new ways to read a file quickly