Two new ways to read a file quickly

Posted Mar 9, 2020 16:34 UTC (Mon) by Cyberax (✭ supporter ✭, #52523)
In reply to: Two new ways to read a file quickly by ibukanov
Parent article: Two new ways to read a file quickly

> Essentially the new thing is optimized for the present moment. We do not know how relevant that optimization will be in future.
The thing is, readfile() is inherently more optimizable than the open()/read()/close() sequence. And it simply can't be slower than them.

Two new ways to read a file quickly

Posted Mar 9, 2020 21:11 UTC (Mon) by ibukanov (subscriber, #3942) [Link] (6 responses)

> readfile() is inherently more optimizable

It is more optimizable in context of the present hardware and current kernel code. RISC architecture is based on an assumption that a small number of fast operations is better than a big set of complex ones. So it could be that on future hardware one could have a small set set of super-fast syscalls. Then readfile implemented as a syscall would be a liability.

Two new ways to read a file quickly

Posted Mar 9, 2020 21:13 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

> It is more optimizable in context of the present hardware and current kernel code.
Incorrect. It'll ALWAYS be more optimizable. In the worst case it'll be no worse than open/read/close sequence.

The main potential for optimization is for networked filesystems where one readfile() request can easily save a couple of roundtrips.

Two new ways to read a file quickly

Posted Mar 17, 2020 16:18 UTC (Tue) by nix (subscriber, #2304) [Link] (4 responses)

ibukanov just gave you a plausible example of a situation in which they are not incorrect: in which common syscalls like read/write/open etc are ultrafast, while uncommon ones like readfile remain slow. Oops, now it's less optimizable.

Do you even read what you're responding to?

Two new ways to read a file quickly

Posted Mar 17, 2020 16:40 UTC (Tue) by mebrown (subscriber, #7960) [Link] (2 responses)

Did you notice he used the word "optimizable"? Paying careful attention to that suffix: -able, can you explain a future world that exists where it's not possible to optimize one system call to be faster than three system calls?

Two new ways to read a file quickly

Posted Mar 17, 2020 22:50 UTC (Tue) by nix (subscriber, #2304) [Link] (1 responses)

I just did: a potential world in which there is a small set of fast syscalls (perhaps architectural limitations prevent the set becoming larger), and a larger set of slow ones. open() and read() would almost certainly be in the fast set; readfile(), being rarely used to date, would surely be in the slow set. If syscall entry/exit in the fast set is at least three times faster than from the slow set, there's nothing you can do to make that one slow syscall faster than the three fast ones, as long as they're doing remotely the same work.

(Similar things have existed before, and will again: the vDSO is one such, as is the fast syscall optimization in Solaris on SPARC.)

Two new ways to read a file quickly

Posted Mar 18, 2020 10:10 UTC (Wed) by farnz (subscriber, #17727) [Link]

Except that for such a potential world to be worth considering, you need to explain how it's plausible.

The "fast syscall optimization" in Solaris on SPARC used the fact that SPARC has 128 syscall entry points in the hardware to optimize up to 128 syscalls - that's over a third of Linux syscalls, more if you ignore all the legacy syscalls (as Solaris could, since it could do the translation from legacy to current in libc). It only had such a drastic effect in Solaris since the "fast" syscalls didn't make use of the generic ABI translation at syscall entry that Solaris chose to do to simplify syscall implementation - in other words, it worked around a known software deficiency in Solaris, stemming from their desire to use the same SunStudio compiler and ABI for all code, rather than teaching SunStudio to have a syscall ABI for kernel code to use.

The vDSO isn't about syscalls per-se; the vDSO page is a userspace page that happens to be shared with the kernel, and contain userspace code and data from the kernel, allowing you to completely avoid making a syscall.

Remember that, at heart, syscalls are four machine micro-operations sequenced sensibly; everything else is built on top of this:

Save the current privilege level, so that you can restore it on return.
Save the next PC so that you can return back here.
Set the current privilege level.
Set PC to a syscall entry point.

Any optimization in hardware that leads to a subset of syscalls being faster has to be in the last micro-operation; all the others are common to all syscalls. The only such optimization that's possible is to have alternate syscall entry points for different syscalls; this is what the SPARC trap system does, using a 128 entry trap table to decide which syscall entry point to use.

Note, too, that the tendency over time is to optimize the hardware with a single syscall entry point, since that's just a single pointer-sized piece of data to track; Intel 8008 through to 80286 only had INT for syscalls, 80386 added call gates, while Pentium II added SYSENTER which only has a single possible entry point. Similarly, ARM, MIPS, POWER, PowerPC, RISC-V, and AArch64 all only have a single instruction to do syscalls that goes to a single syscall entry point (albeit that on POWER, PowerPC, ARM, and AArch64, that instruction also includes a limited amount of data that's supplied to the kernel, intended for use as a syscall number).

SPARC is the one exception to the rule that more modern architectures only have a single syscall entry point, with its trap table of 128 entries, and even then, it was only a performance win because Solaris was able to use the trap table to get around its own earlier bad decisions around syscall handling.

Two new ways to read a file quickly

Posted Mar 17, 2020 16:53 UTC (Tue) by farnz (subscriber, #17727) [Link]

Except that that's an implausible situation, based on the hardware of the last 50 (yes, 50!) years.

The trend has been towards fewer system call instructions, not more, over time. In the 1970s, you had things like the 8008's RST instructions, which gave you a small number of fast system calls. RISC CPUs have tended to have just a single syscall type instruction (sc/svc in PowerPC/POWER, SVC in AArch64, SWI in AArch32, syscall in MIPS), with the exception of SPARC, whose trap instructions allowed you to specify different trap handlers directly.

In modern x86, the SYSENTER/SYSCALL instructions are also a single option - there's no "fast path" included here at all.

Now, AArch32, AArch64, POWER/PowerPC, and VAX all have an argument supplied as part of the syscall instruction itself, but it's literally just an argument. It doesn't point you to a new trap handler, it's just an argument to the handler.