What about async metadata

Posted Jan 17, 2019 1:03 UTC (Thu) by dw (guest, #12017)
In reply to: What about async metadata by Sesse
Parent article: Ringing in a new asynchronous I/O API

I have it on my todo list to write a fully CPU/IO-parallel ZIP implementation (because it's fairly straightforward), with an article around it highlighting most of the traditional UNIX tooling is utterly obsolete on pretty much all modern devices. Naturally it can't really benefit from the work here due to the parent comment, but yeah, the problem is very real, and frankly an entirely ridiculous state of affairs

What about async metadata

Posted Jan 17, 2019 12:30 UTC (Thu) by Sesse (subscriber, #53779) [Link] (1 responses)

So you want to demonstrate that something is obsolete by implementing… an obsolete compression algorithm? :-)

(zlib/deflate is still around pretty much only due to huge transition costs, and a fragmented market among the alternatives. Try something like zstd if you want to make a clean break.)

What about async metadata

Posted Jan 17, 2019 12:33 UTC (Thu) by dw (guest, #12017) [Link]

There's always newer and better technology around, but tech is only useful when it's compatible with what you already have :) And ZIPs are eeeverywhere

What about async metadata

Posted Jan 22, 2019 10:09 UTC (Tue) by epa (subscriber, #39769) [Link] (4 responses)

It would be convenient to have a system call that declares 'I plan to read this file in the near future'. The kernel would make a best effort to get that file into the page cache, using background I/O, while your process continues. So if you are about to zip up a directory, call plan_to_read() on each file, then continue reading them sequentially as normal. It wouldn't be quite as fast as a true parallel implementation, but for some tasks it could give you 80% of the performance gains without having to rewrite your creaky old sequential code.

What about async metadata

Posted Jan 22, 2019 11:35 UTC (Tue) by dw (guest, #12017) [Link] (2 responses)

Isn't this basically what posix_fadvise() gives us already? But IIRC that interface currently or previously blocked while readahead happened.

For zipping, imagine something like a 100k item maildir of tiny 1.5kb messages. While the compression is still relatively expensive, a huge chunk of the operation will be wasted on ceremonial serialized filesystem round-trips (open/close/read/stat/getdents/etc). To avoid that I'm not sure there is any way around it except a whole bunch of threads keeping as many FS operations in flight (either doing the CPU bits or any IO bits for uncached data) to get even close to a genuinely busy computer.

What about async metadata

Posted Jan 22, 2019 12:22 UTC (Tue) by epa (subscriber, #39769) [Link]

Yes, I was thinking of a few large files, where the overhead really is in I/O and not in bookkeeping.

How about a generalized stat() that lets you open a directory and get info on all the files it contains? That would save a lot of time, and not just for parallel code. Network filesystems, for example.

What about async metadata

Posted Jan 22, 2019 12:38 UTC (Tue) by epa (subscriber, #39769) [Link]

You mentioned posix_fadvise(). That is useful but not quite the stupidly simple interface I had in mind. It requires an open file handle. I envisaged a call that takes a filename and nothing else, works entirely in the background, and does not fail (not even if the file doesn't exist or whatever; it just does nothing in that case).

You could then sprinkle these calls all over your code -- including scripting languages -- and get a handy speedup without having to do any real programming.

What about async metadata

Posted Feb 26, 2019 1:53 UTC (Tue) by josh (subscriber, #17465) [Link]

> It would be convenient to have a system call that declares 'I plan to read this file in the near future'.

The readahead system call does that.