Dynamically allocated pseudo-filesystems

Posted May 17, 2022 23:41 UTC (Tue) by dgc (subscriber, #6611)
In reply to: Dynamically allocated pseudo-filesystems by neilbrown
Parent article: Dynamically allocated pseudo-filesystems

> > a find from the root took multiple minutes, and pegged the CPU at 100%, to find that there were 31 million files in it.
>
> Is this even slightly surprising?

Nope.

> If you want "find" to be fast, keep everything in the cache and put up with the memory cost.

But that's just plain wrong. Caches only speed up the *second* access and find is generally a single access cold cache workload.

Indeed, what I find surprising is that nobody seems to recognise that the limit here is find being "100% CPU bound". That is, find isn't automatically multithreading and making use of all the CPUs in the system. Yet find is a trivially parallelisable workload - iterating individual (sub-) directories per thread scales almost perfectly out to either IO or CPU hardware limits.

e.g. I can run a concurrent find+stat iteration that visits every inode in a directory structure of over 50 million inodes on XFS in about a 1m30s on my test machine before 16+ CPUs are fully CPU bound on inode cache lock contention. With lock contention sorted, it scales out to 32 CPUs and comes down to about 30s - roughly 1.5 million inodes a second can be streamed through the dentry and inode cache before being CPU bound again.

The inode cache alone on this machine can stream about 6 million cold inodes/s (XFS bulkstat on same 50 million inodes using DONT_CACHE) before we run out of CPU and memory reclaim starts to fall over handling the >10GB/s of memory allocation and reclaim this requires (on a 16GB RAM machine). And even with this sort of crazy high inode scanning rate, the disk is only barely over 50% utilised at ~150k IOPS and 3.5GB/s of read bandwith.

Modern SSDs are *crazy fast* and we can build machines containing dozens of them and we have the memory bandwidth to feed them all. In memory and pseudo filesystems that use CPUs to do all the processing/IO (and I include PMEM+DAX in that group) are *slow* compared to the amount of cached data we can stream and access via asynchronous DMA directly to/from the hardware.

So what this anecdote says to me is that this 'find is slow' problem is caused by the fact our basic filesystem tools still treat systems and storage as if it still is a machine from the 1980s - one CPU and a real slow spinning disk - and so fail to use much of the capability the hardware actually has....

> Beware of premature optimisation (the rt of al evl)

Yup, optimising OS structures because a single threaded find is CPU bound is optimising the wrong thing. We should be providing tools that can, out of the box, scale out to the capability of the underlying hardware they are provided with. There's orders of magnitude to be gained by scaling out the tool, optimising for a single CPU bound workload will, at best, gain a few percent.

-Dave.

Dynamically allocated pseudo-filesystems

Posted May 18, 2022 6:04 UTC (Wed) by zdzichu (subscriber, #17118) [Link] (2 responses)

The article didn't state _which find_ was used. We guess it was GNU/find.
I'm personally using https://github.com/sharkdp/fd daily. It parallelizes on all CPU cores by default.

Dynamically allocated pseudo-filesystems

Posted May 18, 2022 8:43 UTC (Wed) by dgc (subscriber, #6611) [Link] (1 responses)

True, but it doesn't really matter _which find_ was used if it only used 100% CPU. A parallel find that was constrained to a single cpu would behave the same.

FWIW, I do know there are find (and other tool) variants out there that are multi-threaded. I use tools like lbzip2 because compression is another common operation that is trivially parallelisable. The problem is we have to go out of our way to discover and then install multi-threaded tools. It is long past the point where the distros should be defaulting to parallelised versions of common tools rather than they being the exception...

-Dave.

Dynamically allocated pseudo-filesystems

Posted May 26, 2022 14:31 UTC (Thu) by mrugiero (guest, #153040) [Link]

> I use tools like lbzip2 because compression is another common operation that is trivially parallelisable.

There are caveats for complession. Block schemes like bzip2 are trivially parallelisable with increased memory usage (which is quite low anyway) as the only drawback, but Lempel-Ziv and streaming compressors in general may take a hit to compression ratio, at least if done without care.

Dynamically allocated pseudo-filesystems

Posted May 23, 2022 4:33 UTC (Mon) by alison (subscriber, #63752) [Link]

A colleague once filed a bug ticket with the complaint that "find" on /proc took so long. "Tell Linus," I wrote in the comments and marked as "Won't Fix."