LWN: Comments on "One billion files on Linux"

hard links ?

pixelpapst — Sat, 28 Aug 2010 06:57:16 +0000

I guess Ric has done the big rm -R already, but for the next experiment, I'd be interested in trying a cp -al. This is something I use a lot, probably more than is strictly healthy. Bonus points for trying to diff -ur the two beasts. :-)

One billion files on Linux

pr1268 — Tue, 24 Aug 2010 01:06:26 +0000

One can only imagine you used a script to generate the files and directories. Either that, or you are a very fast typist! ;-)

One billion files on Linux

roelofs — Tue, 24 Aug 2010 00:59:55 +0000

there is an option to tell ls not to bother sorting the output (-N IIRC) and I've found that to be significant in many cases.

I think you mean -f (at least for GNU ls). -N has something to do with quoting, according to ls --help. I've used the former but not the latter, AFAIR.

Greg

One billion files on Linux

mhelsley — Fri, 20 Aug 2010 21:54:57 +0000

Thanks for the clarification.

This use of rsync presents an interesting case for the userspace portion of checkpoint/restart.

During checkpoint we often need to checkpoint the contents of the filesystems. One way to do that is with a frozen filesystem and rsync. Obviously if we're rsync'ing to mirror the filesystem in the first place then we shouldn't attempt to checkpoint the rsync task's filesystem(s) with rsync -- we'd want to do a "local" snapshot if possible.

Since the kernel does not force userspace to save the filesystem contents userspace can choose if and how it will do so. In other words this case requires no special changes to the checkpoint syscall.

One billion files on Linux

ricwheeler — Fri, 20 Aug 2010 18:12:31 +0000

My general point was that anything that takes days or weeks to complete, will break eventually. Think of using rsync to mirror a billion files over a wide area network for example. After a network issue or a power outage, you do not want to have to start from the first file.

How you checkpoint/restart is less critical to me. I would see that some applications (like rsync itself) should be aware and restartable in their design. Others would certainly benefit from external checkpointing.

One billion files on Linux

dlang — Fri, 20 Aug 2010 00:08:33 +0000

one of the headaches with doing a ls on a large number of files is that by default ls fetches everything then sorts it all by filename. there is an option to tell ls not to bother sorting the output (-N IIRC) and I've found that to be significant in many cases.

One billion files on Linux

mhelsley — Thu, 19 Aug 2010 20:32:31 +0000

"Finally, application developers must bear in mind that processes which run this long will invariably experience failures, sooner or later. So they will need to be designed with some sort of checkpoint and restart capability."

Was that exactly Ric's point -- that the applications had to checkpoint themselves? Or did he just say that being able to checkpoint applications was necessary? I ask because there's a big difference. Expecting all applications that might be run in these environments to explicitly checkpoint themselves just isn't practical. Look at how many non-HPC applications use BLCR for example.

The alternative is to enable "external" checkpointing. Checkpoints that don't require rewriting the application, or ld preloads, etc. There is already an effort underway to push this to mainline:

https://ckpt.wiki.kernel.org/index.php/Main_Page

One billion files on Linux

ricwheeler — Thu, 19 Aug 2010 20:16:03 +0000

One thing you can do (and upstream, tools like rm do this now) is to get a bunch of entries back from readdir and then sort them by inode number.

That removes the random, seeky nature of the list for file systems that suffer from this (ext3/4, reiserfs, other?).

For the more advanced layouts, you should look to btrfs.

One billion files on Linux

liljencrantz — Thu, 19 Aug 2010 18:34:45 +0000

The advantage of putting all files in the same directory is that it's slightly easier to code it that way. The disadvantage is that you have directories that effectively can't have their content listed using ls, you likely can't even count the number of files in the directory. Basically some kind of storage tar pit. I think I'll stick to using subfolders. And once mailing lists with more than say 10 million messages in them become common, I'll start worrying about a subfoldered replacement for maildir. :-)

One billion files on Linux

liljencrantz — Thu, 19 Aug 2010 18:27:21 +0000

Oh, ok. That makes more sense to me. Thanks for explaining.

One billion files on Linux

bcopeland — Thu, 19 Aug 2010 17:08:37 +0000

When trying to look at that many files, you need to avoid running stat() on every one of them or trying to sort the whole list.

Underlying this issue is that today's directories (for ext4 at least) are not set up to iterate in inode order. The consequence is that if you do a walk of the files in the order they are stored in the directory, and the inodes aren't in the cache, you have to seek all over the disk to get to the inode information. I remember reading once that the htree designers planned at some point to group the files in htree leaves into buckets based on inode; I wonder if anything ever came of that?

One billion files on Linux

zzxtty — Thu, 19 Aug 2010 15:20:42 +0000

If people are wondering about the validity of 1 billion files I can give an example, I work with MRI data. We do a lot of fMRI which generates lots of files (DICOM images). One file is generated per slice, with fMRI you continuously scan someone for an extended period of time, a single scan can generated 20,000 files. If you've got several MRI scanners and have been up and running for a few years the 1 billion file mark is not large, So far this year we have generated over 23 million files on one of our scanners.

However, I'm not sure I'd want to store them all on one file system, it would be a nightmare to restore from tape if anything did go horribly wrong. This is where data management comes in, I create a new partition for each scanner, each year. Currently we run all this on midrange hardware raid and format with zfs, it appears to cope. Would be nice to move it all to Linux =)

One billion files on Linux

ricwheeler — Thu, 19 Aug 2010 15:19:07 +0000

The test was for a file system, not for a single directory. In the test I ran, I did use a thousand subdirectories (each with 1 million files).

One billion files on Linux

cesarb — Thu, 19 Aug 2010 10:57:02 +0000

> But in what situations will it make more sense to not group a billion of file items into logical groups?

Things like squid cache directories, git object directories, ccache cache directories, that hidden thumbnails directory in your $HOME... They all have in common that the files are named by a hash or something similar. There is no logical grouping at all here; it is a completely flat namespace.

Most of these work around the large number of files in a single directory this causes by extracting some bits (usually 4 or 8) of the hash and using it as the name of a subdirectory (which works because the hashes used have an almost perfect uniform distribution). Sometimes more than one level is used. If the filesystem can easily deal with a huge number of files in a single directory, this extra complexity is not needed.

There is also Maildir directories, which use one file per message, and the only logical grouping is a "folder" or similar. If you have a million messages in a single "folder" (for instance one named "linux-kernel-mailing-list" which has all the messages you collected since 1999), you need a filesystem which can deal with a million files in a single directory. And here the names are not hashes, so the scheme above fails (and even if it worked, it is not a Maildir anymore).

One billion files on Linux

liljencrantz — Thu, 19 Aug 2010 10:13:21 +0000

I probably simply lack the imagination, but I fail to see when it would be beneficial to keep as much as a billion files in the same directory. A billion files spread out through a million different directories in a hierarchy? That makes loads of sense, and it's really just a matter of time before that becomes normal enough. But in what situations will it make more sense to not group a billion of file items into logical groups?

One billion files on Linux

niner — Thu, 19 Aug 2010 06:54:14 +0000

Recently I did similiar tests for determining how well PostgreSQL would be able to deal with databases with potentially hundreds of thousands of tables. From what I found out, it's only limited by the file system's ability to work with that many files in a single directory.

So I tried that and put about 4.3 millions of files in a directory on my ext4 file system. Took quite a while to create and delete later on but file access times where impressive. It seems like accessing a file by it's name in such a directory takes a pretty much constant amount of time. Reading in the directory is quite fast as well, though that obviously takes longer the larger the directory is.