One billion files on Linux

By Jonathan Corbet
August 18, 2010

What happens if you try to put one billion files onto a Linux filesystem? One might see this as an academic sort of question; even the most enthusiastic music downloader will have to work a while to collect that much data. It would require over 30,000 (clean) kernel trees to add up to a billion files. Even contemporary desktop systems, which often seem to be quite adept at the creation of vast numbers of small files, would be hard put to make a billion of them. But, Ric Wheeler says, this is a problem we need to be thinking about now, or we will not be able to scale up to tomorrow's storage systems. His LinuxCon talk used the billion-file workload as a way to investigate the scalability of the Linux storage stack.

One's first thought, when faced with the prospect of handling one billion files, might be to look for workarounds. Rather than shoveling all of those files into a single filesystem, why not spread them out across a series of smaller filesystems? The problems with that approach are that (1) it limits the kernel's ability to optimize head seeks and such, reducing performance, and (2) it forces developers (or administrators) to deal with the hassles involved in actually distributing the files. Inevitably things will get out of balance, forcing things to be redistributed in the future.

Another possibility is to use a database rather than the filesystem. But filesystems are familiar to developers and users, and they come with the operating system from the outset. Filesystems also are better at handling partial failure; databases, instead, tend to be all-or-nothing affairs.

If one wanted to experiment with a billion-file filesystem, how would one come up with hardware which is up to the task? The most obvious way at the moment is with external disk arrays. These boxes feature non-volatile caching and a hierarchy of storage technologies. They are often quite fast at streaming data, but random access may be fast or slow, depending on where the data of interest is stored. They cost $20,000 and up.

With regard to solid-state storage, Ric noted only that 1Tb still costs a good $1000. So rotating media is likely to be with us for a while.

What if you wanted to put together a 100Tb array on your own? They did it at Red Hat; the system involved four expansion shelves holding 64 2Tb drives. It cost over $30,000, and was, Ric said, a generally bad idea. Anybody wanting a big storage array will be well advised to just go out and buy one.

The filesystem life cycle, according to Ric, starts with a mkfs operation. The filesystem is filled, iterated over in various ways, and an occasional fsck run is required. At some point in the future, the files are removed. Ric put up a series of plots showing how ext3, ext4, XFS, and btrfs performed on each of those operations with a one-million-file filesystem. The results varied, with about the only consistent factor being that ext4 generally performs better than ext3. Ext3/4 are much slower than the others at creating filesystems, due to the need to create the static inode tables. On the other hand, the worst performers when creating 1 million files were ext3 and XFS. Everybody except ext3 performs reasonably well when running fsck - though btrfs shows room for some optimization. The big loser when it comes to removing those million files is XFS.

To see the actual plots, have a look at Ric's slides [PDF].

It's one thing to put one million files into a filesystem, but what about one billion? Ric did this experiment on ext4, using the homebrew array described above. Creating the filesystem in the first place was not an exercise for the impatient; it took about four hours to run. Actually creating those one billion files, instead, took a full four days. Surprisingly, running fsck on this filesystem only took 2.5 hours - a real walk in the park. So, in other words, Linux can handle one billion files now.

That said, there are some lessons that came out of this experience; they indicate where some of the problems are going to be. The first of these is that running fsck on an ext4 filesystem takes a lot of memory: on a 70Tb filesystem with one billion files, 10GB of RAM was needed. That number goes up to 30GB when XFS is used, though, so things can get worse. The short conclusion: you can put a huge amount of storage onto a small server, but you'll not be able to run the filesystem checker on it. That is a good limitation to know about ahead of time.

Next lesson: XFS, for all of its strengths, struggles when faced with metadata-intensive workloads. There is work in progress to improve things in this area, but, for now, it will not perform as well as ext3 in such situations.

According to Ric, running ls on a huge filesystem is "a bad idea"; iterating over that many files can generate a lot of I/O activity. When trying to look at that many files, you need to avoid running stat() on every one of them or trying to sort the whole list. Some filesystems can return the file type with the name in readdir() calls, eliminating the need to call stat() in many situations; that can help a lot in this case.

In general, enumeration of files tends to be slow; we can do, at best, a few thousand files per second. That may seem like a lot of files, but, if the target is one billion files, it will take a very long time to get through the whole list. A related problem is backup and/or replication. That, too, will take a very long time, and it can badly affect the performance of other things running at the same time. That can be a problem because, given that a backup can take days, it really needs to be run on an operating, production system. Control groups and the I/O bandwidth controller can maybe help to preserve system performance in such situations.

Finally, application developers must bear in mind that processes which run this long will invariably experience failures, sooner or later. So they will need to be designed with some sort of checkpoint and restart capability. We also need to do better about moving on quickly when I/O operations fail; lengthy retry operations can take a slow process and turn it into an interminable one.

In other words, as things get bigger we will run into some scalability problems. There is nothing new in that revelation. We've always overcome those problems in the past, and should certainly be able to do so in the future. It's always better to think about these things before they become urgent problems, though, so talks like Ric's provide a valuable service to the community.

Index entries for this article
Kernel	Filesystems
Kernel	Scalability
Conference	LinuxCon North America/2010

One billion files on Linux

Posted Aug 19, 2010 6:54 UTC (Thu) by niner (subscriber, #26151) [Link]

Recently I did similiar tests for determining how well PostgreSQL would be able to deal with databases with potentially hundreds of thousands of tables. From what I found out, it's only limited by the file system's ability to work with that many files in a single directory.

So I tried that and put about 4.3 millions of files in a directory on my ext4 file system. Took quite a while to create and delete later on but file access times where impressive. It seems like accessing a file by it's name in such a directory takes a pretty much constant amount of time. Reading in the directory is quite fast as well, though that obviously takes longer the larger the directory is.

One billion files on Linux

Posted Aug 19, 2010 10:13 UTC (Thu) by liljencrantz (guest, #28458) [Link] (5 responses)

I probably simply lack the imagination, but I fail to see when it would be beneficial to keep as much as a billion files in the same directory. A billion files spread out through a million different directories in a hierarchy? That makes loads of sense, and it's really just a matter of time before that becomes normal enough. But in what situations will it make more sense to not group a billion of file items into logical groups?

One billion files on Linux

Posted Aug 19, 2010 10:57 UTC (Thu) by cesarb (subscriber, #6266) [Link] (1 responses)

> But in what situations will it make more sense to not group a billion of file items into logical groups?

Things like squid cache directories, git object directories, ccache cache directories, that hidden thumbnails directory in your $HOME... They all have in common that the files are named by a hash or something similar. There is no logical grouping at all here; it is a completely flat namespace.

Most of these work around the large number of files in a single directory this causes by extracting some bits (usually 4 or 8) of the hash and using it as the name of a subdirectory (which works because the hashes used have an almost perfect uniform distribution). Sometimes more than one level is used. If the filesystem can easily deal with a huge number of files in a single directory, this extra complexity is not needed.

There is also Maildir directories, which use one file per message, and the only logical grouping is a "folder" or similar. If you have a million messages in a single "folder" (for instance one named "linux-kernel-mailing-list" which has all the messages you collected since 1999), you need a filesystem which can deal with a million files in a single directory. And here the names are not hashes, so the scheme above fails (and even if it worked, it is not a Maildir anymore).

One billion files on Linux

Posted Aug 19, 2010 18:34 UTC (Thu) by liljencrantz (guest, #28458) [Link]

The advantage of putting all files in the same directory is that it's slightly easier to code it that way. The disadvantage is that you have directories that effectively can't have their content listed using ls, you likely can't even count the number of files in the directory. Basically some kind of storage tar pit. I think I'll stick to using subfolders. And once mailing lists with more than say 10 million messages in them become common, I'll start worrying about a subfoldered replacement for maildir. :-)

One billion files on Linux

Posted Aug 19, 2010 15:19 UTC (Thu) by ricwheeler (subscriber, #4980) [Link] (2 responses)

The test was for a file system, not for a single directory. In the test I ran, I did use a thousand subdirectories (each with 1 million files).

One billion files on Linux

Posted Aug 19, 2010 18:27 UTC (Thu) by liljencrantz (guest, #28458) [Link]

Oh, ok. That makes more sense to me. Thanks for explaining.

One billion files on Linux

Posted Aug 24, 2010 1:06 UTC (Tue) by pr1268 (guest, #24648) [Link]

One can only imagine you used a script to generate the files and directories. Either that, or you are a very fast typist! ;-)

One billion files on Linux

Posted Aug 19, 2010 15:20 UTC (Thu) by zzxtty (guest, #45175) [Link]

If people are wondering about the validity of 1 billion files I can give an example, I work with MRI data. We do a lot of fMRI which generates lots of files (DICOM images). One file is generated per slice, with fMRI you continuously scan someone for an extended period of time, a single scan can generated 20,000 files. If you've got several MRI scanners and have been up and running for a few years the 1 billion file mark is not large, So far this year we have generated over 23 million files on one of our scanners.

However, I'm not sure I'd want to store them all on one file system, it would be a nightmare to restore from tape if anything did go horribly wrong. This is where data management comes in, I create a new partition for each scanner, each year. Currently we run all this on midrange hardware raid and format with zfs, it appears to cope. Would be nice to move it all to Linux =)

One billion files on Linux

Posted Aug 19, 2010 17:08 UTC (Thu) by bcopeland (subscriber, #51750) [Link] (1 responses)

When trying to look at that many files, you need to avoid running stat() on every one of them or trying to sort the whole list.

Underlying this issue is that today's directories (for ext4 at least) are not set up to iterate in inode order. The consequence is that if you do a walk of the files in the order they are stored in the directory, and the inodes aren't in the cache, you have to seek all over the disk to get to the inode information. I remember reading once that the htree designers planned at some point to group the files in htree leaves into buckets based on inode; I wonder if anything ever came of that?

One billion files on Linux

Posted Aug 19, 2010 20:16 UTC (Thu) by ricwheeler (subscriber, #4980) [Link]

One thing you can do (and upstream, tools like rm do this now) is to get a bunch of entries back from readdir and then sort them by inode number.

That removes the random, seeky nature of the list for file systems that suffer from this (ext3/4, reiserfs, other?).

For the more advanced layouts, you should look to btrfs.

One billion files on Linux

Posted Aug 19, 2010 20:32 UTC (Thu) by mhelsley (guest, #11324) [Link] (2 responses)

"Finally, application developers must bear in mind that processes which run this long will invariably experience failures, sooner or later. So they will need to be designed with some sort of checkpoint and restart capability."

Was that exactly Ric's point -- that the applications had to checkpoint themselves? Or did he just say that being able to checkpoint applications was necessary? I ask because there's a big difference. Expecting all applications that might be run in these environments to explicitly checkpoint themselves just isn't practical. Look at how many non-HPC applications use BLCR for example.

The alternative is to enable "external" checkpointing. Checkpoints that don't require rewriting the application, or ld preloads, etc. There is already an effort underway to push this to mainline:

https://ckpt.wiki.kernel.org/index.php/Main_Page

One billion files on Linux

Posted Aug 20, 2010 18:12 UTC (Fri) by ricwheeler (subscriber, #4980) [Link] (1 responses)

My general point was that anything that takes days or weeks to complete, will break eventually. Think of using rsync to mirror a billion files over a wide area network for example. After a network issue or a power outage, you do not want to have to start from the first file.

How you checkpoint/restart is less critical to me. I would see that some applications (like rsync itself) should be aware and restartable in their design. Others would certainly benefit from external checkpointing.

One billion files on Linux

Posted Aug 20, 2010 21:54 UTC (Fri) by mhelsley (guest, #11324) [Link]

Thanks for the clarification.

This use of rsync presents an interesting case for the userspace portion of checkpoint/restart.

During checkpoint we often need to checkpoint the contents of the filesystems. One way to do that is with a frozen filesystem and rsync. Obviously if we're rsync'ing to mirror the filesystem in the first place then we shouldn't attempt to checkpoint the rsync task's filesystem(s) with rsync -- we'd want to do a "local" snapshot if possible.

Since the kernel does not force userspace to save the filesystem contents userspace can choose if and how it will do so. In other words this case requires no special changes to the checkpoint syscall.

One billion files on Linux

Posted Aug 20, 2010 0:08 UTC (Fri) by dlang (guest, #313) [Link] (1 responses)

one of the headaches with doing a ls on a large number of files is that by default ls fetches everything then sorts it all by filename. there is an option to tell ls not to bother sorting the output (-N IIRC) and I've found that to be significant in many cases.

One billion files on Linux

Posted Aug 24, 2010 0:59 UTC (Tue) by roelofs (guest, #2599) [Link]

there is an option to tell ls not to bother sorting the output (-N IIRC) and I've found that to be significant in many cases.

I think you mean -f (at least for GNU ls). -N has something to do with quoting, according to ls --help. I've used the former but not the latter, AFAIR.

Greg

hard links ?

Posted Aug 28, 2010 6:57 UTC (Sat) by pixelpapst (guest, #55301) [Link]

I guess Ric has done the big rm -R already, but for the next experiment, I'd be interested in trying a cp -al. This is something I use a lot, probably more than is strictly healthy. Bonus points for trying to diff -ur the two beasts. :-)