Why kernel.org is slow

[Posted January 10, 2007 by corbet]

Kernel.org is the main repository for the Linux kernel source, numerous development trees, and a great deal of associated material. It also offers mirroring for some other Linux-related projects - distribution CD images, for example. Users of kernel.org have occasionally noticed that the service is rather slow. Kernel tree releases are a long time in making it to the front page, and the mirror network tends to lag behind. This important part of the kernel's development infrastructure, it seems, is not keeping up with demand.

Discussion on the mailing lists reveal that the kernel.org servers (there are two of them) often run with load averages in the range of 2-300. So it's not entirely surprising that they are not always quite as responsive as one would like. There is talk of adding servers, but there is also a sense that the current servers should be able to keep up with the load. So the developers have been looking into what is going on.

The problem seems to originate with git. Kernel.org hosts quite a few git repositories and a version of the gitweb system as well - though gitweb is often disabled when the load gets too high. The git-related problems, in turn, come down to the speed with which Linux can read directories. According to kernel.org administrator H. Peter Anvin:

During extremely high load, it appears that what slows kernel.org down more than anything else is the time that each individual getdents() call takes. When I've looked this I've observed times from 200 ms to almost 2 seconds! Since an unpacked *OR* unpruned git tree adds 256 directories to a cleanly packed tree, you can do the math yourself.

Clearly, something is not quite right with the handling of large filesystems under heavy load. Part of the problem may be that Linux is not dedicating enough memory to caching directories in this situation, but the real problems are elsewhere. It turns out that:

The getdents() system call, used to read a directory, is, according to Linus, one of the most expensive in Linux. The locking is such that only one process can be reading a given directory at any given time. If that process must wait for disk I/O, it sleeps holding the inode semaphore and blocks all other readers - even if some of the others could work with parts of the directory which are already in memory.
No readahead is done on directories, so each block must be read, one by one, with the whole process stopping and waiting for I/O each time.
To make things worse, while the ext3 filesystem tries hard to lay out files contiguously on the disk, it does not make the same effort with directories. So the chances are good that a multi-block directory will be scattered on the disk, forcing a seek for each read and defeating any track caching the drive may be doing.

It has been reported that the third of the above-listed problems can be addressed by moving to XFS, which does a better job at keeping directories together. Kernel.org could make such a switch - at the cost of about a week's downtime for each server. So one should not expect it to happen overnight.

The first priority for improving the situation is, most likely, the implementation of some sort of directory readahead. That change would cut the amount of time spent waiting for directory I/O and, crucially, would require no change to existing filesystems - not even a backup and restore - to get better performance. An early readahead patch has been circulated, but this issue looks complex enough that a few iterations of careful work will be required to arrive at a real solution. So look for something to show up in the 2.6.21 time frame.

Why kernel.org is slow: EXT3 file fragmentation?

Posted Jan 11, 2007 11:38 UTC (Thu) by etienne_lorrain@yahoo.fr (guest, #38022) [Link] (2 responses)

> To make things worse, while the ext3 filesystem tries hard to lay out
> files contiguously on the disk, it does not make the same effort with
> directories.

It is maybe only me, but I have a problem with creating contigous
files on EXT3, for quite some time.
I need them for my Gujin bootloader, to simulate a floppy/hard disk
before booting, a small resident software simulating a BIOS disk
to do all sort of stuff. The resident reads the data from disk, so do
not manage holes at all for technical reasons, and while it is easy
to create contigous files with ISO9660 filesystems where all files
are contigous, FAT and mostly EXT* are more a problem.

Even if you submit the complete file to write at once by low-layer
interface of Linux, you often end up with multiple blocks - maybe
linked to some security or load balancing setup.

I have even written two small program if someone wants to experiment,
one to copy a file trying to get a single segment (and so a 12 blocks
hole at its begining for EXT2/3), retrying multiple times if it does
not achieve it, and one to display the file segments on the partition/disk.

To reproduce, get gujin-1.6.tar.gz from http://gujin.org or sourceforge,
gunzip/untar and, choosing a big file like "disk.c" from an ext2/3
partition (YMMV) :
# make showmap addhole
# su
# ./addhole disk.c holly_disk.c
Warning: created file with multiple segments, renaming to 'htmpA' and retrying
Warning: created file with multiple segments, renaming to 'htmpB' and retrying
Warning: created file with multiple segments, renaming to 'htmpC' and retrying
Success: cleaning intermediate files
# ./showmap holly_disk.c
File "holly_disk.c" of size 306600 (512 blocks512) is on filesystem 0xFE0A.
Device block size: 512, FS block size: 4096, device size: 41943038 blocks
Device length: 21474835456 bytes
The device start at 0, C/H/S: 0/0/0.
File (75 blocks of 4096) begin with a hole of 12 block, and start at block 334377 for 63 blocks,
last block 334439 and file has 1 fragments.
FIBMAP succeeded after end of file, block index 75 give block 0
# rm holly_disk.c

Maybe time for an EXT2/3 analyser/defragmenter, or to disable something somewhere?

Why kernel.org is slow: EXT3 file fragmentation?

Posted Jan 13, 2007 22:13 UTC (Sat) by jzbiciak (guest, #5246) [Link] (1 responses)

The EXT2/EXT3 filesystem has a number of structures, such as block group bitmaps, that occur in a regular pattern over the surface of the disk. While ext2/ext3 try to minimize gaps in a file, these fixed structures do interrupt large files. So, in your case, you're probably hitting that. Furthermore, no amount of allocator smarts will make a fully contiguous file if it's larger than the distance between two block bitmaps.

Note that fragmenting a file in this manner tends not to be a performance issue since large linear reads still mostly read the file of interest, and skipping a few blocks doesn't typically cause a seek. That is why I imagine it's not widely considered an issue. It's quite a bit different than, say, the old MS-DOS first-fit policy that could cause a file to be literally spread piecemeal across the entire filesystem. That sort of fragmentation destroys performance.

Why kernel.org is slow: EXT3 file fragmentation?

Posted Jan 15, 2007 11:51 UTC (Mon) by etienne_lorrain@yahoo.fr (guest, #38022) [Link]

No, here as shown in the small and quick example, to create a contigous file of approx 300 Kbytes, the small software had to try 4 times; the first 3 times the file was not contigous.
The first 3 files are not deleted but renamed after each unsuccessfull tries, to not try to position the new file at the exact same place on the disk.
I'd bet there has not been even one "EXT2/3 fixed area" collision, those shall be very unusual, and the behaviour I see on multiple distributions is very usual - the small software stops trying after 16 unsuccessfull (i.e. non contigous) files because sometimes I got 10 non-contigous 1.44 Mbytes files (unloaded machine, plenty of space on the FS)...
I do agree on the fact that if the two fragments are nearby it should not produce performance loss.

Why kernel.org is slow

Posted Jan 11, 2007 13:19 UTC (Thu) by davecb (subscriber, #1574) [Link] (5 responses)

Directory performance is a long-lived issue with
Unix-derived operating systems, and a known
hard problem even in the research world: Andy
Tannenbaum's "amoeba" team have some interesting
publications on the subject.

In a previous life, the low-hanging fruit in
in-memory directory structures were:
- The time to find that a file does not exist
Ironically, NTFS does it better, with an ordered
(actually b-tree) structure, but one can get
surprising improvements by sorting just the
in-memory form of the structure.
- Searching for something which does exist: as above.
- Using the full generality of locking for an
update to a single directory entry. Renaming
to an equal-length or shorter name is a common
case which can be done with minimal locking
(depending on your locking structure: YMMV (:-))
- reader-writer locks, for some sense of that phrase.
Getting the right sense seems to be rather subtle, but
the read speed that kernel.org needs can be directly
adressed here, and finally.
- lock-free and low-lock schemes, optimal for the
combintion of reader-writer and fast in-memory
access, for all of the above.
It is understood that the last is something of a challenge (;-))

--dave

Why kernel.org is slow

Posted Jan 11, 2007 13:48 UTC (Thu) by etienne_lorrain@yahoo.fr (guest, #38022) [Link] (1 responses)

If the problem is linked to read/write access and locks, would it be good (when there is a lot of read and few writes like for the Linux versions), to keep the filesystem mounted read-only most of the time?
I mean, keep the partition containing data read-only, then to update do:
mount -o remount,rw /server/data
cp -ra new_linux_version /server/data
sync
mount -o remount,ro /server/data

Just curious,
Etienne.

Why kernel.org is slow

Posted Jan 11, 2007 16:01 UTC (Thu) by davecb (subscriber, #1574) [Link]

Hmmn, does someone know if Linux directory locks are never held
on read-only media? I know zfs locks at the directory-entry
level (see http://src.opensolaris.org/source/xref/loficc/crypto/usr/... ) but UFSs generally lock the in-memory directory, and don't know if it
comes from RO or RW media...
Anyone know ext3 that well?

--dave

Why kernel.org is slow

Posted Jan 13, 2007 13:20 UTC (Sat) by ebiederm (subscriber, #35028) [Link] (2 responses)

Linux appears to do a much better job than the NT kernel for
the in memory data structures. A cheap way to see this is to
run git on a windows system. There is an order of magnitude
performance hit for directory sensitive things. I don't believe
that is just cygwin.

ext3 for large directories hashes the filename and looks it up in a
btree. Using a hash of the filename results in a better branching
factor in your btree. So the on-disk data structures are not at
a disadvantage.

I haven't looked but ext2+ directories should all be kept in the same
block group which is roughly a single disk track. So even with
fragmentation the disk track cache should work well. I don't remember
if block groups are small enough so that they always map to the
same disk track though.

So I'm pretty certain the issue is the large directories the inode
semaphore.

Read-ahead should help a lot if the pages don't get thrown out before we
use them.

Changing the locking to allow more concurrency is a trickier problem.
If done right my gut feel is that you should be able to operate
essentially lock free, with multiple concurrent writes and reads going on
simultaneously. The readdir semantics allow for it. But anything with a
high degree of concurrency comes with tricky corner cases.

Eric

Why kernel.org is slow

Posted Jan 13, 2007 15:44 UTC (Sat) by davecb (subscriber, #1574) [Link]

Excellent!

I'd be inclined to say that lock-free algorithms
might be a solution to look closely at... more
speculation after I've had a chance to think
about it (;-))

--dave

Why kernel.org is slow

Posted Jan 14, 2007 10:25 UTC (Sun) by evgeny (guest, #774) [Link]

> I don't believe that is just cygwin.

Comparing with [an app running under] cygwin is unfair. Try watching a configure script under cygwin and natively - the difference can easily be a factor of ten.

Why kernel.org is slow

Posted Jan 11, 2007 16:09 UTC (Thu) by smitty_one_each (subscriber, #28989) [Link]

Efforts at doing a git-fetch to go from 2.6.20-rc2-mm1 to 2.6.20-rc3-mm1 have failed spectacularly half a dozen times for me.
After counting ~15k objects and establishing that it should push ~10k, git-fetch dies on an EOF.
Could LWN host git trees for folks happy to pay for the bandwidth, and, if so, what cost might that be?

Why kernel.org is slow

Posted Jan 12, 2007 0:42 UTC (Fri) by wcooley (guest, #1233) [Link]

This explains why my rsync mirrors from mirror.kernel.org have been failing for the last few weeks.

Why kernel.org is slow

Posted Jan 13, 2007 0:17 UTC (Sat) by brouhaha (subscriber, #1698) [Link] (1 responses)

How difficult would it be to revise the ext3 code to try to keep directory blocks contiguous as it attempts for files? The on-disk structures shouldn't need to change, so it wouldn't break compatibility.

If that's hard to do, how about a tunable parameter to control how much space is initially allocated to directories, and try to keep that initial allocation contiguous?

Why kernel.org is slow

Posted Jan 18, 2007 10:21 UTC (Thu) by forthy (guest, #1525) [Link]

It's a bit a mystery for me why nobody has attacked this problem a long time ago. Directory read was always a pain in the neck, and you can imagine how slow it is if you compare locate with find (and how big the impact of rebuilding the locate database is).

From a more abstract point of view, the directory is a data base with file names, and a n:1 relation between file names and parent directories. The relation between overall file system size and directory size is quite good, i.e. the directory size is a small percentage figure. On a larger file sever here with about 1TB space used, the locatedb (which contains just everything) is only ~64MB. Even when you use a larger, less space-efficient directory structure, 128MB/TB should be completely sufficient. A modern RAID array can read 128MB in a fraction of a second, the memory is there to keep it all, so a find / -name '*' can - if well implemented - print a result within a second or less.

I'd suggest the following to the file system implementors: Forget everything you'd read about Unix directories. Start from scratch. Get a decent knowledge about how data bases work, the directory is a data base. An extremely simple one, so to say. Create a single directory file for the directory data base; make sure that it won't fragment much over time (if the directory grows beyond the previously allocated space, allocate a larger space, and copy the directory over completely). Do read-aheads and all the other caching stuff like for any other file, when accessing the directory data base. Keep the file names easy to access by using a large hash table (on disk - not to be computed on the fly!). Hash key is computed as usual from the directory id+file name hash.

And for the locking: Make sure that readers never have to lock a directory. They'll maybe get stale content, when a writer adds or removes files from a directory, but that's ok. You can never rely on getdents() entries to be valid when you open() them later. Writers should use a RCU mechanism for updating directories.

Why kernel.org is slow

Posted Jan 14, 2007 1:08 UTC (Sun) by csamuel (✭ supporter ✭, #2624) [Link]

Does anyone know how JFS behaves in this situation as well ?

I've been doing some basic benchmarking recently with bonnie++ and what
sprang out at me was that XFS is *much* slower for file creation &
deletion than both ext3 and JFS (under the 2.6.15 kernel in Ubuntu 6.06)
and that with a reasonable log file size (400) JFS was faster than ext3.