Why kernel.org is slow
Discussion on the mailing lists reveal that the kernel.org servers (there are two of them) often run with load averages in the range of 2-300. So it's not entirely surprising that they are not always quite as responsive as one would like. There is talk of adding servers, but there is also a sense that the current servers should be able to keep up with the load. So the developers have been looking into what is going on.
The problem seems to originate with git. Kernel.org hosts quite a few git repositories and a version of the gitweb system as well - though gitweb is often disabled when the load gets too high. The git-related problems, in turn, come down to the speed with which Linux can read directories. According to kernel.org administrator H. Peter Anvin:
Clearly, something is not quite right with the handling of large filesystems under heavy load. Part of the problem may be that Linux is not dedicating enough memory to caching directories in this situation, but the real problems are elsewhere. It turns out that:
- The getdents() system call, used to read a directory, is, according to Linus, one of the most
expensive in Linux. The locking is such that only one process can be
reading a given directory at any given time. If that process must
wait for disk I/O, it sleeps holding the inode semaphore and blocks
all other readers - even if some of the others could work with parts
of the directory which are already in memory.
- No readahead is done on directories, so each block must be read, one
by one, with the whole process stopping and waiting for I/O each time.
- To make things worse, while the ext3 filesystem tries hard to lay out files contiguously on the disk, it does not make the same effort with directories. So the chances are good that a multi-block directory will be scattered on the disk, forcing a seek for each read and defeating any track caching the drive may be doing.
It has been reported that the third of the above-listed problems can be addressed by moving to XFS, which does a better job at keeping directories together. Kernel.org could make such a switch - at the cost of about a week's downtime for each server. So one should not expect it to happen overnight.
The first priority for improving the situation is, most likely, the
implementation of some sort of directory readahead. That change would cut
the amount of time spent waiting for directory I/O and, crucially, would
require no change to existing filesystems - not even a backup and restore -
to get better performance. An early readahead patch has been circulated,
but this issue looks complex enough that a few iterations of careful work
will be required to arrive at a real solution. So look for something to
show up in the 2.6.21 time frame.
Posted Jan 11, 2007 11:38 UTC (Thu)
by etienne_lorrain@yahoo.fr (guest, #38022)
[Link] (2 responses)
It is maybe only me, but I have a problem with creating contigous
Even if you submit the complete file to write at once by low-layer
I have even written two small program if someone wants to experiment,
To reproduce, get gujin-1.6.tar.gz from http://gujin.org or sourceforge,
Maybe time for an EXT2/3 analyser/defragmenter, or to disable something somewhere?
Posted Jan 13, 2007 22:13 UTC (Sat)
by jzbiciak (guest, #5246)
[Link] (1 responses)
Note that fragmenting a file in this manner tends not to be a performance issue since large linear reads still mostly read the file of interest, and skipping a few blocks doesn't typically cause a seek. That is why I imagine it's not widely considered an issue. It's quite a bit different than, say, the old MS-DOS first-fit policy that could cause a file to be literally spread piecemeal across the entire filesystem. That sort of fragmentation destroys performance.
Posted Jan 15, 2007 11:51 UTC (Mon)
by etienne_lorrain@yahoo.fr (guest, #38022)
[Link]
Posted Jan 11, 2007 13:19 UTC (Thu)
by davecb (subscriber, #1574)
[Link] (5 responses)
In a previous life, the low-hanging fruit in
--dave
Posted Jan 11, 2007 13:48 UTC (Thu)
by etienne_lorrain@yahoo.fr (guest, #38022)
[Link] (1 responses)
Just curious,
Posted Jan 11, 2007 16:01 UTC (Thu)
by davecb (subscriber, #1574)
[Link]
--dave
Posted Jan 13, 2007 13:20 UTC (Sat)
by ebiederm (subscriber, #35028)
[Link] (2 responses)
ext3 for large directories hashes the filename and looks it up in a
I haven't looked but ext2+ directories should all be kept in the same
So I'm pretty certain the issue is the large directories the inode
Read-ahead should help a lot if the pages don't get thrown out before we
Changing the locking to allow more concurrency is a trickier problem.
Eric
Posted Jan 13, 2007 15:44 UTC (Sat)
by davecb (subscriber, #1574)
[Link]
I'd be inclined to say that lock-free algorithms
--dave
Posted Jan 14, 2007 10:25 UTC (Sun)
by evgeny (guest, #774)
[Link]
Comparing with [an app running under] cygwin is unfair. Try watching a configure script under cygwin and natively - the difference can easily be a factor of ten.
Posted Jan 11, 2007 16:09 UTC (Thu)
by smitty_one_each (subscriber, #28989)
[Link]
Posted Jan 12, 2007 0:42 UTC (Fri)
by wcooley (guest, #1233)
[Link]
Posted Jan 13, 2007 0:17 UTC (Sat)
by brouhaha (subscriber, #1698)
[Link] (1 responses)
If that's hard to do, how about a tunable parameter to control how much space is initially allocated to directories, and try to keep that initial allocation contiguous?
Posted Jan 18, 2007 10:21 UTC (Thu)
by forthy (guest, #1525)
[Link]
It's a bit a mystery for me why nobody has attacked this problem a
long time ago. Directory read was always a pain in the neck, and you can
imagine how slow it is if you compare locate with find (and how big the
impact of rebuilding the locate database is). From a more abstract point of view, the directory is a data base with
file names, and a n:1 relation between file names and parent directories.
The relation between overall file system size and directory size is quite
good, i.e. the directory size is a small percentage figure. On a larger
file sever here with about 1TB space used, the locatedb (which contains
just everything) is only ~64MB. Even when you use a larger, less
space-efficient directory structure, 128MB/TB should be completely
sufficient. A modern RAID array can read 128MB in a fraction of a second,
the memory is there to keep it all, so a find / -name '*' can - if well
implemented - print a result within a second or less. I'd suggest the following to the file system implementors: Forget
everything you'd read about Unix directories. Start from scratch. Get a
decent knowledge about how data bases work, the directory is a
data base. An extremely simple one, so to say. Create a single directory
file for the directory data base; make sure that it won't fragment much
over time (if the directory grows beyond the previously allocated space,
allocate a larger space, and copy the directory over completely). Do
read-aheads and all the other caching stuff like for any other file, when
accessing the directory data base. Keep the file names easy to access by
using a large hash table (on disk - not to be computed on the fly!). Hash
key is computed as usual from the directory id+file name hash. And for the locking: Make sure that readers never have to lock a
directory. They'll maybe get stale content, when a writer adds or removes
files from a directory, but that's ok. You can never rely on getdents()
entries to be valid when you open() them later. Writers should use a RCU
mechanism for updating directories.
Posted Jan 14, 2007 1:08 UTC (Sun)
by csamuel (✭ supporter ✭, #2624)
[Link]
I've been doing some basic benchmarking recently with bonnie++ and what
> To make things worse, while the ext3 filesystem tries hard to lay outWhy kernel.org is slow: EXT3 file fragmentation?
> files contiguously on the disk, it does not make the same effort with
> directories.
files on EXT3, for quite some time.
I need them for my Gujin bootloader, to simulate a floppy/hard disk
before booting, a small resident software simulating a BIOS disk
to do all sort of stuff. The resident reads the data from disk, so do
not manage holes at all for technical reasons, and while it is easy
to create contigous files with ISO9660 filesystems where all files
are contigous, FAT and mostly EXT* are more a problem.
interface of Linux, you often end up with multiple blocks - maybe
linked to some security or load balancing setup.
one to copy a file trying to get a single segment (and so a 12 blocks
hole at its begining for EXT2/3), retrying multiple times if it does
not achieve it, and one to display the file segments on the partition/disk.
gunzip/untar and, choosing a big file like "disk.c" from an ext2/3
partition (YMMV) :
# make showmap addhole
# su
# ./addhole disk.c holly_disk.c
Warning: created file with multiple segments, renaming to 'htmpA' and retrying
Warning: created file with multiple segments, renaming to 'htmpB' and retrying
Warning: created file with multiple segments, renaming to 'htmpC' and retrying
Success: cleaning intermediate files
# ./showmap holly_disk.c
File "holly_disk.c" of size 306600 (512 blocks512) is on filesystem 0xFE0A.
Device block size: 512, FS block size: 4096, device size: 41943038 blocks
Device length: 21474835456 bytes
The device start at 0, C/H/S: 0/0/0.
File (75 blocks of 4096) begin with a hole of 12 block, and start at block 334377 for 63 blocks,
last block 334439 and file has 1 fragments.
FIBMAP succeeded after end of file, block index 75 give block 0
# rm holly_disk.c
The EXT2/EXT3 filesystem has a number of structures, such as block group bitmaps, that occur in a regular pattern over the surface of the disk. While ext2/ext3 try to minimize gaps in a file, these fixed structures do interrupt large files. So, in your case, you're probably hitting that. Furthermore, no amount of allocator smarts will make a fully contiguous file if it's larger than the distance between two block bitmaps.Why kernel.org is slow: EXT3 file fragmentation?
No, here as shown in the small and quick example, to create a contigous file of approx 300 Kbytes, the small software had to try 4 times; the first 3 times the file was not contigous.Why kernel.org is slow: EXT3 file fragmentation?
The first 3 files are not deleted but renamed after each unsuccessfull tries, to not try to position the new file at the exact same place on the disk.
I'd bet there has not been even one "EXT2/3 fixed area" collision, those shall be very unusual, and the behaviour I see on multiple distributions is very usual - the small software stops trying after 16 unsuccessfull (i.e. non contigous) files because sometimes I got 10 non-contigous 1.44 Mbytes files (unloaded machine, plenty of space on the FS)...
I do agree on the fact that if the two fragments are nearby it should not produce performance loss.
Directory performance is a long-lived issue withWhy kernel.org is slow
Unix-derived operating systems, and a known
hard problem even in the research world: Andy
Tannenbaum's "amoeba" team have some interesting
publications on the subject.
in-memory directory structures were:
- The time to find that a file does not exist
Ironically, NTFS does it better, with an ordered
(actually b-tree) structure, but one can get
surprising improvements by sorting just the
in-memory form of the structure.
- Searching for something which does exist: as above.
- Using the full generality of locking for an
update to a single directory entry. Renaming
to an equal-length or shorter name is a common
case which can be done with minimal locking
(depending on your locking structure: YMMV (:-))
- reader-writer locks, for some sense of that phrase.
Getting the right sense seems to be rather subtle, but
the read speed that kernel.org needs can be directly
adressed here, and finally.
- lock-free and low-lock schemes, optimal for the
combintion of reader-writer and fast in-memory
access, for all of the above.
It is understood that the last is something of a challenge (;-))
If the problem is linked to read/write access and locks, would it be good (when there is a lot of read and few writes like for the Linux versions), to keep the filesystem mounted read-only most of the time?Why kernel.org is slow
I mean, keep the partition containing data read-only, then to update do:
mount -o remount,rw /server/data
cp -ra new_linux_version /server/data
sync
mount -o remount,ro /server/data
Etienne.
Hmmn, does someone know if Linux directory locks are never heldWhy kernel.org is slow
on read-only media? I know zfs locks at the directory-entry
level (see http://src.opensolaris.org/source/xref/loficc/crypto/usr/... ) but UFSs generally lock the in-memory directory, and don't know if it
comes from RO or RW media...
Anyone know ext3 that well?
Linux appears to do a much better job than the NT kernel forWhy kernel.org is slow
the in memory data structures. A cheap way to see this is to
run git on a windows system. There is an order of magnitude
performance hit for directory sensitive things. I don't believe
that is just cygwin.
btree. Using a hash of the filename results in a better branching
factor in your btree. So the on-disk data structures are not at
a disadvantage.
block group which is roughly a single disk track. So even with
fragmentation the disk track cache should work well. I don't remember
if block groups are small enough so that they always map to the
same disk track though.
semaphore.
use them.
If done right my gut feel is that you should be able to operate
essentially lock free, with multiple concurrent writes and reads going on
simultaneously. The readdir semantics allow for it. But anything with a
high degree of concurrency comes with tricky corner cases.
Excellent!Why kernel.org is slow
might be a solution to look closely at... more
speculation after I've had a chance to think
about it (;-))
> I don't believe that is just cygwin.Why kernel.org is slow
Efforts at doing a git-fetch to go from 2.6.20-rc2-mm1 to 2.6.20-rc3-mm1 have failed spectacularly half a dozen times for me.Why kernel.org is slow
After counting ~15k objects and establishing that it should push ~10k, git-fetch dies on an EOF.
Could LWN host git trees for folks happy to pay for the bandwidth, and, if so, what cost might that be?
This explains why my rsync mirrors from mirror.kernel.org have been failing for the last few weeks.Why kernel.org is slow
How difficult would it be to revise the ext3 code to try to keep directory blocks contiguous as it attempts for files? The on-disk structures shouldn't need to change, so it wouldn't break compatibility.Why kernel.org is slow
Why kernel.org is slow
Does anyone know how JFS behaves in this situation as well ?Why kernel.org is slow
sprang out at me was that XFS is *much* slower for file creation &
deletion than both ext3 and JFS (under the 2.6.15 kernel in Ubuntu 6.06)
and that with a reasonable log file size (400) JFS was faster than ext3.