Posted Jul 30, 2009 16:15 UTC (Thu) by joib (guest, #8541)
Parent article: A short history of btrfs
Great article, thanks!
Though I'm a bit confused why the SLAB allocator style approach in ZFS would be such a problem from a fragmentation standpoint. In the VM, SLAB was introduced to reduce fragmentation in the first place, after all. Not to mention, many user-space malloc() implementation use somewhat similar strategies to cope with arbitrary sized requests, e.g. the Doug Lea allocator in glibc malloc maintains lists of objects in increasing size of powers-of-two all the way up to the mmap limit, and each allocation is rounded up to the nearest power-of-2, all this mainly to reduce fragmentation (or something like that, it's been a while since I looked into it). Heck, back in fs land, one thing that ext4 got from Lustre was having separate large- and small-file areas on disk, to reduce fragmentation, and it wouldn't surprise me if btrfs did something similar as well.
Anyway, from the article it naively sounds like the real problem is that the SLAB's are pre-allocated, perhaps even at fs creation time, rather than as needed from free space, which again naively would do away with a lot of the requirement to coalesce or split objects?
Posted Aug 1, 2009 17:27 UTC (Sat) by graydon (subscriber, #5009)
[Link]
A couple points. I'm not expert enough to be sure these answer your question:
As far as I know, slab allocators are based on a regional grouping: "this entire range of memory is dedicated to objects of size N bytes", say. The Lea allocator keeps a freelist that can hold objects of N bytes, but there's nothing saying those objects are either allocated from an N-byte-object slab initially, nor that they're exactly N bytes. Just that that's the right power-of-two freelist to put them on. It also coalesces adjacent allocations into larger blocks and moves them to larger freelists whenever it can.
Furthermore -- here I am getting more sketchy in my understanding -- I think that the design of a slab-like allocator implies having a certain degree of co-operation from your allocation client. The client has to be willing to ask for memory from slabs, and the memory it asks for has to (commonly) be in slab-entry-sized units. This is fine -- actually a correct design observation that slab allocation was invented to take advantage of -- when your client is a kernel with a fixed-at-compile-time assortment of struct sizes it's going to be (mostly) allocating. It is not really a correct design observation for things like "general-purpose mallocs" or "general-purpose filesystems", where the clients are unknown and aren't interested in confining themselves to slab-appropriate behavior.
IOW I think both are trying to "reduce fragmentation", but with a different assumed type of workload: a slab allocator eliminates external fragmentation almost entirely if the client can co-operate, but a more general allocator like Lea malloc lacks that co-operation promise, and so takes a simpler approach that (on average) keeps both internal and external fragmentation "under control", but with less of a guarantee about it.
I believe, if I'm reading correctly, that btrfs will be more in the Lea-malloc camp: possibly subject to more external fragmentation (say if you let a bunch of leaves from the middle of blocks drop to 0-refcount), but possibly less internally fragmenting than ZFS (due to the absence of slab size-grouping), so better on average for the general workload of a filesystem. Allocators are always tuned to workload assumptions.