LWN.net Logo

bigalloc

bigalloc

Posted Nov 30, 2011 21:58 UTC (Wed) by cmccabe (guest, #60281)
In reply to: bigalloc by tialaramex
Parent article: Improving ext4: bigalloc, inline data, and metadata checksums

> Right now, with 4096 byte pages, the performance is... well, we're working
> on it but it's already surprisingly good. But if bigalloc clusters mean
> the unit of caching is larger, it seems like bad news for us.

mmap is such an elegant faculty, but it lacks a few things. The first is a way to handle I/O errors reasonably. The second is a way to do nonblocking I/O. You can sort of fudge the second point by using mincore(), but it doesn't work that well.

As far as performance goes... SSDs are great at random reads, but small random writes are often not so good. Don't assume that you can write small chunks anywhere on the flash "for free." The firmware has to do a lot of write coalescing to even make that possible, let alone fast.

bigalloc might very well be slower for you IF you have poor locality-- for example, if most data structures are smaller than 4k, and you never access two sequential data structures. If you have bigger data structures, bigalloc could very well end up being faster.

If you have poor locality, you should try reducing readahead in /sys/block/sda/queue/read_ahead_kb or wherever. There's no point reading bytes that you're not going to access.


(Log in to post comments)

bigalloc

Posted Dec 1, 2011 16:56 UTC (Thu) by tialaramex (subscriber, #21167) [Link]

On I/O errors: Sure. We even catch some implausible signal like SIGBUS if we write to a previously unallocated block on a full filesystem for example. But in practice other than giving developers like myself a terrible shock when it first happened (a SIGBUS? what the hell did I touch?) such behaviour isn't too troubling for us. If an SSD actually dies, we're out of action for some time no matter what, just as we would be if the RAM failed. We anticipate this happening once in a while, it isn't a reason to give up and go home.

Yes, our locality is fairly poor such that readahead is actively bad news. The data structures which dominate are exactly page-sized. We may end up changing anything from a few bytes to a whole page (and even when we write a whole page we need the old contents to determine the new contents), but the chance we then move on to the linearly next (or previous) page is negligible.

My impression was that readahead would be disabled by suitable incantations of madvise(). Is that wrong? It didn't benchmark as wrong on toy systems, but I would have to check whether we actually re-tested on the big machines.

bigalloc

Posted Dec 1, 2011 20:36 UTC (Thu) by cmccabe (guest, #60281) [Link]

> We even catch some implausible signal like SIGBUS if we write to
> a previously unallocated block on a full filesystem for example.

If I were you, I'd use posix_fallocate to de-sparsify (manifest?) all of the blocks. Then you don't have unpleasant surprises waiting for you later.

> My impression was that readahead would be disabled by suitable
> incantations of madvise(). Is that wrong? It didn't benchmark as wrong on
> toy systems, but I would have to check whether we actually re-tested on
> the big machines.

I looked at mm/filemap.c and found this:

> static void do_sync_mmap_readahead(...) {
> ...
> if (VM_RandomReadHint(vma))
> return;
> ...
> }

So I'm guessing you're safe with MADV_RANDOM. But it might be wise to check the source of the kernel you're using in case something is different in that version.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds