User: Password:
|
|
Subscribe / Log in / New account

bigalloc

bigalloc

Posted Nov 30, 2011 12:36 UTC (Wed) by tialaramex (subscriber, #21167)
Parent article: Improving ext4: bigalloc, inline data, and metadata checksums

How will this interact with huge, almost randomly written files?

I have an application where what we really want is lots of RAM. But RAM is expensive. We can afford to buy 2TB of RAM, but not 20TB. However we can afford to go quite a lot slower than RAM sometimes so long as our averages are good enough, so our solution is to use SSDs plus RAM, via mmap()

When we're lucky, the page we want is in RAM, we update it, and the kernel lazily writes it back to an SSD whenever. When we're unlucky, the SSD has to retrieve the page we need, which takes longer and of course forces one of the other pages out of cache, in the worst case forcing it to wait for that page to be written first. We can arrange to be somewhat luckier than pure chance would dictate, on average, but we certainly can't make this into a nice linear operation.

Right now, with 4096 byte pages, the performance is... well, we're working on it but it's already surprisingly good. But if bigalloc clusters mean the unit of caching is larger, it seems like bad news for us.


(Log in to post comments)

bigalloc

Posted Nov 30, 2011 14:34 UTC (Wed) by Seegras (guest, #20463) [Link]

> But if bigalloc clusters mean the unit of caching is larger, it seems like
> bad news for us.

You're not supposed to make filesystems with bigalloc clusters if you don't want them or if it hampers your performance.

bigalloc

Posted Nov 30, 2011 18:22 UTC (Wed) by tialaramex (subscriber, #21167) [Link]

Ah, OK, I somehow got the idea this was the unavoidable future, rather than another option. Nothing to worry about then. Thanks for pointing that out.

bigalloc

Posted Nov 30, 2011 19:19 UTC (Wed) by jimparis (subscriber, #38647) [Link]

It also seems that you might not want to use a filesystem at all for that type of application, but instead just mmap a block device directly.

bigalloc

Posted Jun 14, 2012 14:19 UTC (Thu) by Klavs (guest, #10563) [Link]

like varnish does :)

bigalloc

Posted Nov 30, 2011 20:02 UTC (Wed) by iabervon (subscriber, #722) [Link]

My impression is that bigalloc doesn't affect the unit of caching. Rather, it affects the unit of disk block allocation, meaning that pages 0-1023 are adjacent on your SSD and the filesystem metadata specifies that that 4M is in use and the inode has a single disk location to find it, but the pages are still accessed independently.

bigalloc

Posted Nov 30, 2011 21:16 UTC (Wed) by walex (subscriber, #69836) [Link]

Note that 'ext4' supports extents, so files can get allocated with very large contiguous extents already, for example for a 70MB file:

#  du -sm /usr/share/icons/oxygen/icon-theme.cache
69      /usr/share/icons/oxygen/icon-theme.cache
#  filefrag /usr/share/icons/oxygen/icon-theme.cache
/usr/share/icons/oxygen/icon-theme.cache: 1 extent found
#  df -T /usr/share/icons/oxygen/icon-theme.cache
Filesystem    Type   1M-blocks      Used Available Use% Mounted on
/dev/sda3     ext4       25383     12558     11545  53% /

But so far the free space has been tracked in block-sized units, and the new thing seems to change the amount of free space accounted for by each bit in the free space bitmap.

Which means that as surmised the granularity of allocation has changed (for example minimum extent size).

bigalloc

Posted Nov 30, 2011 21:58 UTC (Wed) by cmccabe (guest, #60281) [Link]

> Right now, with 4096 byte pages, the performance is... well, we're working
> on it but it's already surprisingly good. But if bigalloc clusters mean
> the unit of caching is larger, it seems like bad news for us.

mmap is such an elegant faculty, but it lacks a few things. The first is a way to handle I/O errors reasonably. The second is a way to do nonblocking I/O. You can sort of fudge the second point by using mincore(), but it doesn't work that well.

As far as performance goes... SSDs are great at random reads, but small random writes are often not so good. Don't assume that you can write small chunks anywhere on the flash "for free." The firmware has to do a lot of write coalescing to even make that possible, let alone fast.

bigalloc might very well be slower for you IF you have poor locality-- for example, if most data structures are smaller than 4k, and you never access two sequential data structures. If you have bigger data structures, bigalloc could very well end up being faster.

If you have poor locality, you should try reducing readahead in /sys/block/sda/queue/read_ahead_kb or wherever. There's no point reading bytes that you're not going to access.

bigalloc

Posted Dec 1, 2011 16:56 UTC (Thu) by tialaramex (subscriber, #21167) [Link]

On I/O errors: Sure. We even catch some implausible signal like SIGBUS if we write to a previously unallocated block on a full filesystem for example. But in practice other than giving developers like myself a terrible shock when it first happened (a SIGBUS? what the hell did I touch?) such behaviour isn't too troubling for us. If an SSD actually dies, we're out of action for some time no matter what, just as we would be if the RAM failed. We anticipate this happening once in a while, it isn't a reason to give up and go home.

Yes, our locality is fairly poor such that readahead is actively bad news. The data structures which dominate are exactly page-sized. We may end up changing anything from a few bytes to a whole page (and even when we write a whole page we need the old contents to determine the new contents), but the chance we then move on to the linearly next (or previous) page is negligible.

My impression was that readahead would be disabled by suitable incantations of madvise(). Is that wrong? It didn't benchmark as wrong on toy systems, but I would have to check whether we actually re-tested on the big machines.

bigalloc

Posted Dec 1, 2011 20:36 UTC (Thu) by cmccabe (guest, #60281) [Link]

> We even catch some implausible signal like SIGBUS if we write to
> a previously unallocated block on a full filesystem for example.

If I were you, I'd use posix_fallocate to de-sparsify (manifest?) all of the blocks. Then you don't have unpleasant surprises waiting for you later.

> My impression was that readahead would be disabled by suitable
> incantations of madvise(). Is that wrong? It didn't benchmark as wrong on
> toy systems, but I would have to check whether we actually re-tested on
> the big machines.

I looked at mm/filemap.c and found this:

> static void do_sync_mmap_readahead(...) {
> ...
> if (VM_RandomReadHint(vma))
> return;
> ...
> }

So I'm guessing you're safe with MADV_RANDOM. But it might be wise to check the source of the kernel you're using in case something is different in that version.


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds