Database use of SMR drives

Posted Jun 26, 2015 18:25 UTC (Fri) by butlerm (subscriber, #13312)
In reply to: A report from PGCon 2015 by ncm
Parent article: A report from PGCon 2015

To start with, it is difficult to think of any random access storage device less suitable for database use than an SMR drive. The use of such drives in a busy production database sounds like a good way to reduce write transaction throughput by a factor of one hundred or more.

That said, it is puzzling to me that PostgreSQL would go out of its away to rewrite a block on a mass storage device simply to update a hint bit of the sort that ought to be easily derived by examining the contents of the block itself. Of course, if it is not rewriting blocks merely to update a hint bit, then there shouldn't be a problem.

Database use of SMR drives

Posted Jun 27, 2015 13:54 UTC (Sat) by kleptog (subscriber, #1183) [Link] (3 responses)

The hint bits are there because while it is possible to determine the state of the tuple by examining the tuple itself, that doesn't mean this check is free. There's a whole file dedicated to visibility checks and they aren't simple functions. These need to be called on every single row in every single scan. Hint bits help short circuit that and help performance. Most useful is the bit that says "this tuple is visible to everyone" because it helps everyone and is true most of the time, just not when the tuple was first inserted..

The other related situation is when the transaction counter wraps, which it does once every 2^32 transactions. The XIDs in the tuples need to be updated with a special marker to indicate it's not valid any more.

There are solutions to these problems but they generally cost in either disk space or performance elsewhere. You could for example switch to a 64-bit transaction IDs and never remove any old transaction logs (the clog) and you would never need to rewrite old tuples. Of course, you'd never be able to recover the disk space used by the transaction logs either.

SMR for databases would be useful for the WAL logging, since they are write once, but for the tables it seems unlikely.

Database use of SMR drives

Posted Jun 27, 2015 17:08 UTC (Sat) by andresfreund (subscriber, #69562) [Link]

> The hint bits are there because while it is possible to determine the state of the tuple by examining the tuple itself, that doesn't mean this check is free. There's a whole file dedicated to visibility checks and they aren't simple functions. These need to be called on every single row in every single scan. Hint bits help short circuit that and help performance. Most useful is the bit that says "this tuple is visible to everyone" because it helps everyone and is true most of the time, just not when the tuple was first inserted..

Correct.

Minor nitpick: It's not primarily the cost of these functions, they're called in many situations regardless. It's that another on-disk file (the 'clog', a integer indexed file listing whether a transaction committed or not) has to be consulted. That's often where much of the time is spent. Especially if you have significant throughput, and access older, uncached, regions of the clog.

> SMR for databases would be useful for the WAL logging, since they are write once, but for the tables it seems unlikely.

I think SMR for the WAL would not be a good idea - due the the need of somewhat frequent fsync you'll likely end up with horrible performance.

But for some workloads where you have append-only data that is only infrequently read it's not unrealistic to use SMR. It's quite common that only the last few months worth of data will be modified, but that you have to archive years worth for regulatory and reporting reasons. Moving older partitions to storage with different characteristics is not unreasonable.

Database use of SMR drives

Posted Jun 30, 2015 21:27 UTC (Tue) by snuxoll (guest, #61198) [Link] (1 responses)

> You could for example switch to a 64-bit transaction IDs and never remove any old transaction logs (the clog) and you would never need to rewrite old tuples. Of course, you'd never be able to recover the disk space used by the transaction logs either.

I'm confused, how would not rolling over the XID prevent you from removing old WAL segments?

Database use of SMR drives

Posted Jun 30, 2015 21:51 UTC (Tue) by kleptog (subscriber, #1183) [Link]

> I'm confused, how would not rolling over the XID prevent you from removing old WAL segments?

You can always eventually remove old WAL segments, but not the transaction log, the one that tracks for each transaction whether it committed or not. It's only 2 bits per transaction, but after 2^32 transactions that's still 1GB of disk space. To get rid of those logs you have to at some point go back to mark the tuple either permanently committed or rolled back. This isn't done with the hint bits though, but by using a special Frozen-XID marker.

The rolling over is only slightly related. If you're going to truncate the transaction log anyway to less than 2^32 transactions, then you don't need to remember more than 2^32 transactions and so you can save the space by only using 32-bit XIDs. If you use larger XIDs then you have more flexibility for people who don't mind the few GB of disk space to remember the last 2^36 transactions for example.