PostgreSQL pain points

Posted Mar 26, 2014 20:37 UTC (Wed) by jberkus (guest, #55561)
In reply to: PostgreSQL pain points by marcH
Parent article: PostgreSQL pain points

As with most things, using raw partitions exchanges one set of problems for another:

* Database blocks become difficult to resize and move.
* Can no longer use standard tools like "rsync" with database files.
* The database project now needs a staff to maintain what's basically their own filesystem
* Can't keep up with hardware advances in a timely fashion.
* Throwing away all of the good stuff developed by Linux IO and FS geeks over the last 20 years.
* Clobbering all other IO-using software on the same machine.

For Postgres, raw partitions aren't even reasonable to contemplate since we'd need to add 5-10 full-time hackers to the community just to build and maintain that portion of the code.

virtualization

Posted Mar 26, 2014 22:52 UTC (Wed) by marcH (subscriber, #57642) [Link] (6 responses)

> * The database project now needs a staff to maintain what's basically their own filesystem

Yes, agreed totally.

> Throwing away all of the good stuff developed by Linux IO and FS geeks over the last 20 years.

I understand all the "it's more [development] work" arguments that you put in one form or the other. Yes for sure it is: exactly like the duplication of effort we have in the variety of filesytems (on various operating systems) that we have out there. Some better at some loads and others at others.

> * Database blocks become difficult to resize and move.

Yes, "virtualization"/layering has pros and cons. But if you really want "bare-metal" performance you know where you have to go.

> * Can no longer use standard tools like "rsync" with database files.

Well, you can't use that on a live database anyway, so this point looks moot. Unless maybe you rely on a filesystem with snapshotting which is... not far from duplicating a database feature! Same pattern gain.

> * Can't keep up with hardware advances in a timely fashion.
> * Clobbering all other IO-using software on the same machine.

Sorry I don't get these two. Care to elaborate?

virtualization

Posted Mar 27, 2014 18:58 UTC (Thu) by kleptog (subscriber, #1183) [Link] (5 responses)

> > * Can no longer use standard tools like "rsync" with database files.

> Well, you can't use that on a live database anyway, so this point looks moot. Unless maybe you rely on a filesystem with snapshotting which is... not far from duplicating a database feature! Same pattern gain.

You can. It's useful for both backup and replication. Basically you can use rsync to quickly update your backup image. And then you take a copy of the WAL logs. The combination gives you a backup. If you have a snapshotting filesystem you can indeed achieve similar effects.

> > * Can't keep up with hardware advances in a timely fashion.
> > * Clobbering all other IO-using software on the same machine.

> Sorry I don't get these two. Care to elaborate

For the first, consider the effects the rise of SSD is having on the Linux VFS. That would need to be replicated in the database. For the second, as a userspace program you don't have a good view of what the rest of the system is doing, hence you might be interfering with other processes. The kernel has the overview.

It's a feature that a database doesn't assume it's the only program on a machine.

virtualization

Posted Mar 27, 2014 19:32 UTC (Thu) by marcH (subscriber, #57642) [Link] (2 responses)

> > * Clobbering all other IO-using software on the same machine.

> For the second, as a userspace program you don't have a good view of what the rest of the system is doing, hence you might be interfering with other processes. The kernel has the overview.

How is the raw partition approach worse here? I would intuitively think it makes things better: less sharing.

Anyway: any database of serious size runs on dedicated or practically dedicated hardware, doesn't it?

virtualization

Posted Mar 28, 2014 22:24 UTC (Fri) by kleptog (subscriber, #1183) [Link]

> > > * Clobbering all other IO-using software on the same machine.

> > For the second, as a userspace program you don't have a good view of what the rest of the system is doing, hence you might be interfering with other processes. The kernel has the overview.

> How is the raw partition approach worse here? I would intuitively think it makes things better: less sharing.

I think it depends on what your goals are. If your goal is to make the absolutely fastest database server possible, then you'd probably want to use raw access on a system with nothing else running.

If your goal is to make a database server that is broadly useful, runs efficiently on a wide variety of systems then asking the kernel to do its job is the better idea.

PostgreSQL tends to the latter. The gains you can get from raw access are simply not worth the effort and would make PostgreSQL much harder to deploy in many situations. A database server that only works well when it's got the machine to itself is a PITA in many situations.

virtualization

Posted Mar 29, 2014 3:37 UTC (Sat) by fandingo (guest, #67019) [Link]

> How is the raw partition approach worse here?

I'm not sure a database should be implementing operations necessary for ATA TRIM.

virtualization

Posted Mar 27, 2014 19:36 UTC (Thu) by marcH (subscriber, #57642) [Link] (1 responses)

> The combination gives you a backup

Heh, that was missing.

I am still not convinced that rsync is the ultimate database backup tool. As much as I love rsync it surely does not have the patented exclusivity of incremental copying/backup techniques.

virtualization

Posted Apr 14, 2014 7:41 UTC (Mon) by MortenSickel (subscriber, #3238) [Link]

Then I think you have not looked well enough into it. As was mentioned earlier, rsync of the database files in combination with the wal logs gives you a simple backup that is immideately usable. At my earlier job, we were heavy users of postgres and used that as our main backup system. (It also makes setting up replication a snap)
So, no raw partitions, please - unless rsync and other file management tools get patched to read them... :-P

On the other hand, for any database of a certain size and importance, you probably want to have a separate partition for the database files so I could be possible to advice using a certain file system with some certain parameters to get optimal performance.

PostgreSQL pain points

Posted Mar 27, 2014 3:34 UTC (Thu) by zblaxell (subscriber, #26385) [Link] (2 responses)

> Database blocks become difficult to resize and move.

LVM.

> Can't keep up with hardware advances in a timely fashion.

Most of that happens below the block device level, so filesystems and raw partitions get it at the same time.

> Can no longer use standard tools like "rsync" with database files.

Databases tend to have their own. You often can't use rsync with a live database file on a filesystem either.

> The database project now needs a staff to maintain what's basically their own filesystem

That private filesystem doesn't have to do much that the database wasn't doing already. You could skip an indirection layer.

Unlike a filesystem, a database is not required to support legacy on-disk data across major releases (your DBA must replicate, dump/restore, or in-place upgrade instead). This means the private database filesystem could adapt more quickly to changes in storage technology compared to a kernel filesystem.

> Throwing away all of the good stuff developed by Linux IO and FS geeks over the last 20 years.

You are assuming that FS geeks are developing stuff that is relevant for databases. A database might be better off freed from the constraints of living with a filesystem layer (and legacy filesystem feature costs) between it and its storage.

OTOH a filesystem might be better after all--but that has to be proven, not assumed.

> Clobbering all other IO-using software on the same machine.

That's also true in the filesystem case.

PostgreSQL pain points

Posted Mar 27, 2014 6:22 UTC (Thu) by amacater (subscriber, #790) [Link] (1 responses)

A database might be better off ...

In a slightly different context - IBM Clearcase did/does something similar.
Softtware snapshotting and versioning by intercepting file system calls and writing to a custom intermediate file system level.

Result: everyone's worst nightmare if a large disk fails - IBM _might_ be able to recover your life's work if you can send them the entire filesystem.

And yes, dirty pages and flushing are fun :(

PostgreSQL pain points

Posted Mar 27, 2014 8:00 UTC (Thu) by marcH (subscriber, #57642) [Link]

A database needs a backup strategy built-in anyway.

About ClearCase: anyone who has used it (and used other things) knows it was one of the worst pieces of engineering ever. So, if you want to be convincing I suggest not using it as an example in any point you are trying to make.

PostgreSQL pain points

Posted Mar 28, 2014 1:02 UTC (Fri) by rodgerd (guest, #58896) [Link] (2 responses)

Not to mention it's a huge pain in the arse as a sysadmin. My experience is predominantly with Oracle's raw filesystem (ASM), and the lack of standard tooling to manage the filesystem... it's just not a good thing. To the point where our DBAs will avoid ASM unless they're building a RAC.

Probably the most annoying experience was discovering that ASM doesn't do any kind of sanity check when starting filesystems: after a SAN change, the SAN operator wired some LUNs back to the wrong boxes. With an AAAABAAAA on one server and BBBBABBBB on another, LVM would have simply refused to start the volume group. ASM started and then Oracle would silently coredump every time it tried to work with data on the misplaced disk. Such as the horrors of re-inventing decades of volume management and filesystems.

PostgreSQL pain points

Posted Mar 28, 2014 8:31 UTC (Fri) by marcH (subscriber, #57642) [Link] (1 responses)

Yet there is a *variety* of filesystems developed, even just on Linux. How come?

Wild, poorly educated guess: in an ideal world, shouldn't databases be hosted on a trimmed down, "semi-filesystem" which has only management features and none of the duplicated performance stuff which gets in the way? It could be called say, LLVM++ for instance?

PostgreSQL pain points

Posted Mar 29, 2014 13:56 UTC (Sat) by kleptog (subscriber, #1183) [Link]

That might be an idea, except at the moment we're still at the API phase: what information needs to be communicated and how? Stuff mentioned in the article like: order dependencies between blocks, indicating that the kernel cache is out of date, etc. Whether it's eventually implemented in the VFS or as ioctl()s on a fancy new filesystem is really beside the point.

I wonder if finding a way to expose parts of the JBD (Journalling Block Device) to user space might help with any of this.