The Tux3 filesystem returns

Posted Jan 3, 2013 7:33 UTC (Thu) by daniel (guest, #3181)
In reply to: The Tux3 filesystem returns by jeltz
Parent article: The Tux3 filesystem returns

We don't journal, we write a chained log. We do moderate compression on the log just by keeping the records small. If log transfer rate proves to move the needle on performance we can compress more aggressively. My guess is, there will be bigger gains to get elsewhere for some time to come. We have zero lock contention on log blocks because only the backend writes them and we never read them except for mount or recovery.

The Tux3 filesystem returns

Posted Jan 6, 2013 3:19 UTC (Sun) by jeltz (guest, #88600) [Link] (2 responses)

Thanks for correcting me. Logs and journals are different beasts. And nice solution to use the frontend/backend split to solve the contention problem. I guess there must be some trade off here though since PostgreSQL normally has the connection processes writing to the log causing lock contention. Maybe it is to support some feature which a SQL database must support while a FS does not.

The Tux3 filesystem returns

Posted Jan 6, 2013 13:47 UTC (Sun) by andresfreund (subscriber, #69562) [Link] (1 responses)

> And nice solution to use the frontend/backend split to solve the contention problem. I guess there must be some trade off here though since PostgreSQL normally has the connection processes writing to the log causing lock contention.

Postgres actually uses a similar split - the individual backends insert the log entry into memory and the "wal writer" writes them to disk in the background. Only when an individual backend requires some wal records to be written/fsynced (e.g. because it wants the commit record to be safely on disk) and that portion of the wal has not yet been written/fsynced the individual backends will do so.
There are two separate areas of contention here:
- synchronous disk writes/syncs. That can be eased by stuff like batching the required syncs for several backends/transactions in one sync and by lowering consistency requirements a bit (like setting synchronous_commit=off).
- contention around the WAL datastructures. For relatively obvious reasons only one backend can insert into the WAL (in memory!) at the same time, so there can be rather heavy contention around the locks protecting it. There's some work going on to make locking more fine grained, but its complicated stuff and the price of screwing up would be way too high, so it might take some time.

I don't think there is some fundamental difference here. PG will always have a higher overhead than a filesystem because the guarantees it gives (by default at least) are far stricter and more essentially because its layered *above* a filesystem so it shares that overhead in the first place.

.oO(Don't mention postgres if you don't want to be talked to death :P)

The Tux3 filesystem returns

Posted Jan 6, 2013 21:29 UTC (Sun) by raven667 (subscriber, #5198) [Link]

> because its layered *above* a filesystem so it shares that overhead in the first place.

Only in a few places, when new files are created or the metadata for existing files is changed, by changing the file size for example. Files that are preallocated have very little overhead to (re)write, except for maybe mtime updates.