|
|
Log in / Subscribe / Register

MVCC implementation

MVCC implementation

Posted Jan 3, 2010 20:21 UTC (Sun) by butlerm (subscriber, #13312)
In reply to: So how does PostGreSQL compare to MySQL? by ringerc
Parent article: The ongoing MySQL campaign

<em>VACUUM or something like it is pretty much required in any MVCC
design</em>

Not quite. Oracle's MVCC, for example, works something like the following
(I believe InnoDB uses much the same scheme):

1. The latest version of any row is stored in the primary data block, along
with some transaction related meta data.
2. The information necessary to recover prior versions of any row is stored
in separate rollback segments
3. Changes to both are written to the redo log
4. When a query is processed, if the version of the row currently stored in
a data block isn't the one needed, a prior version is recovered by
referring to the appropriate rollback segment.
5. When a transaction commits, or soon thereafter, the rollback data for
that transaction is discarded. No disk I/O required.
6. Alternatively, if a transaction is rolled back, prior versions of
modified rows are recovered from the rollback segment and restored to their
original positions in the primary data blocks.

The main advantage of this scheme is that data blocks generally only get
modified once per transaction, rather than once initially and once some
time later, during the vacuum process. However, redo log overhead is
probably twice as high, and if a transaction is large or long enough, the
rollback entries will physically hit the the disk before they are discarded
as well.


to post comments

MVCC implementation

Posted Jan 4, 2010 4:12 UTC (Mon) by ringerc (subscriber, #3071) [Link]

Good point.

A redo log does have a large disk I/O cost if you have a high update/delete volume and/or any long-running SERIALIZABLE transactions, but it doesn't require the same sort of work as Pg's in-place storage of superceded rows. On the other hand, it has a HUGE cost to queries - in the form of random disk I/O - if they have to hit the redo log on disk.

I don't know enough to comment on which is "better". I suspect that like most things, it's probably a trade-off where some workloads benefit from one and others from the other.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds