Thread summary: the Linux kernel and PostgreSQL

[Posted January 15, 2014 by corbet]

From:		Mel Gorman <mgorman-AT-suse.de>
To:		pgsql-hackers-AT-postgresql.org
Subject:		Re: Linux kernel impact on PostgreSQL performance (summary v1 2014-1-15)
Date:		Wed, 15 Jan 2014 14:14:08 +0000
Message-ID:		<20140115141408.GJ4963@suse.de>
Cc:		Josh Berkus <josh-AT-agliodbs.com>, Joshua Drake <jd-AT-commandprompt.com>, Robert Haas <robertmhaas-AT-gmail.com>, Tom Lane <tgl-AT-sss.pgh.pa.us>, lsf-pc-AT-lists.linux-foundation.org, Magnus Hagander <magnus-AT-hagander.net>, Kevin Grittner <kgrittn-AT-ymail.com>, James Bottomley <James.Bottomley-AT-HansenPartnership.com>, Andres Freund <andres-AT-2ndquadrant.com>, Hannu Krosing <hannu-AT-2ndquadrant.com>, Dave Chinner <david-AT-fromorbit.com>, Claudio Freire <klaussfreire-AT-gmail.com>, Jonathan Corbet <corbet-AT-lwn.net>, Trond Myklebust <trondmy-AT-gmail.com>
Archive‑link:		Article

> One assumption would be that Postgres is perfectly happy with the current
> kernel behaviour in which case our discussion here is done.

It has been demonstrated that this statement was farcical.  The thread is
massive just from interaction with the LSF/MM program committee.  I'm hoping
that there will be Postgres representation at LSF/MM this year to bring
the issues to a wider audience. I expect that LSF/MM can only commit to
one person attending the whole summit due to limited seats but we could
be more more flexible for the Postgres track itself so informal meetings
can be arranged for the evenings and at collab summit.

In this gets forgotten, this mail describes what has already been
discussed and some of the proposals. Some stuff I do not describe because
it was superseded by later discussion. If I missed something important,
misinterpreted or simply screwed up then shout and I'll update this. I'd
rather none of this gets lost even if it takes months or years to address
it all.

On testing of modern kernels
----------------------------

Josh Berkus claims that most people are using Postgres with 2.6.19 and
consequently there may be poor awareness of recent kernel developments.
This is a disturbingly large window of opportunity for problems to have
been introduced. It begs the question what sort of penetration modern
distributions shipping Postgres has. More information on why older kernels
dominate in Postgres installation would be nice.

Postgres bug reports and LKML
-------------------------------

It is claimed that LKML does not welcome bug reports but it's less clear
what the basis of this claim is.  Is it because the reports are ignored? A
possible explanation is that they are simply getting lost in the LKML noise
and there would be better luck if the bug report was cc'd to a specific
subsystem list. Another explanation is that there is not enough data
available to debug the problem. The worst explanation is that to date
the problem has not been fixable but the details of this have been lost
and are now unknown. Is is possible that some of these bug reports can be
refreshed so at least there is a chance they get addressed?

Apparently there were changes to the reclaim algorithms that crippled
performance without any sysctls. The problem may be compounded by the
introduction of adaptive replacement cache in the shape of the thrash
detection patches currently being reviewed.  Postgres investigated the
use of ARC in the past and ultimately abandoned it. Details are in the
archives (http://www.Postgres.org/search/?m=1&q=arc&l=1&...). I
have not read then, just noting they exist for future reference.

Sysctls to control VM behaviour are not popular as such tuning parameters
are often used as an excuse to not properly fix the problem. Would it be
possible to describe a test case that shows 2.6.19 performing well and a
modern kernel failing? That would give the VM people a concrete basis to
work from to either fix the problem or identify exactly what sysctls are
required to make this work.

I am confident that any bug related to VM reclaim in this area has been lost.
At least, I recall no instances of it being discussed on linux-mm and it
has not featured on LSF/MM during the last years.

IO Scheduling
-------------

Kevin Grittner has stated that it is known that the DEADLINE and NOOP
schedulers perform better than any alternatives for most database loads.
It would be desirable to quantify this for some test case and see can the
default scheduler cope in some way.

The deadline scheduler makes sense to a large extent though. Postgres
is sensitive to large latencies due to IO write spikes. It is at least
plausible that deadline would give more deterministic behaviour for
parallel reads in the presence of large writes assuming there were not
ordering problems between the reads/writes and the underlying filesystem.

For reference, these IO spikes can be massive. If the shared buffer is
completely dirtied in a short space of time then it could be 20-25% of
RAM being dirtied and writeback required in typical configurations. There
have been cases where it was worked around by limiting the size of the
shared buffer to a small enough size so that it can be written back
quickly. There are other tuning options available such as altering when
dirty background writing starts within the kernel but that will not help if
the dirtying happens in a very short space of time. Dave Chinner described
the considerations as follows

	There's no absolute rule here, but the threshold for background
	writeback needs to consider the amount of dirty data being generated,
	the rate at which it can be retired and the checkpoint period the
	application is configured with. i.e. it needs to be slow enough to
	not cause serious read IO perturbations, but still fast enough that
	it avoids peaks at synchronisation points. And most importantly, it
	needs to be fast enought that it can complete writeback of all the
	dirty data in a checkpoint before the next checkpoint is triggered.

	In general, I find that threshold to be somewhere around 2-5s
	worth of data writeback - enough to keep a good amount of write
	combining and the IO pipeline full as work is done, but no more.

	e.g. if your workload results in writeback rates of 500MB/s,
	then I'd be setting the dirty limit somewhere around 1-2GB as
	an initial guess. It's basically a simple trade off buffering
	space for writeback latency. Some applications perform well with
	increased buffering space (e.g. 10-20s of writeback) while others
	perform better with extremely low writeback latency (e.g. 0.5-1s).

Some of this may have been addressed in recent changes with IO-less dirty
throttling. When considering stalls related to excessive IO it will be
important to check if the kernel was later than 3.2 and what the underlying
filesystem was.

Again, it really should be possible to demonstrate this with a test case,
one driven by pgbench maybe? Workload would generate a bunch of test data,
dirty a large percentage of it and try to sync. Metrics would be measuring
average read-only query latency when reading in parallel to the write,
average latencies from the underlying storage, IO queue lengths etc and
comparing default IO scheduler with deadline or noop.

NUMA Optimisations
------------------

The primary one that showed up was zone_reclaim_mode. Enabling that parameter
is a disaster for many workloads and apparently Postgres is one. It might
be time to revisit leaving that thing disabled by default and explicitly
requiring that NUMA-aware workloads that are correctly partitioned enable it.
Otherwise NUMA considerations are not that much of a concern right now.

Direct IO, buffered IO and double buffering
-------------------------------------------

The general position of Postgres is that the kernel knows more about
storage geometries and IO scheduling that an application can or should
know. It would be preferred to have interfaces that allow Postgres.
give hints to the kernel about how and when data should be written back.
The alternative is exposing details of the underlying storage to userspace
so Postgres can implement a full IO scheduler using direct IO. It has
been asserted on the kernel side that the optimal IO size and alignment
is the most important detail should be all the details that are required
in the majority of cases. While some database vendors have this option,
the Postgres community do not have the resources to implement something
of this magnitude.

I can understand Postgres preference for using the kernel to handle these
details for them. They are a cross-platform application and the kernel
should not be washing its hands of the problem and hiding behind direct
IO as a solution. Ted Ts'o summarises the issues as

	The high order bit is what's the right thing to do when database
	programmers come to kernel engineers saying, we want to do <FOO>
	and the performance sucks.  Do we say, "Use O_DIRECT, dummy", not
	withstanding Linus's past comments on the issue?  Or do we have
	some general design principles that we tell database engineers that
	they should do for better performance, and then all developers for
	all of the file systems can then try to optimize for a set of new
	API's, or recommended ways of using the existing API's?

In an effort to avoid depending on direct IO there are some proposals
and/or wishlist items

   1. Reclaim pages only under reclaim pressure but then prioritise their
      reclaim. This avoids a problem where fadvise(DONTNEED) discards a
      page only to have a read/write or WILLNEED hint immediately read
      it back in again. The requirements are similar to the volatile
      range hinting but they do not use mmap() currently and would need
      a file-descriptor based interface. Robert Hass had some concerns
      with the general concept and described them thusly

	This is an interesting idea but it stinks of impracticality.
	Essentially when the last buffer pin on a page is dropped we'd
	have to mark it as discardable, and then the next person wanting
	to pin it would have to check whether it's still there.  But the
	system call overhead of calling vrange() every time the last pin
	on a page was dropped would probably hose us.

	Well, I guess it could be done lazily: make periodic sweeps through
	shared_buffers, looking for pages that haven't been touched in a
	while, and vrange() them.  That's quite a bit of new mechanism,
	but in theory it could work out to a win.  vrange() would have
	to scale well to millions of separate ranges, though.  Will it?
	And a lot depends on whether the kernel makes the right decision
	about whether to chunk data from our vrange() vs. any other page
	it could have reclaimed.

   2. Only writeback some pages if explicitly synced or dirty limits
      are violated. Jeff Janes states that he has problems with large
      temporary files that generate IO spikes when the data starts hitting
      the platter even though the data does not need to be preserved. Jim
      Nasby agreed and commented that he "also frequently see this, and it
      has an even larger impact if pgsql_tmp is on the same filesystem as
      WAL. Which *theoretically* shouldn't matter with a BBU controller,
      except that when the kernel suddenly +decides your *temporary*
      data needs to hit the media you're screwed."

      One proposal that may address this is

	Allow a process with an open fd to hint that pages managed by this
	inode will have dirty-sticky pages. Pages will be ignored by dirty
	background writing unless there is an fsync call or dirty page limits
	are hit. The hint is cleared when no process has the file open.

   3. Only writeback pages if explicitly synced. Postgres has strict write
      ordering requirements. In the words of Tom Lane -- "As things currently
      stand, we dirty the page in our internal buffers, and we don't write
      it to the kernel until we've written and fsync'd the WAL data that
      needs to get to disk first". mmap() would avoid double buffering but
      it has no control about the write ordering which is a show-stopper.
      As Andres Freund described;

	Postgres' durability works by guaranteeing that our journal
	entries (called WAL := Write Ahead Log) are written & synced to
	disk before the corresponding entries of tables and indexes reach
	the disk. That also allows to group together many random-writes
	into a few contiguous writes fdatasync()ed at once. Only during
	a checkpointing phase the big bulk of the data is then (slowly,
	in the background) synced to disk. I don't see how that's doable
	with holding all pages in mmap()ed buffers.

      There are also concerns there would be an absurd number of mappings.

      The problem with this sort of dirty pinning interface is that it
      can deadlock the kernel if all dirty pages in the system cannot be
      written back by the kernel. James Bottomley stated

	No, I'm sorry, that's never going to be possible.  No user space
	application has all the facts.	If we give you an interface to
	force unconditional holding of dirty pages in core you'll livelock
	the system eventually because you made a wrong decision to hold
	too many dirty pages.

      However, it was very clearly stated that the writing ordering is
      critical. If the kernel breaks the requirement then the database
      can get trashed in the event of a power failure.

      This led to a discussion on write barriers which the kernel uses
      internally but there are scaling concerns both with the number of
      constraints that would exist and the requirement that Postgres use
      mapped buffers.

      I did not bring it up on the list but one possibility is that the
      kernel would allow a limited number of pinned dirty pages. If a
      process tries to dirty more pages without cleaning some of them
      we could either block it or fail the write. The number of dirty
      pages would be controlled by limits and we'd require that the limit
      be lower than dirty_ratio|bytes or be at most 50% of that value.
      There are unclear semantics about what happens if the process crashes.

   4. Allow userspace process to insert data into the kernel page cache
      without marking the page dirty. This would allow the application
      to request that the OS use the application copy of data as page
      cache if it does not have a copy already. The difficulty here
      is that the application has no way of knowing if something else
      has altered the underlying file in the meantime via something like
      direct IO. Granted, such activity has probably corrupted the database
      already but initial reactions are that this is not a safe interface
      and there are coherency concerns.

      Dave Chinner asked "why, exactly, do you even need the kernel page
      cache here?"  when Postgres already knows how and when data should
      be written back to disk. The answer boiled down to "To let kernel do
      the job that it is good at, namely managing the write-back of dirty
      buffers to disk and to manage (possible) read-ahead pages". Postgres
      has some ordering requirements but it does not want to be responsible
      for all cache replacement and IO scheduling. Hannu Krosing summarised
      it best as

	Again, as said above the linux file system is doing fine. What we
	want is a few ways to interact with it to let it do even better
	when working with Postgres by telling it some stuff it otherwise
	would have to second guess and by sometimes giving it back some
	cache pages which were copied away for potential modifying but
	ended up clean in the end.

	And let the linux kernel decide if and how long to keep these pages
	in its	cache using its superior knowledge of disk subsystem and
	about what else is going on in the system in general.

   5. Allow copy-on-write of page-cache pages to anonymous. This would limit
      the double ram usage to some extent. It's not as simple as having a
      MAP_PRIVATE mapping of a file-backed page because presumably they want
      this data in a shared buffer shared between Postgres processes. The
      implementation details of something like this are hairy because it's
      mmap()-like but not mmap() as it does not have the same writeback
      semantics due to the write ordering requireqments Postgres has for
      database integrity.

      Completely nuts and this was not mentioned on the list, but arguably
      you could try implementing something like this as a character device
      that allows MAP_SHARED with ioctls with ioctls controlling that file
      and offset backs pages within the mapping.  A new mapping would be
      forced resident and read-only. A write would COW the page. It's a
      crazy way of doing something like this but avoids a lot of overhead.
      Even considering the stupid solution might make the general solution
      a bit more obvious.

      For reference, Tom Lane comprehensively
      described the problems with mmap at
      http://www.Postgres.org/message-id/17515.1389715715@sss.p...

      There were some variants of how something like this could be achieved
      but no finalised proposal at the time of writing.

Not all of these suggestions are viable but some are more viable than
others. Ultimately we would still need a test case showing the benefit
even if that depends on a Postgres patch taking advantage of a new
feature.

-- 
Mel Gorman
SUSE Labs

Thread summary: the Linux kernel and PostgreSQL

Posted Jan 15, 2014 19:26 UTC (Wed) by dlang (guest, #313) [Link]

> Josh Berkus claims that most people are using Postgres with 2.6.19

> It is claimed that LKML does not welcome bug reports but it's less clear
what the basis of this claim is.

actually, I suspect that the problem is that people trying to report bugs are using such an old kernel. You need to report the bug to someone who will support the kernel version you are running, and you then need to be willing to test possible fixes that you give them.

For example. If you are running a RHEL 5 box (shipped with 2.6.18) and try to report a bug to the kernel list, but aren't willing to change the kernel that's you're running to something other than what's provided by Red Hat, then your bugreport will be ignored. You would need to report it to Red Hat (who you are paying for support) and be willing to install a test kernel that they provide you.

Thread summary: the Linux kernel and PostgreSQL

Posted Jan 16, 2014 9:02 UTC (Thu) by pbonzini (subscriber, #60935) [Link] (3 responses)

Regarding the I/O schedulers, the deadline scheduler also has better performance than CFQ for virtualization.

Thread summary: the Linux kernel and PostgreSQL

Posted Jan 16, 2014 17:11 UTC (Thu) by jhoblitt (subscriber, #77733) [Link] (1 responses)

I think virtually all tuneD profiles set the block scheduler to deadline. I recall that when CFQ was merged into mainline, at least for my desktop, it did improve general disk/system interactivity but it seems that either the deadline algorithm or the hardware characteristics have changed enough since that deadline is likely the better general scheduler. I find that deadline is also better than for [SATA] JBOD SSDs and most (but not all) BBU backed RAID array configurations.

Why is CFQ still the block default?

Thread summary: the Linux kernel and PostgreSQL

Posted Jan 16, 2014 17:12 UTC (Thu) by jhoblitt (subscriber, #77733) [Link]

s/better than/better than NOOP/

Thread summary: the Linux kernel and PostgreSQL

Posted Jan 18, 2014 9:27 UTC (Sat) by dlang (guest, #313) [Link]

I think the key is what the disk subsystem looks like.

If you have a small number of spinning rust drives, CFQ is pretty good.

But as your disk gets more complicated (caching controllers, RAID, virtualization and sharing drives with other OS instances, etc) the heuristics of CFQ stop working well and either deadline or noop end up working better.

The differing workload has some effect as well, but high performance databases almost never talk directly to spinning rust drives. the closest they come is talking to RAID1 pairs of spinning rust drives, and if someone is really worried about performance, they talk to something that behaves _very_ differently.