|| ||Dave Chinner <david-AT-fromorbit.com> |
|| ||npiggin-AT-suse.de |
|| ||Re: [patch 00/52] vfs scalability patches updated |
|| ||Wed, 30 Jun 2010 21:30:54 +1000|
|| ||linux-fsdevel-AT-vger.kernel.org, linux-kernel-AT-vger.kernel.org,
John Stultz <johnstul-AT-us.ibm.com>,
Frank Mayhar <fmayhar-AT-google.com>|
|| ||Article, Thread
On Thu, Jun 24, 2010 at 01:02:12PM +1000, firstname.lastname@example.org wrote:
Can you put a git tree up somewhere?
> Update to vfs scalability patches:
Now that I've had a look at the whole series, I'll make an overall
comment: I suspect that the locking is sufficiently complex that we
can count the number of people that will be able to debug it on one
hand. This patch set didn't just fall off the locking cliff, it
fell into a bottomless pit...
> Last time I was testing on a 32-node Altix which could be considered as not a
> sweet-spot for Linux performance target (ie. improvements there may not justify
> complexity). So recently I've been testing with a tightly interconnected
> 4-socket Nehalem (4s/32c/64t). Linux needs to perform well on this size of
Sure, but I have to question how much of this is actually necessary?
A lot of it looks like scalability for scalabilities sake, not
because there is a demonstrated need...
> *** Single-thread microbenchmark (simple syscall loops, lower is better):
> Test Difference at 95.0% confidence (50 runs)
> open/close -6.07% +/- 1.075%
> creat/unlink 27.83% +/- 0.522%
> Open/close is a little faster, which should be due to one less atomic in the
> dput common case. Creat/unlink is significantly slower, which is due to RCU
> freeing inodes.
That's a pretty big ouch. Why does RCU freeing of inodes cause that
much regression? The RCU freeing is out of line, so where does the big
impact come from?
> *** 64 parallel git diff on 64 kernel trees fully cached (avg of 5 runs):
> vanilla vfs
> real 0m4.911s 0m0.183s
> user 0m1.920s 0m1.610s
> sys 4m58.670s 0m5.770s
> After vfs patches, 26x increase in throughput, however parallelism is limited
> by test spawning and exit phases. sys time improvement shows closer to 50x
> improvement. vanilla is bottlenecked on dcache_lock.
So if we cherry pick patches out of the series, what is the bare
minimum set needed to obtain a result in this ballpark? Same for the
> *** Reclaim
> I have not done much reclaim testing yet. It should be more scalable and lower
> latency due to significant reduction in lru locks interfering with other
> critical sections in inode/dentry code, and because we have per-zone locks.
> Per-zone LRUs mean that reclaim is targetted to the correct zone, and that
> kswapd will operate on lists of node-local memory objects.
This means we no longer have any global LRUness to inode or dentry
reclaim, which is going to significantly change caching behaviour.
It's also got interesting corner cases like a workload running on a
single node with a dentry/icache working set larger than the VM
wants to hold on a single node.
We went through these sorts of problems with cpusets a few years
back, and the workaround for it was not to limit the slab cache to
the cpuset's nodes. Handling this sort of problem correctly seems
distinctly non-trivial, so I'm really very reluctant to move in this
direction without clear evidence that we have no other
to post comments)