LWN.net Logo

smp overhead, and rwlocks considered harmful

From:  Andrew Morton <akpm@digeo.com>
To:  linux-kernel@vger.kernel.org
Subject:  smp overhead, and rwlocks considered harmful
Date:  Sat, 22 Mar 2003 17:58:16 -0800


I've been looking at the CPU cost of the write() system call.  Time how long
it takes to write a million bytes to an ext2 file, via a million
one-byte-writes:

	time dd if=/dev/zero of=foo bs=1 count=1M

This only uses one CPU.  It takes twice as long on SMP.

On a 2.7GHz P4-HT:

	2.5.65-mm4, UP:
		0.34s user 1.00s system 99% cpu 1.348 total
	2.5.65-mm4, SMP:
		0.41s user 2.04s system 100% cpu 2.445 total

	2.4.21-pre5, UP:
		0.34s user 0.96s system 106% cpu 1.224 total
	2.4.21-pre5, SMP:
		0.42s user 1.95s system 99% cpu 2.372 total

(The small additional overhead in 2.5 is expected - there are more function
calls due to the addition of AIO and there is more setup due to the (large)
writev speedups).


On a 500MHz PIII Xeon:

	500MHz PIII, UP:
		1.08s user 2.90s system 100% cpu 3.971 total
	500MHz PIII, SMP:
		1.13s user 4.86s system 99% cpu 5.999 total


This pretty gross.  About six months back I worked out that across the
lifecycle of a pagecache page (creation via write() through to reclaim via
the page LRU) we take 27 spinlocks and rwlocks.  And this does not even
include semaphores and atomic bitops (I used lockmeter).  I'm not sure it is
this high any more - quite a few things were fixed up, but it is still high.


Profiles for 2.5 on the P4 show that it's all in fget(), fput() and
find_get_page().  Those locked operations are really hurting.

One thing is noteworthy: ia32's read_unlock() is buslocked, whereas
spin_unlock() is not.  So let's see what happens if we convert file_lock
from an rwlock to a spinlock:


2.5.65-mm4, SMP:
	0.34s user 2.00s system 100% cpu 2.329 total

That's a 5% speedup.


And if we were to convert file->f_count to be a nonatomic "int", protected by
files_lock it would probably speed things up further.

I've always been a bit skeptical about rwlocks - if you're holding the lock
for long enough for a significant amount of reader concurrency, you're
holding it for too long.  eg:  tasklist_lock.



SMP:

c0254a14 read_zero                                   189   0.3841
c0250f5c clear_user                                  217   3.3906
c0148a64 sys_write                                   251   3.9219
c0148a24 sys_read                                    268   4.1875
c014b00c __block_prepare_write                       485   0.5226
c0130dd4 unlock_page                                 493   7.7031
c0120220 current_kernel_time                         560   8.7500
c0163190 __mark_inode_dirty                          775   3.5227
c01488ec vfs_write                                   983   3.1506
c0132d9c generic_file_write                          996  10.3750
c01486fc vfs_read                                   1010   3.2372
c014b3ac __block_commit_write                       1100   7.6389
c0149500 fput                                       1110  39.6429
c01322e4 generic_file_aio_write_nolock              1558   0.6292
c0130fc0 find_lock_page                             2301  13.0739
c01495f0 fget                                       2780  33.0952
c0108ddc system_call                                4047  91.9773
c0106f64 default_idle                              24624 473.5385
00000000 total                                     45321   0.0200

UP:

c023c9ac radix_tree_lookup                            13   0.1711
c015362c inode_times_differ                           14   0.1944
c015372c inode_update_time                            14   0.1000
c012cb9c generic_file_write                           15   0.1630
c012ca90 generic_file_write_nolock                    17   0.1214
c0140e1c fget                                         17   0.3542
c023deb8 __copy_from_user_ll                          17   0.1545
c0241514 read_zero                                    17   0.0346
c014001c vfs_read                                     18   0.0682
c01401dc vfs_write                                    20   0.0758
c01430d8 generic_commit_write                         20   0.1786
c01402e4 sys_read                                     21   0.3281
c0142968 __block_commit_write                         29   0.2014
c023dc4c clear_user                                   34   0.5312
c0140324 sys_write                                    38   0.5938
c01425cc __block_prepare_write                        39   0.0422
c012c0f4 generic_file_aio_write_nolock                89   0.0362
c0108b54 system_call                                 406   9.2273
00000000 total                                       944   0.0004



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


(Log in to post comments)

Copyright © 2003, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds