In DragonFly, we did a fair bit of a work improving the VM on SMP systems recently (in the last year or so); we still use RB-trees w/ sx (rw) locking for vm_maps (mapping from (VA, aspace) -> vm_pages) and there is still only one pagedaemon (kthread handling page queue scans), but the front-end of the VM and page fault handling are SMP-friendly. The page queues are also locked in fairly fine-grained ways.
DFly's recent 3.2 release is pretty competitive with Linux on Postgres/Pgbench workloads; some of the lessons learned while improving Postgres/Pgbench performance would be interesting to the Linux VM team -- for instance, DragonFly moved to implement a limited form of page table sharing; this reduced the overhead of managing pv_entries ((D)FBSD's version of rmap entries; see the pre-objrmap Linux VM). Page-table sharing in Linux would not bring the same sort of wins there ('cause of objrmap), but it is worth exploring, I suspect fork() performance would benefit.
Another neat bit of DFly VM machinery that might be interesting -- 'swapcache' -- basically allows for clean file data/metadata to be written to a hopefully-solid-state swap device. It would have a similar effect to flashcache, except it happens at the file vnode layer first, rather than at the block device layer.
Some other interesting things that we learned while scaling out DFly -- we spent a bunch of time optimizing our spinlocks on a wide AMD (4x-socket 48 core) K10 system; along the way we reinvented Linux's ticket spinlock, except with the two counters on separate cachelines rather than parts of the same word. We found them to be pretty terrible under heavy contention, (see a comment on LWN's discussion "Ticket spinlocks perform terribly for uncontested and highly contested cases.") and had a pretty incredible 175sec -> 90something sec reduction in concurrent buildworld time by just moving to something like while-cmpxchg.