Not logged in
Log in now
Create an account
Subscribe to LWN
LWN.net Weekly Edition for May 23, 2013
An "enum" for Python 3
An unexpected perf feature
LWN.net Weekly Edition for May 16, 2013
A look at the PyPy 2.0 release
Use option -XX:+UseConcMarkSweepGC.
For more information, see this article:
The Managed Runtime Initiative
Posted Jun 19, 2010 23:37 UTC (Sat) by giltene (guest, #67734)
The duration of the stop-the-world pause in all current JVM GC's is generally linear to the amount of live data the heap contains (you have to scan all that stuff and fix all the pointers to the relocated objects). This means that the larger the heap - the larger the pause. Sun's CMS (the Mostly Concurrent Mark Sweep -XX:+UseConcMarkSweepGC mentioned above) will delay the compaction as long as it can and track empty spaces in free lists, but it will eventually fall back on it's compaction code and pause for about 2-4 seconds per live gigabyte on a modern x86-64 machine. This is why JVMs are generally not used with more than a few GB of data, except for batch apps (ones that can accept a 10s of seconds of complete pause). Since a 256GB server now costs less than $18K, there is a ~100x and growing gap between commodity server capacity and the ability for individual runtime to scale with acceptable response times.
The Pauseless GC algorithm and implementation put forward as part of the Managed Runtime Initiative changes all this. It compacts the heap and fixes up pointers concurrently, without having to stop the world to do so. As a result, response time is completely decoupled from memory size, and indirectly from allocation rate and throughput, breaking the 100x logjam.
Posted Jun 20, 2010 17:19 UTC (Sun) by rilder (subscriber, #59804)
Posted Jun 21, 2010 4:06 UTC (Mon) by giltene (guest, #67734)
Due to various needs though, our current implementation a separate physical memory management subsystem. The main, overriding need for this is the avoidance of unnecessary TLB invalidates when physical pages are unmapped and mapped [a bit later] within the same process - this is the most frequent pattern across the pauseless collector and the application mutators, and addressing it pretty much requires process-local free lists, which we use.
WRT VMs (e.g.. running this kernel as a guest on top of KVM or VMWare), we see those as common targets. Because the rate of virtual memory mapping changes in our system is quite substantial (well over 1000x compared to the typical OS and application loads), we care a lot about the cost of those manipulations in virtualized environments. Luckily EPT (Intel) and NPT/RVI (AMD) features in all modern x86-64 machines practically eliminate this cost, taking the hypervisor out of the business of intercepting, tracking, and applying guest-virtiual to guest-physical mappings. Without those HW assisted features, applying changes at the rate we do would certainly "hurt" on virtualized systems.
We'll also be able to play with memory ballooning, and we've got some very good uses for it, but the current implementation we've posted doesn't. When we do, we'll be looking to deflate balloons at sustained rates of several GBs per second, which will probably put some serious stress on current host implementations. This is one of the items at the hypervisor level that we're going to be playing with as part of the Managed Runtime Initiative [large page and high sustained rate ballooning]. At first glance, KVM's handling of ballooning seems like it can easily be extended to accommodate what we'll need.
Posted Jun 23, 2010 17:06 UTC (Wed) by jonabbey (subscriber, #2736)
Posted Jun 27, 2010 0:00 UTC (Sun) by giltene (guest, #67734)
1) A concurrent, multi-pass OldGen marker
2) An incremental, stop-the-world OldGen compactor
3) A stop-the-world NewGen copying collector
It's the stop-the-world nature of (2) and (3) above that determine pause behavior for large heap and high throughput applications. When object copying and/or compaction are done in a stop the world operation, moving even a single object in memory requires that *all* references to that object in the heap be tracked down and corrected before the program is allowed to continue execution. This takes an amount of time bound only by the size of the heap (i.e. it takes a multi-seconds-per-live-GB pause to find all those references). While a lot of cool code and tuning goes into making such long pauses spaced out in time, all stop-the-world-compaction collectors (including G1) will periodically fall back on their full stop the world compaction code when dealing with complex heaps. [The collector's elaborate code for doing that operation is there for a reason]. For applications that cannot accept the "occasional 50 seconds pause" during normal operation, this means heap size is still limited by pause time.
In contrast, The GPGC algorithm used in both Azul's Zing VM and in the Managed Runtime Initiative contribution (and in the default, production collector for Azul Vega systems since 2005), can be classified as:
1) A concurrent, guaranteed-single-pass OldGen marker
2) A concurrent OldGen compactor
3) A concurrent, guaranteed-single-pass NewGen marker
4) A concurrent NewGen compactor
Concurrent [as opposed to incremental stop-the-world] compaction is the key here. It completely decouples application pause behavior from heap size and allocation rates, and allows us to deal with the amounts of memory prevalent in todays servers without those nasty many-seconds pauses. GPGC simply doesn't have any form of stop-the-world compaction code in it [as a fall back or otherwise]. It doesn't need it. In GPGC, objects are relocated without having to stop the application while all references to them are tracked down and fixed. Instead, ref remapping is a concurrent operation supported by self-healing read barriers.
It's the software pipeline used to support this concurrent compaction (and guaranteed-single-pass marking) that needs the new virtual memory operations we've put forward.
Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds