User: Password:
Subscribe / Log in / New account

The Managed Runtime Initiative

The Managed Runtime Initiative

Posted Jun 19, 2010 23:37 UTC (Sat) by giltene (guest, #67734)
In reply to: The Managed Runtime Initiative by foom
Parent article: The Managed Runtime Initiative

All current commercial forms of GCs on large scale runtimes include code that compacts the object heap in a stop-the-world operation. Compaction is unavoidable in long lived applications that use variable sized objects (e.g. XML data). Most GC setups can be tuned in various ways to delay this inevitable compaction, but none can avoid it. Think of it as a ticking time bomb - do enough work, and you'll need to defragment the heap because your new object can't fit in any of the empty spaces you've been tracking.

The duration of the stop-the-world pause in all current JVM GC's is generally linear to the amount of live data the heap contains (you have to scan all that stuff and fix all the pointers to the relocated objects). This means that the larger the heap - the larger the pause. Sun's CMS (the Mostly Concurrent Mark Sweep -XX:+UseConcMarkSweepGC mentioned above) will delay the compaction as long as it can and track empty spaces in free lists, but it will eventually fall back on it's compaction code and pause for about 2-4 seconds per live gigabyte on a modern x86-64 machine. This is why JVMs are generally not used with more than a few GB of data, except for batch apps (ones that can accept a 10s of seconds of complete pause). Since a 256GB server now costs less than $18K, there is a ~100x and growing gap between commodity server capacity and the ability for individual runtime to scale with acceptable response times.

The Pauseless GC algorithm and implementation put forward as part of the Managed Runtime Initiative changes all this. It compacts the heap and fixes up pointers concurrently, without having to stop the world to do so. As a result, response time is completely decoupled from memory size, and indirectly from allocation rate and throughput, breaking the 100x logjam.

(Log in to post comments)

The Managed Runtime Initiative

Posted Jun 20, 2010 17:19 UTC (Sun) by rilder (guest, #59804) [Link]

Looks interesting. However recently I read about commits related to memory compaction and transparent huge pages. These you may find useful in your work.
Also in your current implementations. are VMs treated differently wrt memory management. There also have been commits wrt VMs for memory ballooing etc.

The Managed Runtime Initiative

Posted Jun 21, 2010 4:06 UTC (Mon) by giltene (guest, #67734) [Link]

Both transparent huge page support and compaction (which transparent huge pages cannot reliably be done without) are good things. We may be able to merge some of our functionality with those over time. We are certainly large consumers of 2MB pages, so we see compaction as very important, and with it we may be able to provide transparent, reliable dynamic allocation of 2MB pages in our subsystem. Without compaction, pages must be explicitly funded and taken away from the kernel upfront, since fragmentation may make it impossible to get hold of them later.

Due to various needs though, our current implementation a separate physical memory management subsystem. The main, overriding need for this is the avoidance of unnecessary TLB invalidates when physical pages are unmapped and mapped [a bit later] within the same process - this is the most frequent pattern across the pauseless collector and the application mutators, and addressing it pretty much requires process-local free lists, which we use.

WRT VMs (e.g.. running this kernel as a guest on top of KVM or VMWare), we see those as common targets. Because the rate of virtual memory mapping changes in our system is quite substantial (well over 1000x compared to the typical OS and application loads), we care a lot about the cost of those manipulations in virtualized environments. Luckily EPT (Intel) and NPT/RVI (AMD) features in all modern x86-64 machines practically eliminate this cost, taking the hypervisor out of the business of intercepting, tracking, and applying guest-virtiual to guest-physical mappings. Without those HW assisted features, applying changes at the rate we do would certainly "hurt" on virtualized systems.

We'll also be able to play with memory ballooning, and we've got some very good uses for it, but the current implementation we've posted doesn't. When we do, we'll be looking to deflate balloons at sustained rates of several GBs per second, which will probably put some serious stress on current host implementations. This is one of the items at the hypervisor level that we're going to be playing with as part of the Managed Runtime Initiative [large page and high sustained rate ballooning]. At first glance, KVM's handling of ballooning seems like it can easily be extended to accommodate what we'll need.

The Managed Runtime Initiative

Posted Jun 23, 2010 17:06 UTC (Wed) by jonabbey (guest, #2736) [Link]

Have you characterized the performance of Sun's G1 collector that they will be using in Java 7?

The Managed Runtime Initiative

Posted Jun 27, 2010 0:00 UTC (Sun) by giltene (guest, #67734) [Link]

We're obviously been following G1's progress and the algorithm it uses. Its' hard to "characterize the actual performance" of G1 right now given it's current experimental state [simple experiments right now show longer pauses than HotSpot's production CMS collector]. However, analyzing G1's basic mechanisms is pretty straight forward. Here is how they can be classified:

1) A concurrent, multi-pass OldGen marker
2) An incremental, stop-the-world OldGen compactor
3) A stop-the-world NewGen copying collector

It's the stop-the-world nature of (2) and (3) above that determine pause behavior for large heap and high throughput applications. When object copying and/or compaction are done in a stop the world operation, moving even a single object in memory requires that *all* references to that object in the heap be tracked down and corrected before the program is allowed to continue execution. This takes an amount of time bound only by the size of the heap (i.e. it takes a multi-seconds-per-live-GB pause to find all those references). While a lot of cool code and tuning goes into making such long pauses spaced out in time, all stop-the-world-compaction collectors (including G1) will periodically fall back on their full stop the world compaction code when dealing with complex heaps. [The collector's elaborate code for doing that operation is there for a reason]. For applications that cannot accept the "occasional 50 seconds pause" during normal operation, this means heap size is still limited by pause time.

In contrast, The GPGC algorithm used in both Azul's Zing VM and in the Managed Runtime Initiative contribution (and in the default, production collector for Azul Vega systems since 2005), can be classified as:
1) A concurrent, guaranteed-single-pass OldGen marker
2) A concurrent OldGen compactor
3) A concurrent, guaranteed-single-pass NewGen marker
4) A concurrent NewGen compactor

Concurrent [as opposed to incremental stop-the-world] compaction is the key here. It completely decouples application pause behavior from heap size and allocation rates, and allows us to deal with the amounts of memory prevalent in todays servers without those nasty many-seconds pauses. GPGC simply doesn't have any form of stop-the-world compaction code in it [as a fall back or otherwise]. It doesn't need it. In GPGC, objects are relocated without having to stop the application while all references to them are tracked down and fixed. Instead, ref remapping is a concurrent operation supported by self-healing read barriers.

It's the software pipeline used to support this concurrent compaction (and guaranteed-single-pass marking) that needs the new virtual memory operations we've put forward.

Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds