The Managed Runtime Initiative
Posted Jun 18, 2010 1:27 UTC (Fri) by giltene (guest, #67734)
It does have some pesky issues. Like GC, but those can actually be solved. We've shown that. It's just that [at least the one way we know of for] solving them does require fundamental changes to things like kernel memory management. Some of the core operations we are advocating for make an actual 1,000,000x difference to how critical phases of GC behave, thereby making the difference between a practical, no-more-puases GC world, and the current one. [For some operation metrics numbers, see http://www.managedruntime.org/files/downloads/AzulVmemMet... ]
Does that make such fundamental changes worthwhile? We think so. Some may not.
Posted Jun 18, 2010 16:19 UTC (Fri) by jzbiciak (subscriber, #5246)
As an outsider looking in, these batch virtual memory operations sound interesting in their own right, and not specifically related to Java and GC. It's just that your Java GC can make immediate use of them.
This seems like a similar transformation to providing a vector of IO operations, versus repeated calls to read() or write().
What other workloads might benefit from such batch-oriented VM? Poul-Henning Kamp had an article in ACM Queue recently that showed a 10x speedup on heap structures by avoiding VM hits. While not identical to the problem you're solving, I think it highlights that explicitly optimizing VM behavior, even on "solved problems," is a generally interesting space, with opportunities for huge improvements. After all, heaps are "optimal," right? Then why the 10x speedup? Finding how your work can be leveraged outside your Java GC environment will make it more attractive.
Posted Jun 19, 2010 5:45 UTC (Sat) by giltene (guest, #67734)
We built the enhanced APIs to be generic [we think], without assumptions about the user-level code being a runtime, and focusing on the needed functionality and semantics. Our GC code is all in user space (part of the OpenJDK based code we put up along with the kernel mods) - it just needs some very scalable and somewhat different virtual and physical memory manipulation semantics form the kernel.
So yes, there can certainly be other uses for batched virtual memory operation with extremely fast commit times, and for other features like explicit-TLB-invalidate semantics (allowing user process to determine when a TLB invalidate is required, instead of issuing one per page), scalable virtual memory manipulation (not done under process-global lock), large page support that includes remapping capability, user-controlled shatter/unshatter transitions from large to small mappings while retaining large physical page layouts, etc.
Posted Jun 21, 2010 20:56 UTC (Mon) by nix (subscriber, #2304)
Posted Jun 18, 2010 16:28 UTC (Fri) by k8to (subscriber, #15413)
I always assumed the pervaisve and unpleasant nature of this problem was simply Sun's inability to make java a real platform for beyond a relatively narrow set of environments. It seemed like the other kids all came up with solutions that worked for them.
Posted Jun 18, 2010 22:19 UTC (Fri) by foom (subscriber, #14868)
But, python already has such poor performance, and, especially, cannot be running multiple threads at once, that it doesn't come up that much. People just don't use python for the kinds of applications that demand high performance with high allocation rates on 12 threads at once...
Posted Jun 19, 2010 8:46 UTC (Sat) by vachi (guest, #67512)
Posted Jun 19, 2010 19:19 UTC (Sat) by k8to (subscriber, #15413)
Meanwhile, we can deconstruct your analogy by pointing out that DOS does not have a scheduler.
Posted Jun 19, 2010 19:51 UTC (Sat) by foom (subscriber, #14868)
Use option -XX:+UseConcMarkSweepGC.
For more information, see this article:
Posted Jun 19, 2010 23:37 UTC (Sat) by giltene (guest, #67734)
The duration of the stop-the-world pause in all current JVM GC's is generally linear to the amount of live data the heap contains (you have to scan all that stuff and fix all the pointers to the relocated objects). This means that the larger the heap - the larger the pause. Sun's CMS (the Mostly Concurrent Mark Sweep -XX:+UseConcMarkSweepGC mentioned above) will delay the compaction as long as it can and track empty spaces in free lists, but it will eventually fall back on it's compaction code and pause for about 2-4 seconds per live gigabyte on a modern x86-64 machine. This is why JVMs are generally not used with more than a few GB of data, except for batch apps (ones that can accept a 10s of seconds of complete pause). Since a 256GB server now costs less than $18K, there is a ~100x and growing gap between commodity server capacity and the ability for individual runtime to scale with acceptable response times.
The Pauseless GC algorithm and implementation put forward as part of the Managed Runtime Initiative changes all this. It compacts the heap and fixes up pointers concurrently, without having to stop the world to do so. As a result, response time is completely decoupled from memory size, and indirectly from allocation rate and throughput, breaking the 100x logjam.
Posted Jun 20, 2010 17:19 UTC (Sun) by rilder (guest, #59804)
Posted Jun 21, 2010 4:06 UTC (Mon) by giltene (guest, #67734)
Due to various needs though, our current implementation a separate physical memory management subsystem. The main, overriding need for this is the avoidance of unnecessary TLB invalidates when physical pages are unmapped and mapped [a bit later] within the same process - this is the most frequent pattern across the pauseless collector and the application mutators, and addressing it pretty much requires process-local free lists, which we use.
WRT VMs (e.g.. running this kernel as a guest on top of KVM or VMWare), we see those as common targets. Because the rate of virtual memory mapping changes in our system is quite substantial (well over 1000x compared to the typical OS and application loads), we care a lot about the cost of those manipulations in virtualized environments. Luckily EPT (Intel) and NPT/RVI (AMD) features in all modern x86-64 machines practically eliminate this cost, taking the hypervisor out of the business of intercepting, tracking, and applying guest-virtiual to guest-physical mappings. Without those HW assisted features, applying changes at the rate we do would certainly "hurt" on virtualized systems.
We'll also be able to play with memory ballooning, and we've got some very good uses for it, but the current implementation we've posted doesn't. When we do, we'll be looking to deflate balloons at sustained rates of several GBs per second, which will probably put some serious stress on current host implementations. This is one of the items at the hypervisor level that we're going to be playing with as part of the Managed Runtime Initiative [large page and high sustained rate ballooning]. At first glance, KVM's handling of ballooning seems like it can easily be extended to accommodate what we'll need.
Posted Jun 23, 2010 17:06 UTC (Wed) by jonabbey (guest, #2736)
Posted Jun 27, 2010 0:00 UTC (Sun) by giltene (guest, #67734)
1) A concurrent, multi-pass OldGen marker
2) An incremental, stop-the-world OldGen compactor
3) A stop-the-world NewGen copying collector
It's the stop-the-world nature of (2) and (3) above that determine pause behavior for large heap and high throughput applications. When object copying and/or compaction are done in a stop the world operation, moving even a single object in memory requires that *all* references to that object in the heap be tracked down and corrected before the program is allowed to continue execution. This takes an amount of time bound only by the size of the heap (i.e. it takes a multi-seconds-per-live-GB pause to find all those references). While a lot of cool code and tuning goes into making such long pauses spaced out in time, all stop-the-world-compaction collectors (including G1) will periodically fall back on their full stop the world compaction code when dealing with complex heaps. [The collector's elaborate code for doing that operation is there for a reason]. For applications that cannot accept the "occasional 50 seconds pause" during normal operation, this means heap size is still limited by pause time.
In contrast, The GPGC algorithm used in both Azul's Zing VM and in the Managed Runtime Initiative contribution (and in the default, production collector for Azul Vega systems since 2005), can be classified as:
1) A concurrent, guaranteed-single-pass OldGen marker
2) A concurrent OldGen compactor
3) A concurrent, guaranteed-single-pass NewGen marker
4) A concurrent NewGen compactor
Concurrent [as opposed to incremental stop-the-world] compaction is the key here. It completely decouples application pause behavior from heap size and allocation rates, and allows us to deal with the amounts of memory prevalent in todays servers without those nasty many-seconds pauses. GPGC simply doesn't have any form of stop-the-world compaction code in it [as a fall back or otherwise]. It doesn't need it. In GPGC, objects are relocated without having to stop the application while all references to them are tracked down and fixed. Instead, ref remapping is a concurrent operation supported by self-healing read barriers.
It's the software pipeline used to support this concurrent compaction (and guaranteed-single-pass marking) that needs the new virtual memory operations we've put forward.
Posted Sep 6, 2010 22:20 UTC (Mon) by Blaisorblade (guest, #25465)
Nah, the behavior is quite different. Sun JVM seems very good at keeping a large number of objects in memory and stalling reclaiming their space until it can do a large number at once, causing stalls, and wasting memory. Other sytems, like Python, seem to reclaim the objects much more incrementally which might not be as effecient in a long term view.
As just said elsewhere by Nix, since Python uses standard reference counting, it is not efficient even in the short term view, because copying a pointer to the stack causes a heap mutation, even to pass a parameter to a procedure. That's why trying to support multithreading gave a 2x slowdown. Given that other portions of code have been optimized, I believe the slowdown nowadays would be bigger. And the slowdown you get with Java and a smaller heap is probably still not comparable to the one you get in Python (no less than 10x).
Python apps seems to have decent performance when they do little, and the rest is written in C, but as soon you try to actually do something with Python code, you lose. I don't get how the same community, which prefers C to Java for performance reasons, can even mention Python. I hope it's not the same people at least.
There are many realtime GCs, and each of them is better than Python's one.
In particular, Cliff Click described the pauseless GC, with its amazing performance and small overhead, somewhere on this blog, which I recommend for those interested in the field (even if quite technical). However, he describes their special CPU, but it seems they could port it to x86, and the code is in this release.
For Python, refcounting was just a bad choice in the beginning, and it's now impossible to get rid of it without rewriting everything - and they don't have the man power nor the will. And all of this was well-known, people implementing Lisp, Smalltalk, Self, knew it for the last 20 years, together with a number of other techniques.
Posted Sep 6, 2010 23:13 UTC (Mon) by nix (subscriber, #2304)
since Python uses standard reference counting, it is not efficient even in the short term view, because copying a pointer to the stack causes a heap mutation, even to pass a parameter to a procedure. That's why trying to support multithreading gave a 2x slowdown.
Posted Sep 10, 2010 21:07 UTC (Fri) by Blaisorblade (guest, #25465)
Moreover, there was work (from Antoine Pitrou IIRC) to change the policies of the GIL - it was/is released/reacquired every 100 opcodes (which might be ridiculously small or too large, depending on the opcodes), while A. Pitrou wanted to have some saner scheduling.
Releasing it every timeslice (i.e. ~80/100 ms IIRC) would probably help with the problems of that presentation - I should have another look to know if this makes sense. A simple user-level scheduler for GIL acquisition would maybe be needed in the worst case, but hopefully not.
Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds