User: Password:
|
|
Subscribe / Log in / New account

The Managed Runtime Initiative

The Managed Runtime Initiative

Posted Jun 17, 2010 23:26 UTC (Thu) by ncm (subscriber, #165)
In reply to: The Managed Runtime Initiative by giltene
Parent article: The Managed Runtime Initiative

"Java" and "GC" are bad words anywhere outside the walled garden. You won't change that by demanding fundamental changes in schedulers and kernel memory management in order to get tolerable performance that normal programs get without.


(Log in to post comments)

The Managed Runtime Initiative

Posted Jun 18, 2010 1:27 UTC (Fri) by giltene (guest, #67734) [Link]

It's a pretty big garden [Java, .NET, Python, ruby, etc.], It's more like the wide open space where most of the world actually is, and the walls seem to be around the rest of the [non managed runtime] world these days. Most enterprise scale apps written in the last several years run on managed runtimes. Most modern frameworks and reusable code bases target them, the vast majority of the world's programmers build software for them. Furthermore, apps running on managed runtimes tend to handle stuff with scales and complexities that nothing else tries to. Why? We don't really know. It's just reality.

It does have some pesky issues. Like GC, but those can actually be solved. We've shown that. It's just that [at least the one way we know of for] solving them does require fundamental changes to things like kernel memory management. Some of the core operations we are advocating for make an actual 1,000,000x difference to how critical phases of GC behave, thereby making the difference between a practical, no-more-puases GC world, and the current one. [For some operation metrics numbers, see http://www.managedruntime.org/files/downloads/AzulVmemMet... ]

Does that make such fundamental changes worthwhile? We think so. Some may not.

The Managed Runtime Initiative

Posted Jun 18, 2010 16:19 UTC (Fri) by jzbiciak (subscriber, #5246) [Link]

As an outsider looking in, these batch virtual memory operations sound interesting in their own right, and not specifically related to Java and GC. It's just that your Java GC can make immediate use of them.

This seems like a similar transformation to providing a vector of IO operations, versus repeated calls to read() or write().

What other workloads might benefit from such batch-oriented VM? Poul-Henning Kamp had an article in ACM Queue recently that showed a 10x speedup on heap structures by avoiding VM hits. While not identical to the problem you're solving, I think it highlights that explicitly optimizing VM behavior, even on "solved problems," is a generally interesting space, with opportunities for huge improvements. After all, heaps are "optimal," right? Then why the 10x speedup? Finding how your work can be leveraged outside your Java GC environment will make it more attractive.

The Managed Runtime Initiative

Posted Jun 19, 2010 5:45 UTC (Sat) by giltene (guest, #67734) [Link]

Good point about relevance outside of GC, Java, and Runtimes. There certainly may be uses in other workloads (in memory DBs? DBs in general? Sparse memory applications that rely on virtual memory tricks?). In general, current virtual memory implementations in almost all OSs assume that virtual memory manipulation is a relatively rare event, and we have put forward an algorithm and an application for rapidly-changing mappings that makes a real difference to a vast array of applications, but it can't do so within the limitations of current virtual memory APIs and manipulation speeds.

We built the enhanced APIs to be generic [we think], without assumptions about the user-level code being a runtime, and focusing on the needed functionality and semantics. Our GC code is all in user space (part of the OpenJDK based code we put up along with the kernel mods) - it just needs some very scalable and somewhat different virtual and physical memory manipulation semantics form the kernel.

So yes, there can certainly be other uses for batched virtual memory operation with extremely fast commit times, and for other features like explicit-TLB-invalidate semantics (allowing user process to determine when a TLB invalidate is required, instead of issuing one per page), scalable virtual memory manipulation (not done under process-global lock), large page support that includes remapping capability, user-controlled shatter/unshatter transitions from large to small mappings while retaining large physical page layouts, etc.

The Managed Runtime Initiative

Posted Jun 21, 2010 20:56 UTC (Mon) by nix (subscriber, #2304) [Link]

A good few of these features seem like things that could be useful for things like user-mode-linux as well (and anything else that has to manipulate a *lot* of mappings).

The Managed Runtime Initiative

Posted Jun 18, 2010 16:28 UTC (Fri) by k8to (subscriber, #15413) [Link]

While managed runtimes is as you say a large set, I don't hear about these sorts of nasty stalls in eg python, ruby, very much. They definitely come up in java all the time. I would get them when i developed back in 2005 on mono, but assumed that was immaturity (they were using a relatively naive boehm at the time).

I always assumed the pervaisve and unpleasant nature of this problem was simply Sun's inability to make java a real platform for beyond a relatively narrow set of environments. It seemed like the other kids all came up with solutions that worked for them.

The Managed Runtime Initiative

Posted Jun 18, 2010 22:19 UTC (Fri) by foom (subscriber, #14868) [Link]

Python certainly does have GC stalls.

But, python already has such poor performance, and, especially, cannot be running multiple threads at once, that it doesn't come up that much. People just don't use python for the kinds of applications that demand high performance with high allocation rates on 12 threads at once...

The Managed Runtime Initiative

Posted Jun 19, 2010 8:46 UTC (Sat) by vachi (guest, #67512) [Link]

I guess JVM is one of the most used VM in large and complex enterpise app. That's why most complaints are heard from that direction. By the way, I don't hear about complaints regarding scheduler performance of DOS nowadays. DOS must have a pretty good scheduler compared to Linux.

The Managed Runtime Initiative

Posted Jun 19, 2010 19:19 UTC (Sat) by k8to (subscriber, #15413) [Link]

Nah, the behavior is quite different. Sun JVM seems very good at keeping a large number of objects in memory and stalling reclaiming their space until it can do a large number at once, causing stalls, and wasting memory. Other sytems, like Python, seem to reclaim the objects much more incrementally which might not be as effecient in a long term view.

Meanwhile, we can deconstruct your analogy by pointing out that DOS does not have a scheduler.

The Managed Runtime Initiative

Posted Jun 19, 2010 19:51 UTC (Sat) by foom (subscriber, #14868) [Link]

The Sun JVM has many different GC options. If you want lower throughput, but smaller pauses, you can use the collector designed for doing just that: the concurrent collector.

Use option -XX:+UseConcMarkSweepGC.

For more information, see this article:
http://java.sun.com/javase/technologies/hotspot/gc/gc_tun...

The Managed Runtime Initiative

Posted Jun 19, 2010 23:37 UTC (Sat) by giltene (guest, #67734) [Link]

All current commercial forms of GCs on large scale runtimes include code that compacts the object heap in a stop-the-world operation. Compaction is unavoidable in long lived applications that use variable sized objects (e.g. XML data). Most GC setups can be tuned in various ways to delay this inevitable compaction, but none can avoid it. Think of it as a ticking time bomb - do enough work, and you'll need to defragment the heap because your new object can't fit in any of the empty spaces you've been tracking.

The duration of the stop-the-world pause in all current JVM GC's is generally linear to the amount of live data the heap contains (you have to scan all that stuff and fix all the pointers to the relocated objects). This means that the larger the heap - the larger the pause. Sun's CMS (the Mostly Concurrent Mark Sweep -XX:+UseConcMarkSweepGC mentioned above) will delay the compaction as long as it can and track empty spaces in free lists, but it will eventually fall back on it's compaction code and pause for about 2-4 seconds per live gigabyte on a modern x86-64 machine. This is why JVMs are generally not used with more than a few GB of data, except for batch apps (ones that can accept a 10s of seconds of complete pause). Since a 256GB server now costs less than $18K, there is a ~100x and growing gap between commodity server capacity and the ability for individual runtime to scale with acceptable response times.

The Pauseless GC algorithm and implementation put forward as part of the Managed Runtime Initiative changes all this. It compacts the heap and fixes up pointers concurrently, without having to stop the world to do so. As a result, response time is completely decoupled from memory size, and indirectly from allocation rate and throughput, breaking the 100x logjam.

The Managed Runtime Initiative

Posted Jun 20, 2010 17:19 UTC (Sun) by rilder (guest, #59804) [Link]

Looks interesting. However recently I read about commits related to memory compaction and transparent huge pages. These you may find useful in your work.
Also in your current implementations. are VMs treated differently wrt memory management. There also have been commits wrt VMs for memory ballooing etc.

The Managed Runtime Initiative

Posted Jun 21, 2010 4:06 UTC (Mon) by giltene (guest, #67734) [Link]

Both transparent huge page support and compaction (which transparent huge pages cannot reliably be done without) are good things. We may be able to merge some of our functionality with those over time. We are certainly large consumers of 2MB pages, so we see compaction as very important, and with it we may be able to provide transparent, reliable dynamic allocation of 2MB pages in our subsystem. Without compaction, pages must be explicitly funded and taken away from the kernel upfront, since fragmentation may make it impossible to get hold of them later.

Due to various needs though, our current implementation a separate physical memory management subsystem. The main, overriding need for this is the avoidance of unnecessary TLB invalidates when physical pages are unmapped and mapped [a bit later] within the same process - this is the most frequent pattern across the pauseless collector and the application mutators, and addressing it pretty much requires process-local free lists, which we use.

WRT VMs (e.g.. running this kernel as a guest on top of KVM or VMWare), we see those as common targets. Because the rate of virtual memory mapping changes in our system is quite substantial (well over 1000x compared to the typical OS and application loads), we care a lot about the cost of those manipulations in virtualized environments. Luckily EPT (Intel) and NPT/RVI (AMD) features in all modern x86-64 machines practically eliminate this cost, taking the hypervisor out of the business of intercepting, tracking, and applying guest-virtiual to guest-physical mappings. Without those HW assisted features, applying changes at the rate we do would certainly "hurt" on virtualized systems.

We'll also be able to play with memory ballooning, and we've got some very good uses for it, but the current implementation we've posted doesn't. When we do, we'll be looking to deflate balloons at sustained rates of several GBs per second, which will probably put some serious stress on current host implementations. This is one of the items at the hypervisor level that we're going to be playing with as part of the Managed Runtime Initiative [large page and high sustained rate ballooning]. At first glance, KVM's handling of ballooning seems like it can easily be extended to accommodate what we'll need.

The Managed Runtime Initiative

Posted Jun 23, 2010 17:06 UTC (Wed) by jonabbey (guest, #2736) [Link]

Have you characterized the performance of Sun's G1 collector that they will be using in Java 7?

The Managed Runtime Initiative

Posted Jun 27, 2010 0:00 UTC (Sun) by giltene (guest, #67734) [Link]

We're obviously been following G1's progress and the algorithm it uses. Its' hard to "characterize the actual performance" of G1 right now given it's current experimental state [simple experiments right now show longer pauses than HotSpot's production CMS collector]. However, analyzing G1's basic mechanisms is pretty straight forward. Here is how they can be classified:

1) A concurrent, multi-pass OldGen marker
2) An incremental, stop-the-world OldGen compactor
3) A stop-the-world NewGen copying collector

It's the stop-the-world nature of (2) and (3) above that determine pause behavior for large heap and high throughput applications. When object copying and/or compaction are done in a stop the world operation, moving even a single object in memory requires that *all* references to that object in the heap be tracked down and corrected before the program is allowed to continue execution. This takes an amount of time bound only by the size of the heap (i.e. it takes a multi-seconds-per-live-GB pause to find all those references). While a lot of cool code and tuning goes into making such long pauses spaced out in time, all stop-the-world-compaction collectors (including G1) will periodically fall back on their full stop the world compaction code when dealing with complex heaps. [The collector's elaborate code for doing that operation is there for a reason]. For applications that cannot accept the "occasional 50 seconds pause" during normal operation, this means heap size is still limited by pause time.

In contrast, The GPGC algorithm used in both Azul's Zing VM and in the Managed Runtime Initiative contribution (and in the default, production collector for Azul Vega systems since 2005), can be classified as:
1) A concurrent, guaranteed-single-pass OldGen marker
2) A concurrent OldGen compactor
3) A concurrent, guaranteed-single-pass NewGen marker
4) A concurrent NewGen compactor

Concurrent [as opposed to incremental stop-the-world] compaction is the key here. It completely decouples application pause behavior from heap size and allocation rates, and allows us to deal with the amounts of memory prevalent in todays servers without those nasty many-seconds pauses. GPGC simply doesn't have any form of stop-the-world compaction code in it [as a fall back or otherwise]. It doesn't need it. In GPGC, objects are relocated without having to stop the application while all references to them are tracked down and fixed. Instead, ref remapping is a concurrent operation supported by self-healing read barriers.

It's the software pipeline used to support this concurrent compaction (and guaranteed-single-pass marking) that needs the new virtual memory operations we've put forward.

The Managed Runtime Initiative

Posted Sep 6, 2010 22:20 UTC (Mon) by Blaisorblade (guest, #25465) [Link]

Nah, the behavior is quite different. Sun JVM seems very good at keeping a large number of objects in memory and stalling reclaiming their space until it can do a large number at once, causing stalls, and wasting memory. Other sytems, like Python, seem to reclaim the objects much more incrementally which might not be as effecient in a long term view.

As just said elsewhere by Nix, since Python uses standard reference counting, it is not efficient even in the short term view, because copying a pointer to the stack causes a heap mutation, even to pass a parameter to a procedure. That's why trying to support multithreading gave a 2x slowdown. Given that other portions of code have been optimized, I believe the slowdown nowadays would be bigger. And the slowdown you get with Java and a smaller heap is probably still not comparable to the one you get in Python (no less than 10x).

Python apps seems to have decent performance when they do little, and the rest is written in C, but as soon you try to actually do something with Python code, you lose. I don't get how the same community, which prefers C to Java for performance reasons, can even mention Python. I hope it's not the same people at least.

There are many realtime GCs, and each of them is better than Python's one. In particular, Cliff Click described the pauseless GC, with its amazing performance and small overhead, somewhere on this blog, which I recommend for those interested in the field (even if quite technical). However, he describes their special CPU, but it seems they could port it to x86, and the code is in this release.

For Python, refcounting was just a bad choice in the beginning, and it's now impossible to get rid of it without rewriting everything - and they don't have the man power nor the will. And all of this was well-known, people implementing Lisp, Smalltalk, Self, knew it for the last 20 years, together with a number of other techniques.

The Managed Runtime Initiative

Posted Sep 6, 2010 23:13 UTC (Mon) by nix (subscriber, #2304) [Link]

since Python uses standard reference counting, it is not efficient even in the short term view, because copying a pointer to the stack causes a heap mutation, even to pass a parameter to a procedure. That's why trying to support multithreading gave a 2x slowdown.
Oh no, there are much more appalling reasons why Python's multithreading is awful. The GIL acquisition macros were never designed with multiple CPUs in mind: on a one-thread-of-execution machine (what we used to call 'one CPU'), it all works fine, but if there's more than one, they race with each other and often end up bouncing ownership back and forth and both blocked for astounding periods of time. There's an awesome presentation on the subject, strongly recommended.

The Managed Runtime Initiative

Posted Sep 10, 2010 21:07 UTC (Fri) by Blaisorblade (guest, #25465) [Link]

Ah, that's right, I even knew of that presentation (and watched it again). I kind-of guessed that people were working on that (without any evidence other than "somebody noticed"), and I had no idea about how to fix that.

Moreover, there was work (from Antoine Pitrou IIRC) to change the policies of the GIL - it was/is released/reacquired every 100 opcodes (which might be ridiculously small or too large, depending on the opcodes), while A. Pitrou wanted to have some saner scheduling.

Releasing it every timeslice (i.e. ~80/100 ms IIRC) would probably help with the problems of that presentation - I should have another look to know if this makes sense. A simple user-level scheduler for GIL acquisition would maybe be needed in the worst case, but hopefully not.


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds