LWN: Comments on "Tratt: Fast Enough VMs in Fast Enough Time"

Tratt: Fast Enough VMs in Fast Enough Time

kleptog — Thu, 16 Feb 2012 20:18:44 +0000

As long as you only have a few different paths it's not a big deal. The problem is when the paths explode exponentially. The case I came across was the parsing of email headers where you process each header with a separate piece of code. Because practically every email has the headers in a different order the tracing JIT ends up producing many almost but not quite the same paths. Leading to memory bloat. We ended up dropping the use of PyPy in one project because the MB/unit work was such that we could run multiple copies of the CPython code in the space of one PyPy copy and come out ahead.

I don't think this is an insurmountable problem, but it's certainly a problem. Ofcourse, if the PyPy teams licks it, then it will be fixed for every JIT based on PyPy. That's the really cool part :).

Tratt: Fast Enough VMs in Fast Enough Time

daglwn — Sat, 11 Feb 2012 16:58:55 +0000

I suppose I can see an argument for calling the P4 trace cache a JIT because it translates to uops. But it's a poor idea because a certain sequence of several branches having the same outcome over and over is fairly unlikely unless you are very lucky.

A traditional JIT is much smarter about choosing what to optimize mostly because the static analysis can usually guess pretty well which variables will contain the same values repeatedly, even though it may not know exactly what those values are.

A trace cache has to guess which branches are going to be predictable AND what the common values will be. A software JIT only needs to guess which variables will have predictable values.

Tratt: Fast Enough VMs in Fast Enough Time

daglwn — Sat, 11 Feb 2012 16:50:42 +0000

This is essentially what HP's Dynamo project from the '90's did. The static compiler created PA RISC code (in this case it was optimized statically) and the Dynamo runtime took that code and translated it to PA RISC, optimizing it in the process given known runtime values. They saw good speedup.

Tratt: Fast Enough VMs in Fast Enough Time

khim — Sat, 11 Feb 2012 13:02:37 +0000

It wasn't even a JIT.

Why not? JIT is supposed to compile bytecode to native code - and this is exactly what all contemporary CPUs are doing. Sure, their design is kind-of-unusual: where traditional JITs call for compilation from the easy-to-compile bytecode to predetermined native code JITs in our CPUs are created to compile from ad-hoc predetermined bytecode to flexible native code, but is it such a big difference?

A closer analogue to a JIT would be a hardware value predictor. Those never really took off either because predicting anything other than 0 is pretty tough.

This is true for software JITs as well. They, too, usually try to predict outcomes of branches, etc. In fact design of hardware JITs and software JITs are surprisingly similar. The biggest difference, of course, lies in the fact that software JIT have gobs of memory and can cache result of it's work pretty aggressively while hardware JIT only have tiny onboard memory - but on the other hand hardware JIT uses resources not directly accessible to the program itself so it does not interfere with program's execution.

Tratt: Fast Enough VMs in Fast Enough Time

khim — Sat, 11 Feb 2012 12:44:50 +0000

This code could not be generated at static compile time because the program input is not available then. Even if you can predict a static path at static compile time (doubtful) you still don't have the actual data values to take advantage of.

You have a point. P4 was purely tracing JIT and contemporary CPUs dropped that approach because it's hard to combine both in a single chip.

As P4 story shows pure tracing JIT is a disaster and will probably always be a disaster. But combination or regular JIT with tracing JIT may sometimes make sense as was already pointed out. The question is: how often will it make sense?

DEC's FX!32 and IBM's DAISY were similar projects.

And here you are missing the point again. DAISY, Dynamo, FX!32 and IA-32 EL - they all play with architectures which explicitly tried to switch from JIT implemented in hardware to "intelligent compiler". This idea flew like a lead balloon. In the vain hope to save it they tried software JIT - but it does not work as well.

They weren't tracing JITs but they could just as well have been.

I'm not so sure. The general problem of tracing JITs (which killed NetBurst, too) will probably be fatal for these projects.

Now, the use of tracing JIT as "superbooster" for rare case may be interesting, but these projects never went so far.

P.S. Actually architecture of contemporary CPUs make work of JIT creators very frustrating. It's easy to show real-world case where JIT wins over pure static compiler. But since contemporary CPUs always use JIT we don't have a "JIT vs compiler" competetion. We have "“JIT+compiler” vs “JIT+JIT”" competition - and this combination rarely makes sense. Especially when two stacked JITs are created by different teams which don't cooperate at all.

Tratt: Fast Enough VMs in Fast Enough Time

flewellyn — Sat, 11 Feb 2012 03:54:24 +0000

Also, I kept misreading what you said as "Futurama Projections". Made me wonder what Fry and Bender had to do with partial evaluation.

Tratt: Fast Enough VMs in Fast Enough Time

flewellyn — Sat, 11 Feb 2012 03:53:37 +0000

Fascinating stuff. As far as the efficiency issues of tracing JIT go, I've been pondering the idea, and I wonder if it might be useful to have a combination of "method JIT" and tracing?

To wit, the system starts by creating compiled (but not necessarily highly optimized) versions of the functions, and then traces their execution and creates "trace-optimized" versions for common paths. This way, when the traced condition fails, you don't have to drop down to interpretation: you still have compiled code. It's just not as fast as the traced path.

I dunno, too much work? Difficult to automate?

Tratt: Fast Enough VMs in Fast Enough Time

daglwn — Fri, 10 Feb 2012 23:41:46 +0000

The trace cache was not a tracing JIT and any sense. It wasn't even a JIT. It simply stored pre-decoded micro-ops as a trace (in the sense of branches going a certain way) in the instruction cache. It did not attempt to do the kind of semantic optimization a JIT would do. It simply elided branches. If the branch predictor generated a sequence of predictions that mapped to a previously-seen trace, the trace cache would feed the micro-ops directly to the core.

The trace cache failed because predicting multiple branches ahead of time is a failed exercise. You'll end up backtracking and flushing the pipeline a whole lot. Hence the power drain.

A closer analogue to a JIT would be a hardware value predictor. Those never really took off either because predicting anything other than 0 is pretty tough.

Tratt: Fast Enough VMs in Fast Enough Time

daglwn — Fri, 10 Feb 2012 23:35:53 +0000

That's sometimes true, but not usually.

HP's Dynamo project had great success JITing PA RISC code to PA RISC code. What's the point? They had access to values at runtime that they did not at static compile time. They were able to delete branches and optimize very large basic blocks such that the resulting code ran fast enough to more than overcome the overhead of the translation.

This code could not be generated at static compile time because the program input is not available then. Even if you can predict a static path at static compile time (doubtful) you still don't have the actual data values to take advantage of.

DEC's FX!32 and IBM's DAISY were similar projects.

They weren't tracing JITs but they could just as well have been.

Tratt: Fast Enough VMs in Fast Enough Time

daglwn — Fri, 10 Feb 2012 23:29:33 +0000

It is very cool stuff. It's a form of of partial evaluation and the Futamura projections.

http://en.wikipedia.org/wiki/Partial_evaluation

RPython appears to essentially be an implementation of some of the Futamura projections.

So in RPython terms:

The tracing JIT performs/is projection #1; it creates the native code by specializing the interpreter against the source program, capturing the resulting instruction stream -- this is the "tracing" part.
The RPython compiler/implementation performs/is projection #2; given an interpreter it creates a compiler (tracing JIT) that can handle any input program.

Whether RPython actually specializes itself against the interpreter to create the tracing JIT determines whether it actually uses Futamura projection #2 to achieve its ends. My guess is that it's more "hard coded" than that and is not really using partial evaluation to do the work. I am curious to know, however.

This isn't to take anything away from the PyPy people. I don't know of any other such project as close to mainstream as PyPy is. That in itself is a monumental achievement. Taking things from the lab to production use is not easy.

Incidentally, RPython itself could have been created by applying the third projection:

Specializing the specializer for itself (as applied in #2), yielding a tool that can convert any interpreter to an equivalent compiler

Obviously I don't know if they built RPython this way (I suspect not) but this kind of tool has been known for some time. Theoretically one can take the CPython interpreter, write a couple of specializers and spit out a program that is essentially RPython.

There are also musings of a fourth projection, for example:

http://dl.acm.org/citation.cfm?id=1480954

Tratt: Fast Enough VMs in Fast Enough Time

khim — Fri, 10 Feb 2012 17:07:57 +0000

Bingo. P4 was supposed to be superfast because it uses tracing JIT (while regular x86 CPUs use normal-style JIT for about last 20 years). And yes, in some narrow cases tracing JIT was better. But in general it was disaster: CPUs were hot, spent a lot of power and worked worse then CPUs with regular JIT.

Tratt: Fast Enough VMs in Fast Enough Time

khim — Fri, 10 Feb 2012 17:05:12 +0000

Does that apply to tracing JIT more than JIT compilers generally?

This is problem with any JIT, but it's worse for tracing one.

Surely the main use for JIT compilers is situations when static compiling is not practical, like when the precise target platform is not known at the time when the code is circulated, or when the person running the code wants more control over it than they get from a precompiled binary.

The main use is for the languages which don't support static compilation.

Think web applications - do you really want to precompile your application for every CPU it might run on, and would people really want to run your precompiled binary locally without good reason to trust you?

Well, we all use JITs every time we run the program (our CPUs don't execute x86 code directly, it translates it to RISC-like µops on the fly) thus the question becomes: "do we need yet another JIT" and sometimes the answer is yes, but more often then not it's no and people use managed code because it has high buzzword-compliance value.

Tratt: Fast Enough VMs in Fast Enough Time

adobriyan — Fri, 10 Feb 2012 14:30:14 +0000

I think P4 trace cache was meant.

Tratt: Fast Enough VMs in Fast Enough Time

jensend — Thu, 09 Feb 2012 20:35:03 +0000

What the devil does NetBurst have to do with tracing JITs? I think you must have posted the wrong link. CPU branch prediction is a very very different problem.

Tratt: Fast Enough VMs in Fast Enough Time

mjthayer — Thu, 09 Feb 2012 20:07:04 +0000

> The problem here is that in the power-constrained world tracing JIT does not look like a huge win: sure, it can optimize something pretty thoroughly, but if your code paths are truly static then static compiler will be even better and and if they change relatively often then it wastes power for optimizations which are never used.

Does that apply to tracing JIT more than JIT compilers generally? Surely the main use for JIT compilers is situations when static compiling is not practical, like when the precise target platform is not known at the time when the code is circulated, or when the person running the code wants more control over it than they get from a precompiled binary. Think web applications - do you really want to precompile your application for every CPU it might run on, and would people really want to run your precompiled binary locally without good reason to trust you? Or for that matter, think processor emulation.

Tratt: Fast Enough VMs in Fast Enough Time

khim — Thu, 09 Feb 2012 16:38:14 +0000

Well, if you'll think about it: the most expensive tracing JIT ended up being colossal failure and was eventually scraped thus I'm pretty sure tracing JITs are still have a long, long way to go.

And I'm not sure if they can be fixed in principle. The problem here is that in the power-constrained world tracing JIT does not look like a huge win: sure, it can optimize something pretty thoroughly, but if your code paths are truly static then static compiler will be even better and and if they change relatively often then it wastes power for optimizations which are never used.

Tratt: Fast Enough VMs in Fast Enough Time

farnz — Thu, 09 Feb 2012 11:25:48 +0000

I think I'd argue that the thing that's innovative in RPython is not the tracing JIT compiler (you could extend LLVM to have a tracing JIT compiler/interpreter for the LLVM IR, after all - it's "just" a matter of engineering time), but the application of a tracing JIT compiler to the problem of making writing a JIT compiler easy.

The leap involved is immense - most programmers are capable of writing a simple interpreter for a language of their own design. Not as many programmers are capable of writing a compiler from their language to an IR. RPython says "if you can write your interpreter in this subset of Python, and add these annotations, you get a JIT for free." As it happens, they've chosen a subset of Python that's reasonably easy to live with, and the annotations are ones that should be natural to anyone writing an interpreter.

To give you an idea of how different this is, imagine writing a modern implementation of BBC BASIC, absent the commands beginning "*", and the CALL and USR facilities for leaving BASIC to call arbitrary machine code. You'd probably tokenize the language the way the BBC Micro did in the 80s, then directly interpret the tokenized "bytecode" (which has a one-to-one correspondence to the original source).

Now try and JIT this - to use LLVM, you need to rework your interpreter to translate to LLVM IR (which is a non-trivial task), then you check it works by using the LLVM interpreter to run it, then you add the JIT bits. In contrast, to use RPython, you rewrite your interpreter in RPython (trivial if it was already written in Python, not that hard if it was written in something else), and check it still works as an interpreter. You then annotate a few key points in your interpreter, and it's done; you have a JIT compiler.

Tratt: Fast Enough VMs in Fast Enough Time

mjthayer — Thu, 09 Feb 2012 10:59:25 +0000

From the article:
> Tracing JITs are relatively new and have some limitations, at least based on what we currently know. Mozilla, for example, removed their tracing JIT a few months back, because while it's sometimes blazingly fast, it's sometimes rather slow. This is due to a tracing JIT optimising a single code-path at a time [...] Code which tends to take the same path time after time benefits hugely from tracing; code which tends to branch unpredictably can take considerable time to derive noticeable benefits from the JIT.

Surely it wouldn't be a huge leap from what they have already achieved to keep a trace around for generated code and to produce a merged trace (and regenerate the code) if a second path becomes hot enough. I am assuming of course that this has been shown by measurement and not just guesswork (the author does seem to be serious about what he is doing) to be a problem.

Tratt: Fast Enough VMs in Fast Enough Time

ati — Wed, 08 Feb 2012 22:21:40 +0000

Another key difference between RPython and LLVM or libjit, is that the former
is a tracing optimizing JIT, while the latter is a one-time JIT compiler.

This practically means that in the case of RPython the JIT compiler will dynamically optimize hot code areas that might span several functions, based on runtime profiling information (with the aid of annotations from the VM),
effectively emitting binary code for an entire call-chain. The binary code further includes some code (guards) for reverting back to the interpreter.

In contrast, the LLVM JIT compilation is explicitly requested by the VM, and works by compiling a function-at-a-time to binary code, that has to be managed by the VM runtime. Obviously it is possible to build a tracing VM on top of LLVM, but it would require some effort.

Tratt: Fast Enough VMs in Fast Enough Time

farnz — Wed, 08 Feb 2012 20:49:14 +0000

I've no experience of libjit, so can't comment there at all, but LLVM's JIT is a "traditional" JIT from the author's perspective.

The key difference is that to use the LLVM JIT, you must write your system to output LLVM IR as its bytecode. You can then use any of the LLVM interpreter, LLVM JIT, or LLVM compiler technologies to run the resulting code.

In the RPython setup, you write an interpreter for a bytecode of your own design. You annotate a few key points in the interpreter to enable the JIT to work. The RPython code does the rest for you; it works out how to JIT compile your bytecode based on tracing your interpreter as it interprets the bytecode.

The big advantage is that you can design the bytecode around the needs of your language - something like LLVM IR is designed around the needs the LLVM project anticipated, which may or may not match your requirements.

Tratt: Fast Enough VMs in Fast Enough Time

flewellyn — Wed, 08 Feb 2012 20:12:30 +0000

Wow, this is impressive. I had no idea about tracing JITs in general, and in particular I didn't know about RPython's ability to create one for a given interpreter program. Amazing stuff.

Even if PyPy ultimately never takes off as the default Python implementation, never replaces CPython in popularity, the project will still be a great success simply because it introduced this automatic JIT creation technology. Even if nobody uses PyPy in production, if people use RPython or something like it to create new language-specific VMs, with automatic tracing JIT built in, this is a HUGE advancement of the state of the art.

Tratt: Fast Enough VMs in Fast Enough Time

atai — Wed, 08 Feb 2012 20:01:58 +0000

This is interesting. Just curious, how would the RPython approach be compared to, say, libjit and LLVM? The author did not seem to have considered the two alternatives