Tratt: Fast Enough VMs in Fast Enough Time
However, in addition to outputting optimised C code, RPython automatically creates a second representation of the user's program. Assuming RPython has been used to write a VM for language L, one gets not only a traditional interpreter, but also an optimising Just-In-Time (JIT) compiler for free. In other words, when a program written in L executes on an appropriately written RPython VM, hot loops (i.e. those which are executed frequently) are automatically turned into machine code and executed directly. This is RPython's unique selling point, as I'll now explain."
Posted Feb 8, 2012 20:01 UTC (Wed)
by atai (subscriber, #10977)
[Link] (3 responses)
Posted Feb 8, 2012 20:49 UTC (Wed)
by farnz (subscriber, #17727)
[Link] (2 responses)
I've no experience of libjit, so can't comment there at all, but LLVM's JIT is a "traditional" JIT from the author's perspective.
The key difference is that to use the LLVM JIT, you must write your system to output LLVM IR as its bytecode. You can then use any of the LLVM interpreter, LLVM JIT, or LLVM compiler technologies to run the resulting code.
In the RPython setup, you write an interpreter for a bytecode of your own design. You annotate a few key points in the interpreter to enable the JIT to work. The RPython code does the rest for you; it works out how to JIT compile your bytecode based on tracing your interpreter as it interprets the bytecode.
The big advantage is that you can design the bytecode around the needs of your language - something like LLVM IR is designed around the needs the LLVM project anticipated, which may or may not match your requirements.
Posted Feb 8, 2012 22:21 UTC (Wed)
by ati (guest, #82816)
[Link] (1 responses)
This practically means that in the case of RPython the JIT compiler will dynamically optimize hot code areas that might span several functions, based on runtime profiling information (with the aid of annotations from the VM),
In contrast, the LLVM JIT compilation is explicitly requested by the VM, and works by compiling a function-at-a-time to binary code, that has to be managed by the VM runtime. Obviously it is possible to build a tracing VM on top of LLVM, but it would require some effort.
Posted Feb 9, 2012 11:25 UTC (Thu)
by farnz (subscriber, #17727)
[Link]
I think I'd argue that the thing that's innovative in RPython is not the tracing JIT compiler (you could extend LLVM to have a tracing JIT compiler/interpreter for the LLVM IR, after all - it's "just" a matter of engineering time), but the application of a tracing JIT compiler to the problem of making writing a JIT compiler easy.
The leap involved is immense - most programmers are capable of writing a simple interpreter for a language of their own design. Not as many programmers are capable of writing a compiler from their language to an IR. RPython says "if you can write your interpreter in this subset of Python, and add these annotations, you get a JIT for free." As it happens, they've chosen a subset of Python that's reasonably easy to live with, and the annotations are ones that should be natural to anyone writing an interpreter.
To give you an idea of how different this is, imagine writing a modern implementation of BBC BASIC, absent the commands beginning "*", and the CALL and USR facilities for leaving BASIC to call arbitrary machine code. You'd probably tokenize the language the way the BBC Micro did in the 80s, then directly interpret the tokenized "bytecode" (which has a one-to-one correspondence to the original source).
Now try and JIT this - to use LLVM, you need to rework your interpreter to translate to LLVM IR (which is a non-trivial task), then you check it works by using the LLVM interpreter to run it, then you add the JIT bits. In contrast, to use RPython, you rewrite your interpreter in RPython (trivial if it was already written in Python, not that hard if it was written in something else), and check it still works as an interpreter. You then annotate a few key points in your interpreter, and it's done; you have a JIT compiler.
Posted Feb 8, 2012 20:12 UTC (Wed)
by flewellyn (subscriber, #5047)
[Link] (4 responses)
Even if PyPy ultimately never takes off as the default Python implementation, never replaces CPython in popularity, the project will still be a great success simply because it introduced this automatic JIT creation technology. Even if nobody uses PyPy in production, if people use RPython or something like it to create new language-specific VMs, with automatic tracing JIT built in, this is a HUGE advancement of the state of the art.
Posted Feb 10, 2012 23:29 UTC (Fri)
by daglwn (guest, #65432)
[Link] (3 responses)
It is very cool stuff. It's a form of of partial evaluation and the Futamura projections. RPython appears to essentially be an implementation of some of the Futamura projections.
So in RPython terms:
Whether RPython actually specializes itself against the interpreter to create the tracing JIT determines whether it actually uses Futamura projection #2 to achieve its ends. My guess is that it's more "hard coded" than that and is not really using partial evaluation to do the work. I am curious to know, however. This isn't to take anything away from the PyPy people. I don't know of any other such project as close to mainstream as PyPy is. That in itself is a monumental achievement. Taking things from the lab to production use is not easy. Incidentally, RPython itself could have been created by applying the third projection:
Obviously I don't know if they built RPython this way (I suspect not) but this kind of tool has been known for some time. Theoretically one can take the CPython interpreter, write a couple of specializers and spit out a program that is essentially RPython. There are also musings of a fourth projection, for example: http://dl.acm.org/citation.cfm?id=1480954
Posted Feb 11, 2012 3:53 UTC (Sat)
by flewellyn (subscriber, #5047)
[Link] (1 responses)
To wit, the system starts by creating compiled (but not necessarily highly optimized) versions of the functions, and then traces their execution and creates "trace-optimized" versions for common paths. This way, when the traced condition fails, you don't have to drop down to interpretation: you still have compiled code. It's just not as fast as the traced path.
I dunno, too much work? Difficult to automate?
Posted Feb 11, 2012 16:50 UTC (Sat)
by daglwn (guest, #65432)
[Link]
Posted Feb 11, 2012 3:54 UTC (Sat)
by flewellyn (subscriber, #5047)
[Link]
Posted Feb 9, 2012 10:59 UTC (Thu)
by mjthayer (guest, #39183)
[Link] (12 responses)
Surely it wouldn't be a huge leap from what they have already achieved to keep a trace around for generated code and to produce a merged trace (and regenerate the code) if a second path becomes hot enough. I am assuming of course that this has been shown by measurement and not just guesswork (the author does seem to be serious about what he is doing) to be a problem.
Posted Feb 9, 2012 16:38 UTC (Thu)
by khim (subscriber, #9252)
[Link] (10 responses)
Well, if you'll think about it: the most expensive tracing JIT ended up being colossal failure and was eventually scraped thus I'm pretty sure tracing JITs are still have a long, long way to go. And I'm not sure if they can be fixed in principle. The problem here is that in the power-constrained world tracing JIT does not look like a huge win: sure, it can optimize something pretty thoroughly, but if your code paths are truly static then static compiler will be even better and and if they change relatively often then it wastes power for optimizations which are never used.
Posted Feb 9, 2012 20:07 UTC (Thu)
by mjthayer (guest, #39183)
[Link] (1 responses)
Does that apply to tracing JIT more than JIT compilers generally? Surely the main use for JIT compilers is situations when static compiling is not practical, like when the precise target platform is not known at the time when the code is circulated, or when the person running the code wants more control over it than they get from a precompiled binary. Think web applications - do you really want to precompile your application for every CPU it might run on, and would people really want to run your precompiled binary locally without good reason to trust you? Or for that matter, think processor emulation.
Posted Feb 10, 2012 17:05 UTC (Fri)
by khim (subscriber, #9252)
[Link]
This is problem with any JIT, but it's worse for tracing one. The main use is for the languages which don't support static compilation. Well, we all use JITs every time we run the program (our CPUs don't execute x86 code directly, it translates it to RISC-like µops on the fly) thus the question becomes: "do we need yet another JIT" and sometimes the answer is yes, but more often then not it's no and people use managed code because it has high buzzword-compliance value.
Posted Feb 9, 2012 20:35 UTC (Thu)
by jensend (guest, #1385)
[Link] (5 responses)
Posted Feb 10, 2012 14:30 UTC (Fri)
by adobriyan (subscriber, #30858)
[Link] (4 responses)
Posted Feb 10, 2012 17:07 UTC (Fri)
by khim (subscriber, #9252)
[Link] (3 responses)
Posted Feb 10, 2012 23:41 UTC (Fri)
by daglwn (guest, #65432)
[Link] (2 responses)
The trace cache failed because predicting multiple branches ahead of time is a failed exercise. You'll end up backtracking and flushing the pipeline a whole lot. Hence the power drain.
A closer analogue to a JIT would be a hardware value predictor. Those never really took off either because predicting anything other than 0 is pretty tough.
Posted Feb 11, 2012 13:02 UTC (Sat)
by khim (subscriber, #9252)
[Link] (1 responses)
Why not? JIT is supposed to compile bytecode to native code - and this is exactly what all contemporary CPUs are doing. Sure, their design is kind-of-unusual: where traditional JITs call for compilation from the easy-to-compile bytecode to predetermined native code JITs in our CPUs are created to compile from ad-hoc predetermined bytecode to flexible native code, but is it such a big difference? This is true for software JITs as well. They, too, usually try to predict outcomes of branches, etc. In fact design of hardware JITs and software JITs are surprisingly similar. The biggest difference, of course, lies in the fact that software JIT have gobs of memory and can cache result of it's work pretty aggressively while hardware JIT only have tiny onboard memory - but on the other hand hardware JIT uses resources not directly accessible to the program itself so it does not interfere with program's execution.
Posted Feb 11, 2012 16:58 UTC (Sat)
by daglwn (guest, #65432)
[Link]
A traditional JIT is much smarter about choosing what to optimize mostly because the static analysis can usually guess pretty well which variables will contain the same values repeatedly, even though it may not know exactly what those values are.
A trace cache has to guess which branches are going to be predictable AND what the common values will be. A software JIT only needs to guess which variables will have predictable values.
Posted Feb 10, 2012 23:35 UTC (Fri)
by daglwn (guest, #65432)
[Link] (1 responses)
HP's Dynamo project had great success JITing PA RISC code to PA RISC code. What's the point? They had access to values at runtime that they did not at static compile time. They were able to delete branches and optimize very large basic blocks such that the resulting code ran fast enough to more than overcome the overhead of the translation.
This code could not be generated at static compile time because the program input is not available then. Even if you can predict a static path at static compile time (doubtful) you still don't have the actual data values to take advantage of.
DEC's FX!32 and IBM's DAISY were similar projects.
They weren't tracing JITs but they could just as well have been.
Posted Feb 11, 2012 12:44 UTC (Sat)
by khim (subscriber, #9252)
[Link]
You have a point. P4 was purely tracing JIT and contemporary CPUs dropped that approach because it's hard to combine both in a single chip. As P4 story shows pure tracing JIT is a disaster and will probably always be a disaster. But combination or regular JIT with tracing JIT may sometimes make sense as was already pointed out. The question is: how often will it make sense? And here you are missing the point again. DAISY, Dynamo, FX!32 and IA-32 EL - they all play with architectures which explicitly tried to switch from JIT implemented in hardware to "intelligent compiler". This idea flew like a lead balloon. In the vain hope to save it they tried software JIT - but it does not work as well. I'm not so sure. The general problem of tracing JITs (which killed NetBurst, too) will probably be fatal for these projects. Now, the use of tracing JIT as "superbooster" for rare case may be interesting, but these projects never went so far. P.S. Actually architecture of contemporary CPUs make work of JIT creators very frustrating. It's easy to show real-world case where JIT wins over pure static compiler. But since contemporary CPUs always use JIT we don't have a "JIT vs compiler" competetion. We have "“JIT+compiler” vs “JIT+JIT”" competition - and this combination rarely makes sense. Especially when two stacked JITs are created by different teams which don't cooperate at all.
Posted Feb 16, 2012 20:18 UTC (Thu)
by kleptog (subscriber, #1183)
[Link]
I don't think this is an insurmountable problem, but it's certainly a problem. Ofcourse, if the PyPy teams licks it, then it will be fixed for every JIT based on PyPy. That's the really cool part :).
Tratt: Fast Enough VMs in Fast Enough Time
Tratt: Fast Enough VMs in Fast Enough Time
Tratt: Fast Enough VMs in Fast Enough Time
is a tracing optimizing JIT, while the latter is a one-time JIT compiler.
effectively emitting binary code for an entire call-chain. The binary code further includes some code (guards) for reverting back to the interpreter.
Tratt: Fast Enough VMs in Fast Enough Time
Tratt: Fast Enough VMs in Fast Enough Time
Tratt: Fast Enough VMs in Fast Enough Time
Tratt: Fast Enough VMs in Fast Enough Time
Tratt: Fast Enough VMs in Fast Enough Time
Tratt: Fast Enough VMs in Fast Enough Time
Tratt: Fast Enough VMs in Fast Enough Time
> Tracing JITs are relatively new and have some limitations, at least based on what we currently know. Mozilla, for example, removed their tracing JIT a few months back, because while it's sometimes blazingly fast, it's sometimes rather slow. This is due to a tracing JIT optimising a single code-path at a time [...] Code which tends to take the same path time after time benefits hugely from tracing; code which tends to branch unpredictably can take considerable time to derive noticeable benefits from the JIT.
Tratt: Fast Enough VMs in Fast Enough Time
Tratt: Fast Enough VMs in Fast Enough Time
Tratt: Fast Enough VMs in Fast Enough Time
Does that apply to tracing JIT more than JIT compilers generally?
Surely the main use for JIT compilers is situations when static compiling is not practical, like when the precise target platform is not known at the time when the code is circulated, or when the person running the code wants more control over it than they get from a precompiled binary.
Think web applications - do you really want to precompile your application for every CPU it might run on, and would people really want to run your precompiled binary locally without good reason to trust you?
Tratt: Fast Enough VMs in Fast Enough Time
Tratt: Fast Enough VMs in Fast Enough Time
Bingo. P4 was supposed to be superfast because it uses tracing JIT (while regular x86 CPUs use normal-style JIT for about last 20 years). And yes, in some narrow cases tracing JIT was better. But in general it was disaster: CPUs were hot, spent a lot of power and worked worse then CPUs with regular JIT.
Tratt: Fast Enough VMs in Fast Enough Time
Tratt: Fast Enough VMs in Fast Enough Time
Tratt: Fast Enough VMs in Fast Enough Time
It wasn't even a JIT.
A closer analogue to a JIT would be a hardware value predictor. Those never really took off either because predicting anything other than 0 is pretty tough.
Tratt: Fast Enough VMs in Fast Enough Time
Tratt: Fast Enough VMs in Fast Enough Time
Tratt: Fast Enough VMs in Fast Enough Time
This code could not be generated at static compile time because the program input is not available then. Even if you can predict a static path at static compile time (doubtful) you still don't have the actual data values to take advantage of.
DEC's FX!32 and IBM's DAISY were similar projects.
They weren't tracing JITs but they could just as well have been.
Tratt: Fast Enough VMs in Fast Enough Time