User: Password:
|
|
Subscribe / Log in / New account

PyPy: the other new compiler project

PyPy: the other new compiler project

Posted May 20, 2010 6:18 UTC (Thu) by ekj (guest, #1524)
Parent article: PyPy: the other new compiler project

It seems to me that this observation is a lot more general.

It is frequently said, that high-level-languages (i.e. python) provide better developer-productivity, at the cost of slower execution. It used to be true that for real speed, you handcoded the inner loop in assembly.

With every passing generation of CPU though, that has become less and less true. There are simply so many optimizations, and so high complexity, that most mortal programmers are in practice unable to write the body of a loop in a more optimized way than an optimizing compiler can.

i.e. compilers translating C -> Assembler do a better job of it, than human beings can.

Pypy seems to demonstrate that the same is true at a higher level, atleast in some situations. Writing code in (R)Python and having pypy translate that to C, can in many situations give a program that runs faster than it would have, had you written it in C in the first place.

Handcoding assembler, increasingly, doesn't pay -- the compiler does it better.

Could it be that handcoding C *ALSO* doesn't pay, and that you'd tend to be better off writing the program in a higher-level-language, and having a compiler translate it ? Or is that too general, and fails to apply generally, despite applying if the program you're writing is a python-interpreter ?


(Log in to post comments)

PyPy: the other new compiler project

Posted May 20, 2010 6:51 UTC (Thu) by Tjebbe (subscriber, #34055) [Link]

That is what the Java people claim, and perhaps it is true. Not sure if we are there exactly yet, whenever I talk to those people about what i do (in C and C++) they make the above statement (of course after the all c is insecure), but a 'prove me wrong' (with something more than a benchmark) usually ends the conversation.

So I'm very interested if this project will :)

Sorry, but this is not true at all

Posted May 20, 2010 7:12 UTC (Thu) by khim (subscriber, #9252) [Link]

There are simply so many optimizations, and so high complexity, that most mortal programmers are in practice unable to write the body of a loop in a more optimized way than an optimizing compiler can.

Small functions (like memcpy) are still much faster if hand-coded assembler. And more complex functions are often written in "today's assembler". When you write code for NEON or SSE where each "function call" generates just one known instruction. You give up the register allocation duty, but this is done to save coding time, it does not speed up execution.

Now, if we are talking about megabytes of code compiler works better then human, but that's because there are not enough time to carefully hand-optimize huge amounts of code to in effect human's -O0 code is compared with compiler's -O3 code...

Writing code in (R)Python and having pypy translate that to C, can in many situations give a program that runs faster than it would have, had you written it in C in the first place.

Care to present your benchmarks? I'm seeing comparison between PyPy (moderately fast JIT) and CPython (exteremely slow interpreter), I don't see where PyPy beats C. Such benchmarks are hard to write correctly (the language interfaces are too different) but usually when your C library is slower then Python one (be it CPython or PyPy in any incarnation) it's because it does 100 times more (and you are throwing away 99% of what it did) or it spends lots of time going from Python to C and back (the switch is usually slower then either C or Python).

Handcoding assembler, increasingly, doesn't pay -- the compiler does it better.

Sorry, but this is not true at all. Compiler does it worse, unless it recognizes some precomputed pattern. For example compiler multiplies register by small number in the best possible way - much better then human. Why? It's easy: compiler writers tried all possible combinations (billions of them) and selected the best one for number below some cut-off (this is how ICC does it, GCC does it worse because it only contains few rules). But this approach hits the wall really fast. In general hand-coded assembly still wins - if you take your time and write good assembler for your CPU.

The problem here is timing: to write good hand-coded assembler code for P4 for sizable program you'll need 5-10 years. By that time P4 will be history, Core2 and Atom will be kings and you'll be behind once again. That's why you use hybrid approach cited above: it's just faster to write code with intrinsics so you have hope to create product while CPU is still in use.

Could it be that handcoding C *ALSO* doesn't pay, and that you'd tend to be better off writing the program in a higher-level-language, and having a compiler translate it ? Or is that too general, and fails to apply generally, despite applying if the program you're writing is a python-interpreter ?

It does not apply in general and it does not apply here. PyPy wins because it's JIT while CPython is pure interpreter.

Sorry, but this is not true at all

Posted May 20, 2010 12:04 UTC (Thu) by djc (subscriber, #56880) [Link]

Here's a microbenchmark where PyPy outperformed C:

http://morepypy.blogspot.com/2008/01/rpython-can-be-faste...

(AIUI PyPy has gotten a lot better since...)

There are lies, damn lies and microbenchmarks...

Posted May 20, 2010 16:53 UTC (Thu) by khim (subscriber, #9252) [Link]

This is exactly what I'm talking about: creators of the compiler always know where they can beat everyone else. Also note that even author of benchmark in question readily admit they are comparing totally different algorithms! This benchmark is almost entirely tied to speed of allocator: while C version uses some hand-coded allocator to make speed of malloc less relevant it's not the best allocator there.

Sure, changes in the algorithm can buy you more speed then microoptimizations, but... how it's related to topic under discussion?

Sorry, but this is not true at all

Posted May 20, 2010 16:20 UTC (Thu) by intgr (subscriber, #39733) [Link]

I agree with everything else that you stated; however:

> CPython (exteremely slow interpreter)

Compared to JITs it is slow, yes. Compared to other *interpreters*, CPython is the fastest one that exists; in most cases it outperforms other interpreters such as Perl, PHP and needless to say Ruby.

Considering that the Python language is way more dynamic than PHP (dynamic typing; class definitions, functions, operator overloads etc can change at any point at runtime, including magic methods like __getattr__), I think it is a real achievement.

Okay, as far as the interpreters go it's not so bad....

Posted May 20, 2010 17:18 UTC (Thu) by khim (subscriber, #9252) [Link]

I'm not sure I want to start the debate about interpreters but in all cases we are talking about not percents but times compared to C (let alone assembler) speed. JIT beats the interpreter and compiler beats JIT on real tasks. Assembler beats everything... if you give the programmer time - and we are talking years here, so it's just not practical.

JIT case is very interesting. People often think that JIT can outperform the compiler (we just need to wait few more years), but it's just not so on practice. The reason is simple: cache. While number of transistors in CPUs grows every year number transistors in CPU core is essentially constant (think L1 cache: 20 years ago - 8K in 486, 10 years ago - 128K in Athlon, today... still 128K and often less). This means that JIT uses very scarce resource for it's work so while artificial samples can be created where JIT outperforms simple PBO in real programs on practice it almost always loses.

Okay, as far as the interpreters go it's not so bad....

Posted May 20, 2010 17:24 UTC (Thu) by intgr (subscriber, #39733) [Link]

Do note that my above post is agreeing with you:
> I agree with everything else that you stated

I didn't want to start a "debate", I just thought it was unfair to call CPython an "extremely slow interpreter", because it's not.

Okay, as far as the interpreters go it's not so bad....

Posted May 22, 2010 5:15 UTC (Sat) by salimma (subscriber, #34460) [Link]

I'm not convinced cache is much of an issue for long-running applications -- for those, one should compare the performance of a Java or C# application after the JIT is no longer being triggered, with a C/C++ equivalent.

It does not matter...

Posted May 22, 2010 9:48 UTC (Sat) by khim (subscriber, #9252) [Link]

You loss can be big or small, but you can't win:

  1. If JIT determines at some point that it's not longer needed and "disconnects" - it's just version of PBO.
  2. If JIT determines that situation is static but checks from time to time that it's not changed - you lose small.
  3. If JIT actively works and changes the recompiles everything all the time - you lose big.

You can only ever win if JIT recompiles stuff constantly (so PBO in normal compiler can't cope) AND the workload does not depend on L1 cache all that much (so loss from JIT work is more then compensated by JIT optimizations). This situation can be easily created in tests but almost never occurs in real life.

Sorry, but this is not true at all

Posted May 22, 2010 15:58 UTC (Sat) by nix (subscriber, #2304) [Link]

Compared to other *interpreters*, CPython is the fastest one that exists
That is extremely debatable. In general, when given a choice between speed and implementation clarity, Python has gone for the latter.

Lua is one example of an interpreter immensely faster than CPython (partly simply because it is smaller: the entire interpreter fits in L2 cache on my machine; Python will barely fit in L3.)

Sorry, but this is not true at all

Posted May 23, 2010 15:17 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link]

Python is the fastest interpreter.

Ha. hAHahhahahaHAHAH!

It doesn't even use computed goto to make a threaded interpreter. Try to run Erlang during your spare time - it's much faster than CPython. Or different Forth interpreters.

CPython and computed goto

Posted May 23, 2010 17:33 UTC (Sun) by scottt (subscriber, #5028) [Link]

> It doesn't even use computed goto to make a threaded interpreter. The 'release31-maint' branch of CPython does use computed goto, see: USE_COMPUTED_GOTOS in Python/ceval.c

Sorry, but this is not true at all

Posted May 21, 2010 15:42 UTC (Fri) by vonbrand (guest, #4458) [Link]

The delightful "Writing Efficient Programs" by Jon Bentley (sadly long out of print, but his "Programming Pearls" contains the gist of it) tells what to do to make programs go faster/use less memory: First you have to measure where the performance drains are, that turns out not to be at all evident (programmers are notoriously bad at guessing at them!). Look at the architecture of the program, check for more efficient algorithms. Then go look at the "small picture": Typical programs spend 95% of their time in 5% of their code. If you make that 5% go twice as fast, your program goes almost twice as fast; futzing around with the rest gives almost no improvement. Only if rewriting in your high level language hits the wall, consider rewriting in a lower level language. Plus never forget that hacking the program for efficiency has a cost in maintenability, and only under rare circumstances is the added programmer time of extreme measures worth the savings in computer time (and with Moore's law it is getting ever harder).

PyPy: the other new compiler project

Posted May 20, 2010 13:07 UTC (Thu) by liljencrantz (guest, #28458) [Link]

Often repeated wisdom, but I've rarely seen it work like that in practice. Interpreted languages usually have huge memory overheads that forces non-trivial data out of the caches, drastically lowering performance on real world workloads.

That said, I think the trade off of increased programmer productivity but decreased program speed is actually often the right choice.


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds