Progress on the Gilectomy

Posted Jun 2, 2017 2:55 UTC (Fri) by lhastings (guest, #66451)
In reply to: Progress on the Gilectomy by excors
Parent article: Progress on the Gilectomy

> > The benchmark that he always uses is a "really bad recursive Fibonacci".
> That sounds like it might produce quite misleading results for the new reference counting approach. Atomic increment/decrement can be relatively fast when the data stays in the core's cache, it only gets really expensive when ping-ponging between multiple cores. Writing all refcount operations to a big per-thread log will always be quite expensive, since that always has to go out to RAM, and I guess you need to write at least a 64-bit pointer each time.
[...]
> That suggests code that is constantly touching the same object in many threads will potentially be much faster with the log approach. But for single-threaded code, or multi-threaded code that mainly uses objects in a single thread at once, the difference is much less clear

First, it'd be the rare Python program that didn't share objects across threads. Many built-in objects are singletons in Python: None, True, False, empty string, empty tuple, and the "small integers" (-5 through 256 inclusive). Also, modules, classes, and functions are all singletons, and they all have reference counts too. And then the constants *compiled into* those modules, classes and functions are all shared and refcounted. I expect that even in a program designed around running well in multiple threads, sharing as few objects across cores as possible, will still implicitly share a lot of objects under the hood. My bad benchmark is admittedly a pathological case about object sharing, but it's not a world of difference.

Second, while I concede I don't genuinely know how atomic incr/decr works inside the CPU, obviously there must be *some* synchronization going on under the covers. The core you're running on doesn't know in advance whether or not another core is examining the memory you want to atomically incr/decr, so it *has* to synchronize somehow with the other cores. That synchronization itself seems to be expensive, and the more cores you have the more synchronization it needs to do. I assume this synchronization is a primary cause of the worse-than-linear scaling I observed when benchmarking the Gilectomy. And that's why, even with its obvious overhead, reference count logging seems to be a big performance win.

Why am I so sure it's this synchronization rather than the cache invalidation that's important? Because the cache invalidation is *still happening* with the reference count logging. The commit thread for the reference count log is continually changing the reference counts on the shared objects: the small ints, the function, the code object, the module object, etc. Those writes invalidate the cache of those objects for all the other cores. And yet I observe a big improvement win with the reference count logger. It seems to me that all the reference count log is really doing is getting rid of the synchronization overhead.

If you'd like to test your theory that atomic incr/decr isn't really so bad, you could try it with the Gilectomy. Here's how I'd do it. For a test with N cores, I'd have N modules, each with their own separate implementation of fib(). That'd ensure they weren't sharing the functions and code objects, the constants inside the code objects, or the modules. I'd then change the function so the lowest it went was 377, aka fib(14). That'd cut down on use of the small integers, though the code would still use 1 and 2, and the fib recursion level (e.g. the 14 in fib(14)) would still be using small ints. That's about as good as you're going to get with fib in Python. I predict you'll see improved scaling over my original fib benchmark but it'll still be markedly worse-than-linear, and not as gentle as with the reference count log.

Progress on the Gilectomy

Posted Jun 3, 2017 7:22 UTC (Sat) by rghetta (subscriber, #39444) [Link]

Is there some convergence with Eric Snow's separate subinterpreters work [https://lwn.net/Articles/650489/] ? While the two projects approach the gil problem almost from the opposite direction, could one benefit from the work done in the other ?