Worst case performance and "impossibly unlikely" conditions
Worst case performance and "impossibly unlikely" conditions
Posted Jan 11, 2012 23:56 UTC (Wed) by nix (subscriber, #2304)In reply to: Worst case performance and "impossibly unlikely" conditions by epa
Parent article: Denial of service via hash collisions
The slower ordered-tree based dictionary implementations... aren't necessarily slower. The last hash I implemented was a modified-splay-tree-based system where each get of a given hash item rotated it 1/Nth of the way towards the top of the tree (a customizable value). Finding a random item scales as O(lg n), which is admittedly slower than the O(1) of a perfect hash or sufficiently empty bucket hash -- but repeated searches on the same set of keys end up O(1) or O(m) where m is the size of the searched set, and additions are always O(lg n) as well. This is definitely not true of bucket hashes, as they have to be periodically expanded and perhaps contracted for efficiency, leading to nasty spikes in the time profile. Having O(1) operations suddenly take on the order of O(n)-with-a-large-constant time while an expansion happens is... not good. I know it amortizes to O(log n), but still it's a jagged and nasty way to do it.
That splay-tree-based hash had other nice properties, such as an iteration order stable under non-colliding modification, and even stable under collision with certain limitations (if you deleted the currently-iterated value, and earlier-iterated values had colliding hash values, you would see such values more than once). It was also really easy to lock for multithreaded use, and copying and freeing it was pleasantly easy.
But its primary attraction to me was the limited conditional count. Because of the absence of expansion code, the only conditional in the whole thing other than the odd-or-even part of the tree rotation code was the trivial five-line code to handle hash collisions, which could have been omitted had I chosen to record full 64-bit hash values. There was no rarely-executed expansion code, no rehashing, nothing rarely executed to harbour bugs. Or there wasn't until I added extra code to efficiently pack small-enough data items into the item pointers, anyway...
(The thing was under a proprietary license thanks to my old free-software-phobic employer, but I will probably be reimplementing it and a number of related data structures under the BSD license sometime soonish. I'm sick of not having them available. :)
Posted Jan 12, 2012 1:26 UTC (Thu)
by cmccabe (guest, #60281)
[Link] (4 responses)
There was a post about this in 2010 on the freebsd list:
Basically, it was a great idea in the 1980s, but I'm not sure how well it works these days.
Posted Jan 12, 2012 1:52 UTC (Thu)
by wahern (subscriber, #37304)
[Link] (1 responses)
It's easy enough to test. NetBSD, FreeBSD, and OpenBSD all have <sys/tree.h>, containing preprocessor based splay and red-black tree implementations which, other than the algorithm, are implemented almost identically. It's self contained, so you can download the single header file. man 3 tree for the API.
Those kernels tend to rely on RB trees fairly heavily. OpenBSD uses <sys/tree.h> for almost everything, including PF and the VM. It's perhaps not as fast as the radix, etc., trees in Linux, but it's simple and is unlikely to create any surprises in the future.
I wrote a left-leaning RB tree variant which is a drop-in replacement, except the prefix is LLRB instead of RB. LLRBs have a simpler algorithm, so it's easier to hack on if necessary.
Posted Jan 12, 2012 21:42 UTC (Thu)
by cmccabe (guest, #60281)
[Link]
Posted Jan 13, 2012 13:52 UTC (Fri)
by nix (subscriber, #2304)
[Link] (1 responses)
And I contend that in the vast majority of applications this is true. This site has a focus on kernel development, but most programs are not the kernel and most data structures are not randomly accessed on a large number of CPUs in a very hot path. And you need all of those things to be true at once before a splay tree's disadvantages become substantial. (That is not to say that it always has advantages over other implementations: I would only bother with a splay tree if I expected to be populating large hashes and then only using bits of them most of the time. But for some very common applications -- e.g. caching -- this is true.)
Posted Jan 18, 2012 23:36 UTC (Wed)
by cmccabe (guest, #60281)
[Link]
I'm still wary of splay trees because they generate dirty cache lines which need to be written back to main memory at some point. That eats into the available memory bandwidth. And of course, even if all of your accesses are single-threaded, there's always the possibility of false sharing.
Posted Jan 20, 2012 1:58 UTC (Fri)
by Wol (subscriber, #4433)
[Link] (1 responses)
The Pick database is based on bucket hashes, and modern implementations will resize so fast the user won't notice. "time is proportional to delta data" - given a 4k bucket it will take EXACTLY the same time to double the file size from 4K to 8K, as it will to increase the file size from 1M to 1028K.
Look up "linear hashing" in Wikipedia.
Cheers,
Posted Jan 26, 2012 18:09 UTC (Thu)
by nix (subscriber, #2304)
[Link]
Worst case performance and "impossibly unlikely" conditions
http://lists.freebsd.org/pipermail/freebsd-hackers/2010-S...
Worst case performance and "impossibly unlikely" conditions
Worst case performance and "impossibly unlikely" conditions
Worst case performance and "impossibly unlikely" conditions
Worst case performance and "impossibly unlikely" conditions
This is definitely not true of bucket hashes,
Wol
This is definitely not true of bucket hashes,