"Lengths that are _not_ at risk: 1, 2, 4, 6, 8, 10, 12, 14, 16, 18.
The rest are at risk (meaning that 8-bit chars in _some_ positions result in 1 to 3 preceding chars being ignored)."
For 97946 different Russian words, I got "70890 (72%) and 97213 (99%) unique hashes for koi8-r and utf-8, respectively."
"For koi8-r, 22 hashes are seen over 100 times each, with the top one being seen 190 times. For utf-8, the top hash (most common) is seen 4 times, then 84 hashes are seen 3 times each.
Thus, obviously the bug does cause collisions. There are not as many of those as some people might expect for nearly purely 8-bit inputs. Yet the very common hashes for koi8-r are worrisome. Even though if one were to run the entire koi8-r wordlist against a bunch of hashes they'd only achieve a 30% speedup due to the bug, if they focus on words producing 22 top hashes - so they only try 22 words - they'd crack around 3% of passwords based on randomly picked words from that list (assuming uniform distribution of random word numbers). For utf-8, this risk is much lower: trying top 85 passwords (0.087% of candidates) effectively tests 256 of them (0.26%)."