Brilliant explanation

Posted Aug 4, 2021 11:36 UTC (Wed) by excors (subscriber, #95769)
In reply to: Brilliant explanation by niner
Parent article: Watson: Launchpad now runs on Python 3

> Bugs like the potential overflow of synthetics id space will of course be addressed in time.

Are there ideas on how that would be addressed? 2^32 graphemes seems like an unfortunately low limit that might be hit under plausible use cases, now that we've got enough RAM for programs to easily slurp a multi-gigabyte file into memory, so it seems bad if the language can't handle those strings correctly. But I don't see how it can handle those strings without changing the basic NFG idea of representing each grapheme with a unique 32-bit integer.

> Random access to strings does actually have a use in parsing. E.g. backtracking in a regex engine must be kinda horrible if you have to iterate through the whole string to the current position every time you jump backwards. C implementations can get around this even with variable length encodings by keeping a pointer into the string.

You shouldn't need to re-iterate forwards through the whole string to jump backwards, you can just iterate backwards over code points from the current position until you find the next grapheme cluster boundary. (Unicode says the grapheme cluster rules can be implemented as a DFA (https://unicode.org/reports/tr29/#State_Machines), and it's always possible to reverse a DFA. That's defined over code points but it's easy to layer it over a backwards UTF-8 byte iterator.)

But I'm not sure that matters for regex backtracking anyway - isn't that normally implemented by pushing the current position onto a stack, so you can backtrack there later? If you're doing grapheme-based iteration over the string, you don't need to push grapheme indexes onto the stack, you can just push an opaque iterator corresponding to the current position (which may be implemented internally as a byte index so it's trivial to dereference the iterator, and the implementation can guarantee it's always the index of the first byte of a grapheme). So there's still no need for random access or even for reverse iterators.

Brilliant explanation

Posted Aug 4, 2021 12:32 UTC (Wed) by Wol (subscriber, #4433) [Link] (1 responses)

> But I'm not sure that matters for regex backtracking anyway - isn't that normally implemented by pushing the current position onto a stack, so you can backtrack there later?

So you've just confirmed the previous poster's point! To do this, you need random byte-count access into the string - which is exactly what is allegedly impossible (or not necessary, I think we're all losing track of the arguments :-)

Cheers,
Wol

Brilliant explanation

Posted Aug 4, 2021 13:48 UTC (Wed) by excors (subscriber, #95769) [Link]

No, in this context "random access" is about whether you can access the Nth element (grapheme cluster, code point, code unit, whatever) in O(1) time for any N.

An array of code points (as used by Python) supports random access to code points but not to grapheme clusters. Raku's NFG supports random access to grapheme clusters but not to code points. An array of UTF-8 bytes (as used by Rust and Swift) or an array of UTF-16 code units (as used by Java) support neither.

In all of them, given an iterator (i.e. a pointer to an existing element) you can move forward/backward by one element in O(1) time, and you can save the iterator in some other data structure (like a stack) and go back to it later.

I think the interesting question around Raku is whether random access to grapheme clusters is actually useful in practice, or whether iterators are almost always sufficient (in which case you can implement strings as UTF-8 and avoid the complexity/performance sacrifices that Raku makes).