Brilliant explanation

Posted Aug 3, 2021 15:17 UTC (Tue) by HelloWorld (guest, #56129)
In reply to: Watson: Launchpad now runs on Python 3 by excors
Parent article: Watson: Launchpad now runs on Python 3

Unfortunately almost no programming language gets this right, one exception being Raku.

Brilliant explanation

Posted Aug 3, 2021 18:34 UTC (Tue) by excors (subscriber, #95769) [Link] (7 responses)

> Unfortunately almost no programming language gets this right, one exception being Raku.

Hmm, I wondered how Raku compares to Swift. Swift seems relatively simple - it uses UTF-8 internally (since Swift 5: https://swift.org/blog/utf8-string/) while the API is based around grapheme clusters (https://docs.swift.org/swift-book/LanguageGuide/StringsAn...), and you can't use integer indexes into strings (though you can construct an Index type with an "offsetBy" integer, with O(n) cost), and String.count is O(n). Strings are not stored or processed in normalised form, though comparison operators are based on canonical representation.

(https://hsivonen.fi/string-length/ explains some drawbacks with that: "🤦🏼‍♂️".count gives different numbers in the same version of Swift on different distros, because it depends on the ICU database. That's dangerous if you compute character indexes and store on disk or send over the network, or if you store/send strings with a character-based length limit, because a second instance of your code may surprisingly interpret indexes and lengths differently.)

I can't find any detailed documentation of Raku's approach, but it sounds like a similar grapheme-centric API to Swift, except that it allows random access to graphemes and the operations are O(1). The representation is basically UTF-32, but first it converts all strings to NFC, and then for every grapheme cluster that's >1 code point it dynamically allocates a new integer (outside of Unicode's 21-bit code point range) and adds that to a global dictionary, so the string is stored as a sequence of 32-bit integers that are either single-grapheme code points or indexes into the grapheme dictionary.

That sounds interesting, but kind of problematic (if I'm understanding it correctly). A carefully-crafted few gigabytes of text could exhaust the 31-bit dictionary space, which at best is a DOS vulnerability, though apparently MoarVM doesn't even check for overflow so it'll reuse integers and probably corrupt strings. And if you try to write a carefully RAM-limited DOS-resistant Raku network server that only accepts small strings from each client, an attacker could just send lots of small strings with unique grapheme combinations, until eventually your server is bloated with many gigabytes of grapheme dictionary (because MoarVM doesn't attempt to garbage-collect old unused graphemes). Then there's other problems like being unable to roundtrip Unicode text (it will all get converted to NFC), and worse performance when iterating over strings because it has to walk through the grapheme dictionary (with lots of extra cache misses), and worse performance because of UTF-32 (where mostly-ASCII strings will take 4x more memory than UTF-8). And it's not supported by the JVM backend for Raku.

hsivonen argues that random access to code points is almost never actually useful, and I think the same applies to grapheme clusters. You only need iterators. Making random access operations be O(1) at the expense of DOS vulnerabilities and memory usage and iteration performance does not sound like a great tradeoff.

Brilliant explanation

Posted Aug 4, 2021 8:10 UTC (Wed) by niner (subscriber, #26151) [Link] (3 responses)

All in all your analysis of Raku and MoarVM is pretty good. Just a small addition: MoarVM doesn't actually force all strings into the 32 bit representation. Pure ASCII strings for example remain in their 7 bit version. Bugs like the potential overflow of synthetics id space will of course be addressed in time.

Random access to strings does actually have a use in parsing. E.g. backtracking in a regex engine must be kinda horrible if you have to iterate through the whole string to the current position every time you jump backwards. C implementations can get around this even with variable length encodings by keeping a pointer into the string.

Brilliant explanation

Posted Aug 4, 2021 11:36 UTC (Wed) by excors (subscriber, #95769) [Link] (2 responses)

> Bugs like the potential overflow of synthetics id space will of course be addressed in time.

Are there ideas on how that would be addressed? 2^32 graphemes seems like an unfortunately low limit that might be hit under plausible use cases, now that we've got enough RAM for programs to easily slurp a multi-gigabyte file into memory, so it seems bad if the language can't handle those strings correctly. But I don't see how it can handle those strings without changing the basic NFG idea of representing each grapheme with a unique 32-bit integer.

> Random access to strings does actually have a use in parsing. E.g. backtracking in a regex engine must be kinda horrible if you have to iterate through the whole string to the current position every time you jump backwards. C implementations can get around this even with variable length encodings by keeping a pointer into the string.

You shouldn't need to re-iterate forwards through the whole string to jump backwards, you can just iterate backwards over code points from the current position until you find the next grapheme cluster boundary. (Unicode says the grapheme cluster rules can be implemented as a DFA (https://unicode.org/reports/tr29/#State_Machines), and it's always possible to reverse a DFA. That's defined over code points but it's easy to layer it over a backwards UTF-8 byte iterator.)

But I'm not sure that matters for regex backtracking anyway - isn't that normally implemented by pushing the current position onto a stack, so you can backtrack there later? If you're doing grapheme-based iteration over the string, you don't need to push grapheme indexes onto the stack, you can just push an opaque iterator corresponding to the current position (which may be implemented internally as a byte index so it's trivial to dereference the iterator, and the implementation can guarantee it's always the index of the first byte of a grapheme). So there's still no need for random access or even for reverse iterators.

Brilliant explanation

Posted Aug 4, 2021 12:32 UTC (Wed) by Wol (subscriber, #4433) [Link] (1 responses)

> But I'm not sure that matters for regex backtracking anyway - isn't that normally implemented by pushing the current position onto a stack, so you can backtrack there later?

So you've just confirmed the previous poster's point! To do this, you need random byte-count access into the string - which is exactly what is allegedly impossible (or not necessary, I think we're all losing track of the arguments :-)

Cheers,
Wol

Brilliant explanation

Posted Aug 4, 2021 13:48 UTC (Wed) by excors (subscriber, #95769) [Link]

No, in this context "random access" is about whether you can access the Nth element (grapheme cluster, code point, code unit, whatever) in O(1) time for any N.

An array of code points (as used by Python) supports random access to code points but not to grapheme clusters. Raku's NFG supports random access to grapheme clusters but not to code points. An array of UTF-8 bytes (as used by Rust and Swift) or an array of UTF-16 code units (as used by Java) support neither.

In all of them, given an iterator (i.e. a pointer to an existing element) you can move forward/backward by one element in O(1) time, and you can save the iterator in some other data structure (like a stack) and go back to it later.

I think the interesting question around Raku is whether random access to grapheme clusters is actually useful in practice, or whether iterators are almost always sufficient (in which case you can implement strings as UTF-8 and avoid the complexity/performance sacrifices that Raku makes).

Have you misunderstood NFG?

Posted Aug 6, 2021 0:09 UTC (Fri) by raiph (guest, #89283) [Link] (2 responses)

I think you've misunderstood NFG. But I'm not an expert so hope you will reply to this comment to clear things up as far as my comments below are concerned. TIA.

I'll start by noting where Raku has a problem you didn't mention, and one I think that's shared by Raku, Swift, and Elixir:

> "🤦🏼‍♂️".count gives different numbers in the same version of Swift on different distros, because it depends on the ICU database.

Raku doesn't rely on a user's system's ICU. Instead it bundles (automatically extracted and massaged) relevant data with the compiler.

Thus "🤦🏼‍♂️".chars, and more generally operations on Unicode strings, give the same results when using the same compiler version.

So Rakudo has the opposite problem: when and how to update the Unicode data bundled with the compiler.

For example, Unicode 9, released in 2016, introduced prepend combiners, with consequences for grapheme clustering. Afaik that wasn't introduced into a Rakudo compiler until the end of 2016.

I note the result for "🤦🏼‍♂️".chars is 3 for Rakudo versions until the end of 2016, and then 1 thereafter. I don't know why, but I presume it was Unicode 9.

Unicode evolution is a blessing (dealing with real problems that need to be solved) and a curse.

I think it'll be a decade or so before grapheme handling becomes a truly mature aspect of Unicode, Swift, Raku, Elixir, and whichever other new PLs are brave enough to join them in making their standard string type be a sequence of elements, each of which is "what a user thinks of as a character" (quoting the Unicode standard).

> That's dangerous if you compute character indexes and store on disk or send over the network, or if you store/send strings with a character-based length limit, because a second instance of your code may surprisingly interpret indexes and lengths differently.)

Right.

Pain will ensue if devs mix up text strings as sequences of *characters* aka graphemes, and binary strings as storage / on-the-wire encoded data that's a sequence of *bytes* or *codepoints*.

> I can't find any detailed documentation of Raku's approach

The best *overview* I know of of the MoarVM implementation is https://github.com/MoarVM/MoarVM/blob/master/docs/strings...

Beyond that I think one has to read the comments in the relevant MoarVM source code.

> The representation is basically UTF-32

No. NFG is a new internal storage form that uses strands (cf ropes).

None of the strands are UTF-32, though some will be sequences of 32 bit integers.

> A carefully-crafted few gigabytes of text could exhaust the 31-bit dictionary space, which at best is a DOS vulnerability, though apparently MoarVM doesn't even check for overflow so it'll reuse integers and probably corrupt strings.

Can you point to where "dictionary" aspects are introduced in https://github.com/MoarVM/MoarVM/blob/master/src/strings/...

Based on comments by core devs, I've always thought it was a direct lookup table. My knowledge of C is sharply limited, but the comments in that source file suggest it's a table, and a search of the page for "dict" does not match. Have you misunderstood due to not previously finding suitable doc / source code?

As for the DoS vulnerability, let me pick two exchanges that I think distil the Raku position.

The first focuses on the technical. See core devs discussing it one day in 2017: https://logs.liz.nl/moarvm/2017-01-12.html (Note that this log service is in its early days following the freenode --> libera transition, and the search functionality is buggy.) A key comment by jnthn, MoarVM's lead dev:

> All the negative numbers = space for over 2 billion synthetics. ... there's easily 100 bytes of data stored per synthetic, perhaps more. If we used the full synthetic space that'd be 214 gigabytes of RAM used. ... I think we're going to run out of memory before the run out of synthetic graphemes. :)

The second focuses on the need to eventually deal with it -- folk will want to use Rakudo with more than 200GB RAM used for strings. From an exchange in 2018 (TimToady is Larry Wall):

timotimo: we need a --fast flag for rakudo that just throws out all checks for invariants :P
TimToady: then we'll make that the default, and make people use a --slow flag when the security flaws emerge :P
jnthn: Worked for speculative execution :P
TimToady: still worries about dos attacks on nfg...
TimToady: but yeah, we need to get things into widespread use before we slow them down again :)

Short term, Rakudo users need to accept that you don't run Rakudo in a production setting and let it go over 200GB RAM used for strings. Longer term, something better has to be done.

> unable to roundtrip Unicode text (it will all get converted to NFC)

Raku is able to roundtrip arbitrary text, including arbitrary Unicode text. See https://docs.raku.org/language/unicode#index-entry-UTF8-C8

> worse performance when iterating over strings because it has to walk through the grapheme dictionary (with lots of extra cache misses)

Aiui NFG iterates the encoded data *once* when a string is first encountered, and thereafter it is a rope of fixed width strands, and that's mostly it. What leads you to talk of having to walk through a "dictionary", and incur "lots of extra cache misses"? Was that a guess?

> worse performance because of UTF-32 (where mostly-ASCII strings will take 4x more memory than UTF-8).

As per several earlier notes, that's not the case.

> it's not supported by the JVM backend for Raku.

True, but the JVM backend is experimental status, with several problems, not just NFG. It's about lack of tuits of those working on the JVM backend, not the merits of NFG.

> hsivonen argues that random access to code points is almost never actually useful

Fwiw I fully agree. One could say it's almost always *dangerous*. Why bother with random access to code points when their correct interpretation as text absolutely requires iteration?

> I think the same applies to grapheme clusters.

Why? To be clear, a grapheme cluster is (the Unicode standard's abstraction of) what the Unicode standard defines as "what a user thinks of as a character".

A character, something an ordinary human understands, is not remotely comparable to a code point, which is a computing implementation detail.

And there are string processing applications where you really need O(1) character handling.

> You only need iterators.

I don't think that's true. I'll return to that in another comment.

> Making random access operations be O(1) at the expense of DOS vulnerabilities and memory usage and iteration performance does not sound like a great tradeoff.

Larry Wall committed to NFG, fully aware Rakoons would have to deal with that vulnerability. It requires something like 15GBs worth of carefully constructed attack strings to overflow the grapheme space, which, in 2017, would allocate more than 200GB of RAM for string related storage. It's a theoretical risk that must be addressed longer term; in the near term one must constrain a Rakudo instance to reject allocating string space beyond 200GB or so.

The UTF-32 memory usage expense you thought existed does not. The strand implementation can be expanded as necessary to include forms other than the 8 bit width strand that was designed for efficiently handling mostly ASCII strings.

Does the iteration performance expense you describe exist? I'd appreciate it if you could link to a line or lines in the source code I linked, indicating what you mean. TIA.

In the meantime, aiui there is great value to NFG's O(1) character handling. In short, it's about meeting the performance needs of a string processing PL that, like Swift and Elixir, takes Unicode seriously, but, in Raku's case, also takes seriously a need for directly PL integrated performant Unicode era string pattern matching / regexing / parsing. That's not Swift's focus, which is fair enough, but is fundamental to Raku and its features.

Have you misunderstood NFG?

Posted Aug 6, 2021 16:43 UTC (Fri) by mbunkus (subscriber, #87248) [Link]

That was very insightful & interesting. Thank you very much.

Have you misunderstood NFG?

Posted Aug 6, 2021 18:15 UTC (Fri) by excors (subscriber, #95769) [Link]

>> That's dangerous if you compute character indexes and store on disk or send over the network, or if you store/send strings with a character-based length limit, because a second instance of your code may surprisingly interpret indexes and lengths differently.)
> Pain will ensue if devs mix up text strings as sequences of *characters* aka graphemes, and binary strings as storage / on-the-wire encoded data that's a sequence of *bytes* or *codepoints*.

I was thinking more of cases where you mix up strings as sequences of Unicode-8.0-graphemes and as sequences of Unicode-9.0-graphemes. Like you implement a login form where the password must be at least 8 characters (using Raku's built-in definition of 'character'), and a user registers with an 8-character password, but then you upgrade Raku and now that user's password is only 7 characters and the form won't let them log in.

To avoid bugs in cases like that, you need to realise beforehand that Raku's definition of 'character' is not stable and you need to implement some alternative form of length counting for any persistent data. I don't mean that's a huge problem, but it's unfortunate that the language's default seemingly-simple string API is creating those traps for unwary programmers.

>> The representation is basically UTF-32
> No. NFG is a new internal storage form that uses strands (cf ropes).
> None of the strands are UTF-32, though some will be sequences of 32 bit integers.

If there are no multi-codepoint graphemes and a single strand and at least one non-ASCII character, then it's sequences of 32-bit integers that are the numeric values of Unicode code points, i.e. in that basic case it's UTF-32 :-) . And that has the benefits of UTF-32 (you can find the Nth character in constant time) and the drawbacks (4 bytes of memory per code point; worse than UTF-8 even for CJK), which are not really affected by the more advanced features that NFG adds.

>> A carefully-crafted few gigabytes of text could exhaust the 31-bit dictionary space, which at best is a DOS vulnerability, though apparently MoarVM doesn't even check for overflow so it'll reuse integers and probably corrupt strings.
>
> Can you point to where "dictionary" aspects are introduced in https://github.com/MoarVM/MoarVM/blob/master/src/strings/... ?
>
> Based on comments by core devs, I've always thought it was a direct lookup table. My knowledge of C is sharply limited, but the comments in that source file suggest it's a table, and a search of the page for "dict" does not match. Have you misunderstood due to not previously finding suitable doc / source code?

By "dictionary" I don't mean a specific data structure like a Python dict / hash table, I just mean any kind of key-value mapping. In MoarVM it looks like there's actually two: MVM_nfg_codes_to_grapheme maps from a sequence of code points to a synthetic (i.e. a negative 32-bit number), or allocates a new synthetic if it hasn't seen this sequence before, and is implemented with a trie; and MVM_nfg_get_synthetic_info maps from a synthetic to a struct containing the sequence of code points (and some other data), implemented with an array. Both of those are limited to 2^31 entries before they'll run out of unique synthetics.

>> If we used the full synthetic space that'd be 214 gigabytes of RAM used. ... I think we're going to run out of memory before the run out of synthetic graphemes. :)

You can get a VM with 256GB RAM for maybe $1.50/hour - that's not a lot of memory by modern standards. It might have been a reasonable limit in a programming language designed a couple of decades ago, but it seems quite shortsighted to design that now, particularly in a language that's meant to be good at text processing.

I think it's not obvious that the implementation could be optimised later, without significant tradeoffs in performance or complexity. Specifically the performance guarantees of the string API require the string to be stored as an array of fixed-size graphemes (to allow the O(1) indexing), so the only obvious way to increase the grapheme limit is to increase the element size, which would greatly increase memory usage and reduce performance in the vast majority of programs that don't use billions of graphemes, and/or to dynamically switch between multiple string representations. (There may be non-obvious solutions of course; I'm certainly not an expert and haven't investigated this very deeply, and I'd be interested if there were existing discussions about this.)

(I don't particularly care about technical limitations of a non-production-ready compiler/runtime which can be fixed later; but I am interested when those limitations are an inevitable consequence of the language definition. Raku expects the O(1) grapheme indexing and that constrains all future implementations of the language, and it's interesting to compare that to other languages' string models.)

(Incidentally I'm ignoring strands here, because it looks like MoarVM has a fixed limit of 64 strands per string. Finding the Nth character might require iterating through 64 strands before doing an array lookup, but technically that's still O(1) even if the constant factors will be bad.)

> Raku is able to roundtrip arbitrary text, including arbitrary Unicode text. See https://docs.raku.org/language/unicode#index-entry-UTF8-C8

It can, as long as you want to do almost nothing with the string apart from decode and encode it (in which case why bother using a Unicode string at all? You could just keep it as bytes). Even if it's perfectly valid UTF-8, but it's NFD instead of NFC, you'll get garbage if you try to print it or encode it like a normal Unicode string:

say Buf.new(0x6f, 0xcc, 0x88).decode('utf8');
# ö
say Buf.new(0x6f, 0xcc, 0x88).decode('utf8-c8');
# o􏿽xCC􏿽x88
say Buf.new(0x6f, 0xcc, 0x88).decode('utf8-c8').encode('utf8');
# utf8:0x<6F F4 8F BF BD 78 43 43 F4 8F BF BD 78 38 38>

The single UTF-8 grapheme gets decoded into 3, so any grapheme-based text processing algorithms will misbehave. The only way to get correct processing is to not use UTF8-C8, and let Raku lossily normalize your string.

That contrasts with Swift where strings aren't stored in normalized form but most string operations behave as if they were. E.g.:

let a = String(decoding: [0x6f, 0xcc, 0x88], as: UTF8.self)
let b = String(decoding: [0xc3, 0xb6], as: UTF8.self)
print(a, b, a == b)
# ö ö true
print(a.count, b.count, a.utf8.count, b.utf8.count)
# 1 1 3 2

so it operates over graphemes but it preserves the original bytes when you re-encode as UTF-8.

>> worse performance when iterating over strings because it has to walk through the grapheme dictionary (with lots of extra cache misses)
> What leads you to talk of having to walk through a "dictionary", and incur "lots of extra cache misses"?

I was slightly wrong about the "walk" because I mixed up the grapheme array and the trie - reading the string only needs the array, which is much cheaper than the trie.

But still, if you're doing an operation where you need to access each code point (e.g. to encode the string) you will iterate over the string's 32-bit elements, and if an element is a synthetic then you have to read nfg->synthetics[n].codes (which will likely be a cache miss, at least once per unique grapheme in the string) then read nfg->synthetics[n].codes[0..m] (another cache miss). That sounds slower than if the string was just stored as UTF-8/UTF-16/UTF-32, where all the relevant data is stored sequentially and will be automatically prefetched.

Admittedly that's only particularly relevant when iterating over code points, which doesn't seem too important outside of encoding. Encoding does seem quite important though. I don't have any benchmarks or anything, just a vague concern about non-linear data structures in general. It's a deliberate tradeoff to get better performance in some grapheme operations, but I worry it's a high cost for questionable benefit.

> And there are string processing applications where you really need O(1) character handling.

Do you have specific examples where it is really needed? I suspect that in a large majority of cases, all you really need is forward/backward iteration and the ability to store iterators in memory. E.g. for backtracking in pattern matching / regexing / parsing, you just store a stack of iterators to jump back to, instead of storing a stack of numeric indexes. That can be done with a variable-sized-grapheme string representation (like UTF-8 or UTF-32), avoiding the compromises that are required by a fixed-size-grapheme representation.