Have you misunderstood NFG?

Posted Aug 6, 2021 18:15 UTC (Fri) by excors (subscriber, #95769)
In reply to: Have you misunderstood NFG? by raiph
Parent article: Watson: Launchpad now runs on Python 3

>> That's dangerous if you compute character indexes and store on disk or send over the network, or if you store/send strings with a character-based length limit, because a second instance of your code may surprisingly interpret indexes and lengths differently.)
> Pain will ensue if devs mix up text strings as sequences of *characters* aka graphemes, and binary strings as storage / on-the-wire encoded data that's a sequence of *bytes* or *codepoints*.

I was thinking more of cases where you mix up strings as sequences of Unicode-8.0-graphemes and as sequences of Unicode-9.0-graphemes. Like you implement a login form where the password must be at least 8 characters (using Raku's built-in definition of 'character'), and a user registers with an 8-character password, but then you upgrade Raku and now that user's password is only 7 characters and the form won't let them log in.

To avoid bugs in cases like that, you need to realise beforehand that Raku's definition of 'character' is not stable and you need to implement some alternative form of length counting for any persistent data. I don't mean that's a huge problem, but it's unfortunate that the language's default seemingly-simple string API is creating those traps for unwary programmers.

>> The representation is basically UTF-32
> No. NFG is a new internal storage form that uses strands (cf ropes).
> None of the strands are UTF-32, though some will be sequences of 32 bit integers.

If there are no multi-codepoint graphemes and a single strand and at least one non-ASCII character, then it's sequences of 32-bit integers that are the numeric values of Unicode code points, i.e. in that basic case it's UTF-32 :-) . And that has the benefits of UTF-32 (you can find the Nth character in constant time) and the drawbacks (4 bytes of memory per code point; worse than UTF-8 even for CJK), which are not really affected by the more advanced features that NFG adds.

>> A carefully-crafted few gigabytes of text could exhaust the 31-bit dictionary space, which at best is a DOS vulnerability, though apparently MoarVM doesn't even check for overflow so it'll reuse integers and probably corrupt strings.
>
> Can you point to where "dictionary" aspects are introduced in https://github.com/MoarVM/MoarVM/blob/master/src/strings/... ?
>
> Based on comments by core devs, I've always thought it was a direct lookup table. My knowledge of C is sharply limited, but the comments in that source file suggest it's a table, and a search of the page for "dict" does not match. Have you misunderstood due to not previously finding suitable doc / source code?

By "dictionary" I don't mean a specific data structure like a Python dict / hash table, I just mean any kind of key-value mapping. In MoarVM it looks like there's actually two: MVM_nfg_codes_to_grapheme maps from a sequence of code points to a synthetic (i.e. a negative 32-bit number), or allocates a new synthetic if it hasn't seen this sequence before, and is implemented with a trie; and MVM_nfg_get_synthetic_info maps from a synthetic to a struct containing the sequence of code points (and some other data), implemented with an array. Both of those are limited to 2^31 entries before they'll run out of unique synthetics.

>> If we used the full synthetic space that'd be 214 gigabytes of RAM used. ... I think we're going to run out of memory before the run out of synthetic graphemes. :)

You can get a VM with 256GB RAM for maybe $1.50/hour - that's not a lot of memory by modern standards. It might have been a reasonable limit in a programming language designed a couple of decades ago, but it seems quite shortsighted to design that now, particularly in a language that's meant to be good at text processing.

I think it's not obvious that the implementation could be optimised later, without significant tradeoffs in performance or complexity. Specifically the performance guarantees of the string API require the string to be stored as an array of fixed-size graphemes (to allow the O(1) indexing), so the only obvious way to increase the grapheme limit is to increase the element size, which would greatly increase memory usage and reduce performance in the vast majority of programs that don't use billions of graphemes, and/or to dynamically switch between multiple string representations. (There may be non-obvious solutions of course; I'm certainly not an expert and haven't investigated this very deeply, and I'd be interested if there were existing discussions about this.)

(I don't particularly care about technical limitations of a non-production-ready compiler/runtime which can be fixed later; but I am interested when those limitations are an inevitable consequence of the language definition. Raku expects the O(1) grapheme indexing and that constrains all future implementations of the language, and it's interesting to compare that to other languages' string models.)

(Incidentally I'm ignoring strands here, because it looks like MoarVM has a fixed limit of 64 strands per string. Finding the Nth character might require iterating through 64 strands before doing an array lookup, but technically that's still O(1) even if the constant factors will be bad.)

> Raku is able to roundtrip arbitrary text, including arbitrary Unicode text. See https://docs.raku.org/language/unicode#index-entry-UTF8-C8

It can, as long as you want to do almost nothing with the string apart from decode and encode it (in which case why bother using a Unicode string at all? You could just keep it as bytes). Even if it's perfectly valid UTF-8, but it's NFD instead of NFC, you'll get garbage if you try to print it or encode it like a normal Unicode string:

say Buf.new(0x6f, 0xcc, 0x88).decode('utf8');
# ö
say Buf.new(0x6f, 0xcc, 0x88).decode('utf8-c8');
# o􏿽xCC􏿽x88
say Buf.new(0x6f, 0xcc, 0x88).decode('utf8-c8').encode('utf8');
# utf8:0x<6F F4 8F BF BD 78 43 43 F4 8F BF BD 78 38 38>

The single UTF-8 grapheme gets decoded into 3, so any grapheme-based text processing algorithms will misbehave. The only way to get correct processing is to not use UTF8-C8, and let Raku lossily normalize your string.

That contrasts with Swift where strings aren't stored in normalized form but most string operations behave as if they were. E.g.:

let a = String(decoding: [0x6f, 0xcc, 0x88], as: UTF8.self)
let b = String(decoding: [0xc3, 0xb6], as: UTF8.self)
print(a, b, a == b)
# ö ö true
print(a.count, b.count, a.utf8.count, b.utf8.count)
# 1 1 3 2

so it operates over graphemes but it preserves the original bytes when you re-encode as UTF-8.

>> worse performance when iterating over strings because it has to walk through the grapheme dictionary (with lots of extra cache misses)
> What leads you to talk of having to walk through a "dictionary", and incur "lots of extra cache misses"?

I was slightly wrong about the "walk" because I mixed up the grapheme array and the trie - reading the string only needs the array, which is much cheaper than the trie.

But still, if you're doing an operation where you need to access each code point (e.g. to encode the string) you will iterate over the string's 32-bit elements, and if an element is a synthetic then you have to read nfg->synthetics[n].codes (which will likely be a cache miss, at least once per unique grapheme in the string) then read nfg->synthetics[n].codes[0..m] (another cache miss). That sounds slower than if the string was just stored as UTF-8/UTF-16/UTF-32, where all the relevant data is stored sequentially and will be automatically prefetched.

Admittedly that's only particularly relevant when iterating over code points, which doesn't seem too important outside of encoding. Encoding does seem quite important though. I don't have any benchmarks or anything, just a vague concern about non-linear data structures in general. It's a deliberate tradeoff to get better performance in some grapheme operations, but I worry it's a high cost for questionable benefit.

> And there are string processing applications where you really need O(1) character handling.

Do you have specific examples where it is really needed? I suspect that in a large majority of cases, all you really need is forward/backward iteration and the ability to store iterators in memory. E.g. for backtracking in pattern matching / regexing / parsing, you just store a stack of iterators to jump back to, instead of storing a stack of numeric indexes. That can be done with a variable-sized-grapheme string representation (like UTF-8 or UTF-32), avoiding the compromises that are required by a fixed-size-grapheme representation.