|
|
Log in / Subscribe / Register

Rustaceans at the border

Rustaceans at the border

Posted Apr 14, 2022 21:49 UTC (Thu) by tialaramex (subscriber, #21167)
Parent article: Rustaceans at the border

I don't foresee people whose mascot is a penguin having too much trouble welcoming people whose mascot is a crab.

As a C programmer I will say that Rust's standard library (not all of which, admittedly, the kernel gets, see other comments about the stack of core -> alloc -> std of which the kernel will receive core and a somewhat custom alloc) has much more than I was used to from the C standard library and in many cases also utility libraries like Glib.

Programs where in C I'd have reached for some third party utility libraries at least, often in Rust the standard library is more than enough. Yet it's still a very hospitable place to do the sort of low-level programming Linux is all about. For example, Rust explicitly has a char type which reflects Unicode scalar values, but it pragmatically offers predicates like is_ascii_hexdigit() both on the char type and on u8 (the unsigned byte). If you actually have bytes (say from a character device), and the only thing you care about is whether those bytes are ASCII hex digits, with no interest in whether they might be part of an Emoji or a Korean word, or anything fancy like that, then Rust doesn't waste your time.


to post comments

Rustaceans at the border

Posted Apr 16, 2022 6:45 UTC (Sat) by bartoc (guest, #124262) [Link] (14 responses)

It's pretty embarrassing that C doesn't have is_ascii_digit and friends tbh.

Someone should write a paper, I bet it could get accepted.

Rustaceans at the border

Posted Apr 16, 2022 16:31 UTC (Sat) by tialaramex (subscriber, #21167) [Link] (12 responses)

Arguably Rust got lucky here. In the 1990s locale APIs were all the rage, and perhaps a new Rust language would have adopted locale-sensitive APIs for all this stuff as C did. After all C is also a low-level bit-banging type language, and yet nevertheless isdigit() insists on being locale sensitive.

By the early 21st century, there was more of a note of caution. An API which tells you whether this code is arguably a digit of some sort in the character encoding somebody (mis)configured on your server is rarely what you needed, while an API which says ASCII 0 through 9 are digits only is often useful. If you'd given away the latter in order to offer the former you looked a bit daft. Today, Rust offers only two things here, is this an ASCII digit, or (exclusively on its char type which represents Unicode's "scalar values" only) is this in Unicode's digit class? Got EUC-JP? Big-5? Too bad, decode them into Unicode and use our Unicode APIs.

If you'd done that in 1995 there'd be howls of outrage at what seems to be cultural imperialism at work, today not so much.

Rustaceans at the border

Posted Apr 16, 2022 19:22 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (9 responses)

> If you'd done that in 1995 there'd be howls of outrage at what seems to be cultural imperialism at work, today not so much.

I think this was because in 1995, surrogate pairs didn't yet exist and everybody still thought of "Unicode" as a 16-bit encoding, which is completely inadequate for encoding all of CJK (without considering the fact that you also want to have enough room left over for all of the other writing systems in the world). But even in 1996, when they did theoretically have the code space for all of CJK, they hadn't done Extension A yet, let alone all of the subsequent CJK extensions, so a whole heck of a lot of characters were not actually encoded.* And then you have the Han unification controversy etc., so it's really not too surprising that people in the 1990's were not sure that Unicode was going to work out as well as it did.

* In 1989, the Consortium took the position that these characters were not "useful" enough for this to be a real problem in practice, which is how they managed to conclude that 16 bits would be enough in the first place. Unsurprisingly, if you tell a user "you can't enter your name because it uses an obscure character," people get upset.

Rustaceans at the border

Posted Apr 17, 2022 10:23 UTC (Sun) by khim (subscriber, #9252) [Link] (8 responses)

I wonder what would have happened in the alternate history where IBM-360 wouldn't impose 8bit byte and people would continue with 36bit words. In such a world we may still be using 36bit systems (very few devices even today need more than 64GiB RAM) and Unicode would be simple 18bit encoding, etc.

But of course by now 8bit byte is too entrenched to do such a switch.

Rustaceans at the border

Posted Apr 19, 2022 2:01 UTC (Tue) by NYKevin (subscriber, #129325) [Link]

> very few devices even today need more than 64GiB RAM

Eh, it depends what you use it for. If you have a bunch of dedicated machines running something like memcached, you probably do want to shove an unusually large amount of RAM into them. The same goes for NAS systems or anything else that has a lot of RAM pressure. I'm not sure if you'd actually hit 64 GiB per device - that probably depends on other factors like the size and shape of your traffic, the number and speed of CPU cores, etc., but OTOH there are RAM-heavy systems that are difficult to horizontally scale (e.g. traditional RDBMS's), so I could imagine scenarios where you might end up deploying such a thing, at least as a stopgap until you figure out how to do replication without having glaring consistency problems all over the place.

Rustaceans at the border

Posted Apr 19, 2022 6:34 UTC (Tue) by flussence (guest, #85566) [Link] (6 responses)

I feel like Unicode could easily fit all the current semantics into 16 bits of codepoints with lots of room to space, if it were redone today as RISC (consistent use of combining sequences) instead of CISC (e.g. the entire precomposed CJK glyphs area, all the latin-with-extra-squiggles codeplanes).

Still wouldn't be anywhere near simple to use, but that's how written communication is.

Rustaceans at the border

Posted Apr 19, 2022 11:06 UTC (Tue) by ssokolow (guest, #94568) [Link] (4 responses)

Unfortunately, it's not as simple as it sounds.

Wikipedia's commentary on Han Ideographs (Chinese, Japanese Kanji, Korean Hanja, etc.) being only offered precomposed is "However, attempts to do this for character encoding have stumbled over the fact that Chinese characters do not decompose as simply or as regularly as Hangul does."

...and I remember reading that the precomposed Latin stuff is necessary to guarantee that text strings used as opaque lookup keys (eg. filesystem paths) wouldn't get altered when round-tripping between a legacy encoding and Unicode, regardless of the circumstances.

Rustaceans at the border

Posted Apr 20, 2022 1:16 UTC (Wed) by flussence (guest, #85566) [Link] (1 responses)

I can accept that first point, but regarding latin I believe macOS munges every filename through NFD, so it's already a lost cause.

Rustaceans at the border

Posted Apr 20, 2022 3:42 UTC (Wed) by ssokolow (guest, #94568) [Link]

That's fine. Programs are supposed to assume that the filesystem may change under them. That's the cost of a shared resource which doesn't use Microsoft Visual SourceSafe-style locking overkill.

What's important is that, if the OS APIs give you a string identifier, your internal string processing can round-trip what you were given without altering it.

For comparison, I imagine that using the Windows version of Python's os.path.normcase for purposes other than in-process equality comparisons would cause i18n issues since it uses Python's internal .lower() method and thus the Unicode case-conversion tables baked into that version of Python while NTFS lookups use a case-folding table baked into the NTFS partition at the time it was formatted to ensure that Unicode updates can't introduce case-equivalence collisions for already-existing paths.

Rustaceans at the border

Posted Apr 22, 2022 10:24 UTC (Fri) by smurf (subscriber, #17840) [Link] (1 responses)

Assuming that there even is a legacy encoding that has composing codepoints *and* the corresponding composed characters.

And even if there is, you could mark the offenders, e.g. by placing a combining grapheme joiner U+034F between them.

IMHO the real reason is that, at the time, font rendering engines were not clever enough to show alternate glyphs for composed characters whose naïve supposition of their constituent parts simply doesn't work. (As in, all accented/umlauted/whatever'd capital letters.)

That, or the precedence of Latin-1 with its mountain of composed characters proved too strong and nobody even thought about solving the problem some other way until it was too late.

That, or the problem was deemed unfixable because instead of expanding Han-encoded texts by 50% (three-byte UTF-8 instead of two-byte words) you'd blow them up by >250% (two bytes for radical A, two for radical B, at least one for either marking the end of a glyph or a joiner; more if there's a radical C involved) which would not have been acceptable at the time. After all, at the time Weird Al chastised Microsoft that "in case you haven't noticed, four-gig drives don't grow on trees".

Rustaceans at the border

Posted Apr 22, 2022 13:48 UTC (Fri) by khim (subscriber, #9252) [Link]

> That, or the precedence of Latin-1 with its mountain of composed characters proved too strong and nobody even thought about solving the problem some other way until it was too late.

It's not even about the “precedence of Latin-1”. It's about the simple practical need to keep parts of your data in Unicode and parts in some other encoding with constant conversions between these.

It took years (about 10 to 20 years, in fact) before people, finally, stopped using legacy encodings.

If Unicode would have been impossible (or very hard and inefficient) to use in that fashion then it would have never taken off.

Size considerations were also quite real: Japan persisted for years with ISO-2022-JP both because roundtrip there is not perfect and also because it made documents 50% larger.

The only big issue with Unicode was initial assumption that 16-bit would be enough, after all: that prompted thus useless and very costly trip to USC-2 then UTF-16 and then, finally, to UTF-8.

USC-2 made sense but UTF-16 has all the problems of UTF-8 without giving you any benefits.

If people realized earlier that USC-2 wouldn't work then all that hoopla with two kinds of functions in Java, endless bugs with UTF-16 in browsers and other such things could have been avoided.

But oh, well, we can't change the path, can only adopt UTF-8 for the future.

Rustaceans at the border

Posted Apr 19, 2022 19:40 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

There are more than 100000 unique Han characters. They can usually be decomposed into simpler characters (radical + phonetic), but even with that simplification it's going to get uncomfortably close to 2^16 code points.

Rustaceans at the border

Posted Apr 17, 2022 21:19 UTC (Sun) by bartoc (guest, #124262) [Link] (1 responses)

well, the real issue is that getting a feature into C is quite involved, and I can understand nobody wanting to go through all the trouble (including possible travel with airfare and hotel costs and so on) just to standardize a one line function.

the real problem with locales (besides them not really working with variable width encodings, and being based on code units) is that programmers do NOT expect the behavior of many if these functions to change out from under them (printf is locale sensitive!). This is not just beginner users either! When the C++ committee standardized formatting (via std::format) for dates and times they accidentally made it local sensitive by basically saying “interpret the format string as strftime would”, whoops. (the std::format model is locale invariant by default with special specifiers to do locale things, and the ability to pass in a locale object if you wanna use that instead of the global one)

C locales are so totally insufficient for actual internationalization that having everything be locale sensitive basically only results in non-user-facing stuff being mangled. I hope you like your log analytics misclassifying output from all your machines in countries with a different date order than your developers! Its totally insane.

Even if they were useful for localization the actual specification is essentially “do whatever you want, unless its the C locale”, its really, really bad. And in practice implementers do just phone in locales because they aren't really useful for anything anyway. They should be deprecated and removed (or “removed” by specifying that all locales are equivalent to the C locale)

Then theres the multiple attempts at standard C encoding conversion routines, all of which are broken.

Even if you stick to Unicode you can get into trouble with cursed/unexpected unicode translation formats, GB18030 is the worst (and the only really bad one in somewhat common usage)

Rustaceans at the border

Posted Apr 19, 2022 15:21 UTC (Tue) by tialaramex (subscriber, #21167) [Link]

We're diverting pretty far from the original topic, but as to it being difficult to get features in C, there isn't any actual requirement.

Consider the BSD sockets API. Why is that commonplace? Did some JTC1 sub-committee sign off on it and then all our operating systems got the same API? No, it's just the right shape and so everybody adopted it and any "standards" are subsequent and simply documenting what was de facto already the case about networking.

Rustaceans at the border

Posted Apr 29, 2022 15:02 UTC (Fri) by peter-b (guest, #66996) [Link]

> It's pretty embarrassing that C doesn't have is_ascii_digit and friends tbh.
>
> Someone should write a paper, I bet it could get accepted.

I will be submitting exactly such a proposal for std::ascii_isdigit() etc. for the C++ standard library.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds