Would you like signs with those chars?

Posted Oct 26, 2022 7:32 UTC (Wed) by NYKevin (subscriber, #129325)
In reply to: Would you like signs with those chars? by tialaramex
Parent article: Would you like signs with those chars?

> they do that fine on UTF-8 data without any decoding.

No they won't, at best they will pass through non-ASCII without doing whatever the function is defined to do (e.g. tolower won't actually lowercase your letters), and at worst they will silently corrupt it (if they think it's one of the legacy 8-bit encodings).

> You often do not need the fancy Unicode is_digit but only is_ascii_digit for real software because overwhelmingly the "is it a digit?" question is not cultural but purely technical.

There are a subset of edge cases where a string does not contain linguistically useful information, like a phone number or UUID. In those cases, these ASCII-only functions are somewhat useful, but most of them could just as easily be done with regular expressions like [0-9]+. Realistically, you need nontrivial parsing logic anyway, to deal with things like embedded dashes and other formatting vagaries, so you may as well solve both problems with the same tool (which can and should be Unicode-capable, because ASCII is ultimately "just" a subset of UTF-8). In that context, these ASCII-only functions look rather less useful to me.

The problem is, ASCII-only functions are also an attractive nuisance. They make things a little too comfortable for the programmer who's still living in 1974, the programmer who still thinks that strings are either ASCII or "uh, I dunno, those funny letters that use the high bit, I guess?" Those programmers are the reason that so many Americans "can't" have diacritical marks in their names (on their IDs, their airline tickets, etc.). If you are writing string logic in 2022, and your strings have anything to do with real text that will be read or written by actual humans, then your strings are or should be Unicode. Unicode is the default, not the "fancy" rare exception. If you have strings, and they're not some variety of Unicode, then one of the following is true:

1. They're encoding something that sort of looks like text, but is not really text, like a phone number.
2. They are raw bytes in some binary format, and not text at all.
3. In practice, they mostly are Unicode, but that's not your problem (e.g. because you're a filesystem and the strings are paths).
4. You hate your non-English-speaking users (and the English-speakers who have diacritical marks anywhere in their string for whatever reason - we shouldn't make assumptions).
5. You inherited a pile of tech debt and it's too late to fix it now.

Would you like signs with those chars?

Posted Oct 26, 2022 12:03 UTC (Wed) by tialaramex (subscriber, #21167) [Link]

The thread you're replying in is about the "isalpha et al class of function" - to my mind that's specifically the predicates, but if you insist on also including tolower and toupper from the same part of the standard library, then that's still fine although more narrowly useful, they perform exactly as anticipated.

Sure enough Rust provides to_ascii_uppercase and to_ascii_lowercase here too.

[ Rust also provides to_uppercase and to_lowercase on char, but because this is a tricky problem these are appropriately more complicated ]

I already mentioned (but you snipped) that this will go wrong if your C library thinks it knows the byte is from some legacy encoding like 8859-1

> most of them could just as easily be done with regular expressions like [0-9]+

This sort of completely inappropriate use of technology (resorting to regular expressions to just match ASCII digits) is how we get software that is hundreds of times bigger and slower than necessary.

> Realistically, you need nontrivial parsing logic anyway

Again, you seem to have determined that people would be looking at these functions where they're completely inappropriate, but C itself isn't the right choice in the applications you're thinking about.

> If you are writing string logic in 2022, and your strings have anything to do with real text that will be read or written by actual humans, then your strings are or should be Unicode.

Certainly, but again, we're not asking about Javascript or C# or even Rust, we're talking about C and most specifically about the Linux kernel. Whether the people implementing a driver for a bluetooth device are "actual humans" is I guess up for question, but they're focused very tightly on low level technical details where the fact that the Han writing system has numbers is *irrelevant* to the question of whether this byte is "a digit" in the sense they mean.

C only provides these 1970s functions, and so you're correct that you should not try to write user-facing software in C in 2022. But, the 1970s style ASCII-only functions are actually useful, like it or not, because a bunch of technical infrastructure we rely on works this way, even if maybe it wouldn't if you designed it today (or maybe it would, hard to say)

Example: DNS name labels are (a subset of) ASCII. to_ascii_lowercase or to_ascii_uppercase is exactly appropriate for comparing such labels, and you might say "Surely we should just agree on the case on the wire" but actually we must not do that, at least not before all DNS traffic is DoH because it turns out our security mitigation for some issues relies on random bits in DNS queries, and there aren't really enough so we actually put more randomness in the case bits of the letters in the DNS name, so your code needs to work properly with DNS queries that have the case bits set or unset seemingly at random, so as to prevent attackers guessing the exact name sent...

The end user doesn't see any of this, you aren't expected to type randomly cased hostnames in URLs, nor to apply Punycode rules, you can type in a (Unicode) hostname, and your browser or other software just makes everything work. But the code to do that may be written in C (or perhaps these days Rust) and its DNS name matching only cares about ASCII, even though the actual names are Unicode.