Would you like signs with those chars?

Posted Oct 24, 2022 18:14 UTC (Mon) by ballombe (subscriber, #9523)
Parent article: Would you like signs with those chars?

C has also the isalpha et al class of function that conveniently take an int argument, but are sometime implemented as an array so you have to cast any char argument to unsigned char to shut up the compiler, even if your strings are 7bit.

Would you like signs with those chars?

Posted Oct 24, 2022 21:29 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (17 responses)

That isn't the problem, as far as I can tell. The problem is that, if you accidentally have a character with the high bit set (because you're using ISO-8859-1 or Windows-1252 or some other 8-bit ASCII superset instead of* UTF-8), then it will sign extend and you will get nonsense. Since those non-Unicode encodings used to be fairly popular, compilers started warning on this, even though I believe there's nothing in the standard that explicitly requires a diagnostic on conversion from char to int. Taking care to do this is good practice, because if you offer a 7 bit channel, people will (ab)use it as an 8 bit channel whether it is intended to support that or not.

If you can find an environment where someone actually declared isalpha and relatives with arrays rather than ints, I would be very surprised, because the standard specifies that the argument must be an int, and any subsequent int-to-array conversion is the callee's problem.

* This problem is theoretically also possible if you are using UTF-8, but in that case, you'll get nonsense anyway, because UTF-8 has to be decoded before you can call functions like isalpha on it - and at that point, you've already widened everything to 32 bit, so hopefully you did it correctly.

Would you like signs with those chars?

Posted Oct 24, 2022 22:45 UTC (Mon) by wahern (subscriber, #37304) [Link] (3 responses)

> That isn't the problem, as far as I can tell.

I can't find conclusive examples for is- ctype routines, but here is how tolower was defined during the first few releases of OpenBSD, as forked from NetBSD:

#define tolower(c) ((_tolower_tab_ + 1)[c])

It's still defined similarly on NetBSD, today: http://cvsweb.netbsd.org/bsdweb.cgi/src/sys/sys/ctype_inl...

Also, EOF is a permitted value and typically -1 (thus the +1 in the above), though that would typically only be an issue for non-C locales.

Would you like signs with those chars?

Posted Oct 24, 2022 23:41 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (2 responses)

> #define tolower(c) ((_tolower_tab_ + 1)[c])

Even so, I don't believe that the standard *actually* says that c has to be unsigned in that expression - just that the "usual arithmetic conversions" happen (i.e. the compiler magicks it into an int when you're not looking). Compilers presumably added that warning because there were instances of arrays being indexed with negative char, but not negative int or any other signed type. And, again, that presumably had something to do with ASCII supersets and other nonsense involving dirty 7 bit channels.

> Also, EOF is a permitted value and typically -1 (thus the +1 in the above), though that would typically only be an issue for non-C locales.

The argument is of type int (according to the standard, not that untyped macro), not char, so it's completely unambiguous: You are allowed to pass negative numbers to those routines, because int is always signed, and if it is implemented as a macro, it has to accept signed values in the int range. Of course, if you pass negatives other than EOF (or whatever EOF is #define'd to), then the standard presumably gives you UB (which is why it's OK for the array implementation to walk off the end in that case).

Would you like signs with those chars?

Posted Oct 25, 2022 0:22 UTC (Tue) by wahern (subscriber, #37304) [Link]

The C standard says, "The header <ctype.h> declares several functions useful for classifying and mapping characters. In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF. If the argument has any other value, the behavior is undefined."

Would you like signs with those chars?

Posted Oct 25, 2022 15:13 UTC (Tue) by mrvn42 (guest, #161806) [Link]

>> #define tolower(c) ((_tolower_tab_ + 1)[c])
>> Also, EOF is a permitted value and typically -1 (thus the +1 in the above), though that would typically only be an issue for non-C locales.

> The argument is of type int (according to the standard, not that untyped macro), not char, so it's completely unambiguous: You are allowed to pass negative numbers to those routines, because int is always signed, and if it is implemented as a macro, it has to accept signed values in the int range. Of course, if you pass negatives other than EOF (or whatever EOF is #define'd to), then the standard presumably gives you UB (which is why it's OK for the array implementation to walk off the end in that case).

The problem is that this will only work for values between -1 and 127 for an array of 129 bytes. A value of -2 (or any other non-ascii value other than EOF with signed char) would access memory before the array and a value of 255 (EOF mistakenly stored in an unsigned char or anything non ascii) would access memory after the array.

Looking at the source link in the other comments the BSD code seems to assume chars are unsigned. The test for ascii doesn't work with signed chars at all.

So I assume that "_tolower_tab_" is actually 257 bytes long to cover all unsigned chars and EOF (which is -1 when stored as int).

Would you like signs with those chars?

Posted Oct 25, 2022 0:31 UTC (Tue) by tialaramex (subscriber, #21167) [Link] (10 responses)

> you'll get nonsense anyway, because UTF-8 has to be decoded before you can call functions like isalpha on it

ASCII is a subset of UTF-8. So if you have a C library which is content to implement these functions for ASCII, they do that fine on UTF-8 data without any decoding.

You run into a problem, as you should expect, if the C library thinks the data is 8859-1 when it's actually UTF-8 but otherwise it just provides the very useful answer to the question: is this (alphabetic / a digit / punctuation/ whitespace / etc.) in ASCII ?

Rust deliberately provides the ASCII variants of these functions on both char (a Unicode scalar value) and u8 (an unsigned 8-bit integer ie like C's unsigned char) named is_ascii_digit and so forth. You often do not need the fancy Unicode is_digit but only is_ascii_digit for real software because overwhelmingly the "is it a digit?" question is not cultural but purely technical.

Would you like signs with those chars?

Posted Oct 25, 2022 3:29 UTC (Tue) by dvdeug (guest, #10998) [Link] (7 responses)

isalpha and friends are defined to work on the current locale; you can't trust them to work on just ASCII.

isxdigit is always 0-9A-F, and as far as I can tell from the manpage, isdigit is always 0-9. Barring those, why is it useful to ask "is this an alphabetic/punctuation/whitespace character in ASCII?" Pretty much everything is defined in terms of Unicode now; even if you're processing C code, you should still be prepared for identifiers in Russian, Greek or Chinese. I have a hard time thinking about a case where it's the right thing to check if something is some unspecified alphabetic character, but only those in ASCII.

Would you like signs with those chars?

Posted Oct 25, 2022 9:48 UTC (Tue) by pbonzini (subscriber, #60935) [Link] (1 responses)

The typical example is configuration files where you *can* restrict identifiers to ASCII. Using locale functions will cause a mess for Turkish and Azerbaijani speakers, thanks to the "dotless i" and "dotted I" characters in their alphabets.

Would you like signs with those chars?

Posted Oct 25, 2022 15:46 UTC (Tue) by dvdeug (guest, #10998) [Link]

You can restrict it to ASCII, but the question is*should* you. The dotted I / dotless i issues only matter if you're doing case-insensitive comparisons, which isn't a very Unix thing.

Would you like signs with those chars?

Posted Oct 25, 2022 15:46 UTC (Tue) by khim (subscriber, #9252) [Link] (4 responses)

> Pretty much everything is defined in terms of Unicode now; even if you're processing C code, you should still be prepared for identifiers in Russian, Greek or Chinese.

Sure, but only part of your program which deals with identifiers needs adjustment.

You can write int ｆｏｏ = 42; but can not write int foo = ４２; which means that you can easily use “C” locale and all ASCII-only functions with a simple change: where before Unicode you used is isalpha(c) or isalnum(c) now you would use c < 0 || isalpha(c) and c < 0 || isalnum(c).

That's how doxygen handles it, I would assume someone may use isalpha(c) and/or isalnum(c) in a similar way.

Would you like signs with those chars?

Posted Oct 25, 2022 16:01 UTC (Tue) by dvdeug (guest, #10998) [Link] (3 responses)

That diff shows doxygen keeping UTF-8 characters together. The question is why do you need to know that a character is alphabetic but only in ASCII. In your example, the lex is "int" identifier "=" integer ";". It's not looking for ASCII letters; it's looking for 'i', 'n', 't', or the identifier in Unicode.

Would you like signs with those chars?

Posted Oct 25, 2022 16:10 UTC (Tue) by khim (subscriber, #9252) [Link] (2 responses)

> It's not looking for ASCII letters; it's looking for 'i', 'n', 't', or the identifier in Unicode.

Only identifier is not “Unicode”. It's alpha or Unicode then alnum or Unicode (where Unicode is defined as “anything with a high bit set”).

Doxygen does that with lex, but in simpler cases you may do the same with ctype.h.

Would you like signs with those chars?

Posted Oct 25, 2022 23:41 UTC (Tue) by dvdeug (guest, #10998) [Link] (1 responses)

No, not if that was C or C++ code. I dug up a copy of the C++2003 standard I had laying around, and it specifically defines the set of letters in an identifier, and there's a limited number of usable Unicode characters. I pretty sure that any standard updated this century will have been made with reference to Unicode Standard Annex #31. The JVM (not Java) standard goes the other way and only restricts . ; [ / from being in a name.

Would you like signs with those chars?

Posted Oct 26, 2022 0:36 UTC (Wed) by khim (subscriber, #9252) [Link]

Sure, but standard doesn't say what compiler (or, even worse non-compiler) have to do with broken programs.

And if you ignore what standard says and just go with isalpha/isalnum + Unicode (where Unicode == “high bit is set”) then you would handle all correct programs perfectly. And if someone feeds incorrect one… who cares how would it be handled?

It's not as if we live in a world where everyone cares all that much about following the standard to a T.

Would you like signs with those chars?

Posted Oct 26, 2022 7:32 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (1 responses)

> they do that fine on UTF-8 data without any decoding.

No they won't, at best they will pass through non-ASCII without doing whatever the function is defined to do (e.g. tolower won't actually lowercase your letters), and at worst they will silently corrupt it (if they think it's one of the legacy 8-bit encodings).

> You often do not need the fancy Unicode is_digit but only is_ascii_digit for real software because overwhelmingly the "is it a digit?" question is not cultural but purely technical.

There are a subset of edge cases where a string does not contain linguistically useful information, like a phone number or UUID. In those cases, these ASCII-only functions are somewhat useful, but most of them could just as easily be done with regular expressions like [0-9]+. Realistically, you need nontrivial parsing logic anyway, to deal with things like embedded dashes and other formatting vagaries, so you may as well solve both problems with the same tool (which can and should be Unicode-capable, because ASCII is ultimately "just" a subset of UTF-8). In that context, these ASCII-only functions look rather less useful to me.

The problem is, ASCII-only functions are also an attractive nuisance. They make things a little too comfortable for the programmer who's still living in 1974, the programmer who still thinks that strings are either ASCII or "uh, I dunno, those funny letters that use the high bit, I guess?" Those programmers are the reason that so many Americans "can't" have diacritical marks in their names (on their IDs, their airline tickets, etc.). If you are writing string logic in 2022, and your strings have anything to do with real text that will be read or written by actual humans, then your strings are or should be Unicode. Unicode is the default, not the "fancy" rare exception. If you have strings, and they're not some variety of Unicode, then one of the following is true:

1. They're encoding something that sort of looks like text, but is not really text, like a phone number.
2. They are raw bytes in some binary format, and not text at all.
3. In practice, they mostly are Unicode, but that's not your problem (e.g. because you're a filesystem and the strings are paths).
4. You hate your non-English-speaking users (and the English-speakers who have diacritical marks anywhere in their string for whatever reason - we shouldn't make assumptions).
5. You inherited a pile of tech debt and it's too late to fix it now.

Would you like signs with those chars?

Posted Oct 26, 2022 12:03 UTC (Wed) by tialaramex (subscriber, #21167) [Link]

The thread you're replying in is about the "isalpha et al class of function" - to my mind that's specifically the predicates, but if you insist on also including tolower and toupper from the same part of the standard library, then that's still fine although more narrowly useful, they perform exactly as anticipated.

Sure enough Rust provides to_ascii_uppercase and to_ascii_lowercase here too.

[ Rust also provides to_uppercase and to_lowercase on char, but because this is a tricky problem these are appropriately more complicated ]

I already mentioned (but you snipped) that this will go wrong if your C library thinks it knows the byte is from some legacy encoding like 8859-1

> most of them could just as easily be done with regular expressions like [0-9]+

This sort of completely inappropriate use of technology (resorting to regular expressions to just match ASCII digits) is how we get software that is hundreds of times bigger and slower than necessary.

> Realistically, you need nontrivial parsing logic anyway

Again, you seem to have determined that people would be looking at these functions where they're completely inappropriate, but C itself isn't the right choice in the applications you're thinking about.

> If you are writing string logic in 2022, and your strings have anything to do with real text that will be read or written by actual humans, then your strings are or should be Unicode.

Certainly, but again, we're not asking about Javascript or C# or even Rust, we're talking about C and most specifically about the Linux kernel. Whether the people implementing a driver for a bluetooth device are "actual humans" is I guess up for question, but they're focused very tightly on low level technical details where the fact that the Han writing system has numbers is *irrelevant* to the question of whether this byte is "a digit" in the sense they mean.

C only provides these 1970s functions, and so you're correct that you should not try to write user-facing software in C in 2022. But, the 1970s style ASCII-only functions are actually useful, like it or not, because a bunch of technical infrastructure we rely on works this way, even if maybe it wouldn't if you designed it today (or maybe it would, hard to say)

Example: DNS name labels are (a subset of) ASCII. to_ascii_lowercase or to_ascii_uppercase is exactly appropriate for comparing such labels, and you might say "Surely we should just agree on the case on the wire" but actually we must not do that, at least not before all DNS traffic is DoH because it turns out our security mitigation for some issues relies on random bits in DNS queries, and there aren't really enough so we actually put more randomness in the case bits of the letters in the DNS name, so your code needs to work properly with DNS queries that have the case bits set or unset seemingly at random, so as to prevent attackers guessing the exact name sent...

The end user doesn't see any of this, you aren't expected to type randomly cased hostnames in URLs, nor to apply Punycode rules, you can type in a (Unicode) hostname, and your browser or other software just makes everything work. But the code to do that may be written in C (or perhaps these days Rust) and its DNS name matching only cares about ASCII, even though the actual names are Unicode.

Would you like signs with those chars?

Posted Oct 25, 2022 9:14 UTC (Tue) by jengelh (guest, #33263) [Link] (1 responses)

>if you accidentally have a character with the high bit set (because you're using ISO-8859-1 or Windows-1252 or some other 8-bit ASCII superset instead of* UTF-8), then it will sign extend and you will get nonsense.

int lowertbl[] = {-1, 0, 1, ..., 0x40, 0x61, 0x62, ...};
#define tolower(c) ((lowertbl+1)[c])

Sign extension need not produce nonsense. Thanks to the equivalency of the expressions lowertbl+1[c] <=> lowertbl+1+c <=> lowertbl+c+1 <=> lowertbl[c+1], what matters is if the pointer still points to something sensible.

Would you like signs with those chars?

Posted Oct 26, 2022 7:35 UTC (Wed) by NYKevin (subscriber, #129325) [Link]

The standard requires that all functions implemented as macros must also be implemented as functions, so that you can take their addresses. If implemented as a function, the type must be declared as int, and then c gets coerced to a negative number by sign extension before you even get to the indexing expression. You would have to pad out the table with an extra 128 entries, not an extra 1 entry.