Would you like signs with those chars?
Would you like signs with those chars?
Posted Oct 24, 2022 18:14 UTC (Mon) by ballombe (subscriber, #9523)Parent article: Would you like signs with those chars?
Posted Oct 24, 2022 21:29 UTC (Mon)
by NYKevin (subscriber, #129325)
[Link] (17 responses)
If you can find an environment where someone actually declared isalpha and relatives with arrays rather than ints, I would be very surprised, because the standard specifies that the argument must be an int, and any subsequent int-to-array conversion is the callee's problem.
* This problem is theoretically also possible if you are using UTF-8, but in that case, you'll get nonsense anyway, because UTF-8 has to be decoded before you can call functions like isalpha on it - and at that point, you've already widened everything to 32 bit, so hopefully you did it correctly.
Posted Oct 24, 2022 22:45 UTC (Mon)
by wahern (subscriber, #37304)
[Link] (3 responses)
I can't find conclusive examples for is- ctype routines, but here is how tolower was defined during the first few releases of OpenBSD, as forked from NetBSD:
#define tolower(c) ((_tolower_tab_ + 1)[c])
It's still defined similarly on NetBSD, today: http://cvsweb.netbsd.org/bsdweb.cgi/src/sys/sys/ctype_inl...
Also, EOF is a permitted value and typically -1 (thus the +1 in the above), though that would typically only be an issue for non-C locales.
Posted Oct 24, 2022 23:41 UTC (Mon)
by NYKevin (subscriber, #129325)
[Link] (2 responses)
Even so, I don't believe that the standard *actually* says that c has to be unsigned in that expression - just that the "usual arithmetic conversions" happen (i.e. the compiler magicks it into an int when you're not looking). Compilers presumably added that warning because there were instances of arrays being indexed with negative char, but not negative int or any other signed type. And, again, that presumably had something to do with ASCII supersets and other nonsense involving dirty 7 bit channels.
> Also, EOF is a permitted value and typically -1 (thus the +1 in the above), though that would typically only be an issue for non-C locales.
The argument is of type int (according to the standard, not that untyped macro), not char, so it's completely unambiguous: You are allowed to pass negative numbers to those routines, because int is always signed, and if it is implemented as a macro, it has to accept signed values in the int range. Of course, if you pass negatives other than EOF (or whatever EOF is #define'd to), then the standard presumably gives you UB (which is why it's OK for the array implementation to walk off the end in that case).
Posted Oct 25, 2022 0:22 UTC (Tue)
by wahern (subscriber, #37304)
[Link]
Posted Oct 25, 2022 15:13 UTC (Tue)
by mrvn42 (guest, #161806)
[Link]
> The argument is of type int (according to the standard, not that untyped macro), not char, so it's completely unambiguous: You are allowed to pass negative numbers to those routines, because int is always signed, and if it is implemented as a macro, it has to accept signed values in the int range. Of course, if you pass negatives other than EOF (or whatever EOF is #define'd to), then the standard presumably gives you UB (which is why it's OK for the array implementation to walk off the end in that case).
The problem is that this will only work for values between -1 and 127 for an array of 129 bytes. A value of -2 (or any other non-ascii value other than EOF with signed char) would access memory before the array and a value of 255 (EOF mistakenly stored in an unsigned char or anything non ascii) would access memory after the array.
Looking at the source link in the other comments the BSD code seems to assume chars are unsigned. The test for ascii doesn't work with signed chars at all.
So I assume that "_tolower_tab_" is actually 257 bytes long to cover all unsigned chars and EOF (which is -1 when stored as int).
Posted Oct 25, 2022 0:31 UTC (Tue)
by tialaramex (subscriber, #21167)
[Link] (10 responses)
ASCII is a subset of UTF-8. So if you have a C library which is content to implement these functions for ASCII, they do that fine on UTF-8 data without any decoding.
You run into a problem, as you should expect, if the C library thinks the data is 8859-1 when it's actually UTF-8 but otherwise it just provides the very useful answer to the question: is this (alphabetic / a digit / punctuation/ whitespace / etc.) in ASCII ?
Rust deliberately provides the ASCII variants of these functions on both char (a Unicode scalar value) and u8 (an unsigned 8-bit integer ie like C's unsigned char) named is_ascii_digit and so forth. You often do not need the fancy Unicode is_digit but only is_ascii_digit for real software because overwhelmingly the "is it a digit?" question is not cultural but purely technical.
Posted Oct 25, 2022 3:29 UTC (Tue)
by dvdeug (guest, #10998)
[Link] (7 responses)
isxdigit is always 0-9A-F, and as far as I can tell from the manpage, isdigit is always 0-9. Barring those, why is it useful to ask "is this an alphabetic/punctuation/whitespace character in ASCII?" Pretty much everything is defined in terms of Unicode now; even if you're processing C code, you should still be prepared for identifiers in Russian, Greek or Chinese. I have a hard time thinking about a case where it's the right thing to check if something is some unspecified alphabetic character, but only those in ASCII.
Posted Oct 25, 2022 9:48 UTC (Tue)
by pbonzini (subscriber, #60935)
[Link] (1 responses)
Posted Oct 25, 2022 15:46 UTC (Tue)
by dvdeug (guest, #10998)
[Link]
Posted Oct 25, 2022 15:46 UTC (Tue)
by khim (subscriber, #9252)
[Link] (4 responses)
Sure, but only part of your program which deals with identifiers needs adjustment. You can write That's how doxygen handles it, I would assume someone may use
Posted Oct 25, 2022 16:01 UTC (Tue)
by dvdeug (guest, #10998)
[Link] (3 responses)
Posted Oct 25, 2022 16:10 UTC (Tue)
by khim (subscriber, #9252)
[Link] (2 responses)
Only identifier is not “Unicode”. It's Doxygen does that with lex, but in simpler cases you may do the same with
Posted Oct 25, 2022 23:41 UTC (Tue)
by dvdeug (guest, #10998)
[Link] (1 responses)
Posted Oct 26, 2022 0:36 UTC (Wed)
by khim (subscriber, #9252)
[Link]
Sure, but standard doesn't say what compiler (or, even worse non-compiler) have to do with broken programs. And if you ignore what standard says and just go with It's not as if we live in a world where everyone cares all that much about following the standard to a
Posted Oct 26, 2022 7:32 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
No they won't, at best they will pass through non-ASCII without doing whatever the function is defined to do (e.g. tolower won't actually lowercase your letters), and at worst they will silently corrupt it (if they think it's one of the legacy 8-bit encodings).
> You often do not need the fancy Unicode is_digit but only is_ascii_digit for real software because overwhelmingly the "is it a digit?" question is not cultural but purely technical.
There are a subset of edge cases where a string does not contain linguistically useful information, like a phone number or UUID. In those cases, these ASCII-only functions are somewhat useful, but most of them could just as easily be done with regular expressions like [0-9]+. Realistically, you need nontrivial parsing logic anyway, to deal with things like embedded dashes and other formatting vagaries, so you may as well solve both problems with the same tool (which can and should be Unicode-capable, because ASCII is ultimately "just" a subset of UTF-8). In that context, these ASCII-only functions look rather less useful to me.
The problem is, ASCII-only functions are also an attractive nuisance. They make things a little too comfortable for the programmer who's still living in 1974, the programmer who still thinks that strings are either ASCII or "uh, I dunno, those funny letters that use the high bit, I guess?" Those programmers are the reason that so many Americans "can't" have diacritical marks in their names (on their IDs, their airline tickets, etc.). If you are writing string logic in 2022, and your strings have anything to do with real text that will be read or written by actual humans, then your strings are or should be Unicode. Unicode is the default, not the "fancy" rare exception. If you have strings, and they're not some variety of Unicode, then one of the following is true:
1. They're encoding something that sort of looks like text, but is not really text, like a phone number.
Posted Oct 26, 2022 12:03 UTC (Wed)
by tialaramex (subscriber, #21167)
[Link]
Sure enough Rust provides to_ascii_uppercase and to_ascii_lowercase here too.
[ Rust also provides to_uppercase and to_lowercase on char, but because this is a tricky problem these are appropriately more complicated ]
I already mentioned (but you snipped) that this will go wrong if your C library thinks it knows the byte is from some legacy encoding like 8859-1
> most of them could just as easily be done with regular expressions like [0-9]+
This sort of completely inappropriate use of technology (resorting to regular expressions to just match ASCII digits) is how we get software that is hundreds of times bigger and slower than necessary.
> Realistically, you need nontrivial parsing logic anyway
Again, you seem to have determined that people would be looking at these functions where they're completely inappropriate, but C itself isn't the right choice in the applications you're thinking about.
> If you are writing string logic in 2022, and your strings have anything to do with real text that will be read or written by actual humans, then your strings are or should be Unicode.
Certainly, but again, we're not asking about Javascript or C# or even Rust, we're talking about C and most specifically about the Linux kernel. Whether the people implementing a driver for a bluetooth device are "actual humans" is I guess up for question, but they're focused very tightly on low level technical details where the fact that the Han writing system has numbers is *irrelevant* to the question of whether this byte is "a digit" in the sense they mean.
C only provides these 1970s functions, and so you're correct that you should not try to write user-facing software in C in 2022. But, the 1970s style ASCII-only functions are actually useful, like it or not, because a bunch of technical infrastructure we rely on works this way, even if maybe it wouldn't if you designed it today (or maybe it would, hard to say)
Example: DNS name labels are (a subset of) ASCII. to_ascii_lowercase or to_ascii_uppercase is exactly appropriate for comparing such labels, and you might say "Surely we should just agree on the case on the wire" but actually we must not do that, at least not before all DNS traffic is DoH because it turns out our security mitigation for some issues relies on random bits in DNS queries, and there aren't really enough so we actually put more randomness in the case bits of the letters in the DNS name, so your code needs to work properly with DNS queries that have the case bits set or unset seemingly at random, so as to prevent attackers guessing the exact name sent...
The end user doesn't see any of this, you aren't expected to type randomly cased hostnames in URLs, nor to apply Punycode rules, you can type in a (Unicode) hostname, and your browser or other software just makes everything work. But the code to do that may be written in C (or perhaps these days Rust) and its DNS name matching only cares about ASCII, even though the actual names are Unicode.
Posted Oct 25, 2022 9:14 UTC (Tue)
by jengelh (guest, #33263)
[Link] (1 responses)
int lowertbl[] = {-1, 0, 1, ..., 0x40, 0x61, 0x62, ...};
Sign extension need not produce nonsense. Thanks to the equivalency of the expressions lowertbl+1[c] <=> lowertbl+1+c <=> lowertbl+c+1 <=> lowertbl[c+1], what matters is if the pointer still points to something sensible.
Posted Oct 26, 2022 7:35 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link]
Would you like signs with those chars?
Would you like signs with those chars?
Would you like signs with those chars?
Would you like signs with those chars?
Would you like signs with those chars?
>> Also, EOF is a permitted value and typically -1 (thus the +1 in the above), though that would typically only be an issue for non-C locales.
Would you like signs with those chars?
Would you like signs with those chars?
Would you like signs with those chars?
Would you like signs with those chars?
> Pretty much everything is defined in terms of Unicode now; even if you're processing C code, you should still be prepared for identifiers in Russian, Greek or Chinese.
Would you like signs with those chars?
int foo = 42;
but can not write int foo = 42;
which means that you can easily use “C” locale and all ASCII-only functions with a simple change: where before Unicode you used is isalpha(c)
or isalnum(c)
now you would use c < 0 || isalpha(c)
and c < 0 || isalnum(c)
.isalpha(c)
and/or isalnum(c)
in a similar way.Would you like signs with those chars?
> It's not looking for ASCII letters; it's looking for 'i', 'n', 't', or the identifier in Unicode.
Would you like signs with those chars?
alpha
or Unicode then alnum
or Unicode (where Unicode is defined as “anything with a high bit set”).ctype.h
.Would you like signs with those chars?
Would you like signs with those chars?
isalpha/isalnum
+ Unicode (where Unicode == “high bit is set”) then you would handle all correct programs perfectly. And if someone feeds incorrect one… who cares how would it be handled?T
.Would you like signs with those chars?
2. They are raw bytes in some binary format, and not text at all.
3. In practice, they mostly are Unicode, but that's not your problem (e.g. because you're a filesystem and the strings are paths).
4. You hate your non-English-speaking users (and the English-speakers who have diacritical marks anywhere in their string for whatever reason - we shouldn't make assumptions).
5. You inherited a pile of tech debt and it's too late to fix it now.
Would you like signs with those chars?
Would you like signs with those chars?
#define tolower(c) ((lowertbl+1)[c])
Would you like signs with those chars?