LWN.net Logo

Language misfeature

Language misfeature

Posted Jun 22, 2011 20:33 UTC (Wed) by eru (subscriber, #2753)
Parent article: A hole in crypt_blowfish

When the byte value stored at *ptr has its high bit set, it is treated as a negative number.

I assume ptr is a char*, in which case this is not completely correct: The unadorned char actually has implementation-defined signedness. Most C implementations make it signed, but there are some that don't, and they are quite legal. This bug would not be present if compiled with them.

This almost-but-not-quite-always signedness of char is one of the stupidest features of C, contributing to a quite a lot of bugs (including some in my own code as well). I really hate it.


(Log in to post comments)

Language misfeature

Posted Jun 22, 2011 20:47 UTC (Wed) by daney (subscriber, #24551) [Link]

You can always do:

#include <stdint.h>

And use uint8_t (or int8_t) instead of char.

Language misfeature

Posted Jun 22, 2011 21:49 UTC (Wed) by samroberts (subscriber, #46749) [Link]

Only if you ignore or suppress sign mismatch warnings (but doing so can also lead to this kind of bug), or code entirely in your own world, never using string literals (the type of "xx" is defined to be "char*") or exchanging data with library functions accepting or returning char*.

Language misfeature

Posted Jun 22, 2011 22:19 UTC (Wed) by dgm (subscriber, #49227) [Link]

All is fine and well with both signed and unsigned chars, for as long as you only use them to store chars. This "miss-feature" is only troublesome when you start to do arithmetic with chars, which most library functions do not.

So, if you plan to do arithmetic on characters, better cast them to int or unsigned char, or whatever type has the properties you need for your operations.

Language misfeature

Posted Jun 23, 2011 4:18 UTC (Thu) by eru (subscriber, #2753) [Link]

This "miss-feature" is only troublesome when you start to do arithmetic with chars, which most library functions do not.

Not just arithmetic. Even indexing an array with a char value will bite the unwary.

Language misfeature

Posted Jun 23, 2011 8:48 UTC (Thu) by kris.shannon (subscriber, #45828) [Link]

Array indexing is pointer arithmetic

Language misfeature

Posted Jun 23, 2011 12:13 UTC (Thu) by dgm (subscriber, #49227) [Link]

I assume you mean comparing the value of characters, taken as numbers. That's arithmetic too.

Language misfeature

Posted Jun 23, 2011 15:27 UTC (Thu) by jani (subscriber, #74547) [Link]

ctype.h functions don't seem like arithmetic, but for the parameter:

"In all cases c is an int, the value of which must be representable as an unsigned char or must equal the value of the macro EOF. If the argument has any other value, the behaviour is undefined."

Language misfeature

Posted Jun 22, 2011 22:22 UTC (Wed) by vapier (subscriber, #15768) [Link]

in the GCC world, "implementation" doesn't stop at the "all GCC ports are the same". it can differ between architectures, and even between OS targets for the same architecture.

$ grep DEFAULT_SIGNED_CHAR gcc/config/*/*.h | grep -c 0
19
$ grep DEFAULT_SIGNED_CHAR gcc/config/*/*.h | grep -c 1
28

e.g. mips is generally a 1, but mips/irix is 0

long story short, if you want a signed char, then you better use "signed char"

Language misfeature

Posted Jun 23, 2011 15:17 UTC (Thu) by HelloWorld (guest, #56129) [Link]

Weak typing is generally a horrible idea, and it's yet another reason to just avoid C whenever you can.

Language misfeature

Posted Jun 24, 2011 12:38 UTC (Fri) by wookey (subscriber, #5501) [Link]

It's true - this bug would (maybe, one would need to check the details of the code) not manifest itself on ARM machines when chars are unisigned by default.

NOT a Language misfeature, it's a Programming Error

Posted Jun 28, 2011 10:30 UTC (Tue) by lacos (subscriber, #70616) [Link]

Anybody who ever looked at the C89 standard, section "6.1.2.5 Types" (I'm even considering the age of the program in question here) knows that "char" may have value representation equivalent to that of "char signed".

paragraph 2:

----v----
An object declared as type _char_ is large enough to store any member of the basic execution character set. If a member of the required source character set enumerated in 5.2.1 is stored in a _char_ object, its value is guaranteed to be positive. If other quantities are stored in a char object, the behavior is implementation-defined; the values are treated as either signed or nonnegative integers.
----^----

The real fuckup here is that "ptr" was declared pointer-to-char, instead of pointer-to-char-unsigned. "char unsigned" is the type to access binary data or the object representation of objects.

"char" (which is a different type from "char signed", but may have identical value representation) does not even have to be able to represent more than 255 (NOT 256!) distinct values, if we rely on nothing else than the C89 standard. This accomodates sign-magnitude and one's complement representations. See C89 "5.2.4.2.1 Sizes of integral types <limits.h>":

----v----
[...] Their implementation-defined values shall be equal or greater in magnitude (absolute value) to those shown, with the same sign.

- number of bits for smallest object that is not a bit-field (byte)
CHAR_BIT 8

- minimum value for an object of type signed char
SCHAR_MIN -127

- maximum value for an object of type signed char
SCHAR_MAX +127

- maximum value for an object of type unsigned char
UCHAR_MAX 255

- minimum value for an object of type char
CHAR_MIN see below

- maximum value for an object of type char
CHAR_MAX see below

[...]

If the value of an object of type char is treated as a signed integer when used in an expression, the value of CHAR_MIN shall be the same as that of SCHAR_MIN and the value of CHAR_MAX shall be the same as that of SCHAR_MAX. Otherwise, the value of CHAR_MIN shall be 0 and the value of CHAR_MAX shall be the same as that of UCHAR_MAX.
----^----

"char *" is there so you can work with _TEXT_ that consists of elements of the (basic or extended) execution character set. "char unsigned *" is there for everything else "binary". If in doubt, use "char unsigned *".

Another example: reading something from a socket and using string functions (like strstr(), strcmp() etc) on the result is _broken_, from a portability aspect.

NOT a Language misfeature, it's a Programming Error

Posted Jun 29, 2011 6:34 UTC (Wed) by eru (subscriber, #2753) [Link]

I think you missed my point. I do agree it is a programming error. But this error (and countless like it) were made much easier to commit by this language misfeature, which makes straightforward code work in a very non-intuitive way.

unsigned char by the way is a fairly "new" invention. The original K&R C did not have it, just the char with implementation-dependent signedness. Therefore there probably still is legacy code around that has bugs like this because char was the only way to represent a byte, and the coders did not remember to put masking around all widening accesses.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds