High code-points vs. invalid representations of low ones
Posted Feb 13, 2006 4:28 UTC (Mon) by xoddam
In reply to: Riddled with errors
Parent article: Setting up international character support (Linux.com)
> a nice string like 0xFC 0x80 0x80 0x80 0x80 0x80 0x80 0x8A would
> screw up some of the stupider UTF-8 engines.
You're absolutely right, and for that reason only the shortest possible
representation of a code point is considered a valid UTF-8 character.
Your code should NEVER produce byte sequences like the above, you should
consider such input to be an attempt to bluff either your program or its
user. The best policy (not sure if it's a MUST or a SHOULD in the
relevant standards) is to discard either the character or the entire
input on these grounds.
None of which decisively determines whether you should accept code points
so high that they haven't been defined yet. The worst that can happen is
an integer overflow (signed unicode characters anyone?), otherwise you
just have a weird character. Throw it away or draw a box; I doubt anyone
really cares which.
to post comments)