High code-points vs. invalid representations of low ones

Posted Feb 13, 2006 5:13 UTC (Mon) by dractyl (subscriber, #26334)
In reply to: High code-points vs. invalid representations of low ones by xoddam
Parent article: Setting up international character support (Linux.com)

> Your code should NEVER produce byte sequences like the above

Unless of course I'm trying to break or subvert a UTF-8 decoder, in which case that's exactly the kind of thing I want my code to produce. ;) That's what I originally had in mind when I suggested the byte sequnce.

FYI, the RFC version of the standard says you should not accept undefined codepoints. From the standard: "There is therefore a risk of buffer overflow if the range of character numbers is not explicitly limited to U+10FFFF or if buffer sizing doesn't take into account the possibility of 5- and 6-byte sequences." Also, an integer overflow can be a very bad thing. I could see an exploitable attack arising out of that (or at least a crash), so it's best to protect against it.