|
|
Subscribe / Log in / New account

Riddled with errors

Riddled with errors

Posted Feb 11, 2006 4:40 UTC (Sat) by pimlott (guest, #1535)
In reply to: Riddled with errors by dractyl
Parent article: Setting up international character support (Linux.com)

That being said, I'm still forced to deal with ISO/IEC 10646's concept of the world for security reasons.
Sorry if I'm misinterpreting what you mean here, but I can't help responding to that statement with alarm. If considering that your input might be ISO-10646 makes your code more secure, you must be using slap-dash programming practices (and an unsafe language) to begin with. Ill-formed input is ill-formed input, regardless of whether some related standard thinks it's valid.


to post comments

Riddled with errors

Posted Feb 11, 2006 9:23 UTC (Sat) by dractyl (subscriber, #26334) [Link]

Allow me to clarify. It seems to me I can approach writing a decoder two different ways:

1. Decide that ISO/IEC 10646 is valid or common enough and play along. In this case I will decode the 6 byte sequences.

2. Decide that I'd rather follow the Unicode format strictly, in which case I'll drop character sequences over 4 bytes long as invalid.

In either case, I'll have to discard codepoints higher than 0x000FFFFF. Even 4 byte characters can still produce illegal codepoints since they encode 21 bits of data compared to Unicode's 20 bit limit. Other illegial sequences such as U+D800 to U+DFFF and U+FFFE and U+FFFF, as well as any other bugaboos still need to be stripped out. As for over long sequences, I'm more inlined to normalize them then anything else.

It was not my intention to suggest a detection of what is used in a given piece of data, as that seems to be unnecessarily fragile and error-prone in a decoder. Merely that we must consider the existence of ISO/IEC 10646 when writing our software.

At the end of the day, it hardly seems to matter which you choose.


Copyright © 2025, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds