|
|
Subscribe / Log in / New account

Python moratorium and the future of 2.x

Python moratorium and the future of 2.x

Posted Nov 13, 2009 20:09 UTC (Fri) by spitzak (guest, #4593)
In reply to: Python moratorium and the future of 2.x by intgr
Parent article: Python moratorium and the future of 2.x

The problem is that programmers will do exactly what you are saying: if "corrupt UTF-8" means "this data is not UTF-8", that is EXACTLY how they will treat it. They will remove all attemps to interpret ANY text as UTF-8, most likely using ISO-8859-1 instead (sometimes they will double-encode the UTF-8 which has the same end results).

You cannot fix invalid UTF-8 if you say it is somehow "not UTF-8". An obvious example is that you are unable to fix an incorrect UTF-8 filename if the filesystem api does not handle it, as you will be unable to name the incorrect file in the rename() call!

You have to also realize that an error thrown when you look at a string is a DENIAL OF SERVICE. For some low-level programmer just trying to get a job done, this is about 10,000 times worse than "some foreign letters will get lost". They will debate the solution for about .001 second and they will then switch their encoding to ISO-8859-1. Or they will do worse things, such as strip all bytes with the high bit set, or remove the high bit, or change all bytes with the high bit set to "\xNN" sequences. I have seen all of these done, over and over and over again. The rules are #1: avoid that DOS error, #2: make most English still readable.

We must redesign these systems so that programmers are encouraged to work in Unicode, by making it EASY, and stop trying to be politically correct and certainly stop this bullshit about saying that invalid code points somehow make it "not UTF-8". It does not do so, any more than misspelled words make a text "not English" and somehow unreadable by an English-reader.

How about len(str)?

You seem to be under the delusion that "the number of Unicode codes" or (more likely) "the number of UTF-16 code points" is somehow interesting. I suspect "how much memory this takes" is an awful lot more interesting, and therefore len(str) should return the number of bytes. If you really really want "characters" than you are going to have to scan the string and do something about canonical decomposition and all the other Unicode nuances.


to post comments

Python moratorium and the future of 2.x

Posted Nov 13, 2009 21:15 UTC (Fri) by nix (subscriber, #2304) [Link]

They will remove all attemps to interpret ANY text as UTF-8, most likely using ISO-8859-1 instead
Only if they never want any users outside the US and Europe, in which case they'll be marginalized sooner rather than later.


Copyright © 2025, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds