User: Password:
Subscribe / Log in / New account

Python moratorium and the future of 2.x

Python moratorium and the future of 2.x

Posted Nov 13, 2009 12:25 UTC (Fri) by intgr (subscriber, #39733)
In reply to: Python moratorium and the future of 2.x by spitzak
Parent article: Python moratorium and the future of 2.x

I agree with you on the point of increasing efficiency by doing less conversions. Basically it boils down to annotating a string with its encoding, and doing the conversion lazily.

But I cannot agree with you on this claim:

They are living in a fantasy world where UTF-8 magically has no errors in it. In the real world, if those errors are not preserved, data will be corrupted.

If your supposedly UTF-8 strings have errors in them, then that data is already corrupt. What you are talking about is wiping those corruptions under a carpet and claim that they never existed. This is exactly the failure mode that forces every single layer of an application to implement the same clunky workarounds again and again.

The real solution to this problem is detecting corruptions early -- at the source -- to avoid them from propagating any further. In fact, a frequent source of these corruptions is indeed handling UTF-8 strings as if they were a bunch of bytes.

To a coder like you, who knows all the details about character encodings, this is very obvious. However, most coders neither have the experience nor time to think of all the issues every time they deal with strings. Having an "array of characters" is just a much simpler model for them, and there is a smaller opportunity to screw up.

But Python 3 doesn't force you to use this model. It simply clears up the ambiguity between what you know is text, and what isn't. In Python 2, you never knew whether a "str" variable contains ASCII text or just a blob of binary data. In Python 3, if you have data in uncertain encodings, you use the "bytes" type. Just don't claim that it's a text string -- as long as you don't know how to interpret it, it's not text.

For some reason the ability to do character = string[int] is somehow so drilled into programmers brains that they turn into complete idiot savants

How about len(str)?

(Log in to post comments)

Python moratorium and the future of 2.x

Posted Nov 13, 2009 18:11 UTC (Fri) by foom (subscriber, #14868) [Link]

> How about len(str)?

That is not actually a particularly useful number to have available. Some more useful numbers:

a) Bounding box if rendered in a particular font. (or for a terminal: number of cell-widths)
b) Number of glyphs (after processing combining characters/etc)
c) Number of bytes when stored in a particular encoding.

Python moratorium and the future of 2.x

Posted Nov 13, 2009 20:09 UTC (Fri) by spitzak (guest, #4593) [Link]

The problem is that programmers will do exactly what you are saying: if "corrupt UTF-8" means "this data is not UTF-8", that is EXACTLY how they will treat it. They will remove all attemps to interpret ANY text as UTF-8, most likely using ISO-8859-1 instead (sometimes they will double-encode the UTF-8 which has the same end results).

You cannot fix invalid UTF-8 if you say it is somehow "not UTF-8". An obvious example is that you are unable to fix an incorrect UTF-8 filename if the filesystem api does not handle it, as you will be unable to name the incorrect file in the rename() call!

You have to also realize that an error thrown when you look at a string is a DENIAL OF SERVICE. For some low-level programmer just trying to get a job done, this is about 10,000 times worse than "some foreign letters will get lost". They will debate the solution for about .001 second and they will then switch their encoding to ISO-8859-1. Or they will do worse things, such as strip all bytes with the high bit set, or remove the high bit, or change all bytes with the high bit set to "\xNN" sequences. I have seen all of these done, over and over and over again. The rules are #1: avoid that DOS error, #2: make most English still readable.

We must redesign these systems so that programmers are encouraged to work in Unicode, by making it EASY, and stop trying to be politically correct and certainly stop this bullshit about saying that invalid code points somehow make it "not UTF-8". It does not do so, any more than misspelled words make a text "not English" and somehow unreadable by an English-reader.

How about len(str)?

You seem to be under the delusion that "the number of Unicode codes" or (more likely) "the number of UTF-16 code points" is somehow interesting. I suspect "how much memory this takes" is an awful lot more interesting, and therefore len(str) should return the number of bytes. If you really really want "characters" than you are going to have to scan the string and do something about canonical decomposition and all the other Unicode nuances.

Python moratorium and the future of 2.x

Posted Nov 13, 2009 21:15 UTC (Fri) by nix (subscriber, #2304) [Link]

They will remove all attemps to interpret ANY text as UTF-8, most likely using ISO-8859-1 instead
Only if they never want any users outside the US and Europe, in which case they'll be marginalized sooner rather than later.

Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds