Not logged in
Log in now
Create an account
Subscribe to LWN
LWN.net Weekly Edition for May 23, 2013
An "enum" for Python 3
An unexpected perf feature
LWN.net Weekly Edition for May 16, 2013
A look at the PyPy 2.0 release
The introduction of the 'byte' datatype combined with the end of the abuse
of strings-for-everything is enough of a reason alone why 2.x should
Python moratorium and the future of 2.x
Posted Nov 13, 2009 3:12 UTC (Fri) by spitzak (guest, #4593)
They are living in a fantasy world where UTF-8 magically has no errors in it.
In the real world, if those errors are not preserved, data will be corrupted. This means that the "Unicode" is USELESS because we cannot store text in it. Therefore everybody will have to use byte strings and because the easiest way to avoid "errors" is to say the strings are ISO-8859-1 then we will revert to non-Unicode really fast. This is a terrible result and it is shameful that the people causing it are under the delusion that they are "helping Unicode".
For some reason the ability to do character = string[int] is somehow so drilled into programmers brains that they turn into complete idiot savants, doing incredible amounts of insanely complex and error-prone work, rather than dare to question their initial assumption and come up with the obvious solution if they thought about any other piece of data that is stored in a stream, such as words.
The strings should be BYTES and there should not be two types. If you want "characters" (in the TINY TINY TINY percentage of the cases where you do) then you use an ITERATOR!!!! Ie "for x in string", and x is set to a special item that can compare to characters and also can encode errors. To change the codec you make a different object, but the bytes just get their reference count incremented so there is no copying and changing coded is trivial and O(1). "Unicode strings" would mean the codec is set to UTF-8 and "bytes" would mean the codec is set to some byte version, or possibly the iterator is disallowed. Also the parser needs to translate "\uXXXX" in a string to the UTF-8 representation and "\xNN" to a byte with that value.
Posted Nov 13, 2009 12:25 UTC (Fri) by intgr (subscriber, #39733)
I agree with you on the point of increasing efficiency by doing less conversions. Basically it boils down to annotating a string with its encoding, and doing the conversion lazily.
But I cannot agree with you on this claim:
They are living in a fantasy world where UTF-8 magically has no errors in it. In the real world, if those errors are not preserved, data will be corrupted.
If your supposedly UTF-8 strings have errors in them, then that data is already corrupt. What you are talking about is wiping those corruptions under a carpet and claim that they never existed. This is exactly the failure mode that forces every single layer of an application to implement the same clunky workarounds again and again.
The real solution to this problem is detecting corruptions early -- at the source -- to avoid them from propagating any further. In fact, a frequent source of these corruptions is indeed handling UTF-8 strings as if they were a bunch of bytes.
To a coder like you, who knows all the details about character encodings, this is very obvious. However, most coders neither have the experience nor time to think of all the issues every time they deal with strings. Having an "array of characters" is just a much simpler model for them, and there is a smaller opportunity to screw up.
But Python 3 doesn't force you to use this model. It simply clears up the ambiguity between what you know is text, and what isn't. In Python 2, you never knew whether a "str" variable contains ASCII text or just a blob of binary data. In Python 3, if you have data in uncertain encodings, you use the "bytes" type. Just don't claim that it's a text string -- as long as you don't know how to interpret it, it's not text.
For some reason the ability to do character = string[int] is somehow so drilled into programmers brains that they turn into complete idiot savants
How about len(str)?
Posted Nov 13, 2009 18:11 UTC (Fri) by foom (subscriber, #14868)
That is not actually a particularly useful number to have available. Some more useful numbers:
a) Bounding box if rendered in a particular font. (or for a terminal: number of cell-widths)
b) Number of glyphs (after processing combining characters/etc)
c) Number of bytes when stored in a particular encoding.
Posted Nov 13, 2009 20:09 UTC (Fri) by spitzak (guest, #4593)
You cannot fix invalid UTF-8 if you say it is somehow "not UTF-8". An obvious example is that you are unable to fix an incorrect UTF-8 filename if the filesystem api does not handle it, as you will be unable to name the incorrect file in the rename() call!
You have to also realize that an error thrown when you look at a string is a DENIAL OF SERVICE. For some low-level programmer just trying to get a job done, this is about 10,000 times worse than "some foreign letters will get lost". They will debate the solution for about .001 second and they will then switch their encoding to ISO-8859-1. Or they will do worse things, such as strip all bytes with the high bit set, or remove the high bit, or change all bytes with the high bit set to "\xNN" sequences. I have seen all of these done, over and over and over again. The rules are #1: avoid that DOS error, #2: make most English still readable.
We must redesign these systems so that programmers are encouraged to work in Unicode, by making it EASY, and stop trying to be politically correct and certainly stop this bullshit about saying that invalid code points somehow make it "not UTF-8". It does not do so, any more than misspelled words make a text "not English" and somehow unreadable by an English-reader.
How about len(str)?
You seem to be under the delusion that "the number of Unicode codes" or (more likely) "the number of UTF-16 code points" is somehow interesting. I suspect "how much memory this takes" is an awful lot more interesting, and therefore len(str) should return the number of bytes. If you really really want "characters" than you are going to have to scan the string and do something about canonical decomposition and all the other Unicode nuances.
Posted Nov 13, 2009 21:15 UTC (Fri) by nix (subscriber, #2304)
They will remove all attemps to interpret ANY text as UTF-8, most likely
using ISO-8859-1 instead
Posted Nov 19, 2009 12:34 UTC (Thu) by yeti-dn (guest, #46560)
That's the real core of the problem.
The scary myths about broken UTF-8 are very likely being spread by precisely the same people who broke the UTF-8 in the first place because they cannot imagine anything beyond ISO-8859-1.
I live outside US and Western Europe. If I do something as sloppy as treating UTF-8 text as ISO-8859-1 (removing 8bit chars, escaping, whatever) I completely mangle it. So, people don't do it. I meet broken UTF-8 rarely and when I do it invariably comes from the western countries.
Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds