LWN.net Logo

Proposal: Moratorium on Python language changes

Proposal: Moratorium on Python language changes

Posted Oct 23, 2009 18:01 UTC (Fri) by pboddie (guest, #50784)
In reply to: Proposal: Moratorium on Python language changes by sergey
Parent article: Proposal: Moratorium on Python language changes

Python 3 needed a clean break from 2 for multiple reasons. The way I saw it was that modern requirements (Unicode support, "professional open source") seriously challenged some of the original design principles and conventions.

Python 2 supported Unicode fairly well. The only thing it didn't do was confront programmers with Unicode and good text handling practices immediately, which is what making Unicode objects the main string type seems to achieve, albeit with various caveats. I recall some cases (although not the details) where this can be counterproductive - generally where system interfaces, input/output and filesystems are involved - and Python 3's approach doesn't necessarily solve all the problems. You'll have to elaborate on what "professional open source" means - it can't have too much to do with things like better support for binary compatibility (although some work has been done on this) or cross-compilation (another area where Python sees a lot of action without any support from the core developers), that are arguably big "professional" areas.

I don't doubt that new users will benefit from Python 3's changes. As for book writers, they've got a reason to publish updated or completely books, what's not to like?

Not a great deal unless you have a book on Python 2.x out. It's also somewhat embarrassing for all the books already out there to suddenly be incompatible with the "mainstream" edition of the language: a great way to dilute the "brand". Obviously, people can argue that PHP and a variety of other technologies do this kind of thing all the time, but it gives ammunition to people like managers who think that everyone should just write Java programs and not use "this open source experimental stuff".

As for the standard library, I confess to having had an interest in advocating work on renovating it, and the response was more or less a brush-off. With stuff like PyPy and Unladen Swallow (and lots of other work, mostly hovering around the Python 2.x language definition), had the standard library been prioritised instead of Python 3, I don't think many people would really miss an absent Python 3 at all.


(Log in to post comments)

"Unicode"

Posted Oct 29, 2009 19:52 UTC (Thu) by spitzak (guest, #4593) [Link]

Python is doing "Unicode" wrong, and Python 3.0 is making things worse.

Strings should and MUST be bytes. In the real world UTF-8 can have invalid bytes in it, and UTF-16 can have invalid sequences. The Python implementers are living in a fantasy land where they believe that declaring data to be "UTF-8" somehow magically causes all the invalid sequences to disappear, as though the laws of quantum physics makes them impossible.

Here is what really happens: the very first moment that some programmer gets an error when reading invalid UTF-8, they will "fix" it by changing the encoding to ISO-8859-1. They are replacing a total Denial of Service failure with a minor defect that all the non-ASCII characters are mangled. This is a no-brainer choice for the programmers and pretending they will act otherwise is stupid.

Ballmer and the Microsoft programmers are laughing themselves silly right now as they watch FOSS swiftly sabotage any ability to handle Unicode with these moronic ideas and delegate Linux to an ISO-8859-1-only ghetto. This is shameful especially when you consider that UTF-8 was invented by K&R for Plan9, a Unix derivative.

I should be able to reliably take ANY byte array and turn it into "unicode" and back with NO errors and NO loss. Any other design makes Python useless and forces me to use "bytes" for all data and to set the encoding to ISO-8859-1 so those API's that want "Unicode" will not throw errors and crash my program.

Instead I believe the amount of Python code that would fail if string[x] returned the x'th byte from the UTF-8 encoding is minuscule. All such code is searching for ASCII and will still work.

String constants should be bytes. "\uXXXX" means the bytes for the UTF-8 encoding. "\xNN" means exactly that byte, even if the result is invalid UTF-8.

If you want Unicode code points, have "for x in string" return them, but as a special new object where encoding errors are unique different values that can be tested and don't compare equal to any character. Converting to "Unicode" would simply add a flag to the string object that indicates that "for x in string" acts this way. There could be other flags for all the different codecs, and it could also do canonical composition/decomposition and other Unicode actions.

"Unicode"

Posted Oct 30, 2009 20:31 UTC (Fri) by nix (subscriber, #2304) [Link]

Here is what really happens: the very first moment that some programmer gets an error when reading invalid UTF-8, they will "fix" it by changing the encoding to ISO-8859-1. They are replacing a total Denial of Service failure with a minor defect that all the non-ASCII characters are mangled. This is a no-brainer choice for the programmers and pretending they will act otherwise is stupid.
Maybe in the US (hell, there they'd probably just go back to 7-bit ASCII!); maybe in parts of Europe. But a lot of people want stuff to work in the Far East now...

"Unicode"

Posted Oct 30, 2009 21:59 UTC (Fri) by spitzak (guest, #4593) [Link]

hell, there they'd probably just go back to 7-bit ASCII!

Indeed they have. I have personally encountered software that "fixed" encoding problems by masking the high bit, by removing all bytes with the high bit set, and by replacing all bytes with the high bit set with "\xNN" sequences. So claiming that they would even preserve ISO-8859-1 was perhaps being too kind. In fact we are regressing to earlier than the 1980's by going ASCII-only.

What is happening in the far east is that Asian text is getting stored in UCS-2 (thought they may claim it is UTF-16), or in non-error-throwing encodings such as the older JP multibyte, while all other text is in ISO-8859-1 or ASCII (they may claim it is UTF-8). Thus text is delegated to two different file types, the exact thing Unicode was supposed to fix!

"Unicode"

Posted Oct 31, 2009 0:24 UTC (Sat) by nix (subscriber, #2304) [Link]

In fact we are regressing to earlier than the 1980's by going ASCII-only.
I don't know who 'we' is, but it doesn't describe any software development shop I know of. Everyone is more i18n-aware than they used to be, not less.

As for text being relegated to multiple encodings, well, Unicode is rapidly conquering over there, as well. Yes, you have to distinguish between UCS-2 and UTF-8, but you've had to do that for ages, and there are pretty accurate heuristics now. Needing heuristics to detect encodings is nothing new, either: we've always needed them for EBCDIC-versus-ASCII, even before ISO-8859 was heard of.

And this problem of illegal UTF-8 characters which you claim is so catastrophic? I've never once seen them outside fuzz tests, attempted attacks, and while debugging a heuristic charset detector. They just don't occur in normal use of a system, at all. Catastrophe? No. The security implications are interesting, but not as significant as problems with equality-comparing UTF-8 characters without considering that they may not be in canonical form -- a problem you didn't mention.

"Unicode"

Posted Oct 31, 2009 1:47 UTC (Sat) by spitzak (guest, #4593) [Link]

The most obvious catostrophe is the inability to access data that is not valid UTF-8, even to fix it.
You can't fix incorrect UTF-8 if your editor refuses to load the file. For a more obvious example,
you cannot correct an incorrect UTF-8 filename if your filesystem API refuses to provide a way to
identify that file in the rename call.

If you have not seen a junior programmer "fix" these by treating the UTF-8 as ISO-8859-1
(sometimes done by "double encoding UTF-8" but the result is the same) then I don't think you
have worked very much with teams of programmers. This is destroying I18N on Linux and in many
internet standards. On Windows it is destroying UTF-16 but it is less of a problem as only non-BMP
characters are lost.

I think changing to iterators is the first step to correctly handling canonical forms and all the other
Unicode problems. This insistence on changing it to a fixed size array and ignoring patterns is
actually a deterrent to correct comparisons.

"Unicode"

Posted Oct 31, 2009 12:37 UTC (Sat) by nix (subscriber, #2304) [Link]

How often do you *see* allegedly-UTF-8 data that isn't valid UTF-8? In my
experience it's vanishingly rare, much less common than encountering
Unicode mapping to unmapped codepoints. What's more, both are dealt with
the same way: the latter is shown using a square box glyph (losing
information about precisely which character it is, but you rarely care);
the former is dealt with by transforming it into a convenient valid
character, often a form of ? or the replacement character, or a graphical
box containing the invalid bytes (you sometimes lose information about
precisely what the invalid string was, but you rarely care). (Noncanonical
UTF-8 is generally quietly canonicalized.)

You are making a mountain out of a very, very small and already-levelled
molehill: Python's behaviour is known bad and will almost certainly be
fixed, that's why there was such a lot of noise over it. To claim that
it's 'destroying' UTF-8 is utterly laughable.

As for filenames in UTF-8, well, that's why POSIX considers filenames to
be a byte string. So should interfaces to POSIX. This is unlikely to
affect anything but a language that nobody much uses yet and an OS
(Windows/NTFS) that has taken considerable pain over this (especially
combined with case-insensitivity) and which thankfully is not an OS this
site is about.

"Unicode"

Posted Nov 2, 2009 18:37 UTC (Mon) by spitzak (guest, #4593) [Link]

I agree it is very rare, but it just takes ONE failure to make a programmer say "forget that, I'll treat it as ISO-8859-1 because I don't give a s**t about Chinese..."

You would like errors to turn into boxes, but the majority of software does not, instead they throw exceptions, which is most cases is equivalent to a Denial of Service if in fact there is no other way to convey the string to the back end. Particularily nasty for me are Python's convertions to "Unicode", QT strings, QT's HTML renderer, and the XRender "draw this UTF-8 string" api. I am sure there are many many other examples.

In my ideal solution, conversion is deferred until as late as possible, probably as part of the glyph layout code (ie Pango, etc). At this point it is harmless to make a lossy conversion (since layout is lossy anyway, doing canonicalization), and I would convert the error bytes to the matching characters in the Microsoft CP1252 character set. This has the advantage that accidental non-UTF-8 is readable by the users. Believe me they really don't want to see boxes!

The Python "solution" of turning errors into 0xCDxx sort of works, but has the nasty problem that you must track the original source of a string to properly convert back to UTF-8 or UTF-16. If you don't, you either make it impossible to produce all possible UTF-16 strings (very bad because you will be unable to name all files on Windows), or you make it possible for a malicious invalid UTF-16 string to turn into a valid UTF-8 string. For this reason I don't think this solution is going to work, and that keeping the strings as UTF-8 (and converting UTF-16 TO UTF-8, which is lossless) is the only way to go.

"POSIX considers filenames to be byte strings": this sort of statement is the problem. Of course it is byte strings. What you are really saying is "I will pretend the problem does not exist by declaring anything that might contain errors to be "not UTF-8"". The problem is that at some point somebody wants to look at what the byte string means to the user, they will indeed have to say "oh yes this *is* UTF-8". Or worse, they might say "oh this is ISO-8859-1 because then I know my program won't throw a damn exception". Statements like this are exactly the problem I am hoping can be fixed.

In reality, of course filenames are "byte strings", but this is because UTF-8 is a byte string. ALL BYTE STRINGS ARE UTF-8. They are also ASCII and ISO-8859-1 or JP encoded or random binary garbage! They can have invalid UTF-8 sequences in them. They can also have misspelled words, control characters, or they can have French words in them while the program thinks they are English. They can invalid Unicode glyph sequences such as misplaced combining accents. They can spell out a false math proof, or a political opinion that you disagree with. There are billions of errors that can be in the string. Deal with it correctly, instead of declaring that some tiny ill-defined subset of possible errors make the string be "not UTF-8".

"Unicode"

Posted Nov 3, 2009 6:47 UTC (Tue) by Cato (subscriber, #7643) [Link]

I think the answer is more configurability of the conversion process - depending on the context you may want to stop the conversion or insert substitute characters as XML entities, \xNNNN, etc. Perl's Encode module does a pretty good job here.

Also, filenames are not always byte strings, unfortunately - every filesystem has various illegal characters, and NTFS and HFS+ expect valid UTF-8 (HFS+ uses UTF-16 internally, and it must also be decomposed Unicode i.e. NFD).

"Unicode"

Posted Nov 3, 2009 19:17 UTC (Tue) by spitzak (guest, #4593) [Link]

> filenames are not always byte strings, unfortunately - every filesystem has various illegal characters

That is the opposite problem. The problem I am trying to solve is that the filesystem can have filenames that are NOT possible in the API that libraries are providing.

The only non-byte-stream filename api that is used at all is UTF-16. However UTF-16 (including invalid UTF-16) can be losslessly translated to UTF-8 and then back to UTF-16. Therefore all filesystems in existence can be controlled by a byte stream API, using UTF-8 as the encoding.

It is true that there are UTF-8 streams that cannot be turned into UTF-16, these would be "illegal characters" for the filenames. If the filesystem does not have a byte api then this can be replicated by turning all errors into "illegal characters" in UTF-16 so that an equivalent error is thrown.

"Unicode"

Posted Nov 2, 2009 9:40 UTC (Mon) by njs (subscriber, #40338) [Link]

> This is destroying I18N on Linux and in many internet standards.

I think if you want us to take such catastrophic declarations seriously you should perhaps name some examples of specific free software or internet standards that have had their I18N "destroyed" (or even negatively affected).

"Unicode"

Posted Oct 30, 2009 22:38 UTC (Fri) by foom (subscriber, #14868) [Link]

I believe from reading some of the Apocalypsen years ago that Perl 6 is going to do something like this. Seems like a nice plan.

Decoding all your bytes into UCS4 is wasteful in time and memory, and 99% of the time unnecessary. Who ever wants to talk about codepoints, anyways? If you want more than bytes, graphemes are the more useful concept, and Python's unicode strings do nothing for you there.

People (especially people who design programming langauges) need to get over this obsession with constant-time access to arbitrary numbered codepoints. It's really *not* an important or useful feature to have in your API!

"Unicode"

Posted Oct 31, 2009 0:27 UTC (Sat) by nix (subscriber, #2304) [Link]

O(1) *anything* is a good property to retain, IMO, especially with regard
to things as fundamental as characters.

Go from O(1) to O(n) on individual codepoint access and suddenly O(n)
stuff on strings goes to O(n^2) and so on: not remotely good.

"Unicode"

Posted Oct 31, 2009 0:50 UTC (Sat) by foom (subscriber, #14868) [Link]

> Go from O(1) to O(n) on individual codepoint access and suddenly O(n)
> stuff on strings goes to O(n^2) and so on: not remotely good.

That clearly only happens if your language doesn't have such a thing as iterators.
"increment(character_iterator)" is still O(1) even if your underlying representation is
UTF-8. The need to access an arbitrary numbered unicode codepoint in a string in
constant time isn't really all that useful.

Unicode codepoints don't really correspond to anything humans care about...splitting
a string in the middle of a Grapheme is really just as bad as splitting it in the middle of
a UTF-8 codepoint-sequence.

"Unicode"

Posted Oct 31, 2009 1:11 UTC (Sat) by spitzak (guest, #4593) [Link]

You are exactly the sort of misguided person who is destroying UTF-8.

Please explain EXACTLY where the "N" comes from that you are passing to your "go to the Nth UTF-16 code point" function. Answer: it is calculated by looking at all the preceeding N-1 "characters" and therefore it is a misguided attempt to store an iterator in a integer, and that it can be trivially replaced by a real iterator that uses a byte offset or pointer.

"Unicode"

Posted Oct 31, 2009 1:29 UTC (Sat) by nix (subscriber, #2304) [Link]

You're just repeating what foom said in different words, I think.

"Unicode"

Posted Oct 31, 2009 1:29 UTC (Sat) by nix (subscriber, #2304) [Link]

Sorry, I misinterpreted you. Of course it's more complicated to iterate
over strings now, but really not much more, and UTF-8 (unlike the
fixed-width multibyte encodings) is easy to resync to if you start from an
arbitrary byte, so things like binary searches in long strings are still
possible with a tiny bit of extra tweaking.

And, agreed, the ability to treat a string as a fixed-width array is
really quite unimportant: generally people iterate over strings rather
than leaping to position N. (You meant 'position' or 'offset', though,
not 'codepoint', which is entirely different. Codepoint 'access' isn't
even a particularly meaningful concept: what does it mean to 'access'
ASCII codepoint 65? Codepoints just *are*.)

"Unicode"

Posted Nov 2, 2009 9:46 UTC (Mon) by njs (subscriber, #40338) [Link]

You can store text in UTF-8 and still have O(log n) random access (and insertion!) by storing your string in an charpoint-offset indexed tree. No-one seems to bother implementing this, though, presumably because it's complicated, and the overhead only amortizes out when you need to do random access/insertion on large chunks of text, which turns out to be a fairly rare need in practice. Also, converting from UCS-4 to UTF-8 is a pessimization for CJK.

"Unicode"

Posted Nov 2, 2009 18:49 UTC (Mon) by spitzak (guest, #4593) [Link]

UTF-8 is self-synchronizing, meaning that you can find character boundaries in O(1) time and thus you can do random access/insertion of large chunks of text, even if the insertion point is somehow generated at random.

What you can't do is define the insertion point as "after N repetitions of this regexp" which is really what is wanted when people say "characters". But no other data structure would dare to require this as the basic iterator. For some reason though text makes otherwise intelligent programmers into morons and they are just blind to the obvious solution.

As for CJK, I think you meant UTF-16, not UCS-4. Converting from UCS-4 to anything will save memory, as it uses 4 bytes per Unicode code point. UTF-16 does use only 2 bytes for 0x800-0xFFFF while UTF-8 uses 3. However UTF-8 uses one byte for 0x00-0x7F and UTF-16 uses 2, so in fact the text will be smaller if there are more of these. In real CJK texts (ie not just single words) there are, because the ASCII range has spaces, newlines, numbers, all english quoted words, and all the XML markup, while the 3-byte ones are generally one-per-word and thus outnumbered. I believe however that some lengthy east-Indian texts, which use a phonetic alphabet that unfortunately requires 3-bytes per letter in UTF-8, is less efficient in UTF-8 than in UTF-16.

In any case talking about compression is silly, as you can use ZIP to turn any large document into far less than 1 byte per Unicode code point.

"Unicode"

Posted Nov 2, 2009 20:31 UTC (Mon) by njs (subscriber, #40338) [Link]

> UTF-8 is self-synchronizing, meaning that you can find character boundaries in O(1) time and thus you can do random access/insertion of large chunks of text, even if the insertion point is somehow generated at random.

Yes, I know. I thought we were talking about random access using character offsets, rather than byte offsets, though -- at least, that's what I was talking about in my comment. My point is that you can still do better than O(n) for arbitrary character access.

I don't really understand what point you're making about regexps -- all the utf-8 apis I know provide character iterators. I am, though, skeptical that the authors are really all morons, and not sure that claiming they are really adds anything to the conversation.

Re: UTF-8 vs. UCS-4/UTF-16: You're right, I misremembered. UTF-8 and UTF-16 are identical in terms of the hassle of doing random access indexing, and both are more memory-efficient than UCS-4, so I guess everything I said applies to both.

I mentioned compression because the original poster complained that UCS-4 was wasteful of memory; one of the motivations for using UTF-8 instead is that it gives some effective compression. Obviously for long-term compressed storage there are better solutions, but that's not what we're talking about.

"Unicode"

Posted Nov 3, 2009 4:01 UTC (Tue) by dvdeug (subscriber, #10998) [Link]

If you really want compressed in-core string storage, there's SCSU or BOCU-1. The memory versus code tradeoff is generally not considered worth it, though.

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds