I agree with you on the point of increasing efficiency by doing less conversions. Basically it boils down to annotating a string with its encoding, and doing the conversion lazily.
But I cannot agree with you on this claim:
They are living in a fantasy world where UTF-8 magically has no errors in it. In the real world, if those errors are not preserved, data will be corrupted.
If your supposedly UTF-8 strings have errors in them, then that data is already corrupt. What you are talking about is wiping those corruptions under a carpet and claim that they never existed. This is exactly the failure mode that forces every single layer of an application to implement the same clunky workarounds again and again.
The real solution to this problem is detecting corruptions early -- at the source -- to avoid them from propagating any further. In fact, a frequent source of these corruptions is indeed handling UTF-8 strings as if they were a bunch of bytes.
To a coder like you, who knows all the details about character encodings, this is very obvious. However, most coders neither have the experience nor time to think of all the issues every time they deal with strings. Having an "array of characters" is just a much simpler model for them, and there is a smaller opportunity to screw up.
But Python 3 doesn't force you to use this model. It simply clears up the ambiguity between what you know is text, and what isn't. In Python 2, you never knew whether a "str" variable contains ASCII text or just a blob of binary data. In Python 3, if you have data in uncertain encodings, you use the "bytes" type. Just don't claim that it's a text string -- as long as you don't know how to interpret it, it's not text.
For some reason the ability to do character = string[int] is somehow so drilled into programmers brains that they turn into complete idiot savants
How about len(str)?
Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds