|
|
Log in / Subscribe / Register

Report from the Python Language Summit

Report from the Python Language Summit

Posted Apr 16, 2015 18:11 UTC (Thu) by Cyberax (✭ supporter ✭, #52523)
In reply to: Report from the Python Language Summit by HelloWorld
Parent article: Report from the Python Language Summit

So how many characters are in composite symbols?

The only sane and modern way to do Unicode is UTF-8.


to post comments

Report from the Python Language Summit

Posted Apr 16, 2015 19:27 UTC (Thu) by HelloWorld (guest, #56129) [Link] (1 responses)

> So how many characters are in composite symbols?
What does that have to do with the fact that bytes are not text/strings/characters?

> The only sane and modern way to do Unicode is UTF-8.
Regardless of whether this is true or not, there is a lot of data in all kinds of encodings, and developers had better think about which one they are going to use when reading that data.

Report from the Python Language Summit

Posted Apr 16, 2015 22:40 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

> What does that have to do with the fact that bytes are not text/strings/characters?
The fact that sequences of UCS-4 codepoints are also not text/string/characters, just as sequences of raw bytes.

> Regardless of whether this is true or not, there is a lot of data in all kinds of encodings, and developers had better think about which one they are going to use when reading that data.
Python3 practically forces one to transcode data from one format to another all the time for no specific reason.

Report from the Python Language Summit

Posted Apr 20, 2015 10:38 UTC (Mon) by niner (guest, #26151) [Link]

"So how many characters are in composite symbols?"

Characters? I'd say one. I can definitely say (as far as I understand this anyway) that it's one grapheme and one or more code points.

Perl 6 will deal with strings as sequences of Normalized Form Graphemes (NFG). There's a very interested blog post about what this means:

https://6guts.wordpress.com/2015/04/12/this-week-unicode-...

I guess the only two sane ways of handling Unicode are:
* be completely agnostic and treat stings as opaque sequences of bytes, or
* go all in and work with graphemes whenever possible.

Report from the Python Language Summit

Posted Apr 21, 2015 22:26 UTC (Tue) by nix (subscriber, #2304) [Link]

I'll just tell all the existing systems out there to use UTF-8, even if they don't. I'm sure I can find a way to jam all of Unicode onto the Adafruit-based display board my Python code is talking to: it has a whole 64K of flash and 2K of RAM! I'm sure I can fit glyphs for all of Unicode in there and still have space for everything else it has to do!

No, not everything can use UTF-8, even in an ideal world. And such systems will *always* use different encodings, so to talk to such systems Python's enforced conversion is extremely valuable. And even when you're not, and the system you are talking to uses UTF-8 or some other Unicode variant, the enforced conversion is *still* valuable because it forces you to think about what encoding is in use, and amazingly often it's not straight UTF-8, or it's UTF-8 with extra requirements such as needing to be canonicalized or decanonicalized in a particular way or "oops we didn't say but experimentation makes it clear that $strange_canonicalization is the only way to go". (I have seen all of these on real systems, along with people claiming UTF-8 but meaning UCS-16 because they didn't know there was a difference, and vice versa -- and, in the latter cases, cursed them.)


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds