|
|
Subscribe / Log in / New account

Python 2.8?

Python 2.8?

Posted Jan 23, 2017 12:51 UTC (Mon) by ssokolow (guest, #94568)
In reply to: Python 2.8? by mathstuf
Parent article: Python 2.8?

> No, [OsString] can represent arbitrary bytes.

I never said it couldn't. On POSIX platforms, String (1) and OsString (1, 2) both rely on an internal Vec<u8> to store the raw bytes... String just enforces extra invariants and presents a different set of methods.

That's what I meant when I said they had the same in-memory representation. You can even use the to_str method to get an &str pointing at the contents of the OsString as long it's valid UTF-8.

> It actually uses WTF-16 internally, a superset of UTF-16.

I looked at the up-to-date source to the standard library before I made my post.

OsString uses an inner type called sys::os_str::Buf to actually store the string and, on Windows, that's an API wrapper around sys_common::wtf8::Wtf8Buf which, again, uses a Vec<u8> internally.

(The key detail is that, when Windows UTF-16 is invalid UTF-8, it's not because it encodes codepoints that UTF-8 can't represent, it's that it violates higher-level rules about surrogates never occuring in isolation or out of order. After the transformation process is complete, UTF-16 and UTF-8 are simply two different ways to map 21-bit numbers into sequences of bytes, so It's trivial for WTF-8 to represent an arbitrary sequence of numbers 21-bits or narrower just by omitting some of the rules checks that a UTF-8 codec enforces.)

> Right from the docs at the OsString link you gave

I actually started out with that exact quote but edited it out for brevity.


to post comments

Python 2.8?

Posted Jan 23, 2017 20:34 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

Ah, yeah about the "backing store"; I shouldn't comment too early in the morning.

The WTF-8 versus WTF-16 thing is news to me. Maybe it was communicated poorly when I first heard about it or it has changed since then.

Thanks for the clarifications.


Copyright © 2025, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds