Python 2.8?
Python 2.8?
Posted Jan 23, 2017 11:44 UTC (Mon) by ssokolow (guest, #94568)In reply to: Python 2.8? by liw
Parent article: Python 2.8?
Rust has three types of "strings". From least to most "binary", they are:
std::string::String
- Guaranteed to be valid UTF-8 but uses a
Vec<u8>
internally. - May contain
NULL
bytes. - Codepoint-oriented APIs
- Can be iterated byte-wise or codepoint-wise. Grapheme clusters or words available via the unicode-segmentation crate.
- Guaranteed to be valid UTF-8 but uses a
std::ffi::OsString
- Can represent any byte/codepoint sequence returned by OS-native APIs.
- Can represent
NULL
bytes even if the OS-native form can't - Designed to be producible from
String
without conversion cost.- On POSIX platforms, In-memory representation is identical to
String
. - On Windows, uses a UTF-8 superset called WTF-8 which matches the relaxed well-formedness rules of the Windows UTF-16 APIs.
- On POSIX platforms, In-memory representation is identical to
Vec<u8>
- Exactly what it looks like. A vector of unsigned 8-bit integers
- May contain
NULL
bytes - Serves the same role as Python's "bytestrings".
- Usable as a "string" type because
Vec
has "sequence of items" versions of most APIs you'd think of as being for string manipulation.
std::path::{Path,PathBuf}
(OsString
internally)
Posted Jan 23, 2017 12:20 UTC (Mon)
by mathstuf (subscriber, #69389)
[Link] (2 responses)
> On POSIX platforms, In-memory representation is identical to String.
No, it can represent arbitrary bytes.
> On Windows, uses a UTF-8 superset called WTF-8 which matches the relaxed well-formedness rules of the Windows UTF-16 APIs.
It actually uses WTF-16 internally, a superset of UTF-16.
Right from the docs at the OsString link you gave:
On Unix systems, strings are often arbitrary sequences of non-zero bytes, in many cases interpreted as UTF-8.
On Windows, strings are often arbitrary sequences of non-zero 16-bit values, interpreted as UTF-16 when it is valid to do so.
In Rust, strings are always valid UTF-8, but may contain zeros.
Posted Jan 23, 2017 12:51 UTC (Mon)
by ssokolow (guest, #94568)
[Link] (1 responses)
I never said it couldn't. On POSIX platforms, That's what I meant when I said they had the same in-memory representation. You can even use the I looked at the up-to-date source to the standard library before I made my post. (The key detail is that, when Windows UTF-16 is invalid UTF-8, it's not because it encodes codepoints that UTF-8 can't represent, it's that it violates higher-level rules about surrogates never occuring in isolation or out of order. After the transformation process is complete, UTF-16 and UTF-8 are simply two different ways to map 21-bit numbers into sequences of bytes, so It's trivial for WTF-8 to represent an arbitrary sequence of numbers 21-bits or narrower just by omitting some of the rules checks that a UTF-8 codec enforces.) I actually started out with that exact quote but edited it out for brevity.
Posted Jan 23, 2017 20:34 UTC (Mon)
by mathstuf (subscriber, #69389)
[Link]
The WTF-8 versus WTF-16 thing is news to me. Maybe it was communicated poorly when I first heard about it or it has changed since then.
Thanks for the clarifications.
Python 2.8?
> No, [OsString] can represent arbitrary bytes.
Python 2.8?
String
(1) and OsString
(1, 2) both rely on an internal Vec<u8>
to store the raw bytes... String
just enforces extra invariants and presents a different set of methods.to_str
method to get an &str
pointing at the contents of the OsString
as long it's valid UTF-8.OsString
uses an inner type called sys::os_str::Buf
to actually store the string and, on Windows, that's an API wrapper around sys_common::wtf8::Wtf8Buf
which, again, uses a Vec<u8>
internally.Python 2.8?