User: Password:
Subscribe / Log in / New account

Of bytes and encoded strings

Of bytes and encoded strings

Posted Jan 27, 2014 14:55 UTC (Mon) by anselm (subscriber, #2796)
In reply to: Of bytes and encoded strings by raven667
Parent article: Of bytes and encoded strings

Tcl represents all strings (as opposed to »byte arrays«) as UTF-8 internally. I/O channels can be set up with other encodings such that data read from them is converted from that encoding to UTF-8, or data written to them is converted from UTF-8 to that encoding. There is also the notion of a »system encoding« that provides a default, so on a non-UTF-8 system (think IBM mainframe), the »external« representation of strings might be something completely different from what Tcl uses internally (UTF-8) but data would be read from and written to files with transparent re-encoding.

»Byte arrays« (i.e., non-UTF-8 strings of 8-bit bytes) are mostly used for things like binary protocol data units and communication to third-party libraries at the C level. They can be converted to strings and back if required, and I/O channels can be set to do no re-encoding to allow them to be read and written as-is. There are special Tcl commands that help with assembling byte arrays out of integer bytes, words, etc.

The main difference is that Tcl does not distinguish between »ASCII strings« and »Unicode strings« the way Python (2) does, since all the strings one usually does business with in a Tcl program are UTF-8 strings. Byte arrays are really a niche thing that normally doesn't come up, unlike in Python (2), where what Tcl would call a »byte array« is the default string type, and UTF-8 strings must be specially marked.

(Log in to post comments)

Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds