LWN: Comments on "GNU ed 1.6 released"

GNU ed 1.6 released

dirtyepic — Sun, 12 Feb 2012 04:54:26 +0000

These days? SearchKit.

GNU ed 1.6 released

ndk — Mon, 09 Jan 2012 15:43:24 +0000

Ah, those were the days: back in the late 70s/early 80s, we had a Prime 750 running PrimOS (anybody remember those?) with an IBM-inspired line editor from the mid-60s as the system editor: if you think ed is frustrating, you should try that beast. I spent a lot of pleasant all-nighters on a terminal with a 300-baud modem porting the Kernighan & Plauger software tools to PL1/G; of course I started with ed. After that bootstrap step, there was a quantum leap in productivity (and a 1200-baud modem upgrade helped, but not as much). The ed port (and quite a few of the other tools) was actually used in the classroom for a few years, before Prime actually paid somebody to port emacs to their OS.

note the consistent user interface

k8to — Fri, 06 Jan 2012 01:25:11 +0000

I learned C++ in a mud environment where all I had was ed. I did a lot of pasting from a local editor with very tightly controlled flow rates.

Sometimes I did legitimately fix bugs in source on the server system using ed commands. It was painful.

GNU ed 1.6 released

jhhaller — Thu, 05 Jan 2012 17:18:00 +0000

My first editor on Unix was ed, vi and emacs hadn't been written yet. em (ed for mortals) was next. But, then moving to emacs instead of vi, I never really learned much of vi other than a, i, d, r, and x; most of my use of vi consists of colon followed by a ed command. Go to the end of the file,

:$

make a copy of a line,

:.t.

move 3 lines

:.,.+2t52 :.,.+2d

(assuming moving lines forward), and substitute apple for banana

:g/apple/s//banana/g

GNU ed 1.6 released

neilbrown — Thu, 05 Jan 2012 02:11:04 +0000

We must never forget the greatest legacy that 'ed' has left us.

If 're' is a 'regular expression', then
/re/
will search for it.
/re/p
will search and then print.
g/re/p
will apply this globally - for every line that matches 're', print the line.
So if you wanted to write a program that just printed the lines that match a regular expression - what do you call it?

GNU ed 1.6 released

cmccabe — Wed, 04 Jan 2012 16:33:53 +0000

Well, I guess you are right. Unicode support, even in higher-level languages, still is not perfect. Disappointing.

note the consistent user interface

tnoo — Wed, 04 Jan 2012 11:15:18 +0000

this still breaks on ^C. The source code must be more complex, maybe like this:

trap "" SIGINT;while :;do read x;echo \?;done

GNU ed 1.6 released

tialaramex — Wed, 04 Jan 2012 02:38:20 +0000

If the higher level language and its associated string manipulation APIs were created in the last decade or so, and by someone with a thorough understanding on Unicode itself and the general problems of different text systems, then yes, maybe.

In practice, I can't think of any languages like that. Many of them are built by people who at best assumed other writing systems are just like Latin except with differently shaped squiggles. They often mandate that "text" means "UTF-16 strings" and then blunder into all sorts of problems with filenames, URLs, streams of bytes some idiot stashed in a "text" field on a database, and other things that definitely aren't UTF-16 strings. There may be built-in assumptions about writing direction, the meaning of "character" (a very, very tricky issue) and so on.

As a rule of thumb if the language claims to be "high level" and yet it has a "character" data type that's distinct from a string, or can be treated meaningfully as an integer, or it has the same data type for binary data and text, then either they're yanking your chain or they had no idea about Unicode. C has the excuse that Unicode literally didn't exist back then. Languages like Python will have to provide their own excuses.

Some more bad signs:

• Mentions of the "length" of a string that don't either include or point at a multi-paragraph discussion of what "length" means in this context.

• Discussion of collation or "sorting" strings that doesn't mention locale.

• A string equality operator or comparison method that doesn't come with a multi-paragraph discussion of Unicode equivalence.

Of course a lot of this stuff can be /fixed/ in theory. But fixes after the fact are often messy. The can involve things like deprecated methods on core objects, parallel APIs replacing every mention of character with "string", or even inventing another type "Unicode string" and then going around replacing all the other APIs in the system with Unicode-friendly ones, leaving maintenance programmers to handle the debris.

GNU ed 1.6 released

cmccabe — Tue, 03 Jan 2012 23:52:41 +0000

I don't see a reason to use ICU unless you need the functionality that ICU provides. Nearly all of the C programs I've ever written just treat strings as opaque blobs and don't try to do "proper semantic text operations."

UTF-8 works great for what I need. My only wish is that it had been invented sooner, so that people didn't come up with N+1 different subtly defective, backwards incompatible "wide character" solutions.

If I were performing fancy operations on text, I would probably do it in a higher level language with built-in unicode support. At that point the encoding should be a non-issue (right?) because the high level language abstracts that away.

GNU ed 1.6 released

JoeBuck — Tue, 03 Jan 2012 21:00:01 +0000

Yes, ed was optimized for use with DECwriters spitting out characters at 300 baud (and I'm old enough that I actually used it as intended).

This is not true...

khim — Tue, 03 Jan 2012 20:01:41 +0000

UTF-8 is not just run-of-the-mill variable-length encoding. Ken Thompson modified original IBM's proposal to make sure most algorithms which treat strings as sequence of 8-bit characters were still usable with UTF-8.

This means that yes, you can easily use UTF-8 with programs like GNU ED or GNU M4 which know absolutely nothing about UTF-8 but correctly support 8bit characters in strings.

GNU ed 1.6 released

blitzkrieg3 — Tue, 03 Jan 2012 19:39:23 +0000

UTF-8 is a variable width encoding. It sounds like they now have support for something like extended ASCII.

GNU ed 1.6 released

nicku — Tue, 03 Jan 2012 19:38:09 +0000

It's strange to relate how happy I was writing all my (mostly Pascal) computing assignments at UNSW in ed on the locally compiled Unix through 2400 bps green terminals in 1986--1989. About twenty of us simultaneously wrote 6809 assembly language programs on a time-share OS/9 system running on one 68000 CPU in an ed-like editor.

note the consistent user interface

NAR — Tue, 03 Jan 2012 15:28:01 +0000

Once I had to use a router which had ed (but not vi) installed to edit it's configuration file on the router. For some very simple tasks I actually could edit the file, but for anything complicated it was easier to download the file, edit it with vim, upload, then reload the configuration.

GNU ed 1.6 released

bjartur — Tue, 03 Jan 2012 13:15:43 +0000

I think ed is even simpler than that: ed doesn't bother with encoding at all, but leaves the mess to terminals. You should be fine as long as you refrain from inputting partial characters.

GNU ed 1.6 released

sorpigal — Tue, 03 Jan 2012 12:55:02 +0000

> There's no *nice* way to hack in this kind of support after the fact...

There, fixed it for ya.

GNU ed 1.6 released

mpr22 — Tue, 03 Jan 2012 12:02:49 +0000

Even if you ignore IPA, not all Latin-alphabet combining sequences used in the orthography of natural languages have precomposed code points. For example, as far as I know there is still no precomposed code point for n̈ - and yes, this does have a use other than correctly representing the name of a certain fictional heavy metal band.

note the consistent user interface

rsidd — Tue, 03 Jan 2012 08:09:04 +0000

Source code for ed:

while :;do read x;echo \?;done

(from here)

GNU ed 1.6 released

ssmith32 — Tue, 03 Jan 2012 05:56:42 +0000

>...There's no hacking in this kind of support after the fact;...

Yet somehow, this is what always happens :D

GNU ed 1.6 released

wahern — Tue, 03 Jan 2012 04:04:32 +0000

Also, it's worth point out, from the FAQ:

Q: Doesn’t it cause a problem to have only UTF-16 string APIs, instead of UTF-32 char APIs?

A: Almost all international functions (upper-, lower-, titlecasing, case folding, drawing, measuring, collation, transliteration, grapheme-, word-, linebreaks, etc.) should take string parameters in the API, not single code-points (UTF-32). Single code-point APIs almost always produce the wrong results except for very simple languages, either because you need more context to get the right answer, or because you need to generate a sequence of characters to return the right answer, or both.

(Source: http://unicode.org/faq/utf_bom.html)

GNU ed 1.6 released

wahern — Tue, 03 Jan 2012 04:02:15 +0000

My fault for being lazy with the terminology.

Question: do all combining sequences have precomposed equivalents. I think all the Latin ones do, but what about other scripts?

note the consistent user interface

tnoo — Tue, 03 Jan 2012 03:38:55 +0000

Let's look at a typical novice's session with the mighty ed:

golem$ ed

?
help
?
?
?
quit
?
exit
?
bye
?
hello?
?
eat flaming death
?
^C
?
^C
?
^D
?

---

Note the consistent user interface and error reportage. Ed is generous enough to flag errors, yet prudent enough not to overwhelm the novice with verbosity.

“Ed is the standard text editor.”

Ed, the greatest WYGIWYG editor of all.

(from http://www.gnu.org/fun/jokes/ed.msg.html)

GNU ed 1.6 released

mjg59 — Tue, 03 Jan 2012 03:34:44 +0000

From the spec:

"UTF-32 encoding form: The Unicode encoding form that assigns each Unicode scalar value to a single unsigned 32-bit code unit with the same numeric value as the Unicode scalar value"

So UTF-32 isn't variable length. The sudden rise in the use of emoji and other non-BMP characters means that ignoring the variable length of UTF-16 is already broken in real-world cases in non-CJK markets, too.

GNU ed 1.6 released

wahern — Tue, 03 Jan 2012 03:28:13 +0000

Both UTF-16 and UTF-32 are also variable length encodings, it's just that most Windows, Java, etc programmers treat them like fixed length encodings. It mostly works for the time being, but will begin breaking when non-Western markets mature and people start requiring more feature parity when slicing and dicing text without buying special-purpose editing packages.

For the time being people have low expectations. But political and technical movements like Simplified Chinese will eventually hit substantial cultural barriers and the push back will require that software handle locales which didn't adapt to western syntax. That will mean following the Unicode rules to a T. To follow the Unicode rules you have to use an API for even simple things like "character" iteration, etc, unless the programming language supports the proper semantic text operations, like Perl6 can over graphemes using it's neat NFG hack. Scripts like Thai have no mandatory punctuation, so again you need to use accessors with a complex built-in rule base to detect, e.g., end-of-sentence. There's no hacking in this kind of support after the fact; it has to be baked into the code.

APIs like ICU are huge, but in many cases can make the code more clear. Unfortunately ICU doesn't get used much because the rule tables are so gargantuan that virtual memory explodes (though most of that is mmap'd straight from disk), and programmers are still beholden to their notion of low-level C-like character strings.

In 10-20 years we are going to see a surge in demand for I18N and L10N programmers to refactor all the crap hacks that came out of the 1990s, heralded by Microsoft's and Sun's half-hearted adoption of UTF-16.

GNU ed 1.6 released

nescafe — Tue, 03 Jan 2012 01:48:47 +0000

GNU ed 1.6 released

halfline — Tue, 03 Jan 2012 01:06:22 +0000

GNU ed 1.6 released

Karellen — Tue, 03 Jan 2012 00:56:56 +0000

I think the only problem with the article is that it means 8-bit character *encodings*, where the encoding of the character set uses 8 bits (e.g. UTF-8), rather than 7 (e.g. UTF-7), regardless of how many bits are required by the character *set* (20 for all proper unicode encodings such as both UTF-7 and UTF-8).

Anyway, whatever that wikipedia article means, it's kind of a red herring, as "ed" being 8-bit clean means that it can handle both 8 bit character sets (e.g. ISO-8859-*) and 8-bit character encodings (e.g. UTF-8).

GNU ed 1.6 released

geuder — Tue, 03 Jan 2012 00:35:03 +0000

True, but I don't think the wikipedia article you link to is very clear.

UTF-8 is not a character set at all, but a variable length encoding of a 16-bit or 32-bit character set.

8 bit cleanliness was a big issue for most Europeans wanting to write anything correctly on a computer in their mother tongue in the early 90s. Most editors could only handle 7 bit character sets without quirks.

8 bit character sets were an intermediate step in the 90s. 8 bits per character are enough for most bigger European languages (But not a single 8 bit character set for all of them)

But 8 bits don't help the Asians. They need 16 bits per character at least, and because different sets where needed in Europe it wasn't ideal even there.

I think the majority of software in use has been basically 16 bit Unicode for may years now. Windows, Symbian, and Java use true 16 bit wide characters, while all Linux distributions I have used use UTF-8 encoding by default. The nice thing with UTF-8 is that you even can't tell the difference to the old ASCII as long as you stick to 7 bit ASCII characters, because their encoding is identical, 8 bits with the most significant bit being 0.

Whether ed supports UTF-8 or not is not said in the announcement. IMHO 8 bit cleanliness defines support of 8 bit character sets, not stripping away or clearing the most significant bit. UTF-8 is more than this, the editor must be able to handle the variable length encoding.

But whether it can or cannot be used for writing texts in European languages I use regularly, I don't see a reason why I personally would do it using ed. As long as all American programmers remember every day that the world is not 7 bit and not even all ASCII characters are reachable on the keyboard without using modifier key I'm happy. The original question shows that there is work to do, so excuse my long comment.

GNU ed 1.6 released

andrel — Mon, 02 Jan 2012 22:39:56 +0000

It means that this version of ed works with 8-bit character sets.

GNU ed 1.6 released

johnny — Mon, 02 Jan 2012 22:27:43 +0000

What does "8-bit clean" mean? That the code only uses 8-bit variables?