Hyphens, minus, and dashes in Debian man pages [LWN.net]

Name	Codepoint
Hyphen-Minus	002D	-
Hyphen	2010	‐
En Dash	2013	–
Em Dash	2014	—
Minus Sign	2212	−

Hyphens, minus, and dashes in Debian man pages

Posted Oct 23, 2023 13:45 UTC (Mon) by willy (subscriber, #9762) [Link] (1 responses)

Illustrating this issue further, perhaps, is the broken Android behaviour around rendering hyphens. It's alluded to here:

https://github.com/guardian/frontend/issues/17506

but the Grauniad appears to have worked around it instead of getting Android fixed. Jake and I had some correspondence on this issue last year, tracked down to using &#8209 (a different glyph, the non-breaking hyphen!)

Hyphens, minus, and dashes in Debian man pages

Posted Oct 23, 2023 18:27 UTC (Mon) by branden (guest, #7029) [Link]

[[I am using the "send a free link" feature to read this article, and find that I cannot "Post a comment", but only reply to one, so my response to your comments is only somewhat apropos. I apologize for that.]]

The LWN editor griped in his piece that the distinction between a hyphen and a hyphen-minus was "invisible" (as it clearly wasn't with the Android font you used).

I've had good results with the "FreeMono" font, from the Debian fonts-freefont-ttf package. I find hyphens and minus signs readily distinguishable with it, and it has excellent coverage; I've used it to extensively revise the groff_char(7) man page, which exercises every glyph one is likely to see in a man page, and practically speaking, many more besides.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 23, 2023 15:02 UTC (Mon) by ms-tg (subscriber, #89231) [Link]

> may have been dashed, but that does not bar

Thank you for this humor!

Hyphens, minus, and dashes in Debian man pages

Posted Oct 23, 2023 15:06 UTC (Mon) by amacater (subscriber, #790) [Link]

I see what you did in the last sentence of the article there, Jon:)

Hyphens, minus, and dashes in Debian man pages

Posted Oct 23, 2023 15:16 UTC (Mon) by smoogen (subscriber, #97) [Link]

> Developers of free software are, of course, diligent about writing man pages;
> they do the job promptly, take their time to get every detail right, and can be
> expected to use the right kind of dash in every situation, even though the
> output from using the wrong kind looks exactly the same. They will surely
> not be bothered by the fact that a format designed to document
> command-line options contains a trap whereby the failure to add backslashes
> silently introduces problems for users who are distant in time and space.

Bravo Mr Corbet, Bravo!

I started with a smirk, went onto a smile, and by the end was what my family said was "insane" giggling at the idea of various programmers I have known and the man pages written, updated and maintained by them.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 23, 2023 15:20 UTC (Mon) by zorro (subscriber, #45643) [Link] (4 responses)

Could groff not simply make the remapping dependent on the output font?
Monospace font → Hyphen-Minus
Proportional font → Hypen

Hyphens, minus, and dashes in Debian man pages

Posted Oct 23, 2023 17:05 UTC (Mon) by branden (guest, #7029) [Link] (2 responses)

> Could groff not simply make the remapping dependent on the output font?

No, because when the formatter runs in "nroff mode" (is producing output for a terminal), there is no support for "font families" (a groff invention that structures a somewhat unruly mess of uncategorized fonts that AT&T device-independent troff developed starting around 1980).

There's more on this in the groff manual.

https://www.gnu.org/software/groff/manual/groff.html.node...

In case it doesn't go without saying, few to zero terminal emulators support switching font families, or between monospaced and proportional type, at least for anything less than the entire rendered screen at once.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 23, 2023 23:28 UTC (Mon) by rfunk (subscriber, #4054) [Link] (1 responses)

But nroff mode can be considered entirely monospace anyway, right?

Hyphens, minus, and dashes in Debian man pages

Posted Oct 24, 2023 2:54 UTC (Tue) by branden (guest, #7029) [Link]

> But nroff mode can be considered entirely monospace anyway, right?

Yes. So in nroff mode, you can simply remap `-` to `\-` on a groff output device that distinguishes them ("utf8" is the only nroff-mode device that does), if you don't care about a collateral effect of line breaks in filled text not happening as often as they should, as mentioned by Russ Allbery in this LWN thread. (And admittedly, many people don't.)

That simple remapping is in fact what the groff "PROBLEMS" file recommends for those who aren't concerned with man page typography, and what the latest revision of the Debian groff package does--it updates a conffile (/etc/groff/man.local), which the site admin can modify if they like. Hence the storm in a teacup. :)

Hyphens, minus, and dashes in Debian man pages

Posted Nov 5, 2023 10:43 UTC (Sun) by cpitrat (subscriber, #116459) [Link]

Or couldn't groff rather invert the behavior? I fail to be understand why '-' should be translated to the unicode version Hyphen rather than to the more natural ASCII Hyphen-minis version. And when the user wants a unicode Hyphen, then escape the -.

It's still an awful task to fix all the man pages which "use it right" (as per v1.23) but it seems much more natural to me.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 23, 2023 15:54 UTC (Mon) by epa (subscriber, #39769) [Link] (65 responses)

Perhaps groff could partially reinstate the old behaviour? If the - character appears at the start of a word, that is, preceded by a space character, then output the ASCII - character. In the middle of a word (surrounded by nonspace), or even at the end of a word, it can indeed be typeset as a hyphen.

I guess that wouldn’t work for ‘hyphenated’ long option names, which have - in the middle, so some more elaborate rule might be needed. Perhaps easier to fix the manpage sources after all.

When that’s done, can we arrange for applications such as spreadsheets to understand the Unicode minus sign?

Hyphens, minus, and dashes in Debian man pages

Posted Oct 23, 2023 16:22 UTC (Mon) by Wol (subscriber, #4433) [Link] (35 responses)

> I guess that wouldn’t work for ‘hyphenated’ long option names, which have - in the middle, so some more elaborate rule might be needed. Perhaps easier to fix the manpage sources after all.

Or assume that ascii-dash means ascii-dash? If you want a breaking hyphen, *that* should be \- or something fancy.

> When that’s done, can we arrange for applications such as spreadsheets to understand the Unicode minus sign?

Yes it would be nice for spreadsheets to recognise Unicode minus, but surely we need proper Unicode keyboards before we worry too much about that (and yes, I expect non-Anglo-Saxon countries already have Unicode keyboards, but how many people writing all this stuff even know how to get at Unicode? Apart from the £ sign (which is almost certainly Unicode) I have no idea how to access any other Unicode).

Cheers,
Wol

Hyphens, minus, and dashes in Debian man pages

Posted Oct 23, 2023 17:23 UTC (Mon) by branden (guest, #7029) [Link] (3 responses)

> Or assume that ascii-dash means ascii-dash? If you want a breaking hyphen, *that* should be \- or something fancy.

The author of troff, Joseph Ossanna of the Bell Laboratories Computing Science Research Center, and a close colleague of Ken Thompson and Dennis Ritchie (whose names may be more familiar) faced this problem when the Labs took delivery of its first phototypesetter. All Unix document formatting had, up to that point (sometime in 1972/1973), been done using Ossanna's "nroff" or the older "roff" program to print to typewriters, where there is indeed no distinction between a hyphen, a minus sign, or a dash (unless you type it more than once).

Fonts for typesetting are a different story. They can have en dashes, em dashes, figure dashes, and almost always have distinguishable hyphen and minus characters.

Given an installed base of nroff users and documents, including Unix man pages, the arrival of the typesetter meant that Ossanna had to decide whether "-" should map to the typesetter's hyphen, or its minus sign.

He chose the former. My bet is that some frequency analysis of glyph usage was done--perhaps with some degree of rigor, since this was Bell Labs after all--and found that "-" occurred much more often as a hyphen than as a minus sign. And moreover, that its use as a minus sign was in fairly restricted contexts, like setting mathematical expressions (necessarily simple ones on typewriting terminals like the Western Electric Model 37 that the Labs used).

You can read more about thesematters in the groff_char(7) and roff(7) man pages in the groff 1.23.0 release.

It is an accident of history that over the years, Unix users largely gave up using troff and nroff _except_ for man page composition, and so people notice the prevalence of the ASCII hyphen-minus much more often than they would in any other context.

Switching the glyphs' meanings around now would (a) break other *roff documents or (b) if done only for man(7) (and mdoc(7)), would make those macro packages work inconsistently with all others.

There is perhaps a case to be made for (b), but there is already a means of giving man(7) and mdoc(7) authors the crude solution many of them already desire, and it is to remap characters in the ASCII-WYSIWYG manner than many man page authors desire in the site-local configuration files that have been around for decades, man.local and mdoc.local. That is what Colin has done for Debian; we've both anticipated this day for years.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 24, 2023 6:53 UTC (Tue) by jengelh (guest, #33263) [Link] (2 responses)

How hard would it be to add a roff command so that man page files can individually choose/override what U+002D is subsequently rendered as for the manpage?

Hyphens, minus, and dashes in Debian man pages

Posted Oct 25, 2023 10:18 UTC (Wed) by tao (subscriber, #17563) [Link] (1 responses)

What would be the point? Do you really think that manual writers who cannot be bothered with using \- would be bothered to add another command that specifies how U++002D should be rendered? If the manual page needs fixing anyway it might as well be fixed the right way instead of using a workaround.

Hyphens, minus, and dashes in Debian man pages

Posted Nov 5, 2023 11:13 UTC (Sun) by cpitrat (subscriber, #116459) [Link]

Why is this 'the right way'? I fail to see why having a default that interpret a character as another one visually undistinguishable and needing to escape it to really get it is 'the right way'. It seems to pe that the right way would be to have '-' mean '-' and '\-' mean the other one. Or, of course, directly type the unicode char or its code with the proper escape sequence.

Why not do the same with other char? Use '\_' if you want a real underscore otherwise you get U+0332. Use '\i' for a i otherwise you get 'U+0049, U+0131'.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 26, 2023 8:46 UTC (Thu) by rsidd (subscriber, #2582) [Link] (16 responses)

Apart from the £ sign (which is almost certainly Unicode) I have no idea how to access any other Unicode).

I have an .XCompose file set up to give me pretty much any Unicode symbol I typically use with a few keystrokes. Eg AltGr+L+= → ₤. AltGr+E+= → €. AltGr+R+= → ₹. All the Greek letters, most of the common math symbols, etc. It may take you half an hour to set up and you can keep adding, once you are used to it you won't go back. This is a good starting point.

I was a skeptic of using Greek letters like μ and δ in writing Julia code. But once I started, I found it is just much more readable that way.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 26, 2023 9:40 UTC (Thu) by geert (subscriber, #98403) [Link] (15 responses)

Most of these you can get by just enabling "Compose Key" in the "Keyboard" section of GNOME Settings.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 27, 2023 15:35 UTC (Fri) by gutschke (subscriber, #27910) [Link]

And if you happen to use a Chromebook, don't forget to install the Compose extension from the app store.

Hyphens, minus, and dashes in Debian man pages

Posted Nov 9, 2023 13:00 UTC (Thu) by Wol (subscriber, #4433) [Link] (13 responses)

> Most of these you can get by just enabling "Compose Key" in the "Keyboard" section of GNOME Settings.

Except my make.conf includes "-gnome -gtk".

Although I guess KDE/Plasma has something similar.

Cheers,
Wol

Hyphens, minus, and dashes in Debian man pages

Posted Nov 16, 2023 3:33 UTC (Thu) by mathstuf (subscriber, #69389) [Link]

That GNOME Settings probably just does something like:

/usr/bin/setxkbmap -option lv3:ralt_switch_multikey

Hyphens, minus, and dashes in Debian man pages

Posted Dec 2, 2023 9:33 UTC (Sat) by ssokolow (guest, #94568) [Link] (11 responses)

System Settings → Input Devices → Keyboard → Advanced → Position of Compose key

Hyphens, minus, and dashes in Debian man pages

Posted Dec 2, 2023 14:45 UTC (Sat) by Wol (subscriber, #4433) [Link] (10 responses)

Thanks. So I thought I'd just take a look ...

Key to choose the 2nd level - "<>" - what on earth is that :-)

The keyboard I'm typing on has ",<" and ".>" but no "<>" key :-)

Mind you, it did tell me how to make num lock default to on - it's been a real pain that num lock seems to change now and again for no apparent reason ...

Cheers,
Wol

Hyphens, minus, and dashes in Debian man pages

Posted Dec 2, 2023 18:04 UTC (Sat) by jem (subscriber, #24231) [Link] (9 responses)

The "<> key" refers to the key between left shift key and the Z key (Y on a German keyboard and W on a French keyboard). Not all keyboards have this key, but if it is present it is usually labelled "<" (unshifted) and ">" (shifted).

If you don't have this key, you can always choose some other key from the 19 alternatives on the list.

Hyphens, minus, and dashes in Debian man pages

Posted Dec 2, 2023 18:48 UTC (Sat) by halla (subscriber, #14185) [Link] (8 responses)

I've never had, or even seen, a keyboard with a key between left shift and Z...

Hyphens, minus, and dashes in Debian man pages

Posted Dec 2, 2023 19:00 UTC (Sat) by gioele (subscriber, #61675) [Link] (7 responses)

> I've never had, or even seen, a keyboard with a key between left shift and Z...

The standard ISO layout (used everywhere except in USA and Japan) has a key between left shift and Z.

https://switchandclick.com/wp-content/uploads/2021/02/phy...

Hyphens, minus, and dashes in Debian man pages

Posted Dec 2, 2023 19:35 UTC (Sat) by halla (subscriber, #14185) [Link] (6 responses)

That "standard" ISO layout isn't used by any brand of keyboard, except Apple, in the Netherlands, where I live. I've never met anyone who got the "standard" ISO keyboard layout when buying a macbook in the Netherlands, though I'm sure they exist. I just have never met them.

Hyphens, minus, and dashes in Debian man pages

Posted Dec 2, 2023 20:40 UTC (Sat) by johill (subscriber, #25196) [Link] (5 responses)

It's really common (likely standard) for German layout keyboards ... I tend to switch mine to US in software for programming, but have multiple ISO/German layout keyboards, including most recently a split one (Keychron Q11).

Hyphens, minus, and dashes in Debian man pages

Posted Dec 3, 2023 0:08 UTC (Sun) by Wol (subscriber, #4433) [Link] (4 responses)

My keyboard (which looks like the standard UK layout) has that key, but it's "\|". To the best of my knowledge I've never seen a "<>" key ...

Cheers,
Wol

Hyphens, minus, and dashes in Debian man pages

Posted Dec 3, 2023 8:19 UTC (Sun) by gioele (subscriber, #61675) [Link] (2 responses)

> My keyboard (which looks like the standard UK layout) has that key, but it's "\|". To the best of my knowledge I've never seen a "<>" key ...

So you have seen the key, but you haven't seen it labeled "<>". :)

The technical name for that physical key, regardless of its legend (= printed label) is "1st main key of the B (= 2nd from the bottom) row". Common legends for it are "<>", "\|", "~`", "][", "«»", "^*".

Hyphens, minus, and dashes in Debian man pages

Posted Dec 3, 2023 10:51 UTC (Sun) by Wol (subscriber, #4433) [Link] (1 responses)

The problem is everyone thinks their layout is the standard layout :-)

For example the AZERTY keyboard is very common in Europe ...

That's probably why I find keyboards so confusing - the dominant culture is US, within Europe it's Germany, and nobody thinks to tell you how to remap a UK keyboard - if all the descriptions are in terms of the developer's local keymaps, we don't seem to have any UK developers ... :-)

Cheers,
Wol

Hyphens, minus, and dashes in Debian man pages

Posted Dec 3, 2023 11:30 UTC (Sun) by gioele (subscriber, #61675) [Link]

> For example the AZERTY keyboard is very common in Europe ...

Well, if Europe is France, then yes. :) The most common visual layout in Europe is QWERTY, followed by QWERTZ (German-speaking countries and Balkans). https://commons.wikimedia.org/wiki/File:Latin_keyboard_la...

> That's probably why I find keyboards so confusing - the dominant culture is US, within Europe it's Germany, and nobody thinks to tell you how to remap a UK keyboard

That is indeed a real issue.

Keyboards have different levels of abstraction (physical layout, visual layout, functional layout) and only the first levels are really standardized (an many different standards exist). And even the standards are often not followed. So it is hard to write documentation in a way that applies to a non US-centric audience.

I have in a radius of 20 meters from my chair at least 10 different keyboards, all of which are "almost" standard ISO, but each of them has a peculiarity (different physical shape, non-standard legends, extra functionalities...) that makes them non-standard.

Xorg/xkb tries to document all this variability using a declarative language (see xkbcomp/xkbprint) but no keyboard manufacturer I know of provides xkb data for their keyboard. (And in the end everything is an evdev keyboard these days, so...)

Hyphens, minus, and dashes in Debian man pages

Posted Dec 4, 2023 21:23 UTC (Mon) by tao (subscriber, #17563) [Link]

You've got to make yourself some Nordic friends. Most of their keyboards are likely to have a <> key.

Hyphens, minus, and dashes in Debian man pages

Posted Nov 9, 2023 12:36 UTC (Thu) by dwmw2 (subscriber, #2063) [Link] (13 responses)

For a standard British keyboard, hold down the right AltGr key, press lots of others (both with an without Shift held down). See what you get.

There are arrows on yuUi ←↓↑→, m is µ, S is §. Superscript numbers ¹²³ on 123...

Most typists don't need to see the basic letters on the keyboard in order to be able to type. Why would you need to be able to see these? ☺

Hyphens, minus, and dashes in Debian man pages

Posted Nov 9, 2023 13:03 UTC (Thu) by Wol (subscriber, #4433) [Link] (12 responses)

> Most typists don't need to see the basic letters on the keyboard in order to be able to type. Why would you need to be able to see these?

Standard British keyboard? Does such a thing exist any more? I think I have access to four British keyboards - my fancy ergonomic Logitech jobby, my two laptops (home and work), and my wife's laptop. All four keyboards appear to be different.

And I'm a 6-fingered typist. Comes from playing the guitar - my left hand can type, my right hand is two fingered hunt-n-peck :-)

:-)

Cheers,
Wol

Hyphens, minus, and dashes in Debian man pages

Posted Nov 9, 2023 13:20 UTC (Thu) by dwmw2 (subscriber, #2063) [Link]

"Standard British keyboard" didn't refer to the physical device on which you bash your fingers. It's about the standard keymap which you get when you install a Linux distribution and tell it what language and keyboard layout you want.

I'm fairly sure that whatever physical keyboard device I plug into my machine (within reason), if I press AltGr-m on it I'm going to get a µ, etc.

Hyphens, minus, and dashes in Debian man pages

Posted Nov 9, 2023 18:43 UTC (Thu) by mpr22 (subscriber, #60784) [Link] (10 responses)

> Standard British keyboard? Does such a thing exist any more?

In terms of layout? Yes, there is.

I've got one right in front of me, and another one in the "WEEE to get rid of" pile that I really need to clear down given I'm moving house soon. Both were purchased within the past five years. (The one in the WEEE pile is there due to negligent handling by mpr22, not due to negligent manufacture.)

They're from different manufacturers (neither of which is Unicomp) and have identical layouts to the Fujitsu FKB-4725 I had back in the late 90s-early 00s, apart from (a) having Windows keys and (b) the broken one having volume and power keys where the undamaged one has Foo Lock indicator lights.

(I'm a nine-finger typist; I learned to type before I ever laid hands on a stringed instrument.)

Even on a laptop, the distinguishing features are "double-height Return key; backtick left of 1; backslash between LSHIFT and Z; semicolon/colon, singlequote/at, and hash/tilde between L and RETURN; 2 has doublequote on it and 3 has £ on it".

Hyphens, minus, and dashes in Debian man pages

Posted Nov 9, 2023 22:48 UTC (Thu) by Wol (subscriber, #4433) [Link] (9 responses)

> backslash between LSHIFT and Z;

Except the laptop I'm typing this on has no key between LSHIFT and Z

> hash/tilde between L and RETURN;

hash/tilde is above a single-height return

And the fancy ergo logitech I'm using - while similar to layout you describe - has some very weird keys.

2 / euro / at / double-quote

double-quote / at / single-quote

3 / pound / hash

4 / euro / dollar

and there's some more weirdos too ...

Cheers,
Wol

Hyphens, minus, and dashes in Debian man pages

Posted Nov 10, 2023 9:54 UTC (Fri) by farnz (subscriber, #17727) [Link] (6 responses)

I have a similar Logitech layout. The reason for those weird keys is that the keyboard has both the PC keysyms (right hand side of the key) and the Mac keysyms (left hand side of the key), along with some symbols that are found via AltGr on a PC keyboard or ⌥ / Option on a Mac keyboard.

It also has this thing of labelling all keys with names in lower case, which I've copied for the description below.

On my keyboard, the AltGr symbols are in unfilled circles for the PC keysym, and filled circles for the Mac keysym. So, the 4 key generates $ with shift, and € with alt gr on a PC, while on a Mac, it generates 4 or $ only. The 2 key is the other way round; on a Mac, it generates @ with shift, and € with opt ⌥, while on a PC, it generates " with shift.

And there are more complex keys, like the one to the top left, above tab. On a PC, I can get ` (on its own), ¬ (with shift) and | (with alt gr) from it, while on a Mac, it would give me § (on its own) or ± (with shift).

Hyphens, minus, and dashes in Debian man pages

Posted Nov 10, 2023 10:38 UTC (Fri) by dwmw2 (subscriber, #2063) [Link] (5 responses)

The button you call the 4 key produces a scancode, probably 33. Unless you mean the keypad 4 key, which might be 92.

The software receiving those scancodes may convert them to anything it likes, according to the software keyboard layout/configuration. Any relationship between the symbols generated and the pretty pictures which are painted on the keyboard is purely coincidental.

On my Chromebook keyboard, the pretty picture above the 3 on the key to the left of that one is a #. But it generates a £ when I press it with Shift held down.

Hyphens, minus, and dashes in Debian man pages

Posted Nov 10, 2023 10:45 UTC (Fri) by farnz (subscriber, #17727) [Link] (4 responses)

Yep, but the default keymaps convert those scancodes to a specific set of symbols; my keyboard has two sets of keysyms printed on it, which makes it rather cluttered to look at, but that's Logitech's way of only producing one SKU for two markets.

The computer, of course, can't see the pictures; it relies on scancodes. But between OS defaults and my keyboard's HID descriptors telling the computer what it "should" do, the computer will do what I described unless I specifically tell the computer to use a non-default keymap.

Hyphens, minus, and dashes in Debian man pages

Posted Nov 10, 2023 12:21 UTC (Fri) by dwmw2 (subscriber, #2063) [Link] (3 responses)

… the computer will do what I described unless I specifically tell the computer to use a non-default keymap

Right. That's what I was getting at in my first post when I described the things that the AltGr key does in the default keymap that I get when I install a Linux distribution in British English.

Hyphens, minus, and dashes in Debian man pages

Posted Nov 10, 2023 13:54 UTC (Fri) by farnz (subscriber, #17727) [Link] (2 responses)

I was responding to a different point; that of why a Logitech keyboard has pictures on keycaps that do not correspond to any character you can get from that key in a default setup of a single OS; it's done that way because if you share the keyboard between macOS and Windows (or macOS and Linux), you get different symbols in text input boxes from a default setup given the same scancodes.

Two orthogonal outcomes from the same situation (keyboard sending scancodes, and having pretty pictures in the hope your OS is set up to interpret the scancodes the way the keyboard maker thought it would). Although I do wish keyboard manufacturers would bring the Compose key back; I have it mapped as Shift-CapsLock (because who uses CapsLock as CapsLock), but I remember the good old days of a separate Compose keycap :-)

Hyphens, minus, and dashes in Debian man pages

Posted Nov 10, 2023 15:40 UTC (Fri) by dwmw2 (subscriber, #2063) [Link] (1 responses)

We just need keyboards with LED display keycaps, so the software can ensure that they do display the symbol which will result from pressing them. In real time, as modifiers and combining characters change...

Hyphens, minus, and dashes in Debian man pages

Posted Nov 11, 2023 1:10 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]

Optimus keyboard debuted 15 years ago: https://www.artlebedev.com/optimus/maximus/

There have been other similar products, but they kinda all died. Mostly because experienced users just don't look at the keyboard.

Hyphens, minus, and dashes in Debian man pages

Posted Nov 10, 2023 17:23 UTC (Fri) by mpr22 (subscriber, #60784) [Link] (1 responses)

> hash/tilde is above a single-height return

Sounds like your laptop uses the American-style physical layout (and is thus not a standard British keyboard).

https://en.wikipedia.org/wiki/British_and_American_keyboards

Hyphens, minus, and dashes in Debian man pages

Posted Nov 11, 2023 11:21 UTC (Sat) by dwmw2 (subscriber, #2063) [Link]

Yes, obviously the physical layout is the US one.

But in software it is using the standard British layout, in precisely the context I first used that phrase in this thread. Which was nothing to do with the hardware.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 23, 2023 17:11 UTC (Mon) by branden (guest, #7029) [Link] (2 responses)

> Perhaps groff could partially reinstate the old behaviour? If the - character appears at the start of a word, that is, preceded by a space character, then output the ASCII - character. In the middle of a word (surrounded by nonspace), or even at the end of a word, it can indeed be typeset as a hyphen.

> I guess that wouldn’t work for ‘hyphenated’ long option names, which have - in the middle, so some more elaborate rule might be needed.

I attempted to anticipate suggestions like this.

"Many people who want to "solve" this issue forget (or ignore) that not every '-' is a minus sign. Some are actual hyphens, as in "long-term effects" and "word-aligned struct members". Trying to infer a distinction from white space adjacency also won't work. Consider the phrases "word- or byte-sized caching" and "object-based vs. -oriented programming". While sophistication with compound hyphenated affixes is seldom seen in man pages, we most often find it where a man page author has taken considerable care with their technical writing. Such pages are less likely than most to require revision with blunt instruments like regular expression-based global search and replace operations."

https://lists.debian.org/debian-devel/2023/10/msg00085.html

If I knew of an algorithm that would faultlessly figure out what the writer meant, I'd use it.

> Perhaps easier to fix the manpage sources after all.

That is my conclusion, even knowing that doing so is sure to be difficult, and to meet much resistance.

Distributors like Debian can of course shield their readers from these difficulties; I expected that, which is why the advice in the "PROBLEMS" file (to which Mr. Corbet helpfully linked) looks the way it does. On the downside, few distributors _ship_ this piece of documentation, so it is harder for distribution users to find than it could be.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 23, 2023 21:20 UTC (Mon) by kleptog (subscriber, #1183) [Link]

> If I knew of an algorithm that would faultlessly figure out what the writer meant, I'd use it.

I suspect ChatGPT (or some similar LLM) could do a pretty good job. The harder part I think would be getting all the patches merged (since you don't want to be doing this on the fly).

Hyphens, minus, and dashes in Debian man pages

Posted Nov 5, 2023 11:17 UTC (Sun) by cpitrat (subscriber, #116459) [Link]

But how harmful is it to have the wrong Hyphen in these cases? Compared to having the wrong one in command line arguments which means copy-pasting fails, searching fails, etc...?

Hyphens, minus, and dashes in Debian man pages

Posted Oct 23, 2023 17:43 UTC (Mon) by rra (subscriber, #99804) [Link] (25 responses)

pod2man has for years attempted to do this, and the current version still does. In the next release, I'm going to change it to always convert - to \-, because the heuristics don't work, sadly.

To see why, instead of thinking about options, think about UNIX commands. For example, consider the standard Debian command "apt-get" and also consider the English phrase "many other package managers are apt-like". The former must translate to the ASCII hyphen-minus for searching and cut and paste to work, since the executable on disk uses a hyphen-minus in the name. But the latter is correctly typeset with a hyphen, not hyphen-minus.

The roff input language does correctly distinguish, but because the output looks almost identical in most fonts the bug of using the wrong choice is almost impossible for most people to detect. It therefore occurs frequently, and would even if literally every person writing a man page knew about this problem, for the same reason that I make spelling errors regularly even though I know how to correctly spell words. For POD, the problem is worse: the input language simply does not distinguish. There is no \- equivalent in POD, and no heuristic that will correctly map cases like the above apt-get vs. apt-like. Therefore, the only safe thing to do is to convert all input - characters to the ASCII hyphen-minus. People who really want hyphens can mark their POD documents as UTF-8 and use a Unicode hyphen (which modern roff also handles correctly).

I appreciate Branden's desire to tilt at windmills, being myself an occasional champion spin-jouster, and he is of course technically correct (which as we all know is the best kind of correct). But I stand by my personal opinion that this is a lost cause that will never be fixed properly, and attempting to get people to fix it properly is just going to annoy people without making much positive impact on the world.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 23, 2023 18:09 UTC (Mon) by branden (guest, #7029) [Link] (24 responses)

Hi Russ,

I think automatic man page source generators and human beings who fire up a text editor to write man(7) documents constitute different problem domains. You've stated your intentions with respect to pod2man in this area multiple times where I could see them and I have no objection. pod2man has a problem to solve and I don't want groff to inhibit your flexibility to solve it.

> For POD, the problem is worse: the input language simply does not distinguish. There is no \- equivalent in POD, and no heuristic that will correctly map cases like the above apt-get vs. apt-like. Therefore, the only safe thing to do is to convert all input - characters to the ASCII hyphen-minus. People who really want hyphens can mark their POD documents as UTF-8 and use a Unicode hyphen (which modern roff also handles correctly).

This seems fine to me. People seduced by the siren song of Unicode's giant character set may find themselves learning to distinguish these characters anyway, and those who don't--and who, moreover, may write the documentation hurriedly and with resentment--won't be troubled with it.

(That said, my exposure to POD documentation, generally in the Perl core, tells me that its quality is very high, suggesting that it is written by people who care enough to make it that way. But I've seen far too many terrible man pages to not complain about the status quo. And despite my affiliation with the GNU Project I emphasize that I don't endorse its [slowly fading] policy of man page deprecation, which I think has only contributed to the unhappy state of affairs by feeding shiftless programmers' indifference to writing documentation at all. Anything not worth doing is not worth doing well, no?)

If a person sits down to write a man page from scratch in a text editor, they will have things to learn, and in my opinion the hyphen/minus distinction is one of them. (As the original article suggested, there are in fact four other "ASCII" glyph distinctions to learn about.)

The theme of audience is also applicable to why I made this change in groff upstream. The GNU Project generally releases source archives, not binary packages. The primary consumers of groff releases from GNU are therefore, I would expect, people who already know of the package and desire to obtain it.

Distributions are different. Their users read man pages without even knowing that groff is involved. That is why it is important to me that (a) groff retain its customizability and (b) that its defaults be correct. groff man(7) actually went down this road before, about 15 years ago, when it first introduced the "utf8" output device. Screeches of clueless outrage erupted from the land then as have did now. Then-maintainer Werner Lemberg threw a blanket over the racket with the aforementioned character-remapping...but he did it in the macro file implementing (most of) man(7) itself, _not_ in the stock man.local file. I think that was an (innocent) error, as it suggested a stronger endorsement of doing this remapping-to-ASCII than was intended. (We discussed the revival of the old behavior on the groff list years ago. Werner, since retired as groff maintainer, felt much as you do; that it was technically correct to do what groff 1.23.0 shipped doing, but that the howls of frustrated man page readers and the commitment of a few vocal man page authors to their bad habits would be too much to endure.)

I figured the distributors are better placed to make this decision. I still think that. But the occasional storm in a tea kettle is the price, and on a slow news week, such weather can fuel an LWN piece.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 23, 2023 18:54 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (5 responses)

One thing that seems to be missing from the article (or at least, I couldn't find it): What exactly is the typesetting benefit of emitting a U+2010 (instead of a U+002D) at all, under any circumstances whatsoever, if ~all fonts render U+2010 approximately the same as U+002D anyway? Do screen readers pronounce them differently? Does one of the standard Unicode algorithms handle them differently (e.g. line breaking, BIDI, etc.)? Is there some other use case that is not obvious to me? Or are we literally doing this for no other reason than "that's what the spec says"?

Hyphens, minus, and dashes in Debian man pages

Posted Oct 23, 2023 19:15 UTC (Mon) by branden (guest, #7029) [Link]

> What exactly is the typesetting benefit of emitting a U+2010 (instead of a U+002D) at all, under any circumstances whatsoever, if ~all fonts render U+2010 approximately the same as U+002D anyway?

"Approximately" is doing a lot of work here. On the groff list we do see discussions and even complaints when dash-like symbols are drawn with incorrect lengths. People will notice, at least if their font doesn't hide the glyph distinctions from them (which brings its own problems, completely independently of anything to do with groff). https://www.unicode.org/faq/security.html

As I noted above, the GNU FreeFont's Mono face has good coverage and I've been using it happily, including much groff development activity, for years. People who dislike serifs might hate it, though.

> Do screen readers pronounce them differently?

I don't know for certain, but would expect so. One would not expect to hear "mother-in-law" pronounced with the interior punctuation called out. For Unix command-line and C language literals, you very much do. Those dashes (hyphen-minuses) are important.

> Does one of the standard Unicode algorithms handle them differently (e.g. line breaking, BIDI, etc.)?

groff doesn't apply the Unicode line-breaking algorithm (because it predates Unicode), but it does something similar. When researching this matter for Russ Allbery on the Debian list I discovered that essentially all man page formatters will break the line after a hyphen (*roff input: -) and none will after a minus sign (*roff input: \-). They're a little less consistent for things like em dashes.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 23, 2023 19:58 UTC (Mon) by rra (subscriber, #99804) [Link]

The distinction is more obvious in proportional-width typeset fonts. If you were looking at a high-quality printed book, for example, you may still not consciously notice the difference unless you're a font geek, but the hyphen will look better. The hyphen-minus is too long. (Well, unless the font itself intentionally collapses the difference, which it might. This area is murky; Unicode has a lot of separate glyphs for dashes for exactly this reason, and hyphen-minus is a compromise glyph that is all of those things and none of them, so it's somewhat up to the font to decide how to implement that compromise.)

Most man page views are in fixed-width fonts, though, and there the distinction is much less apparent, or occasionally nonexistent. With a fixed-width font, the character has to take up the same space regardless of the dash length, so the length variation is much less useful.

(Also, as Branden said, it affects line wrapping, which sometimes can matter a lot.)

Hyphens, minus, and dashes in Debian man pages

Posted Oct 25, 2023 8:25 UTC (Wed) by nim-nim (subscriber, #34454) [Link] (1 responses)

The basic “problem” man pages face is that text rendering goes forward. Screen pixel density goes up, as does font accuracy and unicode coverage.

This may not be obvious in a Linux console which is basically stuck in the past pre-unicode world, where non-ascii rendering is broken in various ways and where you force 1024×768 resolution because otherwise various things break hard, but in a gfx Linux (or Windows or OSX or Android) terminal that exercise screens at their full pixel density, using OpenType vector fonts that try to render Unicode (including its nuances) ever more accurately encoding mistakes start to become visible and will become ever more visible as the years pass.

On a high-dpi phone screen and on many high-to-mid end computer screens the tech is already capable to match (and exceed) traditionnal paper printing the traditionnal “it’s a problem for typography buffs that do paper print” excuse does not apply anymore.

Also, remember that people have found a need for translated man pages for a long time so considering them an ASCII-only world is highly inacurate.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 25, 2023 19:12 UTC (Wed) by NYKevin (subscriber, #129325) [Link]

I'm not claiming that Unicode in general is somehow unnecessary (indeed, my original comment does not claim *anything at all* - it asks questions to which I honestly did not know the answers at the time).

But what I will say is this: When I have written content for the web, which (in the present day) very much is a high-DPI Unicode-friendly environment, I have never felt any need to use U+2010 HYPHEN. I have used the en dash, the em dash, and the minus sign, but all of those render very differently from U+002D HYPHEN-MINUS. I was just trying to understand why anyone wants U+2010 in particular, when it looks so similar to U+002D even in a proportional font. I mean, just look at them in the article. They are practically homoglyphs, and I had to lean really far in just to see that U+2010 is about half a pixel thicker (in my font, when subpixel hinting is enabled).

If there are in fact screen-reader benefits, then this probably shouldn't be written off entirely. OTOH, one could say the same about and . Screen readers can certainly benefit from distinguishing between italics for emphasis, and all other italics. But nobody on the web makes that distinction in practice, despite what the W3C and WHATWG recommend. Markdown, for instance, has two different syntaxes for italics, but I'm not aware of any well-known flavor of Markdown actually using one syntax for and one for (CommonMark specifies for both, and most other Markdowns don't even bother telling you which one they emit unless you dig into the guts of the implementation).

The point in all this: You can't make authors care about semantics if they do not wish to care. From the perspective of the average author, U+2010 is just "U+002D, except if I use it in computer code, then it breaks things." They do not wish to know the difference between U+2010 and U+002D, and no amount of "well they should learn" is going to change their behavior.

Hyphens, minus, and dashes in Debian man pages

Posted Jan 7, 2024 23:18 UTC (Sun) by mirabilos (subscriber, #84359) [Link]

2010 (and 2011, which is used in cases like “foo-, -bar- and -baz-type”) are much narrower than 002D even in Fixed [Misc] for example.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 23, 2023 19:02 UTC (Mon) by rra (subscriber, #99804) [Link] (12 responses)

Right, I do understand. There is no doubt that roff makes this distinction and writing roff input correctly involves making this distinction. And maybe leaving it to distributions is the best approach. Certainly if you view your role as maintainer as the keeper of the original intent of the roff typesetting language, I certainly see how you arrive at that conclusion.

My arguments are, briefly:

1. It is very difficult to get this correct. Yes, there are a lot of things to learn when writing man pages, but bugs that cannot be caught by automated tools and don't produce visibly different output are extremely hard to eliminate. This is effectively a foot-gun in the roff language that authors will continue to get wrong because getting it wrong produces no visible effect.

2. The distinction is mostly drawback for the typical use of man pages. Most views of man pages are in contexts where the only distinction between hyphen and hyphen-minus is (maybe) whether roff does line breaking at that point, and with the (IMO correct) increasing trend of disabling full justification in man pages, this is a very minor benefit. The glyphs are otherwise essentially identical, but hyphen breaks searching and cut and paste. The positive benefits are mostly for troff output for printed material, and for man pages this is not a nonexistent use case but it is very rare. This is why I have dropped all of the pod2man transformations that were only useful for troff output; they were causing problems for nroff output and were essentially never used as intended.

3. The world has changed since roff was designed. This is not going to be persuasive if you see your role as preservation of the original roff intent, so to some extent this is a conflict of uses. You are maintaining the roff typesetting system, but most people writing man pages are just trying to present documentation to the user and don't care about the roff typesetting system as such. roff was designed in a world without Unicode, but we have Unicode now. If people want hyphens, or matched single quotes, there is now a fairly good argument they should just type the thing they intend using Unicode. I think if roff were invented today, \- would not exist and - would mean \- because roff would just use Unicode input and respect the characters the user entered.

I am very sympathetic to the argument that this should translate into roff preserving the original distinction by default, but all distributions disabling this distinction when processing man pages. I think that is a fairly reasonable compromise, although it does have the drawback of requiring all the distributions to duplicate essentially the same configuration work.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 23, 2023 19:57 UTC (Mon) by branden (guest, #7029) [Link] (11 responses)

> Right, I do understand. There is no doubt that roff makes this distinction and writing roff input correctly involves making this distinction. And maybe leaving it to distributions is the best approach. Certainly if you view your role as maintainer as the keeper of the original intent of the roff typesetting language, I certainly see how you arrive at that conclusion.

It's a factor. There are historical roff documents that I'd like to keep working nicely, as well as I can.

For example: https://github.com/g-branden-robinson/retypesetting-mathe...

That said, I do not consider myself beholden to bug-for-bug compatibility with AT&T troff (James Clark didn't, though he accommodated several), or to making the same decisions about issues in areas not even specified by CSTR #54, the "Troff User's Manual" (originally written by Ossanna in 1976, revised in 1992 by Kernighan). One relatively vocal subscriber to the groff mailing list sees me more as a heedless radical with meager respect for the wisdom of my superior ancestors. There is a certain exhilaration in juxtaposing that critique with yours.

> Yes, there are a lot of things to learn when writing man pages, but bugs that cannot be caught by automated tools and don't produce visibly different output are extremely hard to eliminate.

Guessing which glyph to use as seen in the examples on the debian-devel list, and here, appears to be an AI-hard problem.

> This is effectively a foot-gun in the roff language that authors will continue to get wrong because getting it wrong produces no visible effect.

I _do_ have some advice on this front: use a good font, one where glyphs for different code points look different.

> the only distinction between hyphen and hyphen-minus is (maybe) whether roff does line breaking at that point

I expect what will happen with pod2man specifically is that you'll use \- everywhere, people will notice that breaks stop happening in as many places, resulting in wide adjustments, then they will...

> and with the (IMO correct) increasing trend of disabling full justification in man pages, this is a very minor benefit.

...join team ragged-right margin. Well, fear not, I've actually made it easier for you to get what you want. https://git.savannah.gnu.org/cgit/groff.git/tree/NEWS?h=1...

People who have been turning automatic hyphenation off in the first place may also welcome that.

> but hyphen breaks searching and cut and paste.

I spend a lot of time reading man pages. I find myself not struggling over this issue. Maybe I'm weird.

> The positive benefits are mostly for troff output for printed material, and for man pages this is not a nonexistent use case but it is very rare.

Linux man-pages maintainer Alejandro Colomar and I are doing what we can to encourage people to rediscover typeset manuals. I've linked to the collected groff-man-pages PDF elsewhere in this discussion. Deri James is doing an invaluable service by helping us get man page cross references wired up to PDF hyperlinks. (Of course, you have to actually have to tell man(7) that something is a man page cross reference first, and thereby hangs a tale. https://git.savannah.gnu.org/cgit/groff.git/tree/NEWS?h=1... )

> The world has changed since roff was designed.

Certainly. There'd be no upset users and no LWN article to reply to if we weren't enjoying the blessing of Unicode support in our terminal emulators. So in a sense this all can be laid at Markus Kuhn's doorstep.
Nah, it's okay, people can keep blaming me. I'm used to it.

> This is not going to be persuasive if you see your role as preservation of the original roff intent, so to some extent this is a conflict of uses.

Not wholly. I also have a handful of new macros I want to introduce to the man(7) macro language. For groff 1.23.0, I settled on one, already linked to above.

The NEWS file entry for 1.23.0 is lengthy; I encourage anyone with any interest in groff to review it.

> but most people writing man pages are just trying to present documentation to the user and don't care about the roff typesetting system as such

People writing C programs are just trying to solve a problem and don't care about the programming language that much. (Okay, C has plenty of people who love it madly for its own sake. Consider substituting "JavaScript".)

If someone is going to pick up a tool to do a job, they're going to have to develop a competence with that tool. If plain text is all a person can manage, that is what they should write their documentation in.

(At this point in the discussion, champions of ReStructured Text and/or one of several not-quite-compatible dialects of Markdown typically appear on the scene, each claiming that their markup language is the _one_, _obvious_ way to write "plain text" such that it is suitable for conversion to richer formatting languages.)

I'm trying to reach and assist people who care about writing man(7) (and mdoc(7) for that matter) competently. If they don't care about that, I'm wasting my time, and they shouldn't waste theirs telling the world how man(7) _should_ be done.

> If people want hyphens, or matched single quotes, there is now a fairly good argument they should just type the thing they intend using Unicode. I think if roff were invented today, \- would not exist and - would mean \- because roff would just use Unicode input and respect the characters the user entered.

I don't agree, because you're forgetting about the minus sign. Unicode has a hyphen (U+2010) and a minus sign (U+2212), and "obviously", a person should input those code points for their distinct purposes.

This works great until someone needs needs to input a "literal" for an overloaded code point in the Basic Latin code chart that has syntactical significance to something like a shell prompt or a language compiler. Then they need that hen's tooth U+002D code point, even though it is meaningful _only_ for talking to computers, and not for any other domain of discourse. And that's not even taking into account the folks who ride in an want distinguishable en dashes, em dashes, figure dashes, and others the LWN article didn't mention. Fitting distinguishable glyphs for these into a half-width character cell even with a fair number of pixels in the horizontal dimension (say, more than 8) starts to become a real pickle.

So, no, I don't think "just use Unicode" is going to solve all the problems here at a stroke. It can help, but eventually people are going to need roff special characters or something like them so that they can tell unlike things apart with confidence. Even if they use a good font.

When you consider the problem space seriously, it turns out the WYSIWYG advocates are on team DWIM. And we know how well that turns out.

None of this is to tell you what you should do with podlators; the plan you've articulated seems fine to me. If, by some miracle, Alex Colomar and I convince more than a handful of people that the PDF man page experience is actually kind of nice, and some of those folks then turn to Perl docs and wonder what it's story is, I'll be happy to help you come up a with new coat of paint to go over that layer of primer you just stripped it down to.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 23, 2023 20:16 UTC (Mon) by rra (subscriber, #99804) [Link] (7 responses)

> This works great until someone needs needs to input a "literal" for an overloaded code point in the Basic Latin code chart that has syntactical significance to something like a shell prompt or a language compiler. Then they need that hen's tooth U+002D code point, even though it is meaningful _only_ for talking to computers, and not for any other domain of discourse.

But this is exactly my point. If you need that code point, you enter that code point, and it should be typeset as that code point, without any second-guessing on the part of the typesetting system.

The second-guessing is there because in a pure ASCII world there was no alternative, because there were not ASCII code points for the different meanings. You therefore had to pick one of them to be the default and represent the other ones with escapes, and roff decided, quite reasonably for typesetting and less reasonably for man pages, that a hyphen was the most common usage and should be the default and the other usage should use an escape. But the point of Unicode is that you no longer have the context collapse on input, because there are code points available to express your exact intent. Essentially, the use of the precise Unicode code point replaces roff escapes (down to being slightly more annoying to type).

We have been down this path already with quotes. In the pure ASCII world, we invented various conventions like `single quotes' or ``double quotes'' used in different typesetting systems, but now you can use correct Unicode quotes if you care about this distinction, and many editors will assist you in doing so to avoid the entry problem. I am dubious any newly-invented typesetting system today would try to overload ASCII quotes in the way that TeX did; instead, it would handle Unicode quotes correctly.

None of these solutions are ideal because the keyboard is not large enough and doesn't allow easily drawing these distinctions. But if you view all the various typesetting escapes as substitutes for not having the correct character on the keyboard, I would argue that using the correct Unicode character is the modern replacement. It works uniformly across multiple typesetting systems, so you don't have to relearn how to use it for each piece of software, and you are far more likely to have active editor assistance in making the entry easier.

In other words, no, I have not forgotten about minus. If you want a minus sign in typeset material, you should enter an actual minus sign, which is U+2212 and will pair correctly in the font with a plus sign. This is not a hyphen, is not an en-dash, is not an em-dash, and is not a hyphen-minus. These are all distinct characters used for different purposes in high-quality typeset material. If you are talking about programming languages, you may not want an actual minus sign because programming languages do not use actual minus signs. You may a hyphen-minus, which should be typeset as such. (Unfortunately, this does create the problem that Unicode has only one plus sign, so you have to choose between fidelity and correct glyph matching between plus and minus signs if you are aiming for printed output. This choice is somewhat context-dependent, and one option would be to make those characters match in fixed-width fonts.)

Hyphens, minus, and dashes in Debian man pages

Posted Oct 24, 2023 3:17 UTC (Tue) by branden (guest, #7029) [Link] (6 responses)

Hi Russ,

Okay. I find little to argue with in this presentation. I think if one were to undertake a "man: The Next Generation", one would likely proceed exactly as you describe, and let Unicode do the heavy lifting of glyph distinction.

...at least as far as some kind of alpha or trial run. I suspect people would rapidly run into trouble with hyphenated phrases (e.g., "long-standing, Debian-specific patches"). As you say, the keyboard is not large enough. At some point we run into not a technological problem, but a human one; it's hard to make people care about typographical distinctions that they don't want to care about, especially if their horizons stretch no farther than a terminal window. If they think of Unicode as mainly a resource for dingbats and emoji, we're unlikely to make much headway in the matters.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 24, 2023 3:31 UTC (Tue) by rra (subscriber, #99804) [Link] (4 responses)

> As you say, the keyboard is not large enough. At some point we run into not a technological problem, but a human one; it's hard to make people care about typographical distinctions that they don't want to care about, especially if their horizons stretch no farther than a terminal window.

This is the point where my own struggles with problems like this over the last ten years have given me a lot of respect for the amount of thought that's gone into Unicode. They took a careful and pragmatic decision to provide code points that represent the ambiguous merged character, and then separate code points that more precisely indicate intent. This to some extent means that within a Unicode world, both options are possible and the document author gets to choose how much to care.

If you want very nice typesetting, you can use hyphens, minuses, and en-dashes in the ways they were intended to be used. If you want to be lazy and not think about it, you can use a Unicode hyphen-minus and you get a compromise character that looks "okay" and, importantly, is clearly marked as a semantic compromise. Any typesetting system gets the correct information that the user was either talking about code or decided not to care about the distinctions between dashes, and therefore the typesetting system probably shouldn't try to care more than the user did.

This is similar to what they did with apostrophe and single quotes: the preferred characters in Unicode are U+2018 and U+2019, and U+0027 is defined as a neutral character that is intentionally left ambiguous, for users who don't care enough to draw the distinction.

You can't force users to care. The best you can do is provide them with the tools and make it clear whether they chose to use them or not. (And indeed, despite knowing all of this, I always use neutral single and double quotes and a hyphen-minus, because I don't care enough. Although I have started using real em-dashes, and I will occasionally use a real en-dash, so maybe eventually I'll come around.)

I'm simplifying a bit, and the Unicode world is not quite as shiny as all of that. Typesetting and human languages are messy and there are still sharp edges and ambiguities. But it's a system that a whole lot of people put a whole lot of thought into, and the results embed more practical wisdom than I think people realize.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 24, 2023 3:55 UTC (Tue) by branden (guest, #7029) [Link] (1 responses)

> This is the point where my own struggles with problems like this over the last ten years have given me a lot of respect for the amount of thought that's gone into Unicode.

I concur with this. I don't think _anyone_ involved with groff development views Unicode as anything less than a tremendous boon to the sanity of glyph and character repertoires. (Oh, how I wish James Clark had decided to store groff characters internally as ints instead of C++ chars. But we'll get that refactored, knock wood.)

I have seen _one_ person grouse that apostrophes (however rendered) and right single quotation marks should be kept logically separate, and I have some sympathy for that view, because they _are_ logically separate--but it seems no English typesetting tradition ever sees fit to distinguish them in print. If I regard were to regard occasional man page authors as intransigent with respect to correct glyph choices, I dread to measure the inertia of commercial publishers.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 25, 2023 14:02 UTC (Wed) by smoogen (subscriber, #97) [Link]

As a side note. I want to say thank you to both rra and branden. A lot of conversations about fonts and layout can lead to ill-chosen words between participants, because even a slightly off font can cause the brain to think 'lion, get ready to fight'. This conversation had instead a lot of 'we agree', and 'we can agree to disagree', and also a LOT of documents I have not read. [I need to get a copy of the updated Kernighan troff manual to add to my Kernighan collection!]

Again thank you for teaching and making this conversation something enjoyable to read.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 24, 2023 7:26 UTC (Tue) by smurf (subscriber, #17840) [Link]

> the preferred characters in Unicode are U+2018 and U+2019

Depends on your locale; don't forget about U+201A. And then there's places where they use U+2039/U+203A … and other places where they use U+203A/U+2039. See https://en.wikipedia.org/wiki/Quotation_mark for even more enlightening examples.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 24, 2023 16:24 UTC (Tue) by gray_-_wolf (subscriber, #131074) [Link]

> This is the point where my own struggles with problems like this over the last ten years have given me a lot of respect for the amount of thought that's gone into Unicode.

Maybe in some areas. The whole Han unification thing is in my opinion still a mistake. Having to know what language the text is in in order to be able to render it correctly is... annoying.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 25, 2023 18:13 UTC (Wed) by jwarnica (subscriber, #27492) [Link]

The obvious solution here is that it is the responsibility of the author to type the correct thing, and one's editor (and/or desktop environment, OS, etc) to convert some keystroke into the right codepoint saved to disk. I include by reference JWZs, er, rant on tabs, spaces and what the physical tab key does as this being not a new problem (or solution). Further consider that a lot of word processors already figure out how to Do The Right Thing with quotes these days; pure "text" editors, usually have some awareness of what is being written, and when configured to personal preferences, should be mostly able to figure out The Right Thing at the time of authorship.

Authors+authoring tools, who care, can be careful, once, and the 78 downstream tools never are allowed to second guess things. Authors+authoring tools who don't care... Well, then the 78 downstream tools at least do what they are directly told without any hackery, and the cause of the errors (if any) becomes clear: the human and/or the single tool they interact with.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 24, 2023 0:43 UTC (Tue) by jkingweb (subscriber, #113039) [Link] (2 responses)

> I'm trying to reach and assist people who care about writing man(7) (and mdoc(7) for that matter) competently. If they don't care about that, I'm wasting my time, and they shouldn't waste theirs telling the world how man(7) _should_ be done.

I am such a person (especially after reading this article), though I'm a complete neophyte. Thus far I've been writing Markdown and converting it using Pandoc, mainly because I had no idea where to begin to learn how to do things properly. And... I still don't. Should I start by reading man(7), or mdoc(7), or something else altogether? There seems to be many schools of thought (as is so common in the free software world), but I'd love an authoritative hand point me in *one* direction, whatever it is.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 24, 2023 3:07 UTC (Tue) by branden (guest, #7029) [Link] (1 responses)

> Thus far I've been writing Markdown and converting it using Pandoc, mainly because I had no idea where to begin to learn how to do things properly. And... I still don't. Should I start by reading man(7), or mdoc(7), or something else altogether?

My recommendation is the groff_man_style(7) page in the groff 1.23.0 release. It attempts to bring the reader from a state of no knowledge about man(7) or roff to a point where they can write a man page. It's not quite a tutorial--it doesn't start with a skeleton page that you fill in, but it covers the basics first and then discusses each group of man(7) macros in approximately the order you're likely to need to use them. So it starts with `TH` and `SH` and their relatives, then covers paragraphing macros, then synopsis macros, then hyperlink macros, and finally font styling macros.

You can start on page 253 of the collected groff man pages PDF. https://www.gnu.org/software/groff/manual/groff-man-pages...

> There seems to be many schools of thought (as is so common in the free software world), but I'd love an authoritative hand point me in *one* direction, whatever it is.

In the course of the past several years I've learned a great deal about the history of *roff and man pages, and I've attempted to reflect that learning in the content of the groff's own man pages.

But even authoritative voices are not infallible, so if you find errors, I'd appreciate hearing about them. (I find that the adjective sits on me uncomfortably, in any case.)

Hyphens, minus, and dashes in Debian man pages

Posted Oct 30, 2023 22:33 UTC (Mon) by jkingweb (subscriber, #113039) [Link]

Thank you very much for the pointer. I found the groff_man_style(7) very well written and an excellent introduction. After spending a couple of days transcribing my manual to groff then mdoc I decided to go with mdoc (despite its doing a poorer job of layout in a couple of places, especially with links—the groff .UR macro yields beautiful output), but I don't know if I would have been able to grasp the basics without first reading groff_man_style(7), so you have my gratitude!

Hyphens, minus, and dashes in Debian man pages

Posted Oct 28, 2023 1:41 UTC (Sat) by ceplm (subscriber, #41334) [Link] (4 responses)

> I think automatic man page source generators and human beings who fire up a text editor to write man(7) documents constitute different problem domains.

Absolutely, and I am the proportion of the later ones is pretty close to proportion of programmers who write their program in assembler. I don’t know the proportion but number of people who write manpages in actual `man(7)` or `mandoc(7)` tends IMHO towards zero. Everybody uses some generator from some reasonable markup language (pod, markdown, rst).

Hyphens, minus, and dashes in Debian man pages

Posted Oct 28, 2023 9:21 UTC (Sat) by branden (guest, #7029) [Link] (1 responses)

> I [claim?] the proportion of the later ones is pretty close to proportion of programmers who write their program in assembler. I don’t know the proportion but number of people who write manpages in actual `man(7)` or `mandoc(7)` tends IMHO towards zero. Everybody uses some generator from some reasonable markup language (pod, markdown, rst).

It appears that you are unfamiliar with the work product of the Linux man-pages project and the activity of its mailing list.

https://lore.kernel.org/linux-man/

Of the 2,680 man pages that project maintains at current count, only one (bpf-helpers(7)) is maintained in a different markup language.

You also appear to be unfamiliar with the man page maintenance practices of the *BSD community. I haven't measured how many mdoc(7) pages they maintain, but I reckon it's the same order of magnitude.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 29, 2023 0:07 UTC (Sun) by ceplm (subscriber, #41334) [Link]

I don’t want to for a second disrespect anybody who writes manpages with man(7) or mdoc(7), and especially not the people who maintain awesome kernel, glibc and similar libraries man pages (and bash(1)). They are absolute heroes!

However, I am suspicious that there is some correlation between people who are willing to program in C and people who are writing manapges in their raw format. In my Pythonish part of the world, I have never met anybody who would have non-generated ones (e.g., https://pypi.org/project/sphinxcontrib-manpage or https://pypi.org/project/argparse-manpage), and yes, I am absolutely certain that quality of such manpages is five flies down from the good ones.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 28, 2023 16:29 UTC (Sat) by rra (subscriber, #99804) [Link] (1 responses)

In addition to what Branden noted, it seems worth mentioning that the only markup language I'm aware of that both has the semantic richness to properly mark up a C library function and that converts easily to man pages is DocBook, which comes with its own set of problems. You can get sort of adequate results with POD or Markdown, but a lot of the details of the formatting will be poor.

Writing directly in the man macros in roff is not a bad markup language. Writing directly in mandoc is even nicer. There are some weirdnesses and oddities, and I personally accept the loss of formatting flexibility and use POD (for possibly obvious reasons), but I would reach for writing roff directly long before I would tolerate the excessive verbosity and tedium of writing something in any XML- or SGML-based markup language. And most of the others just don't have the required detailed markup to do a good job with complex formatting.

I wouldn't recommend roff for a long technical document (either LaTeX or reStructuredText with Sphinx, depending on the nature of the document, are massively superior in my opinion), but it's not a horrible choice and you can get good output with it, and I think you'd be better off than with XML.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 29, 2023 0:10 UTC (Sun) by ceplm (subscriber, #41334) [Link]

See above … and yes reStructuredText is my preferred format for all my writing, so sphinxcontrib-manpage seems like the best tool for me, when I need a manpage (and it is not very often). Quite certainly I won’t be returning back to DocBook, that’s just too much pain for too little gain.

In my opinion, man page source documents are not the correct place to discard that information.

Posted Oct 23, 2023 15:57 UTC (Mon) by Wol (subscriber, #4433) [Link] (2 responses)

> Mapping all hyphens and minus signs to a single character, as people whose blood pressure spikes over this issue tend to promote as a first resort, is an ineluctably information-discarding operation.

Except, if the original input is a plain hypen/minus (as it is in ASCII), surely converting it silently to a hyphen or minus is an information-corrupting operation. Surely discarding the corruption makes more sense !?!?

Computers should be "do as I say, not what *you* think I mean". If I type an ascii dash, please give me an ascii dash! Don't auto-corrupt it behind my back! If I want something else, make me ask for what I want!

Cheers,
Wol

In my opinion, man page source documents are not the correct place to discard that information.

Posted Oct 23, 2023 17:17 UTC (Mon) by dskoll (subscriber, #1630) [Link] (1 responses)

The problem is that troff started out as a typesetting tool and generally speaking, for typeset output you want a - on input to be turned into a ‐ (hyphen) on output, just as a human typesetter would do when typesetting typewritten source material. I believe LaTeX does the same thing, though it's much less noticeable because LaTeX doesn't produce output on a terminal.

FWIW, I religiously use \- in my man pages where I mean codepoint 002D. It's hard-coded in my muscle memory now.

In my opinion, man page source documents are not the correct place to discard that information.

Posted Oct 24, 2023 4:38 UTC (Tue) by branden (guest, #7029) [Link]

> I religiously use \- in my man pages where I mean codepoint 002D. It's hard-coded in my muscle memory now.

I find this issue closely analogous to = vs. == in C, and I think many of the same people who call troff's distinction between - and \- "stupid" also derogate their peers who manage to screw up the =/== distinction, deriding them as inexperienced newbies. It seems that knowing what you're talking about in C is a virtue, but in documentation it is a tedious waste of time. (I've met many specimens of brogrammer--perhaps I am unusually unfortunate.)

In my view, in fact, the C situation is _less_ excusable, because at the time Ken and Dennis came up with C, := as an assignment operator had been around at least since Algol 60 (so, a decade or more, and _everybody_ knew what Algol was), and there was certainly no problem finding the keys with which to type it. (I like Hillel Wayne's take on the matter: "Nowadays most languages use = entirely because C uses it, and we can trace C using it to CPL being such a trash fire.")

By contrast, you will require winning-lottery-ticket levels of luck to find distinguishable hyphen and minus keys on a keyboard.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 23, 2023 17:14 UTC (Mon) by butlerm (subscriber, #13312) [Link] (28 responses)

The only information that is being lost here is information that groff is making up out of thin air by translating an ASCII minus to something other than an ASCII minus.

They should quit doing that, it is not helpful to change characters from one character to a different one by default. A macro or some other option should be required to do something out of the ordinary like that, in accordance with the principle of least surprise.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 23, 2023 17:34 UTC (Mon) by branden (guest, #7029) [Link] (27 responses)

> The only information that is being lost here is information that groff is making up out of thin air

No, groff's behavior is consistent with every other implementation of troff in the world, including the original implementation dating back to about 1973, appearing in Fourth Edition Unix from Bell Labs.

> by translating an ASCII minus to something other than an ASCII minus.

There is no such thing as an "ASCII minus". The relevant standards documents call it a "hyphen-minus", which reveals the very problem you are trying to conceal with your poorly informed proclamation.

Strictly, ASCII didn't even give the characters names, except arguably the control characters.

ISO 8859/ECMA-94 did, and they call the character "hyphen-minus", as did Unicode 1.0 and every revision since.

https://www.ecma-international.org/wp-content/uploads/ECM...
https://www.unicode.org/versions/Unicode1.0.0/CodeCharts2...

Hyphens, minus, and dashes in Debian man pages

Posted Oct 23, 2023 17:59 UTC (Mon) by butlerm (subscriber, #13312) [Link]

> No, groff's behavior is consistent with every other implementation of troff in the world, including the original implementation dating back to about 1973, appearing in Fourth Edition Unix from Bell Labs.

It is defective by design then, and they should fix it. Or apparently it was fixed, and they decided to break it.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 23, 2023 18:41 UTC (Mon) by ms-tg (subscriber, #89231) [Link] (16 responses)

I find these responses thoroughly befuddling.

Who are the user base that are benefitting from the typesetting of '-' hyphen-minus in the source of man pages into anything else?

Hyphens, minus, and dashes in Debian man pages

Posted Oct 23, 2023 19:04 UTC (Mon) by branden (guest, #7029) [Link] (15 responses)

> Who are the user base that are benefitting from the typesetting of '-' hyphen-minus in the source of man pages into anything else?

It is precisely that set of people who render man pages to output formats that distinguish hyphens from minus signs. It seems to come as a shock to some people that you can do things like render man pages as PDF.

https://www.gnu.org/software/groff/manual/groff-man-pages...

Save the specific format of PDF itself, this was intention and practice of the people who brought us Unix man pages in the first place.

"The manual was intended to be typeset; some detail is sacrificed on terminals." (man(1), _Unix Time-Sharing System Programmer's Manual_, Eighth Edition, Volume 1, February 1985)

Hyphens, minus, and dashes in Debian man pages

Posted Oct 23, 2023 19:26 UTC (Mon) by tzafrir (subscriber, #11501) [Link] (2 responses)

I think it is more common to read man pages as HTML. The following pages seem to show hyphens as hyphen-minus characters (unless I read incorrectly):

https://manpages.debian.org/unstable/groff-base/groff.1.e...
https://www.man7.org/linux/man-pages/man7/groff.7.html

I do see a different character for an em-dash in the second one (—).

Hyphens, minus, and dashes in Debian man pages

Posted Oct 23, 2023 20:10 UTC (Mon) by branden (guest, #7029) [Link] (1 responses)

> I think it is more common to read man pages as HTML.

Possibly. I wish I had reliable statistics!

> The following pages seem to show hyphens as hyphen-minus characters (unless I read incorrectly):

Ingo Schwarze might thank me for pointing out that the back-end renderer that debiman uses for this purpose is mandoc(1), and so groff is not involved at all.

mandoc is indeed better in many cases at rendering man pages to HTML than groff is. I'm not happy about that, but it's my understanding of the status quo. grohtml(1), the relevant part of groff, is difficult to work on. I've fixed some bugs in it but it inherently attempts a much more ambitious thing than mandoc(1) does. groff's HTML support attempts to handle the full roff language. mandoc(1) avowedly does not, and Ingo swears it never will.

Hyphens, minus, and dashes in Debian man pages

Posted Jan 7, 2024 23:21 UTC (Sun) by mirabilos (subscriber, #84359) [Link]

My HTML-format manpages (like http://www.mirbsd.org/man1/mksh to complement the prior example) are vastly more read than the PDF-format ones.

(I generate them from catman pages though, which in turn are produced with nroff (not gnroff) and the BSD mdoc, man.old, me, ms, etc. macropackages.)

Hyphens, minus, and dashes in Debian man pages

Posted Oct 23, 2023 19:33 UTC (Mon) by butlerm (subscriber, #13312) [Link] (10 responses)

Given that a primary purpose of man pages is to document command line options and other things you can type at the keyboard it is at the very least rather more convenient for 'hyphen-minus' characters to be preserved rather than be converted into something else.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 23, 2023 20:04 UTC (Mon) by branden (guest, #7029) [Link] (9 responses)

> Given that a primary purpose of man pages is to document command line options and other things you can type at the keyboard it is at the very least rather more convenient for 'hyphen-minus' characters to be preserved rather than be converted into something else.

If only Ken, Dennis, Steve Bourne, Doug McIlroy, et al., had had the benefit of your wisdom...

https://minnie.tuhs.org/cgi-bin/utree.pl?file=V7/usr/man/...

Hyphens, minus, and dashes in Debian man pages

Posted Oct 23, 2023 20:46 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (1 responses)

I think this is a rather uncharitable interpretation of butlerm's comment. It is fairly obvious, at least to me, that the comment was phrased in the present tense, and is about what is convenient for man(7) users *today*, rather than what might have made sense in the 70's.

(And yes, it is fair to point out that we do not have a time machine and cannot change the past. But it is also fair to point out that standards are paper. We can, and should, regularly ask whether the benefits of any given standard continue to outweigh its costs, so long as we remember to include backcompat in that cost/benefit analysis. This is how C got rid of trigraphs, for example.)

Hyphens, minus, and dashes in Debian man pages

Posted Oct 24, 2023 3:37 UTC (Tue) by branden (guest, #7029) [Link]

> I think this is a rather uncharitable interpretation of butlerm's comment.

I think it's about as charitable as statements like "groff is making up out of thin air" and "defective by design". It's the currency he seems to be comfortable trading in.

> It is fairly obvious, at least to me, that the comment was phrased in the present tense, and is about what is convenient for man(7) users *today*, rather than what might have made sense in the 70's.

But the specific case of - vs. \- is one area where the passage of time _hasn't_ made much of a difference. It was just as difficult to learn the distinction and operate the keyboard to produce these alternatives in the mid-1970s as it is today. Arguably worse back then, in fact, since the Bell Labs Unix room people all used Western Electric Teletypes, and my impression is that the force required to actuate the keys on those things was colossal compared to, say, an Apple Magic keyboard. (Apart from machine memory constraints and a baud rate that makes continental drift look like a test of special relativity, this may account for Ken Thompson and early Unix culture's preoccupation with extreme abbreviation.)

And as Russ suggested above, it's not like today's keyboards have separate, convenient hyphen and minus keys.

> But it is also fair to point out that standards are paper. We can, and should, regularly ask whether the benefits of any given standard continue to outweigh its costs, so long as we remember to include backcompat in that cost/benefit analysis.

Quite so, and that is what I have tried to do. Moreover, *roff and the man(7) language is not formally standardized anyway. (Some may consider this fortunate.) All we have is convention. I didn't see any reason to make historical man pages render incorrectly. I've collected them and use them (informally) to regression-test groff. Here's one (long, technical) example of a groff regression that I felt honor-bound to undo to keep compatibility with historical man pages even though I felt it was a technical detriment. https://lists.gnu.org/archive/html/groff/2022-06/msg00026...

Note the follow-ups, particularly https://lists.gnu.org/archive/html/groff/2022-06/msg00048... .

> This is how C got rid of trigraphs, for example.

Yes, and a good thing. Their whole purpose was to accommodate people whose keyboards couldn't even type the printable code points in ASCII (the true, 7-bit, ANSI version, as opposed to early revisions of ISO 646, or any form of ISO 8859, which people slovenly call "ASCII").

Hyphens, minus, and dashes in Debian man pages

Posted Oct 24, 2023 7:11 UTC (Tue) by butlerm (subscriber, #13312) [Link] (6 responses)

> If only Ken, Dennis, Steve Bourne, Doug McIlroy, et al., had had the benefit of your wisdom...

Apparently the Debian maintainer finds it immensely more practical to continue mapping hyphen-minus to hyphen-minus in the man macro package, so it is hard to see what the audience is for treating hyphen-minus as something other than hyphen-minus in man pages.

Somewhere in the development history of that package someone decided it was a useful thing to treat hyphen-minus as hyphen-minus and now there is a breaking change that distributions apparently cannot adopt in practice to revert that behavior because thousands of man pages have (inadvertently) come to rely on it.

Perhaps it was a mistake to allow hyphen-minus to map to hyphen-minus when the standard was for it do something else, but it appears to be an accommodation that is now almost inevitable - and indeed almost as if an invisible hand had restored the natural default mapping of a character to itself notwithstanding what was more convenient five decades ago - at least for man pages.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 24, 2023 8:35 UTC (Tue) by cjwatson (subscriber, #7322) [Link] (1 responses)

People shouldn't interpret me as being opposed to Branden's technical goals; I just have unfortunately finite time. I do take care to make the distinction in the *roff documents I write, and I think others should do the same for the sake of better printed output. As far as I know the only real point of difference is that I don't want to externalize the costs of better printed output onto the readers of manual pages in terminals (and even Branden has some sympathy with that position when I put it that way, I think).

I'd also like to clarify that the change I made in the Debian packaging is only for manual pages _rendered in terminals_ and not for things like PDF output.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 24, 2023 10:24 UTC (Tue) by branden (guest, #7029) [Link]

Hi Colin,

Thank you for adding your perspective here.

> People shouldn't interpret me as being opposed to Branden's technical goals; I just have unfortunately finite time. I do take care to make the distinction in the *roff documents I write, and I think others should do the same for the sake of better printed output. As far as I know the only real point of difference is that I don't want to externalize the costs of better printed output onto the readers of manual pages in terminals (and even Branden has some sympathy with that position when I put it that way, I think).

I do, and I am sorry that your mailbox exploded in flames over this issue, particularly since I know my fire hose to be a high-volume one.

I can accept the interpretive frame that this was a matter of balancing externalities; when I first came to the issue, I asked myself, "well, how are people who *want* typographically superior man pages supposed to see the errors so that they can fix them?" Once one had set up a UTF-8 environment if necessary, and selected for one's terminal emulator a font that is not an outright impediment in this area, as implied by the LWN editor's OP, there were a few possibilities under the status quo ante (groff 1.22.4 and going back several years).

1a. Fork your distributor's package locally and maintain a patch against tmac/an-old.tmac (as it was then known). Remember to keep your forked package for any other machines where you want to do this work.

2a. Use dpkg-divert on /usr/share/groff/1.22.4/tmac/an-old.tmac and maintain a modified copy of that file. Remember to duplicate this diversion process on any other machines you install where you want to do this work. (Other distributions, I assume, have some equivalent to dpkg-divert.)

3a. Download groff from GNU and build and maintain an installation of it outside the packaging system. And, oh yeah, patch tmac/an-old.tmac there, too, damn it.

Now, as of groff 1.23.0, if a person wants to improve man pages in this respect (we few, we happy OCD-ridden few), their courses of action are as follows.

1b. Go to work right away, if your distributor hasn't changed groff in this respect.

2b. Modify the conffile /etc/groff/man.local and comment out the workaround. Experienced system administrators of my acquaintance are accustomed to backing up /etc, or at least checking it for things they don't want to lose when copying or migrating systems.

3b. Download groff 1.23.0 from GNU and build it and use it as-is.

These three alternatives each seem superior to their 1.22.4 counterparts to me.

There is a cost, yes. Judging by repology.org, the quantity of groff package maintainers in the world numbers between 10 and 100 (Fermi estimate). It might make me a jerkass to expect these folks to read the NEWS file when a release happens every few years, and to be up to the challenge of applying a small diff to a text file. But having some experience as a package maintainer, these didn't seem like onerous expectations to me.

And in case the point need be reiterated to onlookers, _you_ were not surprised by this change. (I haven't heard from any other groff 1.23.0 packagers on this point, though I have about other matters.) It was telegraphed and discussed literally years ago. The surprise, if it was one, was the vehemence of some users' response.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 24, 2023 15:41 UTC (Tue) by WolfWings (subscriber, #56790) [Link] (3 responses)

Perhaps it was a mistake to allow hyphen-minus to map to hyphen-minus...

...or perhaps the mistake was ever mapping almost any unescaped ASCII character to anything except itself? Principle of least surprise.

I get it, decision made decades ago, but it's a random typographical gotcha to have this cross-mapped to another character by default up there with Excel's "OH THAT'S A DATE!"-ism.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 24, 2023 17:27 UTC (Tue) by branden (guest, #7029) [Link] (2 responses)

> ...or perhaps the mistake was ever mapping almost any unescaped ASCII character to anything except itself? Principle of least surprise.

Then consider how surprising it would be to have to write hyphen\(hyminus, non\(hy\(man\(hypage, command\(hyline, cuts\(hyand\(hypastes, UTF\(hy8, line\(hybreaking, pre\(hyrelease, look\(hyalike, and device\(hyindependent, to name just a few examples of hyphenated words or phrases from this very web page.

I begin to perceive that people aren't going to read the groff_char(7) man page no matter how many times I link to it, so I'll just quote it.

History

A consideration of the typefaces originally available to AT&T nroff and troff illuminates many conventions that one might regard as idiosyncratic fifty years afterward. (See section “History” of roff(7) for more context.) The face used by the Teletype Model 37 terminals of the Murray Hill Unix Room was based on ASCII, but assigned multiple meanings to several code points, as suggested by that standard. Decimal 34 (") served as a dieresis accent and neutral double quotation mark; decimal 39 (') as an acute accent, apostrophe, and closing (right) single quotation mark; decimal 45 (-) as a hyphen and a minus sign; decimal 94 (^) as a circumflex accent and caret; decimal 96 (`) as a grave accent and opening (left) single quotation mark; and decimal 126 (~) as a tilde accent and (with a half‐line motion) swung dash. The Model 37 bore an optional extended character set offering upright Greek letters and several mathematical symbols; these were documented as early as the kbd(VII) man page of the (First Edition) Unix Programmer’s Manual.

At the time Graphic Systems delivered the C/A/T phototypesetter to AT&T, the ASCII character set was not considered a standard basis for a glyph repertoire by traditional typographers. In the stock Times roman, italic, and bold styles available, several ASCII characters were not present at all, nor was most of the Teletype’s extended character set. AT&T commissioned a “special” font to retain their accustomed glyph repertoire.

A representation of the coverage of the C/A/T’s text fonts follows. The glyph resembling an underscore is a baseline rule, and that resembling a vertical line is a box rule. In italics, the box rule was not slanted. We also observe that the hyphen and minus sign were already “de‐unified” by the fonts provided; a decision whither to map an input “-” therefore had to be taken.

            A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
            a b c d e f g h i j k l m n o p q r s t u v w x y z
            0 1 2 3 4 5 6 7 8 9 fi fl ffi ffl
            ! $ % & ( ) ‘ ’ * + - . , / : ; = ? [ ] │
            • □ — ‐ _ ¼ ½ ¾ ° † ′ ¢ ® ©

The special font supplied the missing ASCII and Teletype extended glyphs, among several others. The plus, minus, and equals signs appeared in the special font despite availability in text fonts “to insulate the appearance of equations from the choice of standard [read: text] fonts”—a priority since troff was turned to the task of mathematical typesetting as soon as it was developed.

We note that AT&T took the opportunity to de‐unify the apostrophe/right single quotation mark from the acute accent (a choice ISO later duplicated in its 8859 series of standards). A slash intended to be mirror‐symmetric with the backslash was also included, as was the Bell System logo; we do not attempt to depict the latter.

         α β γ δ ε ζ η θ ι κ λ μ ν ξ ο π ρ σ ς τ υ ϕ χ ψ ω
         Γ Δ Θ Λ Ξ Π Σ Υ Φ Ψ Ω
         " ´ \ ^ _ ` ~ / < > { } # @ + − = ∗
         ≥ ≤ ≡ ≈ ∼ ≠ ↑ ↓ ← → × ÷ ± ∞ ∂ ∇ ¬ ∫ ∝ √ ‾ ∪ ∩ ⊂ ⊃ ⊆ ⊇ ∅ ∈
         § ‡ ☜ ☞ | ○ ⎧ ⎩ ⎫ ⎭ ⎨ ⎬ ⎪ ⌊ ⌋ ⌈ ⌉

One ASCII character as rendered by the Model 37 was apparently abandoned. That device printed decimal 124 (|) as a broken vertical line, like Unicode U+00A6 (¦). No equivalent was available on the C/A/T; the box rule \[br], brace vertical extension \[bv], and “or” operator \[or] were used as contextually appropriate.

Devices supported by AT&T device‐independent troff exhibited some differences in glyph detail. For example, on the Autologic APS‐5 phototypesetter, the square \(sq became filled in the Times bold face.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 27, 2023 1:21 UTC (Fri) by ms-tg (subscriber, #89231) [Link] (1 responses)

Is it possible to generate man pages without groff?

Hyphens, minus, and dashes in Debian man pages

Posted Oct 27, 2023 6:56 UTC (Fri) by branden (guest, #7029) [Link]

> Is it possible to generate man pages without groff?

If you mean "is it possible to *format* man pages with something other than groff", then options include (limiting myself to software projects that are maintained--albeit some at a very slow pace) the following.

1. Heirloom Doctools troff: https://n-t-roff.github.io/heirloom/doctools.html
2. mandoc: https://mandoc.bsd.lv/
3. Plan 9 from User Space troff: https://github.com/9fans/plan9port
4. neatroff: http://litcave.rudi.ir/neatroff.pdf

neatroff does not ship with a man(7) package, but you can configure it to use another troff's. I haven't tested this extensively.

Not all of these implement all the same extensions to the original man(7) dialect of 1979 that groff does. The groff_man(7) man page tracks such portability issues.

There are several other partial interpreters of man+roff, like mandoc, out there that produce HTML output exclusively. Most are of dubious quality and many are dead--no longer maintained. Several are unrelated but call themselves "man2html", and when discussing them, it is crucial to clarify which one you're talking about. Of those, Thomas Dickey's is probably the highest quality, but I have never rigorously evaluated it for completeness or correctness. https://invisible-island.net/scripts/man2html.html

(Hmm, LWN's comment previewer seems to think "-" characters are invalid in URLs, and so won't hyperlink them.

Maybe I should have spelled it "\-".)

Hyphens, minus, and dashes in Debian man pages

Posted Jan 7, 2024 23:20 UTC (Sun) by mirabilos (subscriber, #84359) [Link]

Yes, and I had to apply extra workarounds around GNU groff “features” to get not-broken URLs in them.

http://www.mirbsd.org/MirOS/dist/mir/mksh/mksh.pdf at least is an okay result.

It is typeset with the BSD mdoc macropackage, not the GNU one, thankfully.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 23, 2023 19:15 UTC (Mon) by tzafrir (subscriber, #11501) [Link] (3 responses)

Are there any non-man-pages documents typeset with groff in the documentation included in Debian (manually-generated)?

Hyphens, minus, and dashes in Debian man pages

Posted Oct 23, 2023 20:16 UTC (Mon) by branden (guest, #7029) [Link] (2 responses)

> Are there any non-man-pages documents typeset with groff in the documentation included in Debian?

Yes.

usr/share/doc/groff-base/meintro.ps.gz
usr/share/doc/groff-base/meintro_fr.ps.gz
usr/share/doc/groff-base/meref.ps.gz
usr/share/doc/groff-base/ms.ps.gz
usr/share/doc/groff-base/pdf/automake.pdf.gz
usr/share/doc/groff-base/pdf/msboxes.pdf.gz
usr/share/doc/groff-base/pdf/pdfmark.pdf.gz
usr/share/doc/groff-base/pic.ps.gz

Hyphens, minus, and dashes in Debian man pages

Posted Oct 24, 2023 10:30 UTC (Tue) by taladar (subscriber, #68407) [Link] (1 responses)

Any outside a groff package?

Hyphens, minus, and dashes in Debian man pages

Posted Oct 24, 2023 11:10 UTC (Tue) by branden (guest, #7029) [Link]

I see these on my Debian system. Just what I happen to have installed.

troffcvt: /usr/share/troffcvt/tc.me
troffcvt: /usr/share/troffcvt/tc.mm
troffcvt: /usr/share/troffcvt/tc.ms
ksh: /usr/share/doc/ksh/PROMO.mm.gz
ksh: /usr/share/doc/ksh/builtins.mm.gz
ksh: /usr/share/doc/ksh/sh.memo.gz
xterm: /usr/share/doc/xterm/ctlseqs.ms.gz
cvs: /usr/share/doc/cvs/cvs-paper.ms.gz

There are some dozens Unix historical documents written variously in ms, mm, and me(7). Some of them have encumbered licensing (or their copyright status is unknown), and others, for instance from the old BSD PS1, PS2, USD, and SMM manuals, simply aren't packaged for Debian as far as I know.

The 150 or so Bell Labs Computing Science Technical Reports documents were all, to the best of my knowledge, composed with troff (the very earliest ones with nroff alone) but unfortunately the sources to these are seldom available (copyright encumbrance again). My understanding is that Doug McIlroy and Brian Kernighan in particular have kept pretty good track of their work artifacts, but for $reasons don't just slap them up online.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 26, 2023 17:27 UTC (Thu) by anton (subscriber, #25547) [Link] (4 responses)

No, groff's behavior is consistent with every other implementation of troff in the world, including the original implementation dating back to about 1973, appearing in Fourth Edition Unix from Bell Labs.

At that time troff output appeared on paper. Nobody cut and pasted from there to terminal input, and nobody computer-searched it for, say --some-option.

The nroff output at the time certainly did not contain an Unicode hyphen, because Unicode did not exist at the time. I expect that the nroff output in 1973 converted hyphen-minus into hyphen-minus where groff's nroff implementation in 2023 converts hyphen-minus into hyphens. In 1973 there probably was not much cutting and pasting, but I expect that you could search man pages already, and the use of hyphen-minus and other ASCII characters was already beneficial for that purpose.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 26, 2023 18:36 UTC (Thu) by branden (guest, #7029) [Link]

> At that time troff output appeared on paper.

At that time, so did nroff output. The Bell Labs CSRC, for reasons not completely clear to me, never bothered to work on support for character-cell video terminals (so-called "glass TTYs"). Part of this may have been due to the Western Electric Teletype's market position and AT&T's status as a (regulated) monopoly. In any event, by all accounts, Bell Labs Research Unix leapfrogged from paper Teletypes to the Jerq/Blit/DMD 5620 graphical terminal (which, I note, was still _branded_ "Teletype"). This is also why Seventh Edition Unix (1979) didn't have a pager program. A lot of their glass TTY support came in by merging back stuff from BSD in the 1980s, the Research Unix years. And as far as I know, _commercial_ AT&T Unix didn't take any more of that than they had to.

Support for character-cell video terminals fell to the Berkeley CSRG and to the commerical AT&T Unix concern, which seems to have been reorganized and rebranded about as often as happens in modern tech companies. Rivalry proliferated here: more(1) vs. pg(1), termcap vs. terminfo, Berkeley's anemic curses vs. AT&T's much more capable one (but locked up of course behind hefty license fees and shouted claims of trade secrecy). To take just one example, BSD curses only ever supported one form of highlighting: "standout/standend". AT&T curses, by the time of System V Release 4 (1989), supported several, following ISO 6429, and used a generalized attribute management data type. (Naturally enough, AT&T picked one that was too small.)

> Nobody cut and pasted from there to terminal input,

Indeed not, since the only selection buffer available was in the operator's brain.

Well, I suppose one could have used the Teletype's paper tape punch/reader attachment.

> and nobody computer-searched it for, say --some-option.

grep(1) existed since very early days. In fact it seems to have shown up in Fourth Edition Unix, just like troff itself. https://minnie.tuhs.org/cgi-bin/utree.pl?file=V4/man/man1...

(I could be slightly off there--troff's man page took much longer to show up than troff itself did, and the source code for some early editions of Unix remains lost.)

A few points: (1) the hyphen and minus were not de-unified in terminal output, only on the typesetter. (2) Typesetter output was not practically searchable; the C/A/T byte stream was not practical for such purposes. (3) Kernighan invented a text-based output format for device-independent troff. You _could_ search that, and observe the difference between a hyphen (written with '-' with either the 'c' command or the anonymous, optimized move-and-print command (see CSTR #97), and the minus sign, which required the 'C' command.

But, I would guess, few people apart from those troubleshooting a troff output driver ever looked at that output file format.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 26, 2023 18:40 UTC (Thu) by branden (guest, #7029) [Link] (2 responses)

> groff's nroff implementation in 2023 converts hyphen-minus into hyphens

This is an overgeneralization. Like AT&T troff, groff interprets an (unescaped) input hyphen-minus as a hyphen. Like AT&T troff, a device that doesn't have distinct hyphen and minus sign glyphs maps them to the same thing in output.

Some knowledge of the architecture of AT&T device-independent troff might be helpful to you.

https://www.hack.org/mc/texts/ditroff-kernighan.ps.gz

Hyphens, minus, and dashes in Debian man pages

Posted Oct 27, 2023 7:57 UTC (Fri) by anton (subscriber, #25547) [Link] (1 responses)

My point is that while in typeset output there may have been hyphens in the output in 1973, for output to an xterm or the like a - in input resulted in the same ASCII character in the output until people switched to Unicode locales (maybe a decade ago). And from what I read in the article and comments, with groff this did not change even at that time, because it still produced the ASCII character. Only groff-1.23.0 changed that. So the claim that this change is in line with 1973 behaviour is wrong as far as on-screen usage by most users is concerned.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 27, 2023 10:00 UTC (Fri) by branden (guest, #7029) [Link]

> My point is that while in typeset output there may have been hyphens in the output in 1973, for output to an xterm or the like

Your point reveals considerable historical ignorance.

(A) xterm did not exist in 1973. Even the DEC VT100 terminal that xterm originally emulated (when implemented circa 1984) would not go into production for another five years.

(B) Output to terminals, at Bell Labs in 1973, was to largely to printing devices like the Western Electric Teletype Model 37. Being based on ink and paper, they were capable of constructive overstriking, a vanishingly rare feature of video terminals (storage-tube displays like the Tektronix 4014 excepted--and which troff treated like a typesetter anyway, see the old tc(1) command: https://www.unix.com/man-page/v7/1/TC/ ). Thus, anyone who started using nroff with a video terminal like the Lear Siegler ADM-3a (on which Bill Joy developed vi) made a more significant break with traditional hardware behavior than this is.

> resulted in the same ASCII character in the output until people switched to Unicode locales (maybe a decade ago).

15 years ago or more, depending on how adventurous one (or one's distribution) was. As I have implied elsewhere in this discussion, Unicode support in terminal emulators crossed a Rubicon. All of a sudden there were distinguishable hyphens and minus signs, and, worse--compared to the problem facing Bell Labs when they acquired the C/A/T phototypesetter--a third character, the hyphen-minus, was still retained. Several years ago I proposed adding a new special character, "hm", to correspond specifically and solely to U+002D, but none of the groff experts on its mailing list thought that was a good idea. And I no longer do, either; man page authors are even less likely to start typing "\(hm\(hmlong\(hmoption" than they are "\-\-long\-option". Worse, an "hm" special character would require implementation in all other man page formatters, and several of those are abandoned, or so indifferently maintained that there is no realistic hope of this job ever getting done.

> And from what I read in the article and comments, with groff this did not change even at that time, because it still produced the ASCII character. Only groff-1.23.0 changed that.

Wrong again. groff's "utf8" device was added in groff 1.16 (released 2000-05-23). https://git.savannah.gnu.org/cgit/groff.git/tree/ChangeLo...

groff's man(7) package did change to collapse the '-' and '\-' ordinary and special characters to the same thing for the utf8 device--get ready to wave your bloody shirt--nine years later, in January 2009. https://git.savannah.gnu.org/cgit/groff.git/commit/?id=98...

Evidently groff's maintainer at the time, Werner Lemberg, did not share your sense of alarm or urgency.

The subsequent release was groff 1.20, 2009-01-05. Interestingly, this was also the first release to which the GNU GPLv3 applied, the presence of which reliably sends Apple into prophylactic shock, so Mac OS X _never_ had this "fix", and for a considerable number of users we can measure a significantly greater historical longevity for mapping - and \- distinctly on groff's utf8 device. groff 1.23.0's behavior is thus a return to form in this sense.

I reiterate: I don't think the character translations introduced in for man(7) on the utf8 device groff 1.20 were an inherently bad idea; it just shouldn't have been done in tmac/an-old.tmac, but rather man.local, and I think it should have been commented out by default for the reasons you can see on this very page in my reply to cjwatson.

'[W]hen I first came to the issue, I asked myself, "well, how are people who *want* typographically superior man pages supposed to see the errors so that they can fix them?"'

The "solutions" that practically everyone objecting to groff 1.23.0's behavior in this respect involve changing these defaults in a way that is much more tedious to override, and poorly serve people who want to locate and fix problems, which is why maintaining the '-'/'\-' distinction is appropriate in the source archives produced and hosted by the GNU Project. It's fine if distributors want to do something different; that is what Colin has done, and I welcome his decision if it reduces the number of ignorant harangues he (and I, as groff's Debian package co-maintainer) have to endure about it.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 24, 2023 4:24 UTC (Tue) by wtarreau (subscriber, #51152) [Link] (19 responses)

This illustrates pretty well the problem I'm having with the abuse of UNICODE and the excess of representations for visually similar characters. Characters should not convey semantics, only a representation. If you'd draw the same character with a pen on paper for different use cases, then it MUST be the same character (and don't tell me about 0 and O, these are both different and of different classes). Here we're having less differences between multiple characters using different code points than the differences between multiple same characters a same person would produce. When I'm writing "ls -l" I'm using the same "-" as in "-1" and mentally pronounce it like "minus", and unsurprisingly I'm using the key "minus" on my keyboard for this, so I'm certain it is the same minus character because my keyboard doesn't know what I'm thinking when I'm pressing that key.

It's unimaginable that humanity managed to reach a point where people could argue over the type of horizontal bar they'd have to use on display depending on the context when the same key is pressed on the keyboard and nobody cares at either ends!

At least reading the man pages in ASCII should fix the copy-paste problem...

Hyphens, minus, and dashes in Debian man pages

Posted Oct 24, 2023 7:38 UTC (Tue) by smurf (subscriber, #17840) [Link]

> nobody cares at either ends!

It might come as a surprise to you that there are people who *do* care, if only to make the output look slightly more reasonable when those pesky-hyphenated-word-expression-thingies want to span a line break.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 24, 2023 16:45 UTC (Tue) by mathstuf (subscriber, #69389) [Link] (16 responses)

We have this problem with Turkish. If they had instead had different code points for "explicitly dotted lowercase i" vs. "Latin lowercase i", `toupper` might not need to care about the locale! There's also the Han unification where this was taken, but the three languages use similar-looking characters differently. There was some…disappointment about that decision.

I have questions for you about Cyrillic "а" vs. ASCII "a". I don't believe the former changes based on the font, but the latter definitely does (into the "o with a stem on the right" form generally). How about the upper case version where they generally *are* the same glyph rendering? This sounds vastly more complicated than Unicode as it is because it is trying to move all of the complexity into the operators instead of keeping it in the data. How is one supposed to capitalize an English string that quotes a Russian word like "гражданский"? Sorting becomes convoluted because anything that isn't English needs to decipher "is this 'Н' meant to be sorted like Cyrillic 'en' or English 'aich' in this context?" (hint: it is the Cyrillic "en").

> Characters should not convey semantics, only a representation.

So where do you stand on the tabs-vs-spaces debate then? Hopefully you don't care (beyond consistency) if that is your view. Do you use any ASCII control characters beyond `\n` and `\0`? These are nothing *but* semantic representations of things.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 24, 2023 17:15 UTC (Tue) by wtarreau (subscriber, #51152) [Link] (15 responses)

I think you in part responded yourself to the question by saying that they don't capitalize similarly, hence they *are* different characters. Probably that when you write with a pen in cursive you will not even draw them similarly due to the way they attach to their neigbors.

Here we're speaking about punctuation symbols that are drawn similarly with a pen, typed with the same key, indistinguishable when read on paper or on screen if you don't have the other ones to compare, etc. They *are* the same character, for the writer and for the reader. The simple fact that it becomes so confusing that you cannot copy-paste a simple command line anymore should indicate that it just went too far in the distinction when some absolutely insist on using different internal representations and resort to heuristics or rules to say "let's say that without a backslash we'll use this one and with a backslash it will be this one".

What will be the next step, left-justified vs centered vs right-justified hyphen/dash/minus ? Hyphen to use between upper case letters and another one to use between lower case letters ? A special shorter minus sign to be used in front of the zero because it looks nicer ? Maybe we'll reach a point where we'll need 256 bits to represent all character variants and it will be sufficient to simply indicate all pixels in a 16x16 matrix that will then be vectorized, and even then I'm not sure it will be sufficient for some.

> So where do you stand on the tabs-vs-spaces debate then?

There's no real "debate", rather perferences that are dictated by the largest consensus among initial authors of a project. Practices can evolved sometimes, but you'll note that tab is not a representation but a control character.

> Do you use any ASCII control characters beyond `\n` and `\0`?

Yes I do, but they're "control characters", which means that they're reserved encodings in byte streams that precisely escape the representation flow to act on the controls of the terminal. XON/XOFF and ESC (0x1B) are a perfect illustration of this by the way. No real character is associated with that.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 24, 2023 17:53 UTC (Tue) by branden (guest, #7029) [Link]

> I think you in part responded yourself to the question by saying that they don't capitalize similarly, hence they *are* different characters.

The price of this choice is a larger space of homoglyph attacks.

What Unicode attacks is a proper engineering problem: there is no solution that is optimal in all dimensions. Trade-offs must be made.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 24, 2023 19:18 UTC (Tue) by butlerm (subscriber, #13312) [Link] (13 responses)

Perhaps minus signs, dashes, and hyphens should be treated the same for matching purposes, as variants of the same character. Diacritical marks should be handled the same way - such that you have an underlying (alphabetical) character and a number of variants that all answer to an (ordinary) search for the base letter.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 24, 2023 20:46 UTC (Tue) by branden (guest, #7029) [Link] (6 responses)

> Perhaps minus signs, dashes, and hyphens should be treated the same for matching purposes, as variants of the same character.

That's a fine idea; so good that there's already a facility in PDF for this (called CMap), and something similar is at work on this LWN web page too. When I searched for hyphens in Firefox, it matched all the alternative dash symbols in Mr. Corbet's exhibit as well.

Since less(1) is the 800-lb. gorilla of pagers (with a man page weighing nearly as much when printed), that might be a good place to see where this might be implemented. Terminal emulators are another possibility, since they generally have to know something about Unicode character properties anyway, and already have a giant data structure housing all of the grapheme clusters rendered at every character cell in the nominal window plus the scrollback buffer. But I won't hold my breath for xterm to implement a search dialog. ;-)

Hyphens, minus, and dashes in Debian man pages

Posted Oct 25, 2023 0:14 UTC (Wed) by butlerm (subscriber, #13312) [Link] (3 responses)

One step further would be for a paste target to do a conversion from allowed variants to canonical form. I don't really see any point in pasting any variant other than hyphen-minus into a text terminal or standard text box for example - at least not by default.

For a rich text editor the default should be reversed, i.e. to preserve variants unless the user prefers to paste as plain text, similar to what many rich text editors do already.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 25, 2023 5:58 UTC (Wed) by donald.buczek (subscriber, #112892) [Link]

> I don't really see any point in pasting any variant other than hyphen-minus into a text terminal or standard text box for example.

I have a command line tool to detect and remove identified phishing email from our users mailboxes selectable by header fields. Just yesterday I've used

./delete_malware.py "" "⚠ Action Required"

(copied-and-pasted from my shell history)

Hyphens, minus, and dashes in Debian man pages

Posted Oct 25, 2023 11:44 UTC (Wed) by nim-nim (subscriber, #34454) [Link]

> I don't really see any point in pasting any variant other than hyphen-minus into a text terminal or standard text box for example - at least not by default.

Filenames can contain pretty much any unicode codepoint, many apps will derive the file name from human text typed within the file (for example, song track titles, document title, autor name, etc).

Some apps will even insist on their god-given right to use any random bunch of bytes in the filename, even when the byte combination is explicitly forbidden in UTF-8.

Thus cut and pasting any command that contains a filename can involve at least the full UTF-8 scope.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 26, 2023 21:12 UTC (Thu) by Wol (subscriber, #4433) [Link]

> One step further would be for a paste target to do a conversion from allowed variants to canonical form. I don't really see any point in pasting any variant other than hyphen-minus into a text terminal or standard text box for example - at least not by default.

Or, if you do a right-click-paste, one of the options should be "paste as ascii (or 8-bit)" (which could be terminal-sensitive ie if it's a konsole or xterm or whatever) which would either convert the multiple variants to space or dash, accented characters to plain, etc etc, or just drop characters it can't convert.

Then at least what happens is under human control ...

Cheers,
Wol

Hyphens, minus, and dashes in Debian man pages

Posted Oct 25, 2023 5:52 UTC (Wed) by donald.buczek (subscriber, #112892) [Link] (1 responses)

> less(1) [...] might be a good place to see where this might be implemented.

getopt(3) and friends to regard any hyphen-like character as the option character? Would resolve one basic problem.

Whitespace-splitting of shells might also be candidates, because for some reason, non-breaking space variants seem to be slipping into command lines copied and pasted from email by our users, because they use webmail.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 27, 2023 17:32 UTC (Fri) by gutschke (subscriber, #27910) [Link]

This sounds good initially, until you start thinking about the security implication. I would not want to touch either getopt(3) nor shell parsing and then have to demonstrate that I didn't open subtle security bugs in the process.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 25, 2023 3:14 UTC (Wed) by wtarreau (subscriber, #51152) [Link] (3 responses)

In fact as a variant of this, I tend to think that's essentially a matter of rendering and it should be the text processors that use the one they want depending on surrounding characters. Look at Latex for example, you write "suffice" and it will emit this single character "ffi" to make it look nice. The human should only know that they want to write a hyphen-minus-dash (in fact a short middle horizontal bar) and let the text renderer adapt it for screen or paper. It would then not cause any copy-paste issues since it would always be the same character.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 25, 2023 9:29 UTC (Wed) by nim-nim (subscriber, #34454) [Link] (1 responses)

There are coding-oriented fonts on the market that apply the same ligature tricks to prettify code as normal fonts do for human text (as in your example).

Those fonts are *more* sensible to encoding errors not less. If you use the wrong kind of hyphen/dash/minus they will make ligature mistakes.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 25, 2023 9:31 UTC (Wed) by nim-nim (subscriber, #34454) [Link]

susceptible not sensible sorry for the gallicism (wtarreau and me will auto-correct, other readers not)

Hyphens, minus, and dashes in Debian man pages

Posted Oct 26, 2023 4:25 UTC (Thu) by NYKevin (subscriber, #129325) [Link]

The Unicode Consortium has said that presentational forms like that will only be added for backcompat reasons, and will not be encoded in the future (because they're considered rich text). If you want to emit a *glyph* that looks like that, rendered as (e.g.) a sequence of SVG or (E)PS curves, then sure, do whatever, but don't try to represent it as plain text (because you will eventually want to add a ligature that Unicode does not support, and when you ask them for it, they will say "no").

Hyphens, minus, and dashes in Debian man pages

Posted Oct 25, 2023 9:24 UTC (Wed) by nim-nim (subscriber, #34454) [Link]

Minus signs, dashes, and hyphens definitely have differences when drawn by hand or rendered in anything but an ASCII-oriented terminal.

Dashes are long, minus signs have the same width as plus signs, hyphens are short.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 25, 2023 11:36 UTC (Wed) by Sesse (subscriber, #53779) [Link]

If you use the Unicode Collation Algorithm (UCA) with some sufficiently low “strength”, you'll get this automatically. (E.g. the most lenient setting is case- and accent-insensitive, so that you can search for e and match É. This is what browsers typically do for Ctrl-F, AFAIK, although in an asymmetric variant. - and – almost certainly match on this level, although I haven't checked the tables.)

Hyphens, minus, and dashes in Debian man pages

Posted Oct 24, 2023 17:26 UTC (Tue) by mpr22 (subscriber, #60784) [Link]

> the abuse of UNICODE and the excess of representations for visually similar characters.

The "multiplicity of representations" concern has existed since the publication of ISO 8859-5:1988 and ISO 8859-7:1987.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 24, 2023 5:44 UTC (Tue) by pabs (subscriber, #43278) [Link] (1 responses)

Is anyone working on fixing the fonts that don't properly render the different characters differently?

Hyphens, minus, and dashes in Debian man pages

Posted Oct 24, 2023 13:56 UTC (Tue) by JanC_ (guest, #34940) [Link]

There really is no typographic reason to render them all differently (some should be, of course, like ‘em dash’ vs. ‘en dash’, but for others that would be optional, or even questionable in some cases).

It might be useful for code fonts used to edit (manpage) sources, but there are also rendering options (colour, style, …) that code editors can use to highlight “special” characters (as they often already do).

I love this

Posted Oct 24, 2023 15:07 UTC (Tue) by mattdm (subscriber, #18) [Link]

This is going to sound a little facetious, but I'm completely serious: this is the kind of article I can't imagine anyone but LWN providing. Thank you!

I love this

Posted Oct 24, 2023 15:07 UTC (Tue) by mattdm (subscriber, #18) [Link]

I know this is going to sound like I'm being facetious, but this is exactly the kind of article I love to see on LWN that I can't imagine being written anywhere else.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 25, 2023 0:30 UTC (Wed) by neilbrown (subscriber, #359) [Link] (13 responses)

I was searching in the man page for the "match" operator because I wasn't sure how to invert it...
I couldn't find "~". I eventually found the documentation I wanted which suggested I use "!˜".
But when I try that, I'm told
awk: cmd. line:1: ^ invalid char '�' in expression

The "char" it identifies is the first bytes of the utf-8 encoding..

There is certainly room for improvement here.... I guess I should send a patch.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 26, 2023 12:01 UTC (Thu) by nim-nim (subscriber, #34454) [Link] (12 responses)

If you are cursed with a Windows or Java-oriented text editor it will often add a starting BOM to UTF-8 files even though UTF-8 has no need of it and it breaks processing right and left.

But Java and Windows are historically UCS-16 environments with half-assed UTF-8 support.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 27, 2023 8:08 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (11 responses)

Windows is kind of in an unfortunate place with Unicode. They have to contend with:

* Files that are UTF-16LE, no BOM included because it is/was the OS's native encoding.
* Files that are some Windows code page, often but not always 1252, no BOM included because it's not Unicode and the BOM doesn't exist.
* Applications that can choose at compile time whether they want to do this fancy-pants Unicode (UTF-16) thing, or use one of those weird code pages instead. This may sound trite, but it means that (among other issues) the entire filesystem needs full backcompat with non-Unicode-aware apps (solved by pulling out the old FOO~1 trick).
* Applications that set their code page to UTF-8, and then proceed to use the "non-Unicode" legacy API with a Unicode encoding. Microsoft even recommends doing this.[1]
* A file format (plain text) that has no encoding information (indeed, no out-of-band metadata whatsoever).
* An outside world that is (at least for the most part) completely hostile to non-Unicode encodings, and increasingly unwilling to accommodate UTF-16.

NTFS does actually have the necessary facilities to smuggle an encoding declaration in e.g. an alternate data stream. But the TXT file extension is older than NTFS, so I can't exactly blame them for not using a time machine here.

[1]: https://learn.microsoft.com/en-us/windows/apps/design/glo...

Hyphens, minus, and dashes in Debian man pages

Posted Oct 28, 2023 15:29 UTC (Sat) by wtarreau (subscriber, #51152) [Link] (10 responses)

It's indeed a real mess.

I've always been very angry at the encoding UTF-8 uses because it was purposely made to be transparent to 7-bit encoding, and being designed by english-speaking people, they probably underestimated the amount of trouble it would cause to those already using code pages daily due to accents and extra letters. In addition, UTF-8 is known for being extremely inefficient for some languages like Chinese.

Ideally we'd need a different encoding that does *not* support 7-bit chars and recodes all of them using prefixes not part of ASCII code pages such as some control chars and 0x7F. This would make documents, file names etc non-ambiguous (old vs new format) instead of trying to be "mostly compatible". This "mostly compatible" aspect is a disaster because a same document tends to contain different encodings at different places when edited with multiple persons who didn't notice the problem. I've even seen a few times here in france some ads printed on paper with a few incorrect characters sequences such as "Ã©" for "é" due to UTF-8 coding issues. This would not happen if no single char would appear correctly. Sure it would use a larger encoding for mostly 7-bit texts but for those using mostly non 7-bit it would be much better. Possibly that it could even end up with 3 bytes if enough control codes were used and end up being of fixed size.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 28, 2023 17:26 UTC (Sat) by Wol (subscriber, #4433) [Link]

Yup. This causes a big problem with utf-8 support on Pick/MV, because chars 0xff, 0xfe, 0xfd, 0xfc, 0fb, and 0fa are special, are found all over the place all the time, and worse are searched for all the time - they're sort of the Pick equivalent of the C 0x00 termination character.

Internationalisation and utf-8 causes massively nasty hacks to get round this problem ...

Cheers,
Wol

Hyphens, minus, and dashes in Debian man pages

Posted Oct 28, 2023 18:16 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

> I've always been very angry at the encoding UTF-8 uses because it was purposely made to be transparent to 7-bit encoding, and being designed by english-speaking people, they probably underestimated the amount of trouble it would cause to those already using code pages daily due to accents and extra letters

Like, really? I speak several languages with non-Latin script, and UTF-8 is the best invention ever since the sliced bread. At least I can edit the text in BOTH of my native languages at the same time.

> UTF-8 is known for being extremely inefficient for some languages like Chinese.

I happen to speak Chinese and I know a bit about its early computer history. The first "encoding" of Chinese used _five_ bytes for each character. UTF-8 uses 3 bytes for most of Chinese characters, and UCS-2 uses 2 bytes. So "extremely inefficient" is completely misleading.

> This "mostly compatible" aspect is a disaster because a same document tends to contain different encodings at different places when edited with multiple persons who didn't notice the problem.

Perhaps you should take your part and stop using weird encodings?

I honestly have not seen any problems with incorrect text rendering related to UTF-8 within the last 10 years. And I use non-Latin scripts constantly.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 28, 2023 18:26 UTC (Sat) by Wol (subscriber, #4433) [Link] (4 responses)

> I honestly have not seen any problems with incorrect text rendering related to UTF-8 within the last 10 years. And I use non-Latin scripts constantly.

That's probably the problem, actually. Any mistakes in non-Latin scripts are obvious, and probably screw up the sentence. wtarrreau's problems are with the occasion Latin screwup where it's 99% correct.

And, speaking from experience, what you really don't want is when the screwups are rare. Because they're rare, you notice them more, because you're conditioned to expect everything to be correct. For Cyrillic and Chinese scripts, they were probably fixed properly. For Latin, they were almost certainly "almost" fixed, for an infuriatingly near-perfect definition of "fixed".

Cheers,
Wol

Hyphens, minus, and dashes in Debian man pages

Posted Oct 28, 2023 19:16 UTC (Sat) by mpr22 (subscriber, #60784) [Link] (3 responses)

As I understand it, wtarreau's problems with UTF-8 are the result of having to deal with a huge pile of ISO/IEC 8859-encoded text, some of which he isn't allowed to recode because it isn't his and some of which he doesn't want to recode because there's a lot of it.

(Also some philosophical objections to its design.)

Hyphens, minus, and dashes in Debian man pages

Posted Oct 28, 2023 21:14 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

Honestly, not buying it. Just copy the files locally, run iconv and read the UTF-8. Especially if you need to _edit_ the files.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 29, 2023 8:47 UTC (Sun) by joib (subscriber, #8541) [Link] (1 responses)

Yes, IMHO in 2023 continuing to use 8859 locales creates more problems than just biting the bullet and switching. YMMV.

Funny utf-8 related war story. A long time ago at a previous job, we migrated a nfs service from Linux servers to netapps. After a while a user complained that some files had disappeared. Turns out that the Linux NFS 4 code treats filenames as a bag of bytes, but netapp follows the RFC which says that filenames must be valid Utf-8. So the problem was that the filenames in question were 8859 encoded. Mounting with nfsv3 and renaming the affected files fixed it (IIRC there's a tool called convmv that does this).

Hyphens, minus, and dashes in Debian man pages

Posted Oct 29, 2023 10:37 UTC (Sun) by Wol (subscriber, #4433) [Link]

But wtarreau is maintaining systems he has neither the authority nor ability to switch ...

"If it's working, DON'T TOUCH IT".

Cheers,
Wol

Hyphens, minus, and dashes in Debian man pages

Posted Oct 28, 2023 19:35 UTC (Sat) by mpr22 (subscriber, #60784) [Link]

> I've always been very angry at the encoding UTF-8 uses because it was purposely made to be transparent to 7-bit encoding,

If you want computer programmers with a non-IBM background to accept an encoding, having "transcode from ASCII" be something other than a no-op was always likely to result in referral to the reply in the case of Arkell v. Pressdram.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 29, 2023 9:09 UTC (Sun) by farnz (subscriber, #17727) [Link] (1 responses)

It's not just UTF-8 that's got 7-bit ASCII as a subset; Shift-JIS, KOI8-R, KOI8-U, all the ISO 8859 variants, EUC-CN, all the ISO-2022 variants, GBK, GB 18030, Big5, CNS 11643 and KS X 1001 all use ASCII as a subset. The only commonly used exceptions to the rule that 7-bit ASCII is represented by itself use a minimum of 2 bytes for all characters (UTF-16, UTF-32, TRON), or are themselves 7-bit or shorter.

And the 2 byte and longer encodings also represent ASCII as itself, but need a lead byte to indicate that the next character is ASCII. It's thus trivial to convert 7-bit ASCII to any commonly used encoding, and had UTF-8 bucked this trend, it'd be impossible to get traction - why use UTF-8 and have to do a complicate transcode from ASCII when you can stick to ISO 2022?

Hyphens, minus, and dashes in Debian man pages

Posted Oct 29, 2023 12:43 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link]

> KOI8-R, KOI8-U

These are even more interesting, they place graphically or phonetically similar characters ("A" and "а", "F" and "ф", etc.) into the same positions modulo 128. So if the 8-th bit is lost, the text can still be somewhat readable. It's a clever hack, but I'm glad that it's no longer needed.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 26, 2023 6:52 UTC (Thu) by AdamW (subscriber, #48457) [Link]

Huh. Pretty sure I reported this on Fedora back in July, via a tip from 'hiredman' on chat:
https://bugzilla.redhat.com/show_bug.cgi?id=2224123
though in my testing, at least, it wasn't easily associated directly with a groff update...

Hyphens, minus, and dashes in Debian man pages

Posted Oct 29, 2023 0:53 UTC (Sun) by acolin (guest, #61859) [Link]

Thanks for this. If I ever paste something and it doesn't work due to this WYSI-N-WYG situation, I will not become angry; instead, I will recall fondly reading this article and discussion.

P.S. Treating similarly-typeset characters as the same in search and in paste (upon user request) seems to help diffuse this curious situation.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 29, 2023 5:20 UTC (Sun) by da4089 (subscriber, #1195) [Link] (2 responses)

So, naively ... it's 2000 packages. C'mon ... one package each? 90% of the problem can be fixed in a week.

Hyphens, minus, and dashes in Debian man pages

Posted Oct 29, 2023 13:21 UTC (Sun) by edgewood (subscriber, #1123) [Link]

I don't think it's the technical work that's the problem, but the social/political work of getting upstream to accept the patches

Hyphens, minus, and dashes in Debian man pages

Posted Nov 4, 2023 19:45 UTC (Sat) by rra (subscriber, #99804) [Link]

And next week the problem will be reintroduced in half a dozen packages because for most people it's a silent error that they will introduce by accident. You need some equivalent of a spell-checker to catch reintroduction of the problem or all your work in converting files will degrade with time. And writing that checker is quite challenging.

If someone did manage to write a really good one, we could introduce it as a QA step and indeed it probably wouldn't be that hard to fix man pages over time. In my experience, upstream often doesn't really care, but will merge a PR since why not. But the one we had definitely did not work (I can think of several obvious problems with it just off the top of my head), and writing a better one is challenging.

Someone elsewhere in this discussion suggested using ChatGPT, an option that I find hilarious given ChatGPT's well-known devotion to accuracy and specific detail.

Hyphens, minus, and dashes in Debian man pages

Posted Nov 1, 2023 11:57 UTC (Wed) by qwertyface (subscriber, #84167) [Link] (2 responses)

These sort of issues are so pervasive and so hard to debug. I remember once copying and pasting a PostScript snippet from the PDF version of the PostScript Language Reference Manual that is available on Adobe's website, and it not working because the quotation marks were ‘ ’ not ' '. If there is one document in the whole history of computing where you wouldn't expect that mistake, it would be that one, but there it was.

Interestingly, PowerShell treats ‘ or ’ as ', and “ or ” as ", so copy-paste out of auto-converted documents works fine. I guess it probably does the equivalent with the various dashes. I'm not aware of any other language or shell that does the same. One Microsoft feature we should adopt?

Hyphens, minus, and dashes in Debian man pages

Posted Dec 2, 2023 11:56 UTC (Sat) by ssokolow (guest, #94568) [Link] (1 responses)

You'd immediately open up a ton of shell injection exploits, since the assumption of which functions have special meaning is baked into a million different functions like Python's shlex.quote. PowerShell can get away with supporting that because it's a new shell with a new syntax.

Hyphens, minus, and dashes in Debian man pages

Posted Dec 2, 2023 20:13 UTC (Sat) by ssokolow (guest, #94568) [Link]

Ugh. Which characters have special meaning. Don't post while sleep deprived, kids!

Hyphens, minus, and dashes in Debian man pages

Posted Feb 27, 2024 14:30 UTC (Tue) by lmb (subscriber, #39048) [Link]

I've finally hit this on a man page, was extremely confused, and eventually found this article as an explanation for what was happening to me.

As a user, I'm ... not entirely in love with this change. Yes, it is technically correct, but it has *horrible* UX and breaks things at times when you're already had to resort to reading documentation, not normally times when you want something else to be befuddling. My primary contact with roff (and I suspect for 99% of all users?) since the post-90s are man pages, and for that use case, this change is questionable.

Yes, in theory, all broken man pages out there should be fixed, that's where the origin of the brokenness is, and I appreciate the thoughtful discussion and decision and commitment to proper layout and typesetting.

In practice, I roll my eyes at technical correctness and alias man to `man -E ascii`. Sorry.