Hyphens, minus, and dashes in Debian man pages
Last July, Sven Joachim filed a bug report regarding a change in groff, and in how it renders man pages for terminals in particular. A change to the handling of the character often referred to as "hyphen", "minus", or "dash" ("-") made many man pages rather harder to work with. To understand the problem, it's worth noting that Unicode provides a plethora of similar characters, some of which are:
Name Codepoint Hyphen-Minus 002D - Hyphen 2010 ‐ En Dash 2013 – Em Dash 2014 — Minus Sign 2212 −
There are many more — Unicode is nothing if not generous in this regard. The term "dashes" will be used to refer to this class of glyphs here.
The specified behavior of groff is that an ASCII "-" (Hyphen-Minus) in the input becomes a Hyphen in the output. If the desire is to use Hyphen-Minus in the output, then the input should use the sequence "\-" instead. If the author of a man page types "--frobnicate" as an option name, the output will read "‐‐frobnicate" (with Hyphen) rather than "--frobnicate" (with Hyphen-Minus). The two look the same, but there is a crucial difference. A user who searches for "--frobnicate" in a man page will not find it if the wrong type of dash is used and, if that user cuts-and-pastes an example with the wrong dash, it will not work.
As an example, one can try pasting these two lines into a shell:
/usr/bin/echo --help /usr/bin/echo ‐‐help
The results from one will be rather more helpful than from the other. Use of the wrong type of dash can also break URLs and corrupt file names.
Developers of free software are, of course, diligent about writing man pages; they do the job promptly, take their time to get every detail right, and can be expected to use the right kind of dash in every situation, even though the output from using the wrong kind looks exactly the same. They will surely not be bothered by the fact that a format designed to document command-line options contains a trap whereby the failure to add backslashes silently introduces problems for users who are distant in time and space.
Shockingly, this turns out not to be the case, and Linux man pages are
overflowing with unescaped dashes. Years ago, the Debian project tried to
address this problem by adding a check to its Lintian tool that would issue a
warning when unescaped dashes were used. That check was dropped in
2015, though, after Niels Thykier concluded that it was simply being
ignored: "The tag has existed since 2004 (commit fb2e7de). To date
there are still 2000 packages with the issue.
" Since then, there has
been no warning shown to Debian developers when man pages contain unescaped
dashes.
Given the prevalence of this problem, it would arguably make sense to apply a fix at the processing level. And, indeed, groff has, for many years, duly remapped the Hyphen-Minus character (and a few others) in the man-page macros, making dash characters simply work as many would expect. That helpful behavior ended with the groff 1.23.0 release in July:
The an (man) and doc (mdoc) macro packages no longer remap the -, ', and ` input characters to Basic Latin code points on UTF-8 devices, but treat them as groff normally does (and AT&T troff before it did) for typesetting devices, where they become the hyphen, apostrophe or right single quotation mark, and left single quotation mark, respectively. This change is expected to expose glyph usage errors in man pages. See the "PROBLEMS" file for a recipe that will conceal these errors. A better long-term approach is for man pages to adopt correct input practices
Problems were indeed exposed, and users began to complain; bugs were filed and the topic showed up on the debian-devel mailing list as well. G. Branden Robinson, the upstream maintainer of groff and author of this change, defended the new behavior:
Mapping all hyphens and minus signs to a single character, as people whose blood pressure spikes over this issue tend to promote as a first resort, is an ineluctably information-discarding operation. In my opinion, man page source documents are not the correct place to discard that information.
Among other things, the information being discarded by this change includes whether line-breaking is allowed; Hyphen-minus does not allow it, while Hyphen does.
Others disagreed with Robinson's position; Russ Allbery said:
My opinion is that the world of documents that are handled by man do not encode meaningful distinctions between - and \-, and man should therefore unify those characters.
Colin Watson, who maintains Debian's groff package, admitted that he had overlooked this problem when he updated Debian to the 1.23.0 release:
I was aware of the change, but it somehow fell off my list of things to make a positive decision about when packaging 1.23.0. I'm rather inclined to revert this by adding the rest of the recipe above to debian/mandoc.local (while I agree with the idealized typographical point being made, I have approximately negative appetite for the Sisyphean task of fixing an entire distribution's manual pages in practice).
A few weeks later, he said that his plan was to leave the change in place during the current Debian 13 ("Trixie") development cycle, but then to revert it prior to the pre-release freeze to avoid inflicting problems on Debian's users. That would, in theory, give developers time to fix as many of the problems as possible. After the discussion went on for a while, though, he changed his mind, stating that he was unwilling to have his inbox filled with this discussion for the next year. So the remapping of "-" has been reinstated into Debian's version of groff.
This little episode may well be repeated in other distributions as they
catch up with the groff 1.23.0 release. It also is probably not finished
within Debian. This situation brings together the problems of
documentation writing, typographic correctness, and Unicode look-alike code
points, all of which are fertile ground for disagreement. The hopes that
removing the remapping in groff would lead to the fixing of all those man
pages may have been dashed, but that does not bar another attempt in the
future.
Posted Oct 23, 2023 13:45 UTC (Mon)
by willy (subscriber, #9762)
[Link] (1 responses)
https://github.com/guardian/frontend/issues/17506
but the Grauniad appears to have worked around it instead of getting Android fixed. Jake and I had some correspondence on this issue last year, tracked down to using ‑ (a different glyph, the non-breaking hyphen!)
Posted Oct 23, 2023 18:27 UTC (Mon)
by branden (guest, #7029)
[Link]
The LWN editor griped in his piece that the distinction between a hyphen and a hyphen-minus was "invisible" (as it clearly wasn't with the Android font you used).
I've had good results with the "FreeMono" font, from the Debian fonts-freefont-ttf package. I find hyphens and minus signs readily distinguishable with it, and it has excellent coverage; I've used it to extensively revise the groff_char(7) man page, which exercises every glyph one is likely to see in a man page, and practically speaking, many more besides.
Posted Oct 23, 2023 15:02 UTC (Mon)
by ms-tg (subscriber, #89231)
[Link]
Thank you for this humor!
Posted Oct 23, 2023 15:06 UTC (Mon)
by amacater (subscriber, #790)
[Link]
Posted Oct 23, 2023 15:16 UTC (Mon)
by smoogen (subscriber, #97)
[Link]
Bravo Mr Corbet, Bravo!
I started with a smirk, went onto a smile, and by the end was what my family said was "insane" giggling at the idea of various programmers I have known and the man pages written, updated and maintained by them.
Posted Oct 23, 2023 15:20 UTC (Mon)
by zorro (subscriber, #45643)
[Link] (4 responses)
Posted Oct 23, 2023 17:05 UTC (Mon)
by branden (guest, #7029)
[Link] (2 responses)
No, because when the formatter runs in "nroff mode" (is producing output for a terminal), there is no support for "font families" (a groff invention that structures a somewhat unruly mess of uncategorized fonts that AT&T device-independent troff developed starting around 1980).
There's more on this in the groff manual.
https://www.gnu.org/software/groff/manual/groff.html.node...
In case it doesn't go without saying, few to zero terminal emulators support switching font families, or between monospaced and proportional type, at least for anything less than the entire rendered screen at once.
Posted Oct 23, 2023 23:28 UTC (Mon)
by rfunk (subscriber, #4054)
[Link] (1 responses)
Posted Oct 24, 2023 2:54 UTC (Tue)
by branden (guest, #7029)
[Link]
Yes. So in nroff mode, you can simply remap `-` to `\-` on a groff output device that distinguishes them ("utf8" is the only nroff-mode device that does), if you don't care about a collateral effect of line breaks in filled text not happening as often as they should, as mentioned by Russ Allbery in this LWN thread. (And admittedly, many people don't.)
That simple remapping is in fact what the groff "PROBLEMS" file recommends for those who aren't concerned with man page typography, and what the latest revision of the Debian groff package does--it updates a conffile (/etc/groff/man.local), which the site admin can modify if they like. Hence the storm in a teacup. :)
Posted Nov 5, 2023 10:43 UTC (Sun)
by cpitrat (subscriber, #116459)
[Link]
It's still an awful task to fix all the man pages which "use it right" (as per v1.23) but it seems much more natural to me.
Posted Oct 23, 2023 15:54 UTC (Mon)
by epa (subscriber, #39769)
[Link] (65 responses)
I guess that wouldn’t work for ‘hyphenated’ long option names, which have - in the middle, so some more elaborate rule might be needed. Perhaps easier to fix the manpage sources after all.
When that’s done, can we arrange for applications such as spreadsheets to understand the Unicode minus sign?
Posted Oct 23, 2023 16:22 UTC (Mon)
by Wol (subscriber, #4433)
[Link] (35 responses)
Or assume that ascii-dash means ascii-dash? If you want a breaking hyphen, *that* should be \- or something fancy.
> When that’s done, can we arrange for applications such as spreadsheets to understand the Unicode minus sign?
Yes it would be nice for spreadsheets to recognise Unicode minus, but surely we need proper Unicode keyboards before we worry too much about that (and yes, I expect non-Anglo-Saxon countries already have Unicode keyboards, but how many people writing all this stuff even know how to get at Unicode? Apart from the £ sign (which is almost certainly Unicode) I have no idea how to access any other Unicode).
Cheers,
Posted Oct 23, 2023 17:23 UTC (Mon)
by branden (guest, #7029)
[Link] (3 responses)
The author of troff, Joseph Ossanna of the Bell Laboratories Computing Science Research Center, and a close colleague of Ken Thompson and Dennis Ritchie (whose names may be more familiar) faced this problem when the Labs took delivery of its first phototypesetter. All Unix document formatting had, up to that point (sometime in 1972/1973), been done using Ossanna's "nroff" or the older "roff" program to print to typewriters, where there is indeed no distinction between a hyphen, a minus sign, or a dash (unless you type it more than once).
Fonts for typesetting are a different story. They can have en dashes, em dashes, figure dashes, and almost always have distinguishable hyphen and minus characters.
Given an installed base of nroff users and documents, including Unix man pages, the arrival of the typesetter meant that Ossanna had to decide whether "-" should map to the typesetter's hyphen, or its minus sign.
He chose the former. My bet is that some frequency analysis of glyph usage was done--perhaps with some degree of rigor, since this was Bell Labs after all--and found that "-" occurred much more often as a hyphen than as a minus sign. And moreover, that its use as a minus sign was in fairly restricted contexts, like setting mathematical expressions (necessarily simple ones on typewriting terminals like the Western Electric Model 37 that the Labs used).
You can read more about thesematters in the groff_char(7) and roff(7) man pages in the groff 1.23.0 release.
It is an accident of history that over the years, Unix users largely gave up using troff and nroff _except_ for man page composition, and so people notice the prevalence of the ASCII hyphen-minus much more often than they would in any other context.
Switching the glyphs' meanings around now would (a) break other *roff documents or (b) if done only for man(7) (and mdoc(7)), would make those macro packages work inconsistently with all others.
There is perhaps a case to be made for (b), but there is already a means of giving man(7) and mdoc(7) authors the crude solution many of them already desire, and it is to remap characters in the ASCII-WYSIWYG manner than many man page authors desire in the site-local configuration files that have been around for decades, man.local and mdoc.local. That is what Colin has done for Debian; we've both anticipated this day for years.
Posted Oct 24, 2023 6:53 UTC (Tue)
by jengelh (guest, #33263)
[Link] (2 responses)
Posted Oct 25, 2023 10:18 UTC (Wed)
by tao (subscriber, #17563)
[Link] (1 responses)
Posted Nov 5, 2023 11:13 UTC (Sun)
by cpitrat (subscriber, #116459)
[Link]
Why not do the same with other char? Use '\_' if you want a real underscore otherwise you get U+0332. Use '\i' for a i otherwise you get 'U+0049, U+0131'.
Posted Oct 26, 2023 8:46 UTC (Thu)
by rsidd (subscriber, #2582)
[Link] (16 responses)
I have an .XCompose file set up to give me pretty much any Unicode symbol I typically use with a few keystrokes. Eg AltGr+L+= → ₤. AltGr+E+= → €. AltGr+R+= → ₹. All the Greek letters, most of the common math symbols, etc. It may take you half an hour to set up and you can keep adding, once you are used to it you won't go back. This is a good starting point.
I was a skeptic of using Greek letters like μ and δ in writing Julia code. But once I started, I found it is just much more readable that way.
Posted Oct 26, 2023 9:40 UTC (Thu)
by geert (subscriber, #98403)
[Link] (15 responses)
Posted Oct 27, 2023 15:35 UTC (Fri)
by gutschke (subscriber, #27910)
[Link]
Posted Nov 9, 2023 13:00 UTC (Thu)
by Wol (subscriber, #4433)
[Link] (13 responses)
Except my make.conf includes "-gnome -gtk".
Although I guess KDE/Plasma has something similar.
Cheers,
Posted Nov 16, 2023 3:33 UTC (Thu)
by mathstuf (subscriber, #69389)
[Link]
/usr/bin/setxkbmap -option lv3:ralt_switch_multikey
Posted Dec 2, 2023 9:33 UTC (Sat)
by ssokolow (guest, #94568)
[Link] (11 responses)
Posted Dec 2, 2023 14:45 UTC (Sat)
by Wol (subscriber, #4433)
[Link] (10 responses)
Key to choose the 2nd level - "<>" - what on earth is that :-)
The keyboard I'm typing on has ",<" and ".>" but no "<>" key :-)
Mind you, it did tell me how to make num lock default to on - it's been a real pain that num lock seems to change now and again for no apparent reason ...
Cheers,
Posted Dec 2, 2023 18:04 UTC (Sat)
by jem (subscriber, #24231)
[Link] (9 responses)
If you don't have this key, you can always choose some other key from the 19 alternatives on the list.
Posted Dec 2, 2023 18:48 UTC (Sat)
by halla (subscriber, #14185)
[Link] (8 responses)
Posted Dec 2, 2023 19:00 UTC (Sat)
by gioele (subscriber, #61675)
[Link] (7 responses)
The standard ISO layout (used everywhere except in USA and Japan) has a key between left shift and Z.
https://switchandclick.com/wp-content/uploads/2021/02/phy...
Posted Dec 2, 2023 19:35 UTC (Sat)
by halla (subscriber, #14185)
[Link] (6 responses)
Posted Dec 2, 2023 20:40 UTC (Sat)
by johill (subscriber, #25196)
[Link] (5 responses)
Posted Dec 3, 2023 0:08 UTC (Sun)
by Wol (subscriber, #4433)
[Link] (4 responses)
Cheers,
Posted Dec 3, 2023 8:19 UTC (Sun)
by gioele (subscriber, #61675)
[Link] (2 responses)
So you have seen the key, but you haven't seen it labeled "<>". :)
The technical name for that physical key, regardless of its legend (= printed label) is "1st main key of the B (= 2nd from the bottom) row". Common legends for it are "<>", "\|", "~`", "][", "«»", "^*".
Posted Dec 3, 2023 10:51 UTC (Sun)
by Wol (subscriber, #4433)
[Link] (1 responses)
For example the AZERTY keyboard is very common in Europe ...
That's probably why I find keyboards so confusing - the dominant culture is US, within Europe it's Germany, and nobody thinks to tell you how to remap a UK keyboard - if all the descriptions are in terms of the developer's local keymaps, we don't seem to have any UK developers ... :-)
Cheers,
Posted Dec 3, 2023 11:30 UTC (Sun)
by gioele (subscriber, #61675)
[Link]
Well, if Europe is France, then yes. :) The most common visual layout in Europe is QWERTY, followed by QWERTZ (German-speaking countries and Balkans). https://commons.wikimedia.org/wiki/File:Latin_keyboard_la...
> That's probably why I find keyboards so confusing - the dominant culture is US, within Europe it's Germany, and nobody thinks to tell you how to remap a UK keyboard
That is indeed a real issue.
Keyboards have different levels of abstraction (physical layout, visual layout, functional layout) and only the first levels are really standardized (an many different standards exist). And even the standards are often not followed. So it is hard to write documentation in a way that applies to a non US-centric audience.
I have in a radius of 20 meters from my chair at least 10 different keyboards, all of which are "almost" standard ISO, but each of them has a peculiarity (different physical shape, non-standard legends, extra functionalities...) that makes them non-standard.
Xorg/xkb tries to document all this variability using a declarative language (see xkbcomp/xkbprint) but no keyboard manufacturer I know of provides xkb data for their keyboard. (And in the end everything is an evdev keyboard these days, so...)
Posted Dec 4, 2023 21:23 UTC (Mon)
by tao (subscriber, #17563)
[Link]
Posted Nov 9, 2023 12:36 UTC (Thu)
by dwmw2 (subscriber, #2063)
[Link] (13 responses)
There are arrows on yuUi ←↓↑→, m is µ, S is §. Superscript numbers ¹²³ on 123...
Most typists don't need to see the basic letters on the keyboard in order to be able to type. Why would you need to be able to see these? ☺
Posted Nov 9, 2023 13:03 UTC (Thu)
by Wol (subscriber, #4433)
[Link] (12 responses)
Standard British keyboard? Does such a thing exist any more? I think I have access to four British keyboards - my fancy ergonomic Logitech jobby, my two laptops (home and work), and my wife's laptop. All four keyboards appear to be different.
And I'm a 6-fingered typist. Comes from playing the guitar - my left hand can type, my right hand is two fingered hunt-n-peck :-)
:-)
Cheers,
Posted Nov 9, 2023 13:20 UTC (Thu)
by dwmw2 (subscriber, #2063)
[Link]
I'm fairly sure that whatever physical keyboard device I plug into my machine (within reason), if I press AltGr-m on it I'm going to get a µ, etc.
Posted Nov 9, 2023 18:43 UTC (Thu)
by mpr22 (subscriber, #60784)
[Link] (10 responses)
In terms of layout? Yes, there is.
I've got one right in front of me, and another one in the "WEEE to get rid of" pile that I really need to clear down given I'm moving house soon. Both were purchased within the past five years. (The one in the WEEE pile is there due to negligent handling by mpr22, not due to negligent manufacture.)
They're from different manufacturers (neither of which is Unicomp) and have identical layouts to the Fujitsu FKB-4725 I had back in the late 90s-early 00s, apart from (a) having Windows keys and (b) the broken one having volume and power keys where the undamaged one has Foo Lock indicator lights.
(I'm a nine-finger typist; I learned to type before I ever laid hands on a stringed instrument.)
Even on a laptop, the distinguishing features are "double-height Return key; backtick left of 1; backslash between LSHIFT and Z; semicolon/colon, singlequote/at, and hash/tilde between L and RETURN; 2 has doublequote on it and 3 has £ on it".
Posted Nov 9, 2023 22:48 UTC (Thu)
by Wol (subscriber, #4433)
[Link] (9 responses)
Except the laptop I'm typing this on has no key between LSHIFT and Z
> hash/tilde between L and RETURN;
hash/tilde is above a single-height return
And the fancy ergo logitech I'm using - while similar to layout you describe - has some very weird keys.
2 / euro / at / double-quote
double-quote / at / single-quote
3 / pound / hash
4 / euro / dollar
and there's some more weirdos too ...
Cheers,
Posted Nov 10, 2023 9:54 UTC (Fri)
by farnz (subscriber, #17727)
[Link] (6 responses)
I have a similar Logitech layout. The reason for those weird keys is that the keyboard has both the PC keysyms (right hand side of the key) and the Mac keysyms (left hand side of the key), along with some symbols that are found via AltGr on a PC keyboard or ⌥ / Option on a Mac keyboard.
It also has this thing of labelling all keys with names in lower case, which I've copied for the description below.
On my keyboard, the AltGr symbols are in unfilled circles for the PC keysym, and filled circles for the Mac keysym. So, the 4 key generates $ with shift, and € with alt gr on a PC, while on a Mac, it generates 4 or $ only. The 2 key is the other way round; on a Mac, it generates @ with shift, and € with opt ⌥, while on a PC, it generates " with shift.
And there are more complex keys, like the one to the top left, above tab. On a PC, I can get ` (on its own), ¬ (with shift) and | (with alt gr) from it, while on a Mac, it would give me § (on its own) or ± (with shift).
Posted Nov 10, 2023 10:38 UTC (Fri)
by dwmw2 (subscriber, #2063)
[Link] (5 responses)
The button you call the 4 key produces a scancode, probably 33. Unless you mean the keypad 4 key, which might be 92. The software receiving those scancodes may convert them to anything it likes, according to the software keyboard layout/configuration. Any relationship between the symbols generated and the pretty pictures which are painted on the keyboard is purely coincidental.
Posted Nov 10, 2023 10:45 UTC (Fri)
by farnz (subscriber, #17727)
[Link] (4 responses)
Yep, but the default keymaps convert those scancodes to a specific set of symbols; my keyboard has two sets of keysyms printed on it, which makes it rather cluttered to look at, but that's Logitech's way of only producing one SKU for two markets.
The computer, of course, can't see the pictures; it relies on scancodes. But between OS defaults and my keyboard's HID descriptors telling the computer what it "should" do, the computer will do what I described unless I specifically tell the computer to use a non-default keymap.
Posted Nov 10, 2023 12:21 UTC (Fri)
by dwmw2 (subscriber, #2063)
[Link] (3 responses)
Posted Nov 10, 2023 13:54 UTC (Fri)
by farnz (subscriber, #17727)
[Link] (2 responses)
I was responding to a different point; that of why a Logitech keyboard has pictures on keycaps that do not correspond to any character you can get from that key in a default setup of a single OS; it's done that way because if you share the keyboard between macOS and Windows (or macOS and Linux), you get different symbols in text input boxes from a default setup given the same scancodes.
Two orthogonal outcomes from the same situation (keyboard sending scancodes, and having pretty pictures in the hope your OS is set up to interpret the scancodes the way the keyboard maker thought it would). Although I do wish keyboard manufacturers would bring the Compose key back; I have it mapped as Shift-CapsLock (because who uses CapsLock as CapsLock), but I remember the good old days of a separate Compose keycap :-)
Posted Nov 10, 2023 15:40 UTC (Fri)
by dwmw2 (subscriber, #2063)
[Link] (1 responses)
Posted Nov 11, 2023 1:10 UTC (Sat)
by Cyberax (✭ supporter ✭, #52523)
[Link]
There have been other similar products, but they kinda all died. Mostly because experienced users just don't look at the keyboard.
Posted Nov 10, 2023 17:23 UTC (Fri)
by mpr22 (subscriber, #60784)
[Link] (1 responses)
Sounds like your laptop uses the American-style physical layout (and is thus not a standard British keyboard).
https://en.wikipedia.org/wiki/British_and_American_keyboards
Posted Nov 11, 2023 11:21 UTC (Sat)
by dwmw2 (subscriber, #2063)
[Link]
But in software it is using the standard British layout, in precisely the context I first used that phrase in this thread. Which was nothing to do with the hardware.
Posted Oct 23, 2023 17:11 UTC (Mon)
by branden (guest, #7029)
[Link] (2 responses)
> I guess that wouldn’t work for ‘hyphenated’ long option names, which have - in the middle, so some more elaborate rule might be needed.
I attempted to anticipate suggestions like this.
"Many people who want to "solve" this issue forget (or ignore) that not every '-' is a minus sign. Some are actual hyphens, as in "long-term effects" and "word-aligned struct members". Trying to infer a distinction from white space adjacency also won't work. Consider the phrases "word- or byte-sized caching" and "object-based vs. -oriented programming". While sophistication with compound hyphenated affixes is seldom seen in man pages, we most often find it where a man page author has taken considerable care with their technical writing. Such pages are less likely than most to require revision with blunt instruments like regular expression-based global search and replace operations."
https://lists.debian.org/debian-devel/2023/10/msg00085.html
If I knew of an algorithm that would faultlessly figure out what the writer meant, I'd use it.
> Perhaps easier to fix the manpage sources after all.
That is my conclusion, even knowing that doing so is sure to be difficult, and to meet much resistance.
Distributors like Debian can of course shield their readers from these difficulties; I expected that, which is why the advice in the "PROBLEMS" file (to which Mr. Corbet helpfully linked) looks the way it does. On the downside, few distributors _ship_ this piece of documentation, so it is harder for distribution users to find than it could be.
Posted Oct 23, 2023 21:20 UTC (Mon)
by kleptog (subscriber, #1183)
[Link]
I suspect ChatGPT (or some similar LLM) could do a pretty good job. The harder part I think would be getting all the patches merged (since you don't want to be doing this on the fly).
Posted Nov 5, 2023 11:17 UTC (Sun)
by cpitrat (subscriber, #116459)
[Link]
Posted Oct 23, 2023 17:43 UTC (Mon)
by rra (subscriber, #99804)
[Link] (25 responses)
To see why, instead of thinking about options, think about UNIX commands. For example, consider the standard Debian command "apt-get" and also consider the English phrase "many other package managers are apt-like". The former must translate to the ASCII hyphen-minus for searching and cut and paste to work, since the executable on disk uses a hyphen-minus in the name. But the latter is correctly typeset with a hyphen, not hyphen-minus.
The roff input language does correctly distinguish, but because the output looks almost identical in most fonts the bug of using the wrong choice is almost impossible for most people to detect. It therefore occurs frequently, and would even if literally every person writing a man page knew about this problem, for the same reason that I make spelling errors regularly even though I know how to correctly spell words. For POD, the problem is worse: the input language simply does not distinguish. There is no \- equivalent in POD, and no heuristic that will correctly map cases like the above apt-get vs. apt-like. Therefore, the only safe thing to do is to convert all input - characters to the ASCII hyphen-minus. People who really want hyphens can mark their POD documents as UTF-8 and use a Unicode hyphen (which modern roff also handles correctly).
I appreciate Branden's desire to tilt at windmills, being myself an occasional champion spin-jouster, and he is of course technically correct (which as we all know is the best kind of correct). But I stand by my personal opinion that this is a lost cause that will never be fixed properly, and attempting to get people to fix it properly is just going to annoy people without making much positive impact on the world.
Posted Oct 23, 2023 18:09 UTC (Mon)
by branden (guest, #7029)
[Link] (24 responses)
I think automatic man page source generators and human beings who fire up a text editor to write man(7) documents constitute different problem domains. You've stated your intentions with respect to pod2man in this area multiple times where I could see them and I have no objection. pod2man has a problem to solve and I don't want groff to inhibit your flexibility to solve it.
> For POD, the problem is worse: the input language simply does not distinguish. There is no \- equivalent in POD, and no heuristic that will correctly map cases like the above apt-get vs. apt-like. Therefore, the only safe thing to do is to convert all input - characters to the ASCII hyphen-minus. People who really want hyphens can mark their POD documents as UTF-8 and use a Unicode hyphen (which modern roff also handles correctly).
This seems fine to me. People seduced by the siren song of Unicode's giant character set may find themselves learning to distinguish these characters anyway, and those who don't--and who, moreover, may write the documentation hurriedly and with resentment--won't be troubled with it.
(That said, my exposure to POD documentation, generally in the Perl core, tells me that its quality is very high, suggesting that it is written by people who care enough to make it that way. But I've seen far too many terrible man pages to not complain about the status quo. And despite my affiliation with the GNU Project I emphasize that I don't endorse its [slowly fading] policy of man page deprecation, which I think has only contributed to the unhappy state of affairs by feeding shiftless programmers' indifference to writing documentation at all. Anything not worth doing is not worth doing well, no?)
If a person sits down to write a man page from scratch in a text editor, they will have things to learn, and in my opinion the hyphen/minus distinction is one of them. (As the original article suggested, there are in fact four other "ASCII" glyph distinctions to learn about.)
The theme of audience is also applicable to why I made this change in groff upstream. The GNU Project generally releases source archives, not binary packages. The primary consumers of groff releases from GNU are therefore, I would expect, people who already know of the package and desire to obtain it.
Distributions are different. Their users read man pages without even knowing that groff is involved. That is why it is important to me that (a) groff retain its customizability and (b) that its defaults be correct. groff man(7) actually went down this road before, about 15 years ago, when it first introduced the "utf8" output device. Screeches of clueless outrage erupted from the land then as have did now. Then-maintainer Werner Lemberg threw a blanket over the racket with the aforementioned character-remapping...but he did it in the macro file implementing (most of) man(7) itself, _not_ in the stock man.local file. I think that was an (innocent) error, as it suggested a stronger endorsement of doing this remapping-to-ASCII than was intended. (We discussed the revival of the old behavior on the groff list years ago. Werner, since retired as groff maintainer, felt much as you do; that it was technically correct to do what groff 1.23.0 shipped doing, but that the howls of frustrated man page readers and the commitment of a few vocal man page authors to their bad habits would be too much to endure.)
I figured the distributors are better placed to make this decision. I still think that. But the occasional storm in a tea kettle is the price, and on a slow news week, such weather can fuel an LWN piece.
Posted Oct 23, 2023 18:54 UTC (Mon)
by NYKevin (subscriber, #129325)
[Link] (5 responses)
Posted Oct 23, 2023 19:15 UTC (Mon)
by branden (guest, #7029)
[Link]
"Approximately" is doing a lot of work here. On the groff list we do see discussions and even complaints when dash-like symbols are drawn with incorrect lengths. People will notice, at least if their font doesn't hide the glyph distinctions from them (which brings its own problems, completely independently of anything to do with groff). https://www.unicode.org/faq/security.html
As I noted above, the GNU FreeFont's Mono face has good coverage and I've been using it happily, including much groff development activity, for years. People who dislike serifs might hate it, though.
> Do screen readers pronounce them differently?
I don't know for certain, but would expect so. One would not expect to hear "mother-in-law" pronounced with the interior punctuation called out. For Unix command-line and C language literals, you very much do. Those dashes (hyphen-minuses) are important.
> Does one of the standard Unicode algorithms handle them differently (e.g. line breaking, BIDI, etc.)?
groff doesn't apply the Unicode line-breaking algorithm (because it predates Unicode), but it does something similar. When researching this matter for Russ Allbery on the Debian list I discovered that essentially all man page formatters will break the line after a hyphen (*roff input: -) and none will after a minus sign (*roff input: \-). They're a little less consistent for things like em dashes.
Posted Oct 23, 2023 19:58 UTC (Mon)
by rra (subscriber, #99804)
[Link]
Most man page views are in fixed-width fonts, though, and there the distinction is much less apparent, or occasionally nonexistent. With a fixed-width font, the character has to take up the same space regardless of the dash length, so the length variation is much less useful.
(Also, as Branden said, it affects line wrapping, which sometimes can matter a lot.)
Posted Oct 25, 2023 8:25 UTC (Wed)
by nim-nim (subscriber, #34454)
[Link] (1 responses)
This may not be obvious in a Linux console which is basically stuck in the past pre-unicode world, where non-ascii rendering is broken in various ways and where you force 1024×768 resolution because otherwise various things break hard, but in a gfx Linux (or Windows or OSX or Android) terminal that exercise screens at their full pixel density, using OpenType vector fonts that try to render Unicode (including its nuances) ever more accurately encoding mistakes start to become visible and will become ever more visible as the years pass.
On a high-dpi phone screen and on many high-to-mid end computer screens the tech is already capable to match (and exceed) traditionnal paper printing the traditionnal “it’s a problem for typography buffs that do paper print” excuse does not apply anymore.
Also, remember that people have found a need for translated man pages for a long time so considering them an ASCII-only world is highly inacurate.
Posted Oct 25, 2023 19:12 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link]
But what I will say is this: When I have written content for the web, which (in the present day) very much is a high-DPI Unicode-friendly environment, I have never felt any need to use U+2010 HYPHEN. I have used the en dash, the em dash, and the minus sign, but all of those render very differently from U+002D HYPHEN-MINUS. I was just trying to understand why anyone wants U+2010 in particular, when it looks so similar to U+002D even in a proportional font. I mean, just look at them in the article. They are practically homoglyphs, and I had to lean really far in just to see that U+2010 is about half a pixel thicker (in my font, when subpixel hinting is enabled).
If there are in fact screen-reader benefits, then this probably shouldn't be written off entirely. OTOH, one could say the same about <i> and <em>. Screen readers can certainly benefit from distinguishing between italics for emphasis, and all other italics. But nobody on the web makes that distinction in practice, despite what the W3C and WHATWG recommend. Markdown, for instance, has two different syntaxes for italics, but I'm not aware of any well-known flavor of Markdown actually using one syntax for <i> and one for <em> (CommonMark specifies <em> for both, and most other Markdowns don't even bother telling you which one they emit unless you dig into the guts of the implementation).
The point in all this: You can't make authors care about semantics if they do not wish to care. From the perspective of the average author, U+2010 is just "U+002D, except if I use it in computer code, then it breaks things." They do not wish to know the difference between U+2010 and U+002D, and no amount of "well they should learn" is going to change their behavior.
Posted Jan 7, 2024 23:18 UTC (Sun)
by mirabilos (subscriber, #84359)
[Link]
Posted Oct 23, 2023 19:02 UTC (Mon)
by rra (subscriber, #99804)
[Link] (12 responses)
My arguments are, briefly:
1. It is very difficult to get this correct. Yes, there are a lot of things to learn when writing man pages, but bugs that cannot be caught by automated tools and don't produce visibly different output are extremely hard to eliminate. This is effectively a foot-gun in the roff language that authors will continue to get wrong because getting it wrong produces no visible effect.
2. The distinction is mostly drawback for the typical use of man pages. Most views of man pages are in contexts where the only distinction between hyphen and hyphen-minus is (maybe) whether roff does line breaking at that point, and with the (IMO correct) increasing trend of disabling full justification in man pages, this is a very minor benefit. The glyphs are otherwise essentially identical, but hyphen breaks searching and cut and paste. The positive benefits are mostly for troff output for printed material, and for man pages this is not a nonexistent use case but it is very rare. This is why I have dropped all of the pod2man transformations that were only useful for troff output; they were causing problems for nroff output and were essentially never used as intended.
3. The world has changed since roff was designed. This is not going to be persuasive if you see your role as preservation of the original roff intent, so to some extent this is a conflict of uses. You are maintaining the roff typesetting system, but most people writing man pages are just trying to present documentation to the user and don't care about the roff typesetting system as such. roff was designed in a world without Unicode, but we have Unicode now. If people want hyphens, or matched single quotes, there is now a fairly good argument they should just type the thing they intend using Unicode. I think if roff were invented today, \- would not exist and - would mean \- because roff would just use Unicode input and respect the characters the user entered.
I am very sympathetic to the argument that this should translate into roff preserving the original distinction by default, but all distributions disabling this distinction when processing man pages. I think that is a fairly reasonable compromise, although it does have the drawback of requiring all the distributions to duplicate essentially the same configuration work.
Posted Oct 23, 2023 19:57 UTC (Mon)
by branden (guest, #7029)
[Link] (11 responses)
It's a factor. There are historical roff documents that I'd like to keep working nicely, as well as I can.
For example: https://github.com/g-branden-robinson/retypesetting-mathe...
That said, I do not consider myself beholden to bug-for-bug compatibility with AT&T troff (James Clark didn't, though he accommodated several), or to making the same decisions about issues in areas not even specified by CSTR #54, the "Troff User's Manual" (originally written by Ossanna in 1976, revised in 1992 by Kernighan). One relatively vocal subscriber to the groff mailing list sees me more as a heedless radical with meager respect for the wisdom of my superior ancestors. There is a certain exhilaration in juxtaposing that critique with yours.
> Yes, there are a lot of things to learn when writing man pages, but bugs that cannot be caught by automated tools and don't produce visibly different output are extremely hard to eliminate.
Guessing which glyph to use as seen in the examples on the debian-devel list, and here, appears to be an AI-hard problem.
> This is effectively a foot-gun in the roff language that authors will continue to get wrong because getting it wrong produces no visible effect.
I _do_ have some advice on this front: use a good font, one where glyphs for different code points look different.
> the only distinction between hyphen and hyphen-minus is (maybe) whether roff does line breaking at that point
I expect what will happen with pod2man specifically is that you'll use \- everywhere, people will notice that breaks stop happening in as many places, resulting in wide adjustments, then they will...
> and with the (IMO correct) increasing trend of disabling full justification in man pages, this is a very minor benefit.
...join team ragged-right margin. Well, fear not, I've actually made it easier for you to get what you want. https://git.savannah.gnu.org/cgit/groff.git/tree/NEWS?h=1...
People who have been turning automatic hyphenation off in the first place may also welcome that.
> but hyphen breaks searching and cut and paste.
I spend a lot of time reading man pages. I find myself not struggling over this issue. Maybe I'm weird.
> The positive benefits are mostly for troff output for printed material, and for man pages this is not a nonexistent use case but it is very rare.
Linux man-pages maintainer Alejandro Colomar and I are doing what we can to encourage people to rediscover typeset manuals. I've linked to the collected groff-man-pages PDF elsewhere in this discussion. Deri James is doing an invaluable service by helping us get man page cross references wired up to PDF hyperlinks. (Of course, you have to actually have to tell man(7) that something is a man page cross reference first, and thereby hangs a tale. https://git.savannah.gnu.org/cgit/groff.git/tree/NEWS?h=1... )
> The world has changed since roff was designed.
Certainly. There'd be no upset users and no LWN article to reply to if we weren't enjoying the blessing of Unicode support in our terminal emulators. So in a sense this all can be laid at Markus Kuhn's doorstep.
> This is not going to be persuasive if you see your role as preservation of the original roff intent, so to some extent this is a conflict of uses.
Not wholly. I also have a handful of new macros I want to introduce to the man(7) macro language. For groff 1.23.0, I settled on one, already linked to above.
The NEWS file entry for 1.23.0 is lengthy; I encourage anyone with any interest in groff to review it.
> but most people writing man pages are just trying to present documentation to the user and don't care about the roff typesetting system as such
People writing C programs are just trying to solve a problem and don't care about the programming language that much. (Okay, C has plenty of people who love it madly for its own sake. Consider substituting "JavaScript".)
If someone is going to pick up a tool to do a job, they're going to have to develop a competence with that tool. If plain text is all a person can manage, that is what they should write their documentation in.
(At this point in the discussion, champions of ReStructured Text and/or one of several not-quite-compatible dialects of Markdown typically appear on the scene, each claiming that their markup language is the _one_, _obvious_ way to write "plain text" such that it is suitable for conversion to richer formatting languages.)
I'm trying to reach and assist people who care about writing man(7) (and mdoc(7) for that matter) competently. If they don't care about that, I'm wasting my time, and they shouldn't waste theirs telling the world how man(7) _should_ be done.
> If people want hyphens, or matched single quotes, there is now a fairly good argument they should just type the thing they intend using Unicode. I think if roff were invented today, \- would not exist and - would mean \- because roff would just use Unicode input and respect the characters the user entered.
I don't agree, because you're forgetting about the minus sign. Unicode has a hyphen (U+2010) and a minus sign (U+2212), and "obviously", a person should input those code points for their distinct purposes.
This works great until someone needs needs to input a "literal" for an overloaded code point in the Basic Latin code chart that has syntactical significance to something like a shell prompt or a language compiler. Then they need that hen's tooth U+002D code point, even though it is meaningful _only_ for talking to computers, and not for any other domain of discourse. And that's not even taking into account the folks who ride in an want distinguishable en dashes, em dashes, figure dashes, and others the LWN article didn't mention. Fitting distinguishable glyphs for these into a half-width character cell even with a fair number of pixels in the horizontal dimension (say, more than 8) starts to become a real pickle.
So, no, I don't think "just use Unicode" is going to solve all the problems here at a stroke. It can help, but eventually people are going to need roff special characters or something like them so that they can tell unlike things apart with confidence. Even if they use a good font.
When you consider the problem space seriously, it turns out the WYSIWYG advocates are on team DWIM. And we know how well that turns out.
None of this is to tell you what you should do with podlators; the plan you've articulated seems fine to me. If, by some miracle, Alex Colomar and I convince more than a handful of people that the PDF man page experience is actually kind of nice, and some of those folks then turn to Perl docs and wonder what it's story is, I'll be happy to help you come up a with new coat of paint to go over that layer of primer you just stripped it down to.
Posted Oct 23, 2023 20:16 UTC (Mon)
by rra (subscriber, #99804)
[Link] (7 responses)
But this is exactly my point. If you need that code point, you enter that code point, and it should be typeset as that code point, without any second-guessing on the part of the typesetting system.
The second-guessing is there because in a pure ASCII world there was no alternative, because there were not ASCII code points for the different meanings. You therefore had to pick one of them to be the default and represent the other ones with escapes, and roff decided, quite reasonably for typesetting and less reasonably for man pages, that a hyphen was the most common usage and should be the default and the other usage should use an escape. But the point of Unicode is that you no longer have the context collapse on input, because there are code points available to express your exact intent. Essentially, the use of the precise Unicode code point replaces roff escapes (down to being slightly more annoying to type).
We have been down this path already with quotes. In the pure ASCII world, we invented various conventions like `single quotes' or ``double quotes'' used in different typesetting systems, but now you can use correct Unicode quotes if you care about this distinction, and many editors will assist you in doing so to avoid the entry problem. I am dubious any newly-invented typesetting system today would try to overload ASCII quotes in the way that TeX did; instead, it would handle Unicode quotes correctly.
None of these solutions are ideal because the keyboard is not large enough and doesn't allow easily drawing these distinctions. But if you view all the various typesetting escapes as substitutes for not having the correct character on the keyboard, I would argue that using the correct Unicode character is the modern replacement. It works uniformly across multiple typesetting systems, so you don't have to relearn how to use it for each piece of software, and you are far more likely to have active editor assistance in making the entry easier.
In other words, no, I have not forgotten about minus. If you want a minus sign in typeset material, you should enter an actual minus sign, which is U+2212 and will pair correctly in the font with a plus sign. This is not a hyphen, is not an en-dash, is not an em-dash, and is not a hyphen-minus. These are all distinct characters used for different purposes in high-quality typeset material. If you are talking about programming languages, you may not want an actual minus sign because programming languages do not use actual minus signs. You may a hyphen-minus, which should be typeset as such. (Unfortunately, this does create the problem that Unicode has only one plus sign, so you have to choose between fidelity and correct glyph matching between plus and minus signs if you are aiming for printed output. This choice is somewhat context-dependent, and one option would be to make those characters match in fixed-width fonts.)
Posted Oct 24, 2023 3:17 UTC (Tue)
by branden (guest, #7029)
[Link] (6 responses)
Okay. I find little to argue with in this presentation. I think if one were to undertake a "man: The Next Generation", one would likely proceed exactly as you describe, and let Unicode do the heavy lifting of glyph distinction.
...at least as far as some kind of alpha or trial run. I suspect people would rapidly run into trouble with hyphenated phrases (e.g., "long-standing, Debian-specific patches"). As you say, the keyboard is not large enough. At some point we run into not a technological problem, but a human one; it's hard to make people care about typographical distinctions that they don't want to care about, especially if their horizons stretch no farther than a terminal window. If they think of Unicode as mainly a resource for dingbats and emoji, we're unlikely to make much headway in the matters.
Posted Oct 24, 2023 3:31 UTC (Tue)
by rra (subscriber, #99804)
[Link] (4 responses)
This is the point where my own struggles with problems like this over the last ten years have given me a lot of respect for the amount of thought that's gone into Unicode. They took a careful and pragmatic decision to provide code points that represent the ambiguous merged character, and then separate code points that more precisely indicate intent. This to some extent means that within a Unicode world, both options are possible and the document author gets to choose how much to care.
If you want very nice typesetting, you can use hyphens, minuses, and en-dashes in the ways they were intended to be used. If you want to be lazy and not think about it, you can use a Unicode hyphen-minus and you get a compromise character that looks "okay" and, importantly, is clearly marked as a semantic compromise. Any typesetting system gets the correct information that the user was either talking about code or decided not to care about the distinctions between dashes, and therefore the typesetting system probably shouldn't try to care more than the user did.
This is similar to what they did with apostrophe and single quotes: the preferred characters in Unicode are U+2018 and U+2019, and U+0027 is defined as a neutral character that is intentionally left ambiguous, for users who don't care enough to draw the distinction.
You can't force users to care. The best you can do is provide them with the tools and make it clear whether they chose to use them or not. (And indeed, despite knowing all of this, I always use neutral single and double quotes and a hyphen-minus, because I don't care enough. Although I have started using real em-dashes, and I will occasionally use a real en-dash, so maybe eventually I'll come around.)
I'm simplifying a bit, and the Unicode world is not quite as shiny as all of that. Typesetting and human languages are messy and there are still sharp edges and ambiguities. But it's a system that a whole lot of people put a whole lot of thought into, and the results embed more practical wisdom than I think people realize.
Posted Oct 24, 2023 3:55 UTC (Tue)
by branden (guest, #7029)
[Link] (1 responses)
I concur with this. I don't think _anyone_ involved with groff development views Unicode as anything less than a tremendous boon to the sanity of glyph and character repertoires. (Oh, how I wish James Clark had decided to store groff characters internally as ints instead of C++ chars. But we'll get that refactored, knock wood.)
I have seen _one_ person grouse that apostrophes (however rendered) and right single quotation marks should be kept logically separate, and I have some sympathy for that view, because they _are_ logically separate--but it seems no English typesetting tradition ever sees fit to distinguish them in print. If I regard were to regard occasional man page authors as intransigent with respect to correct glyph choices, I dread to measure the inertia of commercial publishers.
Posted Oct 25, 2023 14:02 UTC (Wed)
by smoogen (subscriber, #97)
[Link]
Again thank you for teaching and making this conversation something enjoyable to read.
Posted Oct 24, 2023 7:26 UTC (Tue)
by smurf (subscriber, #17840)
[Link]
Depends on your locale; don't forget about U+201A. And then there's places where they use U+2039/U+203A … and other places where they use U+203A/U+2039. See https://en.wikipedia.org/wiki/Quotation_mark for even more enlightening examples.
Posted Oct 24, 2023 16:24 UTC (Tue)
by gray_-_wolf (subscriber, #131074)
[Link]
Maybe in some areas. The whole Han unification thing is in my opinion still a mistake. Having to know what language the text is in in order to be able to render it correctly is... annoying.
Posted Oct 25, 2023 18:13 UTC (Wed)
by jwarnica (subscriber, #27492)
[Link]
Authors+authoring tools, who care, can be careful, once, and the 78 downstream tools never are allowed to second guess things. Authors+authoring tools who don't care... Well, then the 78 downstream tools at least do what they are directly told without any hackery, and the cause of the errors (if any) becomes clear: the human and/or the single tool they interact with.
Posted Oct 24, 2023 0:43 UTC (Tue)
by jkingweb (subscriber, #113039)
[Link] (2 responses)
I am such a person (especially after reading this article), though I'm a complete neophyte. Thus far I've been writing Markdown and converting it using Pandoc, mainly because I had no idea where to begin to learn how to do things properly. And... I still don't. Should I start by reading man(7), or mdoc(7), or something else altogether? There seems to be many schools of thought (as is so common in the free software world), but I'd love an authoritative hand point me in *one* direction, whatever it is.
Posted Oct 24, 2023 3:07 UTC (Tue)
by branden (guest, #7029)
[Link] (1 responses)
My recommendation is the groff_man_style(7) page in the groff 1.23.0 release. It attempts to bring the reader from a state of no knowledge about man(7) or roff to a point where they can write a man page. It's not quite a tutorial--it doesn't start with a skeleton page that you fill in, but it covers the basics first and then discusses each group of man(7) macros in approximately the order you're likely to need to use them. So it starts with `TH` and `SH` and their relatives, then covers paragraphing macros, then synopsis macros, then hyperlink macros, and finally font styling macros.
You can start on page 253 of the collected groff man pages PDF. https://www.gnu.org/software/groff/manual/groff-man-pages...
> There seems to be many schools of thought (as is so common in the free software world), but I'd love an authoritative hand point me in *one* direction, whatever it is.
In the course of the past several years I've learned a great deal about the history of *roff and man pages, and I've attempted to reflect that learning in the content of the groff's own man pages.
But even authoritative voices are not infallible, so if you find errors, I'd appreciate hearing about them. (I find that the adjective sits on me uncomfortably, in any case.)
Posted Oct 30, 2023 22:33 UTC (Mon)
by jkingweb (subscriber, #113039)
[Link]
Posted Oct 28, 2023 1:41 UTC (Sat)
by ceplm (subscriber, #41334)
[Link] (4 responses)
Absolutely, and I am the proportion of the later ones is pretty close to proportion of programmers who write their program in assembler. I don’t know the proportion but number of people who write manpages in actual `man(7)` or `mandoc(7)` tends IMHO towards zero. Everybody uses some generator from some reasonable markup language (pod, markdown, rst).
Posted Oct 28, 2023 9:21 UTC (Sat)
by branden (guest, #7029)
[Link] (1 responses)
It appears that you are unfamiliar with the work product of the Linux man-pages project and the activity of its mailing list.
https://lore.kernel.org/linux-man/
Of the 2,680 man pages that project maintains at current count, only one (bpf-helpers(7)) is maintained in a different markup language.
You also appear to be unfamiliar with the man page maintenance practices of the *BSD community. I haven't measured how many mdoc(7) pages they maintain, but I reckon it's the same order of magnitude.
Posted Oct 29, 2023 0:07 UTC (Sun)
by ceplm (subscriber, #41334)
[Link]
However, I am suspicious that there is some correlation between people who are willing to program in C and people who are writing manapges in their raw format. In my Pythonish part of the world, I have never met anybody who would have non-generated ones (e.g., https://pypi.org/project/sphinxcontrib-manpage or https://pypi.org/project/argparse-manpage), and yes, I am absolutely certain that quality of such manpages is five flies down from the good ones.
Posted Oct 28, 2023 16:29 UTC (Sat)
by rra (subscriber, #99804)
[Link] (1 responses)
Writing directly in the man macros in roff is not a bad markup language. Writing directly in mandoc is even nicer. There are some weirdnesses and oddities, and I personally accept the loss of formatting flexibility and use POD (for possibly obvious reasons), but I would reach for writing roff directly long before I would tolerate the excessive verbosity and tedium of writing something in any XML- or SGML-based markup language. And most of the others just don't have the required detailed markup to do a good job with complex formatting.
I wouldn't recommend roff for a long technical document (either LaTeX or reStructuredText with Sphinx, depending on the nature of the document, are massively superior in my opinion), but it's not a horrible choice and you can get good output with it, and I think you'd be better off than with XML.
Posted Oct 29, 2023 0:10 UTC (Sun)
by ceplm (subscriber, #41334)
[Link]
Posted Oct 23, 2023 15:57 UTC (Mon)
by Wol (subscriber, #4433)
[Link] (2 responses)
Except, if the original input is a plain hypen/minus (as it is in ASCII), surely converting it silently to a hyphen or minus is an information-corrupting operation. Surely discarding the corruption makes more sense !?!?
Computers should be "do as I say, not what *you* think I mean". If I type an ascii dash, please give me an ascii dash! Don't auto-corrupt it behind my back! If I want something else, make me ask for what I want!
Cheers,
Posted Oct 23, 2023 17:17 UTC (Mon)
by dskoll (subscriber, #1630)
[Link] (1 responses)
The problem is that troff started out as a typesetting tool and generally speaking, for typeset output you want a - on input to be turned into a ‐ (hyphen) on output, just as a human typesetter would do when typesetting typewritten source material. I believe LaTeX does the same thing, though it's much less noticeable because LaTeX doesn't produce output on a terminal.
FWIW, I religiously use \- in my man pages where I mean codepoint 002D. It's hard-coded in my muscle memory now.
Posted Oct 24, 2023 4:38 UTC (Tue)
by branden (guest, #7029)
[Link]
I find this issue closely analogous to = vs. == in C, and I think many of the same people who call troff's distinction between - and \- "stupid" also derogate their peers who manage to screw up the =/== distinction, deriding them as inexperienced newbies. It seems that knowing what you're talking about in C is a virtue, but in documentation it is a tedious waste of time. (I've met many specimens of brogrammer--perhaps I am unusually unfortunate.)
In my view, in fact, the C situation is _less_ excusable, because at the time Ken and Dennis came up with C, := as an assignment operator had been around at least since Algol 60 (so, a decade or more, and _everybody_ knew what Algol was), and there was certainly no problem finding the keys with which to type it. (I like Hillel Wayne's take on the matter: "Nowadays most languages use = entirely because C uses it, and we can trace C using it to CPL being such a trash fire.")
By contrast, you will require winning-lottery-ticket levels of luck to find distinguishable hyphen and minus keys on a keyboard.
Posted Oct 23, 2023 17:14 UTC (Mon)
by butlerm (subscriber, #13312)
[Link] (28 responses)
They should quit doing that, it is not helpful to change characters from one character to a different one by default. A macro or some other option should be required to do something out of the ordinary like that, in accordance with the principle of least surprise.
Posted Oct 23, 2023 17:34 UTC (Mon)
by branden (guest, #7029)
[Link] (27 responses)
No, groff's behavior is consistent with every other implementation of troff in the world, including the original implementation dating back to about 1973, appearing in Fourth Edition Unix from Bell Labs.
> by translating an ASCII minus to something other than an ASCII minus.
There is no such thing as an "ASCII minus". The relevant standards documents call it a "hyphen-minus", which reveals the very problem you are trying to conceal with your poorly informed proclamation.
Strictly, ASCII didn't even give the characters names, except arguably the control characters.
ISO 8859/ECMA-94 did, and they call the character "hyphen-minus", as did Unicode 1.0 and every revision since.
https://www.ecma-international.org/wp-content/uploads/ECM...
Posted Oct 23, 2023 17:59 UTC (Mon)
by butlerm (subscriber, #13312)
[Link]
It is defective by design then, and they should fix it. Or apparently it was fixed, and they decided to break it.
Posted Oct 23, 2023 18:41 UTC (Mon)
by ms-tg (subscriber, #89231)
[Link] (16 responses)
Who are the user base that are benefitting from the typesetting of '-' hyphen-minus in the source of man pages into anything else?
Posted Oct 23, 2023 19:04 UTC (Mon)
by branden (guest, #7029)
[Link] (15 responses)
It is precisely that set of people who render man pages to output formats that distinguish hyphens from minus signs. It seems to come as a shock to some people that you can do things like render man pages as PDF.
https://www.gnu.org/software/groff/manual/groff-man-pages...
Save the specific format of PDF itself, this was intention and practice of the people who brought us Unix man pages in the first place.
"The manual was intended to be typeset; some detail is sacrificed on terminals." (man(1), _Unix Time-Sharing System Programmer's Manual_, Eighth Edition, Volume 1, February 1985)
Posted Oct 23, 2023 19:26 UTC (Mon)
by tzafrir (subscriber, #11501)
[Link] (2 responses)
https://manpages.debian.org/unstable/groff-base/groff.1.e...
I do see a different character for an em-dash in the second one (—).
Posted Oct 23, 2023 20:10 UTC (Mon)
by branden (guest, #7029)
[Link] (1 responses)
Possibly. I wish I had reliable statistics!
> The following pages seem to show hyphens as hyphen-minus characters (unless I read incorrectly):
Ingo Schwarze might thank me for pointing out that the back-end renderer that debiman uses for this purpose is mandoc(1), and so groff is not involved at all.
mandoc is indeed better in many cases at rendering man pages to HTML than groff is. I'm not happy about that, but it's my understanding of the status quo. grohtml(1), the relevant part of groff, is difficult to work on. I've fixed some bugs in it but it inherently attempts a much more ambitious thing than mandoc(1) does. groff's HTML support attempts to handle the full roff language. mandoc(1) avowedly does not, and Ingo swears it never will.
Posted Jan 7, 2024 23:21 UTC (Sun)
by mirabilos (subscriber, #84359)
[Link]
(I generate them from catman pages though, which in turn are produced with nroff (not gnroff) and the BSD mdoc, man.old, me, ms, etc. macropackages.)
Posted Oct 23, 2023 19:33 UTC (Mon)
by butlerm (subscriber, #13312)
[Link] (10 responses)
Posted Oct 23, 2023 20:04 UTC (Mon)
by branden (guest, #7029)
[Link] (9 responses)
If only Ken, Dennis, Steve Bourne, Doug McIlroy, et al., had had the benefit of your wisdom...
https://minnie.tuhs.org/cgi-bin/utree.pl?file=V7/usr/man/...
Posted Oct 23, 2023 20:46 UTC (Mon)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
(And yes, it is fair to point out that we do not have a time machine and cannot change the past. But it is also fair to point out that standards are paper. We can, and should, regularly ask whether the benefits of any given standard continue to outweigh its costs, so long as we remember to include backcompat in that cost/benefit analysis. This is how C got rid of trigraphs, for example.)
Posted Oct 24, 2023 3:37 UTC (Tue)
by branden (guest, #7029)
[Link]
I think it's about as charitable as statements like "groff is making up out of thin air" and "defective by design". It's the currency he seems to be comfortable trading in.
> It is fairly obvious, at least to me, that the comment was phrased in the present tense, and is about what is convenient for man(7) users *today*, rather than what might have made sense in the 70's.
But the specific case of - vs. \- is one area where the passage of time _hasn't_ made much of a difference. It was just as difficult to learn the distinction and operate the keyboard to produce these alternatives in the mid-1970s as it is today. Arguably worse back then, in fact, since the Bell Labs Unix room people all used Western Electric Teletypes, and my impression is that the force required to actuate the keys on those things was colossal compared to, say, an Apple Magic keyboard. (Apart from machine memory constraints and a baud rate that makes continental drift look like a test of special relativity, this may account for Ken Thompson and early Unix culture's preoccupation with extreme abbreviation.)
And as Russ suggested above, it's not like today's keyboards have separate, convenient hyphen and minus keys.
> But it is also fair to point out that standards are paper. We can, and should, regularly ask whether the benefits of any given standard continue to outweigh its costs, so long as we remember to include backcompat in that cost/benefit analysis.
Quite so, and that is what I have tried to do. Moreover, *roff and the man(7) language is not formally standardized anyway. (Some may consider this fortunate.) All we have is convention. I didn't see any reason to make historical man pages render incorrectly. I've collected them and use them (informally) to regression-test groff. Here's one (long, technical) example of a groff regression that I felt honor-bound to undo to keep compatibility with historical man pages even though I felt it was a technical detriment. https://lists.gnu.org/archive/html/groff/2022-06/msg00026...
Note the follow-ups, particularly https://lists.gnu.org/archive/html/groff/2022-06/msg00048... .
> This is how C got rid of trigraphs, for example.
Yes, and a good thing. Their whole purpose was to accommodate people whose keyboards couldn't even type the printable code points in ASCII (the true, 7-bit, ANSI version, as opposed to early revisions of ISO 646, or any form of ISO 8859, which people slovenly call "ASCII").
Posted Oct 24, 2023 7:11 UTC (Tue)
by butlerm (subscriber, #13312)
[Link] (6 responses)
Apparently the Debian maintainer finds it immensely more practical to continue mapping hyphen-minus to hyphen-minus in the man macro package, so it is hard to see what the audience is for treating hyphen-minus as something other than hyphen-minus in man pages.
Somewhere in the development history of that package someone decided it was a useful thing to treat hyphen-minus as hyphen-minus and now there is a breaking change that distributions apparently cannot adopt in practice to revert that behavior because thousands of man pages have (inadvertently) come to rely on it.
Perhaps it was a mistake to allow hyphen-minus to map to hyphen-minus when the standard was for it do something else, but it appears to be an accommodation that is now almost inevitable - and indeed almost as if an invisible hand had restored the natural default mapping of a character to itself notwithstanding what was more convenient five decades ago - at least for man pages.
Posted Oct 24, 2023 8:35 UTC (Tue)
by cjwatson (subscriber, #7322)
[Link] (1 responses)
I'd also like to clarify that the change I made in the Debian packaging is only for manual pages _rendered in terminals_ and not for things like PDF output.
Posted Oct 24, 2023 10:24 UTC (Tue)
by branden (guest, #7029)
[Link]
Thank you for adding your perspective here.
> People shouldn't interpret me as being opposed to Branden's technical goals; I just have unfortunately finite time. I do take care to make the distinction in the *roff documents I write, and I think others should do the same for the sake of better printed output. As far as I know the only real point of difference is that I don't want to externalize the costs of better printed output onto the readers of manual pages in terminals (and even Branden has some sympathy with that position when I put it that way, I think).
I do, and I am sorry that your mailbox exploded in flames over this issue, particularly since I know my fire hose to be a high-volume one.
I can accept the interpretive frame that this was a matter of balancing externalities; when I first came to the issue, I asked myself, "well, how are people who *want* typographically superior man pages supposed to see the errors so that they can fix them?" Once one had set up a UTF-8 environment if necessary, and selected for one's terminal emulator a font that is not an outright impediment in this area, as implied by the LWN editor's OP, there were a few possibilities under the status quo ante (groff 1.22.4 and going back several years).
1a. Fork your distributor's package locally and maintain a patch against tmac/an-old.tmac (as it was then known). Remember to keep your forked package for any other machines where you want to do this work.
2a. Use dpkg-divert on /usr/share/groff/1.22.4/tmac/an-old.tmac and maintain a modified copy of that file. Remember to duplicate this diversion process on any other machines you install where you want to do this work. (Other distributions, I assume, have some equivalent to dpkg-divert.)
3a. Download groff from GNU and build and maintain an installation of it outside the packaging system. And, oh yeah, patch tmac/an-old.tmac there, too, damn it.
Now, as of groff 1.23.0, if a person wants to improve man pages in this respect (we few, we happy OCD-ridden few), their courses of action are as follows.
1b. Go to work right away, if your distributor hasn't changed groff in this respect.
2b. Modify the conffile /etc/groff/man.local and comment out the workaround. Experienced system administrators of my acquaintance are accustomed to backing up /etc, or at least checking it for things they don't want to lose when copying or migrating systems.
3b. Download groff 1.23.0 from GNU and build it and use it as-is.
These three alternatives each seem superior to their 1.22.4 counterparts to me.
There is a cost, yes. Judging by repology.org, the quantity of groff package maintainers in the world numbers between 10 and 100 (Fermi estimate). It might make me a jerkass to expect these folks to read the NEWS file when a release happens every few years, and to be up to the challenge of applying a small diff to a text file. But having some experience as a package maintainer, these didn't seem like onerous expectations to me.
And in case the point need be reiterated to onlookers, _you_ were not surprised by this change. (I haven't heard from any other groff 1.23.0 packagers on this point, though I have about other matters.) It was telegraphed and discussed literally years ago. The surprise, if it was one, was the vehemence of some users' response.
Posted Oct 24, 2023 15:41 UTC (Tue)
by WolfWings (subscriber, #56790)
[Link] (3 responses)
...or perhaps the mistake was ever mapping almost any unescaped ASCII character to anything except itself? Principle of least surprise. I get it, decision made decades ago, but it's a random typographical gotcha to have this cross-mapped to another character by default up there with Excel's "OH THAT'S A DATE!"-ism.
Posted Oct 24, 2023 17:27 UTC (Tue)
by branden (guest, #7029)
[Link] (2 responses)
> ...or perhaps the mistake was ever mapping almost any unescaped ASCII character to anything except itself? Principle of least surprise. Then consider how surprising it would be to have to write hyphen\(hyminus, non\(hy\(man\(hypage, command\(hyline, cuts\(hyand\(hypastes, UTF\(hy8, line\(hybreaking, pre\(hyrelease, look\(hyalike, and device\(hyindependent, to name just a few examples of hyphenated words or phrases from this very web page. I begin to perceive that people aren't going to read the groff_char(7) man page no matter how many times I link to it, so I'll just quote it. History A consideration of the typefaces originally available to AT&T
nroff and troff illuminates many conventions that one might
regard as idiosyncratic fifty years afterward. (See section
“History” of roff(7) for more context.) The face used by the
Teletype Model 37 terminals of the Murray Hill Unix Room was
based on ASCII, but assigned multiple meanings to several code
points, as suggested by that standard. Decimal 34 (") served as
a dieresis accent and neutral double quotation mark; decimal 39
(') as an acute accent, apostrophe, and closing (right) single
quotation mark; decimal 45 (-) as a hyphen and a minus sign;
decimal 94 (^) as a circumflex accent and caret; decimal 96 (`)
as a grave accent and opening (left) single quotation mark; and
decimal 126 (~) as a tilde accent and (with a half‐line motion)
swung dash. The Model 37 bore an optional extended character set
offering upright Greek letters and several mathematical symbols;
these were documented as early as the kbd(VII) man page of the
(First Edition) Unix Programmer’s Manual. At the time Graphic Systems delivered the C/A/T phototypesetter
to AT&T, the ASCII character set was not considered a standard
basis for a glyph repertoire by traditional typographers. In the
stock Times roman, italic, and bold styles available, several
ASCII characters were not present at all, nor was most of the
Teletype’s extended character set. AT&T commissioned a “special”
font to retain their accustomed glyph repertoire. A representation of the coverage of the C/A/T’s text fonts
follows. The glyph resembling an underscore is a baseline rule,
and that resembling a vertical line is a box rule. In italics,
the box rule was not slanted. We also observe that the hyphen
and minus sign were already “de‐unified” by the fonts provided; a
decision whither to map an input “-” therefore had to be taken. The special font supplied the missing ASCII and Teletype extended
glyphs, among several others. The plus, minus, and equals signs
appeared in the special font despite availability in text fonts
“to insulate the appearance of equations from the choice of
standard [read: text] fonts”—a priority since troff was turned to
the task of mathematical typesetting as soon as it was developed. We note that AT&T took the opportunity to de‐unify the
apostrophe/right single quotation mark from the acute accent (a
choice ISO later duplicated in its 8859 series of standards). A
slash intended to be mirror‐symmetric with the backslash was also
included, as was the Bell System logo; we do not attempt to
depict the latter. One ASCII character as rendered by the Model 37 was apparently
abandoned. That device printed decimal 124 (|) as a broken
vertical line, like Unicode U+00A6 (¦). No equivalent was
available on the C/A/T; the box rule \[br], brace vertical
extension \[bv], and “or” operator \[or] were used as
contextually appropriate. Devices supported by AT&T device‐independent troff exhibited some
differences in glyph detail. For example, on the Autologic APS‐5
phototypesetter, the square \(sq became filled in the Times bold
face.
Posted Oct 27, 2023 1:21 UTC (Fri)
by ms-tg (subscriber, #89231)
[Link] (1 responses)
Posted Oct 27, 2023 6:56 UTC (Fri)
by branden (guest, #7029)
[Link]
If you mean "is it possible to *format* man pages with something other than groff", then options include (limiting myself to software projects that are maintained--albeit some at a very slow pace) the following.
1. Heirloom Doctools troff: https://n-t-roff.github.io/heirloom/doctools.html
neatroff does not ship with a man(7) package, but you can configure it to use another troff's. I haven't tested this extensively.
Not all of these implement all the same extensions to the original man(7) dialect of 1979 that groff does. The groff_man(7) man page tracks such portability issues.
There are several other partial interpreters of man+roff, like mandoc, out there that produce HTML output exclusively. Most are of dubious quality and many are dead--no longer maintained. Several are unrelated but call themselves "man2html", and when discussing them, it is crucial to clarify which one you're talking about. Of those, Thomas Dickey's is probably the highest quality, but I have never rigorously evaluated it for completeness or correctness. https://invisible-island.net/scripts/man2html.html
(Hmm, LWN's comment previewer seems to think "-" characters are invalid in URLs, and so won't hyperlink them.
Maybe I should have spelled it "\-".)
Posted Jan 7, 2024 23:20 UTC (Sun)
by mirabilos (subscriber, #84359)
[Link]
http://www.mirbsd.org/MirOS/dist/mir/mksh/mksh.pdf at least is an okay result.
It is typeset with the BSD mdoc macropackage, not the GNU one, thankfully.
Posted Oct 23, 2023 19:15 UTC (Mon)
by tzafrir (subscriber, #11501)
[Link] (3 responses)
Posted Oct 23, 2023 20:16 UTC (Mon)
by branden (guest, #7029)
[Link] (2 responses)
Yes.
usr/share/doc/groff-base/meintro.ps.gz
Posted Oct 24, 2023 10:30 UTC (Tue)
by taladar (subscriber, #68407)
[Link] (1 responses)
Posted Oct 24, 2023 11:10 UTC (Tue)
by branden (guest, #7029)
[Link]
troffcvt: /usr/share/troffcvt/tc.me
There are some dozens Unix historical documents written variously in ms, mm, and me(7). Some of them have encumbered licensing (or their copyright status is unknown), and others, for instance from the old BSD PS1, PS2, USD, and SMM manuals, simply aren't packaged for Debian as far as I know.
The 150 or so Bell Labs Computing Science Technical Reports documents were all, to the best of my knowledge, composed with troff (the very earliest ones with nroff alone) but unfortunately the sources to these are seldom available (copyright encumbrance again). My understanding is that Doug McIlroy and Brian Kernighan in particular have kept pretty good track of their work artifacts, but for $reasons don't just slap them up online.
Posted Oct 26, 2023 17:27 UTC (Thu)
by anton (subscriber, #25547)
[Link] (4 responses)
The nroff output at the time certainly did not contain an Unicode hyphen, because Unicode did not exist at the time. I expect that the nroff output in 1973 converted hyphen-minus into hyphen-minus where groff's nroff implementation in 2023 converts hyphen-minus into hyphens. In 1973 there probably was not much cutting and pasting, but I expect that you could search man pages already, and the use of hyphen-minus and other ASCII characters was already beneficial for that purpose.
Posted Oct 26, 2023 18:36 UTC (Thu)
by branden (guest, #7029)
[Link]
At that time, so did nroff output. The Bell Labs CSRC, for reasons not completely clear to me, never bothered to work on support for character-cell video terminals (so-called "glass TTYs"). Part of this may have been due to the Western Electric Teletype's market position and AT&T's status as a (regulated) monopoly. In any event, by all accounts, Bell Labs Research Unix leapfrogged from paper Teletypes to the Jerq/Blit/DMD 5620 graphical terminal (which, I note, was still _branded_ "Teletype"). This is also why Seventh Edition Unix (1979) didn't have a pager program. A lot of their glass TTY support came in by merging back stuff from BSD in the 1980s, the Research Unix years. And as far as I know, _commercial_ AT&T Unix didn't take any more of that than they had to.
Support for character-cell video terminals fell to the Berkeley CSRG and to the commerical AT&T Unix concern, which seems to have been reorganized and rebranded about as often as happens in modern tech companies. Rivalry proliferated here: more(1) vs. pg(1), termcap vs. terminfo, Berkeley's anemic curses vs. AT&T's much more capable one (but locked up of course behind hefty license fees and shouted claims of trade secrecy). To take just one example, BSD curses only ever supported one form of highlighting: "standout/standend". AT&T curses, by the time of System V Release 4 (1989), supported several, following ISO 6429, and used a generalized attribute management data type. (Naturally enough, AT&T picked one that was too small.)
> Nobody cut and pasted from there to terminal input,
Indeed not, since the only selection buffer available was in the operator's brain.
Well, I suppose one could have used the Teletype's paper tape punch/reader attachment.
> and nobody computer-searched it for, say --some-option.
grep(1) existed since very early days. In fact it seems to have shown up in Fourth Edition Unix, just like troff itself. https://minnie.tuhs.org/cgi-bin/utree.pl?file=V4/man/man1...
(I could be slightly off there--troff's man page took much longer to show up than troff itself did, and the source code for some early editions of Unix remains lost.)
A few points: (1) the hyphen and minus were not de-unified in terminal output, only on the typesetter. (2) Typesetter output was not practically searchable; the C/A/T byte stream was not practical for such purposes. (3) Kernighan invented a text-based output format for device-independent troff. You _could_ search that, and observe the difference between a hyphen (written with '-' with either the 'c' command or the anonymous, optimized move-and-print command (see CSTR #97), and the minus sign, which required the 'C' command.
But, I would guess, few people apart from those troubleshooting a troff output driver ever looked at that output file format.
Posted Oct 26, 2023 18:40 UTC (Thu)
by branden (guest, #7029)
[Link] (2 responses)
This is an overgeneralization. Like AT&T troff, groff interprets an (unescaped) input hyphen-minus as a hyphen. Like AT&T troff, a device that doesn't have distinct hyphen and minus sign glyphs maps them to the same thing in output.
Some knowledge of the architecture of AT&T device-independent troff might be helpful to you.
Posted Oct 27, 2023 7:57 UTC (Fri)
by anton (subscriber, #25547)
[Link] (1 responses)
Posted Oct 27, 2023 10:00 UTC (Fri)
by branden (guest, #7029)
[Link]
Your point reveals considerable historical ignorance.
(A) xterm did not exist in 1973. Even the DEC VT100 terminal that xterm originally emulated (when implemented circa 1984) would not go into production for another five years.
(B) Output to terminals, at Bell Labs in 1973, was to largely to printing devices like the Western Electric Teletype Model 37. Being based on ink and paper, they were capable of constructive overstriking, a vanishingly rare feature of video terminals (storage-tube displays like the Tektronix 4014 excepted--and which troff treated like a typesetter anyway, see the old tc(1) command: https://www.unix.com/man-page/v7/1/TC/ ). Thus, anyone who started using nroff with a video terminal like the Lear Siegler ADM-3a (on which Bill Joy developed vi) made a more significant break with traditional hardware behavior than this is.
> resulted in the same ASCII character in the output until people switched to Unicode locales (maybe a decade ago).
15 years ago or more, depending on how adventurous one (or one's distribution) was. As I have implied elsewhere in this discussion, Unicode support in terminal emulators crossed a Rubicon. All of a sudden there were distinguishable hyphens and minus signs, and, worse--compared to the problem facing Bell Labs when they acquired the C/A/T phototypesetter--a third character, the hyphen-minus, was still retained. Several years ago I proposed adding a new special character, "hm", to correspond specifically and solely to U+002D, but none of the groff experts on its mailing list thought that was a good idea. And I no longer do, either; man page authors are even less likely to start typing "\(hm\(hmlong\(hmoption" than they are "\-\-long\-option". Worse, an "hm" special character would require implementation in all other man page formatters, and several of those are abandoned, or so indifferently maintained that there is no realistic hope of this job ever getting done.
> And from what I read in the article and comments, with groff this did not change even at that time, because it still produced the ASCII character. Only groff-1.23.0 changed that.
Wrong again. groff's "utf8" device was added in groff 1.16 (released 2000-05-23). https://git.savannah.gnu.org/cgit/groff.git/tree/ChangeLo...
groff's man(7) package did change to collapse the '-' and '\-' ordinary and special characters to the same thing for the utf8 device--get ready to wave your bloody shirt--nine years later, in January 2009. https://git.savannah.gnu.org/cgit/groff.git/commit/?id=98...
Evidently groff's maintainer at the time, Werner Lemberg, did not share your sense of alarm or urgency.
The subsequent release was groff 1.20, 2009-01-05. Interestingly, this was also the first release to which the GNU GPLv3 applied, the presence of which reliably sends Apple into prophylactic shock, so Mac OS X _never_ had this "fix", and for a considerable number of users we can measure a significantly greater historical longevity for mapping - and \- distinctly on groff's utf8 device. groff 1.23.0's behavior is thus a return to form in this sense.
I reiterate: I don't think the character translations introduced in for man(7) on the utf8 device groff 1.20 were an inherently bad idea; it just shouldn't have been done in tmac/an-old.tmac, but rather man.local, and I think it should have been commented out by default for the reasons you can see on this very page in my reply to cjwatson.
'[W]hen I first came to the issue, I asked myself, "well, how are people who *want* typographically superior man pages supposed to see the errors so that they can fix them?"'
The "solutions" that practically everyone objecting to groff 1.23.0's behavior in this respect involve changing these defaults in a way that is much more tedious to override, and poorly serve people who want to locate and fix problems, which is why maintaining the '-'/'\-' distinction is appropriate in the source archives produced and hosted by the GNU Project. It's fine if distributors want to do something different; that is what Colin has done, and I welcome his decision if it reduces the number of ignorant harangues he (and I, as groff's Debian package co-maintainer) have to endure about it.
Posted Oct 24, 2023 4:24 UTC (Tue)
by wtarreau (subscriber, #51152)
[Link] (19 responses)
It's unimaginable that humanity managed to reach a point where people could argue over the type of horizontal bar they'd have to use on display depending on the context when the same key is pressed on the keyboard and nobody cares at either ends!
At least reading the man pages in ASCII should fix the copy-paste problem...
Posted Oct 24, 2023 7:38 UTC (Tue)
by smurf (subscriber, #17840)
[Link]
It might come as a surprise to you that there are people who *do* care, if only to make the output look slightly more reasonable when those pesky-hyphenated-word-expression-thingies want to span a line break.
Posted Oct 24, 2023 16:45 UTC (Tue)
by mathstuf (subscriber, #69389)
[Link] (16 responses)
I have questions for you about Cyrillic "а" vs. ASCII "a". I don't believe the former changes based on the font, but the latter definitely does (into the "o with a stem on the right" form generally). How about the upper case version where they generally *are* the same glyph rendering? This sounds vastly more complicated than Unicode as it is because it is trying to move all of the complexity into the operators instead of keeping it in the data. How is one supposed to capitalize an English string that quotes a Russian word like "гражданский"? Sorting becomes convoluted because anything that isn't English needs to decipher "is this 'Н' meant to be sorted like Cyrillic 'en' or English 'aich' in this context?" (hint: it is the Cyrillic "en").
> Characters should not convey semantics, only a representation.
So where do you stand on the tabs-vs-spaces debate then? Hopefully you don't care (beyond consistency) if that is your view. Do you use any ASCII control characters beyond `\n` and `\0`? These are nothing *but* semantic representations of things.
Posted Oct 24, 2023 17:15 UTC (Tue)
by wtarreau (subscriber, #51152)
[Link] (15 responses)
Here we're speaking about punctuation symbols that are drawn similarly with a pen, typed with the same key, indistinguishable when read on paper or on screen if you don't have the other ones to compare, etc. They *are* the same character, for the writer and for the reader. The simple fact that it becomes so confusing that you cannot copy-paste a simple command line anymore should indicate that it just went too far in the distinction when some absolutely insist on using different internal representations and resort to heuristics or rules to say "let's say that without a backslash we'll use this one and with a backslash it will be this one".
What will be the next step, left-justified vs centered vs right-justified hyphen/dash/minus ? Hyphen to use between upper case letters and another one to use between lower case letters ? A special shorter minus sign to be used in front of the zero because it looks nicer ? Maybe we'll reach a point where we'll need 256 bits to represent all character variants and it will be sufficient to simply indicate all pixels in a 16x16 matrix that will then be vectorized, and even then I'm not sure it will be sufficient for some.
> So where do you stand on the tabs-vs-spaces debate then?
There's no real "debate", rather perferences that are dictated by the largest consensus among initial authors of a project. Practices can evolved sometimes, but you'll note that tab is not a representation but a control character.
> Do you use any ASCII control characters beyond `\n` and `\0`?
Yes I do, but they're "control characters", which means that they're reserved encodings in byte streams that precisely escape the representation flow to act on the controls of the terminal. XON/XOFF and ESC (0x1B) are a perfect illustration of this by the way. No real character is associated with that.
Posted Oct 24, 2023 17:53 UTC (Tue)
by branden (guest, #7029)
[Link]
The price of this choice is a larger space of homoglyph attacks.
What Unicode attacks is a proper engineering problem: there is no solution that is optimal in all dimensions. Trade-offs must be made.
Posted Oct 24, 2023 19:18 UTC (Tue)
by butlerm (subscriber, #13312)
[Link] (13 responses)
Posted Oct 24, 2023 20:46 UTC (Tue)
by branden (guest, #7029)
[Link] (6 responses)
That's a fine idea; so good that there's already a facility in PDF for this (called CMap), and something similar is at work on this LWN web page too. When I searched for hyphens in Firefox, it matched all the alternative dash symbols in Mr. Corbet's exhibit as well.
Since less(1) is the 800-lb. gorilla of pagers (with a man page weighing nearly as much when printed), that might be a good place to see where this might be implemented. Terminal emulators are another possibility, since they generally have to know something about Unicode character properties anyway, and already have a giant data structure housing all of the grapheme clusters rendered at every character cell in the nominal window plus the scrollback buffer. But I won't hold my breath for xterm to implement a search dialog. ;-)
Posted Oct 25, 2023 0:14 UTC (Wed)
by butlerm (subscriber, #13312)
[Link] (3 responses)
For a rich text editor the default should be reversed, i.e. to preserve variants unless the user prefers to paste as plain text, similar to what many rich text editors do already.
Posted Oct 25, 2023 5:58 UTC (Wed)
by donald.buczek (subscriber, #112892)
[Link]
I have a command line tool to detect and remove identified phishing email from our users mailboxes selectable by header fields. Just yesterday I've used
./delete_malware.py "" "⚠ Action Required"
(copied-and-pasted from my shell history)
Posted Oct 25, 2023 11:44 UTC (Wed)
by nim-nim (subscriber, #34454)
[Link]
Filenames can contain pretty much any unicode codepoint, many apps will derive the file name from human text typed within the file (for example, song track titles, document title, autor name, etc).
Some apps will even insist on their god-given right to use any random bunch of bytes in the filename, even when the byte combination is explicitly forbidden in UTF-8.
Thus cut and pasting any command that contains a filename can involve at least the full UTF-8 scope.
Posted Oct 26, 2023 21:12 UTC (Thu)
by Wol (subscriber, #4433)
[Link]
Or, if you do a right-click-paste, one of the options should be "paste as ascii (or 8-bit)" (which could be terminal-sensitive ie if it's a konsole or xterm or whatever) which would either convert the multiple variants to space or dash, accented characters to plain, etc etc, or just drop characters it can't convert.
Then at least what happens is under human control ...
Cheers,
Posted Oct 25, 2023 5:52 UTC (Wed)
by donald.buczek (subscriber, #112892)
[Link] (1 responses)
getopt(3) and friends to regard any hyphen-like character as the option character? Would resolve one basic problem.
Whitespace-splitting of shells might also be candidates, because for some reason, non-breaking space variants seem to be slipping into command lines copied and pasted from email by our users, because they use webmail.
Posted Oct 27, 2023 17:32 UTC (Fri)
by gutschke (subscriber, #27910)
[Link]
Posted Oct 25, 2023 3:14 UTC (Wed)
by wtarreau (subscriber, #51152)
[Link] (3 responses)
Posted Oct 25, 2023 9:29 UTC (Wed)
by nim-nim (subscriber, #34454)
[Link] (1 responses)
Those fonts are *more* sensible to encoding errors not less. If you use the wrong kind of hyphen/dash/minus they will make ligature mistakes.
Posted Oct 25, 2023 9:31 UTC (Wed)
by nim-nim (subscriber, #34454)
[Link]
Posted Oct 26, 2023 4:25 UTC (Thu)
by NYKevin (subscriber, #129325)
[Link]
Posted Oct 25, 2023 9:24 UTC (Wed)
by nim-nim (subscriber, #34454)
[Link]
Dashes are long, minus signs have the same width as plus signs, hyphens are short.
Posted Oct 25, 2023 11:36 UTC (Wed)
by Sesse (subscriber, #53779)
[Link]
Posted Oct 24, 2023 17:26 UTC (Tue)
by mpr22 (subscriber, #60784)
[Link]
The "multiplicity of representations" concern has existed since the publication of ISO 8859-5:1988 and ISO 8859-7:1987.
Posted Oct 24, 2023 5:44 UTC (Tue)
by pabs (subscriber, #43278)
[Link] (1 responses)
Posted Oct 24, 2023 13:56 UTC (Tue)
by JanC_ (guest, #34940)
[Link]
It might be useful for code fonts used to edit (manpage) sources, but there are also rendering options (colour, style, …) that code editors can use to highlight “special” characters (as they often already do).
Posted Oct 24, 2023 15:07 UTC (Tue)
by mattdm (subscriber, #18)
[Link]
Posted Oct 24, 2023 15:07 UTC (Tue)
by mattdm (subscriber, #18)
[Link]
Posted Oct 25, 2023 0:30 UTC (Wed)
by neilbrown (subscriber, #359)
[Link] (13 responses)
The "char" it identifies is the first bytes of the utf-8 encoding..
There is certainly room for improvement here.... I guess I should send a patch.
Posted Oct 26, 2023 12:01 UTC (Thu)
by nim-nim (subscriber, #34454)
[Link] (12 responses)
But Java and Windows are historically UCS-16 environments with half-assed UTF-8 support.
Posted Oct 27, 2023 8:08 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link] (11 responses)
* Files that are UTF-16LE, no BOM included because it is/was the OS's native encoding.
NTFS does actually have the necessary facilities to smuggle an encoding declaration in e.g. an alternate data stream. But the TXT file extension is older than NTFS, so I can't exactly blame them for not using a time machine here.
[1]: https://learn.microsoft.com/en-us/windows/apps/design/glo...
Posted Oct 28, 2023 15:29 UTC (Sat)
by wtarreau (subscriber, #51152)
[Link] (10 responses)
I've always been very angry at the encoding UTF-8 uses because it was purposely made to be transparent to 7-bit encoding, and being designed by english-speaking people, they probably underestimated the amount of trouble it would cause to those already using code pages daily due to accents and extra letters. In addition, UTF-8 is known for being extremely inefficient for some languages like Chinese.
Ideally we'd need a different encoding that does *not* support 7-bit chars and recodes all of them using prefixes not part of ASCII code pages such as some control chars and 0x7F. This would make documents, file names etc non-ambiguous (old vs new format) instead of trying to be "mostly compatible". This "mostly compatible" aspect is a disaster because a same document tends to contain different encodings at different places when edited with multiple persons who didn't notice the problem. I've even seen a few times here in france some ads printed on paper with a few incorrect characters sequences such as "é" for "é" due to UTF-8 coding issues. This would not happen if no single char would appear correctly. Sure it would use a larger encoding for mostly 7-bit texts but for those using mostly non 7-bit it would be much better. Possibly that it could even end up with 3 bytes if enough control codes were used and end up being of fixed size.
Posted Oct 28, 2023 17:26 UTC (Sat)
by Wol (subscriber, #4433)
[Link]
Internationalisation and utf-8 causes massively nasty hacks to get round this problem ...
Cheers,
Posted Oct 28, 2023 18:16 UTC (Sat)
by Cyberax (✭ supporter ✭, #52523)
[Link] (5 responses)
Like, really? I speak several languages with non-Latin script, and UTF-8 is the best invention ever since the sliced bread. At least I can edit the text in BOTH of my native languages at the same time.
> UTF-8 is known for being extremely inefficient for some languages like Chinese.
I happen to speak Chinese and I know a bit about its early computer history. The first "encoding" of Chinese used _five_ bytes for each character. UTF-8 uses 3 bytes for most of Chinese characters, and UCS-2 uses 2 bytes. So "extremely inefficient" is completely misleading.
> This "mostly compatible" aspect is a disaster because a same document tends to contain different encodings at different places when edited with multiple persons who didn't notice the problem.
Perhaps you should take your part and stop using weird encodings?
I honestly have not seen any problems with incorrect text rendering related to UTF-8 within the last 10 years. And I use non-Latin scripts constantly.
Posted Oct 28, 2023 18:26 UTC (Sat)
by Wol (subscriber, #4433)
[Link] (4 responses)
That's probably the problem, actually. Any mistakes in non-Latin scripts are obvious, and probably screw up the sentence. wtarrreau's problems are with the occasion Latin screwup where it's 99% correct.
And, speaking from experience, what you really don't want is when the screwups are rare. Because they're rare, you notice them more, because you're conditioned to expect everything to be correct. For Cyrillic and Chinese scripts, they were probably fixed properly. For Latin, they were almost certainly "almost" fixed, for an infuriatingly near-perfect definition of "fixed".
Cheers,
Posted Oct 28, 2023 19:16 UTC (Sat)
by mpr22 (subscriber, #60784)
[Link] (3 responses)
(Also some philosophical objections to its design.)
Posted Oct 28, 2023 21:14 UTC (Sat)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
Posted Oct 29, 2023 8:47 UTC (Sun)
by joib (subscriber, #8541)
[Link] (1 responses)
Funny utf-8 related war story. A long time ago at a previous job, we migrated a nfs service from Linux servers to netapps. After a while a user complained that some files had disappeared. Turns out that the Linux NFS 4 code treats filenames as a bag of bytes, but netapp follows the RFC which says that filenames must be valid Utf-8. So the problem was that the filenames in question were 8859 encoded. Mounting with nfsv3 and renaming the affected files fixed it (IIRC there's a tool called convmv that does this).
Posted Oct 29, 2023 10:37 UTC (Sun)
by Wol (subscriber, #4433)
[Link]
"If it's working, DON'T TOUCH IT".
Cheers,
Posted Oct 28, 2023 19:35 UTC (Sat)
by mpr22 (subscriber, #60784)
[Link]
If you want computer programmers with a non-IBM background to accept an encoding, having "transcode from ASCII" be something other than a no-op was always likely to result in referral to the reply in the case of Arkell v. Pressdram.
Posted Oct 29, 2023 9:09 UTC (Sun)
by farnz (subscriber, #17727)
[Link] (1 responses)
It's not just UTF-8 that's got 7-bit ASCII as a subset; Shift-JIS, KOI8-R, KOI8-U, all the ISO 8859 variants, EUC-CN, all the ISO-2022 variants, GBK, GB 18030, Big5, CNS 11643 and KS X 1001 all use ASCII as a subset. The only commonly used exceptions to the rule that 7-bit ASCII is represented by itself use a minimum of 2 bytes for all characters (UTF-16, UTF-32, TRON), or are themselves 7-bit or shorter.
And the 2 byte and longer encodings also represent ASCII as itself, but need a lead byte to indicate that the next character is ASCII. It's thus trivial to convert 7-bit ASCII to any commonly used encoding, and had UTF-8 bucked this trend, it'd be impossible to get traction - why use UTF-8 and have to do a complicate transcode from ASCII when you can stick to ISO 2022?
Posted Oct 29, 2023 12:43 UTC (Sun)
by Cyberax (✭ supporter ✭, #52523)
[Link]
These are even more interesting, they place graphically or phonetically similar characters ("A" and "а", "F" and "ф", etc.) into the same positions modulo 128. So if the 8-th bit is lost, the text can still be somewhat readable. It's a clever hack, but I'm glad that it's no longer needed.
Posted Oct 26, 2023 6:52 UTC (Thu)
by AdamW (subscriber, #48457)
[Link]
Posted Oct 29, 2023 0:53 UTC (Sun)
by acolin (guest, #61859)
[Link]
P.S. Treating similarly-typeset characters as the same in search and in paste (upon user request) seems to help diffuse this curious situation.
Posted Oct 29, 2023 5:20 UTC (Sun)
by da4089 (subscriber, #1195)
[Link] (2 responses)
Posted Oct 29, 2023 13:21 UTC (Sun)
by edgewood (subscriber, #1123)
[Link]
Posted Nov 4, 2023 19:45 UTC (Sat)
by rra (subscriber, #99804)
[Link]
If someone did manage to write a really good one, we could introduce it as a QA step and indeed it probably wouldn't be that hard to fix man pages over time. In my experience, upstream often doesn't really care, but will merge a PR since why not. But the one we had definitely did not work (I can think of several obvious problems with it just off the top of my head), and writing a better one is challenging.
Someone elsewhere in this discussion suggested using ChatGPT, an option that I find hilarious given ChatGPT's well-known devotion to accuracy and specific detail.
Posted Nov 1, 2023 11:57 UTC (Wed)
by qwertyface (subscriber, #84167)
[Link] (2 responses)
Interestingly, PowerShell treats ‘ or ’ as ', and “ or ” as ", so copy-paste out of auto-converted documents works fine. I guess it probably does the equivalent with the various dashes. I'm not aware of any other language or shell that does the same. One Microsoft feature we should adopt?
Posted Dec 2, 2023 11:56 UTC (Sat)
by ssokolow (guest, #94568)
[Link] (1 responses)
Posted Dec 2, 2023 20:13 UTC (Sat)
by ssokolow (guest, #94568)
[Link]
Posted Feb 27, 2024 14:30 UTC (Tue)
by lmb (subscriber, #39048)
[Link]
As a user, I'm ... not entirely in love with this change. Yes, it is technically correct, but it has *horrible* UX and breaks things at times when you're already had to resort to reading documentation, not normally times when you want something else to be befuddling. My primary contact with roff (and I suspect for 99% of all users?) since the post-90s are man pages, and for that use case, this change is questionable.
Yes, in theory, all broken man pages out there should be fixed, that's where the origin of the brokenness is, and I appreciate the thoughtful discussion and decision and commitment to proper layout and typesetting.
In practice, I roll my eyes at technical correctness and alias man to `man -E ascii`. Sorry.
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
> they do the job promptly, take their time to get every detail right, and can be
> expected to use the right kind of dash in every situation, even though the
> output from using the wrong kind looks exactly the same. They will surely
> not be bothered by the fact that a format designed to document
> command-line options contains a trap whereby the failure to add backslashes
> silently introduces problems for users who are distant in time and space.
Hyphens, minus, and dashes in Debian man pages
Monospace font → Hyphen-Minus
Proportional font → Hypen
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Wol
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Apart from the £ sign (which is almost certainly Unicode) I have no idea how to access any other Unicode).
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Wol
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Wol
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Wol
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Wol
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Wol
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Wol
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
… the computer will do what I described unless I specifically tell the computer to use a non-default keymap
Right. That's what I was getting at in my first post when I described the things that the AltGr key does in the default keymap that I get when I install a Linux distribution in British English.
Hyphens, minus, and dashes in Debian man pages
We just need keyboards with LED display keycaps, so the software can ensure that they do display the symbol which will result from pressing them. In real time, as modifiers and combining characters change...
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Nah, it's okay, people can keep blaming me. I'm used to it.
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
In my opinion, man page source documents are not the correct place to discard that information.
Wol
In my opinion, man page source documents are not the correct place to discard that information.
In my opinion, man page source documents are not the correct place to discard that information.
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
https://www.unicode.org/versions/Unicode1.0.0/CodeCharts2...
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
https://www.man7.org/linux/man-pages/man7/groff.7.html
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Perhaps it was a mistake to allow hyphen-minus to map to hyphen-minus...
Hyphens, minus, and dashes in Debian man pages
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z
0 1 2 3 4 5 6 7 8 9 fi fl ffi ffl
! $ % & ( ) ‘ ’ * + - . , / : ; = ? [ ] │
• □ — ‐ _ ¼ ½ ¾ ° † ′ ¢ ® ©
α β γ δ ε ζ η θ ι κ λ μ ν ξ ο π ρ σ ς τ υ ϕ χ ψ ω
Γ Δ Θ Λ Ξ Π Σ Υ Φ Ψ Ω
" ´ \ ^ _ ` ~ / < > { } # @ + − = ∗
≥ ≤ ≡ ≈ ∼ ≠ ↑ ↓ ← → × ÷ ± ∞ ∂ ∇ ¬ ∫ ∝ √ ‾ ∪ ∩ ⊂ ⊃ ⊆ ⊇ ∅ ∈
§ ‡ ☜ ☞ | ○ ⎧ ⎩ ⎫ ⎭ ⎨ ⎬ ⎪ ⌊ ⌋ ⌈ ⌉
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
2. mandoc: https://mandoc.bsd.lv/
3. Plan 9 from User Space troff: https://github.com/9fans/plan9port
4. neatroff: http://litcave.rudi.ir/neatroff.pdf
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
usr/share/doc/groff-base/meintro_fr.ps.gz
usr/share/doc/groff-base/meref.ps.gz
usr/share/doc/groff-base/ms.ps.gz
usr/share/doc/groff-base/pdf/automake.pdf.gz
usr/share/doc/groff-base/pdf/msboxes.pdf.gz
usr/share/doc/groff-base/pdf/pdfmark.pdf.gz
usr/share/doc/groff-base/pic.ps.gz
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
troffcvt: /usr/share/troffcvt/tc.mm
troffcvt: /usr/share/troffcvt/tc.ms
ksh: /usr/share/doc/ksh/PROMO.mm.gz
ksh: /usr/share/doc/ksh/builtins.mm.gz
ksh: /usr/share/doc/ksh/sh.memo.gz
xterm: /usr/share/doc/xterm/ctlseqs.ms.gz
cvs: /usr/share/doc/cvs/cvs-paper.ms.gz
Hyphens, minus, and dashes in Debian man pages
No, groff's behavior is consistent with every other implementation of troff in the world, including the original implementation dating back to about 1973, appearing in Fourth Edition Unix from Bell Labs.
At that time troff output appeared on paper. Nobody cut and pasted from there to terminal input, and nobody computer-searched it for, say --some-option
.
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
My point is that while in typeset output there may have been hyphens in the output in 1973, for output to an xterm or the like a Hyphens, minus, and dashes in Debian man pages
-
in input resulted in the same ASCII character in the output until people switched to Unicode locales (maybe a decade ago). And from what I read in the article and comments, with groff this did not change even at that time, because it still produced the ASCII character. Only groff-1.23.0 changed that. So the claim that this change is in line with 1973 behaviour is wrong as far as on-screen usage by most users is concerned.
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Wol
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
I love this
I love this
Hyphens, minus, and dashes in Debian man pages
I couldn't find "~". I eventually found the documentation I wanted which suggested I use "!˜".
But when I try that, I'm told
awk: cmd. line:1: ^ invalid char '�' in expression
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
* Files that are some Windows code page, often but not always 1252, no BOM included because it's not Unicode and the BOM doesn't exist.
* Applications that can choose at compile time whether they want to do this fancy-pants Unicode (UTF-16) thing, or use one of those weird code pages instead. This may sound trite, but it means that (among other issues) the entire filesystem needs full backcompat with non-Unicode-aware apps (solved by pulling out the old FOO~1 trick).
* Applications that set their code page to UTF-8, and then proceed to use the "non-Unicode" legacy API with a Unicode encoding. Microsoft even recommends doing this.[1]
* A file format (plain text) that has no encoding information (indeed, no out-of-band metadata whatsoever).
* An outside world that is (at least for the most part) completely hostile to non-Unicode encodings, and increasingly unwilling to accommodate UTF-16.
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Wol
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Wol
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Wol
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
https://bugzilla.redhat.com/show_bug.cgi?id=2224123
though in my testing, at least, it wasn't easily associated directly with a groff update...
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
I don't think it's the technical work that's the problem, but the social/political work of getting upstream to accept the patches
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
You'd immediately open up a ton of shell injection exploits, since the assumption of which functions have special meaning is baked into a million different functions like Python's Hyphens, minus, and dashes in Debian man pages
shlex.quote
.
PowerShell can get away with supporting that because it's a new shell with a new syntax.
Ugh. Which characters have special meaning. Don't post while sleep deprived, kids!
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages