Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Posted Oct 23, 2023 19:57 UTC (Mon) by branden (guest, #7029)In reply to: Hyphens, minus, and dashes in Debian man pages by rra
Parent article: Hyphens, minus, and dashes in Debian man pages
It's a factor. There are historical roff documents that I'd like to keep working nicely, as well as I can.
For example: https://github.com/g-branden-robinson/retypesetting-mathe...
That said, I do not consider myself beholden to bug-for-bug compatibility with AT&T troff (James Clark didn't, though he accommodated several), or to making the same decisions about issues in areas not even specified by CSTR #54, the "Troff User's Manual" (originally written by Ossanna in 1976, revised in 1992 by Kernighan). One relatively vocal subscriber to the groff mailing list sees me more as a heedless radical with meager respect for the wisdom of my superior ancestors. There is a certain exhilaration in juxtaposing that critique with yours.
> Yes, there are a lot of things to learn when writing man pages, but bugs that cannot be caught by automated tools and don't produce visibly different output are extremely hard to eliminate.
Guessing which glyph to use as seen in the examples on the debian-devel list, and here, appears to be an AI-hard problem.
> This is effectively a foot-gun in the roff language that authors will continue to get wrong because getting it wrong produces no visible effect.
I _do_ have some advice on this front: use a good font, one where glyphs for different code points look different.
> the only distinction between hyphen and hyphen-minus is (maybe) whether roff does line breaking at that point
I expect what will happen with pod2man specifically is that you'll use \- everywhere, people will notice that breaks stop happening in as many places, resulting in wide adjustments, then they will...
> and with the (IMO correct) increasing trend of disabling full justification in man pages, this is a very minor benefit.
...join team ragged-right margin. Well, fear not, I've actually made it easier for you to get what you want. https://git.savannah.gnu.org/cgit/groff.git/tree/NEWS?h=1...
People who have been turning automatic hyphenation off in the first place may also welcome that.
> but hyphen breaks searching and cut and paste.
I spend a lot of time reading man pages. I find myself not struggling over this issue. Maybe I'm weird.
> The positive benefits are mostly for troff output for printed material, and for man pages this is not a nonexistent use case but it is very rare.
Linux man-pages maintainer Alejandro Colomar and I are doing what we can to encourage people to rediscover typeset manuals. I've linked to the collected groff-man-pages PDF elsewhere in this discussion. Deri James is doing an invaluable service by helping us get man page cross references wired up to PDF hyperlinks. (Of course, you have to actually have to tell man(7) that something is a man page cross reference first, and thereby hangs a tale. https://git.savannah.gnu.org/cgit/groff.git/tree/NEWS?h=1... )
> The world has changed since roff was designed.
Certainly. There'd be no upset users and no LWN article to reply to if we weren't enjoying the blessing of Unicode support in our terminal emulators. So in a sense this all can be laid at Markus Kuhn's doorstep.
Nah, it's okay, people can keep blaming me. I'm used to it.
> This is not going to be persuasive if you see your role as preservation of the original roff intent, so to some extent this is a conflict of uses.
Not wholly. I also have a handful of new macros I want to introduce to the man(7) macro language. For groff 1.23.0, I settled on one, already linked to above.
The NEWS file entry for 1.23.0 is lengthy; I encourage anyone with any interest in groff to review it.
> but most people writing man pages are just trying to present documentation to the user and don't care about the roff typesetting system as such
People writing C programs are just trying to solve a problem and don't care about the programming language that much. (Okay, C has plenty of people who love it madly for its own sake. Consider substituting "JavaScript".)
If someone is going to pick up a tool to do a job, they're going to have to develop a competence with that tool. If plain text is all a person can manage, that is what they should write their documentation in.
(At this point in the discussion, champions of ReStructured Text and/or one of several not-quite-compatible dialects of Markdown typically appear on the scene, each claiming that their markup language is the _one_, _obvious_ way to write "plain text" such that it is suitable for conversion to richer formatting languages.)
I'm trying to reach and assist people who care about writing man(7) (and mdoc(7) for that matter) competently. If they don't care about that, I'm wasting my time, and they shouldn't waste theirs telling the world how man(7) _should_ be done.
> If people want hyphens, or matched single quotes, there is now a fairly good argument they should just type the thing they intend using Unicode. I think if roff were invented today, \- would not exist and - would mean \- because roff would just use Unicode input and respect the characters the user entered.
I don't agree, because you're forgetting about the minus sign. Unicode has a hyphen (U+2010) and a minus sign (U+2212), and "obviously", a person should input those code points for their distinct purposes.
This works great until someone needs needs to input a "literal" for an overloaded code point in the Basic Latin code chart that has syntactical significance to something like a shell prompt or a language compiler. Then they need that hen's tooth U+002D code point, even though it is meaningful _only_ for talking to computers, and not for any other domain of discourse. And that's not even taking into account the folks who ride in an want distinguishable en dashes, em dashes, figure dashes, and others the LWN article didn't mention. Fitting distinguishable glyphs for these into a half-width character cell even with a fair number of pixels in the horizontal dimension (say, more than 8) starts to become a real pickle.
So, no, I don't think "just use Unicode" is going to solve all the problems here at a stroke. It can help, but eventually people are going to need roff special characters or something like them so that they can tell unlike things apart with confidence. Even if they use a good font.
When you consider the problem space seriously, it turns out the WYSIWYG advocates are on team DWIM. And we know how well that turns out.
None of this is to tell you what you should do with podlators; the plan you've articulated seems fine to me. If, by some miracle, Alex Colomar and I convince more than a handful of people that the PDF man page experience is actually kind of nice, and some of those folks then turn to Perl docs and wonder what it's story is, I'll be happy to help you come up a with new coat of paint to go over that layer of primer you just stripped it down to.
Posted Oct 23, 2023 20:16 UTC (Mon)
by rra (subscriber, #99804)
[Link] (7 responses)
But this is exactly my point. If you need that code point, you enter that code point, and it should be typeset as that code point, without any second-guessing on the part of the typesetting system.
The second-guessing is there because in a pure ASCII world there was no alternative, because there were not ASCII code points for the different meanings. You therefore had to pick one of them to be the default and represent the other ones with escapes, and roff decided, quite reasonably for typesetting and less reasonably for man pages, that a hyphen was the most common usage and should be the default and the other usage should use an escape. But the point of Unicode is that you no longer have the context collapse on input, because there are code points available to express your exact intent. Essentially, the use of the precise Unicode code point replaces roff escapes (down to being slightly more annoying to type).
We have been down this path already with quotes. In the pure ASCII world, we invented various conventions like `single quotes' or ``double quotes'' used in different typesetting systems, but now you can use correct Unicode quotes if you care about this distinction, and many editors will assist you in doing so to avoid the entry problem. I am dubious any newly-invented typesetting system today would try to overload ASCII quotes in the way that TeX did; instead, it would handle Unicode quotes correctly.
None of these solutions are ideal because the keyboard is not large enough and doesn't allow easily drawing these distinctions. But if you view all the various typesetting escapes as substitutes for not having the correct character on the keyboard, I would argue that using the correct Unicode character is the modern replacement. It works uniformly across multiple typesetting systems, so you don't have to relearn how to use it for each piece of software, and you are far more likely to have active editor assistance in making the entry easier.
In other words, no, I have not forgotten about minus. If you want a minus sign in typeset material, you should enter an actual minus sign, which is U+2212 and will pair correctly in the font with a plus sign. This is not a hyphen, is not an en-dash, is not an em-dash, and is not a hyphen-minus. These are all distinct characters used for different purposes in high-quality typeset material. If you are talking about programming languages, you may not want an actual minus sign because programming languages do not use actual minus signs. You may a hyphen-minus, which should be typeset as such. (Unfortunately, this does create the problem that Unicode has only one plus sign, so you have to choose between fidelity and correct glyph matching between plus and minus signs if you are aiming for printed output. This choice is somewhat context-dependent, and one option would be to make those characters match in fixed-width fonts.)
Posted Oct 24, 2023 3:17 UTC (Tue)
by branden (guest, #7029)
[Link] (6 responses)
Okay. I find little to argue with in this presentation. I think if one were to undertake a "man: The Next Generation", one would likely proceed exactly as you describe, and let Unicode do the heavy lifting of glyph distinction.
...at least as far as some kind of alpha or trial run. I suspect people would rapidly run into trouble with hyphenated phrases (e.g., "long-standing, Debian-specific patches"). As you say, the keyboard is not large enough. At some point we run into not a technological problem, but a human one; it's hard to make people care about typographical distinctions that they don't want to care about, especially if their horizons stretch no farther than a terminal window. If they think of Unicode as mainly a resource for dingbats and emoji, we're unlikely to make much headway in the matters.
Posted Oct 24, 2023 3:31 UTC (Tue)
by rra (subscriber, #99804)
[Link] (4 responses)
This is the point where my own struggles with problems like this over the last ten years have given me a lot of respect for the amount of thought that's gone into Unicode. They took a careful and pragmatic decision to provide code points that represent the ambiguous merged character, and then separate code points that more precisely indicate intent. This to some extent means that within a Unicode world, both options are possible and the document author gets to choose how much to care.
If you want very nice typesetting, you can use hyphens, minuses, and en-dashes in the ways they were intended to be used. If you want to be lazy and not think about it, you can use a Unicode hyphen-minus and you get a compromise character that looks "okay" and, importantly, is clearly marked as a semantic compromise. Any typesetting system gets the correct information that the user was either talking about code or decided not to care about the distinctions between dashes, and therefore the typesetting system probably shouldn't try to care more than the user did.
This is similar to what they did with apostrophe and single quotes: the preferred characters in Unicode are U+2018 and U+2019, and U+0027 is defined as a neutral character that is intentionally left ambiguous, for users who don't care enough to draw the distinction.
You can't force users to care. The best you can do is provide them with the tools and make it clear whether they chose to use them or not. (And indeed, despite knowing all of this, I always use neutral single and double quotes and a hyphen-minus, because I don't care enough. Although I have started using real em-dashes, and I will occasionally use a real en-dash, so maybe eventually I'll come around.)
I'm simplifying a bit, and the Unicode world is not quite as shiny as all of that. Typesetting and human languages are messy and there are still sharp edges and ambiguities. But it's a system that a whole lot of people put a whole lot of thought into, and the results embed more practical wisdom than I think people realize.
Posted Oct 24, 2023 3:55 UTC (Tue)
by branden (guest, #7029)
[Link] (1 responses)
I concur with this. I don't think _anyone_ involved with groff development views Unicode as anything less than a tremendous boon to the sanity of glyph and character repertoires. (Oh, how I wish James Clark had decided to store groff characters internally as ints instead of C++ chars. But we'll get that refactored, knock wood.)
I have seen _one_ person grouse that apostrophes (however rendered) and right single quotation marks should be kept logically separate, and I have some sympathy for that view, because they _are_ logically separate--but it seems no English typesetting tradition ever sees fit to distinguish them in print. If I regard were to regard occasional man page authors as intransigent with respect to correct glyph choices, I dread to measure the inertia of commercial publishers.
Posted Oct 25, 2023 14:02 UTC (Wed)
by smoogen (subscriber, #97)
[Link]
Again thank you for teaching and making this conversation something enjoyable to read.
Posted Oct 24, 2023 7:26 UTC (Tue)
by smurf (subscriber, #17840)
[Link]
Depends on your locale; don't forget about U+201A. And then there's places where they use U+2039/U+203A … and other places where they use U+203A/U+2039. See https://en.wikipedia.org/wiki/Quotation_mark for even more enlightening examples.
Posted Oct 24, 2023 16:24 UTC (Tue)
by gray_-_wolf (subscriber, #131074)
[Link]
Maybe in some areas. The whole Han unification thing is in my opinion still a mistake. Having to know what language the text is in in order to be able to render it correctly is... annoying.
Posted Oct 25, 2023 18:13 UTC (Wed)
by jwarnica (subscriber, #27492)
[Link]
Authors+authoring tools, who care, can be careful, once, and the 78 downstream tools never are allowed to second guess things. Authors+authoring tools who don't care... Well, then the 78 downstream tools at least do what they are directly told without any hackery, and the cause of the errors (if any) becomes clear: the human and/or the single tool they interact with.
Posted Oct 24, 2023 0:43 UTC (Tue)
by jkingweb (subscriber, #113039)
[Link] (2 responses)
I am such a person (especially after reading this article), though I'm a complete neophyte. Thus far I've been writing Markdown and converting it using Pandoc, mainly because I had no idea where to begin to learn how to do things properly. And... I still don't. Should I start by reading man(7), or mdoc(7), or something else altogether? There seems to be many schools of thought (as is so common in the free software world), but I'd love an authoritative hand point me in *one* direction, whatever it is.
Posted Oct 24, 2023 3:07 UTC (Tue)
by branden (guest, #7029)
[Link] (1 responses)
My recommendation is the groff_man_style(7) page in the groff 1.23.0 release. It attempts to bring the reader from a state of no knowledge about man(7) or roff to a point where they can write a man page. It's not quite a tutorial--it doesn't start with a skeleton page that you fill in, but it covers the basics first and then discusses each group of man(7) macros in approximately the order you're likely to need to use them. So it starts with `TH` and `SH` and their relatives, then covers paragraphing macros, then synopsis macros, then hyperlink macros, and finally font styling macros.
You can start on page 253 of the collected groff man pages PDF. https://www.gnu.org/software/groff/manual/groff-man-pages...
> There seems to be many schools of thought (as is so common in the free software world), but I'd love an authoritative hand point me in *one* direction, whatever it is.
In the course of the past several years I've learned a great deal about the history of *roff and man pages, and I've attempted to reflect that learning in the content of the groff's own man pages.
But even authoritative voices are not infallible, so if you find errors, I'd appreciate hearing about them. (I find that the adjective sits on me uncomfortably, in any case.)
Posted Oct 30, 2023 22:33 UTC (Mon)
by jkingweb (subscriber, #113039)
[Link]
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
Hyphens, minus, and dashes in Debian man pages
