|
|
Log in / Subscribe / Register

A report from the documentation maintainer

By Jonathan Corbet
October 26, 2016
It is now nearly exactly two years since my ill-advised decision to accept the role of the maintainer of the kernel's documentation collection. After a bit of a slow start, things have started to happen in the documentation area. As part of the preparation exercise for an upcoming Kernel Summit session on documentation, here is a report on where things stand and where they are going.

The biggest overall change, of course, is the transition away from a homebrew DocBook-based toolchain to a formatted documentation setup based on the Sphinx system, as was described in this article last July. The transition made some waves when it hit; in the 4.8-rc1 announcement, Linus noted that a full 20% of the patch set was documentation updates. It is fair to say that kernel developers do not ordinarily put that much effort into documentation. Much of the credit for this work goes to Daniel Vetter and Mauro Carvalho Chehab, who worked hard to transition the GPU and media subsystem documentation, respectively, along with Jani Nikula and Markus Heiser, who made the Sphinx-based plumbing work.

Perhaps unsurprisingly, there have been places where Sphinx has not worked out quite as well as desired. Perhaps the biggest initial disappointment was PDF output. The original plan was to use rst2pdf, a relatively simple tool that offered the possibility of creating PDF files without a heavy toolchain. It does indeed create pretty output for simple input files, but it falls over completely with more complex documents; after a while, it became clear that it was not going to meet the kernel community's needs.

That means falling back to LaTeX in 4.9; LaTeX works, but is not without its drawbacks. LaTeX is not a small system; the basic install on my openSUSE Tumbleweed system was over 1,700 packages. The base Fedora installation is much smaller, but that is not necessarily better. There, getting the documentation built requires executing a seemingly endless loop of "which .sty file is missing now, and which package provides it?" work. Part of the idea behind switching to Sphinx was to make setting up the toolchain easier; that goal has still been met for those who are happy with HTML or EPUB output, but remains elusive for PDF output.

After 4.8

The 4.7 kernel contains 34 "template" files that are processed by the DocBook-based toolchain; that number is down to 30 in the 4.9-rc kernels. The conversion of the remaining template files continues; eventually they will all be done and the DocBook dependency can be removed. The conversion is generally easy to do (there is a script included in the kernel source that helps), but making it all look nice can take a little longer. And updating some of the kernel's ancient documentation to match current reality may take longer yet.

A few dozen template files are one thing, but what about the various plain-text files scattered around the documentation directory? There are over 2,000 of these (not counting the device-tree files), some rather more helpful than others. Very little organizational thought has been applied to this directory. As former documentation maintainer Rob Landley put it in 2007, "Documentation/* is a gigantic mess, currently organized based on where random passers-by put things down last". It has improved little since then.

Now we are trying to improve it by applying some structure to the directory and by bringing the plain-text files into the growing body of Sphinx-based documentation. The latter task is easy — most of the plain-text files are almost in the reStructuredText format used by Sphinx already, so only minor tweaks are required. The organizational task is harder.

The 4.9 kernel will contain a couple of new sub-manuals in the Sphinx-based documentation. One of them, called dev-tools, is a collection of the plain-text documents about tools that can be used in kernel development. The other, driver-api, gathers information of interest to device-driver developers. Both of these books are works in progress, they exist in their current form mostly to show the way forward.

In 4.10, the chances are good that three more major sub-manuals will put in an appearance. One of them, tentatively called core-api, will be a collecting point for documentation about the core-kernel interfaces. That information is currently widely distributed among plain-text files and kerneldoc comments within the source itself; it will be good to have it together in one place — sometime well in the future, when the process of creating this manual has run its course.

Next, the process book will hold our (fairly extensive) documentation on how to work with the kernel development community. It includes the often-cited SubmittingPatches document (now process/submitting-patches.rst), along with information on coding style, email client configuration, and more. This work (done by Mauro) was ready in time for 4.9, but I put the brakes on it out of fear that moving files like SubmittingPatches would leave a lot of dangling links in the brains of the development community. Various discussions over the past month have failed to turn up even a single developer who was unhappy about it, though, so the current plan is for this work to proceed for 4.10.

The last proposed book recognizes that there are multiple audiences for the kernel's documentation; it will (probably) be called admin-guide and will be aimed at system administrators, users, and others who are trying to figure out how to get the kernel to do what they want. Much of our documentation covers module parameters, tuning knobs, and user-space APIs; collecting and organizing it should make it more accessible for our users.

Open issues

As this work proceeds, a number of issues have come up that are still in need of resolution; many of them come down to a tradeoff between simplicity and functionality. On the simplicity side, it is desirable to keep the documentation toolchain as simple and easy to set up as possible so that anybody can build the docs. On the other hand, making use of more functionality (and thus adding to the toolchain's dependencies) enables the creation of more expressive documentation.

One such issue is the use of the Sphinx math extension, which supports the formatting of mathematical expressions using the LaTeX syntax. As of 4.9, the media documentation is using this extension, but there is a cost: it forces the use of LaTeX even to build the HTML documentation. The hope is to find an easy way to fall back gracefully when LaTeX is unavailable in order to soften this dependency.

A deeper question has to do with the automatic generation of reStructuredText documentation from other files in the kernel tree. That is already done with the in-source kerneldoc comments, of course, but there is interest in pulling in a number of other types of information as well. That extends as far as reformatting the MAINTAINERS file as part of the documentation build process. There are patches circulating to allow, to a varying extent, the running of arbitrary programs during the documentation build to do this generation; these patches run into concerns about security and maintainability. The form of the solution to this problem is not yet entirely clear.

Interestingly, there is significant disagreement over the removal of ancient, obsolete documentation. Do we really need, say, documentation from 1996 describing how to manually bisect bugs in the 1.3.x kernel? Resistance to removing such cruft usually comes in the form of "but it might be useful to somebody someday." But we do not retain unused code on that basis; we recognize that there is a cost to carrying such code in the kernel. There is, likewise, a cost to carrying old, obsolete documentation, paid by both the documentation maintainers and the users the documentation is meant to help. In my opinion, some spring cleaning is in order, even if spring is a distant prospect in the northern hemisphere.

One other possibly contentious change has been suggested by a few people now. Documentation/ is a long name, and is the only top-level directory in the kernel starting with a capital letter. One can joke that this distinction highlights the importance of documentation, but it's also a lot for people to type. So I've been asked a few times if it could be renamed to something like "docs". That, I think, is a question for the Kernel Summit.

Finally, it should be said that much of the above consists of a rearrangement of a bunch of kernel documentation that is of varying quality and is not all current. It makes the documentation prettier and, hopefully, easier to find, but does not yet turn it into a coherent body of accessible and useful material. There is a good case for doing the organizational work first, as long as we don't forget that there is a lot more to be done.

Despite the disagreements over how to proceed in some of these areas, and despite the magnitude of the task, there is a broad consensus that the time has come to improve the kernel's documentation. More effort is going into this part of the kernel than has been seen for some years. With any luck, kernel developers, distributors, and users will all be the beneficiaries of this work. For anybody who is looking for a way to help with kernel development, there are plenty of opportunities in the documentation area; we would be happy to hear from you. The linux-doc list at vger.kernel.org is a relatively calm place to work on documentation without subjecting oneself to the linux-kernel firehose. We look forward to your patches.

Index entries for this article
KernelDocumentation


to post comments

A report from the documentation maintainer

Posted Oct 27, 2016 5:28 UTC (Thu) by xtifr (guest, #143) [Link] (17 responses)

Wow, ok, not really my place to criticize any kernel devs, but the ones who don't want to "remove" obsolete documentation--do they not understand how git works? I am, frankly, a bit boggled.

I'm less boggled—but still slightly—by the directory-name-is-too-long crowd. Are there really that many people who use neither a point-and-click interface nor filename completion? (And are there really that many people working on the kernel who find typing that difficult?)

Anyway, thanks for the report. Definitely interesting, and very promising, though I'm sorry to hear about the LaTeX-related struggles.

A report from the documentation maintainer

Posted Oct 27, 2016 14:10 UTC (Thu) by niner (guest, #26151) [Link] (9 responses)

There are more than 2300 pointers at Documentation files in the kernel source in comments. None of them could have used auto completion or point and click.

Documentation/ references

Posted Oct 27, 2016 14:15 UTC (Thu) by corbet (editor, #1) [Link] (8 responses)

Heh...that's a bit of a disincentive to moving Documentation/, of course...:)

Even more common, though, is references in email. People type those every day, and I've already gotten complaints about making the path to files longer than it is already.

Documentation/ references

Posted Oct 28, 2016 20:30 UTC (Fri) by HIGHGuY (subscriber, #62277) [Link] (7 responses)

git supports soft-links, so it should be easy to let docs -> Documentation, followed by a period of clean-ups in the code and ultimately moving Documentation over docs or swapping things around with Documentation -> docs. The latter might accomodate weblinks pointing to kernel.org git repo's and similar things.

Is there some peculiar reason why that wouldn't be allowed? (checkout on FAT or NTFS filesystems perhaps >:-] )

Documentation/ references

Posted Oct 29, 2016 0:12 UTC (Sat) by flussence (guest, #85566) [Link]

How do you propose fixing all the other links that point to kernel trees hosted somewhere other than kernel.org? Or do those just stay broken forever?

Documentation/ references

Posted Oct 31, 2016 13:19 UTC (Mon) by cesarb (subscriber, #6266) [Link] (5 responses)

> (checkout on FAT or NTFS filesystems perhaps >:-] )

Checkout on FAT or NTFS is already not possible, since the kernel has files which differ only on case.

Documentation/ references

Posted Oct 31, 2016 15:39 UTC (Mon) by zdzichu (subscriber, #17118) [Link] (4 responses)

NTFS is case-sensitive, so there isn't a problem.

Documentation/ references

Posted Oct 31, 2016 16:21 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

Not true. NTFS is case-preserving but not case-sensitive. You _can_ create files on NTFS that differ only in case, but it will cause a lot of problems (like not being able to delete them using regular tools).

Documentation/ references

Posted Nov 7, 2016 4:10 UTC (Mon) by JanC_ (guest, #34940) [Link] (2 responses)

NTFS is case sensitive.

The NT kernel supports case sensitive file systems.

The Windows subsystems & applications using it have problems with it.

Documentation/ references

Posted Nov 7, 2016 4:44 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

NTFS stores case mapping in a metafile called "$UpCase". You are free to ignore it and write files named "Foo" and "FOO" but this will break the Windows API.

Documentation/ references

Posted Nov 7, 2016 12:16 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

It is fun to use a Samba share from Linux and create files Windows has problems with and explore that way. Creating a file named CON gets mangled in Explorer to some string X. Create a file named X and Explorer cannot delete the CON file until the other disappears (selecting the right file does not work; deletion is by name). Similar shenanigans happen with mixed cases.

A report from the documentation maintainer

Posted Oct 27, 2016 17:54 UTC (Thu) by mchehab (subscriber, #41156) [Link] (1 responses)

> Interestingly, there is significant disagreement over the removal of ancient, obsolete documentation. Do we really need, say, documentation from 1996 (http://static.lwn.net/kerneldoc/admin-guide/bug-hunting.html)

> > Wow, ok, not really my place to criticize any kernel devs, but the ones who don't want to "remove" obsolete documentation--do they not understand how git works?

My point, with regards to bug_hunting.html is that, except for one section there that it is really old (Finding it the old way), the remaining file provide useful tips that are still useful, like using git bisect to identify the bug source, and objdump/gdb to identify what part of the source code caused the bug.

So, IMHO, instead of dropping everything, we should drop just the outdated section, and improve the remaining document to reflect the current practices, like what I proposed on this patch:

https://marc.info/?l=linux-doc&m=147752366022236&w=2

A report from the documentation maintainer

Posted Nov 4, 2016 12:54 UTC (Fri) by Wol (subscriber, #4433) [Link]

Or, what I'm trying to do on the linux raid wiki, have a section for obsolete stuff. So I've linked all the out-of-date stuff there, and as I rewrite it, the new stuff replaces the old on the main pages.

So all the old stuff is still there if it's wanted, but it's in a section flagged as "obsolete".

Cheers,
Wol

A report from the documentation maintainer

Posted Oct 27, 2016 20:40 UTC (Thu) by flussence (guest, #85566) [Link] (4 responses)

I find the proposal to rename the directory completely ridiculous. Too long? "D<tab>" is one character shorter than "do<tab>".

And it's *much* faster to type than "d<tab><beat><tab><shell beeps and displays ambiguous results><cursing at whoever proposed this>o<tab>".

I get the feeling the people asking for that change are just pedants that don't actually use the documentation.

A report from the documentation maintainer

Posted Oct 30, 2016 8:25 UTC (Sun) by dirtyepic (guest, #30178) [Link] (3 responses)

Believe it or not, some people have the ability to type into things other than shells.

A report from the documentation maintainer

Posted Nov 2, 2016 12:02 UTC (Wed) by robbe (guest, #16131) [Link] (2 responses)

If your text editor cannot complete words, maybe consider switching?

The example of e-mail was brought up. I guess you are surrounding the reference to a file under Documentation/ with one or more sentences. Containing complete words (not abbreviations) for the most part.

A report from the documentation maintainer

Posted Nov 3, 2016 15:22 UTC (Thu) by Wol (subscriber, #4433) [Link] (1 responses)

If your text editor can complete words, maybe consider ditching it?

I consider the text completion in Kate (my favourite editor) an absolute PAIN. Likewise date completion in word processors. Etc etc. If I *don't ask for it*, then I *don't want it*. How many times I have I had the stuff I'm typing corrupted by autocomplete, autocorrect, etc? Far too many!!! And they usually use something like "return" to indicate that I want the change, so if I'm gaily typing away and put this thing at the end of the line, I get the auto-change that I didn't ask for because the editor thinks I did!

Getting the UI on this right is *NOT* easy, and then giving it to a touch typist who has been trained *NOT* to look at the screen is going to cause an awful lot of grief!

Cheers,
Wol

A report from the documentation maintainer

Posted Nov 3, 2016 18:18 UTC (Thu) by mathstuf (subscriber, #69389) [Link]

You're talking about autocompletion. Supporting completion behind a keybinding (e.g., <C-n> or <C-p> in Vim) is not the same thing.

Sphinx LaTeX woes

Posted Oct 27, 2016 11:26 UTC (Thu) by jnareb (subscriber, #46500) [Link] (2 responses)

> There, getting the documentation built requires executing a seemingly endless loop of
> "which .sty file is missing now, and which package provides it?" work.

Or you can use `texliveonfly` package / tool that implements on demand installation in TeX Live 2010 and later.

> One such issue is the use of the Sphinx math extension, which supports the formatting of mathematical expressions
> using the LaTeX syntax. As of 4.9, the media documentation is using this extension, but there is a cost: it forces
> the use of LaTeX even to build the HTML documentation. The hope is to find an easy way to fall back gracefully
> when LaTeX is unavailable in order to soften this dependency.

Are not there tools that allow conversion of LaTeX math syntax (e.g. $$z^2 = x^2 + y^2$$) to MathML without
requiring LaTeX? There might be a problem with image fallback, but one could use plain text (LaTeX math syntax
or ASCIImath) instead.

Also there are JavaScript-based solutions like MathJax and/or KaTeX, with math rendered client-side.

Sphinx LaTeX woes

Posted Oct 27, 2016 16:27 UTC (Thu) by mchehab (subscriber, #41156) [Link]

> Or you can use `texliveonfly` package / tool that implements on demand installation in TeX Live 2010 and later.

Sounds promissing! will do some tests here.

> Are not there tools that allow conversion of LaTeX math syntax (e.g. $$z^2 = x^2 + y^2$$) to MathML without requiring LaTeX? There might be a problem with image fallback, but one could use plain text (LaTeX math syntax or ASCIImath) instead.

There is a MathML extension. We didn't test yet. Not sure what dependencies it would require, nor if it will work for all kinds of document outputs (HTML, ePub and LaTeX/PDF) that the documentation build supports.

> Also there are JavaScript-based solutions like MathJax and/or KaTeX, with math rendered client-side.

The problem is that Spinx accepts just *one* extension to handle math:: tags. So, it should be either sphinx.ext.imgmath or sphinx.ext.mathjax. (http://www.sphinx-doc.org/en/1.4.8/ext/math.html). So, if we move to mathjax, PDF output will likely not work anymore. Maybe with some Python magic we could find a way to make Sphinx conf.py to identify the type of output and switch the extension in runtime. It also seems that the MathJax extension requires an extra package to be installed.

Sphinx LaTeX woes

Posted Oct 27, 2016 21:33 UTC (Thu) by mathstuf (subscriber, #69389) [Link]

There's matplotlib which can render LaTeX and is in Python as well as widely available.

epub -> PDF

Posted Oct 27, 2016 13:11 UTC (Thu) by tsr2 (subscriber, #4293) [Link] (2 responses)

So, if the epub output is satisfactory, would calibre's ebook convert be a relatively easy way to produce satisfactory PDF output from the epub?

epub -> PDF

Posted Oct 27, 2016 18:00 UTC (Thu) by mchehab (subscriber, #41156) [Link] (1 responses)

> So, if the epub output is satisfactory, would calibre's ebook convert be a relatively easy way to produce satisfactory PDF output from the epub?

Just for fun, I tried to convert the epub book with calibre (version 2.60.0). It failed:

Rendering failed
Traceback (most recent call last):
File "/usr/lib64/calibre/calibre/ebooks/pdf/render/from_html.py", line 279, in render_html
self.do_paged_render()
File "/usr/lib64/calibre/calibre/ebooks/pdf/render/from_html.py", line 405, in do_paged_render
amap['anchors'])
File "/usr/lib64/calibre/calibre/ebooks/pdf/render/engine.py", line 335, in add_links
self.pdf.links.add(current_item, start_page, links, anchors)
File "/usr/lib64/calibre/calibre/ebooks/pdf/render/links.py", line 41, in add
a[anchor] = Destination(start_page, pos, self.pdf.get_pageref)
File "/usr/lib64/calibre/calibre/ebooks/pdf/render/links.py", line 22, in __init__
pref = get_pageref(pnum-1)
File "/usr/lib64/calibre/calibre/ebooks/pdf/render/serialize.py", line 303, in get_pageref
return self.page_tree.obj.get_ref(pagenum)
File "/usr/lib64/calibre/calibre/ebooks/pdf/render/serialize.py", line 187, in get_ref
return self['Kids'][num-1]
IndexError: list index out of range

epub -> PDF

Posted Nov 4, 2016 13:37 UTC (Fri) by Wol (subscriber, #4433) [Link]

I'm pretty certain I've tried to use calibre on epubs. And given up.

The fewer steps and tools you need, the better. The less room for a fsck-up, the better.

Cheers,
Wol

A report from the documentation maintainer

Posted Oct 27, 2016 13:30 UTC (Thu) by cesarb (subscriber, #6266) [Link] (47 responses)

> Documentation/ is a long name, and is the only top-level directory in the kernel starting with a capital letter. One can joke that this distinction highlights the importance of documentation, but it's also a lot for people to type.

I have a simpler explanation: uppercase sorts first. Therefore, it (together with related files like README) gets "out of the way" of the real source code, instead of nesting somewhere between "arch" and "kernel".

Just take a look at https://git.kernel.org/cgit/linux/kernel/git/torvalds/lin... to see it in action. Also, won't most people type "D<TAB>" instead of the full name? (Conveniently, there's only one directory starting with uppercase D; "d<TAB>" leads to the device drivers.)

A report from the documentation maintainer

Posted Oct 27, 2016 16:33 UTC (Thu) by mstone_ (subscriber, #66309) [Link] (46 responses)

That's a concept that seems to be related to the length and grayness of one's beard. Most people these days use a locale other than C, which typically ignore case when collating. I look forward to the time when I can tell my grandchildren the wonders of sorting by ASCII character code and they'll be impressed in the same way I'm fascinated by the tricks used by pre-industrial craftsmen. My kids just think I'm weird.

A report from the documentation maintainer

Posted Oct 28, 2016 7:37 UTC (Fri) by ianmcc (guest, #88379) [Link] (45 responses)

Yeah, I did a worksheet on makefiles a couple of weeks ago in my computational physics class. I was about to write in the notes "the name 'Makefile' is preferred because it appears at the top of the directory list", when I thought hmm, I should check that. No system that I use has that sorting order any more.

A report from the documentation maintainer

Posted Oct 29, 2016 6:39 UTC (Sat) by lsl (guest, #86508) [Link] (44 responses)

I thought everyone sets their LC_COLLATE to C (C.UTF-8, if supported) anyway to preserve their sanity. You don't? Fire up bash and try this:
> case b in [A-Z]) echo upper;; *) echo lower;; esac
Changed your mind?

See also:

Bash (In en-US.UTF-8 locale, '*[A-Z]*' matches very oddly):
http://savannah.gnu.org/support/index.php?108609
Findutils (regex ranges [A-Z] and [a-z] ignore case):
http://savannah.gnu.org/bugs/?30327
[…]

A report from the documentation maintainer

Posted Oct 29, 2016 12:47 UTC (Sat) by mathstuf (subscriber, #69389) [Link] (1 responses)

Well, this is Zsh, bug I can't reproduce either of those things:

% locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
% case b in [A-Z]) echo upper ;; *) echo lower ;; esac
lower
% touch a A
% find -maxdepth 1 -name '[A-Z]'
./A

A report from the documentation maintainer

Posted Oct 29, 2016 13:55 UTC (Sat) by mathstuf (subscriber, #69389) [Link]

OK, fired up bash and yeah, the case statement is broken. Bug findutils seems to have been fixed. Using findutils-4.6.0-8.fc25.x86_64 here.

A report from the documentation maintainer

Posted Oct 29, 2016 12:52 UTC (Sat) by mstone_ (subscriber, #66309) [Link]

Nope, most people do not change their LC_COLLATE to C. This is especially true IME for people whose native character set isn't US-ASCII, because it makes 30 year old directory listings look nice at the expense of normal behavior for basically everything else.

A report from the documentation maintainer

Posted Oct 30, 2016 21:59 UTC (Sun) by neilbrown (subscriber, #359) [Link] (2 responses)

> I thought everyone sets their LC_COLLATE to C (C.UTF-8, if supported) anyway to preserve their sanity. You don't?

Yes. Yes! A thousand times YES.

> Fire up bash and try this:
> > case b in [A-Z]) echo upper;; *) echo lower;; esac
> Changed your mind?

Or better:

> rm [A-Z]*

WHAT! It removed all the files which names starting with a lower case letter !?!?!

A report from the documentation maintainer

Posted Oct 31, 2016 0:32 UTC (Mon) by mathstuf (subscriber, #69389) [Link] (1 responses)

Two solutions: stop using a buggy shell or use {A..Z} which doesn't have the problem. Well, a third would be to have bash expand globs on <Tab> like Zsh can[1], but I don't know how to do that.

[1]`rm a*<Tab>` becomes `rm afile adir` and so on.

A report from the documentation maintainer

Posted Oct 31, 2016 16:13 UTC (Mon) by nybble41 (subscriber, #55106) [Link]

> use {A..Z} which doesn't have the problem

That has a different problem: Unlike the glob pattern [A-Z], the braces expand to every letter from A to Z, even if some of them don't match:

$ echo [A-D]*
Desktop Documents Downloads
$ echo {A..D}*
A* B* C* Desktop Documents Downloads

> have bash expand globs on <Tab> like Zsh can[1]

The readline command for this in Bash is glob-expand-word (C-x *).

A report from the documentation maintainer

Posted Nov 2, 2016 18:50 UTC (Wed) by robbe (guest, #16131) [Link] (36 responses)

Well, you could set globasciiranges, as documented in the fine manpage... This is not one of those projects that lack in tunable knobs.

$ locale | grep LC_COLLATE
LC_COLLATE="de_AT.UTF-8"
$ echo [A-C]*
A1 A2 b1 B1 b2 B2 c1 c2
$ shopt -s globasciiranges
$ echo [A-C]*
A1 A2 B1 B2

FWIW, I’m one of those people who are glad that we finally transcended ASCII.

A report from the documentation maintainer

Posted Nov 2, 2016 20:07 UTC (Wed) by nybble41 (subscriber, #55106) [Link] (35 responses)

> FWIW, I’m one of those people who are glad that we finally transcended ASCII.

I'm also glad that we're no longer restricted to ASCII filenames, but is it really too much to ask that character case be taken into account for sorting and glob pattern matching, regardless of the locale? It's not as if Unicode fails to distinguish between upper- and lower-case characters where applicable. Case sensitivity should not be a function of the encoding.

A report from the documentation maintainer

Posted Nov 2, 2016 20:42 UTC (Wed) by bronson (guest, #4806) [Link] (13 responses)

Programmers may believe that ABCabc is an appropriate collation sequence, but most people expect the more traditional AaBbCc. And computers are increasingly made for most people.

A report from the documentation maintainer

Posted Nov 2, 2016 22:22 UTC (Wed) by nybble41 (subscriber, #55106) [Link] (12 responses)

Obviously there are pros and cons to each collation sequence, and ideally the shell would be configurable to use either ABCabc or AaBbCc. What I have yet to see is any good reason why the collation sequence should change just because you're working in Unicode rather than ASCII.

A report from the documentation maintainer

Posted Nov 2, 2016 23:25 UTC (Wed) by mstone_ (subscriber, #66309) [Link] (11 responses)

It doesn't change with unicode vs ASCII, it changes with natural sorting vs byte sorting. You can set a locale which is not unicode but which does reflect long established rules of dictionary sorting. ASCII byte sorting is a bug. It leads to behavior which is simply incorrect according to various language's conventions for ordering text. It was introduced because it was feasible on the computers of the 1960s, not because it was correct. In recognition of the fact that some people wrote code which was bug-dependent, the C locale was introduced to lock everything to 1980. Some people cling to that as the pinnacle of human achievement and become increasingly angry as the world moves on and fewer and fewer people see things the same way. (Literally: their program output is not the same.) Be that as it may, most of the world (especially the vast majority of people who couldn't write their own name with plain ASCII and the shortcut byte-ordering it facilitated) isn't going to look back.

A report from the documentation maintainer

Posted Nov 3, 2016 15:45 UTC (Thu) by nybble41 (subscriber, #55106) [Link] (10 responses)

> It doesn't change with unicode vs ASCII, it changes with natural sorting vs byte sorting.

Exactly as I've been saying. And yet, enabling Unicode *character encoding* in your environment somehow implies to the shell that you want so-called "natural" sorting rather than traditional byte sorting, unless you take further steps to override LC_COLLATE. These are unrelated topics and there is no good reason for a change in character encoding to affect the sort order, or vice-versa.

There is nothing "unnatural" about sorting uppercase characters first to match the traditional behavior of the shell. This is no different from the common option to sort directories before regular files. For that matter there is nothing "unnatural" about sorting filenames as the byte strings they actually are, regardless of encoding, rather than by their locale-specific decoding into text. Only the byte-strings are guaranteed to be uniform system-wide; the text presented to the user depends on the current environment.

> especially the vast majority of people who couldn't write their own name with plain ASCII and the shortcut byte-ordering it facilitated

Again, separate issues. Nothing about byte-order sorting precludes the use of Unicode character encoding in filenames.

Anyway, this is getting rather far off-topic. The subject was glob patterns and the dubious wisdom of matching characters not specified by the user. If the user writes the glob [A-Z] that should be read as "the uppercase letters from A to Z", not "all letters except lowercase z" (using a range of letters in "natural" sort order) or "all letters" (using case-insensitive comparison). A user who went to the trouble of writing [A-Z] in uppercase obviously didn't intend to match [a-z]. Sort order is, for the most part, a matter of preference and not correctness; redefining the filenames which match a given glob pattern carries a much larger risk of data-loss.

As for the earlier objection that only a programmer would view things this way... the shell is a programming language! Anyone writing commands at the shell prompt is a programmer. If you want an interface designed for non-programmers, there are any number of GUI file managers and program launchers available.

A report from the documentation maintainer

Posted Nov 3, 2016 15:56 UTC (Thu) by farnz (subscriber, #17727) [Link] (3 responses)

But "the uppercase letters from A to Z" is locale-dependent. For a Pole, that includes Ó. For a Swede, it does not.

Basically, the moment you go beyond the user specifying all characters that they're interested in, you're making locale assumptions; whether that be because you're guessing at case, or because you're looking at a range and the user expects the range to include the characters they would include in that range.

A report from the documentation maintainer

Posted Nov 4, 2016 0:09 UTC (Fri) by lsl (guest, #86508) [Link] (2 responses)

There's always Unicode code point order, where A-Z is well-defined without respect to locale. UTF-8 has the nice property to sort that way using a simple strcmp-based comparator.

Sure, it isn't the order you'd see in a good old telephone book, but at least it's simple and predictable.

A report from the documentation maintainer

Posted Nov 4, 2016 9:55 UTC (Fri) by farnz (subscriber, #17727) [Link] (1 responses)

That's the problem, though - there are several non-ambiguous but arbitrary orderings (Polish alphabet, Swedish alphabet, English alphabet, French alphabet, Unicode code point etc). The machine can't (definitionally, as they conflict) give you all reasonable orderings at once; historically (1970s and 1980s) we handled this by saying that the ordering used in English is the One True Ordering, and anyone who thinks in another language can learn the One True Language. Modern machines can do better (and should, IMO).

A report from the documentation maintainer

Posted Nov 5, 2016 1:19 UTC (Sat) by lsl (guest, #86508) [Link]

> Modern machines can do better (and should, IMO).

I'm not convinced. Software is buggy and crappy enough already even without supporting a thousand different ways to sort a directory listing. The machines can do better, but programmers apparently can't.

A report from the documentation maintainer

Posted Nov 3, 2016 16:22 UTC (Thu) by mstone_ (subscriber, #66309) [Link] (5 responses)

You keep saying things that don't make sense. Using unicode character encoding doesn't imply anything. Yes they are unrelated topics, but you keep conflating them for some reason. The implication comes not from the character encodeing, but from the user specifying the language that they want to use, including that language's sorting rules. What you seem to want is for the natural language sorting rules to be arbitrarily ignored (always?) because you want to make things easier for the programmer in one particular case. (Similarly, some people want program output to be exactly as it was in 1980 rather than translated for the user because it makes parsing the output easier--another reason we have the C locale.) I guess you haven't had users complain when things are byte order sorted and they want to see things sorted in a way that makes sense for humans instead. (This is quite a common complaint, and lazy programmers often brush it off as unimportant even though it's quite important to the users.) At any rate, the shell isn't a programming language, it's a user interface. Yes, it tries to be both, and that's why it's both a lousy language and a lousy interface. It has knobs you can fiddle to make it do whatever it is you want. The fact that you have to fiddle with a bunch of knobs is part of why it's a lousy language. If you don't like this, it's more productive to choose a different language than to rail against reality. Specifically you need something with better defined semantics for pattern matching, probably something with the knobs in the API call itself rather than a bunch of environment variables which alter the semantics in unexpected and non-obvious ways. Another option is to just force the locale to C at the top of your script and explain to any users that you just can't be bothered to care about their language preference.

A report from the documentation maintainer

Posted Nov 3, 2016 19:07 UTC (Thu) by nybble41 (subscriber, #55106) [Link] (4 responses)

> The implication comes not from the character encodeing, but from the user specifying the language that they want to use, including that language's sorting rules.

The instructions provided for setting the character encoding, and the defaults for systems configured to use UTF-8, always seem to include the language as well as the character encoding. This bundling of language and encoding and collation and other preferences into a single global setting is the root of the problem. The same environment variable controls both, unless you set yet another variable to selectively override the sorting rules. It doesn't make sense to use the same settings everywhere, and pattern-matching in the shell programming language is one of those areas where a dependency on the current locale makes no sense.

The fact that the user wants to see messages in English (or whatever language) and use UTF-8 for character encoding should not be taken to imply that they want to change the way the shell expands glob patterns.

As I said before, collation order for user-visible output is not really the point. Sometimes the correctness of a script does depend on the sort order internally (such as *.d directories), but typically the files involved are defined to start with ASCII digits and case issues consequently do not apply (given reasonable locale definitions). Personally I would be content with an option to sort by a file type, case, filename triplet which otherwise followed the locale-specific rules. Absent that option, byte-order sorting with LC_COLLATE=C gives the desired behavior in 99.9% of the cases I am ever likely to encounter. YMMV.

> At any rate, the shell isn't a programming language, it's a user interface.

Shell is a user interface in the same sense as any programming language: a user interface designed for use by programmers. It's ridiculous to claim that it isn't a programming language, since significant fraction of the programs on most Unix systems are written in it. By percentage of commands executed it's a programming language first, with interactive use as a distant second.

> If you don't like this, it's more productive to choose a different language than to rail against reality. Specifically you need something with better defined semantics for pattern matching, probably something with the knobs in the API call itself rather than a bunch of environment variables which alter the semantics in unexpected and non-obvious ways.

What do you think this thread was about? This "better language" you suggest is shell, as it existed before glob patterns became locale-aware and thus context-dependent and dangerously unpredictable. Now the only safe way to write a Bash shell script with glob patterns is by setting the globasciiranges option within the script. For system scripts which can't be fixed the only option is to force LC_COLLATE=C.

A report from the documentation maintainer

Posted Nov 3, 2016 20:05 UTC (Thu) by mstone_ (subscriber, #66309) [Link] (3 responses)

> The instructions provided for setting the character encoding, and the defaults for systems configured to use UTF-8, always seem to include the language as well as the character encoding.

You continue to go on about character encoding, which has nothing to do with this. I don't fully understand why you keep bringing it up. If you byte sort purely english language text using UTF-8 encoding it's exactly the same as byte sorting purely english text using ASCII encoding. But real people don't want their output byte sorted, they want it sorted in a way that's sensible to them.

> The fact that the user wants to see messages in English (or whatever language) and use UTF-8 for character encoding should not be taken to imply that they want to change the way the shell expands glob patterns.

Clearly it should, that's the reality. You want a different reality than one which we inhabit. It's also a good thing that people who use languages other than english can use character classes which include characters from their language, which they couldn't do in the 1980s. And if you still want 1980s semantics, tada, they're still there by turning a knob.

> Shell is a user interface in the same sense as any programming language: a user interface designed for use by programmers. It's ridiculous to claim that it isn't a programming language, since significant fraction of the programs on most Unix systems are written in it. By percentage of commands executed it's a programming language first, with interactive use as a distant second.

Most system commands are now written in other, better languages. The majority of people using a unix command line today interact via the shell but don't program using it. That's for the best. In my bin & sbin right now I have:
1236 compiled
238 shell
163 perl
83 bash
71 python
11 ruby
1 tcl
1 php

Looking at the shell scripts they're mostly either trivial, or at least 15 years old. 100 of them are less than 50 lines. A couple are cheats, only using the shell to launch some other interpreter. As far as I can tell from a quick glance, none of them use the critical "remove-files-with-a-glob-character-class" capability you're so stuck on. The longest are 1-2k LOC and date from the early to mid 90s. So yeah, some people program in it, mostly trivial scripts which avoid dealing with corner cases, and mostly as a legacy capability. I stand by my assertion that the majority of users' interaction with the shell is as an interface with some automation capabilities. I suspect that was always true, and that the number of people who copied around cool snippits of .profile was much higher than the number of people who actually created them (that's certainly my recollection of the cs lab).

> What do you think this thread was about? This "better language" you suggest is shell, as it existed before glob patterns became locale-aware and thus context-dependent and dangerously unpredictable.

I have no idea what this thread is about at this point. You keep bringing up character encodings for some reason, and asserting that the shell is something other than what it actually is. I started the thread with the assertion that upper casing filenames to make them come first is something that's pretty irrelevant to anyone except a few wild-eyed people who haven't kept up with the reality of modern systems and then you proceeded to demonstrate my point. Yeah, some tricks that worked 30 years ago don't work anymore (like uppercasing filenames to make them come first). New things have come along that most people are pretty happy about (I'm personally happy taking advantage of the fact that I can run interpreters and libraries that aren't constrained by 1970s or 80s era hardware limitations). You can either continue to complain that things changed or acknowledge that today's user has a different experience from one 30 years ago and do things in a way that make sense in a modern context.

A report from the documentation maintainer

Posted Nov 3, 2016 22:37 UTC (Thu) by bronson (guest, #4806) [Link]

> and do things in a way that make sense in a modern context.

Where "modern context" refers to the centuries-old practice of case-insensitive collation. :)

Did case-sensitive collation even exist before punchcards?

A report from the documentation maintainer

Posted Nov 14, 2016 18:27 UTC (Mon) by Wol (subscriber, #4433) [Link] (1 responses)

> If you byte sort purely english language text using UTF-8 encoding it's exactly the same as byte sorting purely english text using ASCII encoding.

Except that if you try that with the purely ENGLISH word "Noël", you can't sort that in ASCII :-)

Cheers,
Wol

A report from the documentation maintainer

Posted Nov 14, 2016 20:23 UTC (Mon) by bronson (guest, #4806) [Link]

You can absolutely byte sort it.

If you're saying that's inadequate for most people then, agreed, but that's been stated on this thread a few times already...

A report from the documentation maintainer

Posted Nov 2, 2016 21:28 UTC (Wed) by farnz (subscriber, #17727) [Link] (20 responses)

It's not too much to ask, but it's a hard problem. To take a couple of examples, using == for "insensitive comparison":

  • Should ß == SS? In a German filename, the answer is "yes", because ß is just a different way to write ss; in an English filename, however, it's a symbol, not a letter.
  • Should a == ä? In an English or Afrikaans filename, yes; the diacritic is a pronunciation guide only. In a Swedish filename, no, because ä is a different letter to a.
  • Should i == I? In most Latin alphabet languages, yes; lowercase I is dotted, uppercase i is not. In Turkish, however, i's uppercase form is İ, not I.

Unicode makes a decent stab at a solution (section 5.18), but then explicitly calls out Lithuanian and Turkic languages as cases where the default algorithm will not include something that users expect it to include; further, the Unicode solution is based on the principle that it's better for the algorithm to match things it shouldn't, than it is for it to miss things it should match. Thus, an English user will be surprised that the glob S* matches ß, but that's better than a German user being surprised when s* does not match ß. Similarly, a Swede is going to be surprised when nä* matches nävi, specifically so that an English user isn't surprised when na* doesn't match nävi.

A report from the documentation maintainer

Posted Nov 2, 2016 22:44 UTC (Wed) by nybble41 (subscriber, #55106) [Link] (1 responses)

> ... the Unicode solution is based on the principle that it's better for the algorithm to match things it shouldn't, than it is for it to miss things it should match.

That's debatable, and it really depends on what the glob pattern is being used for. When deleting files, for example, it would generally be better to match conservatively so that you don't remove files which the user didn't expect to match. In most cases it's much easier to clean up any files which were missed than it is to restore ones which were unexpectedly removed. The same would apply to any operation which modified files in place (e.g. perl -i).

Personally, I just set LC_COLLATE=C and compare all strings bytewise--which does not preclude the use of Unicode filenames. I find this far less surprising than any of the locale-specific options, and think it would be a safer default, especially for scripts. Interactively, absent some indication of the user's intent, perhaps the shell should evaluate the glob pattern both ways and generate an error if the results do not agree.

A report from the documentation maintainer

Posted Nov 2, 2016 23:13 UTC (Wed) by mstone_ (subscriber, #66309) [Link]

The solution here is simply not to blindly delete files using glob patterns. If you need this for some reason you'd darn well better set your locale to C, but you should probably just come up with a safer solution.

A report from the documentation maintainer

Posted Nov 2, 2016 22:54 UTC (Wed) by nybble41 (subscriber, #55106) [Link] (1 responses)

> It's not too much to ask, but it's a hard problem.

As an addendum to my previous comment, the examples given here support the argument that *case-insensitive* comparison presents difficulties above and beyond case-sensitive comparison of Unicode strings. This would make traditional case-sensitive glob matching and sorting easier to implement than the current case-insensitive behavior. "Should ß == SS?" No, the first is lowercase and the second is uppercase. (Of course, you still have to deal with the question of whether ß == ss.)

A report from the documentation maintainer

Posted Nov 3, 2016 9:35 UTC (Thu) by farnz (subscriber, #17727) [Link]

The other trouble is that any human-friendly comparison function presents difficulties for the machine - too much of what's "correct" depends on your cultural norms. For example, I've known native Arabic speakers who simply could not get their heads around the idea that English has 5 different short vowel sounds, and that the difference between "nit" and "net" is significant (because the short vowel modifies the preceding consonant) - to their eyes, both were reasonable ways to write what in Arabic would be nt (as Arabic writing does not include short vowel sounds by default).

The brains of the squishy bag of meat in control of the computer are weird places - while I don't think the difficulty is insurmountable, I don't think it's a trivial problem to solve, either.

A report from the documentation maintainer

Posted Nov 3, 2016 22:12 UTC (Thu) by tao (subscriber, #17563) [Link]

Swedish standardised sorting actually treats V = W, which annoys the hell out of me, seeing as my last name starts with a W :)

A report from the documentation maintainer

Posted Nov 4, 2016 10:10 UTC (Fri) by tdz (subscriber, #58733) [Link] (14 responses)

> Should ß == SS? In a German filename, the answer is "yes", because ß is just a different way to
> write ss;

Actually no, because the result would be pronounced differently, or even be a different word. Double-s is just the workaround for scripts that don't have 'ß'.

A report from the documentation maintainer

Posted Nov 4, 2016 10:14 UTC (Fri) by farnz (subscriber, #17727) [Link] (13 responses)

So what is the correct capitalisation of "groß"? I can't find a capital eszet in normal use, and the beginners guide I'm consulting says that "groß" would capitalise as "GROSS". For that to hold true for a case-insensitive comparison, the filename "groß ding" has to compare equal to the filename "GROSS DING", otherwise the comparison is not fully case-insensitive; and on the case insensitive locale-aware OSes I've tried, that indeed holds true.

A report from the documentation maintainer

Posted Nov 4, 2016 10:25 UTC (Fri) by tao (subscriber, #17563) [Link] (4 responses)

You mean "ẞ" U+1E9E LATIN CAPITAL LETTER SHARP S?

A report from the documentation maintainer

Posted Nov 4, 2016 11:23 UTC (Fri) by idrys (subscriber, #4347) [Link]

I have not seen much (if anything at all) of the capital ß in the wild - it's mostly capitalized as the previous poster noted.

(With all the fun of 'in Maßen' (i.e. not too much) vs. 'in Massen' (a lot) - although the reforms inflicted upon German writing in the last years mean that most people are confused anyway...)

A report from the documentation maintainer

Posted Nov 4, 2016 11:42 UTC (Fri) by farnz (subscriber, #17727) [Link] (2 responses)

It exists, yes, but all the references I can find (as a beginner in the language, not a native speaker nor resident in a German-speaking country) tell me that the use of U+1E9E is for things like advertising, where you're doing the "shouting caps" emphasis trick - thus, where an English advert might say "BIG..." the German equivalent advert would use "GROẞ" to indicate that the caps are not "real" capitals, they're shouting.

In normal writing, though, everything seems to use SS as the capital form of ß.

A report from the documentation maintainer

Posted Nov 7, 2016 18:16 UTC (Mon) by JanC_ (guest, #34940) [Link] (1 responses)

That's at least partially also because “ẞ” isn't on German typewriters, (traditional) computer keyboards, etc.

A report from the documentation maintainer

Posted Nov 8, 2016 9:40 UTC (Tue) by anselm (subscriber, #2796) [Link]

The codepoint for the uppercase “ß” was added to Unicode as a sort of precaution and to make life easier for people implementing case conversion routines. In Germany, in spite of this, uppercase “ß” isn't actually being used in practice, and in fact there is no official agreement or rule as to what the glyph should even look like (although people keep throwing around “ẞ” as if that was some sort of gospel). It would be reasonable to render U+1E9E as “SS” except for the ambiguity with “in Maßen/Massen”, or as “ß” (i.e., exactly like the lowercase “ß”) except for the bad typography of having a lowercase letter in the middle of a bunch of uppercase ones.

Generally, according to the new German orthography rules we're now supposed to write “ß” after a long vowel (or diphthong) and “ss” after a short vowel. This is not the worst part of the orthography reform but the uppercase-“ß” issue remains largely unresolved.

A report from the documentation maintainer

Posted Nov 7, 2016 9:11 UTC (Mon) by tdz (subscriber, #58733) [Link] (7 responses)

Hi,

as the other commenter points out, there exists a symbol and a code to display it. But even being a native speaker, this discussion is actually the first place I've ever seen it. :D

You can use double-s in place of ß in capitalization and that's what people usually do. I've also seen ß not being capitalized; GROß in your example. ß won't ever appear at the beginning of words, so there no 'natural' use case for capital-ß. I'm not sure if many people are aware that it exists.

A report from the documentation maintainer

Posted Nov 7, 2016 10:08 UTC (Mon) by farnz (subscriber, #17727) [Link] (6 responses)

Aha, someone who can actually answer the underlying question for me!

Would you expect a case-insensitive equality operator to have "groß" == "gross" == "GROSS" == "GROß" == "GROẞ" (which the case-insensitive OS I've played with chooses to do in a German locale)?

Put differently, would you expect that if you searched for "groß" in a text document, you would not find matches for "GROSS" but would for "GROß"? Equally, if you searched a text document for "GROSS", would you expect to see matches for "groß", or only for "gross" in a case-insensitive search?

A report from the documentation maintainer

Posted Nov 7, 2016 10:27 UTC (Mon) by johill (subscriber, #25196) [Link] (4 responses)

I'm in the same situation as tdz, being a native German speaker and never having seen the ẞ (upper-case) before. I actually appreciate if ss ends up being equivalent to ß in all cases, for multiple reasons:
  1. sometimes I don't have German keyboard settings available immediately, making it awkward to enter ß
  2. a document may use old or new orthography, so words like "Fluss" (river; this is the currently correct spelling) may be spelled as "Fluß" (old spelling)
  3. when spelled in headings/etc., "SS" will frequently be used to replace "ß"
So I'd argue that treating things as in your example ("groß" == "gross" == "GROSS" == "GROß" == "GROẞ") is helpful.

A report from the documentation maintainer

Posted Nov 7, 2016 11:20 UTC (Mon) by idrys (subscriber, #4347) [Link] (3 responses)

(native speaker here as well)

While this matching helps with words that can simply be written in two ways, I'd be rather surprised to get a match for a different word (like in Maßen vs. in Massen). And I think the new orthography is, for the most part, horrible (it emphasized writing over reading while neglecting that you know what you're writing but your reader doesn't). But adherents of the old orthography will die out over time anyway :/

A report from the documentation maintainer

Posted Nov 7, 2016 13:45 UTC (Mon) by farnz (subscriber, #17727) [Link] (2 responses)

Hmm. Are you saying that, when doing a case-insensitive match, you'd really want the computer to be aware of the intended dictionary word? So that a search for "groß" matches all of "GROSS", "groß" and "gross" (leaving your human intelligence to determine which ones are "good" matches), while a search for "maßen" should match "Maßen" but not "MASSEN" or "Massen", because "Maßen" and "Massen" are different words? Or is there an underlying rule that I'm not seeing (something like "Maßen" should match "MASSEN" as an all-caps Maßen and "maßen" as missing the initial capital, but not "Massen" or "massen" because the casing rules let you see that ss was deliberate, not the result of round-tripping through upper case back to lower case)?

A report from the documentation maintainer

Posted Nov 7, 2016 14:30 UTC (Mon) by idrys (subscriber, #4347) [Link] (1 responses)

I'd prefer to not match eszet vs. double-s at all, generally. I understand and to a degree follow the reasoning, but I think it would cause more confusion than not. (And your example neatly illustrates this; too much side-knowledge required.)

I _could_ imagine an exception for eszet vs. upper-case double-s, but I'd be surprised if 'grep -i maßen' would find MASSEN as well... (And what about 'SZ' as a capitalization for 'ß'? It is now extremely uncommon, but I've seen this in documents up to the mid-20th century.)

[As an aside, old documents are sometimes inconsistent for eszet vs. double-s in people's names as well, as they sometimes capitalized names and sometimes not, so this is not a new issue. We are not 100% sure what the family name on my mother's side is for that reason. Oh well...]

A report from the documentation maintainer

Posted Nov 7, 2016 16:52 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

FYI, there also exist ligature codepoints like `fi` would need to be split apart on uppercase.

A report from the documentation maintainer

Posted Nov 10, 2016 14:39 UTC (Thu) by tdz (subscriber, #58733) [Link]

These are all different words, so they should probably not compare equal by default. Having the option of treating ß and ss that same could be useful, though. OTOH I never had this problem in practice.

In English, people sometimes (frequently?) confuse "its" and "it's". Treating them the same in text searches seems a comparable use case.

I thought about your question about ß in capital-letter advertising messages, but I can't remember having seen that anywhere. I could imagine that advertisers avoid using ß and ss in capital letters, because it doesn't look good either way.

A report from the documentation maintainer

Posted Nov 3, 2016 14:47 UTC (Thu) by ssl (guest, #98177) [Link]

>I thought everyone sets their LC_COLLATE to C (C.UTF-8, if supported) anyway to preserve their sanity.

Um, actually the converse. I have my collate set to pl_PL.UTF-8 because I want my ą's sorted after a, ć's after c, ó after o, and so on. C puts them after 'z' which makes absolutely no sense. C.UTF-8 breaks the letters completely.

A report from the documentation maintainer

Posted Nov 6, 2016 0:56 UTC (Sun) by ceplm (subscriber, #41334) [Link]

Concerning the incredible amount of packages for TeXLive.

a) I still dream somebody would finish the lout_writer branch of docutils (https://gitlab.com/mcepl/docutils/tree/lout_writer); Lout has its many idiosyncrasies (e.g., it was created before Unicode, so it is still 8bit-based), but it is a way simpler than TeX and as much capable.

b) The other dream is somebody would take a XeTeX and make a minimal distribution of TeX just for producing PDF files with as many as possible dependencies used from system (e.g., no TeX fonts, everything from the system-wide installed TrueTypes, etc). TeX itself is not that huge, but the giant machinery around it is overcomplicated, confusing, and perhaps to large extent unnecessary.


Copyright © 2016, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds