Development

What's new in TeX, part 2

October 28, 2015

This article was contributed by Lee Phillips

This is the second half of our look at how development around the venerable TeX document-preparation system has evolved in recent years. The first article dealt with TeX engines and packages that offer support for modern font handling, scripting, and graph plotting. This time, we will explore work that seeks to bridge the gap between the printed-page world addressed by the original TeX system and the web-centric needs of authors and publishers today.

Background

When TeX first arrived on the scene, it was a revelation. Those who took the time to master it could create beautiful mathematical documents by themselves, without engaging the services of a professional typesetter or a secretary. At that time, and for some time thereafter, "document" meant "something printed on paper."

Advances in hardware and software soon changed that definition—but also made TeX even more valuable. LaTeX automatically numbered our equations and sections for us, eliminating the drudgery of renumbering dozens of references after adding or removing an equation. New tools allowed me to preview my thesis chapters right on the screen, without having to print them out; now, of course, this is routine. PDF output engines became standard and, eventually, computers became fast enough to scroll through PDFs fluidly, making the dream of a paperless collection of scholarly literature a reality.

Around the same time, there was another revolution underway. I remember watching it unfold in real time, as WAIS, Gopher, FTP, and the World Wide Web competed to become the standard for online delivery and sharing of documents. It was over almost before it started, as everyone was enchanted with the Web's embedded hypertext links.

The web has made tremendous progress, and the TeX world has evolved as well. However, the web still does not solve the problems solved by TeX. And TeX still does not offer what the web provides.

This creates a problem for scientists and other authors of technical material, who may be loath to give up the ease of mathematical markup, and the high quality of the result, offered by TeX (and who may be required to use it by their preferred journals), but would like to make their results available on the web. The web, after all, provides immediacy, nearly universal access, and opportunities for feedback and reuse that the sometimes glacial cycles of formal journal publication cannot match. On the bridge being built between the worlds of the web and of TeX, construction has begun from each side of the divide; whether the two efforts will meet in the middle to endow us with a new, unified infrastructure we still do not know.

Typography on the web

If you follow the web-design literature you will see plenty of articles about "typography on the web." Almost all of them are about subjects such as how to best serve font files and font-licensing issues, but rarely about typography. The ability to specify fonts for web pages, rather than relying on the user's system fonts, represents a significant advance and is crucially important for displaying mathematics on the web, as we'll discuss below. But despite some progress in the text-layout engines used in web browsers, the state of typography on the web is still not up to the (perhaps slightly obsessive) standards that habitual TeX users have learned to expect.

Here are two figures illustrating the problem. The left image shows TeX output, and the right image shows how the Chrome web browser displays the same text. I've used the same font in both examples and tried to get the line length and leading to be the same.

$[TeX output]$ $[Chrome output]$

The most obvious deficiency in the web example is the inconsistent (and often large) gaps in the spacing of the text, caused by the lack of hyphenation. But it exhibits more subtle problems, too, such as the absence of ligatures (note "coffin" at the end of line eleven). Web results can be improved by using a JavaScript hyphenator, but until something equivalent to TeX's whole-paragraph optimization is implemented in browsers, their rendering of text will always be—at least subtly—inferior.

These examples are of simple prose. More demanding typography, such as might be encountered in a "critical edition" texts with multiple languages and scripts or footnotes within footnotes, or in poetry, demands a degree of typographic control that is nearly impossible to achieve with HTML.

One obvious solution might be to simply serve our PDF files and not worry about translation to HTML. Since it's easy, with LaTeX, to insert hypertext references in PDF output, one can seamlessly navigate between PDF documents and between PDF and HTML. However, the inability of PDF to reflow, its larger file sizes, its reuse-unfriendly binary nature, and a lingering prejudice against the format on the web, compel us to look for a more flexible solution.

Math on the web

Early attempts to render mathematical notation on the web, including translations from LaTeX, relied either on the insertion of images for equations or on "font hacks" that built up equations by wrapping characters from symbol fonts in individually positioned elements. Both approaches introduced assorted problems, but were the best we could do.

One attempt at a solution to these problems is MathML, a W3C standard for notation of mathematics. Despite existing for a decade, however, MathML still has poor browser support. There are tools to translate various input formats into MathML, but it's a verbose format unsuited to direct authoring.

As an example of MathML's verbosity, let's look at the display of a simple equation (which some mathematicians consider the most beautiful equation in mathematics):

$[Euler's equation]$

In a (La)TeX document, the markup for this equation is:

    e^{i\pi} = -1

There are a handful of online services that translate TeX mathematical markup to MathML. The one at mathtowebonline.com translates the above into:

 
    <mtable class="m-equation-square" displaystyle="true"
      style="display: block; margin-top: 1.0em; margin-bottom: 2.0em">
      <mtr> 
        <mtd> 
          <mspace width="6.0em" /> 
        </mtd>
        <mtd columnalign="left"> 
          <msup> 
            <mi>e</mi>
            <mrow> 
              <mi>i</mi> 
              <mi>&#x003C0;</mi>
            </mrow> 
          </msup> 
          <mo>=</mo>
          <mo>-</mo> 
          <mn>1</mn> 
        </mtd>
      </mtr> 
    </mtable>

MathJax seems to be succeeding at what MathML set out to do. This project has changed the landscape for mathematics on the web dramatically. By including a single JavaScript-loading directive in the header of your web page, you can mix LaTeX math freely with your HTML. The reader's browser will download any required math fonts, and the results look almost identical to LaTeX. While this doesn't solve the other typographical issues mentioned above, MathJax is an excellent solution to the problem of math on the web, and far better than anything that has come before, at least if we insist on serving HTML. Its only major drawback is the considerable delay required to download the fonts (if the user doesn't already have them) and for JavaScript to render the equations. From the point of view of speed, there seems to be no advantage over serving the equivalent document as a PDF.

KaTeX is an alternative to MathJax that claims to be significantly faster. It processes mathematical markup on the server, rather than in the browser. Time will tell whether this kind of approach will replace MathJax, which can only become more useful as JavaScript engines continue to improve.

Write once ...

At first glance, one obvious way for users of TeX to embrace the web would be to continue to write their documents in TeX and transform them, using an automated tool, to HTML. In this way, a single source file could produce both a PDF and at least an approximate HTML rendering of the same document. Unfortunately, this is impossible. TeX is a Turing-complete programming language, as well as a system of markup. With difficulty, you can express any computation within the TeX language. This means that no program can determine whether an arbitrary TeX source document, when run through the TeX program, will even terminate and produce output. The only way to find out what the result of running TeX on a source document will be is to actually run TeX and look at the result.

A common problem is that TeX-to-HTML translators are usually unable to handle TeX containing user macros, since the results of macros are impossible to predict without running TeX on the source document. Here is a very simple example of a macro, which will work in plain TeX or LaTeX:

    \def\nchars#1#2{#2\count0=#1 \advance\count0 by -1
    \ifnum\count0>0 \nchars{\count0}{#2} \fi}

This macro accepts two parameters: a number and an arbitrary character. When it is called it prints the character the number of times specified, looping by recursion. If we say \nchars{30}{\textbullet} in our TeX source, we get:

$[TeX macro output]$

printed in the output. It would be easy to make a mistake in the definition of a macro such as this that would lead to an infinite loop. Naturally, this would not result in a useful TeX document either, but it illustrates why the problem of automatic TeX translation to declarative markup is insoluble.

However, the fact that something is merely impossible doesn’t stop everyone. There are a handful of TeX-to-HTML translators available. Some do a pretty good job on relatively simple documents, and are worth considering if you have a TeX paper that you need to convert to HTML quickly.

One converter that does take the approach of requiring the user to run TeX first is the tex4ht system, which is actually distributed with TeX Live. To use tex4ht, you include a special style (and, optionally, embed special directives) in your LaTeX source, and process your document with an engine that creates a DVI file. Then you process the DVI file with a tool that can create not only HTML, but .odt (OpenOffice) and other formats. Tex4ht has, indeed, made realistic single-source authoring workflows [PDF] based on LaTeX markup a reasonable approach—if the author is willing to prepare source files with multiple output targets in mind. There are some limitations, though, including the inability to use OpenType fonts.

For example, here is the classic TeX representation of Stokes' Theorem (which we explored earlier in part one of this series), with tex4ht markup added:

 
    \documentclass{article} 
    \def\HCode#1{} 
    \begin{document}
    
    \HCode{<div name = 'top'>Top</div>} 

    Here is the elementary version of Stokes' Theorem:

    \[ \int_\Sigma \nabla\times \mathbf{F} \cdot 
    d\Sigma = \oint_{\partial\Sigma}\mathbf{F}\cdot d\mathbf{r} \]
    
    \HCode{<a href = '\#top'>Go to top</a>} 
    \end{document}

The \HCode commands in the example embed their arguments verbatim in the HTML output. We've used this mechanism to include an example of some navigation that might be useful on a web page. In order to prevent the HCode commands from interfering if the file is fed through pdflatex, the second line disables them.

The tex4ht installation includes a number of convenience scripts to include the required style files automatically, to generate images from equations, and to perform other tasks needed to produce the final output file. If HTML is the target, we process the above by passing its filename to the htlatex command. The result is an HTML file in the current directory along with a PNG image for the displayed equation:

$[tex4ht output]$

Another approach is offered by the pdf2htmlEX project. This tool translates any PDF (not just one created by TeX) into an HTML page. In my tests, it did an accurate job at what is sometimes described as an impossible task, and the examples on its GitHub page are impressive.

But the HTML that this program generates is not human-readable and poses a challenge to search engine indexing: nearly every character is wrapped in an individually positioned HTML element. The result is HTML in name, but not in spirit; the pages are slow to load and they do not reflow. For example, this sentence (taken from the "scientific paper" example on the pdf2htmlEX site):

    Dynamic languages such as JavaScript are more difﬁcult to com-
    pile than statically typed ones.

Corresponds to the HTML shown in the figure below.

$[pdf2htmlEX example]$

There seems to be little advantage in this approach over simply putting a PDF online.

Going the other way—writing in HTML and translating, as needed, to TeX—is a non-starter except for the simplest of documents. This is because HTML provides a small subset of what can be expressed using TeX (see part one of this series for examples, or browse the TeX Showcase). However, partial translations might be useful in certain circumstances. A interesting project in this vein is xml2tex, which allows the author to define arbitrary mappings from XML tags (and therefore from XHTML) to TeX markup.

Markup to markup

The alternative to writing in TeX or HTML and trying to translate between them is to author in another markup system, more general than either, that can be translated into a variety of target formats. The most prominent example of this approach is DocBook. In its XML incarnation, DocBook is a "schema", or set of tags, that describes various types of documents, including books, by defining the abstract intent of their various elements.

There are tools to transform DocBook files into various other text formats, such as HTML, and into "final output" formats, including PDF. It has a bad reputation among people who have gravitated to TeX in search of high-quality output, though, because, historically, the toolchain for producing PDF from DocBook created results with very poor typography. More recently, the maturing of the dblatex project, which converts DocBook to LaTeX, is making it possible to write in DocBook with HTML, high-quality PDF, and other targets in mind.

This is still not an acceptable solution for a great many authors, however, because of the daunting nature of DocBook's tag system. It is highly abstract, voluminous (there are over 400 elements), and requires deep nesting of XML elements to express commonplace intents. Despite its huge catalogue of elements, DocBook is draconian about enforcing its abstract conception of how a document should be described, making it impossible to specify purely visual output, such as bold text.

The need for a more pleasant and flexible markup system, one that was nevertheless more abstract than TeX or HTML and could be transformed to either, led Torsten Bronger to create the tbook project. Tbook is also an XML schema, but it is designed to incorporate an almost-LaTeX syntax for mathematics—and to be less verbose, avoiding DocBook's deep nesting. Here is Bronger's example of what you need to do to include a figure in DocBook:

    <figure label="1.1" float="1">
      <title>The galaxy.</title>
      <mediaobject>
        <imageobject>
          <imagedata fileref="galaxie.png" format="PNG"/>
        </imageobject>
      </mediaobject>
    </figure>

Obviously, this is more verbose and deeply nested than HTML, TeX, or any markup system that an author would be likely to find convenient. Tbook markup is more compact:

    <figure>
          <graphics file="galaxie" kind="vector"/>
          <caption>The galaxy.</caption>
    </figure>

It can be transformed into LaTeX, HTML, or even DocBook, using the author's set of XSLT stylesheets. It contains facilities for processing illustrations and performing a few other tasks for the convenience of its main intended audience: authors of scientific papers. Regrettably, tbook never gained traction, probably because modifying its behavior non-trivially meant working with XSLT—an exercise in masochism rivaling the creation of LaTeX style files.

The very idea of writing with XML tags at all is anathema to many people. Linus Torvalds, for example, believes that XML is "probably the worst format ever designed", "a complete disaster", and "horrible crap". Relief can be found by turning to the family of "lightweight" text-based formats. These formats all strive to make the document source easy for the human eye to parse, which they accomplish by adopting some conventions that evolved in email (and elsewhere) for indicating emphasis, headings, and the like.

One of the earliest of these is AsciiDoc, which was originally intended as an friendlier shorthand for DocBook. It has transformers to create HTML, LaTeX, and more, and it is designed so that both its syntax and transformations can be conveniently extended by the user. Direct transformation to LaTeX is still experimental, but can be accomplished by transforming to DocBook and using dblatex or other DocBook tools for the final step. Here is a somewhat recursive example that gives some of the flavor of AsciiDoc markup:

    What's New in TeX: Part 2     
    =========================

    Write once...
    -------------

    Here is the http://tbookdtd.sourceforge.net/db-diss.html[tbook
    author's] example of what you need to do to *include a figure* in
    DocBook: 

    .DocBook Example
    ---- 
    <figure label="1.1" float="1">
      <title>The galaxy.</title>      
      <mediaobject>   
        <imageobject>    
          <imagedata fileref="galaxie.png" format="PNG"/>
        </imageobject>
      </mediaobject>
    </figure>
    ----

    Obviously, this is _more verbose_ and deeply nested

This figure shows what the HTML looks like when this sample is run through the asciidoc command, using the default stylesheet.

$[AsciiDoc output]$

Pandoc is a Haskell program that can transform between many pairs of file formats. It is not a solution for authoring for the web using LaTeX, despite offering a LaTeX-to-HTML translator, because its LaTeX parser is still under development. However, it can translate from a highly extended form of Markdown (originally a dumbed-down AsciiDoc) to LaTeX, HTML, and many other formats. Pandoc makes it realistic to adopt its dialect of Markdown as an authoring language and target the web, PDF, and even OpenOffice and Word. You can use LaTeX math directly, and even transform your text to DocBook, if your publisher requires it.

Sphinx began as a solution for writing Python documentation, but is being adopted by a growing number of authors of technical books and papers. Its input is reStructuredText, which is another lightweight format similar to AsciiDoc. It can convert to HTML, LaTeX, and a few other formats, such as man pages. Sphinx may be a more convenient choice than Pandoc if you intend to alter the transformation engine and are more familiar with Python than with Haskell.

Pandoc, Sphinx, and AsciiDoc are all mature, realistic solutions that should allow you to repurpose your possibly math-heavy documents for multiple targets, including the web, PDF, and more. They, as well as everything else I've discussed in this article, are free and open source. They can be used in conjunction with MathJax, and so can serve, for authors of technical material, as the bridge between the web and the TeX world.

I hope I've provided some sense, in this article and the last, of how the two great branches of technical publication have each gained greater power and added interoperability in recent years. Despite the progress in both the TeX and web worlds, however, we still face a Tower of Babel of markup languages and a paralysis of choice in technology when sitting down to write. The difficulty is not merely in finding a workflow that is congenial, which is simple enough, but in making choices that will future-proof your content and allow it to travel, with little manual intervention, from journal to web site, from book chapter to slideshow.

I regret that there is no obvious, final recommendation to give here in this final paragraph—except what is probably self-evident to readers of this publication, which is to stick with free software and text-based formats that allow the profitable use of version control and all our familiar utilities. And, finally, to relax and remember that the quality of your content is infinitely more important than the technology you use to express it.

Comments (35 posted)

Brief items

Quotes of the week

You don’t notice it until it breaks. Or clogs, slowing down the rest of the system. Or you need plumbing for your shiny new hot tub, only you’re missing some new essential feature, like hot water or tessellation shaders.

— Sarah Sharp, on the importance of Mesa.

Very little has changed fundamentally about computing in many decades. The details change, platforms change, but there are two ways to do computing: you do it on a machine in front of you or you do it on a machine on some network, and you have connections between the two.

The rest is details. Yes, we've invented ways to do the details better, or worse, as the case may be.

— Bradley Kuhn, on game-changing technologies.

Comments (none posted)

Stellarium 0.14 released

Version 0.14 of the Stellarium planetarium application has been released. The update "brings a big leap forward in astronomical accuracy for historical applications" with support for the IAU2006 long-term precession model and support for nutation correction. There has also been a major update to the Deep Sky Observatory data set, and updates to several plugins.

Comments (none posted)

Mozilla Launches Open Source Support Program

Mozilla CEO Mitchell Baker has announced the launch of "an award program specifically focused on supporting open source and free software. The main focus of the program will be to provide financial support of other projects, "to recognize and celebrate communities who are leading the way with open source projects that contribute to our work and the health of the Web. It encompasses both: a) a 'give back' element for open source and free software projects that Mozilla relies on; and b) a 'give forward' component for supporting other projects where financial resources from Mozilla can make our entire community more successful." The initial funding allocation for the program is $1,000,000, and Mozilla is seeking applications for ten recipient projects. The announcement also notes that one planned component of the program will be to fund security-related work. (Thanks to Martin Michlmayr).

Comments (2 posted)

KDevelop 5.0.0 beta available

The first beta release of KDevelop 5.0.0 is available. The code base has been ported to Qt 5 and KDE frameworks 5, the legacy C++ parser and semantic analysis plugin has been replaced with a much more powerful one that is based on Clang, the hand-written CMake interpreter has been removed in favor of upstream CMake, plus more features, code cleanup and bug fixes.

Comments (12 posted)

Tizen 2.4 SDK released

The Tizen SDK has been updated to supported the project's 2.4 release, including "support of the Tizen 2015 UX (new winset and style) and over 3,000 new APIs." The new APIs include voice control and additional input methods; there are also new Tizen reference apps included in the SDK download.

Comments (none posted)

LabPlot 2.1.0 available

Version 2.1.0 of LabPlot, the KDE data-analysis and visualization application, has been released. Changes include an interface for manipulating matrix-like data, a "workbook" tool that allows users to group related objects into a shared workspace, and support for importing many new data formats.

Comments (none posted)

Container provisioning with Guix

A the GNU Guix blog, David Thompson introduces the new Guix container implementation. Currently, Guix allows users to spawn a new development environment inside of a container by adding the --container flag to the guix-environment command, and allows the root user to create a full Guix system within a container. "There is still much work to be done in order to make call-with-container a robust container platform. For example, control groups could be used to arbitrarily limit the resources a container can consume, and virtual network interfaces could be used to give containers access to the net without sharing the host system's network interfaces. If you would like to help improve call-with-container, or any other part of the Guix codebase, please join the fun!"

Comments (none posted)

Elasticsearch 2.0.0 is available

Elasticsearch version 2.0.0 has been released. Based on Lucene 5.2.1, the update adds support for configurable compression on storage, merges the "query" and "filter" APIs into a single interface, and allows aggregation operations (such as moving-average calculations) to be chained together. Also noteworthy is that the new release runs under the Java Security Manager, which should provide additional protection against security exploits.

Comments (none posted)

Newsletters and articles

Development newsletters from the past week

What's cooking in git.git (October 22)
What's cooking in git.git (October 26)
LLVM Weekly (October 26)
OCaml Weekly News (October 27)
Perl Weekly (October 26)
PostgreSQL Weekly News (October 25)
Python Weekly (October 22)
Ruby Weekly (October 22)
This Week in Rust (October 26)
Tahoe-LAFS Weekly News (October 27)
Tor Weekly News (October 24)
Wikimedia Tech News (October 26)

Comments (none posted)

Coghlan: 27 languages to improve your Python

Python language developer Nick Coghlan has posted a survey of 27 languages that, he thinks, have lessons for Python. "One of the things we do as part of the Python core development process is to look at features we appreciate having available in other languages we have experience with, and see whether or not there is a way to adapt them to be useful in making Python code easier to both read and write. This means that learning another programming language that focuses more specifically on a given style of software development can help improve anyone's understanding of that style of programming in the context of Python."

Comments (38 posted)

Swarm v. Fleet v. Kubernetes v. Mesos (O'Reilly)

Here's a survey of orchestration systems on the O'Reilly site. "Various software tools and solutions exist to help with these challenges. Let’s focus on orchestration tools, which help make all the pieces work together, working with the cluster to start containers on appropriate hosts and connect them together. Along the way, we’ll consider scaling and automatic failover, which are important features."

Comments (none posted)

Page editor: Nathan Willis
Next page: Announcements>>