Pandoc: a workhorse for document conversion
Pandoc is a document-conversion program that can translate among a myriad of formats, including LaTeX, HTML, Office Open XML (docx), plain text, and Markdown. It is also extensible by writing Lua filters that can manipulate the document structure and perform arbitrary computations. Pandoc has appeared in various LWN articles over the years, such as my look at Typst and at the importance of free software to science in 2025, but we have missed providing an overview of the tool. The February release of Pandoc 3.9, which comes with the ability to compile the program to WebAssembly (Wasm), allowing Pandoc to run in web browsers, will likely also be of interest.
Problems solved
Writers, scholars, scientists, and many others often need to create versions of their documents in various formats. For example, a mathematician, after preparing a paper formatted with LaTeX, may want to generate an HTML version to share on the web, and later create a set of slides for a conference. Sections of the paper might be useful in a grant application, but the funding agency might require a Microsoft Word format. Creating these various incarnations would seem to require a great deal of redundant work, especially if the paper contains equations, figures, and references.
This is the problem that Pandoc was created to solve. Using Pandoc, our
hypothetical mathematician can write the paper once and have it quickly and
automatically converted to any of over 40 output formats. Others may
simply have a document in some format and want it translated to a different
format to use it in some other way. Pandoc also lurks
behind other projects and online services that require document conversion,
such as the popular Quarto "scientific
and technical publishing system
". The list of Pandoc input and output
formats can be found in its README file.
A program that directly translated among N formats would have to incorporate N2 translators. Instead of implementing the resulting hundreds of translation routines, Pandoc relies on a modular architecture. It ingests documents using "readers" specialized for each input format, whose job is to translate into Pandoc's internal abstract syntax tree (AST). Output is emitted by "writers" that translate the internal representation into the desired output format.
This strategy turns the N2 problem into a 2N problem. It also makes it easier to enlarge Pandoc's repertoire: a contributor who would like the system to be able to read a new format need only write a Lua program to convert that format into the internal representation, after which Pandoc will be able to translate the format into any of its many outputs. Similarly, a writer module for a new format gives Pandoc the ability to convert any of the document types that it already understands into the new one. (I've simplified the situation somewhat, in that the set of document formats that Pandoc can read is not identical to the set of formats that it can write, but they have a large intersection.)
In practice, the most effective way to start using Pandoc for a new document is to write it in Pandoc's extended dialect of Markdown. This is because the readers for different formats vary in their abilities, but the Markdown reader is fully featured and complete. That variability is necessarily the case, because not every document format can be reliably translated into an AST. For example, LaTeX and Typst can embody arbitrary programs, whose output cannot be determined without running their respective compilers.
If a Markdown document is saved in a file
called example.md, this command
would translate it into a Typst document:
pandoc example.md -o example.typ
The output format is inferred from the file extension. To translate it
into HTML, a .html extension can be used instead. Another
option is to specify the output format with the "-t" flag, for
example "-t html".
Pandoc's Markdown dialect incorporates additions to the original specification that make it more suitable for expressing complex documents, while retaining the signature advantage of being pleasant to parse by eye. In addition to everything in traditional Markdown, it can handle ordered lists with letters or Roman numerals, task lists, definition lists, various styles of tables, LaTeX mathematics, footnotes, bibliographic citations, and quite a bit more.
Authors can also insert raw markup into the Markdown source, which is passed directly to the output file, bypassing Pandoc processing. Pandoc supplies syntax for raw markup that allows different fragments to be inserted in the output, depending on the format. For example, this code listing renders the name of Donald Knuth's typesetting engine in plain text, HTML, and (La)TeX; the appropriate fragment is used depending on the output format passed to Pandoc on the command line (or implied by the output file extension).
The `TeX`{=plain}
`T<span style="position:relative;top:0.2em">E</span>X`{=html}
`\TeX\ `{=latex} typesetting system.
The figure below shows all three versions, as they appear in a terminal, a web browser, and a PDF viewer, respectively. The PDF version was generated by LuaLaTeX processing the LaTeX output.
Citations
Pandoc has sophisticated support for citations and bibliographies. This feature is what makes it possible for scientists and scholars to write in Pandoc's extended Markdown rather than directly in LaTeX. Pandoc borrows the relevant syntax and mechanisms from LaTeX, using similar markup to indicate citations, the same BibTeX text-file databases of references, and the same standard Citation Style Language (CSL) bibliography style files.
As a simple example of how this works, here is a fragment of a document written in Pandoc's Markdown:
Free software [@fsfs] is important for physics [@heisenbergBeyond] and other sciences.
The bits inside the square brackets indicate references to resources defined in a BibTeX database. This a text file containing the bibliographic data for papers, books, websites, and anything else that an author might want to refer to. Here is the relevant part of the database for this document:
@article{fsfs,
author = {Lee Phillips},
journal = {LWN},
month = {June},
title = {{T}he {I}mportance of {F}ree {S}oftware to {S}cience},
url = {https://lwn.net/Articles/1023299/},
year = {2025}
}
@book{heisenbergBeyond,
address = {New York},
author = {Werner Heisenberg},
isbn = {06-131622-9},
publisher = {Harper {and} Row},
title = {{P}hysics and {B}eyond},
url = {https://www.amazon.com/Physics-Beyond-Encounters-Conversations-Perspectives/dp/0061316229/},
year = {1971}
}
The fragment can then be processed with Pandoc to create output, in any of the formats that it supports, with references and a bibliography. The command for doing so is:
pandoc --citeproc --bibliography=refs.bib refsExample.md -o refsExample.html
In this command "--citeproc" flag tells Pandoc to use its citation machinery and "--bibliography=" points it to the BibTeX file. The input file is refsExample.md, the file extension indicating Markdown. The output will be in HTML5 because of the extension for the output file. The result will be formatted using the default bibliography style, which happens to be the Chicago Manual of Style author-date format. Different styles can be used by downloading the desired CSL file (a good source is the Zotero style repository)
The figure below shows how the result appears in my web browser, first using the default bibliography style and, below that, using a numerical superscript style. The Pandoc command line is the same in both cases; the only change was the installation of a different CSL file in the default location in the filesystem (another option is to use a "--csl" flag to point Pandoc at the desired file).
In a real project the bibliography would be set off from the main text by a header and, probably, a different font, both of which can be conveniently added by Pandoc.
We can even have plain text output with citations, for those occasions when our emails or website comments require maximum pedantry. Here's the output of the same command with the "-t plain" flag added to request plain text:
Free software¹ is important for physics² and other sciences. 1. Phillips, Lee. (2025). The Importance of Free Software to Science. LWN. https://lwn.net/Articles/1023299/ 2. Heisenberg, Werner. (1971). Physics and Beyond. Harper and Row. New York. ISBN 06-131622-9.
Because it is a command-line program, Pandoc can be flexibly incorporated into various workflows and used in concert with other tools. I have a keyboard shortcut defined in my editor that processes any selected text through Pandoc, using "--citeproc", and places the plain-text output in the clipboard, ready for pasting into emails or comment boxes.
Pandoc filters
A Pandoc filter is a Lua program that acts directly upon the AST. Filters can make changes to document elements and can perform arbitrary computations based on the contents of those elements. For example, a user might create a filter to number all of the paragraphs in a report, or to print a timestamp under an article's title.
There is also an older type of filter that manipulates a JSON serialization of the AST. These filters can be written in any language, but they have extra dependencies and are slower, as they require serialization and deserialization steps. The current recommendation is to write filters in Lua.
Using Lua filters involves no dependencies, as Pandoc has a Lua interpreter built in. In fact, entering "pandoc lua" in the terminal opens a standard Lua read-eval-print loop (REPL).
Here's an example of a simple filter that changes all underlined elements to Strong elements, which are rendered in most output formats as boldface:
function Underline(elem)
return pandoc.Strong(elem.content)
end
To apply this filter, the program can be invoked with the argument "--lua-filter filter.lua" if filter.lua in the current directory contains the code above.
The function doesn't make much sense outside of the Pandoc context, but filters have access to an API that, somewhat magically, turns function names into routines that walk the AST and process nodes matching the name of the function. Another example is a filter that I wrote that allows me to input equations using the more concise Typst math syntax, rather than LaTeX, even when I need to produce a LaTeX document.
Filters are a powerful means to bend Pandoc to serve nearly any document-processing task, limited only by the imagination and patience of the author/programmer. Other potential applications might be to retrieve information from the Internet to insert into documents, to invoke external programs to create illustrations, or to assemble outlines.
Custom readers and writers
Although Pandoc can handle an impressive variety of document formats, it would not make sense to include every highly specialized or ad hoc markup system in existence. For more arcane applications, Pandoc provides the ability to create custom readers and custom writers. A custom reader is a Lua program that parses input text and translates it into Pandoc's AST; a writer does the reverse. To make this more convenient, Pandoc includes Lua's LPeg parsing library.
Some examples of custom readers are pandoc-fountain, which ingests screenplays written using Fountain markup, the IDML Pandoc reader that understands the markup language used in Adobe InDesign projects, and Lean.lua that can handle Lean files. Interesting custom writers include Pandoc Terminal Writer for colorful pretty-printing on the terminal, Pandoc to PreTeXt that creates markup in the PreTeXt format, which is a structural markup language for textbooks and papers, and a custom writer that emits the specialized format used in a discussion forum.
Changes for 3.9
Pandoc 3.9 was released in early February. The most prominent new feature is the ability to compile Pandoc to Wasm, allowing it to run entirely in a web browser. The project has provided an example page that allows users to do general document conversions via a web form. Pandoc in the browser leads to some interesting possibilities. For example, it should be practical to construct pages that accept user content in nearly any markup format, either in comment boxes or as wiki-style editable text, and convert it in realtime to valid HTML5 for inclusion in the page. Another possibility might be educational sites that, for instance, teach Typst or LaTeX with live feedback.
The new release brings with it dozens of small improvements and bug fixes. Most of these are technical minutiae having to do with corner cases, layout subtleties, or obscure markup formats.
There are two enhancements that may be of wider interest. One is wider support for specifying PDF standards (PDF/A, etc.), as long as LuaLaTeX is used to produce the output. The other is a significant improvement to citation handling that allows the author to reset citation history when needed, typically at the beginnings of book chapters. Standard practice is to abbreviate citations after the first mention (omitting publication details, etc.). But at the beginning of a new chapter, there may be a desire to see the full citation again.
Installations and origins
Pandoc is a free program, released under the GPLv2 (or later). It is available in the package repositories of most Linux distributions, but those versions may be behind the recent release. To get the latest release, the source, as well as binaries for Wasm, Windows, macOS, and Linux for both major architectures, are available on the web site. Those who want to use Pandoc to create PDFs must also have LaTeX, Typst, or another program that can accept one of Pandoc's output formats for conversion to PDF, installed. Pandoc will use LaTeX by default, but the user can specify another program.
The Pandoc documentation is quite good, and is all I've ever needed to support my extensive and daily use of the system.
Pandoc was created by John MacFarlane, a professor of philosophy at UC Berkeley specializing in the philosophy of language. The project is distinguished by being one of the few widely used Haskell programs. In an interview from 2023 its creator said:
I think Pandoc is, in general, written in fairly simple Haskell. I don't use too many complicated things in there. And that's partly due to the fact that I wasn't a very sophisticated Haskeller when I started, and I'm still not that much more sophisticated now.
Conclusions
Pandoc is indispensable for anyone who needs to create different kinds of written documents. With Pandoc, I can write for a publication that insists on Word documents without ever touching a GUI program, turn a set of notes into a web page, or have a book's illustrations automatically generated. Pandoc is fast and never, in my experience, fails to do exactly what it is supposed to do.
The program is actively developed and continually improved in its GitHub repository. Contributions
are welcomed, and those interested in helping are not necessarily excluded
by a lack of intimacy with Haskell. As MacFarlane remarked in the
interview: "it's possible for people who are generally
familiar with computer languages to look at some Pandoc code sometimes,
even if they don't know Haskell, and figure out what might be needed
".
Pandoc is most powerful as a command-line tool, where it can be incorporated into scripts and adapted to various document-processing pipelines. However, its new ability to be embedded in web pages is an interesting development that may lead to a wider variety of uses and, perhaps as a secondary effect, to an expanded interest in Haskell.
| Index entries for this article | |
|---|---|
| GuestArticles | Phillips, Lee |
