|
|
Log in / Subscribe / Register

Pandoc: a workhorse for document conversion

April 1, 2026

This article was contributed by Lee Phillips

Pandoc is a document-conversion program that can translate among a myriad of formats, including LaTeX, HTML, Office Open XML (docx), plain text, and Markdown. It is also extensible by writing Lua filters that can manipulate the document structure and perform arbitrary computations. Pandoc has appeared in various LWN articles over the years, such as my look at Typst and at the importance of free software to science in 2025, but we have missed providing an overview of the tool. The February release of Pandoc 3.9, which comes with the ability to compile the program to WebAssembly (Wasm), allowing Pandoc to run in web browsers, will likely also be of interest.

Problems solved

Writers, scholars, scientists, and many others often need to create versions of their documents in various formats. For example, a mathematician, after preparing a paper formatted with LaTeX, may want to generate an HTML version to share on the web, and later create a set of slides for a conference. Sections of the paper might be useful in a grant application, but the funding agency might require a Microsoft Word format. Creating these various incarnations would seem to require a great deal of redundant work, especially if the paper contains equations, figures, and references.

This is the problem that Pandoc was created to solve. Using Pandoc, our hypothetical mathematician can write the paper once and have it quickly and automatically converted to any of over 40 output formats. Others may simply have a document in some format and want it translated to a different format to use it in some other way. Pandoc also lurks behind other projects and online services that require document conversion, such as the popular Quarto "scientific and technical publishing system". The list of Pandoc input and output formats can be found in its README file.

A program that directly translated among N formats would have to incorporate N2 translators. Instead of implementing the resulting hundreds of translation routines, Pandoc relies on a modular architecture. It ingests documents using "readers" specialized for each input format, whose job is to translate into Pandoc's internal abstract syntax tree (AST). Output is emitted by "writers" that translate the internal representation into the desired output format.

This strategy turns the N2 problem into a 2N problem. It also makes it easier to enlarge Pandoc's repertoire: a contributor who would like the system to be able to read a new format need only write a Lua program to convert that format into the internal representation, after which Pandoc will be able to translate the format into any of its many outputs. Similarly, a writer module for a new format gives Pandoc the ability to convert any of the document types that it already understands into the new one. (I've simplified the situation somewhat, in that the set of document formats that Pandoc can read is not identical to the set of formats that it can write, but they have a large intersection.)

In practice, the most effective way to start using Pandoc for a new document is to write it in Pandoc's extended dialect of Markdown. This is because the readers for different formats vary in their abilities, but the Markdown reader is fully featured and complete. That variability is necessarily the case, because not every document format can be reliably translated into an AST. For example, LaTeX and Typst can embody arbitrary programs, whose output cannot be determined without running their respective compilers.

If a Markdown document is saved in a file called example.md, this command would translate it into a Typst document:

    pandoc example.md -o example.typ

The output format is inferred from the file extension. To translate it into HTML, a .html extension can be used instead. Another option is to specify the output format with the "-t" flag, for example "-t html".

Pandoc's Markdown dialect incorporates additions to the original specification that make it more suitable for expressing complex documents, while retaining the signature advantage of being pleasant to parse by eye. In addition to everything in traditional Markdown, it can handle ordered lists with letters or Roman numerals, task lists, definition lists, various styles of tables, LaTeX mathematics, footnotes, bibliographic citations, and quite a bit more.

Authors can also insert raw markup into the Markdown source, which is passed directly to the output file, bypassing Pandoc processing. Pandoc supplies syntax for raw markup that allows different fragments to be inserted in the output, depending on the format. For example, this code listing renders the name of Donald Knuth's typesetting engine in plain text, HTML, and (La)TeX; the appropriate fragment is used depending on the output format passed to Pandoc on the command line (or implied by the output file extension).

   The `TeX`{=plain}
     `T<span style="position:relative;top:0.2em">E</span>X`{=html}
     `\TeX\ `{=latex} typesetting system.  

The figure below shows all three versions, as they appear in a terminal, a web browser, and a PDF viewer, respectively. The PDF version was generated by LuaLaTeX processing the LaTeX output.

[Output from raw markup]

Citations

Pandoc has sophisticated support for citations and bibliographies. This feature is what makes it possible for scientists and scholars to write in Pandoc's extended Markdown rather than directly in LaTeX. Pandoc borrows the relevant syntax and mechanisms from LaTeX, using similar markup to indicate citations, the same BibTeX text-file databases of references, and the same standard Citation Style Language (CSL) bibliography style files.

As a simple example of how this works, here is a fragment of a document written in Pandoc's Markdown:

   Free software [@fsfs] is important for physics [@heisenbergBeyond] and
   other sciences.  

The bits inside the square brackets indicate references to resources defined in a BibTeX database. This a text file containing the bibliographic data for papers, books, websites, and anything else that an author might want to refer to. Here is the relevant part of the database for this document:

   @article{fsfs,
    author = {Lee Phillips},
    journal = {LWN},
    month = {June},
    title = {{T}he {I}mportance of {F}ree {S}oftware to {S}cience},
    url = {https://lwn.net/Articles/1023299/},
    year = {2025}
   }
   
   @book{heisenbergBeyond,
    address = {New York},
    author = {Werner Heisenberg},
    isbn = {06-131622-9},
    publisher = {Harper {and} Row},
    title = {{P}hysics and {B}eyond},
    url = {https://www.amazon.com/Physics-Beyond-Encounters-Conversations-Perspectives/dp/0061316229/},
    year = {1971}
   }  

The fragment can then be processed with Pandoc to create output, in any of the formats that it supports, with references and a bibliography. The command for doing so is:

   pandoc --citeproc --bibliography=refs.bib refsExample.md -o refsExample.html

In this command "--citeproc" flag tells Pandoc to use its citation machinery and "--bibliography=" points it to the BibTeX file. The input file is refsExample.md, the file extension indicating Markdown. The output will be in HTML5 because of the extension for the output file. The result will be formatted using the default bibliography style, which happens to be the Chicago Manual of Style author-date format. Different styles can be used by downloading the desired CSL file (a good source is the Zotero style repository)

The figure below shows how the result appears in my web browser, first using the default bibliography style and, below that, using a numerical superscript style. The Pandoc command line is the same in both cases; the only change was the installation of a different CSL file in the default location in the filesystem (another option is to use a "--csl" flag to point Pandoc at the desired file).

[Citations]

In a real project the bibliography would be set off from the main text by a header and, probably, a different font, both of which can be conveniently added by Pandoc.

We can even have plain text output with citations, for those occasions when our emails or website comments require maximum pedantry. Here's the output of the same command with the "-t plain" flag added to request plain text:

   Free software¹ is important for physics² and other sciences.
   
   1. Phillips, Lee. (2025). The Importance of Free Software to Science.
   LWN. https://lwn.net/Articles/1023299/
   
   2. Heisenberg, Werner. (1971). Physics and Beyond. Harper and Row. New
   York. ISBN 06-131622-9.

Because it is a command-line program, Pandoc can be flexibly incorporated into various workflows and used in concert with other tools. I have a keyboard shortcut defined in my editor that processes any selected text through Pandoc, using "--citeproc", and places the plain-text output in the clipboard, ready for pasting into emails or comment boxes.

Pandoc filters

A Pandoc filter is a Lua program that acts directly upon the AST. Filters can make changes to document elements and can perform arbitrary computations based on the contents of those elements. For example, a user might create a filter to number all of the paragraphs in a report, or to print a timestamp under an article's title.

There is also an older type of filter that manipulates a JSON serialization of the AST. These filters can be written in any language, but they have extra dependencies and are slower, as they require serialization and deserialization steps. The current recommendation is to write filters in Lua.

Using Lua filters involves no dependencies, as Pandoc has a Lua interpreter built in. In fact, entering "pandoc lua" in the terminal opens a standard Lua read-eval-print loop (REPL).

Here's an example of a simple filter that changes all underlined elements to Strong elements, which are rendered in most output formats as boldface:

   function Underline(elem)
       return pandoc.Strong(elem.content)
   end

To apply this filter, the program can be invoked with the argument "--lua-filter filter.lua" if filter.lua in the current directory contains the code above.

The function doesn't make much sense outside of the Pandoc context, but filters have access to an API that, somewhat magically, turns function names into routines that walk the AST and process nodes matching the name of the function. Another example is a filter that I wrote that allows me to input equations using the more concise Typst math syntax, rather than LaTeX, even when I need to produce a LaTeX document.

Filters are a powerful means to bend Pandoc to serve nearly any document-processing task, limited only by the imagination and patience of the author/programmer. Other potential applications might be to retrieve information from the Internet to insert into documents, to invoke external programs to create illustrations, or to assemble outlines.

Custom readers and writers

Although Pandoc can handle an impressive variety of document formats, it would not make sense to include every highly specialized or ad hoc markup system in existence. For more arcane applications, Pandoc provides the ability to create custom readers and custom writers. A custom reader is a Lua program that parses input text and translates it into Pandoc's AST; a writer does the reverse. To make this more convenient, Pandoc includes Lua's LPeg parsing library.

Some examples of custom readers are pandoc-fountain, which ingests screenplays written using Fountain markup, the IDML Pandoc reader that understands the markup language used in Adobe InDesign projects, and Lean.lua that can handle Lean files. Interesting custom writers include Pandoc Terminal Writer for colorful pretty-printing on the terminal, Pandoc to PreTeXt that creates markup in the PreTeXt format, which is a structural markup language for textbooks and papers, and a custom writer that emits the specialized format used in a discussion forum.

Changes for 3.9

Pandoc 3.9 was released in early February. The most prominent new feature is the ability to compile Pandoc to Wasm, allowing it to run entirely in a web browser. The project has provided an example page that allows users to do general document conversions via a web form. Pandoc in the browser leads to some interesting possibilities. For example, it should be practical to construct pages that accept user content in nearly any markup format, either in comment boxes or as wiki-style editable text, and convert it in realtime to valid HTML5 for inclusion in the page. Another possibility might be educational sites that, for instance, teach Typst or LaTeX with live feedback.

The new release brings with it dozens of small improvements and bug fixes. Most of these are technical minutiae having to do with corner cases, layout subtleties, or obscure markup formats.

There are two enhancements that may be of wider interest. One is wider support for specifying PDF standards (PDF/A, etc.), as long as LuaLaTeX is used to produce the output. The other is a significant improvement to citation handling that allows the author to reset citation history when needed, typically at the beginnings of book chapters. Standard practice is to abbreviate citations after the first mention (omitting publication details, etc.). But at the beginning of a new chapter, there may be a desire to see the full citation again.

Installations and origins

Pandoc is a free program, released under the GPLv2 (or later). It is available in the package repositories of most Linux distributions, but those versions may be behind the recent release. To get the latest release, the source, as well as binaries for Wasm, Windows, macOS, and Linux for both major architectures, are available on the web site. Those who want to use Pandoc to create PDFs must also have LaTeX, Typst, or another program that can accept one of Pandoc's output formats for conversion to PDF, installed. Pandoc will use LaTeX by default, but the user can specify another program.

The Pandoc documentation is quite good, and is all I've ever needed to support my extensive and daily use of the system.

Pandoc was created by John MacFarlane, a professor of philosophy at UC Berkeley specializing in the philosophy of language. The project is distinguished by being one of the few widely used Haskell programs. In an interview from 2023 its creator said:

I think Pandoc is, in general, written in fairly simple Haskell. I don't use too many complicated things in there. And that's partly due to the fact that I wasn't a very sophisticated Haskeller when I started, and I'm still not that much more sophisticated now.

Conclusions

Pandoc is indispensable for anyone who needs to create different kinds of written documents. With Pandoc, I can write for a publication that insists on Word documents without ever touching a GUI program, turn a set of notes into a web page, or have a book's illustrations automatically generated. Pandoc is fast and never, in my experience, fails to do exactly what it is supposed to do.

The program is actively developed and continually improved in its GitHub repository. Contributions are welcomed, and those interested in helping are not necessarily excluded by a lack of intimacy with Haskell. As MacFarlane remarked in the interview: "it's possible for people who are generally familiar with computer languages to look at some Pandoc code sometimes, even if they don't know Haskell, and figure out what might be needed".

Pandoc is most powerful as a command-line tool, where it can be incorporated into scripts and adapted to various document-processing pipelines. However, its new ability to be embedded in web pages is an interesting development that may lead to a wider variety of uses and, perhaps as a secondary effect, to an expanded interest in Haskell.


Index entries for this article
GuestArticlesPhillips, Lee


to post comments

Yes, the Haskell compiler (GHC) can compile to JS or Wasm now

Posted Apr 1, 2026 20:55 UTC (Wed) by dcoutts (subscriber, #5387) [Link]

> The February release of Pandoc 3.9, which comes with the ability to compile the program to WebAssembly (Wasm), allowing Pandoc to run in web browsers, will likely also be of interest.

Much credit for this goes to the many contributors to GHC (the Haskell compiler) who have been working on the JS and Wasm backends for GHC in the last few years. The wasm backend was included in ghc 9.6 in 2023, and has matured significantly over the last few major releases. This now allows more-or-less any Haskell program to be compiled to JS or Wasm. The Wasm backend supports FFI for interacting with JS, or with C code libraries compiled to wasm.

A bit of history: ghcjs was started around 15 years ago or so, as a fork of ghc to compile to JS. (Indeed I used an early version of ghcjs as one of the organisers of the 2014 ICFP programming contest to provide contestants with a web-based reference simulator for the task.) More recently, the ghcjs author and other GHC contributors have been working on integrating ghcjs as a proper backend in the mainline ghc, and concurrently a project for a wasm backend was started too, which shares much of the same infrastructure.

Pandoc also is invauable for a cheap-and-dirty retrieval augmented generation.

Posted Apr 4, 2026 23:30 UTC (Sat) by ejr (subscriber, #51652) [Link] (1 responses)

If you're just playing with LLMs and want to feed in relevant documentation as needed, pandoc can produce good Markdown output for just about anything. And almost all the embedding / vector search gizmos can consume Markdown without effort.

The result may not provide the world's best semantic context, but it sure does function with very little effort. And with a grand total of one outside tool: Pandoc.

Want to grab information from texinfo files? Pandoc. Docbook? Pandoc. That epub you bought from O'Reilly? Pandoc. The vendors' spec sheets in some random version of Word? Pandoc. A bibliography database shared among your peers (well, BibTeX or CSLJSON)? Pandoc. Jira markup? Pandoc. The Jupyter notebook demo? Pandoc.

Seriously, it's wonderfully ridiculous! And if it's a different format, there's some formatter for it somewhere that'll turn it into something Pandoc can manipulate. There are all sorts of fancier gizmos in doclings, LangChain, etc., but this is just a command you stick in a script and feed to their default ingest.

(Aside: When building RAG assistance for an LLM, etc., do keep in mind that you are explicitly choosing documents that you include. Making a conscious *decision* should affect the licensing of the output... e.g. I will not distribute parts of my Emacs configuration because the GFDL of the docs doesn't combine with the GPL of the implementation. I *chose* to include those pieces, so the licensing absolutely applies in my non-lawyer opinion.)

Pandoc also is invauable for a cheap-and-dirty retrieval augmented generation.

Posted Apr 9, 2026 17:53 UTC (Thu) by ejr (subscriber, #51652) [Link]

Self-correction: Pandoc doesn't handle texinfo natively, but it does a great job with texi2any's other output formats.

Still nothing better than BibTeX format?

Posted Apr 5, 2026 16:18 UTC (Sun) by ceplm (subscriber, #41334) [Link] (7 responses)

> @article{fsfs,
> author = {Lee Phillips},
> journal = {LWN},
> month = {June},
> title = {{T}he {I}mportance of {F}ree {S}oftware to {S}cience},
> url = {https://lwn.net/Articles/1023299/},
> year = {2025}
> }

Can we, please, all agree that this is completely ridiculous to use in this year of Our Lord 2026? Using TeX groups for preserving capitalisation was a bad hack in 1985 and it is not any better now. When will Pandoc be finally able to read directly from Zotero?

Still nothing better than BibTeX format?

Posted Apr 7, 2026 13:38 UTC (Tue) by leephillips (subscriber, #100450) [Link] (1 responses)

It may be ugly, but this is the format that can be reliably used by the most tools, from LaTeX to Typst. And, unless I have to make a manual adjustment (or write an article) I never need to look at it with my bare eyes.

Still nothing better than BibTeX format?

Posted Apr 7, 2026 14:07 UTC (Tue) by daroc (editor, #160859) [Link]

I believe Zotero can export BibTeX information as well. So adding a direct connection is "just" a matter of smoothing out the workflow, and not of enabling some new possibility, which might make it a less urgent prospect for developers contributing to Pandoc.

Still nothing better than BibTeX format?

Posted Apr 7, 2026 22:42 UTC (Tue) by jschrod (subscriber, #1646) [Link] (4 responses)

This usage of braces is not a property of the BibTeX format.

It's a habit caused by a relict, the behavior of old default BibTeX formatting styles. Other styles exist that implement capitalization different.

Please note that in the TeX world, almost nobody uses BibTeX, the program, any more. In BibLaTeX, one realizes capitalization with LaTeX macros, as one wants it.

Still nothing better than BibTeX format?

Posted Apr 9, 2026 17:52 UTC (Thu) by ejr (subscriber, #51652) [Link] (3 responses)

Actually... The rules for title capitalization depend on publication venue, locale, and other things. It's completely nuts and not at all designed for an acronym-heavy field. Or when peoples' names may coincide with articles that won't be capitalized (e.g. The). Everything that "must" preserve the acronyms needs some form of the brace-quoting. CSL JSON uses brace quotes as well.

Still nothing better than BibTeX format?

Posted Apr 9, 2026 18:25 UTC (Thu) by jschrod (subscriber, #1646) [Link] (2 responses)

I don't dispute the need for acronym markup, also not for markup of person names in titles (which is a rare thing) and for markup of titles in languages that don't follow English capitalization rules. I'm from Germany after all. Braces have themselves established as good readable markup for these cases.

I dispute the statement that it's necessary (and/or sensible) to put the 1st character of any word in the title in braces, which is a completely other horse.

> The rules for title capitalization depend on publication venue, locale, and other things.

Being the author of xindy, I know more about different rules for capitalization than I ever wanted. They are among the reasons why I invented merge rules in xindy.
See, I wrote my first BibTeX style file in 1986 (Oren published BibTeX in 1985...) -- I know exactly what you mean. These problems contribute to my statement that usage of BibTeX doesn't make sense any more. BibLaTeX is more flexible. (Although correct sorting is often still a problem.)

Still nothing better than BibTeX format?

Posted Apr 9, 2026 19:35 UTC (Thu) by ejr (subscriber, #51652) [Link]

Hah! xindy is incredibly useful, thank you.

I mentally lump BibTeX into BibLaTeX, although so many processors still rely on ye olde BibTeX. I've been trying to move to CSL JSON + citeproc. It isn't trivial. Staying with BibTeX *is* trivial for lots of people in computing. I don't have great ideas on how to move the needle, or at least I don't have time to really delve into bib -> json conversion.

Although... The "document understanding models" may make much of this irrelevant. If you just cite the *document* and let the agent gather the rest of the information... hmm. I suspect many of us already to that manually via BibTeX exports, etc. But given the success in LLM-ish OCR systems in responding to tax form queries (as a timely example), they entire input format problem may go away-enough. Then there's the human feedback loop of making sure papers *can* be cited that way. We'd close that pretty quickly, I suspect.

In some of my fields, we can rely on incredibly well-curated BibTeX files maintained by a few. But, as people may notice, the few keep growing... fewer. And few are stepping up to continue their efforts.

Still nothing better than BibTeX format?

Posted Apr 9, 2026 21:21 UTC (Thu) by ceplm (subscriber, #41334) [Link]

I don’t write anything significant anymore (twenty years I switched from academia to being a programmer/packager), so just from distance I am fascinated that there is not greater traction in the LaTeX world for BibTeX replacements like Zotero or for example papis (https://github.com/papis/papis). Any thoughts?


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds