Pandoc also is invauable for a cheap-and-dirty retrieval augmented generation.
Pandoc also is invauable for a cheap-and-dirty retrieval augmented generation.
Posted Apr 4, 2026 23:30 UTC (Sat) by ejr (subscriber, #51652)Parent article: Pandoc: a workhorse for document conversion
The result may not provide the world's best semantic context, but it sure does function with very little effort. And with a grand total of one outside tool: Pandoc.
Want to grab information from texinfo files? Pandoc. Docbook? Pandoc. That epub you bought from O'Reilly? Pandoc. The vendors' spec sheets in some random version of Word? Pandoc. A bibliography database shared among your peers (well, BibTeX or CSLJSON)? Pandoc. Jira markup? Pandoc. The Jupyter notebook demo? Pandoc.
Seriously, it's wonderfully ridiculous! And if it's a different format, there's some formatter for it somewhere that'll turn it into something Pandoc can manipulate. There are all sorts of fancier gizmos in doclings, LangChain, etc., but this is just a command you stick in a script and feed to their default ingest.
(Aside: When building RAG assistance for an LLM, etc., do keep in mind that you are explicitly choosing documents that you include. Making a conscious *decision* should affect the licensing of the output... e.g. I will not distribute parts of my Emacs configuration because the GFDL of the docs doesn't combine with the GPL of the implementation. I *chose* to include those pieces, so the licensing absolutely applies in my non-lawyer opinion.)
