Grammar tools for Emacs

By Nathan Willis
June 29, 2016

Last week, we noted that grammar-checking constitutes a daunting challenge for humans and software alike. Consequently, it is hardly surprising that automated tools to assess grammatical correctness remain few and far between. While there are several options available for simple stylistic checks (which we examined last week), deep grammatical analysis typically requires connecting to a dedicated grammar engine.

As was the case with our previous installment, the utilities described are available for Emacs, but the landscape is more-or-less equivalent for other editors. The set of tools is also limited to English, although few other languages offer a significantly better palette of options.

Broadly speaking, grammar engines for natural language operate in the same manner as software tools designed to process programming languages. Individual tokens in a document first need to be identified and classified, then the various sentences and paragraphs (or, potentially, shorter snippets) can be parsed and put into a syntax tree. That analogy starts to break down, however, because natural language does not have the strict, formal rules that define the correct syntax—unlike well-designed programming languages. English in particular attracts a great deal of argument where many so-called "grammar rules" are concerned. LWN commenters on last week's story even provided several examples of such vigorous debate.

The upshot is that researchers devote a lot of time to modeling natural languages and to developing rule sets that produce satisfactory results. But this research is ongoing and does not have a definite stopping point, so the software written by experts in the field is always a moving target—in addition, of course, to each program reflecting a different approach to the problem.

What that means for Emacs users is that any grammar-checking utility linked to a grammar engine is linked to one particular engine from one particular team of researchers. Thus, choosing between tools can demand considerable testing time, on top of any assessment about the suitability of the software available for running the engine itself.

Link Grammar

The Link Grammar Parser was originally written at Carnegie Mellon University, but maintenance as a free-software project has subsequently been taken over by a team from the Abiword project. The code is hosted at GitHub and the latest release was version 5.3.7, from May 7. In addition to English, there are data sets available to use Link Grammar with Russian, German, Lithuanian, Arabic, Farsi, Hebrew, Vietnamese, Indonesian, Kazakh, and Turkish.

In 2010, there was an Emacs tool designed to work with Link Grammar: Baoqiu Cui's grammar checker, which was included (along with one other utility) in his bcui-emacs project on Google Code. Link Grammar provides bindings for Common Lisp (among several languages), but Cui's implementation utilizes a small standalone C++ program to call Link Grammar's API.

Unfortunately, that program is badly out of date; the Link Grammar API has changed (Cui's code was released during the Link Grammar 4.x era), and Cui's C++ wrapper appears to not have undergone much in the way of testing even when it was initially published. While understandable (it was, after all, a personal experiment), it is a disappointing dead-end.

Nevertheless, the possibility exists that Cui's grammar.el code could be revived and brought up to speed. The Emacs script was designed to provide a minor mode that performed grammar-checking on the fly, processing newly typed sentences every time the user pauses for more than a specified time period (by default, three seconds). But it never provided much in the way of configuration options for working with the Link Grammar engine; a revival attempt would likely need to be honed for performance, which would be a tricky task unto itself.

Langtool

LanguageTool is a Java-based free-software grammar-checking engine. The project actively maintains plugins for LibreOffice and OpenOffice as well as add-ons for Firefox and Chrome/Chromium. LanguageTool requires Java 8 and it supports a lengthy list of languages. It is worth noting that LanguageTool and Link Grammar operate in entirely different manners. Whereas Link Grammar attempts to parse sentences grammatically and records errors that it encounters along the way, LanguageTool checks text against a database of error patterns (n-grams).

Masahiro Hayashi's langtool.el is a front-end for LanguageTool. To use it, one must set the langtool-language-tool-jar variable to the path of the LanguageTool .jar file in the script, then load and execute it in Emacs. The script supplies functions to process a text region or buffer, looking for possible errors. Each error is highlighted, and switching on the langtool-show-message-at-point option will display an explanation in the Emacs status line whenever you place the cursor over an error.

The script does not, however, perform on-the-fly grammar checking. That means a potentially annoying wait is involved for every check, since LanguageTool can take ten or fifteen seconds to scan even a moderately sized document. Checking a file as one works means running repeated scans and waiting that amount of time for every run.

In my own tests, LanguageTool flagged a few obvious errors (such as "Grammar be hard"), but nothing particularly tricky to catch by eye. The bigger issue, however, is that LanguageTool also flags what it considers spelling mistakes and questionable punctuation rules (like flagging "dumb" quotation marks), thus creating a great many false positives. This is especially true if one is writing in a markup language like HTML or uses a lot of peculiar terms (or free-software project names).

There does not seem to be an easy way to disable the spelling checks; a dictionary is built into each language's grammar module. On the plus side, the langtool.el script does make it easy to switch between multiple languages (with its langtool-switch-default-language function), and regional variations are supported (e.g., en-US and en-GB).

In the long term, LanguageTool's n-gram database of error patterns should flag fewer potential errors than a full-blown natural language parser like Link Grammar. But amending LanguageTool's database can only be done by editing the XML rules for the language in question.

ATD

Another option available to Emacs users willing to do a bit of legwork is After the Deadline (ATD), an open-source grammar checker that is offered as a remote web service run by Automattic (the company behind WordPress.com). The ATD service itself is free "for your personal use", while commercial users are told to run their own server.

The source code for the server is available under the GPLv2 and is Java-based, because ATD, under the hood, uses LanguageTool for its grammar engine. But the service boasts that it uses a custom rule set that improves on that employed by the competition. On the source-code site, only a minimal language data set is available, however; the page notes that "I can’t release all data we use".

Leandro Lisboa Penz has developed atdtool, a command-line Python utility that queries the ATD web service. On its own, atdtool could be used as a post-processor for text files, but in a 2012 discussion on Reddit, user "mrdbr" pointed out how easy it is to hook atdtool into Emacs's compilation command. Mrdbr's suggestion is a short function:

    (defun check-grammar ()
      "Checks the current buffer with atdtool"
      (interactive)
      (compile (concat "atdtool " (shell-quote-argument (buffer-file-name)))))

which works—and rather quickly—although it is far from convenient. The ATD service returns XML output by default, and hooking into the Emacs compile command results in a nicely compact report, but one that would need to be processed separately to be of immediate use.

ATD does seem to catch more subtle errors than LanguageTool on its own, and it runs considerably faster than the standalone LanguageTool .jar file (at least on a desktop-class machine).

Here, too, there seems to be an opportunity for interested developers to push the state of the art forward. ATD currently supports English, French, German, Portuguese, and Spanish. The limitation, however, is the ATD service's vague "personal use" terms of service. Without ATD's custom grammar, it is possible to run one's own ATD instance, but the results would presumably be less useful.

TextLint

A final note on the grammar-checking front is Damien Cassou's TextLint project, which provides a set of tools to hook Emacs and other editors into Cassou's TextLint parser.

TextLint is a "style checker" like the ones we discussed in the first installment of this series; it uses rules based on The Elements Of Style and On Writing Well. Like mrdbr's atdtool function, however, TextLint uses Emacs's compilation function to run a separate program and capture the output. So, in practice, it has more in common with the external grammar-engine tools than it does with the simple blacklist-based style checkers.

Cassou's program is derived from the SmallLint code checker for SmallTalk and uses PetitParser, a tool for building grammar parsers. Although it catches only the stylistic "errors" discussed last week (which not everyone agrees are worthwhile), it is perhaps notable because it does what several in the Reddit discussion were hoping for from atdtool: it allows the user to examine errors one by one.

Unclear the future is

The more complete grammar checkers discussed here offer some hope to users who find simple style checkers less than satisfactory, but there is clearly still room for improvement. Some Emacs users may have an aversion to running an external Java tool (given that language's reputation for bloat and security issues in years past), much less connecting to a remote web service, but that appears to be where the most complete results are to be found.

It is a bit unfortunate, too, that no one is actively pursuing further development of a Link-Grammar–based utility, because it would be interesting to compare that project's results against the model used by both LanguageTool and ATD. The latter project would do well to reconsider opening up its data set, since contributions by the community could only improve its usefulness. Considering how rarely the subject of grammar-checking comes up in free-software circles, it may simply be a matter of no one having pushed the issue.

For now, users in need of a grammar checker with Emacs integration should look to langtool.el. While there are several opportunities to develop interesting alternatives, someone would need to step up to do the work. As to whether or not there is any chance of that happening in the foreseeable future, well, there ain't no tellin'.

Grammar tools for Emacs

Posted Jun 30, 2016 12:34 UTC (Thu) by nbecker (subscriber, #35200) [Link] (1 responses)

What about proselint?

Grammar tools for Emacs

Posted Jun 30, 2016 18:33 UTC (Thu) by tfx2 (guest, #92047) [Link]

While strange that proselint wasn't mentioned in this series, there was a review on proselint back in March.

Grammar tools for Emacs

Posted Jun 30, 2016 19:22 UTC (Thu) by dnaber (guest, #56178) [Link]

I'm the author of LanguageTool and I'd like to comment on a few things:

> LanguageTool checks text against a database of error patterns (n-grams)

This is a bit misleading: LT scans the text for error patterns, but this has nothing to do with n-grams. The n-gram database can optionally be used to detect common word confusions like there/their or typos like just/juts. LT has more than 500 of these pairs for English when the n-gram data is available. Unfortunately, the n-gram data is huge and further slows down LT, which is why it should usually be used in server mode. LT provides an embedded HTTP(S) server which also avoids the issue of Java's slow start-up. With that, checking a single English sentence takes about 15ms on my computer.

> In my own tests, LanguageTool flagged a few obvious errors (such as "Grammar be hard"), but nothing particularly tricky to catch by eye.

It's not clear to me whether the n-gram data was used in this test, but for English it should make a difference (although the data is used for statistics, so there's no *guarantee* that errors are found).

> creating a great many false positives

With LT you can activate/deactivate any rule easily, so this seems to be a limitation of the add-on for Emacs.

> because ATD, under the hood, uses LanguageTool for its grammar engine

Mhh, I don't think this is true for English, only for non-English languages. AtD for English was developed independently of LT.