Grammar tools for Emacs
Last week, we noted that grammar-checking constitutes a daunting challenge for humans and software alike. Consequently, it is hardly surprising that automated tools to assess grammatical correctness remain few and far between. While there are several options available for simple stylistic checks (which we examined last week), deep grammatical analysis typically requires connecting to a dedicated grammar engine.
As was the case with our previous installment, the utilities described are available for Emacs, but the landscape is more-or-less equivalent for other editors. The set of tools is also limited to English, although few other languages offer a significantly better palette of options.
Broadly speaking, grammar engines for natural language operate in the same manner as software tools designed to process programming languages. Individual tokens in a document first need to be identified and classified, then the various sentences and paragraphs (or, potentially, shorter snippets) can be parsed and put into a syntax tree. That analogy starts to break down, however, because natural language does not have the strict, formal rules that define the correct syntax—unlike well-designed programming languages. English in particular attracts a great deal of argument where many so-called "grammar rules" are concerned. LWN commenters on last week's story even provided several examples of such vigorous debate.
The upshot is that researchers devote a lot of time to modeling natural languages and to developing rule sets that produce satisfactory results. But this research is ongoing and does not have a definite stopping point, so the software written by experts in the field is always a moving target—in addition, of course, to each program reflecting a different approach to the problem.
What that means for Emacs users is that any grammar-checking utility linked to a grammar engine is linked to one particular engine from one particular team of researchers. Thus, choosing between tools can demand considerable testing time, on top of any assessment about the suitability of the software available for running the engine itself.
Link Grammar
The Link Grammar Parser was originally written at Carnegie Mellon University, but maintenance as a free-software project has subsequently been taken over by a team from the Abiword project. The code is hosted at GitHub and the latest release was version 5.3.7, from May 7. In addition to English, there are data sets available to use Link Grammar with Russian, German, Lithuanian, Arabic, Farsi, Hebrew, Vietnamese, Indonesian, Kazakh, and Turkish.
In 2010, there was an Emacs tool designed to work with Link Grammar: Baoqiu Cui's grammar checker, which was included (along with one other utility) in his bcui-emacs project on Google Code. Link Grammar provides bindings for Common Lisp (among several languages), but Cui's implementation utilizes a small standalone C++ program to call Link Grammar's API.
Unfortunately, that program is badly out of date; the Link Grammar API has changed (Cui's code was released during the Link Grammar 4.x era), and Cui's C++ wrapper appears to not have undergone much in the way of testing even when it was initially published. While understandable (it was, after all, a personal experiment), it is a disappointing dead-end.
Nevertheless, the possibility exists that Cui's grammar.el code could be revived and brought up to speed. The Emacs script was designed to provide a minor mode that performed grammar-checking on the fly, processing newly typed sentences every time the user pauses for more than a specified time period (by default, three seconds). But it never provided much in the way of configuration options for working with the Link Grammar engine; a revival attempt would likely need to be honed for performance, which would be a tricky task unto itself.
Langtool
LanguageTool is a Java-based free-software grammar-checking engine. The project actively maintains plugins for LibreOffice and OpenOffice as well as add-ons for Firefox and Chrome/Chromium. LanguageTool requires Java 8 and it supports a lengthy list of languages. It is worth noting that LanguageTool and Link Grammar operate in entirely different manners. Whereas Link Grammar attempts to parse sentences grammatically and records errors that it encounters along the way, LanguageTool checks text against a database of error patterns (n-grams).
Masahiro Hayashi's langtool.el is a front-end for LanguageTool. To use it, one must set the langtool-language-tool-jar variable to the path of the LanguageTool .jar file in the script, then load and execute it in Emacs. The script supplies functions to process a text region or buffer, looking for possible errors. Each error is highlighted, and switching on the langtool-show-message-at-point option will display an explanation in the Emacs status line whenever you place the cursor over an error.
The script does not, however, perform on-the-fly grammar checking. That means a potentially annoying wait is involved for every check, since LanguageTool can take ten or fifteen seconds to scan even a moderately sized document. Checking a file as one works means running repeated scans and waiting that amount of time for every run.
In my own tests, LanguageTool flagged a few obvious errors (such as "Grammar be hard"), but nothing particularly tricky to catch by eye. The bigger issue, however, is that LanguageTool also flags what it considers spelling mistakes and questionable punctuation rules (like flagging "dumb" quotation marks), thus creating a great many false positives. This is especially true if one is writing in a markup language like HTML or uses a lot of peculiar terms (or free-software project names).
There does not seem to be an easy way to disable the spelling checks; a dictionary is built into each language's grammar module. On the plus side, the langtool.el script does make it easy to switch between multiple languages (with its langtool-switch-default-language function), and regional variations are supported (e.g., en-US and en-GB).
In the long term, LanguageTool's n-gram database of error patterns should flag fewer potential errors than a full-blown natural language parser like Link Grammar. But amending LanguageTool's database can only be done by editing the XML rules for the language in question.
ATD
Another option available to Emacs users willing to do a bit of
legwork is After the Deadline
(ATD), an open-source grammar checker that is offered as a remote web
service run by Automattic (the company behind WordPress.com). The
ATD service itself is free " The source code
for the server is available under the GPLv2 and is Java-based, because
ATD, under the hood, uses LanguageTool for its grammar engine. But
the service boasts that it
uses a custom rule set that improves on that employed by the
competition. On the source-code site, only a minimal language data set is available,
however; the page notes that " Leandro Lisboa Penz has developed atdtool, a command-line Python
utility that queries the ATD web service. On its own, atdtool could
be used as a post-processor for text files, but in a 2012 discussion
on Reddit, user "mrdbr" pointed out how easy it is to hook atdtool
into Emacs's compilation command. Mrdbr's suggestion is a short
function:
which works—and rather quickly—although it is far from
convenient.
The ATD service returns XML output by default, and
hooking into the Emacs compile command results in a nicely
compact report, but one that would need to be processed separately
to be of immediate use.
ATD does seem to catch more subtle errors than LanguageTool on its
own, and it runs considerably faster than the standalone LanguageTool
.jar file (at least on a desktop-class machine).
Here, too, there seems to be an opportunity for interested
developers to push the state of the art forward. ATD currently
supports English, French, German, Portuguese, and Spanish. The
limitation, however, is the ATD service's vague "personal use" terms
of service. Without ATD's custom grammar, it is possible to run one's
own ATD instance, but the results would presumably be less useful.
A final note on the grammar-checking front is Damien Cassou's TextLint project,
which provides a set of tools to hook Emacs and other editors into
Cassou's TextLint
parser.
TextLint is a "style checker" like the ones we discussed in the
first installment of this series; it uses rules based on The
Elements Of Style and On Writing Well. Like
mrdbr's atdtool function, however, TextLint uses Emacs's compilation
function to run a separate program and capture the output. So, in
practice, it has more in common with the external grammar-engine tools
than it does with the simple blacklist-based style checkers.
Cassou's program is derived from the SmallLint code checker for
SmallTalk and uses PetitParser,
a tool for building grammar parsers. Although it
catches only the stylistic "errors" discussed last week (which not
everyone agrees are worthwhile), it is perhaps notable because it does
what several in the Reddit discussion were hoping for from atdtool: it
allows the user to examine errors one by one.
The more complete grammar checkers discussed here offer some hope
to users who find simple style checkers less than satisfactory, but
there is clearly still room for improvement. Some Emacs users may
have an aversion to running an external Java tool (given
that language's reputation for bloat and security issues in years
past), much less connecting to a remote web service, but that appears
to be where the most complete results are to be found.
It is a bit unfortunate, too, that no one is actively pursuing
further development of a Link-Grammar–based utility, because it
would be interesting to compare that project's results against the
model used by both LanguageTool and ATD. The latter project would do
well to reconsider opening up its data set, since contributions by the
community could only improve its usefulness. Considering how rarely
the subject of grammar-checking comes up in free-software circles, it
may simply be a matter of no one having pushed the issue.
For now, users in need of a grammar checker with Emacs integration
should look to langtool.el. While there are several opportunities to
develop interesting alternatives, someone would need to step up to do
the work. As to whether or not there is any chance of that happening in the
foreseeable future, well, there ain't no tellin'.for your personal
use
", while commercial users are told to run their own server.
I can’t release all data we
use
".
(defun check-grammar ()
"Checks the current buffer with atdtool"
(interactive)
(compile (concat "atdtool " (shell-quote-argument (buffer-file-name)))))
TextLint
Unclear the future is
Posted Jun 30, 2016 12:34 UTC (Thu)
by nbecker (subscriber, #35200)
[Link] (1 responses)
Posted Jun 30, 2016 19:22 UTC (Thu)
by dnaber (guest, #56178)
[Link]
> LanguageTool checks text against a database of error patterns (n-grams)
This is a bit misleading: LT scans the text for error patterns, but this has nothing to do with n-grams. The n-gram database can optionally be used to detect common word confusions like there/their or typos like just/juts. LT has more than 500 of these pairs for English when the n-gram data is available. Unfortunately, the n-gram data is huge and further slows down LT, which is why it should usually be used in server mode. LT provides an embedded HTTP(S) server which also avoids the issue of Java's slow start-up. With that, checking a single English sentence takes about 15ms on my computer.
> In my own tests, LanguageTool flagged a few obvious errors (such as "Grammar be hard"), but nothing particularly tricky to catch by eye.
It's not clear to me whether the n-gram data was used in this test, but for English it should make a difference (although the data is used for statistics, so there's no *guarantee* that errors are found).
> creating a great many false positives
With LT you can activate/deactivate any rule easily, so this seems to be a limitation of the add-on for Emacs.
> because ATD, under the hood, uses LanguageTool for its grammar engine
Mhh, I don't think this is true for English, only for non-English languages. AtD for English was developed independently of LT.
Grammar tools for Emacs
Grammar tools for Emacs