Leading items
Grammar tools for Emacs
Last week, we noted that grammar-checking constitutes a daunting challenge for humans and software alike. Consequently, it is hardly surprising that automated tools to assess grammatical correctness remain few and far between. While there are several options available for simple stylistic checks (which we examined last week), deep grammatical analysis typically requires connecting to a dedicated grammar engine.
As was the case with our previous installment, the utilities described are available for Emacs, but the landscape is more-or-less equivalent for other editors. The set of tools is also limited to English, although few other languages offer a significantly better palette of options.
Broadly speaking, grammar engines for natural language operate in the same manner as software tools designed to process programming languages. Individual tokens in a document first need to be identified and classified, then the various sentences and paragraphs (or, potentially, shorter snippets) can be parsed and put into a syntax tree. That analogy starts to break down, however, because natural language does not have the strict, formal rules that define the correct syntax—unlike well-designed programming languages. English in particular attracts a great deal of argument where many so-called "grammar rules" are concerned. LWN commenters on last week's story even provided several examples of such vigorous debate.
The upshot is that researchers devote a lot of time to modeling natural languages and to developing rule sets that produce satisfactory results. But this research is ongoing and does not have a definite stopping point, so the software written by experts in the field is always a moving target—in addition, of course, to each program reflecting a different approach to the problem.
What that means for Emacs users is that any grammar-checking utility linked to a grammar engine is linked to one particular engine from one particular team of researchers. Thus, choosing between tools can demand considerable testing time, on top of any assessment about the suitability of the software available for running the engine itself.
Link Grammar
The Link Grammar Parser was originally written at Carnegie Mellon University, but maintenance as a free-software project has subsequently been taken over by a team from the Abiword project. The code is hosted at GitHub and the latest release was version 5.3.7, from May 7. In addition to English, there are data sets available to use Link Grammar with Russian, German, Lithuanian, Arabic, Farsi, Hebrew, Vietnamese, Indonesian, Kazakh, and Turkish.
In 2010, there was an Emacs tool designed to work with Link Grammar: Baoqiu Cui's grammar checker, which was included (along with one other utility) in his bcui-emacs project on Google Code. Link Grammar provides bindings for Common Lisp (among several languages), but Cui's implementation utilizes a small standalone C++ program to call Link Grammar's API.
Unfortunately, that program is badly out of date; the Link Grammar API has changed (Cui's code was released during the Link Grammar 4.x era), and Cui's C++ wrapper appears to not have undergone much in the way of testing even when it was initially published. While understandable (it was, after all, a personal experiment), it is a disappointing dead-end.
Nevertheless, the possibility exists that Cui's grammar.el code could be revived and brought up to speed. The Emacs script was designed to provide a minor mode that performed grammar-checking on the fly, processing newly typed sentences every time the user pauses for more than a specified time period (by default, three seconds). But it never provided much in the way of configuration options for working with the Link Grammar engine; a revival attempt would likely need to be honed for performance, which would be a tricky task unto itself.
Langtool
LanguageTool is a Java-based free-software grammar-checking engine. The project actively maintains plugins for LibreOffice and OpenOffice as well as add-ons for Firefox and Chrome/Chromium. LanguageTool requires Java 8 and it supports a lengthy list of languages. It is worth noting that LanguageTool and Link Grammar operate in entirely different manners. Whereas Link Grammar attempts to parse sentences grammatically and records errors that it encounters along the way, LanguageTool checks text against a database of error patterns (n-grams).
Masahiro Hayashi's langtool.el is a front-end for LanguageTool. To use it, one must set the langtool-language-tool-jar variable to the path of the LanguageTool .jar file in the script, then load and execute it in Emacs. The script supplies functions to process a text region or buffer, looking for possible errors. Each error is highlighted, and switching on the langtool-show-message-at-point option will display an explanation in the Emacs status line whenever you place the cursor over an error.
The script does not, however, perform on-the-fly grammar checking. That means a potentially annoying wait is involved for every check, since LanguageTool can take ten or fifteen seconds to scan even a moderately sized document. Checking a file as one works means running repeated scans and waiting that amount of time for every run.
In my own tests, LanguageTool flagged a few obvious errors (such as "Grammar be hard"), but nothing particularly tricky to catch by eye. The bigger issue, however, is that LanguageTool also flags what it considers spelling mistakes and questionable punctuation rules (like flagging "dumb" quotation marks), thus creating a great many false positives. This is especially true if one is writing in a markup language like HTML or uses a lot of peculiar terms (or free-software project names).
There does not seem to be an easy way to disable the spelling checks; a dictionary is built into each language's grammar module. On the plus side, the langtool.el script does make it easy to switch between multiple languages (with its langtool-switch-default-language function), and regional variations are supported (e.g., en-US and en-GB).
In the long term, LanguageTool's n-gram database of error patterns should flag fewer potential errors than a full-blown natural language parser like Link Grammar. But amending LanguageTool's database can only be done by editing the XML rules for the language in question.
ATD
Another option available to Emacs users willing to do a bit of
legwork is After the Deadline
(ATD), an open-source grammar checker that is offered as a remote web
service run by Automattic (the company behind WordPress.com). The
ATD service itself is free " The source code
for the server is available under the GPLv2 and is Java-based, because
ATD, under the hood, uses LanguageTool for its grammar engine. But
the service boasts that it
uses a custom rule set that improves on that employed by the
competition. On the source-code site, only a minimal language data set is available,
however; the page notes that " Leandro Lisboa Penz has developed atdtool, a command-line Python
utility that queries the ATD web service. On its own, atdtool could
be used as a post-processor for text files, but in a 2012 discussion
on Reddit, user "mrdbr" pointed out how easy it is to hook atdtool
into Emacs's compilation command. Mrdbr's suggestion is a short
function:
which works—and rather quickly—although it is far from
convenient.
The ATD service returns XML output by default, and
hooking into the Emacs compile command results in a nicely
compact report, but one that would need to be processed separately
to be of immediate use.
ATD does seem to catch more subtle errors than LanguageTool on its
own, and it runs considerably faster than the standalone LanguageTool
.jar file (at least on a desktop-class machine).
Here, too, there seems to be an opportunity for interested
developers to push the state of the art forward. ATD currently
supports English, French, German, Portuguese, and Spanish. The
limitation, however, is the ATD service's vague "personal use" terms
of service. Without ATD's custom grammar, it is possible to run one's
own ATD instance, but the results would presumably be less useful.
A final note on the grammar-checking front is Damien Cassou's TextLint project,
which provides a set of tools to hook Emacs and other editors into
Cassou's TextLint
parser.
TextLint is a "style checker" like the ones we discussed in the
first installment of this series; it uses rules based on The
Elements Of Style and On Writing Well. Like
mrdbr's atdtool function, however, TextLint uses Emacs's compilation
function to run a separate program and capture the output. So, in
practice, it has more in common with the external grammar-engine tools
than it does with the simple blacklist-based style checkers.
Cassou's program is derived from the SmallLint code checker for
SmallTalk and uses PetitParser,
a tool for building grammar parsers. Although it
catches only the stylistic "errors" discussed last week (which not
everyone agrees are worthwhile), it is perhaps notable because it does
what several in the Reddit discussion were hoping for from atdtool: it
allows the user to examine errors one by one.
The more complete grammar checkers discussed here offer some hope
to users who find simple style checkers less than satisfactory, but
there is clearly still room for improvement. Some Emacs users may
have an aversion to running an external Java tool (given
that language's reputation for bloat and security issues in years
past), much less connecting to a remote web service, but that appears
to be where the most complete results are to be found.
It is a bit unfortunate, too, that no one is actively pursuing
further development of a Link-Grammar–based utility, because it
would be interesting to compare that project's results against the
model used by both LanguageTool and ATD. The latter project would do
well to reconsider opening up its data set, since contributions by the
community could only improve its usefulness. Considering how rarely
the subject of grammar-checking comes up in free-software circles, it
may simply be a matter of no one having pushed the issue.
For now, users in need of a grammar checker with Emacs integration
should look to langtool.el. While there are several opportunities to
develop interesting alternatives, someone would need to step up to do
the work. As to whether or not there is any chance of that happening in the
foreseeable future, well, there ain't no tellin'.
The relationship between free and open-source software (FOSS) communities
and companies that use (and sometimes contribute to) their code is complex
at times. Lynn Root and Noa Resare gave a presentation at PyCon 2016 in
Portland to describe some of the problems and complications of that
relationship and to suggest some ideas of how to make it better.
Both Root and Resare work for Spotify, Root as a site-reliability engineer
and Resare as a back-end engineer. Root has also been one of the leaders
of the PyLadies mentoring
organization and, just that day, had finished her term as a member of the
board of directors of the Python
Software Foundation. Resare also works on "cultural challenges" within
Spotify, trying to help the company to be a better free-software community
member.
They are the two engineering representatives to a FOSS board within the
company, which is a committee that originated out of the passion about
free software that some employees had. The board also has a patent
engineer and company lawyer on it to try to balance the interests of all the
different constituencies in the company.
Spotify builds a service that it sells to customers, so contributing to
FOSS is not what makes it money, Root said. Many in the
audience are probably working for companies in a similar position; not
everyone can work for "Red Hat, Docker, or Puppet Labs", where developing
and maintaining FOSS is part of what makes them money. But "first and
foremost", she and Resare are open-source developers, she said; the talk is
meant to
try to
help companies be "good citizens". They have both worked hard to get their
employer to give back to the community and want to share what they have
learned.
Root referenced a blog
post by Ian Cordasco, who is a developer for many different Python projects, that talked about how "corporations and OSS do not
mix". In it, he noted that there are a lot of big companies and even the
US government that use his
projects, but instead of helping out, they make his life much harder. They
ping him relentlessly on IRC, open duplicate bugs, and email him to try to
get him to fix a problem that is affecting them. He also noted that when
these organizations do actually contribute, the contributions are often
misguided or even
detrimental to the project.
That post resonated with Root, though it made her sad. Much of what he
wrote is "very true", she said, and she has experienced some of those same
things along the way. Cordasco made some "moral arguments" about why
companies should be giving back to the communities. Overall, though, the
post made her angry, which also made her "proactive" to try to help solve
the problems—thus the talk.
The underlying issue is the "tragedy of the
commons", which is where entities who are acting independently and
rationally in their own self-interest end up behaving in ways that are
contrary to the common good. It is commonly used to describe various
problems like human overpopulation or environmental and land-use conflicts.
She quoted the Wikipedia page linked
above and said that the tragedy of the commons is an "apt theory when
describing corporations using
open source".
Companies recognize the benefits of using FOSS, and may even realize that
they should give back to the communities behind that software, but as
Cordasco pointed out, they typically don't. On the other hand, they expect
those communities to support and maintain the software. Individual
free-software
users can see that this is not particularly workable, so they are often
willing to help out with financial contributions via Kickstarter and other
funding options.
While the maintainers for these projects have built up a lot of respect and
goodwill within
the community, they can generally only "cash in" on that once or twice
before "donor fatigue" sets in.
Since these donations are coming from individuals, they generally don't result
in a sustainable salary. Companies see FOSS as an almost complete net
positive, she said, but an individual user that doesn't contribute
generally has a low impact on the project; companies have a much greater
effect.
At that point, Resare took over by noting that what Root had presented was
"pretty sad stuff". He wondered if things had to be that way or if there
are ways to make it better. The relevant question is: "what is
self-interest?" One way to get companies to make more and better
contributions would be to "nudge the understanding" of self-interest in a
direction that leads to that.
He suggested that requiring altruism from companies is not the right path.
Trying to use guilt to encourage better behavior might work here or there,
but it would be better to get the companies to understand where their real
self-interest lies.
There are clear-cut costs to bugs in FOSS. He pointed to the Heartbleed bug in OpenSSL as an example.
Heartbleed woke people and companies up to the fact that OpenSSL was being
maintained by one person, "not even full-time". That resulted in the
formation of the Core
Infrastructure Initiative (CII), which gives companies a way to help ensure
that critical projects have the funding they need for maintenance.
These bug costs are real even for small organizations that aren't part of
the CII. It really is in
their own self-interest to pitch in. Beyond bugs, the usability of various
FOSS tools could be improved. That lack of usability has real costs
in terms of developer and operations-staff time.
Resare had some "practical steps" that developers within organizations that
are not contributing could take to help change that. The first step is to
connect with other like-minded people in the organization. Set up an email
list or Slack channel (though he cautioned that Spotify had just switched
from IRC
to Slack and that has been frustrating for him) to discuss ideas about
contributing more. If there are projects that the company could contribute
to or internal projects that could be released as open source, the group
could start listing those. He also suggested making a list of what is
blocking progress.
Another important piece is to "become friends with legal". At Spotify,
there is a large legal team with lots of experience in working with the music
industry, which gave it pre-conceived ideas of how to deal with
intellectual property. Giving away code didn't really enter into their
thinking. It took two months and multiple meetings to get permission to
sign the contributor agreement for Apache CloudStack, since the legal team
did not see the benefit. That made him realize that more effort was needed
to help the legal department understand those benefits.
Finding ways for the company to financially contribute to projects is
another avenue, though it is "no picnic". Developers are generally focused
on code and such, but getting funding will often require writing documents
and going to meetings. However, providing some funding for a project is an easy
way to get the project to prioritize features and bug fixes that the
company needs.
Blog posts and presentations are another way to get the message out about
these kinds of problems. Both internally and externally, organizations can
learn about the problems and solutions that way.
As Cordasco pointed out, making bad contributions is almost worse than not
contributing at all. So companies should be thinking about what
kinds of contributions they will make. Providing high-quality maintenance for
some part of the project is far superior to adding features,
Resare said. Consider the existing community and its needs when deciding
on what to work on.
In addition, companies should be thinking about the long
term for the project and for any contributions they might make to it.
Root then stepped up to describe ways to help convince company
management "to do the right thing". For one thing, it is definitely much
easier to get bugs fixed or features added if the company is working with
the community. The community is far more interested and helpful to those
it is already familiar with.
Another incentive to invest in FOSS communities is recruitment. It is much
easier to sell the company to potential employees if it is already visible
in the
community. She had Spotify on her short list because it supports
diversity efforts—PyLadies in particular—and because she had seen a talk at a
conference about how the company was using certain tools. In addition, if
the company's involvement with well-known languages, tools, frameworks, and
so on is out in the open, potential employees who know and like those
choices will have more interest. That can reduce the time needed to train
new employees.
There are some examples to learn from in terms of organizations moving to
open-source solutions, Root said. The city of Munich was
able to break its Microsoft lock-in starting back in 2006. That both
increased security and saved the city some money, but the main benefit was
to give the decision-making about upgrades and changes back to the
city, rather than being beholden to its vendor. And, these days, Microsoft
itself has embraced open source, which is a "huge shift in their cultural
thinking", Root said.
In conclusion, Resare said, there is hope that things can get better. He
believes that companies can be nudged into making more and better
contributions. There are free-software allies at all companies that employ
developers, he said. Over the last twenty years, software projects have
shown that if you release early and often, and fix things here and there as
you go, you can make amazing progress. Similarly, small changes to
organizations and company cultures can build up over time so that companies
and FOSS can work even better together.
[ I would like to thank LWN subscribers for supporting my travel to
Portland for PyCon. ]
Those interested in free-and-open-source alternatives to
proprietary online collaborative document-editing software (such as
Google Docs or Microsoft's
Office
Online) have a number of options to choose from. For simple
collaborative text editing, there are solutions like Etherpad and Gobby. But for a more comprehensive
suite for writing documents, crafting slideshow presentations, or making
spreadsheets, the only game in town had been OnlyOffice or, more recently, the Open365 beta.
Another option has
recently arrived
with the
announcement
of a 1.0 release and demo version of Collabora Online, which is an adaptation of
LibreOffice for the cloud.
Collabora Online's origins date back to October 2011, with a demonstration
[WebM video] at a conference by long-time LibreOffice developer Michael
Meeks. This eventually led to the release
of the Collabora Online Development Edition (CODE) in December
2015.
Accessing the demo requires filling out a brief
form.
Diving
into it reveals a promising suite of software, but one currently lacking in
many features that users may take for granted. There are examples of slide
presentations, spreadsheets, and text documents on the home page; clicking
on the Collabora Online link in the dropdown menu leads to options to
create a new document, spreadsheet, or presentation.
Using the demo is something of a bare-bones experience. The
document editor feels like a traditional "What You See Is What You Get" (WYSIWYG) editor, but a
number of features that users may expect are not there. For
example, there's no automated spell-checking available, nor is there an
option for setting paragraphs to have double-spacing.
A reviewer of the document
(for example, a co-author collaborating on a draft of an academic paper)
can mark up the text with comments, but there is no comprehensive "track
changes" option, which is a hindrance. Normally, collaborative tools allow an
individual to make direct deletions and additions to the text while having
those changes clearly marked, so the original author can see exactly
what changes were made—and accept or reject said changes as
they wish. This type of editing is essential to collaborative online
document crafting (roughly analogous to using diff on source code) and
it was surprising to not see it in a 1.0 release. Documents can be exported
to PDF, ODT, DOC, or DOCX formats.
The slideshow presentation editor is similarly minimal. One can add,
delete, or clone slides, insert graphics and tables into
slides, and run the final presentation in full-screen mode for displaying to an
audience, but that's it. There is no spell-check, there are no mechanisms
for fancy transitions from slide to slide, there is no clip art gallery
included for those needing images to insert into the presentation, nor is
there a video export option to allow for easy publishing on the Web (as,
say, an HTML5 video). One can export the final project into PDF, ODP, PPT,
and PPTX formats. Spreadsheet editing also has the basics. One is greeted with a grid of
cells to input and manage data, as usual. There can be up to three
worksheets for each spreadsheet instance; other programs (including
LibreOffice itself) allow
for many more. This is limiting for a number of use cases, such as
businesses that need to track several different accounts, types of
inventory, or other items that cannot be easily categorized into three
sheets of data.
Furthermore, the user interface to the spreadsheet functions is lacking.
There is a button for a SUM() function allowing users to
add up the values in a range of cells.
But other functions, such as statistical, financial, conditional, and
engineering functions, must be input manually, unlike other spreadsheet
programs. The
help menu does not provide an explanation of how to use these functions, so
they must be learned elsewhere, such as from LibreOffice Calc. PDF, ODS,
XLS, and XLSX formats are available for exporting. Given that this is the early days of a new product, it is understandable
that
there may be missing features. An email exchange with Meeks revealed that
the main bottleneck for reintroducing these features is
realtime collaboration: "While the functionality is all there under the
hood, there are a number of compromises here around how much UX [user
experience] surface we
can expose before we implement collaboration: the more UX operations, the
bigger the collaboration problem matrix gets. We plan to focus on
collaboration before expanding the UX to include lots of the dialogs." One
can begin exploring this collaborative editing in the 1.0 demo by
connecting to the same document from two different browsers. Documents can
be marked up with comments or textual/graphical changes in realtime while
the other browser shows the changes remotely.
Those interested in getting involved can download the Collabora Online Development
Edition
(CODE): a virtual machine image based on openSUSE with the latest
revisions to the Collabora Online codebase, running in a custom ownCloud
instance supplied with it. This allows developers to work on the code
while offline.
Collabora Online is made of five major pieces: most of the
LibreOffice codebase, LibreOfficeKit, which is an API that allows adding
the
vast majority of the LibreOffice code to the browser, a WebSocket daemon
to manage and serve traffic to an online instance of the office suite,
a Node.js back-end for realtime document rendering, and an ownCloud
plugin. Despite the turmoil
for ownCloud following a fork of the project, it continues
to be actively developed, with a recent 9.1 beta release.
The two projects are closely aligned; Collabora has just announced
the release
of Collabora Online for ownCloud Enterprise.
The
client-side component is written in JavaScript, and the LibreOfficeKit API
is based on Leaflet, an open-source
JavaScript library for interactive maps. Here, the various elements of the
office-software instances, such as the toolbar, timestamps of the last
modification to the files, and the address of the WebSocket hosting server,
are analogized to mapping information. The API translates user modification
of the document into C++ operations, such as saveAs(), loadDocument(), and
paste(), which are sent to the server hosting the LibreOffice
instance via
"loolwsd", the LibreOffice OnLine WebSocket Daemon. The daemon updates the
document on both the client and server in realtime.
The client-side
code is permissively-licensed under the "two-clause BSD" license, while the
server-side code is copylefted under MPL 2.0.
Collabora intends to make money from subscription and service
contracts, which would be used to fund further
development. Collabora offers
to provide customization to the software as needed by customers, such as
for user-interface changes to incorporate a corporate
logo.
Contributors to Collabora Online, including those wanting to submit
patches, are encouraged to join the IRC chat on #libreoffice-dev at
irc.freenode.org, or to subscribe to the LibreOffice
mailing list. The project uses Bugzilla
for bug tracking and Gerrit
for code review. Writers are encouraged to help with documentation on the
project's wiki.
The code itself can be found at a freedesktop.org
repository; API documentation is available
as well.
While there is plenty of room for the project to grow, it is likely that
Collabora Online will soon become an attractive, full-featured open-source
office suite. The ability to do basic editing is already there, which may
be enough for some. With the project backed by a multinational
business boasting over ten years of experience, millions of lines code
written, and clients including some of the largest information technology
corporations in the world as well as the UK
government, it will be exciting to see what is
in store for the future.
for your personal
use
", while commercial users are told to run their own server.
I can’t release all data we
use
".
(defun check-grammar ()
"Checks the current buffer with atdtool"
(interactive)
(compile (concat "atdtool " (shell-quote-argument (buffer-file-name)))))
TextLint
Unclear the future is
Companies and FOSS
Tragedy of the commons
Practical steps
Collabora Online
Page editor: Jonathan Corbet
Next page:
Security>>
