|
|
Log in / Subscribe / Register

Leading items

Grammar tools for Emacs

By Nathan Willis
June 29, 2016

Last week, we noted that grammar-checking constitutes a daunting challenge for humans and software alike. Consequently, it is hardly surprising that automated tools to assess grammatical correctness remain few and far between. While there are several options available for simple stylistic checks (which we examined last week), deep grammatical analysis typically requires connecting to a dedicated grammar engine.

As was the case with our previous installment, the utilities described are available for Emacs, but the landscape is more-or-less equivalent for other editors. The set of tools is also limited to English, although few other languages offer a significantly better palette of options.

Broadly speaking, grammar engines for natural language operate in the same manner as software tools designed to process programming languages. Individual tokens in a document first need to be identified and classified, then the various sentences and paragraphs (or, potentially, shorter snippets) can be parsed and put into a syntax tree. That analogy starts to break down, however, because natural language does not have the strict, formal rules that define the correct syntax—unlike well-designed programming languages. English in particular attracts a great deal of argument where many so-called "grammar rules" are concerned. LWN commenters on last week's story even provided several examples of such vigorous debate.

The upshot is that researchers devote a lot of time to modeling natural languages and to developing rule sets that produce satisfactory results. But this research is ongoing and does not have a definite stopping point, so the software written by experts in the field is always a moving target—in addition, of course, to each program reflecting a different approach to the problem.

What that means for Emacs users is that any grammar-checking utility linked to a grammar engine is linked to one particular engine from one particular team of researchers. Thus, choosing between tools can demand considerable testing time, on top of any assessment about the suitability of the software available for running the engine itself.

Link Grammar

The Link Grammar Parser was originally written at Carnegie Mellon University, but maintenance as a free-software project has subsequently been taken over by a team from the Abiword project. The code is hosted at GitHub and the latest release was version 5.3.7, from May 7. In addition to English, there are data sets available to use Link Grammar with Russian, German, Lithuanian, Arabic, Farsi, Hebrew, Vietnamese, Indonesian, Kazakh, and Turkish.

In 2010, there was an Emacs tool designed to work with Link Grammar: Baoqiu Cui's grammar checker, which was included (along with one other utility) in his bcui-emacs project on Google Code. Link Grammar provides bindings for Common Lisp (among several languages), but Cui's implementation utilizes a small standalone C++ program to call Link Grammar's API.

Unfortunately, that program is badly out of date; the Link Grammar API has changed (Cui's code was released during the Link Grammar 4.x era), and Cui's C++ wrapper appears to not have undergone much in the way of testing even when it was initially published. While understandable (it was, after all, a personal experiment), it is a disappointing dead-end.

Nevertheless, the possibility exists that Cui's grammar.el code could be revived and brought up to speed. The Emacs script was designed to provide a minor mode that performed grammar-checking on the fly, processing newly typed sentences every time the user pauses for more than a specified time period (by default, three seconds). But it never provided much in the way of configuration options for working with the Link Grammar engine; a revival attempt would likely need to be honed for performance, which would be a tricky task unto itself.

Langtool

LanguageTool is a Java-based free-software grammar-checking engine. The project actively maintains plugins for LibreOffice and OpenOffice as well as add-ons for Firefox and Chrome/Chromium. LanguageTool requires Java 8 and it supports a lengthy list of languages. It is worth noting that LanguageTool and Link Grammar operate in entirely different manners. Whereas Link Grammar attempts to parse sentences grammatically and records errors that it encounters along the way, LanguageTool checks text against a database of error patterns (n-grams).

Masahiro Hayashi's langtool.el is a front-end for LanguageTool. To use it, one must set the langtool-language-tool-jar variable to the path of the LanguageTool .jar file in the script, then load and execute it in Emacs. The script supplies functions to process a text region or buffer, looking for possible errors. Each error is highlighted, and switching on the langtool-show-message-at-point option will display an explanation in the Emacs status line whenever you place the cursor over an error.

[Langtool.el]

The script does not, however, perform on-the-fly grammar checking. That means a potentially annoying wait is involved for every check, since LanguageTool can take ten or fifteen seconds to scan even a moderately sized document. Checking a file as one works means running repeated scans and waiting that amount of time for every run.

In my own tests, LanguageTool flagged a few obvious errors (such as "Grammar be hard"), but nothing particularly tricky to catch by eye. The bigger issue, however, is that LanguageTool also flags what it considers spelling mistakes and questionable punctuation rules (like flagging "dumb" quotation marks), thus creating a great many false positives. This is especially true if one is writing in a markup language like HTML or uses a lot of peculiar terms (or free-software project names).

There does not seem to be an easy way to disable the spelling checks; a dictionary is built into each language's grammar module. On the plus side, the langtool.el script does make it easy to switch between multiple languages (with its langtool-switch-default-language function), and regional variations are supported (e.g., en-US and en-GB).

In the long term, LanguageTool's n-gram database of error patterns should flag fewer potential errors than a full-blown natural language parser like Link Grammar. But amending LanguageTool's database can only be done by editing the XML rules for the language in question.

ATD

Another option available to Emacs users willing to do a bit of legwork is After the Deadline (ATD), an open-source grammar checker that is offered as a remote web service run by Automattic (the company behind WordPress.com). The ATD service itself is free "for your personal use", while commercial users are told to run their own server.

The source code for the server is available under the GPLv2 and is Java-based, because ATD, under the hood, uses LanguageTool for its grammar engine. But the service boasts that it uses a custom rule set that improves on that employed by the competition. On the source-code site, only a minimal language data set is available, however; the page notes that "I can’t release all data we use".

Leandro Lisboa Penz has developed atdtool, a command-line Python utility that queries the ATD web service. On its own, atdtool could be used as a post-processor for text files, but in a 2012 discussion on Reddit, user "mrdbr" pointed out how easy it is to hook atdtool into Emacs's compilation command. Mrdbr's suggestion is a short function:

    (defun check-grammar ()
      "Checks the current buffer with atdtool"
      (interactive)
      (compile (concat "atdtool " (shell-quote-argument (buffer-file-name)))))

which works—and rather quickly—although it is far from convenient. The ATD service returns XML output by default, and hooking into the Emacs compile command results in a nicely compact report, but one that would need to be processed separately to be of immediate use.

[ATD output]

ATD does seem to catch more subtle errors than LanguageTool on its own, and it runs considerably faster than the standalone LanguageTool .jar file (at least on a desktop-class machine).

Here, too, there seems to be an opportunity for interested developers to push the state of the art forward. ATD currently supports English, French, German, Portuguese, and Spanish. The limitation, however, is the ATD service's vague "personal use" terms of service. Without ATD's custom grammar, it is possible to run one's own ATD instance, but the results would presumably be less useful.

TextLint

A final note on the grammar-checking front is Damien Cassou's TextLint project, which provides a set of tools to hook Emacs and other editors into Cassou's TextLint parser.

TextLint is a "style checker" like the ones we discussed in the first installment of this series; it uses rules based on The Elements Of Style and On Writing Well. Like mrdbr's atdtool function, however, TextLint uses Emacs's compilation function to run a separate program and capture the output. So, in practice, it has more in common with the external grammar-engine tools than it does with the simple blacklist-based style checkers.

Cassou's program is derived from the SmallLint code checker for SmallTalk and uses PetitParser, a tool for building grammar parsers. Although it catches only the stylistic "errors" discussed last week (which not everyone agrees are worthwhile), it is perhaps notable because it does what several in the Reddit discussion were hoping for from atdtool: it allows the user to examine errors one by one.

Unclear the future is

The more complete grammar checkers discussed here offer some hope to users who find simple style checkers less than satisfactory, but there is clearly still room for improvement. Some Emacs users may have an aversion to running an external Java tool (given that language's reputation for bloat and security issues in years past), much less connecting to a remote web service, but that appears to be where the most complete results are to be found.

It is a bit unfortunate, too, that no one is actively pursuing further development of a Link-Grammar–based utility, because it would be interesting to compare that project's results against the model used by both LanguageTool and ATD. The latter project would do well to reconsider opening up its data set, since contributions by the community could only improve its usefulness. Considering how rarely the subject of grammar-checking comes up in free-software circles, it may simply be a matter of no one having pushed the issue.

For now, users in need of a grammar checker with Emacs integration should look to langtool.el. While there are several opportunities to develop interesting alternatives, someone would need to step up to do the work. As to whether or not there is any chance of that happening in the foreseeable future, well, there ain't no tellin'.

Comments (3 posted)

Companies and FOSS

By Jake Edge
June 29, 2016

PyCon 2016

The relationship between free and open-source software (FOSS) communities and companies that use (and sometimes contribute to) their code is complex at times. Lynn Root and Noa Resare gave a presentation at PyCon 2016 in Portland to describe some of the problems and complications of that relationship and to suggest some ideas of how to make it better.

Both Root and Resare work for Spotify, Root as a site-reliability engineer and Resare as a back-end engineer. Root has also been one of the leaders of the PyLadies mentoring organization and, just that day, had finished her term as a member of the board of directors of the Python Software Foundation. Resare also works on "cultural challenges" within Spotify, trying to help the company to be a better free-software community member.

They are the two engineering representatives to a FOSS board within the company, which is a committee that originated out of the passion about free software that some employees had. The board also has a patent engineer and company lawyer on it to try to balance the interests of all the different constituencies in the company.

Spotify builds a service that it sells to customers, so contributing to FOSS is not what makes it money, Root said. Many in the audience are probably working for companies in a similar position; not everyone can work for "Red Hat, Docker, or Puppet Labs", where developing and maintaining FOSS is part of what makes them money. But "first and foremost", she and Resare are open-source developers, she said; the talk is meant to try to help companies be "good citizens". They have both worked hard to get their employer to give back to the community and want to share what they have learned.

[Lynn Root & Noa Resare]

Root referenced a blog post by Ian Cordasco, who is a developer for many different Python projects, that talked about how "corporations and OSS do not mix". In it, he noted that there are a lot of big companies and even the US government that use his projects, but instead of helping out, they make his life much harder. They ping him relentlessly on IRC, open duplicate bugs, and email him to try to get him to fix a problem that is affecting them. He also noted that when these organizations do actually contribute, the contributions are often misguided or even detrimental to the project.

That post resonated with Root, though it made her sad. Much of what he wrote is "very true", she said, and she has experienced some of those same things along the way. Cordasco made some "moral arguments" about why companies should be giving back to the communities. Overall, though, the post made her angry, which also made her "proactive" to try to help solve the problems—thus the talk.

Tragedy of the commons

The underlying issue is the "tragedy of the commons", which is where entities who are acting independently and rationally in their own self-interest end up behaving in ways that are contrary to the common good. It is commonly used to describe various problems like human overpopulation or environmental and land-use conflicts. She quoted the Wikipedia page linked above and said that the tragedy of the commons is an "apt theory when describing corporations using open source".

Companies recognize the benefits of using FOSS, and may even realize that they should give back to the communities behind that software, but as Cordasco pointed out, they typically don't. On the other hand, they expect those communities to support and maintain the software. Individual free-software users can see that this is not particularly workable, so they are often willing to help out with financial contributions via Kickstarter and other funding options.

While the maintainers for these projects have built up a lot of respect and goodwill within the community, they can generally only "cash in" on that once or twice before "donor fatigue" sets in. Since these donations are coming from individuals, they generally don't result in a sustainable salary. Companies see FOSS as an almost complete net positive, she said, but an individual user that doesn't contribute generally has a low impact on the project; companies have a much greater effect.

At that point, Resare took over by noting that what Root had presented was "pretty sad stuff". He wondered if things had to be that way or if there are ways to make it better. The relevant question is: "what is self-interest?" One way to get companies to make more and better contributions would be to "nudge the understanding" of self-interest in a direction that leads to that.

He suggested that requiring altruism from companies is not the right path. Trying to use guilt to encourage better behavior might work here or there, but it would be better to get the companies to understand where their real self-interest lies.

There are clear-cut costs to bugs in FOSS. He pointed to the Heartbleed bug in OpenSSL as an example. Heartbleed woke people and companies up to the fact that OpenSSL was being maintained by one person, "not even full-time". That resulted in the formation of the Core Infrastructure Initiative (CII), which gives companies a way to help ensure that critical projects have the funding they need for maintenance.

These bug costs are real even for small organizations that aren't part of the CII. It really is in their own self-interest to pitch in. Beyond bugs, the usability of various FOSS tools could be improved. That lack of usability has real costs in terms of developer and operations-staff time.

Practical steps

Resare had some "practical steps" that developers within organizations that are not contributing could take to help change that. The first step is to connect with other like-minded people in the organization. Set up an email list or Slack channel (though he cautioned that Spotify had just switched from IRC to Slack and that has been frustrating for him) to discuss ideas about contributing more. If there are projects that the company could contribute to or internal projects that could be released as open source, the group could start listing those. He also suggested making a list of what is blocking progress.

Another important piece is to "become friends with legal". At Spotify, there is a large legal team with lots of experience in working with the music industry, which gave it pre-conceived ideas of how to deal with intellectual property. Giving away code didn't really enter into their thinking. It took two months and multiple meetings to get permission to sign the contributor agreement for Apache CloudStack, since the legal team did not see the benefit. That made him realize that more effort was needed to help the legal department understand those benefits.

Finding ways for the company to financially contribute to projects is another avenue, though it is "no picnic". Developers are generally focused on code and such, but getting funding will often require writing documents and going to meetings. However, providing some funding for a project is an easy way to get the project to prioritize features and bug fixes that the company needs.

Blog posts and presentations are another way to get the message out about these kinds of problems. Both internally and externally, organizations can learn about the problems and solutions that way.

As Cordasco pointed out, making bad contributions is almost worse than not contributing at all. So companies should be thinking about what kinds of contributions they will make. Providing high-quality maintenance for some part of the project is far superior to adding features, Resare said. Consider the existing community and its needs when deciding on what to work on. In addition, companies should be thinking about the long term for the project and for any contributions they might make to it.

Root then stepped up to describe ways to help convince company management "to do the right thing". For one thing, it is definitely much easier to get bugs fixed or features added if the company is working with the community. The community is far more interested and helpful to those it is already familiar with.

Another incentive to invest in FOSS communities is recruitment. It is much easier to sell the company to potential employees if it is already visible in the community. She had Spotify on her short list because it supports diversity efforts—PyLadies in particular—and because she had seen a talk at a conference about how the company was using certain tools. In addition, if the company's involvement with well-known languages, tools, frameworks, and so on is out in the open, potential employees who know and like those choices will have more interest. That can reduce the time needed to train new employees.

There are some examples to learn from in terms of organizations moving to open-source solutions, Root said. The city of Munich was able to break its Microsoft lock-in starting back in 2006. That both increased security and saved the city some money, but the main benefit was to give the decision-making about upgrades and changes back to the city, rather than being beholden to its vendor. And, these days, Microsoft itself has embraced open source, which is a "huge shift in their cultural thinking", Root said.

In conclusion, Resare said, there is hope that things can get better. He believes that companies can be nudged into making more and better contributions. There are free-software allies at all companies that employ developers, he said. Over the last twenty years, software projects have shown that if you release early and often, and fix things here and there as you go, you can make amazing progress. Similarly, small changes to organizations and company cultures can build up over time so that companies and FOSS can work even better together.

[ I would like to thank LWN subscribers for supporting my travel to Portland for PyCon. ]

Comments (3 posted)

Collabora Online

June 29, 2016

This article was contributed by Adam Saunders

Those interested in free-and-open-source alternatives to proprietary online collaborative document-editing software (such as Google Docs or Microsoft's Office Online) have a number of options to choose from. For simple collaborative text editing, there are solutions like Etherpad and Gobby. But for a more comprehensive suite for writing documents, crafting slideshow presentations, or making spreadsheets, the only game in town had been OnlyOffice or, more recently, the Open365 beta.

Another option has recently arrived with the announcement of a 1.0 release and demo version of Collabora Online, which is an adaptation of LibreOffice for the cloud. Collabora Online's origins date back to October 2011, with a demonstration [WebM video] at a conference by long-time LibreOffice developer Michael Meeks. This eventually led to the release of the Collabora Online Development Edition (CODE) in December 2015.

Accessing the demo requires filling out a brief form. Diving into it reveals a promising suite of software, but one currently lacking in many features that users may take for granted. There are examples of slide presentations, spreadsheets, and text documents on the home page; clicking on the Collabora Online link in the dropdown menu leads to options to create a new document, spreadsheet, or presentation. Using the demo is something of a bare-bones experience. The document editor feels like a traditional "What You See Is What You Get" (WYSIWYG) editor, but a number of features that users may expect are not there. For example, there's no automated spell-checking available, nor is there an option for setting paragraphs to have double-spacing.

[Word processing]

A reviewer of the document (for example, a co-author collaborating on a draft of an academic paper) can mark up the text with comments, but there is no comprehensive "track changes" option, which is a hindrance. Normally, collaborative tools allow an individual to make direct deletions and additions to the text while having those changes clearly marked, so the original author can see exactly what changes were made—and accept or reject said changes as they wish. This type of editing is essential to collaborative online document crafting (roughly analogous to using diff on source code) and it was surprising to not see it in a 1.0 release. Documents can be exported to PDF, ODT, DOC, or DOCX formats.

The slideshow presentation editor is similarly minimal. One can add, delete, or clone slides, insert graphics and tables into slides, and run the final presentation in full-screen mode for displaying to an audience, but that's it. There is no spell-check, there are no mechanisms for fancy transitions from slide to slide, there is no clip art gallery included for those needing images to insert into the presentation, nor is there a video export option to allow for easy publishing on the Web (as, say, an HTML5 video). One can export the final project into PDF, ODP, PPT, and PPTX formats.

[Presentation]

Spreadsheet editing also has the basics. One is greeted with a grid of cells to input and manage data, as usual. There can be up to three worksheets for each spreadsheet instance; other programs (including LibreOffice itself) allow for many more. This is limiting for a number of use cases, such as businesses that need to track several different accounts, types of inventory, or other items that cannot be easily categorized into three sheets of data.

Furthermore, the user interface to the spreadsheet functions is lacking. There is a button for a SUM() function allowing users to add up the values in a range of cells. But other functions, such as statistical, financial, conditional, and engineering functions, must be input manually, unlike other spreadsheet programs. The help menu does not provide an explanation of how to use these functions, so they must be learned elsewhere, such as from LibreOffice Calc. PDF, ODS, XLS, and XLSX formats are available for exporting.

[Spreadsheet]

Given that this is the early days of a new product, it is understandable that there may be missing features. An email exchange with Meeks revealed that the main bottleneck for reintroducing these features is realtime collaboration: "While the functionality is all there under the hood, there are a number of compromises here around how much UX [user experience] surface we can expose before we implement collaboration: the more UX operations, the bigger the collaboration problem matrix gets. We plan to focus on collaboration before expanding the UX to include lots of the dialogs." One can begin exploring this collaborative editing in the 1.0 demo by connecting to the same document from two different browsers. Documents can be marked up with comments or textual/graphical changes in realtime while the other browser shows the changes remotely.

Those interested in getting involved can download the Collabora Online Development Edition (CODE): a virtual machine image based on openSUSE with the latest revisions to the Collabora Online codebase, running in a custom ownCloud instance supplied with it. This allows developers to work on the code while offline.

Collabora Online is made of five major pieces: most of the LibreOffice codebase, LibreOfficeKit, which is an API that allows adding the vast majority of the LibreOffice code to the browser, a WebSocket daemon to manage and serve traffic to an online instance of the office suite, a Node.js back-end for realtime document rendering, and an ownCloud plugin. Despite the turmoil for ownCloud following a fork of the project, it continues to be actively developed, with a recent 9.1 beta release. The two projects are closely aligned; Collabora has just announced the release of Collabora Online for ownCloud Enterprise.

The client-side component is written in JavaScript, and the LibreOfficeKit API is based on Leaflet, an open-source JavaScript library for interactive maps. Here, the various elements of the office-software instances, such as the toolbar, timestamps of the last modification to the files, and the address of the WebSocket hosting server, are analogized to mapping information. The API translates user modification of the document into C++ operations, such as saveAs(), loadDocument(), and paste(), which are sent to the server hosting the LibreOffice instance via "loolwsd", the LibreOffice OnLine WebSocket Daemon. The daemon updates the document on both the client and server in realtime. The client-side code is permissively-licensed under the "two-clause BSD" license, while the server-side code is copylefted under MPL 2.0.

Collabora intends to make money from subscription and service contracts, which would be used to fund further development. Collabora offers to provide customization to the software as needed by customers, such as for user-interface changes to incorporate a corporate logo.

Contributors to Collabora Online, including those wanting to submit patches, are encouraged to join the IRC chat on #libreoffice-dev at irc.freenode.org, or to subscribe to the LibreOffice mailing list. The project uses Bugzilla for bug tracking and Gerrit for code review. Writers are encouraged to help with documentation on the project's wiki. The code itself can be found at a freedesktop.org repository; API documentation is available as well.

While there is plenty of room for the project to grow, it is likely that Collabora Online will soon become an attractive, full-featured open-source office suite. The ability to do basic editing is already there, which may be enough for some. With the project backed by a multinational business boasting over ten years of experience, millions of lines code written, and clients including some of the largest information technology corporations in the world as well as the UK government, it will be exciting to see what is in store for the future.

Comments (none posted)

Page editor: Jonathan Corbet
Next page: Security>>


Copyright © 2016, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds