December 14, 2011
This article was contributed by Nathan Willis
Understanding human language is a notoriously complicated,
hard-to-reduce problem without simple software solutions. On the plus
side, this is one of the few things that prevent the machines from becoming
our masters, but it also complicates all manner of natural-language
processing tasks, from grammar-checking to speech recognition. High on
that list of tasks is automated (or semi-automated) translation from one
human language to another. Most of us are well acquainted with the
benefits of automatically-translated web pages, so it is easy to imagine
other areas that would see a boost from better automatic translation: email, instant messaging, video subtitles, or code comments and documentation, to name just a few.
When we covered the Transifex collaborative translation tool in September, I started reading up on open source machine translation (MT) engines. MT is not a particularly fast-paced area of study; most of the projects are focused on academic research rather than on producing user-friendly applications and software modules. In addition, a large percentage of the projects are also focused on either one particular language, or on a particular language pair — which underscores one of the topic's major hurdles: once you stray outside of related languages, our forms of communication can become so different that it is hard for humans to accurately translate between them, much less machines.
Still, there are a few free software projects that both produce usable
code and cover a broad enough set of languages to play around with. The
leader of the pack is Apertium, a
GPL-licensed package originally developed with funding from the Spanish
government. Apertium is a rule-based (or transfer-based) MT system, meaning that it takes the input text in one language and attempts to break it down into an intermediate, abstract representation. The abstract message is then used to compose an equivalent text in the output language.
Apertium 101
In 2004, Apertium started off targeting the native languages of Spain
(Castillian Spanish, Catalan, Basque, Galician, Asturian, and Occitan), but
has subsequently branched out, and now boasts stable support for more than 20 languages. Currently the supported languages are all of European origin, but the unstable set includes considerably more, several of which hail from other continents.
The engine itself requires that language support be built in pairs, with each module consisting of three pieces: one "morphological dictionary" for each of the two languages (which includes both words and rules for how they are inflected), and a third file that holds word mappings between the languages. The project's New Language Pair HOWTO goes into further detail about the XML-based syntax of the dictionaries, and how they mark up various parts of speech and language features. Despite the one-to-one nature of the translation engine itself, the various front-end tools written for Apertium can chain together several of these one-to-one relationships and translate between many more languages. However, there are several languages that only exist in one language pair and thus effectively form closed loops — such as Norwegian Nynorsk and Norwegian Bokmål.
The engine's documentation is actually quite thorough, although that
does not make it simple to dig in and start translating using the bundled
tools. The translation process involves a number of discrete steps: simply
turning the input into the intermediate form requires parsing the input to
strip out formatting, breaking the input text into segments, finding the
base form and appropriate morphological tags for
each word, and choosing the best-fit match for each of the ambiguous bits
(both for words that could resolve to more than one match, such as homographs and heteronyms, and for sentences and phrases where the meaning itself is unclear). After that, the engine tries to match the intermediate representation to matching base words in the target dictionary, then it has to rearrange and grammatically reconstruct the message into words, chunks, and sentences in the target language, and ultimately re-apply any original formatting.
Each step in the translation process is implemented as a separate
module; Apertium uses finite-state transducers (FSTs), which are finite
state machines that have separate input and output "tapes." The online documentation
does not go into as much detail about the various modules as it is available
in the downloadable PDF manual; if you are interested in learning more about Apertium's lexical processing capabilities, the manual is the place to start.
However, it seems clear that a substantial portion of Apertium's skill on any particular translation task is bound up in the quality of the language-pairs, and not with the core modules themselves. Dictionary creation takes up more of the documentation than hacking on the code itself does — and with good reason. The quality of a dictionary sets lower- and upper-bounds on the quality of the output possible. Thus, the project encourages users to create new language pairs, contribute to existing pairs, and tackle tricky phenomena in languages. The manual, too, spends as much time discussing good language data as it does the functionality of the modules — and the project posts statistics on the quality of the language pairs, rather than the individual modules of the engine.
Your first translation(s)
The best way to assess Apertium's readiness for real-world use is to run it on a text of your own, of course. The code is available for Linux, Windows, and Mac OS X. The latest release of the engine is 3.2, from September 2010. It is packaged in Debian (and thus is automatically available for most Debian derivatives, including Ubuntu), and the project provides installation instructions for other popular distributions. The necessary pieces include the apertium package, libapertium, and auxiliary tools provided in lttoolbox and liblttoolbox. But the language pairs are provided as separate packages, each updated independently; they average around 1MB in size, so if space is available the safest bet is to install them all.
The code and language pairs are also available through Subversion.
If building from source, you must compile and install the
lttoolbox packages first, followed by Apertium itself, and the
language modules last. Some language pairs also require an additional
Constraint Grammar module be installed separately. The wiki lists the
Breton-French and Nynorsk-Bokmål pairs, as examples — but there could be others.
The apertium package include a command-line tool of the same
name. To translate a single file, the syntax is:
apertium path_to_language_pair_directory text_format translation_pair < infile > outfile
The translation_pair argument indicates the direction of the translation (e.g.
en-es or
es-en). The text format can be
txt,
html, or
rtf. You can also omit the
infile and
outfile arguments and enter text directly, followed by
Ctrl-D.
The project also runs an online translation server on the Demo page of its web site — as well as a bleeding-edge demo bolstered with the unstable language pairs, found at http://xixona.dlsi.ua.es/testing/. The stable demo server can also translate web pages, perform dictionary lookups, and translate uploaded documents (including OpenOffice/LibreOffice ODTs in addition to the formats supported by the command line tool).
For those of us on Linux, an experimental D-Bus service is also available, which is intended to make Apertium more accessible to desktop application developers. The simplest demo application using the D-Bus option is Apertium-tolk, a GTK+ text translation tool written in Python. You simply type text into the top pane of the window, select the appropriate language pair from the drop-down selector, and Apertium generates translated text in the bottom pane. Apertium-tolk did not recognize accented characters in my tests (which appears to be a recurring
issue with Python expecting a different character encoding), but for input text without accents, it produced results on par with the demo web application — with significantly less lag time.
Translation in practice
The demo server and Apertium-tolk are not designed for "serious" work, however. One of the recurring themes on MT sites is the notion that an automatic translation can never replace a human translation — it may serve as a tool to bootstrap a translation or save time on long texts, but an unattended translation is simply not feasible. Consequently, MT tools like Apertium are designed to be incorporated into human-translators' workflows. Those workflows are often very task-specific: someone translating an ebook has different requirements than someone translating gettext using Transifex, and both have different requirements than someone translating subtitles in a video.
On that front, it looks like Apertium has been making inroads in recent years. There are two projects for video subtitle editing: Apertium Subtitles, which is developed entirely by the project, and a separate extension for the Gaupol editor. I tested both on some randomly-selected .SRT subtitles files, and (as with Apertium-tolk), both ran into trouble with text encoding and were unable to figure out a good portion of the words.
Regrettably, that makes it virtually impossible to estimate the accuracy
of a translation language pair, because removing multiple words from a
sentence derails the morphological analyzer (the module whose job it is to
decide how break sentences up into meaningful chunks). This does not appear
to be
a bug in the Apertium core; just a problem some users hit with the GUI
front-ends. Because that step comes before translating individual words, when it fails the stages of the process that follow it fail, too. You can always fall back on the command-line tool (which did not hiccup on accented characters in my tests), but for translation projects that do not revolve around one of the supported file formats, that is hardly a realistic option.
But outside developers are using Apertium to build some other
interesting front-end
services. At least two are focused on online text; there is an
AGPL-licensed "social translation" system called Tradubi, which focuses on building and
customizing language pairs, and a WordPress translation plugin for posts called Transposh. Virtaal is a tool for translating .po text (and, according to the site, will soon tackle other things). It makes up for the highly technical vocabulary found in software strings by augmenting the MT engine with pre-loaded suggestions. Although it has not proven itself practical in the field yet, one of my particular favorites is a plugin for XChat that calls the Apertium D-Bus service to translate other users' comments.
In its current state, Apertium is not a free drop-in-replacement for
Google Translate, but its capabilities are impressive — the
third-party tools built on top of it exhibit a degree of polish that the demo applications do not, and you can easily imagine them growing into truly useful services. The developer base is small, but active — the mailing list traffic is steady, there is a roadmap, and the project has participated in Google's Summer of Code.
The one piece that still seems to be missing is a straightforward way for users to contribute useful data back into the language dictionaries. For example, one of the test texts I used on all of the Apertium front-ends was the opening paragraph of Don Quixote, and to my surprise, Apertium was unable to translate some very common words (including cuyo and hidalgo from the first sentence...). Most of the proprietary translation web services provide a simple "suggest a better translation" mechanism; not only does Apertium not offer such a feature, but building and refining the language pairs is an arcane, manual, and closed-off process. Most have individual maintainers, which makes sense from a management perspective, but considering how much of Apertium's intelligence resides in its language pairs, building better tools to improve them would go a long way toward improving the project as a whole.
(
Log in to post comments)