LWN.net Logo

Apertium: An open source translation engine

December 14, 2011

This article was contributed by Nathan Willis

Understanding human language is a notoriously complicated, hard-to-reduce problem without simple software solutions. On the plus side, this is one of the few things that prevent the machines from becoming our masters, but it also complicates all manner of natural-language processing tasks, from grammar-checking to speech recognition. High on that list of tasks is automated (or semi-automated) translation from one human language to another. Most of us are well acquainted with the benefits of automatically-translated web pages, so it is easy to imagine other areas that would see a boost from better automatic translation: email, instant messaging, video subtitles, or code comments and documentation, to name just a few.

When we covered the Transifex collaborative translation tool in September, I started reading up on open source machine translation (MT) engines. MT is not a particularly fast-paced area of study; most of the projects are focused on academic research rather than on producing user-friendly applications and software modules. In addition, a large percentage of the projects are also focused on either one particular language, or on a particular language pair — which underscores one of the topic's major hurdles: once you stray outside of related languages, our forms of communication can become so different that it is hard for humans to accurately translate between them, much less machines.

Still, there are a few free software projects that both produce usable code and cover a broad enough set of languages to play around with. The leader of the pack is Apertium, a GPL-licensed package originally developed with funding from the Spanish government. Apertium is a rule-based (or transfer-based) MT system, meaning that it takes the input text in one language and attempts to break it down into an intermediate, abstract representation. The abstract message is then used to compose an equivalent text in the output language.

Apertium 101

In 2004, Apertium started off targeting the native languages of Spain (Castillian Spanish, Catalan, Basque, Galician, Asturian, and Occitan), but has subsequently branched out, and now boasts stable support for more than 20 languages. Currently the supported languages are all of European origin, but the unstable set includes considerably more, several of which hail from other continents.

The engine itself requires that language support be built in pairs, with each module consisting of three pieces: one "morphological dictionary" for each of the two languages (which includes both words and rules for how they are inflected), and a third file that holds word mappings between the languages. The project's New Language Pair HOWTO goes into further detail about the XML-based syntax of the dictionaries, and how they mark up various parts of speech and language features. Despite the one-to-one nature of the translation engine itself, the various front-end tools written for Apertium can chain together several of these one-to-one relationships and translate between many more languages. However, there are several languages that only exist in one language pair and thus effectively form closed loops — such as Norwegian Nynorsk and Norwegian Bokmål.

The engine's documentation is actually quite thorough, although that does not make it simple to dig in and start translating using the bundled tools. The translation process involves a number of discrete steps: simply turning the input into the intermediate form requires parsing the input to strip out formatting, breaking the input text into segments, finding the base form and appropriate morphological tags for each word, and choosing the best-fit match for each of the ambiguous bits (both for words that could resolve to more than one match, such as homographs and heteronyms, and for sentences and phrases where the meaning itself is unclear). After that, the engine tries to match the intermediate representation to matching base words in the target dictionary, then it has to rearrange and grammatically reconstruct the message into words, chunks, and sentences in the target language, and ultimately re-apply any original formatting.

Each step in the translation process is implemented as a separate module; Apertium uses finite-state transducers (FSTs), which are finite state machines that have separate input and output "tapes." The online documentation does not go into as much detail about the various modules as it is available in the downloadable PDF manual; if you are interested in learning more about Apertium's lexical processing capabilities, the manual is the place to start.

However, it seems clear that a substantial portion of Apertium's skill on any particular translation task is bound up in the quality of the language-pairs, and not with the core modules themselves. Dictionary creation takes up more of the documentation than hacking on the code itself does — and with good reason. The quality of a dictionary sets lower- and upper-bounds on the quality of the output possible. Thus, the project encourages users to create new language pairs, contribute to existing pairs, and tackle tricky phenomena in languages. The manual, too, spends as much time discussing good language data as it does the functionality of the modules — and the project posts statistics on the quality of the language pairs, rather than the individual modules of the engine.

Your first translation(s)

The best way to assess Apertium's readiness for real-world use is to run it on a text of your own, of course. The code is available for Linux, Windows, and Mac OS X. The latest release of the engine is 3.2, from September 2010. It is packaged in Debian (and thus is automatically available for most Debian derivatives, including Ubuntu), and the project provides installation instructions for other popular distributions. The necessary pieces include the apertium package, libapertium, and auxiliary tools provided in lttoolbox and liblttoolbox. But the language pairs are provided as separate packages, each updated independently; they average around 1MB in size, so if space is available the safest bet is to install them all.

The code and language pairs are also available through Subversion. If building from source, you must compile and install the lttoolbox packages first, followed by Apertium itself, and the language modules last. Some language pairs also require an additional Constraint Grammar module be installed separately. The wiki lists the Breton-French and Nynorsk-Bokmål pairs, as examples — but there could be others.

The apertium package include a command-line tool of the same name. To translate a single file, the syntax is:

    apertium path_to_language_pair_directory text_format translation_pair < infile > outfile
The translation_pair argument indicates the direction of the translation (e.g. en-es or es-en). The text format can be txt, html, or rtf. You can also omit the infile and outfile arguments and enter text directly, followed by Ctrl-D.

The project also runs an online translation server on the Demo page of its web site — as well as a bleeding-edge demo bolstered with the unstable language pairs, found at http://xixona.dlsi.ua.es/testing/. The stable demo server can also translate web pages, perform dictionary lookups, and translate uploaded documents (including OpenOffice/LibreOffice ODTs in addition to the formats supported by the command line tool).

[apertium-tolk]

For those of us on Linux, an experimental D-Bus service is also available, which is intended to make Apertium more accessible to desktop application developers. The simplest demo application using the D-Bus option is Apertium-tolk, a GTK+ text translation tool written in Python. You simply type text into the top pane of the window, select the appropriate language pair from the drop-down selector, and Apertium generates translated text in the bottom pane. Apertium-tolk did not recognize accented characters in my tests (which appears to be a recurring issue with Python expecting a different character encoding), but for input text without accents, it produced results on par with the demo web application — with significantly less lag time.

Translation in practice

The demo server and Apertium-tolk are not designed for "serious" work, however. One of the recurring themes on MT sites is the notion that an automatic translation can never replace a human translation — it may serve as a tool to bootstrap a translation or save time on long texts, but an unattended translation is simply not feasible. Consequently, MT tools like Apertium are designed to be incorporated into human-translators' workflows. Those workflows are often very task-specific: someone translating an ebook has different requirements than someone translating gettext using Transifex, and both have different requirements than someone translating subtitles in a video.

[Apertium Subtitles]

On that front, it looks like Apertium has been making inroads in recent years. There are two projects for video subtitle editing: Apertium Subtitles, which is developed entirely by the project, and a separate extension for the Gaupol editor. I tested both on some randomly-selected .SRT subtitles files, and (as with Apertium-tolk), both ran into trouble with text encoding and were unable to figure out a good portion of the words.

Regrettably, that makes it virtually impossible to estimate the accuracy of a translation language pair, because removing multiple words from a sentence derails the morphological analyzer (the module whose job it is to decide how break sentences up into meaningful chunks). This does not appear to be a bug in the Apertium core; just a problem some users hit with the GUI front-ends. Because that step comes before translating individual words, when it fails the stages of the process that follow it fail, too. You can always fall back on the command-line tool (which did not hiccup on accented characters in my tests), but for translation projects that do not revolve around one of the supported file formats, that is hardly a realistic option.

But outside developers are using Apertium to build some other interesting front-end services. At least two are focused on online text; there is an AGPL-licensed "social translation" system called Tradubi, which focuses on building and customizing language pairs, and a WordPress translation plugin for posts called Transposh. Virtaal is a tool for translating .po text (and, according to the site, will soon tackle other things). It makes up for the highly technical vocabulary found in software strings by augmenting the MT engine with pre-loaded suggestions. Although it has not proven itself practical in the field yet, one of my particular favorites is a plugin for XChat that calls the Apertium D-Bus service to translate other users' comments.

In its current state, Apertium is not a free drop-in-replacement for Google Translate, but its capabilities are impressive — the third-party tools built on top of it exhibit a degree of polish that the demo applications do not, and you can easily imagine them growing into truly useful services. The developer base is small, but active — the mailing list traffic is steady, there is a roadmap, and the project has participated in Google's Summer of Code.

The one piece that still seems to be missing is a straightforward way for users to contribute useful data back into the language dictionaries. For example, one of the test texts I used on all of the Apertium front-ends was the opening paragraph of Don Quixote, and to my surprise, Apertium was unable to translate some very common words (including cuyo and hidalgo from the first sentence...). Most of the proprietary translation web services provide a simple "suggest a better translation" mechanism; not only does Apertium not offer such a feature, but building and refining the language pairs is an arcane, manual, and closed-off process. Most have individual maintainers, which makes sense from a management perspective, but considering how much of Apertium's intelligence resides in its language pairs, building better tools to improve them would go a long way toward improving the project as a whole.


(Log in to post comments)

Not quite there yet I'm afraid

Posted Dec 15, 2011 12:33 UTC (Thu) by rvfh (subscriber, #31018) [Link]

http://xixona.dlsi.ua.es/testing/ just does not work at all... Seeing one of the snapshots, many features are missing, likes considering names starting with a capital as not-to-be-translated like "Hidalgo of the Stain" (ROTFL)

Thanks for the nice article anyway.

Metal and "lets translate english to english"

Posted Dec 15, 2011 14:29 UTC (Thu) by davecb (subscriber, #1574) [Link]

In a previous life, I sat next to the techie for the Siemens "Metal" translator suite. In their biggest production project, they took an interesting approach, which is probably applicable to any translation, not just metal or Apurtium.

They first translated from English to English.

The translation pairs put odd terms into the standard form the readers would expect, removed a large number of idioms, and in general translates "stylish" English into boring, dead-simple English.

Then and only then they translated into French and German. The result was boring but very clear French and German, with very few bizzare phrases popping up.

Some days I wish I had that language module available in my current job, to apply to legislation and caselaw...

--dave

Metal and "lets translate english to english"

Posted Dec 19, 2011 13:06 UTC (Mon) by sorpigal (subscriber, #36106) [Link]

It seems like you could build an english/english pair for Apertium and perhaps reproduce this behavior without too much effort.

Apertium: An open source translation engine

Posted Dec 16, 2011 7:46 UTC (Fri) by kleptog (subscriber, #1183) [Link]

Thank for the article. Sounds like a really cool program. I've always been interested in languages and this combines it with programming.

One thing I'm surprised about is that there's no mention of using existing translations to help constructing new ones. For example, online there are masses of documents translated into various languages. Open source projects are full of PO files. Translated books have are also online. Surely it must be possible to use these to provide hints to missing words in the dictionary?

Apertium: An open source translation engine

Posted Dec 16, 2011 16:02 UTC (Fri) by n8willis (editor, #43041) [Link]

That would be Statistical or example-based MT. The approaches are different enough that they don't mix into a single algorithm -- although a complete MT tool could certainly incorporate more than one method, then somehow attempt to merge the results.

Nate

Apertium: An open source translation engine

Posted Dec 16, 2011 23:38 UTC (Fri) by jimregan (guest, #81866) [Link]

Combining SMT, EBMT and RBMT is quite an active area of research at the moment. It goes by a number of names, but 'Hybrid MT' is the most popular. Hybrid systems which used Apertium as one of their components did quite well in the WMT evaluations this year and last.

Mikel Forcada (the instigator of the Apertium project) maintains a list of open source MT systems of all kinds here: http://www.computing.dcu.ie/~mforcada/fosmt.html

Apertium: An open source translation engine

Posted Dec 16, 2011 23:27 UTC (Fri) by jimregan (guest, #81866) [Link]

Apertium is a rule-based machine translation system, so we don't directly use existing translations. We do, however, indirectly use them to build dictionaries and rules.

Apertium: An open source translation engine

Posted Dec 19, 2011 13:47 UTC (Mon) by Velmont (guest, #46433) [Link]

I've been using Apertium Bokmaal-Nynorsk transaltion pair for political reasons. Norways biggest news paper has forbidden nynorsk, and I've set up a site that does a translation using Apertium. It uses quite a lot of RAM so I needed to cache quite agressively, but it works quite OK.

People quickly notice many faults, but it's mostly to gather attention, not actually to be used. http://nynorskvg.no/

I hung around in the IRC channel a bit for it, - and the development community is very welcoming and nice.

Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds