User: Password:
|
|
Subscribe / Log in / New account

Leading items

Apertium: An open source translation engine

December 14, 2011

This article was contributed by Nathan Willis

Understanding human language is a notoriously complicated, hard-to-reduce problem without simple software solutions. On the plus side, this is one of the few things that prevent the machines from becoming our masters, but it also complicates all manner of natural-language processing tasks, from grammar-checking to speech recognition. High on that list of tasks is automated (or semi-automated) translation from one human language to another. Most of us are well acquainted with the benefits of automatically-translated web pages, so it is easy to imagine other areas that would see a boost from better automatic translation: email, instant messaging, video subtitles, or code comments and documentation, to name just a few.

When we covered the Transifex collaborative translation tool in September, I started reading up on open source machine translation (MT) engines. MT is not a particularly fast-paced area of study; most of the projects are focused on academic research rather than on producing user-friendly applications and software modules. In addition, a large percentage of the projects are also focused on either one particular language, or on a particular language pair — which underscores one of the topic's major hurdles: once you stray outside of related languages, our forms of communication can become so different that it is hard for humans to accurately translate between them, much less machines.

Still, there are a few free software projects that both produce usable code and cover a broad enough set of languages to play around with. The leader of the pack is Apertium, a GPL-licensed package originally developed with funding from the Spanish government. Apertium is a rule-based (or transfer-based) MT system, meaning that it takes the input text in one language and attempts to break it down into an intermediate, abstract representation. The abstract message is then used to compose an equivalent text in the output language.

Apertium 101

In 2004, Apertium started off targeting the native languages of Spain (Castillian Spanish, Catalan, Basque, Galician, Asturian, and Occitan), but has subsequently branched out, and now boasts stable support for more than 20 languages. Currently the supported languages are all of European origin, but the unstable set includes considerably more, several of which hail from other continents.

The engine itself requires that language support be built in pairs, with each module consisting of three pieces: one "morphological dictionary" for each of the two languages (which includes both words and rules for how they are inflected), and a third file that holds word mappings between the languages. The project's New Language Pair HOWTO goes into further detail about the XML-based syntax of the dictionaries, and how they mark up various parts of speech and language features. Despite the one-to-one nature of the translation engine itself, the various front-end tools written for Apertium can chain together several of these one-to-one relationships and translate between many more languages. However, there are several languages that only exist in one language pair and thus effectively form closed loops — such as Norwegian Nynorsk and Norwegian Bokmål.

The engine's documentation is actually quite thorough, although that does not make it simple to dig in and start translating using the bundled tools. The translation process involves a number of discrete steps: simply turning the input into the intermediate form requires parsing the input to strip out formatting, breaking the input text into segments, finding the base form and appropriate morphological tags for each word, and choosing the best-fit match for each of the ambiguous bits (both for words that could resolve to more than one match, such as homographs and heteronyms, and for sentences and phrases where the meaning itself is unclear). After that, the engine tries to match the intermediate representation to matching base words in the target dictionary, then it has to rearrange and grammatically reconstruct the message into words, chunks, and sentences in the target language, and ultimately re-apply any original formatting.

Each step in the translation process is implemented as a separate module; Apertium uses finite-state transducers (FSTs), which are finite state machines that have separate input and output "tapes." The online documentation does not go into as much detail about the various modules as it is available in the downloadable PDF manual; if you are interested in learning more about Apertium's lexical processing capabilities, the manual is the place to start.

However, it seems clear that a substantial portion of Apertium's skill on any particular translation task is bound up in the quality of the language-pairs, and not with the core modules themselves. Dictionary creation takes up more of the documentation than hacking on the code itself does — and with good reason. The quality of a dictionary sets lower- and upper-bounds on the quality of the output possible. Thus, the project encourages users to create new language pairs, contribute to existing pairs, and tackle tricky phenomena in languages. The manual, too, spends as much time discussing good language data as it does the functionality of the modules — and the project posts statistics on the quality of the language pairs, rather than the individual modules of the engine.

Your first translation(s)

The best way to assess Apertium's readiness for real-world use is to run it on a text of your own, of course. The code is available for Linux, Windows, and Mac OS X. The latest release of the engine is 3.2, from September 2010. It is packaged in Debian (and thus is automatically available for most Debian derivatives, including Ubuntu), and the project provides installation instructions for other popular distributions. The necessary pieces include the apertium package, libapertium, and auxiliary tools provided in lttoolbox and liblttoolbox. But the language pairs are provided as separate packages, each updated independently; they average around 1MB in size, so if space is available the safest bet is to install them all.

The code and language pairs are also available through Subversion. If building from source, you must compile and install the lttoolbox packages first, followed by Apertium itself, and the language modules last. Some language pairs also require an additional Constraint Grammar module be installed separately. The wiki lists the Breton-French and Nynorsk-Bokmål pairs, as examples — but there could be others.

The apertium package include a command-line tool of the same name. To translate a single file, the syntax is:

    apertium path_to_language_pair_directory text_format translation_pair < infile > outfile
The translation_pair argument indicates the direction of the translation (e.g. en-es or es-en). The text format can be txt, html, or rtf. You can also omit the infile and outfile arguments and enter text directly, followed by Ctrl-D.

The project also runs an online translation server on the Demo page of its web site — as well as a bleeding-edge demo bolstered with the unstable language pairs, found at http://xixona.dlsi.ua.es/testing/. The stable demo server can also translate web pages, perform dictionary lookups, and translate uploaded documents (including OpenOffice/LibreOffice ODTs in addition to the formats supported by the command line tool).

[apertium-tolk]

For those of us on Linux, an experimental D-Bus service is also available, which is intended to make Apertium more accessible to desktop application developers. The simplest demo application using the D-Bus option is Apertium-tolk, a GTK+ text translation tool written in Python. You simply type text into the top pane of the window, select the appropriate language pair from the drop-down selector, and Apertium generates translated text in the bottom pane. Apertium-tolk did not recognize accented characters in my tests (which appears to be a recurring issue with Python expecting a different character encoding), but for input text without accents, it produced results on par with the demo web application — with significantly less lag time.

Translation in practice

The demo server and Apertium-tolk are not designed for "serious" work, however. One of the recurring themes on MT sites is the notion that an automatic translation can never replace a human translation — it may serve as a tool to bootstrap a translation or save time on long texts, but an unattended translation is simply not feasible. Consequently, MT tools like Apertium are designed to be incorporated into human-translators' workflows. Those workflows are often very task-specific: someone translating an ebook has different requirements than someone translating gettext using Transifex, and both have different requirements than someone translating subtitles in a video.

[Apertium Subtitles]

On that front, it looks like Apertium has been making inroads in recent years. There are two projects for video subtitle editing: Apertium Subtitles, which is developed entirely by the project, and a separate extension for the Gaupol editor. I tested both on some randomly-selected .SRT subtitles files, and (as with Apertium-tolk), both ran into trouble with text encoding and were unable to figure out a good portion of the words.

Regrettably, that makes it virtually impossible to estimate the accuracy of a translation language pair, because removing multiple words from a sentence derails the morphological analyzer (the module whose job it is to decide how break sentences up into meaningful chunks). This does not appear to be a bug in the Apertium core; just a problem some users hit with the GUI front-ends. Because that step comes before translating individual words, when it fails the stages of the process that follow it fail, too. You can always fall back on the command-line tool (which did not hiccup on accented characters in my tests), but for translation projects that do not revolve around one of the supported file formats, that is hardly a realistic option.

But outside developers are using Apertium to build some other interesting front-end services. At least two are focused on online text; there is an AGPL-licensed "social translation" system called Tradubi, which focuses on building and customizing language pairs, and a WordPress translation plugin for posts called Transposh. Virtaal is a tool for translating .po text (and, according to the site, will soon tackle other things). It makes up for the highly technical vocabulary found in software strings by augmenting the MT engine with pre-loaded suggestions. Although it has not proven itself practical in the field yet, one of my particular favorites is a plugin for XChat that calls the Apertium D-Bus service to translate other users' comments.

In its current state, Apertium is not a free drop-in-replacement for Google Translate, but its capabilities are impressive — the third-party tools built on top of it exhibit a degree of polish that the demo applications do not, and you can easily imagine them growing into truly useful services. The developer base is small, but active — the mailing list traffic is steady, there is a roadmap, and the project has participated in Google's Summer of Code.

The one piece that still seems to be missing is a straightforward way for users to contribute useful data back into the language dictionaries. For example, one of the test texts I used on all of the Apertium front-ends was the opening paragraph of Don Quixote, and to my surprise, Apertium was unable to translate some very common words (including cuyo and hidalgo from the first sentence...). Most of the proprietary translation web services provide a simple "suggest a better translation" mechanism; not only does Apertium not offer such a feature, but building and refining the language pairs is an arcane, manual, and closed-off process. Most have individual maintainers, which makes sense from a management perspective, but considering how much of Apertium's intelligence resides in its language pairs, building better tools to improve them would go a long way toward improving the project as a whole.

Comments (8 posted)

An update on the Ada Initiative

December 13, 2011

This article was contributed by Valerie Aurora (formerly Henson)

The Ada Initiative is a non-profit dedicated to increasing the participation of women in open technology and culture. In other words, we want more women in open source, Wikipedia, and the rest of our brave new Internet world. A lot of people agree with that goal - at least that's what our first Ada Initiative survey told us. (Note that the Ada Initiative has absolutely nothing to do with the Ada programming language, other than sharing a namesake. Your author wrote Ada 95 for a living once and sincerely hopes to never touch another "bondage-and-discipline" language again.)

LWN readers might remember us from our launch announcement back in February, as well as our first "Seed 100" fundraising campaign. This article is an update on what Ada Initiative has done since its founding, what we're doing next, and what you can do to help.

Accomplishments this year

The most surprising and visible change in open tech/culture communities over the last year was the widespread adoption of some form of code of conduct or anti-harassment policy by over 30 conferences and organizations. (The exact number is hard to determine since some organizations adopted the policy for all their events and put on dozens of events per year.) Many major Linux and open source conferences have a policy: all Linux Foundation events, including Plumbers, LinuxCon, and Kernel Summit, linux.conf.au, all Ubuntu Developer Summits, several PyCons, OSBridge, all O'Reilly conferences (pledged), and many more. The idea that everyone should be able to attend a conference without expecting to be harassed or threatened is spreading to fan, science-fiction, and open culture events as well. Given the level of controversy a year ago, this shows a strong change in public opinion across a broad swath of the open technology and culture community.

We ran two surveys this year. First was the Ada Initiative Census (part 1 and part 2), with over 2800 responses (about 1600 from women). We ran this survey to find out what people thought about women in open technology and culture, which communities had more women than others, and if people felt that having more women was a good goal or not. A lot of people had told us that they wanted more women in open tech/culture, and that they felt many communities weren't very welcoming of women, but it was good to get statistical confirmation from several thousand people.

Our second survey asked attendees of the Grace Hopper Celebration (mainly young women in university computer science programs) about their attitudes towards careers in open source. It was an extremely simple and open-ended survey, nonetheless two common themes appeared: most believed you couldn't get paid to write open source software or that it paid much less than closed source, and that the "personalities" and culture of open source were intimidating and unpleasant. This is important information to know so that efforts to recruit new college graduates into open source jobs can be successful.

We also achieved our goal of being the "go-to" organization for advice on how to respond to incidents of harassment in a way that says women are a welcome and valued part of the community. Often it's merely an issue of raising awareness: Most people simply don't know that harassment and bad behavior is happening. If you're a famous, well-known, influential member of the open source community, you're likely to be treated very well, and if you do run into any obnoxiousness, you have a lot of friends willing to come to your aid quickly. You won't have much of an idea of how newcomers are treated, or people who look different from you, or don't have as many friends as you, unless you go looking for their stories. One place to find these stories is on the "Timeline of sexist incidents in geek communities" maintained on the Geek Feminism wiki.

We organized our first AdaCamp for January 14, 2012, in Melbourne, Australia (the Saturday before linux.conf.au 2012 in nearby Ballarat). AdaCamp is a small invitation-only unconference bringing together a variety of people to collaborate on ways to increase the participation of women in open technology and culture. We have people from open source, but are making a strong effort to bring in people from Wikipedia, fan culture, and other areas. Applications are currently open; if you know someone who should attend, encourage them to apply! For our North American AdaCampers, our next AdaCamp is tentatively planned to coincide with Wikimania in Washington D.C. in July.

We wrote a first draft of Ada's Advice, a guide to useful resources for people who want to help women in open tech/culture, organized by the role of the person looking for advice: parent of a young daughter, employer looking to hire more women, women in open tech/culture themselves. I'm constantly trying to find that link to that one article that I vaguely remember being somewhere on the Geek Feminism Wiki and failing; this is my solution. We are also planning to write short summaries of books and longer articles, as well as some original content and updating older content (such as generalizing HOWTO Encourage Women in Linux to open tech/culture overall). We think that people shouldn't have to read ten books before they can start helping women effectively.

Ada's Careers is a project in the planning stage. This is our answer to the abandoned job postings mailing list - you know, the one you create after recruiters keep trying to post jobs to your development mailing list and then no one ever reads again? Well, we want to create a career development community: a place where women hang out all the time because it helps them at all stages of their careers, not just when they are looking for a new job. Finally, we'll have an answer to "Where do I put my job posting where qualified women will read it?"

Another project we'd like to run is First Patch Week. Often, experience writing open source is a prerequisite to getting a job in open source. At the same time, women face extra barriers to getting that unpaid experience, starting with local user group meetings that are often uncomfortable for women to attend, to IRC servers where users perceived to be female are 25 times more likely [PDF] to get a nasty private message than those percieved to be male. We want to partner with an open source company to donate a week of their programmers' time to mentor women through the process of creating and submitting their first patch to an open source project. This will be an expensive and time-consuming project to run the first time, but will get easier as we repeat it, and will have a major, direct effect on the number of women available and qualified to be hired by open source companies.

We have some other project ideas but these are the ones we're most likely to do soon. What project do you want to see finished next? Leave us a comment telling us what your favorite is.

The not-so-fun stuff: Paperwork and government regulations

I'm a kernel programmer by training, so it's not that surprising that I found myself comparing the process of incorporating a U.S. non-profit with booting a kernel. You have to bootstrap from a couple of people with an idea for a non-profit to a legally registered corporation with strict oversight by a board of directors, with every step along the way properly authorized and recorded. It may not be the best analogy to explain how to found a non-profit, since most people don't know how the boot process works either, but since this is an article for LWN I can get away with it.

The non-profit/boot analogy goes thus: (1) file articles of incorporation (BIOS) and bylaws (bootloader), (2) take "action by incorporator" to appoint the board of directors (secondary bootloader), (3) board votes for standard "startup" motions (kernel initialization), then (4) board meets regularly to vote on new motions, elect new board members, and delegate tasks (servicing interrupts, running processes).

The "articles of incorporation" are paperwork you send to a state government declaring that you are a non-profit corporation. The articles of incorporation describe the ground rules of the corporation and don't change. The bylaws, which can change, are filed at the same time as the articles of incorporation and describe how the corporation is governed - stuff like how the board of directors is elected.

To me, the most obscure part of the bootstrapping process was the "action by incorporator." Sure, the bylaws say how your board of directors elects new directors, but how do you get your board in the first place? What happens is that the person who filed the articles of incorporation (me, in this case) writes down who they appoint to the board of directors, states they relinquish all rights as incorporator, and then signs and dates the document. Presto, the corporation now has a board of directors in complete control.

From there on out, everything is governed by votes by the board of directors. The board usually delegates a lot of stuff to the officers so it doesn't have to meet every time the hosting bill has to be paid. There is an initial set of standard motions that most corporations pass that is similar to kernel initialization, allowing the officers to do things like hire lawyers and buy liability insurance. After that, the board meets routinely and as-needed (which is like responding to timer ticks or servicing interrupts) to vote on new motions. We even have an equivalent of AppArmor or SELinux: We have to make detailed yearly reports to the U.S. tax service on our finances and management, beginning with filing an incredibly complex and expensive application for tax-exempt status.

The annoying stuff: Fundraising

Fundraising is a lot like funding a startup except that no one gets rich. We began in classic self-funded startup fashion: For 7 months we lived on our savings and part-time consulting work. We also had angel funders who trusted us enough to give us money on faith: Linux Australia, Puppet Labs, and the Ceph division of DreamHost. Next we raised a round of "seed funding": 100 donors of $512 or more in our Seed 100 round (actually, 103 because we couldn't close the drive fast enough). We've nearly used up our startup capital and have started our first general fundraising drive, open to both small individual donors and large corporate donors. If you like the work we're doing, and want to see things like Ada's Advice and First Patch Week become a reality, please donate now and tell your friends about us too!

We're still debating the long-term funding model for the Ada Initiative. Should companies who benefit financially from open source and open culture fund most of the Ada Initiative? Should we rely on lots of small individual donors like Wikimedia? Should we sell t-shirts? Tell us what you think in the comments!

Comments (64 posted)

2011 Linux and free software timeline - Q3

Here is LWN's fourteenth annual timeline of significant events in the Linux and free software world for the year.

We will be breaking the timeline up into quarters, and this is our report on July-September 2011. Next week, we will be put out the timeline for the last quarter of 2011.


This is version 0.8 of the 2011 timeline. There are almost certainly some errors or omissions; if you find any, please send them to timeline@lwn.net.

LWN subscribers have paid for the development of this timeline, along with previous timelines and the weekly editions. If you like what you see here, or elsewhere on the site, please consider subscribing to LWN.

For those with a nostalgic bent, our timeline index page has links to the previous thirteen timelines and some other retrospective articles going all the way back to 1998.

July

A backdoor is found in the vsftpd source code (LWN blurb).

Most well-adjusted people would not stand up in a crowd of people and start calling people around them idiots. Just because there is a monitor and a network cable separating you from the crowd doesn't make it ok, and I am tired of it.

-- Rasmus Lerdorf

[Open Hardware logo] CERN releases version 1.1 of its Open Hardware License (announcement).

Project Harmony releases version 1.0 of its contributor agreements (LWN blurb, agreements).

Nortel sells a huge pile of patents covering networking and lots more to a consortium made up of Apple, EMC, Ericsson, Microsoft, Research In Motion, and Sony. Google also unsuccessfully bid on the patents (Reuters article).

The VLC media player reports that companies are bundling it with adware/spyware, which is an increasing problem for free software projects (announcement, LWN article).

I am quite at ease not participating in netfilter/iptables anymore while the discussion about IPv6 NAT becomes an issue again: I always indicated "over my dead body", and now that I am no longer in charge, nobody will have to kill me ;)

-- Harald Welte

[CentOS logo]

CentOS 6.0 is released, eight months after RHEL 6 (announcement, release notes).

The realtime kernel tree moves to 3.0 after being based on 2.6.33 for a long time (3.0-rc7-rt0 announcement).

IBM promises to contribute the Symphony fork of OpenOffice.org (OOo) to the Apache OOo project (announcement).

Oracle acquires Ksplice, Inc., makers of the ksplice no-reboot kernel patching product (announcement, LWN article: Ksplice and CentOS).

As already mentioned several times, there are no special landmark features or incompatibilities related to the version number change, it's simply a way to drop an inconvenient numbering system in honor of twenty years of Linux.

-- Linus Torvalds announces 3.0

Linux 3.0 is released without any major changes that some might assume come with the move from 2.6.x (announcement, KernelNewbies summary, and Who wrote 3.0).

Mozilla announces the "Boot to Gecko" standalone operating system, which is based on Linux (announcement, LWN coverage).

Several versions of Emacs ship without all of the source code, which does not comply with the GPL, though the FSF itself is not violating the license (LWN coverage). [digiKam logo]

The digiKam software collection 2.0.0 is released; digiKam SC is a photo editor and related tools (announcement, LWN review).

KDE Software Compilation 4.7 is released (announcement).

DebConf 2011 is held July 24-30 in Banja Luka, Bosnia and Herzegovina.

August

[Desktop summit logo]

The second Desktop Summit is held in Berlin, Germany, August 6-12; it is a combination of GNOME's GUADEC and KDE's Akademy conferences (LWN coverage: Companies and open source, Copyright assignments, Desktop crypto consolidation, Service design, Plasma Active

Every time I get frustrated with doing paperwork, I simply imagine having the job of estimating how much time it takes to do paperwork, and I feel better immediately.

-- Valerie Aurora

Samba 3.6.0 is released (announcement).

Debian celebrates its 18th birthday, just two years younger than Linux itself (announcement).

Google announces its intent to acquire Motorola Mobility mostly for its patents it would seem (announcement).

The first release candidate of the Mozilla Public License 2.0 is released (announcement, an LWN look at the update process).

But if you want to be taken seriously as a researcher, you should publish your code! Without publication of your *code* research in your area cannot be reproduced by others, so it is not science.

-- Guido van Rossum

[LinuxCon NA logo]

LinuxCon North America is held August 17-19 in Vancouver, Canada and celebrates 20 years of Linux (LWN coverage: Clay Shirky on collaboration, Largest desktop Linux deployment, FreedomBox, x86 platform drivers, MeeGo architecture update, ConnMan, and Mobile Linux patent landscape).

COSCUP 2011 is held in Taipei, Taiwan August 20-21 (LWN coverage: Year of the Linux tablet?).

[xkcd password strength] A serious denial-of-service attack against Apache web servers is seen in the wild (announcement, LWN coverage).

HP announces it is dropping its webOS devices (press release).

The 20th anniversary of the first Linux post is August 25; the now-famous "just a hobby" post to comp.os.minix.

The Certificate Authority system as it stands today is a house of cards and we're witnessing in public what many have known for years in private. The entire system is soaked in petrol and waiting for a light.

-- Jacob Appelbaum

DigiNotar issues fraudulent SSL/TLS certificates for several domains including google.com in July, but it is discovered in August (LWN blurb and coverage).

The kernel.org server is found to be compromised; the compromise affects various Linux Foundation servers as well; it will take some time for things to get back to normal. (LWN coverage)

[Mandriva logo] Mandriva 2011 ("Hydrogen") is released (announcement, release notes).

September

The Linux Plumbers Conference is held in Santa Rosa, California, September 7-9 (LWN coverage: Development model diversity, Booting and systemd, Making the net go faster, Coping with hardware diversity, Bufferbloat update, and Control groups).

No developer ever thinks their change is going to break anything for anyone. It's the QA Law of What Could Possibly Go Wrong.

-- Adam Williamson

The Linux Security Summit is held with Plumbers (LWN coverage: LSM roundtable and Kernel hardening roundtable). [PostgreSQL logo]

PostgreSQL 9.1 is released (announcement, LWN article).

[Qt logo] The Qt Project is announced for more open governance of the free software UI toolkit (announcement).

Coherent vision isn't something that the kernel community really values.

-- Neil Brown

The openSUSE conference is held in Nürnberg, Germany September 11-14 (Conference wrap-up). [OpenShot logo]

The OpenShot video editor releases version 1.4 (announcement).

UEFI "secure boot" and Microsoft's mandate of it for Windows 8 hardware starts to concern free operating system developers (Matthew Garrett blog posts: Part 1, Part 2; LWN article).

Not spending as much time sitting in meetings and fighting with other vendors is one of the competitive advantages PostgreSQL development has vs. the "big guys". There needs to be a pretty serious problem with your process before adding bureaucracy to it is anything but a backwards move. And standardization tends to attract lots of paperwork. Last thing you want to be competing with a big company on is doing that sort of big company work.

-- Greg Smith

GNOME 3.2 is released (announcement, release notes).

[digiKam logo] PulseAudio 1.0 is released (announcement, release notes).

Tizen, the successor to MeeGo, is announced, which incorporates technology from the LiMo project; the announcement comes less than a month after Intel says it is "fully committed" to MeeGo (announcement, LWN coverage).

The Berlios code repository announces that it will shut down at the end of the year (announcement, LWN coverage).

Comments (6 posted)

Page editor: Jonathan Corbet
Next page: Security>>


Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds