December 14, 2011
This article was contributed by Nathan Willis
Understanding human language is a notoriously complicated,
hard-to-reduce problem without simple software solutions. On the plus
side, this is one of the few things that prevent the machines from becoming
our masters, but it also complicates all manner of natural-language
processing tasks, from grammar-checking to speech recognition. High on
that list of tasks is automated (or semi-automated) translation from one
human language to another. Most of us are well acquainted with the
benefits of automatically-translated web pages, so it is easy to imagine
other areas that would see a boost from better automatic translation: email, instant messaging, video subtitles, or code comments and documentation, to name just a few.
When we covered the Transifex collaborative translation tool in September, I started reading up on open source machine translation (MT) engines. MT is not a particularly fast-paced area of study; most of the projects are focused on academic research rather than on producing user-friendly applications and software modules. In addition, a large percentage of the projects are also focused on either one particular language, or on a particular language pair — which underscores one of the topic's major hurdles: once you stray outside of related languages, our forms of communication can become so different that it is hard for humans to accurately translate between them, much less machines.
Still, there are a few free software projects that both produce usable
code and cover a broad enough set of languages to play around with. The
leader of the pack is Apertium, a
GPL-licensed package originally developed with funding from the Spanish
government. Apertium is a rule-based (or transfer-based) MT system, meaning that it takes the input text in one language and attempts to break it down into an intermediate, abstract representation. The abstract message is then used to compose an equivalent text in the output language.
Apertium 101
In 2004, Apertium started off targeting the native languages of Spain
(Castillian Spanish, Catalan, Basque, Galician, Asturian, and Occitan), but
has subsequently branched out, and now boasts stable support for more than 20 languages. Currently the supported languages are all of European origin, but the unstable set includes considerably more, several of which hail from other continents.
The engine itself requires that language support be built in pairs, with each module consisting of three pieces: one "morphological dictionary" for each of the two languages (which includes both words and rules for how they are inflected), and a third file that holds word mappings between the languages. The project's New Language Pair HOWTO goes into further detail about the XML-based syntax of the dictionaries, and how they mark up various parts of speech and language features. Despite the one-to-one nature of the translation engine itself, the various front-end tools written for Apertium can chain together several of these one-to-one relationships and translate between many more languages. However, there are several languages that only exist in one language pair and thus effectively form closed loops — such as Norwegian Nynorsk and Norwegian Bokmål.
The engine's documentation is actually quite thorough, although that
does not make it simple to dig in and start translating using the bundled
tools. The translation process involves a number of discrete steps: simply
turning the input into the intermediate form requires parsing the input to
strip out formatting, breaking the input text into segments, finding the
base form and appropriate morphological tags for
each word, and choosing the best-fit match for each of the ambiguous bits
(both for words that could resolve to more than one match, such as homographs and heteronyms, and for sentences and phrases where the meaning itself is unclear). After that, the engine tries to match the intermediate representation to matching base words in the target dictionary, then it has to rearrange and grammatically reconstruct the message into words, chunks, and sentences in the target language, and ultimately re-apply any original formatting.
Each step in the translation process is implemented as a separate
module; Apertium uses finite-state transducers (FSTs), which are finite
state machines that have separate input and output "tapes." The online documentation
does not go into as much detail about the various modules as it is available
in the downloadable PDF manual; if you are interested in learning more about Apertium's lexical processing capabilities, the manual is the place to start.
However, it seems clear that a substantial portion of Apertium's skill on any particular translation task is bound up in the quality of the language-pairs, and not with the core modules themselves. Dictionary creation takes up more of the documentation than hacking on the code itself does — and with good reason. The quality of a dictionary sets lower- and upper-bounds on the quality of the output possible. Thus, the project encourages users to create new language pairs, contribute to existing pairs, and tackle tricky phenomena in languages. The manual, too, spends as much time discussing good language data as it does the functionality of the modules — and the project posts statistics on the quality of the language pairs, rather than the individual modules of the engine.
Your first translation(s)
The best way to assess Apertium's readiness for real-world use is to run it on a text of your own, of course. The code is available for Linux, Windows, and Mac OS X. The latest release of the engine is 3.2, from September 2010. It is packaged in Debian (and thus is automatically available for most Debian derivatives, including Ubuntu), and the project provides installation instructions for other popular distributions. The necessary pieces include the apertium package, libapertium, and auxiliary tools provided in lttoolbox and liblttoolbox. But the language pairs are provided as separate packages, each updated independently; they average around 1MB in size, so if space is available the safest bet is to install them all.
The code and language pairs are also available through Subversion.
If building from source, you must compile and install the
lttoolbox packages first, followed by Apertium itself, and the
language modules last. Some language pairs also require an additional
Constraint Grammar module be installed separately. The wiki lists the
Breton-French and Nynorsk-Bokmål pairs, as examples — but there could be others.
The apertium package include a command-line tool of the same
name. To translate a single file, the syntax is:
apertium path_to_language_pair_directory text_format translation_pair < infile > outfile
The translation_pair argument indicates the direction of the translation (e.g.
en-es or
es-en). The text format can be
txt,
html, or
rtf. You can also omit the
infile and
outfile arguments and enter text directly, followed by
Ctrl-D.
The project also runs an online translation server on the Demo page of its web site — as well as a bleeding-edge demo bolstered with the unstable language pairs, found at http://xixona.dlsi.ua.es/testing/. The stable demo server can also translate web pages, perform dictionary lookups, and translate uploaded documents (including OpenOffice/LibreOffice ODTs in addition to the formats supported by the command line tool).
For those of us on Linux, an experimental D-Bus service is also available, which is intended to make Apertium more accessible to desktop application developers. The simplest demo application using the D-Bus option is Apertium-tolk, a GTK+ text translation tool written in Python. You simply type text into the top pane of the window, select the appropriate language pair from the drop-down selector, and Apertium generates translated text in the bottom pane. Apertium-tolk did not recognize accented characters in my tests (which appears to be a recurring
issue with Python expecting a different character encoding), but for input text without accents, it produced results on par with the demo web application — with significantly less lag time.
Translation in practice
The demo server and Apertium-tolk are not designed for "serious" work, however. One of the recurring themes on MT sites is the notion that an automatic translation can never replace a human translation — it may serve as a tool to bootstrap a translation or save time on long texts, but an unattended translation is simply not feasible. Consequently, MT tools like Apertium are designed to be incorporated into human-translators' workflows. Those workflows are often very task-specific: someone translating an ebook has different requirements than someone translating gettext using Transifex, and both have different requirements than someone translating subtitles in a video.
On that front, it looks like Apertium has been making inroads in recent years. There are two projects for video subtitle editing: Apertium Subtitles, which is developed entirely by the project, and a separate extension for the Gaupol editor. I tested both on some randomly-selected .SRT subtitles files, and (as with Apertium-tolk), both ran into trouble with text encoding and were unable to figure out a good portion of the words.
Regrettably, that makes it virtually impossible to estimate the accuracy
of a translation language pair, because removing multiple words from a
sentence derails the morphological analyzer (the module whose job it is to
decide how break sentences up into meaningful chunks). This does not appear
to be
a bug in the Apertium core; just a problem some users hit with the GUI
front-ends. Because that step comes before translating individual words, when it fails the stages of the process that follow it fail, too. You can always fall back on the command-line tool (which did not hiccup on accented characters in my tests), but for translation projects that do not revolve around one of the supported file formats, that is hardly a realistic option.
But outside developers are using Apertium to build some other
interesting front-end
services. At least two are focused on online text; there is an
AGPL-licensed "social translation" system called Tradubi, which focuses on building and
customizing language pairs, and a WordPress translation plugin for posts called Transposh. Virtaal is a tool for translating .po text (and, according to the site, will soon tackle other things). It makes up for the highly technical vocabulary found in software strings by augmenting the MT engine with pre-loaded suggestions. Although it has not proven itself practical in the field yet, one of my particular favorites is a plugin for XChat that calls the Apertium D-Bus service to translate other users' comments.
In its current state, Apertium is not a free drop-in-replacement for
Google Translate, but its capabilities are impressive — the
third-party tools built on top of it exhibit a degree of polish that the demo applications do not, and you can easily imagine them growing into truly useful services. The developer base is small, but active — the mailing list traffic is steady, there is a roadmap, and the project has participated in Google's Summer of Code.
The one piece that still seems to be missing is a straightforward way for users to contribute useful data back into the language dictionaries. For example, one of the test texts I used on all of the Apertium front-ends was the opening paragraph of Don Quixote, and to my surprise, Apertium was unable to translate some very common words (including cuyo and hidalgo from the first sentence...). Most of the proprietary translation web services provide a simple "suggest a better translation" mechanism; not only does Apertium not offer such a feature, but building and refining the language pairs is an arcane, manual, and closed-off process. Most have individual maintainers, which makes sense from a management perspective, but considering how much of Apertium's intelligence resides in its language pairs, building better tools to improve them would go a long way toward improving the project as a whole.
Comments (8 posted)
December 13, 2011
This article was contributed by Valerie Aurora (formerly Henson)
The Ada Initiative is a
non-profit dedicated to increasing the participation of women in open
technology and culture. In other words, we want more women in open
source, Wikipedia, and the rest of our brave new Internet world. A
lot of people agree with that goal - at least that's what
our first
Ada Initiative survey told us. (Note that the Ada Initiative has
absolutely nothing to do with the Ada programming language, other than
sharing a namesake. Your author wrote Ada 95 for a living once and
sincerely hopes to never touch another "bondage-and-discipline"
language again.)
LWN readers might remember us from
our launch announcement
back in February, as well as our first "Seed 100" fundraising
campaign.
This article is an update on what Ada Initiative has done since its
founding, what we're doing next, and what you can do to help.
Accomplishments this year
The most surprising and visible change in open tech/culture
communities over the last year was
the widespread
adoption of some form of code of conduct or anti-harassment policy
by over
30 conferences and organizations. (The exact number is hard to
determine since some organizations adopted the policy for all their
events and put on dozens of events per year.) Many major Linux and
open source conferences have a policy: all Linux Foundation events,
including Plumbers, LinuxCon, and Kernel Summit, linux.conf.au, all
Ubuntu Developer Summits, several PyCons, OSBridge, all O'Reilly
conferences (pledged), and many more. The idea that everyone should
be able to attend a conference without expecting to be harassed or
threatened is spreading to fan, science-fiction, and open culture
events as well. Given
the level of controversy a
year ago, this shows a strong change in public opinion across a
broad swath of the open technology and culture community.
We ran two surveys this year. First was the Ada Initiative Census
(part
1 and
part
2), with over 2800 responses (about 1600 from women). We ran this
survey to find out what people thought about women in open technology
and culture, which communities had more women than others, and if people felt that having more
women was a good goal or not. A lot of people had told us that they
wanted more women in open tech/culture, and that they felt many
communities weren't very welcoming of women, but it was good to get
statistical confirmation from several thousand people.
Our second survey asked attendees of the Grace Hopper Celebration
(mainly young women in university computer science programs) about
their attitudes towards careers in open source. It was an extremely
simple and open-ended survey, nonetheless two common themes appeared:
most believed you couldn't get paid to write open source software or
that it paid much less than closed source, and that the
"personalities" and culture of open source were intimidating and
unpleasant. This is important information to know so that efforts to
recruit new college graduates into open source jobs can be successful.
We also achieved our goal of being the "go-to" organization for advice
on how to
respond
to incidents of harassment in a way that says women are a welcome
and valued part of the community. Often it's merely an issue of
raising awareness: Most people simply don't know that harassment and
bad behavior is happening. If you're a famous, well-known,
influential member of the open source community, you're likely to be
treated very well, and if you do run into any obnoxiousness, you have
a lot of friends willing to come to your aid quickly. You won't have
much of an idea of how newcomers are treated, or people who look
different from you, or don't have as many friends as you, unless you
go looking for their stories. One place to find these stories is on
the "
Timeline
of sexist incidents in geek communities" maintained on the Geek
Feminism wiki.
We organized our
first AdaCamp
for January 14, 2012, in Melbourne, Australia (the Saturday
before linux.conf.au 2012 in nearby
Ballarat). AdaCamp is a small invitation-only unconference bringing
together a variety of people to collaborate on ways to increase the
participation of women in open technology and culture. We have people
from open source, but are making a strong effort to bring in people
from Wikipedia, fan culture, and other
areas. Applications
are currently open; if you know someone who should attend,
encourage them to apply! For our North American AdaCampers, our next
AdaCamp is tentatively planned to coincide with Wikimania in
Washington D.C. in July.
We wrote a first draft of Ada's Advice, a guide to useful resources
for people who want to help women in open tech/culture, organized by
the role of the person looking for advice: parent of a young daughter,
employer looking to hire more women, women in open tech/culture
themselves. I'm constantly trying to find that link to that one
article that I vaguely remember being somewhere
on the Geek Feminism Wiki and
failing; this is my solution. We are also planning to write short summaries
of books and longer articles, as well as some original content and
updating older content (such as
generalizing HOWTO
Encourage Women in Linux to open tech/culture overall). We think
that people shouldn't have to read ten books before they can start
helping women effectively.
Ada's Careers is a project in the planning stage. This is our answer
to the abandoned job postings mailing list - you know, the one you
create after recruiters keep trying to post jobs to your development
mailing list and then no one ever reads again? Well, we want to
create a career development community: a place where women hang out
all the time because it helps them at all stages of their careers, not
just when they are looking for a new job. Finally, we'll have an
answer to "Where do I put my job posting where qualified women will
read it?"
Another project we'd like to run is First Patch Week. Often,
experience writing open source is a prerequisite to getting a job in
open source. At the same time, women face extra barriers to getting
that unpaid experience, starting with local user group meetings
that are often uncomfortable for women to attend, to
IRC servers where users perceived to be female are 25 times more
likely [PDF] to get a nasty private message than those percieved to be
male. We want
to partner with an open source company to donate a week of their
programmers' time to mentor women through the process of creating and
submitting their first patch to an open source project. This will be
an expensive and time-consuming project to run the first time, but
will get easier as we repeat it, and will have a major, direct effect
on the number of women available and qualified to be hired by open
source companies.
We have some other project ideas but these are the ones we're most
likely to do soon. What project do you want to see finished next?
Leave us a comment telling us what your favorite is.
The not-so-fun stuff: Paperwork and government regulations
I'm a kernel programmer by training, so it's not that surprising that
I found myself comparing the process of incorporating a
U.S. non-profit with booting a kernel. You have to bootstrap from a
couple of people with an idea for a non-profit to a legally registered
corporation with strict oversight by a board of directors, with every
step along the way properly authorized and recorded. It may not be
the best analogy to explain how to found a non-profit, since most
people don't know how the boot process works either, but since
this is an article for LWN I can get away with it.
The non-profit/boot analogy goes thus: (1) file articles of
incorporation (BIOS) and bylaws (bootloader), (2) take "action by
incorporator" to appoint the board of directors (secondary
bootloader), (3) board votes for standard "startup" motions (kernel
initialization), then (4) board meets regularly to vote on new
motions, elect new board members, and delegate tasks (servicing
interrupts, running processes).
The "articles of incorporation" are paperwork you send to a state
government declaring that you are a non-profit corporation. The
articles of incorporation describe the ground rules of the corporation
and don't change. The bylaws, which can change, are filed at the same
time as the articles of incorporation and describe how the corporation
is governed - stuff like how the board of directors is elected.
To me, the most obscure part of the bootstrapping process was the
"action by incorporator." Sure, the bylaws say how your board of
directors elects new directors, but how do you get your board in the
first place? What happens is that the person who filed the articles
of incorporation (me, in this case) writes down who they appoint to the
board of directors, states they relinquish all rights as incorporator,
and then signs and dates the document. Presto, the corporation now
has a board of directors in complete control.
From there on out, everything is governed by votes by the board of
directors. The board usually delegates a lot of stuff to the officers
so it doesn't have to meet every time the hosting bill has to be paid.
There is an initial set of standard motions that most corporations
pass that is similar to kernel initialization, allowing the officers
to do things like hire lawyers and buy liability insurance. After
that, the board meets routinely and as-needed (which is like responding to
timer ticks or servicing interrupts) to vote on
new motions. We even have an equivalent of AppArmor or SELinux: We
have to make detailed yearly reports to the U.S. tax service on our
finances and management, beginning with filing an incredibly complex
and expensive application for tax-exempt status.
The annoying stuff: Fundraising
Fundraising is a lot like funding a startup except that no one gets rich.
We began in classic self-funded startup fashion: For 7 months we lived
on our savings and part-time consulting work. We also
had angel
funders who trusted us enough to give us money on faith: Linux
Australia, Puppet Labs, and the Ceph division of DreamHost. Next we
raised a round of "seed funding": 100 donors of $512 or more in our
Seed
100 round (actually, 103 because we couldn't close the drive fast
enough). We've nearly used up our startup capital and have started our
first
general fundraising drive, open
to both
small individual donors
and large corporate
donors. If you like the work we're doing, and want to see things
like Ada's Advice and First Patch Week become a reality, please donate
now and tell your friends about us too!
We're still debating the long-term funding model for the Ada
Initiative. Should companies who benefit financially from open source
and open culture fund most of the Ada Initiative? Should we rely on
lots of small individual donors like Wikimedia? Should we sell
t-shirts? Tell us what you think in the comments!
Comments (64 posted)
Here is LWN's fourteenth annual timeline of significant events in the Linux
and free software world for the year.
We will be breaking the timeline up into quarters, and this is our
report on July-September 2011. Next week, we will be
put out the timeline for the last quarter of 2011.
This is version 0.8 of the 2011 timeline. There are almost certainly some
errors or omissions; if you find any, please send them to timeline@lwn.net.
LWN subscribers have paid for the development of this timeline, along with
previous timelines and the weekly editions. If you like what you see here,
or elsewhere on the site, please consider subscribing to LWN.
For those with a nostalgic bent, our timeline index page has links
to the previous thirteen timelines and some other retrospective articles
going all the way back to 1998.
A backdoor is found in the vsftpd source code (LWN blurb).
Most well-adjusted people would not stand up in a crowd of people and start
calling people around them idiots. Just because there is a monitor and a
network cable separating you from the crowd doesn't make it ok, and I am
tired of it.
-- Rasmus Lerdorf
CERN releases version 1.1 of its Open Hardware
License (announcement).
Project Harmony releases version 1.0 of its contributor agreements
(LWN blurb, agreements).
Nortel sells a huge pile of patents covering networking and lots
more to a consortium made up of Apple, EMC, Ericsson, Microsoft, Research
In Motion, and Sony. Google also unsuccessfully bid on the patents (Reuters
article).
The VLC media player reports that companies are bundling it with
adware/spyware, which is an increasing problem for free software
projects (announcement,
LWN article).
I am quite at ease not participating in netfilter/iptables anymore while
the discussion about IPv6 NAT becomes an issue again: I always indicated
"over my dead body", and now that I am no longer in charge, nobody will
have to kill me ;)
-- Harald Welte
CentOS 6.0 is released, eight months after RHEL 6 (announcement, release
notes).
The realtime kernel tree moves to 3.0 after being based on 2.6.33
for a long time (3.0-rc7-rt0
announcement).
IBM promises to contribute the Symphony fork of OpenOffice.org (OOo) to the Apache
OOo project (announcement).
Oracle acquires Ksplice, Inc., makers of the ksplice no-reboot
kernel patching product (announcement,
LWN article: Ksplice and CentOS).
As already mentioned several times, there are no special landmark features
or incompatibilities related to the version number change, it's simply a
way to drop an inconvenient numbering system in honor of twenty years of
Linux.
-- Linus Torvalds announces
3.0
Linux 3.0 is released without any major changes that some might
assume come with the move from 2.6.x (announcement, KernelNewbies summary, and
Who wrote 3.0).
Mozilla announces the "Boot to Gecko" standalone operating system,
which is based on Linux (announcement,
LWN coverage).
Several versions of Emacs ship without all of the source code, which
does not comply with the GPL, though the FSF itself is not violating the
license (LWN coverage).
The digiKam software collection 2.0.0 is released; digiKam SC is a
photo editor and related tools (announcement, LWN review).
KDE Software Compilation 4.7 is released (announcement).
DebConf 2011 is held July 24-30 in Banja Luka, Bosnia and
Herzegovina.
The second Desktop Summit is held in Berlin, Germany, August 6-12;
it is a combination of GNOME's GUADEC and KDE's Akademy conferences (LWN
coverage: Companies and open source,
Copyright assignments, Desktop crypto consolidation, Service design, Plasma Active
Every time I get frustrated with doing paperwork, I simply imagine having
the job of estimating how much time it takes to do paperwork, and I feel
better immediately.
-- Valerie
Aurora
Samba 3.6.0 is released (announcement).
Debian celebrates its 18th birthday, just two years younger than
Linux itself (announcement).
Google announces its intent to acquire Motorola Mobility
mostly for its patents it would seem (announcement).
The first release candidate of the Mozilla Public License 2.0 is
released (announcement,
an LWN look at the update process).
But if you want to be taken seriously as a researcher, you should
publish your code! Without publication of your *code* research in your
area cannot be reproduced by others, so it is not science.
-- Guido van Rossum
LinuxCon North America is held August 17-19 in Vancouver, Canada
and celebrates 20 years of Linux (LWN coverage: Clay Shirky on
collaboration, Largest desktop Linux
deployment, FreedomBox, x86 platform drivers, MeeGo architecture update, ConnMan, and Mobile
Linux patent landscape).
COSCUP 2011 is held in Taipei, Taiwan August 20-21 (LWN coverage: Year of the Linux tablet?).
A serious denial-of-service attack against Apache web servers is seen in
the wild (announcement, LWN coverage).
HP announces it is dropping its webOS devices (press
release).
The 20th anniversary of the first Linux post is August 25; the
now-famous "just a hobby" post
to comp.os.minix.
The Certificate Authority system as it stands today is a house of cards and
we're witnessing in public what many have known for years in private. The
entire system is soaked in petrol and waiting for a light.
-- Jacob
Appelbaum
DigiNotar issues fraudulent SSL/TLS certificates for several domains
including google.com in July, but it is discovered in August (LWN blurb and coverage).
The kernel.org server is found to be compromised; the compromise
affects various Linux
Foundation servers as well; it will take some time for things to get back
to normal. (LWN coverage)
Mandriva 2011 ("Hydrogen") is released (announcement,
release notes).
The Linux Plumbers Conference is held in Santa Rosa, California,
September 7-9 (LWN coverage: Development model
diversity, Booting and systemd, Making the net go faster, Coping with hardware diversity, Bufferbloat update, and Control groups).
No developer ever thinks their change is going to break anything for
anyone. It's the QA Law of What Could Possibly Go Wrong.
-- Adam Williamson
The Linux Security Summit is held with Plumbers (LWN coverage: LSM roundtable and Kernel hardening roundtable).
PostgreSQL 9.1 is released (announcement, LWN article).
The Qt Project is announced for more open governance of the free
software UI toolkit (announcement).
Coherent vision isn't something that the kernel community really values.
-- Neil Brown
The openSUSE conference is held in Nürnberg, Germany September
11-14 (Conference
wrap-up).
The OpenShot video editor releases version 1.4 (announcement).
UEFI "secure boot" and Microsoft's mandate of it for Windows 8 hardware
starts
to concern free operating system developers (Matthew Garrett blog posts:
Part 1, Part 2; LWN article).
Not spending as much time sitting in meetings and fighting with other
vendors is one of the competitive advantages PostgreSQL development has
vs. the "big guys". There needs to be a pretty serious problem with your
process before adding bureaucracy to it is anything but a backwards
move. And standardization tends to attract lots of paperwork. Last thing
you want to be competing with a big company on is doing that sort of big
company work.
-- Greg Smith
GNOME 3.2 is released (announcement,
release notes).
PulseAudio 1.0 is released (announcement, release notes).
Tizen, the successor to MeeGo, is announced, which incorporates
technology from the LiMo project; the announcement comes less than a month
after Intel says it is "fully committed" to MeeGo (announcement, LWN coverage).
The Berlios code repository announces that it will shut down at the
end of the year (announcement,
LWN coverage).
Comments (6 posted)
Page editor: Jonathan Corbet
Inside this week's LWN.net Weekly Edition
- Security: Comparative browser security white paper; New vulnerabilities in dovecot, icu, ykclient, zabbix, ...
- Kernel: Fixing the symlink race problem; LTTng rejection, next generation; Vtunerc and software acceptance politics.
- Distributions: WebOS reborn?; CentOS 6.1, ...
- Development: Xxxterm: Surfing like a Vim pro; Plasma Active, ODB, Open Dylan, Upstart, ...
- Announcements: Creative Commons 4.0 process starts, Linaro Community Contributor Process, 2011: The Year of Linux Disappointments, ...
Next page:
Security>>