|
|
Log in / Subscribe / Register

Archiveopteryx

By Jonathan Corbet
March 16, 2010
Your editor, like many LWN readers, deals in large quantities of electronic mail. As a result, tools which can help with the mail flood are always of interest. One tool which has been on the radar for some time is Archiveopteryx, a database-backed mail store which is meant to deal with high mail volumes. Archiveopteryx does not seem to have a hugely high profile, but it does have a dedicated user base and a steady development pace; Archiveopteryx 3.1.3 was released on March 10.

The idea behind Archiveopteryx is simple enough: build a mail store around the PostgreSQL database, then provide access to it through the usual protocols. Installation is relatively easy for a site which already has PostgreSQL in place; a simple "make install" does the bulk of the work. A straightforward configuration file allows for control over protocols, ports, etc., and there is an administrative program which can be used to set up users within the mail store.

On the protocol side, Archiveopteryx supports POP and IMAP for access to email. It can handle mail receipt directly through SMTP, but that is not normally how one would do things; there is still value in having a real mail transfer agent in the process. The preferred mode is to use the LTMP LMTP protocol to accept mail from the MTA; there is also a command-line utility which can be used for that purpose if need be. The installation instructions include straightforward recipes for configuring Archiveopteryx to work with a number of MTAs. Archiveopteryx also supports the Sieve filtering standard and the associated protocol for managing scripts.

Those who set up a large-scale mail store can be expected to have some archived mail sitting around. Archiveopteryx provides an aoximport tool for importing this email into the system. Your editor found it to be overly simple and inflexible, though. It is unable to create subfolders when importing an entire folder tree (they must already be in place or the import fails), and it failed to import the bulk of the messages when working with a Dovecot-managed maildir mailbox. The importer, perhaps, is like the Debian installer: users tend to only need it once, so it gets relatively little work once the basic functionality is in place.

Archiveopteryx works well as an IMAP server, and it is indeed fast when dealing with folders containing many messages. Operations like deleting or refiling groups of messages go notably faster than with Dovecot on the same server. On the other hand, your editor was unable to get the Sieve script functionality to work at all; this is probably more a matter of incomplete configuration than fundamental problems with Archiveopteryx itself, but it was still a discouraging development.

That ties into the biggest disappointment with Archiveopteryx, though, which is probably totally unjustified: your editor would like this tool to be something that it is not. If one is going to go to the trouble of storing all of one's email into a complex database, it would be nice to be able to do fast, complex searches on that email. That way, the next time it becomes necessary to, say, collect linux-kernel zombie posts, a quick search will do. Archiveopteryx seems to have a search feature built into it, but actually using that feature appears to be limited to exporting messages with the aoxexport tool. The IMAP protocol is not particularly friendly toward the implementation of fast, server-side searching, but it still seems like something better should be possible.

All that should not detract from what Archiveopteryx does well: store and serve email in large volumes using standard protocols. As a tool for ISPs and for others needing to make email available to lots of users, it seems highly useful; it is clearly meant to scale in ways that servers like Dovecot are not.

There is one remaining problem, though: the future of Archiveopteryx is not entirely assured. For years, this program has been developed by a company called Oryx, which offered commercial support for it. In June, 2009, though, the developers behind Oryx announced that the company was shutting down, with the final closure expected in October of this year. They say:

So we're gradually closing down Oryx, BUT NOT ARCHIVEOPTERYX. We'll relicense it using either the BSD or Apache 2 licenses and continue making new releases for years to come. We both feel obliged to keep the existing archives viable.

(The code is currently licensed under OSLv3).

A sense of obligation may keep Archiveopteryx going for a while, but if it's going to be something that people can count on for years into the future, it will have to develop a more active development community. Archiveopteryx has the look of a solidly company-controlled project - the project's git repository is overwhelmingly dominated by commits from the two principal developers. Such projects are always at a bit of risk if the backing company runs into trouble. But Archiveopteryx is free software, and highly useful free software at that; it seems like its user community should be able to carry it forward.


to post comments

Archiveopteryx

Posted Mar 18, 2010 6:54 UTC (Thu) by dlang (guest, #313) [Link] (1 responses)

quote: The IMAP protocol is not particularly friendly toward the implementation of fast, server-side searching,

that depends on exactly what you are looking for in terms of searching.

IMAP can do exact match searches in one folder very well (although many implementations do not do very well)

there is even a spec (not widely implemented in either server or client) for enhanced searches.

what it does not do is searches across different folders

Archiveopteryx

Posted Mar 19, 2010 23:29 UTC (Fri) by cras (guest, #7000) [Link]

Standard IMAP protocol requires that searches support substring matching. This is difficult to
support in a fast way, and about none of the commonly used full text search indexers support
searching substrings. Of course, many IMAP servers (like GMail among others) simply ignore this
requirement. https://datatracker.ietf.org/doc/draft-ietf-morg-fuzzy-se... is anyway supposed
to help with this.

Archiveopteryx vs. Cyrus

Posted Mar 18, 2010 7:34 UTC (Thu) by dlang (guest, #313) [Link] (2 responses)

any idea how this compares with Cyrus in terms of features and performance with large amounts of mail?

what does Archiveopteryx consider massive amounts of mail (in terms of messages in a single folder, total number of users, and totoal number of messages)?

Archiveopteryx vs. Cyrus

Posted Mar 18, 2010 14:16 UTC (Thu) by nye (subscriber, #51576) [Link] (1 responses)

The FAQ says:

"Archiveopteryx's bottleneck is the number of deliveries per minute, everything else is irrelevant. ... On fast PC hardware, Archiveopteryx currently handles in the neighbourhood of 4000 deliveries per minute."

Of course, that doesn't directly answer your question...

Archiveopteryx vs. Cyrus

Posted Mar 18, 2010 18:07 UTC (Thu) by dlang (guest, #313) [Link]

Actually, that makes me suspicious that they have not really scaled this up.

if a user has 500,000 message in their INBOX there are things that are definitely going to be slower (sorting, threading, etc) than if they have 100 message in their INBOX, saying that this is irrelevant tells me they probably haven't tested it to such extremes.

Yes, they can push a lot of this off to postgres, and appropriate indexing can greatly speed up searching and sorting, but it comes at a cost of slowing down adding or deleting messages (not to mention that it could require a lot of different indexes, many for single bit flags, so this can take a significant amount of storage, and therefor I/O to access)

Dbmail

Posted Mar 18, 2010 9:13 UTC (Thu) by furlongm (subscriber, #34572) [Link] (1 responses)

dbmail (http://dbmail.org/) is another alternative that also has similar
goals and is released under the GPL

Dbmail

Posted Mar 18, 2010 18:18 UTC (Thu) by quotemstr (subscriber, #45331) [Link]

DBMail drives me nuts because it maintains a single seen flag per message instead of a seen flag for each message for each user; doing it DBMail's way makes shared IMAP folders practically useless.

Archiveopteryx

Posted Mar 18, 2010 12:40 UTC (Thu) by pohly (guest, #6319) [Link] (4 responses)

http://notmuchmail.org/ by Carl Worth and others is another candidate. It is based on the idea of indexing mails with Xapian, then doing all of the normal operations via tagging and fast searches.

Actually receiving and sending mail has to be handled separately. There's a command line and Emacs interface. It is clearly not something for normal users, but looks promising.

Notmuch

Posted Mar 18, 2010 13:37 UTC (Thu) by corbet (editor, #1) [Link] (3 responses)

Notmuch looks very interesting; I suspect it's the mail system of my future. It's not a server-side solution, though. It requires local storage of all email and isn't (yet) well suited to those of use who, say, want to deal with our mail from our phones occasionally. That said, there will be a notmuch article here sooner or later.

Notmuch

Posted Mar 18, 2010 16:21 UTC (Thu) by wingo (guest, #26929) [Link]

I eagerly await this article. I suspect it will be my future mail too, but haven't invested the time to switch.

I'll probably stay with offlineimap though, provided I can get the tags to mirror somehow.

Notmuch

Posted Mar 18, 2010 16:23 UTC (Thu) by RobSeace (subscriber, #4435) [Link]

I use something that looks similar to NotMuch, which you may wish to compare it against, too: mairix... For my very simple needs, I've been happy with mairix...

Notmuch

Posted Mar 19, 2010 7:27 UTC (Fri) by pohly (guest, #6319) [Link]

In a recent presentation of notmuch, Carl Worth mentioned that there might be an IMAP server exposing some of search functionality at some point. The full power of notmuch might not be available without IMAP protocol extensions, though.

Archiveopteryx

Posted Mar 18, 2010 13:04 UTC (Thu) by paulj (subscriber, #341) [Link] (6 responses)

Operations like deleting or refiling groups of messages go notably faster than with Dovecot on the same server.

I found Dovecot to be very, *very* slow, in comparison to good old UW-IMAP. Perhaps that was cause I stuck with mbox and Dovecot is optimised for some other mailbox format. However, dovecot was simply unuseable. It was much easier to switch back to UW-IMAP than converting all my email to risk discovering that Dovecot was still slow.

UW-IMAP is responsive enough for me, even though I run it on a 700MHz Athlon with 512MB of RAM, and an INBOX of 24k. It remains useable even with mailboxes of 100k in size - on this hardware. Opening such folders can take a minute or two though, but once open access and search is fast (MUA potentially may be a factor here). I found my UW-IMAP server, on its ancient hardware, was faster than the corporate Netscape iPlanet derived IMAP at my previous employer, even though my mailboxes there generally were 1/10th the size (greater latency to server perhaps was one factor there - one was at home, the other was a WAN link away, even in the office).

I'm very interested in faster IMAP servers, but Dovecot does not seem to be it (for me).

Alpine's beautifully user-friendly, yet powerful, query interface is great for searching. It's very expressive and allows for chains of different criteria to be applied in succession. Also, Alpine supports tagging mail (before a certain well-known web mail thing did, I think), which can help to keep track of things.

The one thing I need to do is switch to saving my outgoing mail in my Inbox.

Archiveopteryx

Posted Mar 18, 2010 19:51 UTC (Thu) by vonbrand (subscriber, #4458) [Link] (2 responses)

I seem to recall severe security (and other) problems with UW-IMAP, and the project is dead. We were forced to move to dovecot, and then redo the mailboxes. The end result was much faster (but there was a hardware upgrade too, so take this with a grain of salt).

Archiveopteryx

Posted Mar 19, 2010 14:24 UTC (Fri) by paulj (subscriber, #341) [Link] (1 responses)

Yeah, I remember that too, but I thought that was partly with RedHat/Fedora
not packaging updates anymore? I have the vague notion licencing issues were
also involved, but am not certain. However, it appeared back in the main
fedora repository. The upstream code seems to be maintained wrt
minor bug fixing and security updates. No active development otherwise
though.

Archiveopteryx

Posted Mar 19, 2010 18:19 UTC (Fri) by vonbrand (subscriber, #4458) [Link]

In our case, we carried an old version of UW-IMAP around for some time, as the later versions just couldn't be coerced to build on Red Hat, or if they did did not work. In the end we just ditched it for the official dovecot package.

Archiveopteryx

Posted Mar 19, 2010 22:58 UTC (Fri) by cras (guest, #7000) [Link] (2 responses)

Wonder if you were using some ancient Dovecot version, because typically people say that
switching from UW-IMAP to Dovecot increased their disk I/O performance by 10x or so with mbox.
But yes, mbox is kind of a second-class citizen there and it's possible you were just using a buggy
version.. There is also mbox_very_dirty_syncs=yes setting to make mbox much faster when it has
had external changes (although defaults are identical to UW-IMAP's behavior).

Archiveopteryx

Posted Mar 20, 2010 3:17 UTC (Sat) by paulj (subscriber, #341) [Link] (1 responses)

Yeah, this was probably about 4 or 5 years ago - so some kind of 0.99
version i guess. I may try out dovecot again so when I next have time if you
think mbox performance should have improved and be on a par with UW-IMAP.

Archiveopteryx

Posted Mar 20, 2010 12:48 UTC (Sat) by cras (guest, #7000) [Link]

Yeah, v0.99 was pretty bad in all ways. All of the mail accessing and indexing code was redesigned
for v1.0. They could almost be thought of as different products..

Archiveopteryx

Posted Mar 18, 2010 14:14 UTC (Thu) by nye (subscriber, #51576) [Link] (1 responses)

From the overview:

"Apart from being designed to handle large volumes of mail accumulated over a long time, Archiveopteryx keeps the future in mind by storing only the canonical form of each message in the database, correcting common syntax errors at delivery."

I'm not sure what exactly this means, but it makes me a little nervous. One person's 'canonicalisation' is another person's 'corruption'. Can anybody shed some light on the subject?

Archiveopteryx

Posted Mar 19, 2010 23:24 UTC (Fri) by cras (guest, #7000) [Link]

I don't know specifics, but I've read a few messages about this from Arnt. I believe this mainly
means that when the message contents are already violating RFCs, Archiveopteryx changes them to
RFC-compilant form using whatever heuristics / built-in rules that typically fix the problem. For
example many messages have 8bit characters in subject, even though this isn't permitted. So
instead of typical GIGO, it's GI-something-better-than-GO.

Archiveopteryx

Posted Mar 18, 2010 14:20 UTC (Thu) by jengelh (subscriber, #33263) [Link] (1 responses)

corbet: I think it's LMTP, not LTMP.

Archiveopteryx

Posted Mar 18, 2010 14:28 UTC (Thu) by jake (editor, #205) [Link]

> I think it's LMTP, not LTMP.

Indeed, fixed now, thanks!

jake

Archiveopteryx

Posted Mar 18, 2010 17:11 UTC (Thu) by ernstp (guest, #13694) [Link]

Thunderbird 3 and it's new GLODA database is very nice, it's a client side database that is really amazing!

Archiveopteryx

Posted Mar 18, 2010 17:16 UTC (Thu) by jospoortvliet (guest, #33164) [Link]

I wonder how this compares to Akonadi, which afaik is like a client-side cache. Using MySQL (but has a plugin structure for the database) it can at least easily handle my 80K+ mailboxes pretty decently, even on slow(er) hardware...

Try Zimbra as a storage engine

Posted Mar 18, 2010 19:15 UTC (Thu) by dowdle (subscriber, #659) [Link]

Jon,

The kinds of things you want to do with email once it is in a database... already sounds like what I've been doing with Zimbra for over 4 years now. It has a very extensive search capability and the ability to store searches. To me having a ton of email is no good at all if you can't find what you are looking for.

Zimbra is certainly more than a way to import some email from a source and then search it... it is a complete mail solution with shared documents, calenders, and even IM... but you can do what you want with it... or at least I'd be surprised after looking at if if you thought the search was missing something.

Cyrus

Posted Mar 19, 2010 10:42 UTC (Fri) by ringerc (subscriber, #3071) [Link]

Cyrus IMAPd is an amazingly powerful IMAP/POP3 server that handles huge piles of mail without any great hassle at my work. I've been using it for years without complaint or issue.

It maintains a header cache for fast IMAP header downloads. It can do full-text indexing and server-side IMAP search with squatter. It supports sieve (obviously) and does so well. Mail is delivered into the system with LMTP.

You might want to look into it if you're dealing with truly huge volumes of mail. The server-side search in particular is a life-saver when dealing with big list archives.

Dovecot

Posted Mar 19, 2010 23:16 UTC (Fri) by cras (guest, #7000) [Link]

I'm Dovecot's primary author.

I actually like Archiveopteryx, as well as all the other IMAP servers. I wish there were more of them.
It's boring when there isn't much competition.

Anyway, since Dovecot's scalability was mentioned, I'd just like to mention that Dovecot's upcoming
v2.0 adds support for a high-performance mailbox format called "mdbox"
(http://wiki.dovecot.org/MailboxFormat/dbox) that scales a lot better than Maildir. For typical users
it shouldn't be a problem to have something like a million messages in a mailbox.

Also if the problem is searching performance, Dovecot supports full text search indexing with Solr.


Copyright © 2010, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds