User: Password:
|
|
Subscribe / Log in / New account

Development

PIM Data Synchronization: Why is it so hard?

May 18, 2009

This article was contributed by Patrick Ohly

Recently, a SyncEvolution user described SyncML and some of its challenges in an LWN article. In this article, the author of that software adds more information for developers who want to synchronize data. Particular attention is given to Personal Information Management (PIM) data (such as contacts, events, tasks, notes) and how SyncEvolution and the underlying Synthesis SyncML Engine deal with the problems inherent in synchronizing that kind of data.

It is worth pointing out that SyncML as a protocol is data format agnostic and not limited to PIM. PIM is just the most common usage today. In addition, PIM data is particularly difficult to synchronize.

Synthesis just made its core technology available under the LGPL v2.1 and 3.0. SyncEvolution 0.9 beta 1 is the first open source program using libsynthesis. Together, these two projects are the core building blocks for data synchronization in the Moblin project.

PIM Synchronization: Challenges

Whenever your author hears the 1974 Steeley Dan song "Rikki don't lose that number", he wonders whether the advice to "send it off in a letter to yourself" is still valid today, or whether we can trust our software to keep important phone numbers safe and sound. Why is it that synchronization software for PIM data still not 100% reliable?

Database replication is a well understood problem. But PIM data is special in many ways. First, there is no globally unique identifier (GUID) for items. When comparing two databases for the first time, there is often no additional meta-information, beyond the data itself. Without a GUID in the item data, it is hard to determine whether two items from different databases refer to the same entity. In particular, because the content and/or representation of the same logical entity is often different in different databases.

For PIM data, the storage and exchange formats are typically vCard 2.1/3.0, vCalendar 1.0 and iCalendar 2.0. At first glance it seems that items in these formats have a GUID. But the UID property in a vCard is not mandatory in version 2.1 or version 3.0 of the standard. Even if it is used by a particular program, it is not guaranteed to be globally unique and therefore cannot be relied upon when comparing two different databases. The same problem exists with vCalendar 1.0, which is still the most common format used by consumer devices for events and tasks. Only iCalendar 2.0 specifies a mandatory, global UID property because it is required for exchanging meeting invitations.

Without a GUID, one has to compare the content of items to identify matches. But the PIM data formats allow many different more or less complete implementations and representations of the same information. One side of an implementation might support just the bare minimum of information for a contact (for example, name and phone number), while the other side may support everything defined in the standard (photo, arbitrary number of email addresses, and phone numbers), plus non-standard extensions such as spouse and instant messaging handles. A simple byte comparison, without any understanding of the semantics of the data, is therefore not good enough.

After identifying matching items, there is a third problem: if the items differ in some properties, which item is more up-to-date? There are REV and LAST-MODIFIED properties but again, support for them is not guaranteed. Worse, both items might have been created or updated ; independently so that each has valid information the other does not have (new phone number added on my cell phone, address changed on my desktop).

Fourth, it is necessary to support these data formats to be interoperable with existing devices. One cannot simply choose a custom data format that avoids the previous three issues. Neither is it possible to make assumptions about the implementation of a peer and what it may or may not support.

Fifth, not knowing enough about a peer is particularly problematic when receiving an item back from that peer. If a property is missing that was sent to the peer earlier, does that mean that the user has removed this piece of information or that the peer was unable to store it? In the former case, the property must also be removed locally. In the latter, it needs to be preserved while updating the other properties. Only allowing one-way synchronization avoids this problem, but is also considerably less useful.

SyncML

The SyncML protocol (aka Open Mobile Alliance Data Synchronization, OMA-DS) itself is fairly simple, at least up to and including version 1.2.1, the latest version supported by Synthesis and most other implementations. SyncML defines a general message format with encodings, both as XML and the more compact WAP Binary XML (WBXML). The exchange of these messages over HTTP (as POST and reply) and Bluetooth is also standardized. A typical session requires three message exchanges. When sending many small data items (~2KB) with WBXML as the encoding, the measured data overhead for the SyncML protocol was 8%, 2.5 times less than for XML.

During a sync session, a client simply talks to a server. The protocol is intentionally not symmetric. A client is fairly simple to implement and usually only talks to one server. The server implements all of the advanced logic like tracking the state of multiple clients, matching items and merging data.

A client has to keep track of local changes (added/removed/updated) between sync sessions, using its own locally unique identifier (LUID) to refer to items. It then has to be able to export, import, update, and delete items. The server needs to maintain a mapping between GUIDs and the corresponding LUIDs that are used by each client.

At the start of a sync session, client and server authenticate each other and negotiate which databases they want to synchronize (identified by a Uniform Resource Identifier (URI)) and which data formats are acceptable (MIME types). In theory, this information could be used to configure clients automatically. In practice, it is often necessary to configure manually because the information is only sent for URIs that are listed explicitly, leading to a chicken-and-egg problem.

The information about supported data types can be detailed enough to describe which properties of the different PIM formats are supported by a client or server. The Synthesis engine generates this information for its peer automatically from the configuration (more on that below) and uses the information received from its peer to merge updates intelligently. Other servers (such as my.funambol.com) check this information only to determine whether specific properties, like PHOTO, are supported and then hard-code the rest of the data handling.

As part of getting the client and server ready for a synchronization, both agree on the sync modes for active databases. The standard specifies one-way and two-way synchronization, both incremental (the only changes sent are those made since the last synchronization) and complete (all currently existing items have to be sent). For the initial session or after a failure, this "slow" mode is used to get client and server in sync (again). That is, the client sends all of its items then the server compares those items against its own data and sends back changes to the client. As explained above, this matching is problematic therefore the standard also supports complete sync modes, where one side is told to wipe out all data before receiving items ("refresh from client" and "refresh from server").

A session concludes when both sides have sent their changes and some meta-information (for example, new LUIDs assigned by the client). The standard defines a mechanism for suspending a session mid-stream, then resuming it later. The same resume mechanism can also be used to recover from an unexpected loss of connection. This is an optional feature of the standard, supported by the Synthesis implementation but not all servers. Without this feature, a slow sync is necessary to keep client and server reliably in sync.

Synthesis SyncML Engine + SyncEvolution

[Moblin sync
components]

Both the Synthesis SyncML Engine and SyncEvolution are implemented in C++. Synthesis paid particular attention to portability of its code to platforms with less capable compilers. Therefore, the choice of C++ features used is intentionally limited (no hard dependency on exception support, moderate use of templates, and the standard template library). SyncEvolution is less restricted and uses both exceptions for error reporting and the "resource acquisition is initialization" (RAII) design pattern to track resources, plus Boost templates (but no Boost libraries at this point).

The diagram shows how the two projects interact and fit into the Moblin infrastructure. A graphical interface is under development. Solid boxes represent executables and dotted boxes represent libraries. The engine itself is compiled into a library with two stable, plain C APIs:

  • an API for a SyncML client user interface, like SyncEvolution
  • a database API for plug-ins, which connect the engine with local data

The BSD-licensed SDK provides a glue mechanism that can be linked statically to access these APIs in C++, without tying clients closely to the implementation. Bindings for Java are available from Synthesis under a commercial license. The official documentation for this is the Synthesis "SDK and Plug-in Interface" reference manual [PDF], which is included in the open source distribution.

The same API is also designed for use of the engine in a SyncML server, but this part of the API is not completely implemented at this time. The code for the server role exists and is used in Synthesis's server products. It is mostly identical with the code that is used by clients.

The engine uses the same XML-based configuration [PDF] mechanism in both roles. Data format support is not hard-coded in the engine. Instead, the XML configuration defines datatypes and their mapping to the standard formats. So, when using Synthesis for both client and server, the definition of a custom data format has to be written only once. The engine can automatically store data defined this way in a relational database using the Open Database Connectivity (ODBC) API, so it might not even be necessary to write C or C++ code.

Because of the shared engine, clients automatically have some of the features normally found only in SyncML servers, like interpreting device information and intelligently updating only those properties of an item really supported by a peer. With well written peers, this goes a long way towards solving problems four (making assumptions about the peer) and five (getting incomplete items back). For cases where the information provided by a peer is insufficient, the engine also has the possibility of making item parsing and generation depend, for example, on the peer name and/or firmware version.

The engine itself does not implement a particular message transport, which minimizes system dependencies and allows adding custom transports without changing the engine. The client calling the engine is responsible for receiving messages, which are then processed by the engine, and for sending messages generated by the engine. SyncEvolution provides that part for HTTP(S), using either libsoup or libcurl, depending on how it was configured during compilation. It also provides a command line tool, which configures a client and runs a sync session, something which is currently missing in the Synthesis open source release itself. The SyncEvolution 0.9 beta 1 source tar ball includes a copy of the engine source and compiles everything with one "configure; make; make install" invocation. This is the most convenient way of getting started using the source code.

Originally, SyncEvolution was a tool for the Evolution mail and PIM application, but it was always meant to be more flexible and can be compiled without depending on Evolution or GNOME. The Evolution backend is just one of many. Plain files (used for KDE synchronization) and Mac OS X Address Book are also supported. More backends could be added as described in this blog article. The file backend synchronizes files inside a directory and is portable, so it can be compiled on different platforms. When adding these backends, your author dodged the bullet of having to rename the project by reinterpreting the name as "SyncEvolution - the missing link".

Another important component of SyncEvolution is a CPPUnit-based testing system, which runs local database access tests, as well as integration tests with real SyncML servers. With the help of the "synccompare" comparison tool, it checks for data modifications when importing and exporting items locally and sending items back and forth.

Data Modeling and Handling

After introducing Synthesis and SyncEvolution, there is still the question is still: how are the tricky PIM data handling issues solved? The Synthesis Engine merely provides the infrastructure for data handling. The actual data modeling and processing of specific kinds of data is defined entirely in the XML configuration [PDF]. The standard PIM formats are supported out of the box in the "syncclient_sample_config.xml": for each kind of data, it defines one "field list", a more or less flat key/value representation of items that is easier to process automatically than the often complex standard PIM formats. The conversion to and from these formats is defined via "profiles" which map the internal fields to the corresponding properties in the external formats.

There are two ways to define different external representations: it is possible to disable parts of a profile conditionally depending on runtime parameters as well as define multiple profiles which use the same field list. Then data conversion is done by parsing with one profile and encoding with another. The semantic associated with a profile definition is sufficient to generate the SyncML Device Information from that profile automatically and to determine which fields need to be updated when importing an updated item. Merging and comparison is also completely configurable. For more complicated cases, the engine can invoke scripts defined in a C-like language, for example to post-process an item just received from a peer.

Open Issues + Ideas

High on the list of items to work on next is to integrate a SyncML server into the Linux desktop. Synthesis offers a "traditional" http server for Linux and Windows, but this is designed for interaction with remote devices over the Internet. For a local desktop, Bluetooth is perhaps more important. Such a desktop server could also offer a GUI that the user can use to control it and interactively influence its operation. During merge operations, current Internet-based SyncML servers are limited to executing hard coded heuristics and have the difficult choice between duplicating or dropping information. A local server could ask the user to help with merging conflicting items.

The data conversion routines in the Synthesis Engine are currently tied to a SyncML session context. After some non-trivial, but doable, code refactoring, these routines could also be exposed as a set of simple API calls. This may be useful in various projects, like OpenSync.

The goal is to continue with SyncEvolution and Synthesis, not just as open source, but also as open projects, with as much communication on public channels as possible. We are actively seeking involvement and feedback as we get these projects going and as we figure out how to do all of this properly.

Comments (none posted)

System Applications

Database Software

PostgreSQL Weekly News

The May 17, 2009 edition of the PostgreSQL Weekly News is online with the latest PostgreSQL DBMS articles and resources.

Full Story (comments: none)

SQLite version 3.6.14.1 released

Version 3.6.14.1 of the SQLite DBMS has been announced. "SQLite version 3.6.14.1 is a patch release to version 3.6.14 with minimal changes that fixes three bugs. Upgrading is only necessary for users who are impacted by one or more of those bugs. "

Comments (none posted)

SQLObject 0.9.11 announced

Version 0.9.11 of SQLObject has been announced, it includes minor bug fixes. "SQLObject is an object-relational mapper. Your database tables are described as classes, and rows are instances of those classes. SQLObject is meant to be easy to use and quick to get started with."

Full Story (comments: none)

SQLObject 0.10.6 announced

Version 0.10.6 of SQLObject has been announced. "I'm pleased to announce version 0.10.6, a minor bugfix release of 0.10 branch of SQLObject."

Full Story (comments: none)

web2py 1.62.1 released

Version 1.62.1 of web2py has been announced, it includes a number of new capabilities. "web2py includes the only Database Abstraction Layer / ORM that works on both the Google App Engine and relational databases (sqlite, mysql, postgresql, mssql, firebird, oracle, db2). You write once and it runs everywhere. You DO NOT NEED to use the Google API to access the Google Datastore as you do when you use other web frameworks on GAE. web2py writes SQL for you (and you don't even need to see it) and automatically creates a web based interface to your data."

Full Story (comments: 1)

Device Drivers

FFADO 2.0 release candidate 2 (1.999.42) is available

Version 1.999.42 of FFADO, a FireWire audio device interface, has been announced. "This release candidate contains a few reliability improvements and bugfixes that should get some field testing before we can release the official 2.0. I would therefore like to ask all users and packagers to upgrade as soon as possible such that we can release sooner rather than later. If we get to about 100 registered users without significant significant bug reports I feel confident that we're good to go. So happy testing!"

Full Story (comments: none)

Virtualization Software

Xen 3.4 Hypervisor now available

Version 3.4 of Xen Hypervisor has been announced. "The new Xen 3.4 hypervisor offers significant enhancements in Xen Client Initiative (XCI), Reliability - Availability - Serviceability (RAS) and Power Management."

Full Story (comments: none)

Web Site Development

nginx 0.6.37 released

Version 0.6.37 of the nginx HTTP server and mail proxy server has been announced. See the CHANGES document for release details.

Comments (none posted)

Desktop Applications

Accessibility

AxTk adds speech output to wxWidgets

The AxTk project has been launched. "AxTk is a toolkit for building highly accessible applications with speech output. AxTk is built on top of wxWidgets and so is cross-platform. The developer can opt to speech-enable an existing wxWidgets UI, or use a new menu-based interface which is easier for a vision impaired user. AxTk also contains a text to speech wrapper class, wxTextToSpeech, with handlers for a variety of speech engines including SAPI, Mac Speech Synthesis Manager, eSpeak and Cepstral."

Comments (none posted)

Audio Applications

Audacious 2.0.1 released

Version 2.0.1 of Audacious is available, it includes a bug fix in the equalizer code. "Audacious is an advanced audio player. It is free, lightweight, based on GTK2, runs on Linux and many other *nix platforms and is focused on audio quality and supporting a wide range of audio codecs."

Comments (none posted)

ladspa-lgv-plugins 0.1 released

Version 0.1 of ladspa-lgv-plugins has been announced. "First release of the pretentiously named "ladspa-lgv-plugins" includes at the moment just one very simple but very convenient plugin: monomix. No more plugins are scheduled in the near future, but never say never, so I kept it open. It is not rocket science, just a way of easily deal with recordings that use full stereo separation. monomix combines the left and right channels of an stereo audio input into a dual mono output. Its interface offers four preset combinations (L+R, L-R, L-only, R-only) and one customizable mode. This allows to eliminate, isolate or enhance different channels on stereo recordings."

Full Story (comments: none)

TkEca 4.4.0 released

Version 4.4.0 of TkEca, a TCL/TK frontend for the ECASOUND audio editor, has been announced. Changes include: "- LADSPA Loading Window. - Main Window is visible during the LADSPA plug-ins loading. - Message if Tkeca is unable to find ecasoundrc file. - Processing Mixdown window. - Master volume control. - Filenames with spaces are allowed. - Certain buttons get disabled during mixdown. - Compressor utility is no longer available for mixdow, it was useless and buggy. - Bug: Certain buttons remain disabled after exporting .ecs. - Bug: Track Properties windows always reset to default values. - Bug: Always applying compressor due an "IF" error."

Full Story (comments: none)

New LADSPA effect plugin: WubFlip

The WubFlip LADSPA effect plugin has been announced. "I have made a LADSPA effect plugin called WubFlip. It's a sort of distortion that might be useful for dirty lofi beats and synths."

Full Story (comments: none)

Desktop Environments

GNOME Software Announcements

The following new GNOME software has been announced this week: You can find more new GNOME software releases at gnomefiles.org.

Comments (none posted)

openSUSE KDE Community Week Brings Distro And KDE Closer (KDE.News)

KDE.news reports on KDE work during the just-completed openSUSE community week. "As SUSE has opened up and become openSUSE over the last couple of years, the team has adopted a pragmatic bug policy so that bugs which are definitely not specific to openSUSE are moved upstream to bugs.kde.org. This is in everybody's best interest since the bugs end up where there are developers most able to fix them quickly, and our expert team of bug triagers improve the quality of the reported bugs by filtering out packaging issues, broken patches, reports against old versions and reports caused by underlying system problems. Keeping bugs hanging around distro bug trackers for months where maintainers' limited resources mean they get limited attention is frustrating for users who take the time to make reports."

Comments (5 posted)

KDE Software Announcements

The following new KDE software has been announced this week: You can find more new KDE software releases at kde-apps.org.

Comments (none posted)

Xorg Software Announcements

The following new Xorg software has been announced this week: More information can be found on the X.Org Foundation wiki.

Comments (none posted)

Desktop Publishing

rst2pdf 0.10 released

Version 0.10 of rst2pdf has been announced. "It's my pleasure to announce the release of rst2pdf version 0.10. This version includes many bugfixes and some new features compared to the previous 0.9 version. Rst2pdf is a tool to generate PDF files directly from restructured text sources via reportlab."

Full Story (comments: none)

GUI Packages

pyFltk 1.1.4 released

Version 1.1.4 of pyFltk, the Python bindings to the FLTK cross-platform GUI toolkit, has been announced. "This is a maintenance release of pyFltk, supporting fltk-1.1.9 and Python2.6. Changes include some bug fixes and fixes for several compilation issues on OSX and 64bit systems."

Comments (none posted)

wxPython 2.8.10.1 released

Version 2.8.10.1 of wxPython, a Python interface to the wxWindows GUI toolkit, has been announced. "This release fixes the problem with using Python 2.6's default manifest, and updates wxcairo to work with the latest PyCairo."

Full Story (comments: none)

Math Applications

mystic 0.1a2 released

Version 0.1a2 of mystic, a simple model-independent inversion framework, has been announced. "Primarily a bugfix and documentation release."

Full Story (comments: none)

Music Applications

Denemo 0.8.4 released

Version 0.8.4 of the Denemo music notation editor has been announced. "Some of the new features are improvements for scripting support and user-created commands and improved MIDI-output. Official support, beneath our website, is avaible via our IRC channel #denemo on irc.freenode.net."

Full Story (comments: none)

Languages and Tools

Caml

Caml Weekly News

The May 19, 2009 edition of the Caml Weekly News is out with new articles about the Caml language.

Full Story (comments: none)

Python

Python-URL! - weekly Python news and links

The May 14, 2009 edition of the Python-URL! is online with a new collection of Python article links.

Full Story (comments: none)

Tcl/Tk

Tcl-URL! - weekly Tcl news and links

The May 20, 2009 edition of the Tcl-URL! is online with new Tcl/Tk articles and resources.

Full Story (comments: none)

Libraries

User space RCU library relicensed to LGPLv2.1

Mathieu Desnoyers has announced that liburcu, a user-space implementation of Read-copy update (RCU) is now available under the LGPL v2.1. RCU is a technique used by the Linux kernel to handle concurrent access to data structures without locking. It was patented originally by Sequent and is now held by IBM, which previously had licensed it only for GPL-licensed implementations. "liburcu is a LGPLv2.1 userspace RCU (read-copy-update) library. This data synchronization library provides read-side access which scales linearly with the number of cores. It does so by allowing multiples copies of a given data structure to live at the same time, and by monitoring the data structure accesses to detect grace periods after which memory reclamation is possible." Click below for the full announcement.

Full Story (comments: 24)

Miscellaneous

DeHaan: Recognizing and Avoiding Common Open Source Community Pitfalls

Michael DeHaan looks at open source development misconceptions in a posting to his blog. In it, he looks at around a dozen different "stumbling blocks" that projects might run up against. "What typically isn't written though are about some of the misconceptions — A lot of folks contributors appear overnight out of the woodwork, that users grow on trees, and that it's possible to direct community members as if they were employees — not so, of course, and folks get disappointed or discouraged when it doesn't happen."

Comments (none posted)

Page editor: Forrest Cook
Next page: Linux in the news>>


Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds