May 18, 2009
This article was contributed by Patrick Ohly
Recently, a SyncEvolution user
described SyncML and some of its challenges in
an LWN article. In this article, the
author of that software adds more information for developers who want to
synchronize data. Particular attention is given to Personal Information
Management (PIM) data (such as contacts, events, tasks, notes) and how
SyncEvolution and the underlying
Synthesis
SyncML Engine deal with the problems inherent in synchronizing that kind
of data.
It is worth pointing out that SyncML as a protocol is data format
agnostic and not limited to PIM. PIM is just the most common usage today.
In addition, PIM data is particularly difficult to synchronize.
Synthesis just made its
core technology available under the LGPL v2.1 and 3.0. SyncEvolution 0.9 beta 1 is the
first open source program using libsynthesis. Together, these two projects
are the core building
blocks for data synchronization in the Moblin project.
PIM Synchronization: Challenges
Whenever your author hears the 1974 Steeley Dan song "Rikki don't lose that
number", he wonders whether the advice to "send it off in a letter to
yourself" is still valid today, or whether we can trust our software to
keep important phone numbers safe and sound. Why is it that synchronization
software for PIM data still not 100% reliable?
Database replication is a well understood problem. But PIM data is special
in many ways. First, there is no globally unique identifier (GUID) for
items. When comparing two databases for the first time, there is often no
additional meta-information, beyond the data itself. Without a GUID in the
item data, it is hard to determine whether two items from different
databases refer to the same entity. In particular, because the
content and/or representation of the same logical entity is often different
in different databases.
For PIM data, the storage and exchange formats are typically vCard
2.1/3.0, vCalendar 1.0 and iCalendar 2.0. At first glance it seems that
items in these formats have a GUID. But the UID property in a vCard is not
mandatory in version
2.1 or version 3.0 of the
standard. Even if it is used by a particular program, it is not guaranteed
to be globally unique and therefore cannot be relied upon when comparing
two different databases. The same problem exists with vCalendar 1.0, which is still
the most common format used by consumer devices for events and tasks. Only
iCalendar 2.0 specifies a
mandatory, global UID property because it is required for exchanging meeting invitations.
Without a GUID, one has to compare the content of items to identify
matches. But the PIM data formats allow many different more or less
complete implementations and representations of the same
information. One side of an implementation might support just the bare
minimum of information for a contact (for example, name and phone number),
while the other side may support everything defined in the standard (photo,
arbitrary number of email addresses, and phone numbers), plus
non-standard
extensions such as spouse and instant messaging handles.
A simple byte comparison, without any understanding of the semantics of
the data, is therefore not good enough.
After identifying matching items, there is a third problem: if the items
differ in some properties, which item is more up-to-date?
There are REV and
LAST-MODIFIED properties but again, support for them is not
guaranteed. Worse, both items might have been created or updated
; independently
so that each has valid information the other does not have (new phone
number added on my cell phone, address changed on my desktop).
Fourth, it is necessary to support these data formats to be interoperable
with existing devices. One cannot simply choose a custom data format that
avoids the previous three issues. Neither is it possible to make
assumptions about the implementation of a peer and what it may or may not
support.
Fifth, not knowing enough about a peer is particularly problematic when
receiving an item back from that peer. If a property is missing that was sent
to the peer earlier, does that mean that the user has removed this piece of
information or that the peer was unable to store it? In the former case,
the property must also be removed locally. In the latter, it needs to be
preserved while updating the other properties. Only allowing one-way
synchronization avoids this problem, but is also considerably less useful.
SyncML
The
SyncML protocol (aka Open Mobile Alliance Data Synchronization, OMA-DS)
itself is fairly simple, at least up to and including version 1.2.1, the
latest version supported by Synthesis and most other
implementations. SyncML defines a general message format with encodings,
both as XML and the more compact WAP Binary XML (WBXML). The exchange of
these messages over HTTP (as POST and reply) and Bluetooth is also
standardized. A typical session requires three message exchanges. When
sending many small data items (~2KB) with WBXML as the encoding, the
measured data overhead for the SyncML protocol was 8%, 2.5 times less
than for XML.
During a sync session, a client simply talks to a server. The protocol is
intentionally not symmetric. A client is fairly simple to implement and
usually only talks to one server. The server implements all of the advanced
logic like tracking the state of multiple clients, matching items and
merging data.
A client has to keep track of local changes (added/removed/updated) between
sync sessions, using its own locally unique identifier (LUID) to refer to
items. It then has to be able to export, import, update, and delete items. The
server needs to maintain a mapping between GUIDs and the corresponding LUIDs
that are used by each client.
At the start of a sync session, client and server authenticate each other
and negotiate which databases they want to synchronize (identified by a
Uniform Resource Identifier (URI)) and which data formats are acceptable
(MIME types). In theory, this information could be used to configure
clients automatically. In practice, it is often necessary to configure
manually because the information is only sent for
URIs that are listed explicitly, leading to a chicken-and-egg problem.
The information about supported data types can be detailed enough to
describe which properties of the different PIM formats are supported by a
client or server. The Synthesis engine generates this information for its
peer automatically from the configuration (more on that below) and uses the
information received from its peer to merge updates intelligently. Other
servers (such as my.funambol.com)
check this information only to determine whether specific
properties, like PHOTO, are supported and then hard-code the rest of
the data handling.
As part of getting the client and server ready for a synchronization, both
agree on the sync modes for active databases. The standard specifies
one-way and two-way synchronization, both incremental (the only changes
sent are those made since the last synchronization) and complete (all
currently existing items have to be sent). For the initial session or after
a failure, this "slow" mode is used to get client and server in sync
(again). That is, the client sends all of its items then the server
compares those items against its own data and sends back changes to the
client. As explained above, this matching is problematic therefore the
standard also supports complete sync modes, where one side is told to wipe
out all data before receiving items ("refresh from client" and "refresh
from server").
A session concludes when both sides have sent their changes and some meta-information (for example, new LUIDs assigned by the client). The standard
defines a mechanism for suspending a session mid-stream, then resuming
it later. The same resume mechanism can also be used to recover from an
unexpected loss of connection. This is an optional feature of the standard,
supported by the Synthesis implementation but not all servers. Without this
feature, a slow sync is necessary to keep client and server reliably in
sync.
Synthesis SyncML Engine + SyncEvolution
Both the Synthesis SyncML Engine and SyncEvolution are implemented in
C++. Synthesis paid particular attention to portability of its code to
platforms with less capable compilers. Therefore, the choice of C++
features used is intentionally limited (no hard dependency on exception
support, moderate use of templates, and the standard template
library). SyncEvolution is less restricted and uses both exceptions for
error reporting and the "resource acquisition is initialization" (RAII)
design pattern to track resources, plus Boost templates (but no Boost
libraries at this point).
The diagram shows how the two projects interact and fit into the Moblin
infrastructure. A graphical interface is under development. Solid boxes represent
executables and dotted boxes represent libraries. The engine itself is
compiled into a library with two stable, plain C APIs:
- an API for a SyncML client user interface, like SyncEvolution
- a database API for plug-ins, which connect the engine with local data
The BSD-licensed SDK provides a glue mechanism that can be linked
statically to access these APIs in C++, without tying clients closely to
the implementation. Bindings for Java are available from Synthesis under a
commercial license. The official documentation for this is the Synthesis "SDK
and Plug-in Interface" reference manual [PDF], which is included in the
open source distribution.
The same API is also designed for use of the engine in a SyncML server, but
this part of the API is not completely implemented at this time. The
code for the server role exists and is used in Synthesis's server
products. It is mostly identical with the code that is used by clients.
The engine uses the same
XML-based
configuration [PDF] mechanism in both roles. Data format support is
not hard-coded in the engine. Instead, the XML configuration defines
datatypes and their mapping to the standard formats. So, when using
Synthesis for both client and server, the definition of a custom data
format has to be written only once. The engine can automatically store data
defined this way in a relational database using the
Open Database
Connectivity (ODBC) API, so it might not even be necessary to write C
or
C++ code.
Because of the shared engine, clients automatically have some of the
features normally found only in SyncML servers, like interpreting device
information and intelligently updating only those properties of an item
really supported by a peer. With well written peers, this goes a long way
towards solving problems four (making assumptions about the peer) and five
(getting incomplete items back). For cases where the information provided
by a peer is insufficient, the engine also has the possibility of making
item parsing and generation depend, for example, on the peer name and/or
firmware version.
The engine itself does not implement a particular message transport,
which minimizes system dependencies and allows adding custom transports
without changing the engine. The client calling the engine is responsible
for receiving messages, which are then processed by the engine, and for
sending messages generated by the engine. SyncEvolution provides that part
for HTTP(S), using either libsoup or libcurl, depending on how it was
configured during compilation. It also provides a command
line tool, which configures a client and runs a sync session, something
which is currently missing in the Synthesis open source release itself. The
SyncEvolution 0.9 beta 1 source tar
ball includes a copy of the engine source and compiles everything with
one "configure; make; make install" invocation. This is the most
convenient
way of getting started using the source code.
Originally, SyncEvolution was a tool for the Evolution mail and PIM
application, but it was always meant to be more flexible and can be
compiled without depending on Evolution or GNOME. The Evolution backend is
just one of many. Plain files (used for
KDE
synchronization) and Mac OS X Address Book are also supported. More
backends could be added as described in this
blog article. The file backend synchronizes files inside a directory and is portable, so it can be compiled on different
platforms.
When adding these backends, your
author dodged the bullet of having to rename the project by reinterpreting
the name as "SyncEvolution - the missing link".
Another important component of SyncEvolution is a CPPUnit-based testing
system, which runs local database access tests, as well as integration
tests with real SyncML servers. With the help of the "synccompare"
comparison tool, it checks for data modifications when importing and
exporting items locally and sending items back and forth.
Data Modeling and Handling
After introducing Synthesis and SyncEvolution, there is still the question
is still: how are the tricky PIM data handling issues solved?
The Synthesis Engine merely provides the infrastructure for data
handling. The actual data modeling and processing of specific kinds of data
is defined entirely in the
XML
configuration [PDF]. The standard PIM formats are supported out of the box
in the
"syncclient_sample_config.xml": for each kind of data, it defines one
"field list", a more or less flat key/value representation of items that
is easier to process automatically than the often complex standard PIM
formats.
The conversion to and from these formats is defined via "profiles" which
map the internal fields to the corresponding properties in the external
formats.
There are two ways to define different external representations:
it is possible to disable parts of a profile conditionally depending on
runtime parameters as well as define multiple profiles which use the same
field list. Then data conversion is done by parsing with one profile and
encoding with another.
The semantic associated with a profile definition is sufficient to generate
the SyncML Device Information from that profile automatically and to
determine which fields need to be updated when importing an updated
item. Merging and comparison is also completely configurable. For more
complicated cases, the engine can invoke scripts defined in a C-like
language, for example to post-process an item just received from a peer.
Open Issues + Ideas
High on the list of items to work on next is to integrate a SyncML server
into the Linux desktop. Synthesis offers a "traditional" http server for
Linux and Windows, but this is designed for interaction with remote devices
over the Internet. For a local desktop, Bluetooth is perhaps more
important. Such a desktop server could also offer a GUI that the user can
use to control it and interactively influence its operation. During merge
operations, current Internet-based SyncML servers are limited to executing
hard coded heuristics and have the difficult choice between
duplicating
or dropping information. A local server could ask the user to
help with merging conflicting items.
The data conversion routines in the Synthesis Engine are currently tied to
a SyncML session context. After some non-trivial, but doable, code
refactoring, these routines could also be exposed as a set of simple API
calls. This may be useful in various projects, like
OpenSync.
The goal is to continue with SyncEvolution and Synthesis, not just as open
source, but also as open projects, with as much communication on public
channels as possible. We are actively
seeking involvement and feedback as we get these projects going and as we
figure out how to do all of this properly.
(
Log in to post comments)