April 9, 2008
This article was contributed by Diego Pettenò
[Editor's note: This article, which looks at the interactions of software projects and distribution providers, is presented in three parts.
Part 1 introduces the concepts found here, in part 2.]
Technical needs
Under the name
technical needs we're going to see a series of
requests that distributors often have to make to the original
developers of the software they want to package. Not all these requests
are made by all distributors. Some will care more about one particular
aspect than another. Some might apply only on non-mainstream
distributions, and some distributions might just want to take care of
philosophical needs and leave the technical side entirely alone, even if
similar distributions aren't exactly common.
Most of the technical needs described in this article are present in
the policies set forth by Debian (written), Gentoo (mostly unwritten), and
apply to other distributions as well. Some of these needs won't be encoded
in any policy and are often not requested explicitly by the developers.
Those are mostly details that make a distributor's life easier. These
details may not be mandatory, but it's still worth considering them. The
easier the life of the downstream maintainer is, the easier it is
for the software to be packaged.
Also, it's important to note that when a distribution makes a request,
it might not be alone. Other distributions might want to take
advantage of the same change, but they didn't have time to request it,
or simply preferred to wait before packaging the software until some
issues were resolved. Don't just ignore the request because the
distribution which contacted you already took care of the issue by
patching your software. Acknowledge the request and apply the patch,
it will make both your and their life easier on the long term.
Sane version information
Distributions often rely on the
version information provided by the original software developers. This
usually means that they don't expect huge changes between version
$x.y.z$ to version $x.y.z+1$.
One very common scheme for versions is the major, minor,
micro version, which in the example above would be respectively
$x$, $y$ and $z$ (it's a common misconception that $y$ is the
major version component).
The way this kind of scheme is usually applied relates to the
compatibility of the programming interface (API and ABI). Changes in the
software warrant increments of various version components depending on the
amount of changes in the interfaces:
- adding zero or more interfaces, without changing or removing
previous interfaces, or the behaviour expected from them - meaning
the software is entirely compatible with the older version - usually
only require an increment of micro version;
- changing or removing interfaces, usually deprecated - in such a
way that older software might require to be adapted, but not
rewritten - usually require an increment of minor version;
- changing the interface entirely - requiring users of the
software to rewrite their code, or otherwise do major structural
changes - usually require an increment of the major version.
Obviously, increasing one component will usually involve resetting to
zero the version components on the right.
There might be other components, too. For instance if the source
archive has to be regenerated without any code change (missing file,
updated addresses for the maintainers or the homepages), rather than
changing the version entirely, a suffix might just be added at the end
of the version, making it, for instance $1.2.3a$ or $1.2.3c$. If just
a security issues has been fixed, it could also be expressed by adding
a nano component to the version, like $1.3.34.1$, to
emphasize that there is no change other than the security fix.
The source archives for the software should be named after both the
project and the version, resulting in names like
foobar-1.3.4.tar.gz. Having different versions of the same
software that don't have the same naming causes confusion.
It is quite important for the distributions that source archives not be
changed without changing the name: distributions usually make sure that
the checksum (usually MD5, but often nowadays SHA1) of the archive is
the one they recorded, and changing the tarball without notice often
leads to failed builds.
There is a similar issue with the naming of the directory inside the
archive. Most distributions assume that the source is included inside
a directory with the same name of the archive (minus the extension),
but often enough the archive contains sources not organised in a
directory, or a directory with the name of the project without
version. Similarly, if possible the directory should also contain
eventual suffixes, to avoid adding extra cases in their presence.
Distribution methods like Ruby Gems and Python Eggs mandate similar
version schemes for their packages for the same reason Free Software
distribution would prefer them: it makes it easier to compare versions,
and know when something has to be updated.
Internal libraries
One common issue considered by both Debian and
Gentoo policies relates to the use of internal copies of
libraries. Sometimes the software needs some uncommon libraries to
work properly. These libraries are unlikely to be found on users'
systems, which would require them to download and install them
separately. Such a task is not easy for new users. A few projects will
keep an internal copy of the libraries they want to use for that reason,
and will use that internal copy unconditionally.
Adding an internal copy of a library seems cheap to the original
developers, and it's convenient for users to download and install a single
package, however this causes a large number of problems to the
distributors. The first problem is that they might have to patch the same
bug several times. Let's all think of zlib as a practical
example, a very common library implementing the classic deflate algorithm
of compression. It's a very small library, that a lot of projects imported
internally over the years. Not too long ago, a serious security issue was
found in the code of zlib, and all the distributors had to patch
it out as fast as they could. In a perfect world, patching zlib,
and eventually rebuilding everything that linked to it would have sufficed.
Unfortunately, we're not in a perfect world. More software was packaged
with internal copies of the library, requiring each of those packages to be
patched to make sure the issue was solved.
There are many other implications with using internal bundled copies
of libraries, and most of them are critical for distributors. These
problems increase their complexity when the internal copies of
libraries are modified to suit better the use the application has for
them. In those cases, even though the source might be advertised as
being part of another library, they are actually different from that
library, and their replacement might be impossible, or may cause
further problems.
- The code is no longer shared between programs: not only
the source code, which requires extra work to fix bugs and security
issues, but also executable code and data. When shared libraries are
used, the memory used by processes loading them is reduced, as they
will share code and part of the data. This cannot be done when using
static libraries or, worse, internal copies of libraries.
- Symbols may collide during the loading: modern Linux
and Unix systems use the ELF format for programs and libraries. This
format provides a so-called flat namespace for the symbols
(data and functions) to be found. When using internal copies of
other libraries in a library, the two definitions of the same symbol
might collide, and just one of them can be used. If the interface
used by the library changed subtly, it is possible that this will
lead the program in an execution path that was not intended and is
not safe.
- Distribution-specific changes need to be duplicated: as
it will be discussed later on, sometimes distributions need to make
changes to source code, to fix bugs (security related and not), or
change paths of files for instance. Internal copies require
downstream maintainers to repeat these changes multiple times.
For this reason, a good compromise between the needs of the original
authors and the needs of the distributions is to treat internal copies of
libraries as
untouchable, thus disallowing any changes in its
interface or behaviour. That way those users who get the package directly
from upstream still have only one package to download and build. The
distributions, who want to share code as much as possible, should have a
way to ask the build system to use the system copy of that library. An easy
way to implement that is to provide
--with-system-libfoo
options at the
./configure call (for
autoconf for
instance), or to give a
WITH_SYSTEM_LIBFOO" handle at the
make command line.
By allowing the distributions to use their own copies of libraries,
the developers are still preserving the ability for the user not to
install extra dependencies, but also giving the distributions the
power they need, to avoid changing the original code, sometimes in a
conflicting way. It is important for the upstream authors to not change the
behaviour of bundled libraries, as the distributions will most likely want
to use a shared system library instead. Modifications made to a bundled
library will likely cause problems for users who use get the package from
their distribution's repository where it has been built with a shared
system library.
An easy choice for optional dependencies
Almost all distributions
prefer having a choice about the optional dependencies of a
package. Source-based distributions (like Gentoo and FreeBSD's ports
system) offer the same (or more) choices as the original project. Gentoo's
USE flags or FreeBSD's
knobs offers the user options on which
options will be enabled. Binary distributions (like Debian or RedHat)
might want to choose options to ensure that the final binary package does
not try to use dependencies that are not present in their official
repositories.
Again, if a project does not provide an easy way to control whether
some optional dependency is used, most distributions will either try
to workaround that problem (by forcing cache discovery variables) or
change the build system themselves to get the choice to disable or
enable some dependency. This creates problems similar to the ones
discussed above: different distributions might use slightly different
changes, which may cause errors when merging them in, and they might
make errors that introduce new bugs.
As above, it's just a matter of providing a switch in the build system
(like a --disable-feature or --without-feature in
autoconf, or a WITHOUT_FEATURE knob for
make). If the software has a plug-in infrastructure, binary
distributions might also just package the different plug-ins in different
packages, allowing the user to choose which ones to install. Software
without plug-in structures might require building different packages with
different feature sets. For instance, if a software can use either OpenSSL
or GnuTLS as implementation of SSL/TLS layers, then the distribution might
create two packages, linking to one or the other. The user could then
choose between the two.
When some optional dependencies are discovered by the build system,
used if present and ignored if not, without a way to tell the software to
not build the optional feature that uses a library that is present on the
system, we're talking about an automagic dependency. Automagic
dependency is a term used to indicate when a package, optionally using
another, discovers its presence automatically, without allowing for the
user (or the downstream maintainer) to ask not to use it. This
kind of dependency is usually a problem just for source distributions, as
they build the software on users' systems, which may or may not have the
same configuration as the developer working on the build scripts. Binary
distributions on the other hand build their code in controlled
environments having only the stated dependencies installed. This might
actually confuse one of their developers in thinking that a given
dependency is mandatory, seeing it enabled in their local build, and
not finding an option to disable it.
In general, automagic dependencies should be avoided; having a
soft failure default is usually equivalent for the user passing by
- you enable the dependency if found, disable it if not found, but
still give a way to tell the build system to disable it even when
found. This preserves the behaviour intended by the original developers,
but also provides the control that (source) distributions want to have over
what is built.
Control over how the software is built
Another problem
shared both by binary and source distribution is having control on
how the software is built. For binary distributions this usually means
being able to impose options to the compiler, linker and other tools
during the package build process, so they respect their standard options.
For source distributions, this means allowing the user to choose the
options to provide to the compiler, linker, assembler and other build
tools, on a package-by-package basis.
This does not mean that the distributions want to force-feed extra
optimisations into software that might be fragile. This seems to be the
biggest concern of developers for not wanting to provide a way to change
the options used at compile time.
Distributions might want to reduce the optimisations used, or they might
just wish to enable (or disable) warnings to more easily spot eventual
problems with their packages. Distributions might also want to build
debug information, or remove debug messages, and so on. There are a huge
amount of possible combinations.
When the distributions want to reduce optimisation, that might be
because the need to create packages which work on lower architectures
not compatible with these optimisations. Or they know that some of
these optimisations are not going to work with their environment. They
might know that their version of the compiler does not support the
optimisation, or there could be other reasons. Usually, the distribution
knows the best way to handle the package for their own environment.
This also leads to a compromise between upstream developers and downstream
maintainers: the former should provide their own default options and
optimisations, leaving a way to override these defaults as the
distributions see fit. On the other hand, distributions should try their
best to determine when eventual problems might be caused by their own
choice of optimisations. Distributions should not expect upstream
developers to fix problems that they have caused with their choice of
optimisations. This way, it's usually possible to keep the relationship
between upstream and downstream in good terms even when the set of
optimisations used is totally different.
More times than not, the problem is not even of willingness of the
developers to provide an override, but rather a problem of actually
having such an override working. While most distribution developers
can fix these problems with relative ease, original developers would
probably want to facilitate the work of their distributors by checking
their own releases so that setting very minimal options to the
compiler will work as intended. A common mistake is hard-setting
CFLAGS (or similar variables) in the configure.ac
file for autoconf (which otherwise has proper support for
user-chosen options).
While we're talking about compiler optimisations it's important to note
that for some software, e.g. number crunching software (multimedia
applications, cryptography tools, etc.) enabling extra optimisations is
desirable. Even so, it should be possible to disable extensive
optimisation. These optimisations are usually fragile, and only work in
particular environments (compiler type and version, and architectures), so
having a way for distributors to decide what they actually want to enable
is a very real need.
But having a way to provide options to compiler (C and C++,
respectively CFLAGS and CXXFLAGS) is not all that is
needed: most modern distributions might want access to the options
used by the linker (LDFLAGS) to change the kind of hash
tables to be generated, or to enforce particular security measures. For
custom-prepared build systems, it's a common mistake to ignore this need,
or to support it in the wrong way. Linker options should go before the
list of object files, which in turn should go before the list of libraries
to link to. This is another common mistake that distributors can fix with
relative ease, but it would be better taken care of by the original
developers, as it would require repeating the same steps for (almost) all
distributions.
[This ends part 2 of this article. Stay tuned for part 3, which will cover
the philosophical concerns and present some conclusions.]
(
Log in to post comments)