By Jake Edge
August 22, 2007
A recent outage at Voice over IP (VoIP) provider Skype has caused quite a
stir. For nearly two days, users of the VoIP software could not make
calls, which set off a storm of blog postings wondering about the cause.
Skype released an official
explanation that did not ring true to some, leading to further
speculation.
Sometime early Thursday, 16 August, Skype users could no longer
authenticate and connect to the network. On Friday, right in the middle of
the outage, a posting to
Bugtraq purported to have information about the vulnerability that was
being exploited to cause the outage. Skype has since categorically denied
that any attack was responsible, but suspicions persist that the
denial-of-service (DoS) vulnerability reported was actually responsible for
the outage.
On Monday, Skype posted the following to their Heartbeat blog:
On Thursday, 16th August 2007, the Skype peer-to-peer network became
unstable and suffered a critical disruption. The disruption was triggered
by a massive restart of our users' computers across the globe within a very short
timeframe as they re-booted after receiving a routine set of patches
through Windows Update.
The high number of restarts affected Skype's network resources. This caused a
flood of log-in requests, which, combined with the lack of peer-to-peer
network resources, prompted a chain reaction that had a critical impact.
Though they
never blamed Microsoft or the updates themselves, many in the media did it
for them, which led Skype to
clarify
their explanation of the outage.
The new message provided more details, but still remained mute on one
of the central puzzles: why did updates on Tuesday cause an outage
starting on Thursday? While they acknowledge a bug in their
software, there is also no mention of how the situation was resolved,
presumably through an automatic update of their own. Overall, the
explanations are fairly thin on technical detail which allows others to conjecture to try
and fill
in the holes.
There are many millions of Skype users – the software is available
for Windows, OS X and x86 Linux – for the no-cost PC-to-PC calling as
well as the other services that Skype does charge for. Hopefully the free
users are not depending on the service, but there are
companies which use Skype exclusively; an outage for two weekdays must have
been rather painful. Certainly the landline and cellular phone companies
have had their problems along the way, but those tend to be regional
rather than worldwide.
All software even minimally more complicated than "hello world" has bugs,
and those bugs will be triggered in surprising ways. Taking the Skype
"perfect storm" explanation at face value, it is nearly amazing that
millions of reboots could result in a network storm so severe that it would
take two days to resolve. Somehow, in the interface between the
Skype's centralized authentication and their P2P routing code, things went
horribly awry. It does, however, give one pause about the power of the
near-monoculture in desktop operating systems.
It is hard, but not completely impossible, to imagine a similar
scenario for Linux boxes. To start with, it is uncommon that a software
upgrade requires a reboot. Within the Linux user community, there is a
wide range of kernel versions running, so even if there were a critical
security fix that required "all" Linux kernels to be upgraded, it would not
be very synchronized – the distributions tend to have different
response times. This is a bit of a double-edged sword, of course, those
varying response times could leave a hole that a worm or attacker could
exploit. But, because Linux boxes are controlled by their owners rather
than by their distribution provider, synchronized reboots are probably not
a major cause for concern.
Beyond monocultural issues, there is the question of how a P2P system can
be taken down by the lack of a centralized resource, in this case
credentials from an authentication server. That provides a single point
of failure to what is supposed to be a robust architecture, resistant to
exactly those kinds of problems. There are also those who wonder if the
outage was caused by an "upgrade" mandated by the US government so that
they can more easily monitor Skype calls.
Skype is proprietary and closed source; there is no easy way to
determine whether the problem has been fixed, or even whether the problem
is being accurately described. If Skype decides, or is forced, to change
their software to be more easily monitored, it will be hard to detect. It
might look an awful lot like a multi-day outage that clears up somewhat
mysteriously. Trusting closed source software for vital communications is
not the best of plans, at least when there are alternatives.
Free software would not necessarily avoid these kinds of problems, but
a completely decentralized network with multiple clients sharing a
protocol, but little else, would certainly be more resistant to this kind
of outage. More importantly, it would also be more transparent. Over
time, projects like openwengo, Linphone, Asterisk and others can
hopefully provide those benefits to a larger audience
(
Log in to post comments)