The Skype outage
A recent outage at Voice over IP (VoIP) provider Skype has caused quite a stir. For nearly two days, users of the VoIP software could not make calls, which set off a storm of blog postings wondering about the cause. Skype released an official explanation that did not ring true to some, leading to further speculation.
Sometime early Thursday, 16 August, Skype users could no longer authenticate and connect to the network. On Friday, right in the middle of the outage, a posting to Bugtraq purported to have information about the vulnerability that was being exploited to cause the outage. Skype has since categorically denied that any attack was responsible, but suspicions persist that the denial-of-service (DoS) vulnerability reported was actually responsible for the outage.
On Monday, Skype posted the following to their Heartbeat blog:
The high number of restarts affected Skype's network resources. This caused a flood of log-in requests, which, combined with the lack of peer-to-peer network resources, prompted a chain reaction that had a critical impact.
The new message provided more details, but still remained mute on one of the central puzzles: why did updates on Tuesday cause an outage starting on Thursday? While they acknowledge a bug in their software, there is also no mention of how the situation was resolved, presumably through an automatic update of their own. Overall, the explanations are fairly thin on technical detail which allows others to conjecture to try and fill in the holes.
There are many millions of Skype users – the software is available for Windows, OS X and x86 Linux – for the no-cost PC-to-PC calling as well as the other services that Skype does charge for. Hopefully the free users are not depending on the service, but there are companies which use Skype exclusively; an outage for two weekdays must have been rather painful. Certainly the landline and cellular phone companies have had their problems along the way, but those tend to be regional rather than worldwide.
All software even minimally more complicated than "hello world" has bugs, and those bugs will be triggered in surprising ways. Taking the Skype "perfect storm" explanation at face value, it is nearly amazing that millions of reboots could result in a network storm so severe that it would take two days to resolve. Somehow, in the interface between the Skype's centralized authentication and their P2P routing code, things went horribly awry. It does, however, give one pause about the power of the near-monoculture in desktop operating systems.
It is hard, but not completely impossible, to imagine a similar scenario for Linux boxes. To start with, it is uncommon that a software upgrade requires a reboot. Within the Linux user community, there is a wide range of kernel versions running, so even if there were a critical security fix that required "all" Linux kernels to be upgraded, it would not be very synchronized – the distributions tend to have different response times. This is a bit of a double-edged sword, of course, those varying response times could leave a hole that a worm or attacker could exploit. But, because Linux boxes are controlled by their owners rather than by their distribution provider, synchronized reboots are probably not a major cause for concern.
Beyond monocultural issues, there is the question of how a P2P system can be taken down by the lack of a centralized resource, in this case credentials from an authentication server. That provides a single point of failure to what is supposed to be a robust architecture, resistant to exactly those kinds of problems. There are also those who wonder if the outage was caused by an "upgrade" mandated by the US government so that they can more easily monitor Skype calls.
Skype is proprietary and closed source; there is no easy way to determine whether the problem has been fixed, or even whether the problem is being accurately described. If Skype decides, or is forced, to change their software to be more easily monitored, it will be hard to detect. It might look an awful lot like a multi-day outage that clears up somewhat mysteriously. Trusting closed source software for vital communications is not the best of plans, at least when there are alternatives.
Free software would not necessarily avoid these kinds of problems, but a completely decentralized network with multiple clients sharing a protocol, but little else, would certainly be more resistant to this kind of outage. More importantly, it would also be more transparent. Over time, projects like openwengo, Linphone, Asterisk and others can hopefully provide those benefits to a larger audience
Index entries for this article | |
---|---|
Security | Internet/Voice over IP (VoIP) |
Posted Aug 22, 2007 17:09 UTC (Wed)
by drag (guest, #31333)
[Link] (2 responses)
Well both reasons; the suspicious activity on the part of Skype and the single point of failure, are exactly why something like a P2P protocol that depends on a central resource is fundamentally flawed.
The way the internet works, they way it's designed, is that it's nearly totally decentralized. If any part of the Internet is suddenly gone then the rest has the ability to route around it.
This decentrialization, this peer-to-peer nature of the internet is fundamental to why it's so robust.
As far as VoIP goes skype is the biggest, but there are other companies that people paid for business accounts that ended up being fly-by-night folks. One day everything works, everybody is happy, the next day the network is down and nobody is returning emails or phone calls.
VoIP to work, like the web, it needs to be open. It needs to be open, completely decentralized, with standardized protocols with multiple implementations. Signing up people like Verizon or Skype should be completely optional and if they do not then they should not loose any compatibility.
Plus integration of Voice and Video messaging into instant messaging is natural in my eyes. IM makes a effective replacement for email and it provides a way to leave messages and locate other people. This way people can use a combination of text, voice, and video to communication in the most effective manner. It's a no-brainer.
I don't know how technically effective XMPP or SIP is. But it seems to me that having open source XMPP and SIP that can integrate into online forums software or other CMS types would be very effective. Something were people can just download and get a basic setup running in one evening. Open source Windows, Linux, and OS X clients and servers. Users sign up to websites, get a account, and that account can be reached in several optional ways.
Probably something like that can help IRC and Email go the way of Gopher.
Posted Aug 23, 2007 6:19 UTC (Thu)
by allesfresser (guest, #216)
[Link] (1 responses)
Posted Aug 23, 2007 6:34 UTC (Thu)
by drag (guest, #31333)
[Link]
But it will happen. All good things must come to a end sometime. Probably will remain as a sys-admin tool for a long time since it's simple but it's been showing it's age for.. oh.. about 10 years now. :)
Posted Aug 22, 2007 17:33 UTC (Wed)
by zooko (guest, #2589)
[Link] (1 responses)
http://lists.zooko.com/mailman/listinfo/p2p-hackers
There are some observations of the behavior of the Skype network during the outage period, and the p2p hackers are pretty good at inferring what might have gone on.
Posted Aug 23, 2007 2:44 UTC (Thu)
by nlucas (guest, #33793)
[Link]
In relation to that, one Microsoft security problem I find extremely disturbing is the fact that is normal for security updates for small countries like Portugal to occur, sometimes, 1 month later than the first updates, giving the bad guys lot of time for doing zero day exploits for the "little guys". The reason is probably just having the resources for the quality control team to certify the localized versions of the update, and there isn't enough economic incentive (not enough number of sold licenses) to have an authorized local team (in relation to the "cultural" piracy percentage on the country).
So, it's perfectly possible that one of the peaks of reboots just happen 1 or 2 days latter, as the localized update versions become available.
Posted Aug 23, 2007 3:49 UTC (Thu)
by bvdm (guest, #42755)
[Link] (6 responses)
1. The borderline insinuation that Skype may have introduced wiretapping and used this outage as cover. What is the basis for suspecting this? Does it make sense for a company to suffer a huge PR hit if they could just have phased it in over time? No, it does not. This is Fear and Doubt.
2. Harping on Windows reboots while it is completely besides the point. The trigger was that Skype was restarted on all those machines, not that the machines rebooted. Red Hat pushing out an update to an open-source P2P client can just as easily lead to a problem like this. This is Uncertainty and Doubt.
The further we can down the FUD path, the more we will lose our *technical* credibility which is FLOSS's strongest selling point.
Posted Aug 23, 2007 5:15 UTC (Thu)
by jamesh (guest, #1159)
[Link] (1 responses)
Posted Aug 23, 2007 5:50 UTC (Thu)
by bvdm (guest, #42755)
[Link]
Posted Aug 23, 2007 6:57 UTC (Thu)
by man_ls (guest, #15091)
[Link]
The only difference I gather might be that we are not a bunch of loony geeks anymore: people are actually listening now. And I agree that this new responsibility should not be taken lightly (to say it in a Stan-Lee-esque way).
Posted Aug 23, 2007 9:24 UTC (Thu)
by pcampe (guest, #28223)
[Link] (2 responses)
Gimme the source code, and I stop saying that. In the meanwhile, I use skype only for personal use, no business could _secretely_ conducted over skype unless otherwise proven.
Posted Aug 23, 2007 9:41 UTC (Thu)
by ekj (guest, #1524)
[Link] (1 responses)
I mean this as an honest question.
You don't trust Skype, fine. But do you then instead trust your cellphone-manufacturer, the company running your cell-network, and all operators of other telephone-networks which your call passes trough instead ?
Posted Aug 23, 2007 11:56 UTC (Thu)
by tajyrink (subscriber, #2750)
[Link]
Posted Aug 23, 2007 5:13 UTC (Thu)
by eru (subscriber, #2753)
[Link] (9 responses)
Posted Aug 23, 2007 10:01 UTC (Thu)
by lacostej (guest, #2760)
[Link] (8 responses)
Same for VoIP. Skype has their own database, OpenWengo has theirs, ekiga has theirs, etc...
A better comparison might be with DNS which is also a directory. DNS is distributed and has several root servers. But information takes time to be updated, something that a phone network wouldn't want.
I am not sure how Skype servers are networked, there are probably several of them and some redundancy. But as always, the performance peak supported by one infrastructure is often a percentage above the maximum peak experienced in the network. If they need to support 1000 logins per second, maybe their network can today support 5 or 10 000, which is usually sufficient, except when 90% of their clients restart within 48h...
They've made a mistake and they will probably learn from it.
But on the other side, I'd rather have skype disappear and get everyone on an open standard technology. Be it SIP or anything else that stands the time.
Posted Aug 29, 2007 22:25 UTC (Wed)
by job (guest, #670)
[Link] (7 responses)
That's just wrong. My email relies on solely on my own server and not on somebody else's database. My VoIP as well. If you would call me using the standard protocol, SIP, no other server than my own would be involved (don't actually do this, it's probably night over here and I'll be cranky).
There is absolutely zero reason to use this proprietary crap, and to rely on a third party which whom you probably don't even have a valid service contract.
Posted Aug 30, 2007 4:27 UTC (Thu)
by lacostej (guest, #2760)
[Link] (6 responses)
That's because you're running your own dynamic/static DNS, right ?
The problem is still there for most people.
Your email address doesn't change when you move. The email is dropped to a mail box and you as a user go and fetch it from there.
For a phone call, the phone call has to reach you, so you have to register your current location first.
I don't see how one can change that without impacting a lot the network, appart from generalizing dynamic DNS maybe.
Am I missing something ?
Posted Aug 31, 2007 4:05 UTC (Fri)
by rqosa (subscriber, #24136)
[Link] (5 responses)
> Am I missing something ? What you're missing is that a SIP client can potentially be used with
any SIP registrar, and can make calls to users on other registrars. Thus,
SIP, is decentralized, like email, and unlike Skype.
Posted Aug 31, 2007 5:27 UTC (Fri)
by lacostej (guest, #2760)
[Link] (4 responses)
I mean I think I understand your point but I feel that we're both right. Yes you can use other registars, but when your registar is down, you lose part of your service.
SIP still need you to log in to a registar server. http://www.voip-info.org/wiki/view/SIP+registrar+server
It is inter operable but within one service provider, you still have to register. As I initially said one must do it with ekiga and openwengo (2 different SIP providers).
So now you tell me, if this service is dead, you can register somewhere else. Yes, but how usable that is ? E.g. I have a Sipura SIP phone at home.
So yes I can call, but I loose some features.
It looks to me to be the same as with Skype.
So I wonder again. Am I missing something ?
Posted Aug 31, 2007 16:44 UTC (Fri)
by njs (subscriber, #40338)
[Link] (3 responses)
The important thing about email and SIP is that there are lots and lots of providers, and they interoperate. So if your current provider sucks consistently, you can get fed up and switch to one that works better; because there are lots and lots of providers competing, there almost certainly *is* one that works better, and because they interoperate, you can switch even if your friends don't. (You can even run your own registrar and switch to that.) With Skype, everyone's pretty much stuck, and at the mercy of the one company. They might do pretty well as companies go (for now), but it's unlikely they'll do as well as any company that managed to climb to the top over heavy competition -- why should they bother, they're making money now.
(See also Jabber vs. proprietary IM protocols, and any article on vendor lock-in...)
Posted Aug 31, 2007 17:21 UTC (Fri)
by lacostej (guest, #2760)
[Link] (2 responses)
But the initial discussion on this thread was:
> but no global database of all cellular users is needed.
to which I responded that the problem was the same today with SIP.
Can we agree on the 2 following sentences:
* the fact that Skype is a lock-in solution and doesn't interop with other VoIP solutions
* SIP also requires some sort of "global database" for each particular service provider, thus forcing you to login to a registrar to make use of *your* SIP id, meaning that if you registrar is down, you cannot be reached (but can make calls by reconfiguring your phone to use a different service provider)
Posted Aug 31, 2007 19:47 UTC (Fri)
by njs (subscriber, #40338)
[Link]
The initial post in this thread was not one that made any sense to me. Scalability is completely a red herring here; scalability is merely a technical concern. Do we have the hardware and knowledge to build a database that can scale to millions of subscribers? Yes, obviously, we do it right now. (AT&T might be only one of many cell phone providers, but I bet they're still routing more calls per second than Skype on its best day.) There are reasons that we have more than one cellular operator, but scalability is just not one of them.
Then the discussion went off in a different direction that I also don't understand. What is a global database for each particular service provider? I mean, if there are multiple service providers, your database is either global, or particular to one of them, it can't be both at once... For Skype, there effectively is only one service provider, so it makes sense to talk about their database being global, but for SIP there are hundreds, and you can run your own if you want (just like you can run your own email server). Like you say, SIP is a little more complicated than email because a call request needs to be deliverable in real-time, but this is just a technical detail.
I sort of get the impression that other posters were similarly confused about this part, and also guessed at what you were trying to say and then replied based on those guesses.
Posted Sep 1, 2007 5:45 UTC (Sat)
by rqosa (subscriber, #24136)
[Link]
Posted Aug 23, 2007 11:57 UTC (Thu)
by sstein (guest, #15028)
[Link] (6 responses)
Regards,
Sebastian
Posted Aug 24, 2007 7:12 UTC (Fri)
by intgr (subscriber, #39733)
[Link] (5 responses)
Posted Aug 24, 2007 8:32 UTC (Fri)
by sstein (guest, #15028)
[Link] (4 responses)
http://heartbeat.skype.com/2007/08/the_microsoft_connecti...
Drawing conclusions from the Skype outage like it wouldn't have happened if the world was not running Windows is just ridiculous and a very poor argument.
Sorry, but I'm a longterm LWN reader, but the conclusions drawn in this article are just crap.
Anyway, if you are convinced that Microsoft is responsible for everything bad happening in the world, I will not try to change your mind.
Sebastian
Posted Aug 24, 2007 14:00 UTC (Fri)
by jake (editor, #205)
[Link] (3 responses)
> Drawing conclusions from the Skype outage like it wouldn't have happened if the world was not running Windows is just ridiculous and a very poor argument.
and, unless i am missing something, one not made by the article. i just reread it because that certainly was not my intent and i just don't see where the article says that, or anything like it. Skype went down because of a bug in their software, at least according to them, that was *triggered* by the reboots.
i have my, relatively unfavorable, opinions of Microsoft, but i try hard *not* to attribute to them things that aren't their fault.
the only word that might be seen as an indictment of Microsoft would seem to be "monocultural" ... even that isn't their fault, though they have worked hard to make it true.
Posted Aug 24, 2007 14:33 UTC (Fri)
by sstein (guest, #15028)
[Link] (2 responses)
Well, you implicitly refer in the following sentence to MS:
> It does, however, give one pause about the power of the near-monoculture in desktop operating systems.
The paragraph below this sentence explains why this couldn't happen to Linux, so you contrast Linux to near-monoculture. The problem is not the monoculture but that MS is using a fixed regular date for updates. So if Ubuntu decides to introduce also a regular update cycle, it could cause the same problem. It is not a problem related to MS.
Sebastian
Posted Aug 30, 2007 12:22 UTC (Thu)
by Cato (guest, #7643)
[Link]
1. The vast majority of Ubuntu updates don't require a system reboot (only kernel/driver updates)
2. In the absence of reboots, and assuming an always-on Skype client, only a Skype client update could cause a mass of Skype nodes to disappear and re-connect.
So even if Ubuntu or some other distro became a monoculture as popular as Windows and went for regular update cycles (rather than its current day-by-day updates), it wouldn't have the same effect on the population of connected Skype clients.
It's not fair to blame Microsoft for this outage, but it is reasonable to point out that the monthly reboot cycle due to Windows Update is a factor in stepping on this Skype bug, and that even a monoculture of a single Linux distro would not have had the same impact.
Posted Sep 6, 2007 16:39 UTC (Thu)
by sstein (guest, #15028)
[Link]
Sebastian
Posted Aug 24, 2007 17:04 UTC (Fri)
by dps (guest, #5725)
[Link]
The skype people might have imagined that rather few people understand that sort of problem and looked for another reason.
Posted Aug 29, 2007 22:32 UTC (Wed)
by job (guest, #670)
[Link]
It's not only proprietary software, it's using lots of anti debugging techniques to hide what it's doing. It is also using a proprietary protocol. And the whole purpose of this software is to bypass your firewall as efficiently as possible.
I can not imagine a worse product. Whoever is responsible for network security and lets this crap on their network doesn't deserve their job. I could not imagine a better bot net base. It may be true that that has not happened (who knows?), but why wait until the damage is done?
SIP won the fight as preferred standard protocol for VoIP over H323 almost ten years ago. Get with the program already.
> Beyond monocultural issues, there is the question of how a P2P system can be taken down by the lack of a centralized resource, in this case credentials from an authentication server. That provides a single point of failure to what is supposed to be a robust architecture, resistant to exactly those kinds of problems. There are also those who wonder if the outage was caused by an "upgrade" mandated by the US government so that they can more easily monitor Skype calls.The Skype outage
Um, I think it will be a while before email 'goes the way of Gopher'.The Skype outage
Oh it will take a long long time. The Skype outage
See the discussion on the p2p-hackers list:The Skype outage
One of the recurrent question is why the problem only spread 1/2 days after the Tuesday update, and I didn't see anyone notice that not even Microsoft is so naive to have the entire world doing updates at the same time.The Skype outage
The increasing amount of FUD originating from within the FLOSS community is a concern to me. This article contains two dubious points:The Skype outage: LWN FUD?
For (2), it was a quote from Skype themselves saying that the reboots were triggered by a Windows Update being pushed out. I don't think you can count Skype as part of the FLOSS community.The Skype outage: LWN FUD?
You have missed my point. The article based a different argument on the reboot issue.The Skype outage: LWN FUD?
I think people in the "FLOSS community" (what a horrible name) have always spread FUD, even among themselves. It has been extensively used in KDE versus GNOME, Free Software vs Open Source, against Sendmail... And just remember how some outside targets have been derailed: Microsoft, Apple, Sun. Deserved or not.
The Skype outage: LWN FUD?
>1. The borderline insinuation that Skype may have introduced wiretapping andThe Skype outage: LWN FUD?
>used this outage as cover. What is the basis for suspecting this?
So, what do you use instead ?The Skype outage: LWN FUD?
SIP w/ zRTP / sRTP works quite fine. Only three-way conference possible with Twinkle client without server support, though. I think setting up an encrypted conference room on a SIP server is more complicated, but someone could know better.The Skype outage: LWN FUD?
From what I gathered from the article, Skype seems to have a central system for authentication. I think the problem is right there. This does not scale. Compare this with cellular phone systems, where the responsibility for authentication of a handset is distributed between the various operators. Each will of course have somewhere a database of their own subscribers, but no global database of all cellular users is needed.
Centralization
> but no global database of all cellular users is needed.Centralization
> Same for VoIP. Skype has their own database, OpenWengo has theirs, ekiga has theirsCentralization
> If you would call me using the standard protocol, SIP, no other serverCentralization
> than my own would be involved
Or because you give your IP to someone before they call you ?
Centralization
I still don't get your point. Thanks for being patient !!Centralization
If the registar service (telio) dies,
1- I have to use a browser to change the config and point to a different SIP server (e.g. to ekiga registar)
2- if I change SIP server, I will still be able to call other parties, but
* I may need to pay some new fees that my current SIP service will already have covered in my agreement
* furthermore, people won't be able to call me to my land phone number (because my SIP provider provides me a bridge land phone-SIP).
* people won't be able to call me on my usual SIP address
The important thing about email and SIP isn't that you can change providers on the drop of a hat -- indeed, in both cases, you're generally at the mercy of your provider's technical acumen when it comes to scaling, uptime, etc.Centralization
This I already know. You don't need to convince me on the benefits of using :) I don't use much Skype, I have a SIP enabled phone at home. I just hope that an openmoko-like phone can fulfill my needs of an open phone.Centralization
Fair enough -- I admit I was picking the parts of the argument that made sense to me and responding to those, myself :-)Centralization
My point was that, with SIP, it's unlikely that there would ever be an
outage for all SIP users; the failure of one registrar doesn't cause an
outage for users on others. I think that was what the original poster
meant (except that post referred to cellular phone systems, rather than
SIP).
Centralization
Folks, do me a favor. Leave Microsoft out of this. It was a bug in Skype's software and has nothing to do with the large number of Windows installations and their reboots. You might argue that this wouldn't happen if Skype was open source but blaming Microsoft for it is just stupid.Leave MS out of the game
Windows Updates causing reboots is the official theory from Skype, it's not made up by LWN.net. I thought the article was quite straightforward in conveying that.Leave MS out of the game
Skype officially said it was not Microsoft's fault:Leave MS out of the game
Leave MS out of the game
> the only word that might be seen as an indictment of Microsoft would seem to be "monocultural" ... even that isn't their fault, though they have worked hard to make it true.Leave MS out of the game
To take your scenario of 'Ubuntu introducing a regular update cycle', here's why it would not have the same effect:Leave MS out of the game
Again, the outage was forced by restarting the Skype client and not rebooting Windows. If all Skype users were using Linux, the same would have happened. It is not related to Windows at all.Leave MS out of the game
Hmmm... well I for one know that some outages can be caused by critical staff of that do things that can not be delegated to other people. The system administrator in places where is only one, which probably means everywhere except vast outfits, is a prime example. Where this means me only I know the root passwords (and how you do things once you have a root shell).The Skype outage
Are there actually Linux users out there using this?Skype is irrelevant