LWN.net Logo

Reliability: Unix and Linux beat Windows (heise online)

Heise online looks at a Yankee Group report on server reliability. "Yankee Group market researchers have presented the results of a study on server and operation system reliability for 2007. Compared to the previous year, annual downtime on all systems has dropped sharply – except on Windows server 2000 and 2003. With between one and two hours of downtime per year, enterprise versions of Linux all performed at a similar level; at some five hours downtime, Debian was the poorest performer. Windows Server 2000 and 2003 had downtimes of ten and nine hours per year, respectively (99.9 per cent accessibility)."
(Log in to post comments)

Reliability: Unix and Linux beat Windows (heise online)

Posted Apr 17, 2008 20:33 UTC (Thu) by alvieboy (guest, #51617) [Link]

# cat /etc/issue && uptime
Debian GNU/Linux testing/unstable \n \l

 21:28:19 up 336 days (...cut...)

Last reboot was due to a power supply failure.

I can't understand why Debian comes in last in reliability.

. Álvaro

Reliability: Unix and Linux beat Windows (heise online)

Posted Apr 17, 2008 21:16 UTC (Thu) by tv (subscriber, #32991) [Link]

Because people using Debian don't care enough about redundant power supply?

Seriously, it seems silly to analyse downtime by OS without addressing the differences in the
hardware configuration.

Reliability: Unix and Linux beat Windows (heise online)

Posted Apr 17, 2008 23:28 UTC (Thu) by ajross (subscriber, #4563) [Link]

And the whole notion of "downtime" as a metric is fundamentally flawed anyway.  The
overwhelming majority of "downtime" I've witnessed happens due to configuration goofs on the
part of the IT staff, followed by software failures in a distant second place and hardware
failures well behind that.  Stuff just doesn't break much in the modern world.

Notice who wrote this, though.

Posted Apr 18, 2008 13:07 UTC (Fri) by dmarti (subscriber, #11625) [Link]

"Debian administration department, Joe speaking."

"Hi, this is Laura DiDio.  You may remember me from such media brouhahas as the SCO case.  I
have some questions about your company's use of Linux."

*click*

Reliability: Unix and Linux beat Windows (heise online)

Posted Apr 18, 2008 0:20 UTC (Fri) by dlang (subscriber, #313) [Link]

over the last few months I've had a rash of hiccups (failover then fail back) on boxes that
appear to be related to 447 days of uptime. I had about a hundred systems hit this.

you know, I really should upgrade more frequently ;-)

Reliability: Unix and Linux beat Windows (heise online)

Posted Apr 18, 2008 2:47 UTC (Fri) by yarikoptic (subscriber, #36795) [Link]

what 447 days of uptime issue? I couldn't google it up.
my file server (quite a busy one) celebrates a year of uptime today, so I started to worry
that I will have to reboot it soon, and since it is running Debian, I am worrying it will be
such a long downtime period (probably not long enough for me to run to grab a cup of
coffee)!!! ;-)

Reliability: Unix and Linux beat Windows (heise online)

Posted Apr 18, 2008 7:02 UTC (Fri) by dlang (subscriber, #313) [Link]

I don't know the specific bug, but it was pretty consistant.

running 2.6.9 and an older heartbeat (1.2.x) right around 447 days of uptime heartbeat would
report a large delay in receiving a packet, long enough that it would declare the other system
dead (taking over) and then the flow would start again and the systems would realixe they were
both active for a few seconds. in my case I don't have shared drives so the only harm was the
failover/failback flop (~15 seconds of outage)

everything seemed to continue to work after that.

i figured that this was close enough to the 497 day time when 32 bit counts wrap that I wrote
it off to some interaction with this and moved on.

Reliability: Unix and Linux beat Windows (heise online)

Posted Apr 18, 2008 3:45 UTC (Fri) by einstein (subscriber, #2052) [Link]

Unix variants in general have extremely low downtime requirements compared to pc operating
systems. Same goes for linux in my experience. Linux servers with hundreds of days uptime are
the norm here -

plemlp01: /home/jjs
(tty/dev/pts/0): bash: 113 > ruptime -t
plemlp05      up 611+06:38,     0 users,  load 0.02, 0.02, 0.00
plemlp04      up 611+06:31,     3 users,  load 0.08, 0.05, 0.01
plemlp03      up 611+06:16,     1 user,   load 0.00, 0.00, 0.00
plifsp05      up 607+00:23,     0 users,  load 0.02, 0.05, 0.05
plemlp01      up 607+00:04,     0 users,  load 0.11, 0.08, 0.08
plemlp02      up 607+00:00,     0 users,  load 0.52, 0.39, 0.35
plemlp01: /home/jjs
(tty/dev/pts/0): bash: 114 > cat /etc/SuSE-release
SUSE Linux Enterprise Server 10 (i586)
VERSION = 10

But this is nothing new - I had a couple of busy production mail/dns servers back near the
turn of the millenium which were up for over 2 years - but that was on the 2.2.17 kernel, and
uptime would wrap around at 497 days. No problems whatsoever in the performance, they kept
running for hundreds more days til we brought them down for hardware upgrades.

At any rate I'm glad the mainstream "IT pundits" are catching on.

Reliability: Unix and Linux beat Windows (heise online)

Posted Apr 19, 2008 23:29 UTC (Sat) by kruemelmo (subscriber, #8279) [Link]

Any uptime above a couple of weeks is a proof for poor system administration: Either, no
updates are done, or reboot is not tested after a changes.

Reliability: Unix and Linux beat Windows (heise online)

Posted Apr 24, 2008 3:19 UTC (Thu) by lysse (subscriber, #3190) [Link]

> Any uptime above a couple of weeks is a proof for poor system administration

Even I can think of 3 or 4 cases for which this is trivially, obviously not true. I'm sure you
can too.

Reliability: Unix and Linux beat Windows (heise online)

Posted Apr 25, 2008 10:36 UTC (Fri) by jschrod (subscriber, #1646) [Link]

So you tell us that (1) you didn't apply kernel security updates and (2) you didn't test (at
appropriate maintenance windows) that your system changes (i.e., security updates, config
changes) will not endanger reboot functionality, thus severly increasing the risk associated
with these systems. And you're even proud of it. Gosh, I don't even know where to start.

Excuse me if I'm too frank for you, but: if I would find you at one of my customer's sites as
a sysadmin, the first thing that I would try to do is to cause revocation of your sysadmin
priviledges, as you're obviously a security and risk hazard. (With the shown unprofessional
mindset, you probably wouldn't work at my company in the first place.)

Any properly run data center will reboot its servers in regular and scheduled intervals. This
doesn't mean that user service is interrupted, that's what HA is for. (I design data centers
and processes around them for a living. And I have seen enough sh*t hitting the fan and have
been called too often inmidst a night to a customer's site to know why such reboot processes
are necessary, for *all* operating systems.)

Reliability: Unix and Linux beat Windows (heise online)

Posted Apr 17, 2008 21:23 UTC (Thu) by dmarti (subscriber, #11625) [Link]

Hey, wait a minute. I thought Laura DiDio wasn't really qualified to understand all this server operating system stuff.

Silly metric

Posted Apr 18, 2008 5:29 UTC (Fri) by wilreichert (subscriber, #17680) [Link]

Individual server downtime is completely irrelevant when you have properly architected
redundant systems.

Silly metric

Posted Apr 18, 2008 22:53 UTC (Fri) by man_ls (subscriber, #15091) [Link]

It is still a good metric to find out how to properly engineer your redundant system, isn't it?

Reliability: Unix and Linux beat Windows (heise online)

Posted Apr 18, 2008 6:07 UTC (Fri) by kripkenstein (subscriber, #43281) [Link]

I wasn't surprised to see Linux outperforming Windows, but what *did* surprise me was Ubuntu
besting Debian.

Since Ubuntu is based on Debian unstable, and much more fast-moving than Debian in general, I
always assumed Ubuntu would be less reliable. But in this survey it had ~1 hour downtime vs.
~5 for Debian.

Can anyone explain this? Random thoughts of mine include that Ubuntu has commercial support
(so people end up using it more correctly perhaps), or maybe that Ubuntu issues patches after
more testing. But I have no idea, this is surprising to me.

Reliability: Unix and Linux beat Windows (heise online)

Posted Apr 18, 2008 6:34 UTC (Fri) by jordip (subscriber, #47356) [Link]

The answer should be the people using those systems. 
For what I read around Internet and my own friends, Debian administrators tend to be more
hackish than Ubuntu's or other distro. 
So they tend to have more resources solving problems but on the other hand they may prefer
their own methods over what the distro have ( or what the distro doesn't have and they
implement).
As everybody is wrong sometimes, they will make the system fail more often that people that
just use what the distro provides that at least has been tested by someone else than yourself.

Don't get me wrong, even if those systems have a couple more bugs they have to address, some
of them are pure art in what they are achieving.

Debian reliability on the report

Posted Apr 18, 2008 22:50 UTC (Fri) by man_ls (subscriber, #15091) [Link]

The answer lies more in the atrocious reporting methodology. Centering just on one sentence, the relevant slide (a ridiculous pdf with 2 pages) says literally "OpenSource Linux (e.g. Debian)". Where:
  • "OpenSource" does not exist: according to its proponents it is "open source".
  • Most Linii (or Linuxes, if you prefer) on the page are "open source": Red Hat, SUSE (not "SuSe" as it appears, which is not even "SuSE" as in the original German), Mandriva, Ubuntu, etc. No reason to single out Debian.
  • "Open source Linux", even if referring to "Community distro", should include everything from Fedora to Debian to Gentoo to home-grown distros.
  • Even if limited to distros maintained by the community, there are a lot of them including Debian, Gentoo, Puppy Linux...
  • Even if limited to Debian, the report makes clear that "24% of the respondents reporting they had at least one Debian server in their network", which does not look like actual production servers. Rather it might be experimental or unmaintained machines.
  • Debian administrators are not "hackish". Depending on how you define downtime, something which is of course absent from the report or the slides, Debian servers do not suffer downtime just because of bugs; once they are in production they stay there for years. What is true is that Debian administrators may work in more precarious conditions than others: I used to run several experimental servers and my only downtimes were related to grid failures, something which a proper setup would have avoided.
  • And as pointed out above, Laura DiDio is not likely to elicit honest responses from a variety of Debian administrators.
There are many other problems with the report. Just see how "Other Linux" varies wildly from one year to the next, or how last year's "Unix (AIX, Solaris, HPUX)" with 6,54 hours of downtime have magically transmogrified into three categories with less than 2 hours each. I seriously doubt proprietary Unix has turned 4 times as reliable in this time. And "Customizations" have mixed effect on reliability, even worse when combined with last year's data.

It is therefore hard to extract meaningful conclusions from the figures reported. If I had to, I'd say that people are still reluctant to run mission-critical services on Debian, and that is why Debian servers are more likely to suffer downtime than the Red Hat counterparts. In this case it is no reflection on the quality of the OS, but on how much the machines are cared for. And that is maybe why proprietary Unices have grown in reliability: only legacy services are still running on these servers, and they are the most reliable since they do little work and are never modified. More demanding applications have migrated to Linux in many cases.

Reliability: Unix and Linux beat Windows (heise online)

Posted May 3, 2008 18:24 UTC (Sat) by anton (guest, #25547) [Link]

what *did* surprise me was Ubuntu besting Debian.
My explanation is: A company that pays for round-the-clock staff usually will want a commercial enterprise distro installed, not a community project like Debian. So the larger downtimes of Debian systems compared to others are not due to a higher failure rate, but due to a larger average downtime when a failure occurs; e.g., if the system crashes on Friday night, it may have between several hours and 2.5 days of downtime ahead. Yes, a hardware watchdog would help availability, but for some servers availability is not everything.
Since Ubuntu is based on Debian unstable, and much more fast-moving than Debian in general, I always assumed Ubuntu would be less reliable.
Ubuntu does its own testing, so the difference may not be so big.

Reliability: Unix and Linux beat Windows (heise online)

Posted Apr 18, 2008 7:14 UTC (Fri) by BackSeat (subscriber, #1886) [Link]

Why should Windows be less reliable than Linux? Don't get me wrong, I welcome the news! But really, is the difference between 9 or 10 hours of downtime a year and 1-5 hours down to the operating system?

I harbour a (very generalised) belief that Windows sysadmins are less technical than Linux sysadmins. For example, in discussing a problem whereby mail wasn't getting to a server, I asked what happened when telnet'ing to port 25 of said server. The reply was, "Mail doesn't use telnet". Many Windows sysadmins are able to find their way around a myriad of point 'n' click dialog boxes to configure what is sometimes complex technology, but when that fails to work they have problems, which are not made any easier by the difficulty of grep'ing a series of ASCII log files. I think the combination of a reduced understanding of the underlying technology coupled with a lack of tools for looking at what is really happening mean that it takes longer to resolve Windows problems.

And no, this isn't a troll - I'm well aware that there are some very technically competent Windows sysadmins just as there are incompetent Linux ones; however, overall my experience is as above.

BS

How they're different (part of the answer)

Posted Apr 18, 2008 12:08 UTC (Fri) by tialaramex (subscriber, #21167) [Link]

Microsoft has historically been concerned about corner cases with changes to core system
software or its configuration†, so they nearly always trigger a reboot on updates or major
config changes.

Unix (and thus Linux) has historically assumed that the operator knows what they're doing, and
will choose to reboot if its appropriate.

This bites both ways, you can get mysterious problems which resolve when you reboot your Unix
server, only for you to realise days, weeks or even years later that the cause was a daemon
still using configuration from a file that had been updated and just not re-read.

Meanwhile your Windows administrator cousin finds that despite her best efforts she's losing
an hour a month to reboots for changes that would most likely have caused zero downtime in
Linux.

† This is a self-fulfilling prophecy of course. Having decided that it's OK to handle such
changes only by rebooting, more and more of the system software comes to rely on this
behavior, and so when Microsoft did try to respond to feedback about "too many reboots" it
struggled to do anything about them. In particular the inability to replace a file while it is
open, combined with the requirement for many core services to hold open files they are using,
results in Windows needing an entire boot-time subsystem dedicated to replacing such files
while there's still a chance, and of course anything which uses this subsystem requires a
reboot...

How they're different (part of the answer)

Posted Apr 20, 2008 7:37 UTC (Sun) by Cato (subscriber, #7643) [Link]

One weak point in Ubuntu and probably other distros is that they don't seem to automatically
either restart services (with a warning) or tell the desktop user that 'XYZ app is open - be
sure to restart it soon to pick up the latest update'.  Ideally such warnings would provide a
URL to the distro security advisory so the user can decide whether to restart urgently or at
end of day/week etc.

I'm curious to know if any Unix/Linux systems do this.

Reliability: Unix and Linux beat Windows (heise online)

Posted Apr 20, 2008 7:35 UTC (Sun) by Cato (subscriber, #7643) [Link]

Windows reliability is getting much better - in the Win95 days I wrote an uptime script
showing my Win95 laptop had an average uptime of 11 hours, including times when I was asleep!
My WinXP laptop is now much less likely to lock up (mis-rendering fonts and becoming unusable)
now that it has 2 GB RAM - was 1 GB, but Windows Process Explorer / Task Manager consistent
mis-reports that it has plenty of RAM when in fact it is struggling (some RAM usage must be
un-reported)...  So having more than adequate hardware is important to Windows - by contrast,
home Linux boxes with 96, 192 or 512 MB RAM run Linux with GUI apps very reliably and never
crash, and very rarely lock up.

On server admin skills, you're right about many Windows admins, I think, in that they never
need to learn so much, but in larger organisations the best admins have to master registry
hacks, patch update management, and many other very technical areas, in order to keep a large
number of Windows systems working.  

Generally, I think it's a lot more effort to keep a desktop or server box running with Windows
than with Linux (what with antivirus, antispyware, anti-rootkit, defragging, Windows updates
[which sometimes takes 100% CPU) - however, it can be more effort to get Linux working with
recent video cards and WiFi adapters.

Large IT depts have a big challenge in testing the various Windows patches before rolling them
out to all desktops/servers, because these patches often break things in unpredictable ways,
particularly for mission-critical apps - hence there's a multi-week "patch lag" between patch
release by MS and rollout by IT, leaving systems more vulnerable.  This means the IT dept has
to invest in more centralised security - there are systems that sit on the network and
dynamically block exploit attempts specific to MS patches that are released but not yet
applied, for example.  I can't see this patch-lag issue happening with Linux - the updates I
get from Ubuntu simply work (with the odd exception such as the broken xorg update a while
back) and I'm sure Debian stable is far better at this.

The end result is that my Windows laptop (managed by the IT dept mostly) is quite reliable,
but due to patch-lag is quite vulnerable for many weeks to vulnerabilities in Flash, MS
Office, etc that are remotely exploitable and therefore rated 'highly critical' by Secunia
(see http://secunia.com -they have a great free vulnerability scanning tool if you have to use
Windows).

What is uptime really?

Posted Apr 20, 2008 12:08 UTC (Sun) by Milan (guest, #26716) [Link]

Uptime 300+ day tell us only that the machine is leaved unmaintained and vulnerable even used
for "nothing" because computer connected to the network is not safe even connected to the
private LAN at these days.
Also statistic without explained methodology is worth nothing because we are not able to
verify the numbers or methodology.
Downtime in general is tightly connected to the TCO (Total Cost of Ownership) because it shows
how much time administrator must sit and work to be able to put the machine up again.
The reason why MS Windows machines has longer downtime it tightly connected to inability to
replace files even they are opened (as Unix is able) and get service up only by restarting
this one service (and not whole computer). Also library (DLL) problem with various versions is
not solved on system level in MS Windows.
Longer time for Debian in compare to other distributions may be related to the fact, that
Debian is late with security updates (as was presented here a year ago or so), does not offer
latest security technologies and thus admin must solve this problem with reinstallation or by
cleaning up the machine (yes, Debian is more "stable"... and more older).
Also do not mix uptime and failover (cluster) as uptime means uptime of one machine and not
uninterrupted reachability of the service (backed by cluster, redundacy, round robin DNS or
something similar).

What is uptime really?

Posted Apr 20, 2008 17:54 UTC (Sun) by ibukanov (subscriber, #3942) [Link]

> Uptime 300+ day tell us only that the machine is leaved unmaintained and vulnerable...

Such uptime tells only that the kernel was not updated. It does not mean that the rest of the
system was not properly maintained. If a vulnerability in the kernel can not be exploited over
a network connection, a decision to skip updating the kernel is very reasonable if the system
is a file or web server with no local user accounts.

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds
Powered by Rackspace Managed Hosting.