LWN.net Logo

Building a Self-Healing Network (O'ReillyNet)

Greg Retkowski writes about self-healing networks on O'Reilly. "Wouldn't it be nice if your network services could detect their own failures and gracefully restart? Sure, you could have cron or FAM jobs always checking them, but that's so unrefined. Instead, consider Greg Retkowski's solution: building a small Cfengine and NAGIOS combination to detect and recover from failure."
(Log in to post comments)

Building a Self-Healing Network (O'ReillyNet)

Posted May 26, 2006 20:54 UTC (Fri) by mtaht (subscriber, #11087) [Link]

Jeeze, and to think I knew greg way back in the day...

Unrefined?

Posted May 26, 2006 22:30 UTC (Fri) by AnswerGuy (subscriber, #1256) [Link]

So Nagios+cfengine is somehow "more refined" then cron jobs and shell scripts?

Why?

Also simply detecting a service failure and restarting it is often plastering over the real problem ... a work around rather than a solution. Someone still needs to do root cause analysis and remediation of most issues.

This is just like the use of panic= options for the kernel ... and watchdog timers. Sure they can force the reboot and hopefully recover service availability from some failure conditions. In other cases they'll simply loop ... endlessly rebooting as the same error condition persistents and prevents the service from ever fully coming back up.

We have a long way to go before we can claim "self-healing."

I would also say that integration of fault-tolerance features into the available Linux distributions and configuration management is sorely lagging. It's lagging on all levels ... from simple disk mirroring and NIC teaming up through fail-over clustering.

We've developed a large number of the low-level tools (the md drivers and mdadm command, the S.M.A.R.T. monitoring tools, heartbeat daemons, etc).

However, our OS installation and management tools are not integrated with these lower level features.

During installation why doesn't your OS detect that you have a pair of identical hard drives and offer to mirror them ... automatically installing the necessary mdadm monitoring daemon? Why don't they detect larger sets of (almost) identical drives and present RAID options based on "best practices" heuristics? Why don't the automatically configure LVM and snapshots? Why aren't OS upgrades and package installations integrated with (optional) snapshot creations (to allow an admin to quickly roll a system back to the pre-upgrade/installation state)? Why are the S.M.A.R.T., lmsensors, and IPMI monitoring tools automatically configured?

How do you configure Nagios or Ganglia or any such tools to warn you that one of your RAID arrays is running in degraded mode? How do you configure them to reflect when one of the hot spares is fully synchronized back into the array? How about warning when a primary has failed over to one or more of the backup servers? Why don't our imaging tools and monitoring systems talk to one another so that a new server registers itself and each of its services (and relevent configuration facets --- such as cluster membership and RAID/LVM arrangements) and the monitoring services just work?

Likewise for backup servers and intrusion detection systems? Why do I have to separately configure each of those for each new server (and generally one at a time)? Why do I have to be an expert on each service and package in order to get the monitoring system to monitor the correct aspects of the system?

Recently I noticed that the Debian AIDE package has adopted a sort of modular configuration management system. There's an /etc/aide/aide.conf.d/
directory and each Debian package can drop its own intrusion detection settings thereto. So each package can specify: here are my logs, which should change in these ways ... here are my invariant files ... here are configuration and data files ... lock files, etc.

It's a nice step. However, it needs to be taken to the next level (for something like Samhain + Beltain/Yule ... or the Prelude IDS) and it similar approaches need to be implemented for backups and for logging/alert management. (Imagine a log file monitoring tool which had an "alerts.d/" directory ... so each package could register a list of known syslog messages that their components might generate along with some hinting about how that alert should be treated, and perhaps some sort of "mrtg/rrd" and "nagios/ganglia/OpenNMS" directories where commands and thresholds could be set (if you see an ethernet interface with X utilization average for Y time interval or Z collisions/errors in that time than not it as a capacity issue to all monitoring services).

How about a means by which each package could register its data and configuration directories/files with a backup system? A couple of the core use models for a backup/recovery system is to restore only the configuration or the data.

Ideally one should be able to install a distribution and start running backups. In the event of a failure they should be able to re-install/re-image (automated by installation/configuration profile that was automatically saved during initial installation and updated by subsequent package management events) ... and then restore just the configuration and data.

The user and/or admin should NOT have to be an expert in every package on the system to know which parts are invariant from the installation, which parts are configuration and which parts are data!

Another common use model arises when testing a new version of the OS or migrating a service from one system (perhaps a 32-bit system) to another (let's say a 64-bit system ... or one on a completely different architecture). Ideally the same distribution is available on the target and one should be able to simply apply the installation profile, merge the configurations and restore/migrate the data. (Note that the configuration often has to be merged rather than simply restored and that data often must be converted to a new format in these cases).

By backups I don't necessarily nor exclusively mean "tapes" or CDR/DVD-R. In fact a distribution and network configuration management system that was aware of something like MogileFS would be truly innovative!

(Considering that data integrity and recoverability is the most important aspect of systems administration it's pretty pathetic how poorly our tools serve us in the area. Overall our distributions and tools just tend to help us "toss everything into place" and may be help a little to "get it running." But the area where the most refinement is still needed is in saving and restoring the actual data that "it" is working on!).

We, as a community, have a long way to go on all these counts before we can claim to have "refined" our approach to systems administration. (And pointing out that the rest of the software industry hasn't done any better is a poor excuse. The commercial/proprietary software industry is separated into silos due to forces that don't apply to the free software community. We can do much better without "permission" from various executives vying for sweetheart "strategic partnerships").

JimD

Unrefined?

Posted May 27, 2006 14:14 UTC (Sat) by stuart (subscriber, #623) [Link]

the bonding driver does the network teaming thing no?

The Debian installer can use lvm and md can't it?

Unrefined?

Posted May 30, 2006 6:50 UTC (Tue) by AnswerGuy (subscriber, #1256) [Link]

You missed my point.

md/RAID (or hardware RAID for that matter) and LVM and NIC teaming are all low level redundancy and scaling tools. They are supported (to varying degrees) by many distribution installation tools.

However, they are not consistently supported by monitoring and configuration management tools and there isn't enough integration between the installation and monitoring and config management systems.

Why doesn't my installer set up a fragment of configuration code that can be picked up by my Nagios installation (possibly be registering the details with some server) so that Nagios automatically knows how to monitor my RAID array or teamed NIC links ... so I can get minor alerts when they are running in degraded mode? Why should I have to go through all those details for all my systems? Can't we make them smart enough so that my installation options for one system are automatically incorporated or detected by various other systems that need them?

Don't even get me started on how much work we need to do to rationally cope with different virtualization and load balancing and diskless workstation and even diskless server systems.

JimD

Unrefined?

Posted Jun 1, 2006 19:52 UTC (Thu) by greg@rage.net (guest, #38143) [Link]

Hello AnswerGuy and thanks for the feedback.

I'll provide some background on why we chose to go with Nagios/Cfengine in our environment.

At Avvenu, we are running several in-house daemons. When I run apache I expect it to be rock solid, however when it's an in-house application you can expect that QA will not find every instance of instability in the daemon . We wanted an insurance policy that would restart daemons in the event that a user did something odd to wedge a server. In our case we went with cfengine & nagios because we were using the extensive power of both tools; NAGIOS was doing a great job at monitoring our server farm, and cfengine was managing all follow-on system configuration after the base-kickstart. It made sense to integrate these two tools rather than write & maintain another codebase of scripts to manage fault detection & restart.

If you are running a single webserver the NAGIOS & Cfengine system is likely overkill; If you manage a whole farm then it makes more sense. While no linux distribution currently automagically determines how you want your disk RAID'ed or NIC's bonded, we've configured these things once in a kickstart config to automatically have each subsequent machine have the proper fault-tolerant configuration.

Other neat things about our environment, we actually do have machines automatically set up their nagios monitoring for hosts as they are kickstarted, based on their network function. Sometime perhaps I'll write this up.

Unrefined?

Posted Jun 4, 2006 12:40 UTC (Sun) by kreutzm (subscriber, #4700) [Link]

Another part with a drop-in directory in Debian is logcheck (but only the regexps there, so the package can only say: This is harmless and this is problematic, but no shades of grey).

Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds