User: Password:
|
|
Subscribe / Log in / New account

Kernel Fault injection framework using SystemTap

From:  Anup C Shan <anupcshan@gmail.com>
To:  systemtap@sources.redhat.com
Subject:  [RFC 1/5] Kernel Fault injection framework using SystemTap
Date:  Fri, 11 Jul 2008 15:34:55 +0530
Message-ID:  <48773047.1050906@gmail.com>
Cc:  kghoshnitk@gmail.com, akinobu.mita@gmail.com, k-tanaka@ce.jp.nec.com
Archive-link:  Article

Hi.

We have designed a tapset for fault injection. It is meant to ease the 
process of injecting faults into the kernel. As use cases, we have 
ported in-kernel fault injection for slab and page_alloc using this 
framework. Refer Documentation/fault-injection/

We have also modified the existing SCSI fault-injection systemtap script 
(http://sourceforge.net/projects/scsifaultinjtst/) to use this framework.

Please find the tapset file and readme attached. The usecase scripts
are in the follow-up mails.

Comments and suggestions are welcome.

Please suggest a right location to place these tapset scripts in 
SystemTap source tree.

Thanks,
Kushal & Anup




Introduction
------------

This tapset provides a framework to facilitate fault injections for
testing the kernel. The framework can be used by systemtap scripts to
actually inject faults. The framework processes the command line 
arguments and controls the fault injection process.

Following are the generic parameters used to set up the fault injection.
	a) failtimes - maximum number of times the process can be failed
	b) interval - number of successful hits between potential failures
	c) probability - probability of potential failure
	d) taskfilter - fail all processes or filter processes on pid
	e) space - number of successful hits before the first failure
	f) verbosity - control amount of output generated by the script
	g) totaltime - duration of fault injection session
	h) debug - print debug information for the script
	i) pid - process IDs of processes to inject failures into. This can 
		 also be specified using the -x option.

These parameters are registered in the tapset using the fij_add_option()
function which also sets the script specific default values and provides help 
text. The generic parameters are appended to the params[] array and can be 
accessed using params["variable_name"]. If you doesn't specify any of the 
parameters in command line, its default value is used. 
Using fij_load_param(), your script can also assign script-specific default
values to generic parameters.

You can define mandatory parameters, which are specific to the script depending
upon the kernel subsystem under test. These variables must necessarily be
specified on the command line during command execution. 
E.g: device numbers, inode numbers etc which cannot be given default values.

Such parameters can be registered using the fij_add_necessary_option() function.
On calling this function, the variable is appended to a mandatoryparams[] array.
If these parameters are not specified on the command line, an error is reported
and script is aborted. The variable can be accessed at params["variable_name"].

The framework controls the fault injection using fij_should_fail() and 
fij_done_fail() functions. Your script should probe the relevant kernel 
routine subjected to fault injection. The user-defined probe handler invokes 
fij_should_fail(), which returns 1 if it's time to inject a failure, or 0 
otherwise. Faults can be injected by your script in various ways like faking the
error return by changing the return value, by modifying data structures etc.
fij_done_fail() must be called immediately after fault injection to alert the
tapset of this. fij_done_fail() must not be called in case no fault was
injected.

fij_logger() - This is a wrapper for the SystemTap log() function with an added 
verbosity parameter. The message will be displayed only if the value of global
fij_verbosity is equal to or more than the parameter provided to the 
function.


How to use the tapset
---------------------

1) begin probe that adds user defined parameters and default values.
2) Probes for fault injection. Call fij_should_fail() before injecting the
   fault and fij_done_fail() after fault is injected.


Description of code flow
------------------------

1) begin(less than -1000) in the user script [OPTIONAL] - Preinitialization. 
   As of now, this is not necessary.

2) begin(-1000) in the tapset - This function initialises counters and
   registers all generic parameters with global defaults.

3) begin in the user script - User defined default parameters are supplied 
   here. Also any script specific parameters are registered at this stage.

4) begin(1000) in the tapset - Command line arguments are parsed and
   parameters assigned appropriate values.

5) begin(more than 1000) in the user script [OPTIONAL] - This can be used
   to copy values of arguments from params[] array to local/global variables 
   for easy referencing.

6) Script starts executing. It is interrupted every 10 milliseconds to
   check if script has run for the stipulated length of time.

7) When function/statement probes are hit, the script must invoke
   fij_should_fail() function to check if the conditions for failure have been
   satisfied.

8) Fail the function using suitable methods (changing return values,
   setting fake values to variables...)

9) Call fij_done_fail() function to inform tapset that fault has been injected.

10) Script will exit either when script calls exit() function or when a
    timeout is hit. At this point, stats of the experiment are printed.

%{
#include<linux/random.h>
%}

global fij_params	//Array of all parameters. (except fij_pids_to_fail)
global fij_paramshelp	//Array of help information for all parameters
global fij_mandatoryparams	
			//Array of mandatory parameters 

global fij_pids_to_fail	//Array of pids subject to fault injection
global fij_failcount	//Number of times failed so far
global fij_probehits	//Number of times the probe has been hit
global fij_intervalcount
			//Number of successful probe hits
global fij_aborted	//Boolean value to check whether the fault injection
			//procedure needs to continue or not
			//Needed for help option

global fij_failtimes
global fij_verbosity
global fij_debug
global fij_taskfilter
global fij_interval
global fij_probability
global fij_space
global fij_totaltime

function fij_random:long()
%{
	THIS->__retvalue = random32();
%}

function fij_add_process_to_fail(procid:long)
{
	fij_logger(1, sprintf("Adding process %d to the fail list", procid))
	fij_pids_to_fail[procid] = 1
}

/*
 * Add an option to the parameters list
 * This option can be provided on the command line as opt = value
 */
function fij_add_option(opt:string, defval, help:string)
{
	fij_params[opt] = defval
	fij_paramshelp[opt] = help
}

/*
 * Add an option to the necessary parameters list
 * This option MUST be provided on the command line
 */
function fij_add_necessary_option(opt:string, help:string)
{
	fij_mandatoryparams[opt] = 1
	fij_paramshelp[opt] = help
}

function fij_print_help()
{
	fij_logger(0, "Usage : stap script.stp [ option1=value1 [ option2=value2 [ ...]]]")
	fij_logger(0, "Options : ")
	fij_logger(0, "\tpid\r\t\t\t\t : PID of a process to be failed. Use this option repeatedly to add multiple processes to fail")
	
	foreach (option in fij_params) {
		fij_logger(0, sprintf("\t%s\r\t\t\t\t : %s", option,
							fij_paramshelp[option]))
	}

	needed_options_counter = 0
	foreach (option in fij_mandatoryparams) {
		if (needed_options_counter == 0) {
			fij_logger(0, "Necessary options : ")
			needed_options_counter++
		}
		fij_logger(0, sprintf("\t%s\r\t\t\t\t : %s", option,
							fij_paramshelp[option]))
	}

	fij_logger(0, "For help : stap script.stp help")
	fij_aborted = 1
}

function fij_process_argument(arg:string)
{
	if (isinstr(arg, "=") == 1) {
		parameter=tokenize(arg, "=")
		value_in_str = tokenize("", "=")
		value = strtol(value_in_str, 10)
		if (parameter in fij_params) {
			fij_params[parameter] = value
			fij_logger(1, sprintf("Parameter %s is assigned value %d",
					parameter, fij_params[parameter]))
		} else if (parameter in fij_mandatoryparams) {
			fij_add_option(parameter, value,
						fij_paramshelp[parameter])
			delete fij_mandatoryparams[parameter]
		} else if (parameter == "pid") {
			fij_add_process_to_fail(value)
		} else
			fij_logger(0, sprintf("WARNING : Argument %s is not found in parameter list. Ignoring..", parameter))
	}
	else
		fij_logger(0, sprintf("WARNING : Invalid command line argument : %s",
									arg))
}

function fij_show_params()
{
	fij_logger(1, "Status of parameters :")
	foreach (option in fij_params)
		fij_logger(1, sprintf("Option %s has value %d", option,
							fij_params[option]))
}

/*
 * Parse command line arguments
 */
function fij_parse_command_line_args()
{
	for (i = 1; i <= argc ; i++) {
		if (argv[i] == "help") {
			fij_print_help()
			return 0
		} else
			fij_process_argument(argv[i])
	}
	
	foreach (parameter in fij_mandatoryparams) {
		fij_logger(0, sprintf("ERROR: Necessary command line parameter %s not specified", parameter))
		fij_aborted = 1
	}
}

/*
 * Load script specific default parameters
 * This function is called by the script using this tapset to set custom
 * default values
 */
function fij_load_param(arg_times:long, arg_interval:long, arg_probability:long,
			arg_taskfilter:long, arg_space:long, arg_verbose:long,
							arg_totaltime:long)
{
	fij_add_option("failtimes", arg_times,
			"Number of times to fail (0 = no limit)")
	fij_add_option("interval", arg_interval, "Number of successful hits between potential failures (0 to fail everytime)")
	fij_add_option("probability", arg_probability,
		"Probability of failure (1<=probability<=100) (0 to disable)")
	fij_add_option("taskfilter", arg_taskfilter, "0=>Fail all processes, 1=>Fail processes based on pid command line argument or -x option.")
	fij_add_option("space", arg_space, "Number of successful hits before the first failure (0 to disable)")
	fij_add_option("verbosity", arg_verbose, "0=>Success or Failure messages, 1=>Print parameter status, 2=>All probe hits, backtrace and register states")
	fij_add_option("totaltime", arg_totaltime, "Duration of fault injection session in milliseconds (Default : 1000 milliseconds)")
}

/*
 * Modified log function with an additional verbosity parameter
 * The message is printed only if the current fij_verbosity parameter
 * is greater than the minimum verbosity specified. Minverbosity value of 100
 * is reserved only for debugging the script.
 */
function fij_logger(minverbosity:long, msg:string)
{
	if (fij_verbosity >= minverbosity)
		log(msg)
	else if (minverbosity == 100 && fij_debug == 1)
		log(msg)
}

/*
 * Checks whether the specified constraints for failure have been met
 * Returns 1 if process must be failed,  else returns 0
 */
function fij_should_fail:long()
{
	fij_probehits++

	fij_logger(2, "Probe hit")

	if (fij_taskfilter != 0) {
		if(!(pid() in fij_pids_to_fail)) {
			fij_logger(100, sprintf("Skipping because wrong process %d - %s probed",
						pid(), execname()))
			return 0
		} else	
			fij_logger(100, sprintf("Continuing with probing process %d - %s %d",
						pid(), execname(), target()))
	}

	if (fij_failcount == 0) {
		if (fij_space != 0)	{
			if (fij_intervalcount < fij_space) {
				fij_logger(100, sprintf("Skipping on space : %d",
								fij_intervalcount))
				fij_intervalcount++
				return 0
			} else {
				fij_intervalcount = 0
				fij_logger(100, sprintf("Done skipping on space"))
				fij_space = 0
			}
		}
	}

	if (fij_failtimes != 0 && fij_failcount >= fij_failtimes) {
			fij_logger(100, sprintf("Failed %d times already. Skipping..",
						fij_failcount))
			return 0
	}

	if (fij_interval != 0) {
		if (fij_intervalcount != 0) {
			fij_logger(100, sprintf("Skipping on interval : %d",
								fij_intervalcount))
			fij_intervalcount++
			fij_intervalcount %= fij_interval
			return 0
		} else
			fij_intervalcount++
	}

	if (fij_probability != 0) {
		if (fij_random() % 100 > fij_probability) {
			fij_logger(100, sprintf("Skipping on probability"))
			return 0
		} else
			fij_logger(100, sprintf("Continuing on probability"))
	}

	return 1
}

/*
 * Post injection cleanup
 * This function MUST be called after the process has been failed
 */
function fij_done_fail()
{
	fij_failcount++
	fij_logger(0, sprintf("Failed process %d - %s", pid(), execname()))
	
	if (fij_params["verbosity"] >= 2 && fij_verbosity != 100) {
		print_backtrace()
		print_regs()
	}
}

function fij_display_stats()
{
	fij_logger(0, sprintf("Probe was hit %d times.", fij_probehits))
	fij_logger(0, sprintf("Function was failed %d times.", fij_failcount))
}

/*
 * The first begin function
 * Initialises counters and adds generic parameters to the parameters list
 * In case the script requires a begin function to be executed prior to this,
 * parameter of less than -1000 must be specified to begin()
 */
probe begin(-1000)
{
	fij_failcount = 0
	fij_probehits = 0
	fij_intervalcount = 0
	fij_aborted = 0

	fij_load_param(0, 0, 0, 0, 0, 0, 1000)	//Loading default values
	fij_add_option("debug", 0, "Display debug information. Requires verbosity=100")

	fij_mandatoryparams["initialize"] = 1;	//Needed to register mandatory
						//params as an array
	delete fij_mandatoryparams["initialize"]
}

/*
 * The last begin function
 * Does parsing of command line arguments
 * In case the script requires a begin function to be executed after all
 * initialization,  parameter of greater than 1000 must be specified to begin()
 * Eg: when you need to copy one of the fij_params[] options into a local variable 
 * after parsing command line args
 */
probe begin(1000)
{
	if (target()!=0)
		fij_add_process_to_fail(target())

	fij_parse_command_line_args()

	fij_failtimes = fij_params["failtimes"]
	fij_interval = fij_params["interval"]
	fij_probability = fij_params["probability"]
	fij_taskfilter = fij_params["taskfilter"]
	fij_space = fij_params["space"]
	fij_verbosity = fij_params["verbosity"]
	fij_totaltime = fij_params["totaltime"]
	fij_debug = fij_params["debug"]

	if (fij_aborted)
		exit()
	else
		fij_show_params()
}

probe end
{
	if (!fij_aborted)
		fij_display_stats()
}

//Check every 10 ms if the stipulated execution time has expired
probe timer.ms(10)
{
	fij_totaltime -= 10
	if (fij_totaltime <= 0)
		exit()
}



Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds