Hej,

Från diskussioner med Per Stenström har jag förstått att ARTES önskar ett
förtydligande av vad SICS avser göra inom den del av sin PAMP ansökan som
berör fel tolerans. Nedan har jag försökt sammanställa dylikt. Har ni frågor
eller önskar ytterligare förtydligande så är ni välkomna att maila eller
ringa 070-7531254.

(PS Jag flyger till USA på tisdag och är borta ca 2 1/2 vecka.)

MvH,

Peter Magnusson
SICS



Full System Simulation as a Fault Injection Platform

The growth in the demand for dependable systems continues to be a challenge
for computer architects and systems engineers. Computers employed in
critical applications typically incorporate redundancy to tolerate faults,
as well as capabilities for detecting, locating, isolating, and recovering
from errors. A fault-tolerant system's ability to handle errors must be
validated to ensure that the system will provide the desired level of
reliability. Fault injection - the deliberate insertion of faults into an
operational system to determine its response - offers an effective solution
to this problem (Clark and Pradhan, 1995).

Fault injection experiments can be classified according to three general
attributes: system abstraction, fault model and injection method, and
dependability measure (Clark and Pradhan). Fault-injection studies have
traditionally been performed on actual systems. The continuing integration
of system logic, and increasing use of off-the-shelf components (COTS),
increases the obscurity of real systems, making this approach more difficult
with each generation of system (Kanawati et al, 1995). This development has
prompted an increasing use of simulation as a fault injection platform
(Goswami et al, 1997).

As a system abstraction for fault-injection experiments, the simulation
approaches that have been described in the open litterature have avoided
modeling a complete system, even at the functional level. Such a simulation
would include a complete set of device models at the kernel-level interface,
i.e. at the level of operating system device drivers. It would be capable of
running unmodified operating system binaries, with a complete and realistic
set of application programs, server processes, libraries, etc, running on
top of the operating system.

The avoidance of a complete system model is understandable, since it is a
significant engineering effort. Only recently have simulators in academia
been capable of running a realistic workload: SimICS in 1997 (Magnusson et
al, 1998) and SimOS in 1998 (Herrod, 1998). Both projects have absorbed
double-digit man-year efforts.

The work behind SimICS and SimOS was motivated by computer architecture
studies. Previous techniques were limited to toy benchmarks, and there was a
need to run realistic workloads: real operating systems, full SPECint95
suite, TPC-C and TPC-D benchmarks with commercial database servers, and
internet servers.

A complete system model at the instruction level can provide a platform for
fault-injection studies for a category of tests that have not been possible
to explore before, namely the interaction of actual unmodified real-world
application binaries with the operating system as well as with all of the
hardware elements. There are many benefits of such a platform: it would
allow fault-injection studies of nearly finalized complete systems prior to
their availability, and irregardless of their observability, and it
simplifies and generalizes the applicable error models in comparison to what
existing software-based tools such as FERRARI or DEPEND can support.

We propose to extend SimICS with support for fault-injection experiments.
The purpose is to provide the first of a new and different class of
simulation platforms for fault tolerance researchers, and to explore the
applicability of existing fault model and injection methods, as well as
dependability measures.

In terms of level of abstraction, we can view SimICS at the level above
detailed processor and device models. For example, existing fault-injection
studies of gate-level models could be used to provide a statistical model
for how errors are reflected at the register transfer level. This model
could then be directly plugged into SimICS, extending the test to a much
larger workload. Alternatively, a detailed fault model can become a
"master", and using SimICS to generate stimuli from a real workload, and
feeding back the effect of hardware errors.

The work will be done by one or two PhD candidates. Industrial and technical
advisors will be Bengt Werner and Peter Magnusson at SICS/Virtutech.
Scientific advisor will be Seif Haridi at SICS/KTH.


References

Clark, Jeffrey A., Dhiraj K. Pradhan. "Fault Injection - A Method for
Validating Computer-System Dependability". Computer, June 1995, pp 47-56.

Goswami, Kumar K., Ravishankar K. Iyer, and Luke Young. "DEPEND: A
Simulation-Based Environment for System Level Dependability Analysis."

Herrod, Stephen A. Personal communication. See also "Using Complete Machine
Simulation to Understand Computer System Behavior", Stephen A. Herrod, Ph.D.
Thesis, Stanford University, February 1998.

Kanawati, Ghani A., Nasser A. Kanawati, and Jacob A. Abraham. "FERRARI: A
Flexible Software-Based Fault and Error Injection System." IEEE Transactions
on Computers, Vol 44, No 2, February 1995, pp 248-260.

Magnusson, Peter S., Fredrik Dahlgren, Håkan Grahn, Magnus Karlsson, Fredrik
Larsson, Fredrik Lundholm, Andreas Moestedt, Jim Nilsson, Per Stenström, and
Bengt Werner. "SimICS/sun4m: A Virtual Workstation". In Usenix Annual
Technical Conference, June 15-18, 1998, New Orleans, Lousiana.