Date: Mon, 7 Dec 1998 10:23:07 +0100 (MET) From: Johan Karlsson Please find below my response to the reviewer comments concerning my proposal "Node-level Fault-tolerance for Fixed Priority Scheduling". Best regards, Johan Karlsson - ----------------------------------------------------------------------- Johan Karlsson Universitetslektor, Tekn Dr Institutionen för datorteknik E-mail: johan@ce.chalmers.se Chalmers tekniska högskola Tel: 031 772 16 70 412 96 Göteborg Fax: 031 772 36 63 - ----------------------------------------------------------------------- Comments and clarifications in response to the reviewer comments concerning the project proposal "Node-level Fault-tolerance for Fixed Priority Scheduling" submitted to ARTES. First I would like to clarify the overall goal of our research. We consider the problem of implementing fault-tolerant hard-real time system using low-cost techniques. We propose the combined use of behavior based error detection and massive time-redundancy. Recent developments in microprocessor technology have made it economically feasible to use high-performance microprocessors also in low-cost application; MIPS are cheap! This has made it possible to consider the use of massive time-redundancy to achieve high dependability in applications with low or moderate requirements of processing power. Massive time-redundancy makes use of double or triple execution of tasks, checkpoints, and rollback or roll-forward recovery techniques. These techniques have previously not been commonly used for hard real-time systems - except for double execution - because they impose a rather large time overhead. Reviewer comment: I don't find any novelty in this proposal. The problem itself is not motivated. There are few details about the fault tolerance techniques to be studied (no novelty?), and even if they accomplish what they say I don't really see much as an outcome. I feel that they failed to motivate any key problems here. Response: The key research problem that we intend to investigate is validation of the failure mode assumptions made in fault-tolerant fixed-priority scheduling. There has been a few papers that have described how time redundant execution can be incorporated into fixed-priority scheduling. A recent paper by Ghosh et al.[1] considers time redundancy in rate-monotonic scheduling. The authors have citations to only three other papers that have addressed this problem. Most papers dealing with fault-tolerant scheduling rely on naive assumptions regarding system behavior under the presence of faults. It is often assumed that error detection coverage is perfect, error detection latency is negligible, or that recovery time is deterministic or negligible, etc. These assumptions are not realistic, especially when low-cost fault-tolerance techniques are used. To deal with this kind of uncertainty Powell has defined the notion of assumption coverage [2]. In reality error detection coverage, error detection latency, and recovery time are random variables that depend on where and when a fault occurs, and how the workload exercise the hardware (workload dependency). We intend to use fault injection to estimate assumption coverage, error detection coverage, error detection latency, recovery time, etc, and create a model in which these parameters are used to estimate the probability that any task will miss its deadline given that a fault has occurred. I am not aware of any experimental research that has addressed this important problem. Another key problem is the trad-off between the probability of fail-silence violations in the value domain and the probability of missing a deadline. We will use double execution of tasks and result comparison to detect data errors that escape the behavior based error detection mechanisms (e.g. hardware exceptions). However, our previous research has shown that up to 95% of transient faults can be detected by behavior based mechanisms . When an error is detected by these mechanisms in, say, the second execution of a task, we have the possibility to use the result of the first execution, without re-executing the second execution. This will decrease the probability of missing a deadline, but it will increase the probability of producing erroneous results (because we rely on a single instance of the result). We intend to investigate different scheduling policies that make on-line decisions about whether to run single or double execution of tasks that has been affected by a fault. For data errors detected by result comparison, we will consider running a third execution in order to make a majority vote on three results. We will identify criteria that the scheduler can use to determine whether it is safe to run another execution of a task, or if the node must be shut-down in order to prevent a fail-silence violation in either the value or the time domain. Other key problems are how to implement the use of time-redundancy for the execution of the kernel program, and optimal placement of checkpoints. Reviewer comment: I find their terminology confusing, e.g., they use the terms static versus fixed priority scheduling without defining them. However, fixed priority scheduling is a type of static scheduling so this is confusing at best. Response: I use the same terminology as Tindell[3] and Poledna[4]. The term fixed priority scheduling is clearly defined by Tindells work. Both Tindell and Poledna use the term "dynamic" when scheduling decisions are made on-line, and the term "static" when schedules are calculated off-line. I am aware that other researchers use the term "dynamic" exclusively to denote techniques that make scheduling decisions based on dynamic priorities, e.g. earliest deadline first (EDF) scheduling. Reviewer comment: I also don't understand their claim that fixed priority scheduling and FT has not been researched; in what sense? I don't agree with this at all; Response: I claimed that it has not been researched extensively, and this was perhaps a bit misleading. There has been several papers on scheduling theory and a few implementation studies, but there have been very few papers that have addressed the problem of validating the failure mode assumptions of fault-tolerant fixed-priority scheduling. In particular, I am not aware of any fault injection paper that has addressed this problem. Reviewer comment: You cannot generalize from 1 system. Response: Yes, it is difficult to draw general conclusions from experiments. But in most scientific disciplines experimentation is necessary in order to assess the accuracy of theoretical models, and to validate the assumptions on which models are based. Computer science is no exception. We expect that the fault injection methodology for parameter estimation and the scheduling techniques that we will develop in this project could be used in a wide range of systems that make use of behavior-based error detection and massive time-redundancy. We also believe that our results could provide valuable insight to scheduling theoreticians. However, the specific results regarding, for example, detection coverage or the probability for missing a deadline will of course only be valid for our target system. References [1] S. Ghosh, R. Melhem and D. Mossé, Fault-Tolerant Rate-Monotonic Scheduling, The Journal of Real-Time Systems, vol 15, no. 2, Kluwer Academic Publishers, pp. 149 - 181, Sept 1998. [2] D. Powell, Failure Mode Assumptions and Assumption Coverage, Proc. of 22nd Int. Symp. on Fault-Tolerant Computing, IEEE Computer Society Press, Boston, MA, USA, pp. 386-395. July 1992. [3] K.W. Tindell, Fixed Priority Scheduling of Hard Real-Time Systems, Ph.D. thesis, Univ. of York, UK, 1994 [4] S. Poledna, Fault-Tolerant Real-Time Systems: The problem of replica determinism, Kluwer Academic Publishers, 1996