Achieving Fault-Tolerance in Operating System Design and...

Achieving Fault-Tolerance in Operating System Design and Implementation

OSAGU, JESSICA CHINEZIE
OBAFEMI AWOLOWO UNIVERSITY, ILE-IFE, NIGERIA
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

ACHIEVING FAULT-TOLERANCE IN OPERATING SYSTEM DESIGN AND IMPLEMENTATION

Introduction
Fault-tolerant computing is the art and science of building computing systems that continue to operate satisfactorily in the presence of faults. A fault-tolerant system may be able to tolerate one or more fault-types including - i) transient, intermittent or permanent hardware faults, ii) software and hardware design errors, iii) operator errors, or iv) externally induced upsets or physical damage. An extensive methodology has been developed in this field over the past thirty years, and a number of fault-tolerant machines have been developed - most dealing with random hardware faults, while a smaller number deal with software, design and operator faults to varying degrees. A large amount of supporting research has been reported.

Fault tolerance and dependable systems research covers a wide spectrum of applications ranging across embedded real-time systems, commercial transaction systems, transportation systems, and military/space systems - to name a few. The supporting research includes system architecture, design techniques, coding theory, testing, validation, proof of correctness, modelling, software reliability, operating systems, parallel processing, and real-time processing. These areas often involve widely diverse core expertise ranging from formal logic, mathematics of stochastic modelling, graph theory, hardware design and software engineering.
Recent developments include the adaptation of existing fault-tolerance techniques to RAID disks where information is striped across several disks to improve bandwidth and a redundant disk is used to hold encoded information so that data can be reconstructed if a disk fails. Another area is the use of application-based fault-tolerance techniques to detect errors in high performance parallel processors.

References: Avizienis, A., et al., (Ed.). (1987):Dependable Computing and Fault-Tolerant Systems Vol. 1: The Evolution of Fault-Tolerant Computing, Vienna: Springer-Verlag. (Though somewhat dated, the best historical reference available.) Harper, R., Lala, J Lala, J., et. al., (1991): The Draper Approach to Ultra Reliable Real-Time Systems, Computer, May 1991. Briere, D., and Traverse, P. (1993): AIRBUS A320/A330/A340 Electrical Flight Controls: A Family of Fault-Tolerant Systems, Proc. of the 23rd International Symposium on Fault-Tolerant Computing FTCS-23, Toulouse, France, IEEE Press, June 1993. Sanders, W., and Obal, W. D. II, (1993): Dependability Evaluation using UltraSAN, Software Demonstration in Proc. of the 23rd International Symposium on Fault-Tolerant Computing FTCS-23, Toulouse, France, IEEE Press, June 1993. Beounes, C., et. al. (1993): SURF-2: A Program For Dependability Evaluation Of Complex Hardware And Software Systems, Proc. of the 23rd International Symposium on Fault-Tolerant Computing FTCS-23, Toulouse, France, IEEE Press, June 1993. Jenn, E. , Arlat, J. Rimen, M., Ohlsson, J. and Karlsson, J. (1994): Fault injection into VHDL models:the MEFISTO tool, Proc. Of the 24th Annual International Symposium on Fault-Tolerant Computing (FTCS-24), Austin, Texas, June 1994. Timothy, K. Tsai and Ravishankar K. Iyer, (1996): "An Approach Towards Benchmarking of Fault-Tolerant Commercial Systems," Proc

Achieving Fault-Tolerance in Operating System Design and Implementation

You May Also Find These Documents Helpful

Nt1310 Project Part 1 Multi-Layered Security Plan

Nt1310 Project Part 1 Multi-Layered Security Plan

Nt1330 Unit 1 Problem Analysis Paper

Nt1330 Unit 1 Problem Analysis Paper

Ittnt2670 Lesson 1

Ittnt2670 Lesson 1

Website Migration Project

Website Migration Project

Pos/355 Failures

Pos/355 Failures

Filures Paper

Filures Paper

Homework

Homework

Failures in a Distributed System

Failures in a Distributed System

Failure Mode Analysis

Failure Mode Analysis

Cloud Testing

Cloud Testing

lru algorithm report

lru algorithm report

Real Time Fault Tolerance

Real Time Fault Tolerance

Consistency Model

Consistency Model

Proposed Record Patient System of Duran Clinic: Survey

Proposed Record Patient System of Duran Clinic: Survey

Mesh topology

Mesh topology

Related Topics