Preview

Achieving Fault-Tolerance in Operating System Design and Implementation

Best Essays
Open Document
Open Document
4745 Words
Grammar
Grammar
Plagiarism
Plagiarism
Writing
Writing
Score
Score
Achieving Fault-Tolerance in Operating System Design and Implementation
OSAGU, JESSICA CHINEZIE
OBAFEMI AWOLOWO UNIVERSITY, ILE-IFE, NIGERIA
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

ACHIEVING FAULT-TOLERANCE IN OPERATING SYSTEM DESIGN AND IMPLEMENTATION

Introduction
Fault-tolerant computing is the art and science of building computing systems that continue to operate satisfactorily in the presence of faults. A fault-tolerant system may be able to tolerate one or more fault-types including - i) transient, intermittent or permanent hardware faults, ii) software and hardware design errors, iii) operator errors, or iv) externally induced upsets or physical damage. An extensive methodology has been developed in this field over the past thirty years, and a number of fault-tolerant machines have been developed - most dealing with random hardware faults, while a smaller number deal with software, design and operator faults to varying degrees. A large amount of supporting research has been reported.

Fault tolerance and dependable systems research covers a wide spectrum of applications ranging across embedded real-time systems, commercial transaction systems, transportation systems, and military/space systems - to name a few. The supporting research includes system architecture, design techniques, coding theory, testing, validation, proof of correctness, modelling, software reliability, operating systems, parallel processing, and real-time processing. These areas often involve widely diverse core expertise ranging from formal logic, mathematics of stochastic modelling, graph theory, hardware design and software engineering.
Recent developments include the adaptation of existing fault-tolerance techniques to RAID disks where information is striped across several disks to improve bandwidth and a redundant disk is used to hold encoded information so that data can be reconstructed if a disk fails. Another area is the use of application-based fault-tolerance techniques to detect errors in high performance parallel processors.



References: Avizienis, A., et al., (Ed.). (1987):Dependable Computing and Fault-Tolerant Systems Vol. 1: The Evolution of Fault-Tolerant Computing, Vienna: Springer-Verlag. (Though somewhat dated, the best historical reference available.) Harper, R., Lala, J Lala, J., et. al., (1991): The Draper Approach to Ultra Reliable Real-Time Systems, Computer, May 1991. Briere, D., and Traverse, P. (1993): AIRBUS A320/A330/A340 Electrical Flight Controls: A Family of Fault-Tolerant Systems, Proc. of the 23rd International Symposium on Fault-Tolerant Computing FTCS-23, Toulouse, France, IEEE Press, June 1993. Sanders, W., and Obal, W. D. II, (1993): Dependability Evaluation using UltraSAN, Software Demonstration in Proc. of the 23rd International Symposium on Fault-Tolerant Computing FTCS-23, Toulouse, France, IEEE Press, June 1993. Beounes, C., et. al. (1993): SURF-2: A Program For Dependability Evaluation Of Complex Hardware And Software Systems, Proc. of the 23rd International Symposium on Fault-Tolerant Computing FTCS-23, Toulouse, France, IEEE Press, June 1993. Jenn, E. , Arlat, J. Rimen, M., Ohlsson, J. and Karlsson, J. (1994): Fault injection into VHDL models:the MEFISTO tool, Proc. Of the 24th Annual International Symposium on Fault-Tolerant Computing (FTCS-24), Austin, Texas, June 1994. Timothy, K. Tsai and Ravishankar K. Iyer, (1996): "An Approach Towards Benchmarking of Fault-Tolerant Commercial Systems," Proc

You May Also Find These Documents Helpful

  • Satisfactory Essays

    | * OS level * Patch history * Resilient computing * Stateful inspection * Whitelists-Blacklists * DB encryption * Backups and archiving…

    • 409 Words
    • 2 Pages
    Satisfactory Essays
  • Good Essays

    Designing a fault-tolerant system can be done at different levels of the software stack. We call general purpose the approaches that detect and correct the failures at a given level of that stack, masking them entirely to the higher levels (and ultimately to the end-user, who eventually see a correct result, despite the occurrence of failures). General-purpose approaches can target specific types of failures (e.g. message loss, or message corruption), and let other types of failures hit higher levels of the software stack. In this section, we discuss a set of well-known and recently developed protocols to provide general-purpose fault tolerance for a large set of failure types, at different levels of the software stack, but always below the…

    • 1211 Words
    • 5 Pages
    Good Essays
  • Satisfactory Essays

    Ittnt2670 Lesson 1

    • 489 Words
    • 2 Pages

    The feature that enhances fault tolerance by providing multiple data paths to a single server storage device is called _________.…

    • 489 Words
    • 2 Pages
    Satisfactory Essays
  • Better Essays

    Website Migration Project

    • 3004 Words
    • 13 Pages

    This project aims to produce a system that will adequately address Tony’s Chips system requirements. In light of this, the system’s architecture will consider all of the system’s requirements in its design. The system’s architecture will make use of the ideally performing applications. The project aims to create a cohesive system from the many available system components by putting emphasis on application compatibility. The project also aims at creating reliable recovery solutions for the system. This will be undertaken with the aim of enhancing system recoverability.…

    • 3004 Words
    • 13 Pages
    Better Essays
  • Powerful Essays

    Pos/355 Failures

    • 2109 Words
    • 9 Pages

    First off to start the assignment only requires writing about four different types of failures that can happen on a distributed system, however there are many more than just four types of failures that can happen and they are all important to learn about if you are going to work with a distributed system so that you know how to deal with and handle each one of them.…

    • 2109 Words
    • 9 Pages
    Powerful Essays
  • Satisfactory Essays

    Filures Paper

    • 498 Words
    • 2 Pages

    There will be a discussion of four different failures within this paper. The failures are as follows: crash failures, timing failures, network failures, and byzantine failures.…

    • 498 Words
    • 2 Pages
    Satisfactory Essays
  • Satisfactory Essays

    Homework

    • 304 Words
    • 1 Page

    Reliable Delivery: The protocols provides reliable delivery service by guaranteeing to move each network layer datagram across the link without error.…

    • 304 Words
    • 1 Page
    Satisfactory Essays
  • Good Essays

    A distributed system is an application that executes a collection of protocols to coordinate the actions of multiple processes on a network, where all component work together to perform a single set of related tasks. A distributed system can be much larger and more powerful given the combined capabilities of the distributed components, than combinations of stand-alone systems. But it's not easy - for a distributed system to be useful, it must be reliable. This is a difficult goal to achieve because of the complexity of the interactions between simultaneously running components. A distributed system must have the following characteristics:…

    • 833 Words
    • 4 Pages
    Good Essays
  • Powerful Essays

    Failure Mode Analysis

    • 1502 Words
    • 7 Pages

    FMEA & FTA •FMEA/FMECA •Fault Tree Analysis Arnljot Hoyland, Marvin Rausand, System Reliability Theory, John Wiley & Sons, Inc., 1994, ISBN 0-471-59397-4 Meng-Lai Yin 1 FMEA (Failure Mode and Effects Analysis) • Qualitative analysis • Purpose: identify design areas where improvements are needed to meet reliability requirements • One of the first systematic techniques for failure analysis • Developed in the late 50s to study problems that might arise from malfunctions of military systems • Often used as the first step of a system reliability study • An FMEA becomes a failure mode, effects, and criticality analysis (FMECA) if criticalities or priorities are assigned • Information can be found in: MIL-STD-1629, IEC 812, SAE ARP 926, IEEE std.…

    • 1502 Words
    • 7 Pages
    Powerful Essays
  • Powerful Essays

    Cloud Testing

    • 1274 Words
    • 6 Pages

    James A. Whittaker, Florida Institute of Technology IEEE Software 17(1), pp. 70-79, Jan-Feb 2000 Avital Braner Basic Seminar of Software Engineering Hebrew University 2009…

    • 1274 Words
    • 6 Pages
    Powerful Essays
  • Good Essays

    lru algorithm report

    • 842 Words
    • 3 Pages

    This approach is the least-recently-used (LRU) algorithm. The result of applying LRU replacement to our example reference string is shown in Fig. 9.14. The LRU algorithm produces 12 faults.…

    • 842 Words
    • 3 Pages
    Good Essays
  • Powerful Essays

    Real Time Fault Tolerance

    • 26468 Words
    • 106 Pages

    It is assumed that students in this course have not been exposed previously to the terminology and techniques used in the fault-tolerant and real-time computing eld. Henceforth the principal aim of this course is to provide the students an introduction to the design and analysis of fault-tolerant and real-time systems. After completing this course, a student will be able to: Comprehend the existing fault-tolerant and real-time computing literature. Describe, explain, generalise, classify, adapt and assess those techniques, which are currently available for designing and analyzing reliable faulttolerant and real-time computer systems. Outline the methodologies that are available to combat system failures, caused by hardware and/or software. Recognise the analysis techniques, which can be used to verify that a system has met its requirements. Discuss the system design fundamentals of a fault-tolerant and real-time system used by Australia 's leading companies.…

    • 26468 Words
    • 106 Pages
    Powerful Essays
  • Powerful Essays

    Consistency Model

    • 6736 Words
    • 27 Pages

    Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565,…

    • 6736 Words
    • 27 Pages
    Powerful Essays
  • Powerful Essays

    b. Is capable to provide greater transmission capacity for the use of the company personnel…

    • 447 Words
    • 2 Pages
    Powerful Essays
  • Good Essays

    Mesh topology

    • 764 Words
    • 4 Pages

    The self-healing capability enables a routing based network to operate when one node breaks down or a connection goes bad. As a result, the network is typically quite reliable, as there is often more than one path between a source and a destination in the network. Although mostly used in wireless situations, this concept is also applicable to wired networks and software interaction.…

    • 764 Words
    • 4 Pages
    Good Essays