Fault Tolerant Systems
MODULE CODE
CREDIT VALUE
Module Aims
Aim 1
Provide an introduction to fault tolerant systems mainly from the hardware point of view
Aim 2
To provide knowledge about how to achieve fault tolerance by using redundancy
Module Content
Introduction
- Overview and motivation, basic principles and fundamental concepts
- Dependability concepts: dependable system, techniques for achieving dependability, dependability measures,
- Fundamental definitions: faults, errors, failures, fault/error models
- Fault models and error manifestations, permanent vs. transient faults
- Fault tolerant strategies and design techniques: Fault detection, masking, containment, location, reconfiguration, and recovery, Techniques (Hardware redundancy, software redundancy, time redundancy, and information redundancy).
- Reliability and availability analysis
Hardware Redundancy
- Passive/active hardware redundancy
- Modular redundancy, voting techniques
- Fault tolerance at processor level; Byzantine General problem, consensus protocols, checkpointing and recovery
Information Redundancy
- Coding
- Resilient disk systems (RAID)
- Algorithm-Based Fault Tolerance (ABFT)
Other Topics
- Failures in error-correcting mechanisms, circuits of “noisy” gates
- Fault tolerant networks; measures of network resilience/reliability
- Software fault tolerance
- Fault-tolerant dynamic systems (non-concurrent error detection and identification)
- Distributed function calculation/consensus, tolerance to faulty/malicious agents
Learning Outcomes
On successful completion of this module, a student will be able to:
Teaching Methods
The module examines a useful range of the fundamental aspects of fault tolerant systems. Lectures will be delivered on campus to provide the formal taught content including concepts, techniques and information. Lectures present the principles and techniques for designing fault-tolerant digital
systems, including combinational logic, dynamic systems, and networks. The practical/tutorial sessions supplement and support the lectures allowing a discovery approach to learning. As part of these practical sessions students will practice with various exercises and use cases.
Web Links that contain relevant research material will be provided to the students in support of the syllabus. Students will prepare and share summaries of technologies and system components. Students will discuss case studies and explore implications
The assessment is designed to assess both the students’ comprehension of theoretical topics relevant to Fault Tolerant Systems (exam) and their research skills in evaluating the modern trends in this topic (coursework).
Assessment Methods
This module is assessed through an exam and an assignment.