At the early stages of the development, decisions were made to adopt network of workstations as the hardware platform for FADI. These systems are cost effective, and can grow in small increments over a large number of sizes (highly scalable), which made them the most popular distributed systems setup in academic, scientific and industrial establishments. Although they concede in performance to dedicated parallel systems (e.g massively parallel systems), their performance-cost ratio is higher. Moreover, the performance of networked systems can be boosted by using faster communication links (FDDI, ATM) and by limiting the number of multi-user access to the software.
FADI is designed to support a class of applications that are computationally intensive but do not have stringent real-time constrains. These applications require days or weeks of computations to execute on a network with dozens of workstations. Therefore, they do not justify the use of expensive high-performance fault-tolerant systems, yet they necessitate the prevention of the loss of the results of the long-running computations.
To achieve the goals of the project, we developed an integrated environment, that encompasses all aspects of modern fault-tolerant distributed computing: automatic remote process allocation, detection of hardware errors, and a technique to recover distributed user processes running under FADI from these errors. The environment modules were designed in a modular fashion which supports the parallel execution of its control and monitoring tasks. In addition, the whole system is message driven which dramatically reduces the response time to events occurring in the distributed environment. On the other hand this emphasises the importance of adopting a flexible and efficient message passing interface to facilitate the communication between the control and monitoring processes of FADI in one hand and between the distributed application processes on the other.
PVM (Parallel virtual Machine) was adopted as the underlying communication medium for FADI. This message passing system has more users than any other parallel programming environment, and is a de-facto standard for message passing environments. In addition to FADI's communication layer essential requirements of message passing and process synchronization, the PVM library offers a rich set of programming tools that makes it extremely attractive for any parallel programmer. Among these tools are: automatic spawning of user-tasks on the distributed system host, balancing of the computing load on the network nodes, dynamic process group operations, and many more.
The Error Detection Mechanism (EDM)
The detection of an erroneous system state is the starting point and probably the most crucial for all fault-tolerance strategies. Hence, an efficient, user-transparent error detection mechanism was developed for FADI. The EDM covers processor node crashes and hardware transient failures (e.g bus errors, temporary memory flips, etc.). The modular design of the EDM allowed the integration of user-programmed (application- specific) error checks into the detectable errors database.
A central error detection policy was implemented where the main detection processes run on a fail-safe processing node. Considering the distributed nature of the application programs, this implies that the latency of detecting the errors might be affected by message traffic in the in the communication network. Therefore, an on-line mechanism was implemented to dynamically measure the round-trip time of the underlying network and calculate the error latency accordingly.
A paper documenting the design, implementation, and testing of the EDDS system was presented at the 7th European Simulation Symposium ESS'95.
Backup and Recovery of Application processes
Checkpointing and rollback was adopted as the backup and recovery methodology for distributed user-application running in FADI, in favour of process replication fault- tolerance methods. The main reason is the heavy cost of the redundant hardware where the replicas execute. Therefore, their use remain confined mainly to large budget mission critical systems with stringent reliability requirements.
On the other hand, the checkpointing/rollback method incurs an overhead on the check pointed application because checkpoints are taken during failure-free operation. Roll-back recovery also requires certain actions to be taken to ensure consistent recovery when processes crash. Nevertheless, this degradation in performance is graceful, and is quite admissible for a large class of business and scientific applications, which although very wide spread and challenging, do not have stringent real-time constrains.
To implement FADI non-blocking checkpointing mechanism, Condor's bytestream checkpointing was used as the basic building block. Condor's checkpointing code was modified to cater for FADI's requirements for reliability and distribution of processing. PVM message passing routines were integrated with the checkpointing mechanism and the integrity of passed interprocess messages was ensured. A module performing the rollback of user-files was implemented. It uses a combination of copy-shadowing and file size bookkeeping to undo modifications to user-files upon rollbacks.
Performance measurements showed that the developed non-blocking checkpointing technique significantly reduces the checkpointing overhead. With this method, an exact copy(thread) of the checkpointed program is forked, and the forked thread performs all the checkpointing routines without suspending the execution of the application code.
A paper discussing the implemented process checkpointing mechanism is to appear in the proceedings of the 11th European Simulation Multiconference ESM 97.
An Algorithm for Checkpointing/Rollback in Distributed Message Passing Systems
The non-blocking checkpointing and rollback mechanism was implemented to recover stand-alone applications from hardware faults. The novel algorithm was developed to coordinate that mechanism in the context of distributed computing so that it covers possible inter-process communications taking place.
Extensive research has been carried out in algorithms and methodologies of reliable distributed computing for message-passing systems. This resulted in the development of a novel approach based on a hybrid technique that takes advantage of the low failure free overhead of consistent checkpointing methods with logging messages that cross the recovery line to avoid blocking the application process during the checkpointing protocol. The low failure-free overhead is at the expense of a longer rollback time which is admissible because of the lengthy execution time of the targeted application in FADI.
In contrast with similar algorithms, this technique is tolerant to errors occurring whilst messages are in transit, i.e messages are delivered to the destination (queued at message passing daemon or transport protocol thread), but not yet requested (consumed) by the receiving task. The algorithm requires only one global checkpoint to be recorded in stable storage and avoids multiple rollbacks (domino effect).
The Graphical User Interface
The purpose of FADI GUI is to provide the application programmer with a user-friendly form for inputting the specifications of the distributed system, and to aid in on-line monitoring of the application tasks and the underlying hardware platform (computer nodes).
The GUI was developed using TCL/TK, a powerful rapid application development tool that is built on the top of X11 interactive window system.
Checkpointing Projects on the Web
Commercial Fault-Tolerant Distributed & Parallel Processing Systems
ISIS Distributed Systems
Simulation Modelling - home page