FADI : A Fault-Tolerant Environment for Distributed Processing Systems
The field of distributed computing has witnessed an explosive expansion during the
last decade. As the use of distributed computing systems for large scale computations
is growing, so is the requirement to increase their reliability. Distributed systems
are especially vulnerable to failures because of the inter-dependency of their
processing nodes. Consequently it is essential that distributed systems adopt measures
that enhance their fault tolerance not only to recover the failed computing node, but
to prevent the waste of processing accomplished on the whole distributed system when
one of its nodes fails.
The purpose of the project is to develop a fault-tolerant distributed environment (FADI)
that provides distributed system users and parallel programmers with an integrated
processing environment, where they can reliably execute their concurrent (distributed)
applications despite errors that might occur in the underlying hardware.
At the early stages of the development, decisions were made to adopt network of
workstations as the hardware platform for FADI. These systems are cost effective, and
can grow in small increments over a large number of sizes (highly scalable), which made
them the most popular distributed systems setup in academic, scientific and industrial
establishments. Although they concede in performance to dedicated parallel systems
(e.g massively parallel systems), their performance-cost ratio is higher. Moreover, the
performance of networked systems can be boosted by using faster communication links
(FDDI, ATM) and by limiting the number of multi-user access to the software.
FADI is designed to support a class of applications that are computationally intensive
but do not have stringent real-time constrains. These applications require days or weeks
of computations to execute on a network with dozens of workstations. Therefore, they do
not justify the use of expensive high-performance fault-tolerant systems, yet they
necessitate the prevention of the loss of the results of the long-running computations.
To achieve the goals of the project, we have developed an
integrated environment (illustrated above),
that encompasses all aspects of modern fault-tolerant distributed computing: automatic
remote process allocation, detection of hardware errors, and a structured method for
error recovery of user processes running under FADI. The environment modules
were designed so as to support the parallel execution of its control
and monitoring tasks. In addition, the whole system is message driven which dramatically
reduces the response time to events occurring in the distributed environment. On the other
hand this emphasises the importance of adopting a flexible and efficient message passing
interface to facilitate the communication between the control and monitoring processes of
FADI on one hand and between the distributed application processes on the other.
Integrated environment for the study of fault-tollerance
PVM (Parallel virtual
Machine) was adopted as the underlying communication medium for FADI. This message passing
system has more users than any other parallel programming environment, and is a de-facto
standard for message passing environments. In addition to FADI's communication layer
essential requirements of message passing and process synchronization, the PVM library
offers a rich set of programming tools that makes it extremely attractive for any parallel
programmer. Among these tools are: automatic spawning of user-tasks on the distributed
system host, balancing of the computing load on the network nodes, dynamic process group
operations, and many more.
FADI's Main Modules
The Error Detection Mechanism (EDM)
The detection of an erroneous system state is the starting point and probably the most
crucial for all fault-tolerance strategies. Hence, an efficient, user-transparent error
detection mechanism was developed for FADI. The
EDM covers processor node
crashes and hardware transient failures (e.g bus errors, temporary memory flips, etc.).
The modular design of the EDM allowed the integration of user-programmed (application-
specific) error checks into the detectable errors database.
A central error detection policy was implemented where the main detection processes
run on a fail-safe processing node. Considering the distributed nature of the application
programs, this implies that the latency of detecting the errors might be affected by
message traffic in the in the communication network. Therefore, an on-line
implemented to dynamically measure the round-trip time of the underlying network and
calculate the error latency accordingly.
Backup and Recovery of Application processes
We adopted checkpointing and rollback as the backup and recovery methodology for
distributed user-application running in FADI, in preference to replication-based fault-tolerance methods.
The main reason for this design choice is the desire to avoid heavy cost of the redundant hardware where
the replicas execute. However, the savings on the hardware infrastructure come at the expense of the
computing time since the checkpoints are taken during failure-free operation.
Roll-back recovery also contributes to the overhead by requiring certain actions to ensure consistent
recovery when processes crash. Nevertheless, this degradation in performance is
graceful, and is quite admissible for a large class of business and scientific
applications which are computationally intensive but do not have stringent
The basic building block of FADI's non-blocking checkpointing is the processing of bytestreams.
The additional layer of software caters for FADI's requirements for
reliability and distribution of processing. PVM message passing routines were
integrated with the checkpointing mechanism and the integrity of interprocess
message-passing was verified. A module performing the rollback of user-files used
a combination of copy-shadowing and file size bookkeeping to undo modifications
to user-files upon rollbacks.
Performance measurements showed that the developed non-blocking checkpointing
technique significantly reduces the checkpointing overhead. With this method, an exact
copy(thread) of the checkpointed program is forked, and the forked thread performs all
the checkpointing routines without suspending the execution of the application code.
FADI's Bytestream-based non-blocking checkpointing
Computational performance of non-blocking checkpointing
The Graphical User Interface
The graphical user interface to FADI provides an intuitive interface to the applications programmer.
Its main function is to facilitate specifications of the distributed system
and to aid in on-line monitoring of the execution of the application tasks and the underlying
hardware platforms (computer nodes).
Interface to the distributed system specification facility
The GUI was developed using TCL/TK, a powerful
rapid application development tool that is built on the top of the X11 window system.
Interface to the distributed system monitoring facility
- Osman T., Bargiela A., Error detection for reliable distributed simulations, Proc. of European Simulation Symposium, Proc. of 7th European Simulation Symposium ESS-95, Erlangen-Nuremberg, October 1995, ISBN 1-56555-083-8, pp. 358-362, PDF
- Osman T., Bargiela A., FADI: A fault tolerant environment for open distributed computing, IEE Proceedings Software , vol. 147, no. 3, June 2000, pp. 91-99, doi:10.1049/ip-sen:20000702, PDF
- Osman T., Bargiela A., Reliable Distributed Computing for Decision Support Systems, Measurement and Control , Vol. 32, No. 4, May 1999, pp.115-118
- Wagealla W., Osman T. Bargiela A., Error detection algorithm for agent-based distributed applications, Proc. Agent-Based Simulation Workshop 2001, Passau, Germany May, 2001
- Osman T., Bargiela A., Process Checkpointing in an Open Distributed Environment, Proc. of European Simulation Multiconference, ESM'97, Istambul, June 1997, ISBN 1-565555-115 X, pp. 536-540, PDF
- Argile A., Peytchev E., Bargiela A., Kosonen I., DIME: A Shared Memory Environment for Distributed Simulation, Monitoring and Control of Urban Traffic, Proceedings of European Simulation Symposium ESS'96, Genoa, 1996, pp.152-156, PDF
- Argile A., Bargiela A., XDSM - an X11 based virtual distributed shared memory system, Proc. of 2nd Int. Conf. on Software for Supercomputers and Multiprocessors, SMS-94, Moscow, 1994, pp. 250-259, PDF
- Argile A., Bargiela A., Using the X11 Windows System to Support the Bulk Synchronous Model of Parallel Computation, Proc. IFIP/IEEE Int. Conf. on Distributed Platforms, Dresden, 1996, ISBN 3 86012 023 9, pp. 308-313.
- Argile A., Bargiela A., Using X11 Windows to provide shared task memory in distributed computer systems, in Inte¬≠grated Computer Applications, Ed. Coulbeck B., Vol. 1, J Wiley, 1993, ISBN 0 471 94232 4, pp. 305-316.
- Bargiela A., Parallel and distributed telemetry data processing, Proceedings of Parallel Computing and Transputers Conference, PCAT‚Äô93, Brisbaine, 1993, ISBN 90 5199 1495, pp. 269-275.
- Peytchev E., Bargiela A., On-line mobile passenger information system, Proc. Int. Workshop on Harbour, Maritime and Industrial Logistics Modelling and Simulation - HMS, Genoa, Sept. 1999, PDF
- Bargiela A, Peytchev E., Berry R., Experiences with a distributed traffic telematics environment - portable travel information system, Proc. IEEE - AFRICON-99 Conf., Sept. 1999, ISBN 0-7803-5546-6, pp.7-12, PDF
- Peytchev E., Bargiela A., Traffic Telematics Software Environment, Proc. European Simulation Symposium ESS-98, Oct. 1998, ISBN 1-56555-147-8, pp.378-382, PDF
- Kosonen I., Bargiela A., Claramunt C., A distributed traffic monitoring and information system, Proc. European Simulation Symposium ESS-98, Oct. 1998, ISBN 1-56555-147-8, pp.355-361, PDF
Last update: 07/03/01