XDSM-Based Implementation of Bulk Synchronous Parallel (BSP) processing
This research builds upon work done on using X11 to provide
heterogeneous distributed shared memory (
) in order to
simplify distributed programming by hiding the details of the
distributed communication from the application programmer.
Using more than one processor to improve the performance
of parallel programming systems results in serious implementation and
financial problems as the size of the system is scaled up.
A practical industrial and commercial compromise is to provide a software
solution using processors in existing, loosely coupled, computer networks,
rather than by providing hardwired, closely coupled, solutions.
However, this approach must overcome the problems
caused by differences in data representation and operating
system access protocols used by different computers. Because X11
is widely supported on a variety of different systems, using X11
in the form of
to support distributed shared memory can
overcome both of these problems. However, to gain wider
acceptance as vehicles for parallel applications, distributed
systems must also provide a complete programming environment.
This requires the development of suitable parallel processing
models such as the Bulk Synchronous Parallel (BSP) programming
BSP uses the concept of a logical architecture based on seperate
processors with externally synchronized communications (figure 1). This
architectural representation is provided by the underlying support
software, irrespective of the real physical architecture.
Figure 1. The BSP System
The BSP computation model is based on the concept of a superstep,
during which processes perform computations using local data.
Processes operate asynchronously within supersteps, which are
bounded (terminated) by barrier synchronizations which ensure
the completion of all current and outstanding remote I/O
operations (reading and writing). Normally, until all other
processes synchronize, all currently synchronizing processes are
g is the ratio of computation to communication - the
computational efficiency of the system.
In distributed systems relative communication costs tend to increase as the
BSP attempts to hide communication costs by make the
assumption that the number of parallel processes is likely to
exceed the number of physical processors. If processors can
multi-task, then while a process waits for communications to
complete, another process (e.g. a communications support
process) can be scheduled in its place.
This system of improving performance by masking
communication costs by simultaneously overlaying them (synchronizing) with
computation costs is termed parallel slackness (figure 2).
Figure 2. Using Parallel Slackness to improve processor
By using a BSP model which does not force all processes to
synchronize, together with the utilization of weaker coherency
and causality, a more flexible BSP model is allowed. This permits
the reading and writing processes to operate at different rates,
and prevents the blocking of faster processes by processes which
have yet to synchronize. If reading and writing are concurrent
(occurring in the same superstep), inconsistency can occur
between different memory segments. Therefore while readers may
concurrently read without loss of consistency, writers must be
exclusive - writing in separate supersteps so that only one
process can write at a time.
is able to provide the necessary BSP facilities -
- Processors (processor/memory modules) are represented by
application program tasks, each with its own local memory space.
These represent virtual rather than physical processors and may
be mapped onto physical processors in order to satisfy
- Communication routing is provided by X11 using the
shared memory interface. This is abstracted to represent the
main communication channel.
- Synchronization is coordinated by a task which provides
memory access ordering (concurrent reading, exclusive writing)
using FIFO queued access. Readers adjacent to the head of the
queue gain simultaneous permission to read, but only the writer
at the head of the queue may write. Communication between the
synchronizer and processor tasks uses X11 xclient messages,
transmitted via the X11 graphics event propagation and handling
Figure 3. BSP X11/XDSM Based System
XDSM-ML (XDSM Metalanguage) is implemented on the
platform (figure 3) and provides a BSP programming interface similar to which
hides both synchronization and data transfer, and provides:
- Task BSP initialization - connection to X11 and mapping of
local to global shared memory.
- Superstep configuration - start (with setting of
synchronization periodicity) and end.
- Shared memory access - load from (read) and store to (write)
global shared memory.
- Task BSP termination - X11 disconnection.
BSP performance tests were done using SLC Sparcstations with 16
Mb RAM, connected by a 10 Mb rated ethernet. The tests used from
1 to 5 clients, each on a separate Sparcstation running UNIX. The
synchronizer was on the shared memory host machine running UNIX
and X11 version 4 using asynchronous TCP/IP communications.
Testing consisted of clients reading or writing 1 megabyte of
data using a 32 kilobyte shared memory block. The BSP test
program was compared with a program using X11 shared memory. To
provide a performance baseline, a comparison was done using a
socket based (TCP/IP protocol) shared memory server and clients
which asynchronously transferred the 1 megabyte of data in 32
kilobyte packets. The results indicate that for a 32 kilobyte
blocksize, the cost of the basic X11 protocol is comparable to
the cost of the underlying TCP/IP protocol (figure 4).
The BSP read time is
in the order of twice as slow due to synchronization overheads.
Comparing BSP write times must take into account the access
locking mechanisms used to provide memory coherency. While BSP
reading is equivalent to reading without any access locking, BSP
writing is exclusive and is therefore closer to using a queued
FIFO access locked mutual exclusion protocol. Both BSP and X11
using FIFO access locking are four times slower, whereas X11
writing without access locking is not excessively slower.
Figure 4. Times for Data Transfer
Estimates of g,
the computational efficiency of the system, suggest that it is
comparable with TCP and X11, and may be a fixed value (figure 5).
Figure 5. Estimates of g
We consider that the performance of this BSP system is adequate
for the class of real time system represented by water supply
simulation and decision support systems.
- Argile A., An Investigation of X11 based intertask communication and HCI in water network simulations, M.Sc Thesis, Nottingham Polytechnic, 1991.
- Argile A., Bargiela A., Using the X11 Windows System to Support the Bulk Synchronous Model of Parallel Computation, Proc. IFIP/IEEE Int. Conf. on Distributed Platforms, Dresden, 1996, ISBN 3 86012 023 9, pp. 308-313.
- Argile A., Bargiela A., Using X11 Windows to provide shared task memory in distributed computer systems, in Integrated Computer Applications, Ed. Coulbeck B., Vol. 1, J Wiley, 1993, ISBN 0 471 94232 4, pp. 305-316.
- Argile A., Bargiela A., XDSM - an X11 based virtual distributed shared memory system, Proc. of 2nd Int. Conf. on Software for Supercomputers and Multiprocessors, SMS-94, Moscow, 1994, pp. 250-259, PDF
- Hartley J., Bargiela A.
TPML: Parallel meta-language for scientific and engineering computations
using transputersProceedings of 2nd Int. Conference on Software for
Supercomputers and Multiprocessors. SMS-94, Moscow 1994, PDF
- Hartley J., Bargiela A., XTPML - Simplifying the Development of Parallel Programs for Implementation on Various Transputer Architectures, Proc. European Simulation Symposium ESS-98, Oct. 1998, ISBN 1-56555-147-8, pp.119-123, PDF
- Hartley J., Bargiela A., Probabilistic Simulation of Large-scale Water Distribution Systems, Proceedings of European Simulation Symposium ES-96, Genoa, October 1996, ISBN 1-565555-099-4 (Vol.2), pp.403-407, PDF
- Hartley J., Bargiela A., Cant R., Parallel simulation of large scale water distribution systems, Proceedings of Modelling and Simulation Conference ESM-95, Prague, June 1995, ISBN 1-56555-080-3, pp. 723-727, PDF
- Hartley J., Bargiela A., Parallel State Estimation with Confidence Limit Analysis, Parallel Algorithms and Applications , Vol. 11, No.1-2, 1997, pp. 155-167, PDF
- Bargiela A., Nonlinear network tearing algorithm for transputer system implementation, Proc. of Int. Conf. TAPA-92, Melbourne, November 1992, ISBN 905199115 0, pp. 19-24
- Bargiela A., Hosseinzaman A., Parallel simulation of nonlinear networks using diakoptics, in Parallel Computing and Transputer Applications, M. Valero (ed.), IOS Press/CIMNE, Barcelona, 1992, ISBN 84 87867 138, pp. 1463-1473.
Last update: 1/2/96