XDSM-Based Implementation of BSP
Introduction
This research builds upon work done on using X11 to provide
heterogeneous distributed shared memory (
XDSM
) in order to
simplify distributed programming by hiding the details of the
distributed communication from the application programmer.
Using more than one processor to improve the performance
of parallel programming systems results in serious implementation and
financial problems as the size of the system is scaled up.
A practical industrial and commercial compromise is to provide a software
solution using processors in existing, loosely coupled, computer networks,
rather than by providing hardwired, closely coupled, solutions.
However, this approach must overcome the problems
caused by differences in data representation and operating
system access protocols used by different computers. Because X11
is widely supported on a variety of different systems, using X11
in the form of
XDSM
to support distributed shared memory can
overcome both of these problems. However, to gain wider
acceptance as vehicles for parallel applications, distributed
systems must also provide a complete programming environment.
This requires the development of suitable parallel processing
models such as the Bulk Synchronous Parallel (BSP) programming
model.
BSP uses the concept of a logical architecture based on seperate
processors with externally synchronized communications (figure 1). This
architectural representation is provided by the underlying support
software, irrespective of the real physical architecture.
Figure 1. The BSP System
The BSP computation model is based on the concept of a superstep,
during which processes perform computations using local data.
Processes operate asynchronously within supersteps, which are
bounded (terminated) by barrier synchronizations which ensure
the completion of all current and outstanding remote I/O
operations (reading and writing). Normally, until all other
processes synchronize, all currently synchronizing processes are
blocked.
g is the ratio of computation to communication - the
computational efficiency of the system.
In distributed systems relative communication costs tend to increase as the
system scales.
BSP attempts to hide communication costs by make the
assumption that the number of parallel processes is likely to
exceed the number of physical processors. If processors can
multi-task, then while a process waits for communications to
complete, another process (e.g. a communications support
process) can be scheduled in its place.
This system of improving performance by masking
communication costs by simultaneously overlaying them (synchronizing) with
computation costs is termed parallel slackness (figure 2).
Figure 2. Using Parallel Slackness to improve processor
utilization
Project description
By using a BSP model which does not force all processes to
synchronize, together with the utilization of weaker coherency
and causality, a more flexible BSP model is allowed. This permits
the reading and writing processes to operate at different rates,
and prevents the blocking of faster processes by processes which
have yet to synchronize. If reading and writing are concurrent
(occurring in the same superstep), inconsistency can occur
between different memory segments. Therefore while readers may
concurrently read without loss of consistency, writers must be
exclusive - writing in separate supersteps so that only one
process can write at a time.
XDSM
is able to provide the necessary BSP facilities -
- Processors (processor/memory modules) are represented by
application program tasks, each with its own local memory space.
These represent virtual rather than physical processors and may
be mapped onto physical processors in order to satisfy
operational requirements.
- Communication routing is provided by X11 using the
XDSM
shared memory interface. This is abstracted to represent the
main communication channel.
- Synchronization is coordinated by a task which provides
memory access ordering (concurrent reading, exclusive writing)
using FIFO queued access. Readers adjacent to the head of the
queue gain simultaneous permission to read, but only the writer
at the head of the queue may write. Communication between the
synchronizer and processor tasks uses X11 xclient messages,
transmitted via the X11 graphics event propagation and handling
system.
Figure 3. BSP X11/XDSM Based System
XDSM-ML (XDSM Metalanguage) is implemented on the
XDSM
platform (figure 3) and provides a BSP programming interface similar to which
hides both synchronization and data transfer, and provides:
- Task BSP initialization - connection to X11 and mapping of
local to global shared memory.
- Superstep configuration - start (with setting of
synchronization periodicity) and end.
- Shared memory access - load from (read) and store to (write)
global shared memory.
- Task BSP termination - X11 disconnection.
Performance
BSP performance tests were done using SLC Sparcstations with 16
Mb RAM, connected by a 10 Mb rated ethernet. The tests used from
1 to 5 clients, each on a separate Sparcstation running UNIX. The
synchronizer was on the shared memory host machine running UNIX
and X11 version 4 using asynchronous TCP/IP communications.
Testing consisted of clients reading or writing 1 megabyte of
data using a 32 kilobyte shared memory block. The BSP test
program was compared with a program using X11 shared memory. To
provide a performance baseline, a comparison was done using a
socket based (TCP/IP protocol) shared memory server and clients
which asynchronously transferred the 1 megabyte of data in 32
kilobyte packets. The results indicate that for a 32 kilobyte
blocksize, the cost of the basic X11 protocol is comparable to
the cost of the underlying TCP/IP protocol (figure 4).
The BSP read time is
in the order of twice as slow due to synchronization overheads.
Comparing BSP write times must take into account the access
locking mechanisms used to provide memory coherency. While BSP
reading is equivalent to reading without any access locking, BSP
writing is exclusive and is therefore closer to using a queued
FIFO access locked mutual exclusion protocol. Both BSP and X11
using FIFO access locking are four times slower, whereas X11
writing without access locking is not excessively slower.
Figure 4. Times for Data Transfer
Estimates of g,
the computational efficiency of the system, suggest that it is
comparable with TCP and X11, and may be a fixed value (figure 5).
Figure 5. Estimates of g
Conclusion
We consider that the performance of this BSP system is adequate
for the class of real time system represented by water supply
simulation and decision support systems.
References
- ARGILE.A.D.S., An Investigation of X11 based intertask
communication and HCI in water network simulations, M.Sc Thesis,
Nottingham Polytechnic, 1991.
- ARGILE.A.D.S., and BARGIELA.A., Using the X11 Windows System to
Support the Bulk Synchronous Model of Parallel Computation,
IFIP/IEEE International Conference on Distributed Platforms,
Dresden, 1996.
- ARGILE.A.D.S., and BARGIELA.A., Using X11 Windows to Provide
Shared Task Memory in Distributed Computer Systems, Leicester
Conference on Integrated Computer Applications in Water Supply,
1993.
- ARGILE.A.D.S., and BARGIELA.A., Distributed Water Monitoring
supported by X11 Windows, Handout produced for the Leicester
Conference on Integrated Computer Applications in Water Supply,
1993.
- ARGILE.A.D.S., and BARGIELA.A., XDSM - an X11 based Virtual
Distributed Shared Memory System, Proceedings 2nd Int. Conf.
SMSTPE, 1994.
Copyright (c) 1996 RTTS
Last update: 1/2/96
Return to project home page