XDSM-Based Implementation of BSP

Andrew Argile


Introduction

This research builds upon work done on using X11 to provide heterogeneous distributed shared memory ( XDSM ) in order to simplify distributed programming by hiding the details of the distributed communication from the application programmer.

Using more than one processor to improve the performance of parallel programming systems results in serious implementation and financial problems as the size of the system is scaled up. A practical industrial and commercial compromise is to provide a software solution using processors in existing, loosely coupled, computer networks, rather than by providing hardwired, closely coupled, solutions. However, this approach must overcome the problems caused by differences in data representation and operating system access protocols used by different computers. Because X11 is widely supported on a variety of different systems, using X11 in the form of XDSM to support distributed shared memory can overcome both of these problems. However, to gain wider acceptance as vehicles for parallel applications, distributed systems must also provide a complete programming environment. This requires the development of suitable parallel processing models such as the Bulk Synchronous Parallel (BSP) programming model.

BSP uses the concept of a logical architecture based on seperate processors with externally synchronized communications (figure 1). This architectural representation is provided by the underlying support software, irrespective of the real physical architecture.

Figure 1. The BSP System



The BSP computation model is based on the concept of a superstep, during which processes perform computations using local data. Processes operate asynchronously within supersteps, which are bounded (terminated) by barrier synchronizations which ensure the completion of all current and outstanding remote I/O operations (reading and writing). Normally, until all other processes synchronize, all currently synchronizing processes are blocked. g is the ratio of computation to communication - the computational efficiency of the system. In distributed systems relative communication costs tend to increase as the system scales. BSP attempts to hide communication costs by make the assumption that the number of parallel processes is likely to exceed the number of physical processors. If processors can multi-task, then while a process waits for communications to complete, another process (e.g. a communications support process) can be scheduled in its place. This system of improving performance by masking communication costs by simultaneously overlaying them (synchronizing) with computation costs is termed parallel slackness (figure 2).

Figure 2. Using Parallel Slackness to improve processor utilization


Project description

By using a BSP model which does not force all processes to synchronize, together with the utilization of weaker coherency and causality, a more flexible BSP model is allowed. This permits the reading and writing processes to operate at different rates, and prevents the blocking of faster processes by processes which have yet to synchronize. If reading and writing are concurrent (occurring in the same superstep), inconsistency can occur between different memory segments. Therefore while readers may concurrently read without loss of consistency, writers must be exclusive - writing in separate supersteps so that only one process can write at a time. XDSM is able to provide the necessary BSP facilities -
  1. Processors (processor/memory modules) are represented by application program tasks, each with its own local memory space. These represent virtual rather than physical processors and may be mapped onto physical processors in order to satisfy operational requirements.

  2. Communication routing is provided by X11 using the XDSM shared memory interface. This is abstracted to represent the main communication channel.

  3. Synchronization is coordinated by a task which provides memory access ordering (concurrent reading, exclusive writing) using FIFO queued access. Readers adjacent to the head of the queue gain simultaneous permission to read, but only the writer at the head of the queue may write. Communication between the synchronizer and processor tasks uses X11 xclient messages, transmitted via the X11 graphics event propagation and handling system.


Figure 3. BSP X11/XDSM Based System


XDSM-ML (XDSM Metalanguage) is implemented on the XDSM platform (figure 3) and provides a BSP programming interface similar to which hides both synchronization and data transfer, and provides:
  1. Task BSP initialization - connection to X11 and mapping of local to global shared memory.
  2. Superstep configuration - start (with setting of synchronization periodicity) and end.
  3. Shared memory access - load from (read) and store to (write) global shared memory.
  4. Task BSP termination - X11 disconnection.

Performance

BSP performance tests were done using SLC Sparcstations with 16 Mb RAM, connected by a 10 Mb rated ethernet. The tests used from 1 to 5 clients, each on a separate Sparcstation running UNIX. The synchronizer was on the shared memory host machine running UNIX and X11 version 4 using asynchronous TCP/IP communications. Testing consisted of clients reading or writing 1 megabyte of data using a 32 kilobyte shared memory block. The BSP test program was compared with a program using X11 shared memory. To provide a performance baseline, a comparison was done using a socket based (TCP/IP protocol) shared memory server and clients which asynchronously transferred the 1 megabyte of data in 32 kilobyte packets. The results indicate that for a 32 kilobyte blocksize, the cost of the basic X11 protocol is comparable to the cost of the underlying TCP/IP protocol (figure 4). The BSP read time is in the order of twice as slow due to synchronization overheads. Comparing BSP write times must take into account the access locking mechanisms used to provide memory coherency. While BSP reading is equivalent to reading without any access locking, BSP writing is exclusive and is therefore closer to using a queued FIFO access locked mutual exclusion protocol. Both BSP and X11 using FIFO access locking are four times slower, whereas X11 writing without access locking is not excessively slower.

Figure 4. Times for Data Transfer


Estimates of g, the computational efficiency of the system, suggest that it is comparable with TCP and X11, and may be a fixed value (figure 5).

Figure 5. Estimates of g


Conclusion

We consider that the performance of this BSP system is adequate for the class of real time system represented by water supply simulation and decision support systems.

References



Copyright (c) 1996 RTTS
Last update: 1/2/96


Project Return to project home page