The basic entities of the TPML are:
The computation and communication actions do not have to be ordered in time. The barrier synchronization concludes the superstep: it has the function of ensuring that all one-sided communications are properly concluded. The figure below shows this in a diagrammatic form. The processes are not regarded as having a particular linear order (from left to right or otherwise), and may be mapped to processors in any way. A further aspect of the BSP model is that of overdecomposition of the problem and oversubscription of the processors: the problem is divided into more logical processes than there are physical processors, and processes are randomly assigned to processors. This strategy can be shown statistically to lead to almost perfectly load balancing, both of work and communication.
In many parallel programming systems, communications are considered at the level of individual actions: sending and receiving a message, memory to memory transfer, etc. This is difficult to work with, since there are many simultaneous communication actions in a parallel program, and their interactions are typically complex. In particular, it is difficult to say much about the time any single communication action will take to complete.
The BSP model considers communication actions en masse. This has the effect that an upper bound on the time taken to communicate a set of data can be given. BSP considers all communication actions of a superstep as one unit, and assumes all messages have a fixed size. The maximum number of incoming or outgoing messages for a superstep is denoted by h. The ability of a communication network to deliver data is captured by a parameter g, defined such that it takes time hg for a processor to deliver h messages of size 1. A message of length m obviously takes longer to send than a message of size 1. However, the BSP model does not make a distinction between a message length of m or m messages of length 1. In either case the cost is said to be mhg.
The one-sided communication of the BSP model requires a global barrier synchronization. Barriers are potentially costly, but have a number of attractions. They do not introduce the possibility of deadlock or livelock, since barriers do not create circular data dependencies. Therefore tools to detect and deal with them are unnecessary. Barriers also permit novel forms of fault tolerance.
The From the viewpoint of application programs, the metalanguage enables a logical specification of parallel tasks and intertask communication by the programmer. This inter-task communication is performed while hiding (from the programmer) the message transport complexities inherent to a specific physical processor connectivity and/or processor-memory arrangements. The metalanguage approach, to the implementation of the BSP processing model, was adopted in order to make full use of the available software development tools.
The syntax of the TPML, provided for use within the application code, includes four sets of commands for the respective management of: tasks, intertask communication, barrier synchronisation and the termination of tasks and the meta languagege.
The user includes his code for the task named `task_id' within these expressions. In the case of a number of identical tasks being distributed over the transputer network, it is not necessary to duplicate the code for each task. Instead, the initialising command is given two arguments - the first naming the task, and the second defining the task to be duplicated.
The tpml_send command specifies to which task a data packet is to be sent.
By explicitly specifying the number of elements in that array, the volume of data on the communication links is minimised.
To receive a data packet, using the tpml_receive command, TPML requires the programmer to specify the name of the array to which the incoming data packet is to be assigned.
The TPML commands: tpml_send/receive combined with the router task enable messages to be sent point to point - from one task to another.
The superstep facility serves to ensure that access to shared data by the parallel tasks is limited. The `step_id' parameter (supplied by the programmer) identifies a superstep within each parallel task.
Within the brackets: c>>tpml_step_begin and c>>tpml_step_end, each task may perform different computations but they must all reach the end of their current, common supersteps before any one can proceed to the next. The superstep facility thus provides a barrier synchronisation for the participating tasks. Two examples of superstep implementation follow:
Example 1 Example 2
c>>tpml_task_begin c>>tpml_task_begin c>>tpml_step_begin c>>tpml_step_begin : : c>>tpml_receive(A) c>>tpml_receive(A) c>>tpml_receive(B) c>>tpml_receive(B) : : c>>tpml_send(C) c>>tpml_step_end c>>tpml_send(D) c>>tpml_step_begin : : c>>tpml_step_end c>>tpml_send(A) c>>tpml_task_end c>>tpml_send(D) : c>>tpml_step_end c>>tpml_task_end
These are included to indicate the end of user code to the TPML parser and to terminate the execution of specific tasks, as required.
c>>tpml_task_begin(master) c declaration of all variables and parameters generate data c c send out work packets c --- the data array is being sent to `worker1' c --- the data array being sent is `command1' c --- the length of the data array is `length' c>>tpml_send(worker1,command1, length) c>>tpml_send(worker2, command2, length) c c program will wait here till a packet arrives c --- `results1' will become the data array being received c>>tpml_receive(results1) c>>tpml_receive(results2) c output results c end c>>tpml_task_end c c>>tpml_task_begin(worker1) c declaration of all variables and parameters c c program will wait here until a packet arrives c --- `command' will become the data array being received c>>tpml_receive(command) c handle data c c send back the sorted data c --- the data array is being sent to `master' c --- the data array being sent is `results' c --- the length of the data array is `length' c>>tpml_send(master, results, length) c end c>>tpml_task_end c c --- this duplicates the code of `worker1' into `worker2' c>>tpml_task_begin(worker2,worker1) c>>tpml_task_end c c --- this instructs the TPML that the application code is complete c>>tpml_end
The TPML generated configuration file for the specific physical connectivity of transputers is listed below:
processor HOST processor ROOT wire ? ROOT HOST ! physical wire connection between processor P001 the root and host wire ? P001 ROOT processor P002 wire ? P002 ROOT ! Task tafserver Ins=1 Outs=1 Task filter Ins=2 Outs=2 Data=50k Task multiplexer File=filemux Ins=3 Outs=3 Data=8k Task tpml_router1 File=router1 Ins=4 Outs=4 ! tpml_router1 has task image file router_1.b4 Task tpml_router2 File=router2 Ins=2 Outs=2 ! tpml_router2 has 2 input and Task tpml_router3 File=router3 Ins=2 Outs=2 output ports Task step File=step Ins=1 Outs=1 Data=500k Task master File=master Ins=3 Outs=3 Data=500k Task worker1 File=worker1 Ins=1 Outs=1 Data=500k Task worker2 File=worker2 Ins=1 Outs=1 Data=500k ! Place tafserver Host Place filter Root Place master Root ! master is placed on the root Place worker1 p001 processor Place worker2 p002 Place tpml_router1 Root Place tpml_router2 p001 Place tpml_router3 p002 ! Connect ? tafserver filter Connect ? filter tafserver Connect ? filter multiplexer Connect ? multiplexer filter Connect ? multiplexer application_m Connect ? master multiplexer ! logical connection from master Connect ? multiplexer step to task multiplexer Connect ? step multiplexer Connect ? tpml_router1 master Connect ? master tpml_router1 Connect ? tpml_router1 step Connect ? step tpml_router1 Connect ? tpml_router1 tpml_router2 ! tpml_router1 has output port 2, Connect ? tpml_router2 tpml_router1 tpml_router2 has input port 0 Connect ? tpml_router2 worker1 Connect ? worker1 tpml_router2 Connect ? tpml_router1 tpml_router3 Connect ? tpml_router3 tpml_router1 Connect ? tpml_router3 worker2 Connect ? worker2 tpml_router3
The tpml-generated code is placed on processors.
TPML derives appropriate switching vectors for each router, which ensures that the optimal route is followed from one task to another, given the placement of these tasks on the processors. The connectivity between the routers mirrors the physical connectivity of the processing nodes, while the logical connectivity of tasks is modelled by the appropriate values of the switching vectors.