Bulk synchronous parallel
The Bulk Synchronous Parallel (BSP) abstract computer is a bridging model for designing parallel algorithms. A bridging model "is intended neither as a hardware nor a programming model but something in between".[1] It serves a purpose similar to the Parallel Random Access Machine (PRAM) model. BSP differs from PRAM by not taking communication and synchronization for granted. An important part of analysing a BSP algorithm rests on quantifying the synchronisation and communication needed.
BSP was developed by Leslie Valiant during the 1980s. The definitive article [1] was published in 1990.
Contents |
The model [edit]
A BSP computer consists of processors connected by a communication network. Each processor has a fast local memory, and may follow different threads of computation. A BSP computation proceeds in a series of global supersteps. A superstep consists of three components:
- Concurrent computation: Several computations take place on every participating processor. Each process only uses values stored in the local memory of the processor. The computations are independent in the sense that they occur asynchronously of all the others.
- Communication: The processes exchange data between themselves. This exchange takes the form of one-sided put and get calls, rather than two-sided send and receive calls.
- Barrier synchronisation: When a process reaches this point (the barrier), it waits until all other processes have finished their communication actions.
The computation and communication actions do not have to be ordered in time. The barrier synchronization concludes the superstep: it has the function of ensuring that all one-sided communications are properly concluded. This global synchronization is not needed in models based on two-sided communication, since these synchronize processes implicitly.
The figure below shows this in a diagrammatic form. The processes are not regarded as having a particular linear order (from left to right or otherwise), and may be mapped to processors in any way.
A further aspect of the BSP model is that of overdecomposition of the problem and oversubscription of the processors: the problem is divided into more logical processes than there are physical processors, and processes are randomly assigned to processors. This strategy can be shown statistically to lead to almost perfectly load balancing, both of work and communication.
Communication [edit]
In many parallel programming systems, communications are considered at the level of individual actions: sending and receiving a message, memory to memory transfer, etc. This is difficult to work with, since there are many simultaneous communication actions in a parallel program, and their interactions are typically complex. In particular, it is difficult to say much about the time any single communication action will take to complete.
The BSP model considers communication actions en masse. This has the effect that an upper bound on the time taken to communicate a set of data can be given. BSP considers all communication actions of a superstep as one unit, and assumes all messages have a fixed size.
The maximum number of incoming or outgoing messages for a superstep is denoted by
. The ability of a communication network to deliver data is captured by a parameter
, defined such that it takes time
for a processor to deliver
messages of size 1.
A message of length
obviously takes longer to send than a message of size 1. However, the BSP model does not make a distinction between a message length of
or
messages of length 1. In either case the cost is said to be
.
The parameter
is dependent on the following factors:
- The protocols used to interact within the communication network.
- Buffer management by both the processors and the communication network.
- The routing strategy used in the network.
- The BSP runtime system.
A value for
is, in practice, determined empirically for each parallel computer. Note that
is not the normalised single-word delivery time, but the single-word delivery time under continuous traffic conditions.
Barriers [edit]
The one-sided communication of the BSP model requires a global barrier synchronization. Barriers are potentially costly, but have a number of attractions. They do not introduce the possibility of deadlock or livelock, since barriers do not create circular data dependencies. Therefore tools to detect and deal with them are unnecessary. Barriers also permit novel forms of fault tolerance.
The cost of barrier synchronization is influenced by a couple of issues:
- The cost imposed by the variation in the completion time of the participating concurrent computations. Take the example where all but one of the processes have completed their work for this superstep, and are waiting for the last process, which still has a lot of work to complete. The best that an implementation can do is ensure that each process works on roughly the same problem size.
- The cost of reaching a globally consistent state in all of the processors. This depends on the communication network, but also on whether there is special-purpose hardware available for synchronizing, and on the way in which interrupts are handled by processors.
The cost of a barrier synchronization is denoted by
. In practice, a value of
is determined empirically.
The presence of barriers makes the BSP model mostly a theoretical one: on large computers barriers are expensive, and this is increasingly so on large scales.[2] In fact, there is a large body of literature[3] on removing synchronization points from existing algorithms.
The Cost of a BSP algorithm [edit]
The cost of a superstep is determined as the sum of three terms; the cost of the longest running local computation, the cost of global communication between the processors, and the cost of the barrier synchronisation at the end of the superstep. The cost of one superstep for
processors:
where
is the cost for the local computation in process
, and
is the number of messages sent or received by process
. Note that homogeneous processors are assumed here. It is more common for the expression to be written as
where
and
are maxima. The cost of the algorithm then, is the sum of the costs of each superstep.
where
is the number of supersteps.
,
, and
are usually modelled as functions, that vary with problem size. These three characteristics of a BSP algorithm are usually described in terms of asymptotic notation, e.g.
.
Extensions and uses [edit]
BSP has been extended by many authors to address concerns about BSP's unsuitability for modelling specific architectures or computational paradigms. One example of this is the decomposable BSP model. The model has also been used in the creation of a number of new programming languages --- including BSML (Bulk Synchronous Parallel ML) --- and programming models --- including BSPLib,[4] Apache Hama, Apache Giraph, and Pregel.[5]
See also [edit]
- Computer cluster
- Concurrent computing
- Concurrency
- Dataflow programming
- Grid computing
- Parallel computing
- ScientificPython
- LogP machine
- Automatic mutual exclusion
References [edit]
- ^ a b Leslie G. Valiant, A bridging model for parallel computation, Communications of the ACM, Volume 33 Issue 8, Aug. 1990 [1]
- ^ http://jointlab.ncsa.illinois.edu/events/workshop3/pdf/presentations/Gropp-Update-on-Libraries.pdf
- ^ E.F. D'Azevedo and V.L. Eijkhout and C.H. Romine, A Matrix Framework for Conjugate Gradient Methods and Some Variants of CG with Less Synchronization Overhead, Proceedings of the Sixth SIAM Conference on Parallel Procesing for Scientific Computing, 1993, pp644-646
- ^ BSPlib
- ^ Pregel
External links [edit]
- D.B. Skillicorn, Jonathan Hill, W. F. McColl, Questions and answers about BSP (1996)
- BSP Worldwide
- BSP related papers
- WWW Resources on BSP Computing
- (French) Bulk Synchronous Parallel ML ((English) official website)