Jump to content

User:Armin Wiebigke/sandbox

From Wikipedia, the free encyclopedia

The two-tree broadcast (abbreviated 2tree-broadcast or 23-broadcast) is an algorithm to implement a broadcast communication pattern. The algorithm can also be adapted to perform a reduction or prefix sum.

Algorithm

[edit]

A broadcast sends a message from a specified root processor to all other processors. Binary tree broadcasting uses a binary tree to model the communication between the processors. Each processor corresponds to one node in the tree, and the root processor is the root of the tree. To broadcast a message M, the root sends M to its two children (child nodes). Each processor waits until it receives M and then sends M to its children. Because leaves have no children, they don't have to send any messages. The broadcasting process can be pipelined by splitting the message into k blocks, which are then broadcasted consecutively. In such a binary tree, the leaves of the tree only receive data, but never send any data themselves. If the communication is bidirectional (full-duplex), meaning each processor can send a message and receive a message at the same time, the leaves only use halve of the available bandwidth.

Two-tree broadcast with seven processors, including the edge coloring. T1 in red, T2 in blue. The last processor is the root.

The idea of the two-tree broadcast is to use two binary trees T1 and T2 and communicate on both concurrently.[1] The trees are constructed so that the interior nodes of one tree correspond to leaf nodes of the other tree. The data that has to be broadcasted is split into blocks of equal size. In each step of the algorithm, each processor receives one block and sends the previous block to one of its children in the tree in which it is an interior node. A schedule is needed so that no processor has to send or receive two messages in the same step. To create such a schedule, the edges of both trees are colored with 0 and 1 such that

  • no processor is connected to its parent nodes in T1 and T2 using edges of the same color
  • no processor is connected to its children nodes in T1 or T2 using edges of the same color.

Edges with color 0 are used in even steps, edges with color 1 are used in odd steps. This schedule allows each processor to send one message and receive one message in each step, fully utilizing the available bandwidth.
Assume that processor i wants to broadcasts a message. The two trees are constructed for the remaining processors. Processor i sends blocks alternating to the roots of the two trees, so each tree broadcasts one half of the message.

Analysis

[edit]

(maybe different communication models)

Let p be the number of processes, numbered from 0 to p - 1.

Construction of the trees

[edit]
Two-trees of size 6, 12, 7, 9 using mirroring (top) and shifting (bottom). T1 in red, T2 in blue.

Let h = ⌈log(p + 2)⌉. T1 and T2 can be constructed as trees of height h - 1, such that both trees form an in-order numbering of the processors, with the following method:

If p = 2h - 2, T1 is a complete binary tree of height h - 1 expect that the rightmost leaf is missing. Otherwise, T1 consists of a complete binary tree of height h - 2 covering PEs [0, 2h-1 - 2], a recursively constructed tree covering PEs [2h-1, p - 1], and a root at PE 2h-1 - 1 whose children are the roots of the left and the right subtree.

There are two ways to construct T2. With shifting, T2 is first constructed like T1, except that it contains an additional processor. Then T2 is shifted by one position to the left and the leftmost leaf is removed. With mirroring, T2 is the mirror image of T1 (with the mirror axis between processes p/2-1 and p/2). Mirroring only works for even p.

It can be proven that a coloring with the desired properties exists for all p.[1] When mirroring is used to construct T2, each processor can independently compute the color of its incident edges in O(log p) time.[1]

Communication Time

[edit]

Communication model: A messeage of size n has a communication time of α + βn. α represents the startup overhead to send the message, β represents the transmission time per data element.[2]

Suppose the message of size m is split into 2k blocks. Each communication step takes time α + βm/2k. Let h=log p be the height of the communication structure with the root at processor i and the two trees below it. After 2h steps, the first data block has reached every node in both trees. Afterwards, each processor receives one block in every step until it received all blocks. The total number of steps is 2h + 2k resulting in a total communication time of (2h + 2k)(α + βm/2k). Using an optimal k = k* = (βmh/2α)½,the total communication time is βm + 2αlog p + 8αβmlog p

Comparison to similar algorithms

[edit]

In a linear pipeline broadcast, the message is split into k blocks. In each step, each processor i receives one block from the processor i-1 and sends one block to the processor i+1. Linear pipeline has optimal throughput, but has a startup time in O(p).[3] For large p, the O(log p) startup time of the two-tree broadcast is faster while the throuput is identical.

A binomial tree broadcast communicates along a binomial tree. Each process receives the message that is broadcasted (the root already has the message) and then sends the message to its children. A binomial tree broadcast has only half the startup time of the Two-Tree broadcast, but a factor of log(p) more communication. [4] The binomial tree broadcast is faster than the two-tree broadcast for small messages, but slower for large messages.

Fibonacci trees of height one to five

A pipelined binary tree broadcast splits the message into k blocks and broadcasts the blocks consecutively over a binary tree. By using a Fibonacci tree instead of a simple complete binary tree, the startup latency can be reduced to αlog(p). [5] A Fibonacci tree of height h consists of a root that has a a Fibonacci tree of height h-1 as its left child and a Fibonacci tree of h-2 as its right child. The pipelined Fibonacci tree broadcast has half the startup time of the two-tree broadcast, but also only half of the bandwidth. It is faster for small messages, while the two-tree broadcast is faster for large messages.

ESBT

[edit]

If p is a power of two, there is an elegant optimal broadcasting algorithm [11] based on log p edge disjoint spanning binomial trees (ESBT) in a hypercube.

Applications

[edit]

Usage for other communication primitives

[edit]

Reduction

[edit]
Two-tree reduction with seven processors. The last processor is the root. T1 in red, T2 in blue.

A reduction (MPI_Reduce) computes where Mi is a vector of length m originally available at processor i and is a binary operation that is associative, but not necessarily commutative. The result is stored at a specified root processor r.

Assume that r = 0 or r = p-1. In this case the communication is identical to the broadcast, except that the communication direction is reversed. Each process receives two blocks from its children, reduces them with its own block, and sends the result to its parent. The root takes turns receiving blocks from the roots of T1 and T2 and reduces them with its own data. The communication time is the same as for the Broadcast and the amount of data reduced per processor is 2m.
If the reduce operation is commutative, the result can be achieved for any root by renumbering the processors. If the operation is not commutative and the root is not 0 or p-1, then 2βm is a lower bound for the communitation time.[1] In this case the result is first computed with processor 0 as the root, and then the result is send to processor r.

Prefix sum

[edit]

A prefix sum (MPI_Scan) computes for each processor j where Mi is a vector of length m originally available at processor i and is a binary associative operation. Using an inorder binary tree, a prefix sum can be computed by first performing an up-phase in which each interior node computes a partial sum for left- and rightmost leaves l and r, followed by a down-phase in which prefixes of the form are sent down the tree and allow each processor to finish computing its prefix sum.[6][1] The communication in the up-phase is equivalent to a reduction to processor 0 and communication in the down-phase is equivalent to a broadcast from the processor 0. The total communication time is about twice the communication time of the two-tree broadcast.[1]

References

[edit]
  1. ^ a b c d e f Sanders, Peter; Speck, Jochen; Träff, Jesper Larsson (2009). "Two-tree algorithms for full bandwidth broadcast, reduction and scan". Parallel Computing. 35 (12): 581–594. doi:10.1016/j.parco.2009.09.001. ISSN 0167-8191.
  2. ^ Hockney, Roger W. (1994). "The communication challenge for MPP: Intel Paragon and Meiko CS-2". Parallel Computing. 20 (3): 389–398. doi:10.1016/S0167-8191(06)80021-9. ISSN 0167-8191.
  3. ^ Pješivac-Grbović, Jelena; Angskun, Thara; Bosilca, George; Fagg, Graham E.; Gabriel, Edgar; Dongarra, Jack J. (2007). "Performance analysis of MPI collective operations". Cluster Computing. 10 (2): 127–143. doi:10.1007/s10586-007-0012-0. ISSN 1386-7857.
  4. ^ Chan, Ernie; Heimlich, Marcel; Purkayastha, Avi; Van De Geijn, Rober (2007). "Collective Communication: Theory, Practice, and Experience". Concurrency and Computation: Practice and Experience. 19 (13): 1749–1783. doi:10.1002/cpe.v19:13. ISSN 1532-0626.
  5. ^ Bruck, Jehoshua; Robert, Cypher; Ho, C-T (1992). "Multiple message broadcasting with generalized Fibonacci trees". Parallel and Distributed Processing, 1992. Proceedings of the Fourth IEEE Symposium on. IEEE: 424–431. doi:10.1109/SPDP.1992.242714.
  6. ^ Sanders, Peter; Träff, Jesper Larsson (2006). "Parallel Prefix (Scan) algorithms for MPI". European Parallel Virtual Machine/Message Passing Interface Users’ Group Meeting. 4192. Springer: 49–57. doi:10.1007/11846802_15. ISSN 0302-9743.

Cite error: A list-defined reference named "sanders07" is not used in the content (see the help page).

Cite error: A list-defined reference has no name (see the help page).