User:Henrik/sandbox/Parallel computing notes
Understanding data dependencies is one of the foundations of knowing how to implement parallel algorithms. No program can run quicker than the longest chain of dependent calculations (known as the critical path), since the fact that the calculations are dependent introduces an ordering of executions. Fortunately, most algorithms do not consist of a long chain of dependent calculations and little else, but instead there are many opportunities for executing independent calculations in parallel.
Instruction level parallelism
Let Pi and Pj be two program fragments. Bernstein’s conditions describe when the two are independent and can be executed in parallel. Let Ii be all of the input variables to Pi and Oi the output variables, and likewise for Pj. P i and Pj are independent if they satisfy
Violation of the ﬁrst condition introduces a ﬂow dependency, corresponding to ﬁrst statement producing a result used by the second statement. The second condition represents an anti-dependency, when the ﬁrst statement overwrites a variable needed by the second expression. The third, and ﬁnal condition q is an output dependency. When two variables write to the same location, the ﬁnal output must be the one of the second statement.
Consider the following function:
1: function Dep(a, b) 2: c := a·b 3: d := 2·c 4: end function
Operation 2 in Dep(a,b) can't be executed before (or even in parallel) operation 3, due to the fact that operation 3 uses a result from operation 2. It violates condition 1, and thus introduces a ﬂow dependency.
1: function NoDep(a, b) 2: c := a·b 3: d := 2·b 4: e := a+b 5: end function
In this example there are no dependencies between the instructions, so they can all be run in parallel.
Pipelining is a method for simultaneously executing several multi-stage problems using one execution unit, similar to a factory assembly line. Consider an algorithm with a 7 step dependency chain which should be applied to a set of data. This could be executed using 7 execution units, each executing one of the instructions. It can be observed that since there are data dependencies between every stage in our calculation, most of the hardware will sit unused unless pipelining is used. In a pipelined architecture, multiple pieces of data ﬂow through the calculation steps simultaneously. As soon as one stage of the calculation is ﬁnished it moves on to the next step, making room for the result of the previous execution step to move into the newly vacated one. After an initial latency of filling the pipeline, the program can produce a new result on every step, assuming the pipeline can be kept full.
Modern processors have multi-stage instruction pipelines. Each stage in the pipeline corresponds to a different action the processor performs on that instruction in that stage. In other words, a processor with N pipeline stages can have up to N different instructions at different stages of completion. The canonical example of a pipelined processor is a RISC processor with five stages: instruction fetch, decode, execute, memory access, write back. The Pentium 4 processor had a 35-stage pipeline.
If the same sequence of operations is run over and over again with different data values, another way of lowering the total run time is to duplicate the execution units and have each execution unit process a subset of the data.
<<< figure here >>>
- K. Hwang and F. A. Briggs. Computer architecture and parallel processing. McGraw-Hill, 1984.
- Yale Patt. "The Microprocessor Ten Years From Now: What Are The Challenges, How Do We Meet Them? (wmv). Distinguished Lecturer talk at Carnegie Mellon University, April 2004. Retrieved on November 7, 2007.