|Look up speedup in Wiktionary, the free dictionary.|
In the field of computer architecture, speedup is a metric for relative performance improvement when executing a task. The notion of speedup was established by Amdahl's law, which was particularly focused in the context of parallel processing. However, speedup can be used more generally to show the effect of any performance enhancement.
Speedup can be defined for two different types of values: throughput and latency. Throughput will be given in the general form of completions per unit of time. In computer architecture the common throughput metric is instructions per cycle, denoted IPC. The reciprocal of this is cycles per instruction or CPI; this is a latency quantity because it is the length of time between successive completions or occurrences. Speedup is defined differently for each type so that it is a consistent metric. One of the most common measurements in computer architecture — the execution time of a program — can be seen as a latency quantity because it is in seconds per program.
For latency values, speedup is defined by the following formula:
- is the resultant speedup.
- is the old execution time, i.e., without the improvement.
- is the new execution time, i.e., with the improvement.
For throughput values, which are also called performance quantities, the enhanced performance will be in the numerator and the original performance will be in the denominator. Notice that speedup is a unit-less quantity (the units cancel). This is because it is a relative quantity, i.e., we are comparing two specific instances of execution. Speedup is only useful when the experimental data is run on the same system, just with the slight tweak for which the speedup test is being run.
Speedup in Parallel Contexts
When applied in the parallel case, speedup can be predicted from Amdahl's Law.
Using Execution Times
We are testing the effectiveness of a branch predictor on the execution of a program. First, we execute the program with the standard branch predictor on the processor, which yields an execution time of 2.25 seconds. Next, we execute the program with our modified (and hopefully improved) branch predictor on the same processor, which produces an execution time of 1.50 seconds. Using our speedup formula, we know
Our new branch predictor has provided a 1.5x speedup over the original.
Using Cycles per Instruction
We have the same circumstance as above, but we are measuring CPI instead. First, we execute the program with the standard branch predictor, which yields a CPI of 2. Next, we execute the program with our modified branch predictor, which yields a CPI of 3. Using our speedup formula, we know
We achieve the same 1.5x speedup, though we measured different quantities.
Let be the speedup for processors. Linear speedup or ideal speedup is obtained when . When running an algorithm with linear speedup, doubling the number of processors doubles the speed. As this is ideal, it is considered very good scalability.
Efficiency is a performance metric defined as
It is a value, typically between zero and one, estimating how well-utilized the processors are in solving the problem, compared to how much effort is wasted in communication and synchronization. Algorithms with linear speedup and algorithms running on a single processor have an efficiency of 1, while many difficult-to-parallelize algorithms have efficiency such as  that approaches zero as the number of processors increases.
In engineering contexts, efficiency is more often used for graphs than speedup, since
- all of the area in the graph is useful (whereas in a speedup curve 1/2 of the space is wasted)
- it is easy to see how well parallelization is working
- there is no need to plot a "perfect speedup" line
In marketing contexts, speedup curves are more often used, largely because they go up and to the right and thus appear better to the less-informed.
Sometimes a speedup of more than p when using p processors is observed in parallel computing, which is called super-linear speedup. Super-linear speedup rarely happens and often confuses beginners, who believe the theoretical maximum speedup should be p when p processors are used.
One possible reason for super-linear speedup is the cache effect resulting from the different memory hierarchies of a modern computer: In parallel computing, not only do the numbers of processors change, but so does the size of accumulated caches from different processors. With the larger accumulated cache size, more or even all of the working set can fit into caches and the memory access time reduces dramatically, which causes the extra speedup in addition to that from the actual computation.
An analogous situation occurs when searching large datasets, such as the genomic data searched by BLAST implementations. There the accumulated RAM from each of the nodes in a cluster enables the dataset to move from disk into RAM thereby drastically reducing the time required by e.g. mpiBLAST to search it.
Super-linear speedups can also occur when performing backtracking in parallel: An exception in one thread can cause several other threads to backtrack early, before they reach the exception themselves.
- Martin, Milo, Performance and Benchmarking, retrieved 5 June 2014
- Hennessy, John L.; David A., Patterson (2012). Computer Architecture: A Quantitive Approach. Waltham, MA: Morgan Kaufmann. pp. 46–47. ISBN 978-0-12-383872-8.
- Baer, Jean-Loup (2010). Microprocessor Architecture: From Simple Pipelines to Chip Multiprocessors. New York: Cambridge University Press. p. 10. ISBN 978-0-521-76992-1.
- John Benzi; M. Damodaran (2007). "Parallel Computational Fluid Dynamics 2007: Implementations and Experiences on Large Scale and Grid Computing". Parallel Computational Fluid Dynamics. Springer. p. 95. Retrieved 2013-03-21.