# Speedup

"Speed up" redirects here. For other uses, see Speed Up (disambiguation).
Not to be confused with velocity.

In the field of computer architecture, speedup is a metric for relative performance improvement when executing a task. The notion of speedup was established by Amdahl's law, which was particularly focused in the context of parallel processing. However, speedup can be used more generally to show the effect of any performance enhancement.

## Definition

Speedup can be defined for two different types of values: throughput and latency.[1] Throughput will be given in the general form of completions per unit of time. In computer architecture the common throughput metric is instructions per cycle, denoted IPC. The reciprocal of this is cycles per instruction or CPI; this is a latency quantity because it is the length of time between successive completions or occurrences. Speedup is defined differently for each type so that it is a consistent metric. One of the most common measurements in computer architecture — the execution time of a program — can be seen as a latency quantity because it is in seconds per program.

For latency values, speedup is defined by the following formula:[2]

$S = \frac{T_{old}}{T_{new}}$

where:

• $S \$ is the resultant speedup.
• $T_{old} \$ is the old execution time, i.e., without the improvement.
• $T_{new} \$ is the new execution time, i.e., with the improvement.

For throughput values, which are also called performance quantities, the enhanced performance will be in the numerator and the original performance will be in the denominator.[3] Notice that speedup is a unit-less quantity (the units cancel). This is because it is a relative quantity, i.e., we are comparing two specific instances of execution. Speedup is only useful when the experimental data is run on the same system, just with the slight tweak for which the speedup test is being run.

$S = \frac{P_{new}}{P_{old}}$

where:

• $S \$ is the resultant speedup.
• $P_{old} \$ is the old performance, i.e., without the improvement.
• $P_{new} \$ is the new performance, i.e., with the improvement.

### Speedup in Parallel Contexts

When applied in the parallel case, speedup can be predicted from Amdahl's Law.

## Examples

### Using Execution Times

We are testing the effectiveness of a branch predictor on the execution of a program. First, we execute the program with the standard branch predictor on the processor, which yields an execution time of 2.25 seconds. Next, we execute the program with our modified (and hopefully improved) branch predictor on the same processor, which produces an execution time of 1.50 seconds. Using our speedup formula, we know

$S = \frac{T_{old}}{T_{new}} = \frac{2.25 \ \mathrm{s}}{1.50 \ \mathrm{s}} = 1.5$

Our new branch predictor has provided a 1.5x speedup over the original.

### Using Cycles per Instruction

We have the same circumstance as above, but we are measuring Cycles per Instruction (CPI) instead. First, we execute the program with the standard branch predictor, which yields a CPI of 3. Next, we execute the program with our modified branch predictor, which yields a CPI of 2. Using our speedup formula, we know

$S = \frac{CPI_{old}}{CPI_{new}} = \frac{3 \ \mathrm{CPI}}{2 \ \mathrm{CPI}} = 1.5$

We can also perform a calculation using the performance metric Instructions per Cycle (IPC), which is the inverse of CPI. To calculate speedup using IPC (performance) instead of CPI (latency), we know

$S = \frac{IPC_{new}}{IPC_{old}} = \frac{0.333 \ \mathrm{IPC}}{0.5 \ \mathrm{IPC}} = 1.5$

We achieve the same 1.5x speedup, though we measured different quantities.

Let $S_p$ be the speedup for $p$ processors. Linear speedup or ideal speedup is obtained when $\,S_p = p$. When running an algorithm with linear speedup, doubling the number of processors doubles the speed. As this is ideal, it is considered very good scalability.

Efficiency is a performance metric defined as

$E_p = \frac{S_p}{p} = \frac{T_1}{pT_p}$.

It is a value, typically between zero and one, estimating how well-utilized the processors are in solving the problem, compared to how much effort is wasted in communication and synchronization. Algorithms with linear speedup and algorithms running on a single processor have an efficiency of 1, while many difficult-to-parallelize algorithms have efficiency such as $\frac{1}{\ln p}$[citation needed] that approaches zero as the number of processors increases.

In engineering contexts, efficiency is more often used for graphs than speedup, since

• all of the area in the graph is useful (whereas in a speedup curve 1/2 of the space is wasted)
• it is easy to see how well parallelization is working
• there is no need to plot a "perfect speedup" line

In marketing contexts, speedup curves are more often used, largely because they go up and to the right and thus appear better to the less-informed.

## Super-linear speedup

Sometimes a speedup of more than p when using p processors is observed in parallel computing, which is called super-linear speedup. Super-linear speedup rarely happens and often confuses beginners, who believe the theoretical maximum speedup should be p when p processors are used.

One possible reason for super-linear speedup in low-level computations is the cache effect resulting from the different memory hierarchies of a modern computer: In parallel computing, not only do the numbers of processors change, but so does the size of accumulated caches from different processors. With the larger accumulated cache size, more or even all of the working set can fit into caches and the memory access time reduces dramatically, which causes the extra speedup in addition to that from the actual computation.[4]

An analogous situation occurs when searching large datasets, such as the genomic data searched by BLAST implementations. There the accumulated RAM from each of the nodes in a cluster enables the dataset to move from disk into RAM thereby drastically reducing the time required by e.g. mpiBLAST to search it.

Super-linear speedups can also occur when performing backtracking in parallel: An exception in one thread can cause several other threads to backtrack early, before they reach the exception themselves.[citation needed]

Super-linear speedups can also occur in parallel implementations of branch-and-bound for optimization:[5] the processing of one node by one processor may affect the work other processors need to do for the other nodes.

## References

1. ^ Martin, Milo, Performance and Benchmarking (PDF), retrieved 5 June 2014
2. ^ Hennessy, John L.; David A., Patterson (2012). Computer Architecture: A Quantitive Approach. Waltham, MA: Morgan Kaufmann. pp. 46–47. ISBN 978-0-12-383872-8.
3. ^ Baer, Jean-Loup (2010). Microprocessor Architecture: From Simple Pipelines to Chip Multiprocessors. New York: Cambridge University Press. p. 10. ISBN 978-0-521-76992-1.
4. ^ John Benzi; M. Damodaran (2007). "Parallel Three Dimensional Direct Simulation Monte Carlo for Simulating Micro Flows". Parallel Computational Fluid Dynamics 2007: Implementations and Experiences on Large Scale and Grid Computing. Parallel Computational Fluid Dynamics. Springer. p. 95. Retrieved 2013-03-21.
5. ^ http://mat.tepper.cmu.edu/blog/?p=534#comment-3029