Speedup

In parallel computing, speedup refers to how much a parallel algorithm is faster than a corresponding sequential algorithm.

Definition

Speedup is defined by the following formula:

S_{p}={\frac {T_{1}}{T_{p}}}

where:

p is the number of processors
$T_{1}$ is the execution time of the sequential algorithm
$T_{p}$ is the execution time of the parallel algorithm with p processors

Linear speedup or ideal speedup is obtained when $\,S_{p}=p$ . When running an algorithm with linear speedup, doubling the number of processors doubles the speed. As this is ideal, it is considered very good scalability.

Efficiency is a performance metric defined as

E_{p}={\frac {S_{p}}{p}}={\frac {T_{1}}{pT_{p}}}

.

It is a value, typically between zero and one, estimating how well-utilized the processors are in solving the problem, compared to how much effort is wasted in communication and synchronization. Algorithms with linear speedup and algorithms running on a single processor have an efficiency of 1, while many difficult-to-parallelize algorithms have efficiency such as ${\frac {1}{\ln p}}$ ^{[citation needed]} that approaches zero as the number of processors increases.

In engineering contexts, efficiency is more often used for graphs than speedup, since

all of the area in the graph is useful (whereas in a speedup curve 1/2 of the space is wasted)
it is easy to see how well parallelization is working
there is no need to plot a "perfect speedup" line

In marketing contexts, speedup curves are more often used ^{[citation needed]}, largely because they go up and to the right and thus appear better to the less-informed.

Super linear speedup

Sometimes a speedup of more than p when using p processors is observed in parallel computing, which is called super linear speedup. Super linear speedup rarely happens and often confuses beginners, who believe the theoretical maximum speedup should be p when p processors are used.

One possible reason for a super linear speedup is the cache effect resulting from the different memory hierarchies of a modern computer: In parallel computing, not only do the numbers of processors change, but so does the size of accumulated caches from different processors. With the larger accumulated cache size, more or even all of the working set can fit into caches and the memory access time reduces dramatically, which causes the extra speedup in addition to that from the actual computation.^[1]。

An analogous situation occurs when searching large datasets, such as the genomic data searched by BLAST implementations. There the accumulated RAM from each of the nodes in a cluster enables the dataset to move from disk into RAM thereby drastically reducing the time required by e.g. mpiBLAST to search it.

Super linear speedups can also occur when performing backtracking in parallel: An exception in one thread can cause several other threads to backtrack early, before they reach the exception themselves^{[citation needed]}.

References

^ John Benzi (2007). "Parallel Three Dimensional Direct Simulation Monte Carlo for Simulating Micro Flows". Parallel Computational Fluid Dynamics 2007: Implementations and Experiences on Large Scale and Grid Computing. Parallel Computational Fluid Dynamics. Springer. p. 95. Retrieved 2013-03-21. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help); Unknown parameter |coauthors= ignored (|author= suggested) (help)

frank dehne (1994-06-10). "scalable parellel computational geometry". In michel cosnard, afonso ferreira, joseph peters (ed.). Parallel and Distributed Computing. springer. Retrieved 2013-03-20.{{cite book}}: CS1 maint: multiple names: editors list (link)

v t e Parallel computing
General	Distributed computing Parallel computing Massively parallel Cloud computing High-performance computing Multiprocessing Manycore processor GPGPU Computer network Systolic array
Levels	Bit Instruction Thread Task Data Memory Loop Pipeline
Multithreading	Temporal Simultaneous (SMT) Simultaneous and heterogenous Speculative (SpMT) Preemptive Cooperative Clustered multi-thread (CMT) Hardware scout
Theory	PRAM model PEM model Analysis of parallel algorithms Amdahl's law Gustafson's law Cost efficiency Karp–Flatt metric Slowdown Speedup
Elements	Process Thread Fiber Instruction window Array
Coordination	Multiprocessing Memory coherence Cache coherence Cache invalidation Barrier Synchronization Application checkpointing
Programming	Stream processing Dataflow programming Models Implicit parallelism Explicit parallelism Concurrency Non-blocking algorithm
Hardware	Flynn's taxonomy SISD SIMD Array processing (SIMT) Pipelined processing Associative processing MISD MIMD Dataflow architecture Pipelined processor Superscalar processor Vector processor Multiprocessor symmetric asymmetric Memory shared distributed distributed shared UMA NUMA COMA Massively parallel computer Computer cluster Beowulf cluster Grid computer Hardware acceleration
APIs	Ateji PX Boost Chapel HPX Charm++ Cilk Coarray Fortran CUDA Dryad C++ AMP Global Arrays GPUOpen MPI OpenMP OpenCL OpenHMPP OpenACC Parallel Extensions PVM pthreads RaftLib ROCm UPC TBB ZPL
Problems	Automatic parallelization Deadlock Deterministic algorithm Embarrassingly parallel Parallel slowdown Race condition Software lockout Scalability Starvation
Category: Parallel computing

Definition

Super linear speedup

References

See also