User talk:M0nkey
OpenMP is an API that makes it easy parallelize C++ programs on symmetric multi-processor (i.e., multi-core) architectures. It consists of a set of preprocessor directives (#pragma's) and library routines. Since the compiler is made aware of your intentions, the modifications to a single-threaded program are often trivial and the resulting code faster and more portable than if you used the operating system's native threading facilities.
If you know OpenMP, you can often achieve a 4x speedup with only a few minutes work. Since the number of cores per CPU is likely to increase geometrically over the next decade, the reasons for learning this library are only going to get more compelling.
Getting started
[edit]Since OpenMP is implemented by the compiler, you need to pass it extra flags so that it recognizes the OpenMP preprocessor directives during compilation.
GCC
[edit]GCC 4.2 supports OpenMP 2.5. To enable it, pass the -fopenmp flag:
g++ -fopenmp my_program.cpp -o my_program
Visual C++
[edit]MS Visual Studio 2005/2008 supports OpenMP 2.0. To enable it, pass the /openmp flag [1].
Intel
[edit]The Intel C++ Compiler 10.1 supports OpenMP 2.5 on both Windows and Linux. Earlier versions probably work as well. To enable it, pass the -openmp flag (on Linux and MacOS) or /Qopenmp (on Windows).
Examples
[edit]Parallel for
[edit]Two directives suffice for many applications. This example illustrates them both.
#include <iostream> using namespace std; void ExpensiveFunction(int i) { // lots of work ... } int main() { #pragma omp parallel for num_threads(4) for(int i = 0; i < 128; i++) { #pragma omp critical { cout << "doing " << i << endl; cout.flush(); } ExpensiveFunction(i); } return 0; }
#pragma omp parallel instructs the compiler to parallelize the for-loop in the subsequent line. The loop termination condition doesn't need to be static, but should not change during the loop's execution. Variables declared inside the parallel-for are local to the worker threads; variables declared outside are shared.
This directive takes several arguments, including the number of threads (illustrated) and the scheduling policy. If each call to ExpensiveFunction() takes a highly variable amount of time, your code might benefit from a dynamic scheduling policy:
#pragma omp parallel for num_threads(4) schedule(dynamic, 16)
In this case, each thread is dynamically assigned the next block of 16 iterations once it completes its current block.
#pragma omp critical declares the next code block (or single line) to be a critical section, i.e., only one of the worker threads can be in that section at a time, so that we can sequentialize access to code that is not thread-safe. Without a critical block in this case, the effects of the print statements may become garbled as multiple threads print to the screen at once.
Thread-specific initialization
[edit]Occasionally, we may need objects that are local to each worker thread, but are persistent across loop iterations.
class NonThreadSafeCache { // ... } ; int main() { #pragma omp parallel num_threads(4) { NonThreadSafeCache *c = new NonThreadSafeCache(); #pragma omp for schedule(dynamic, 16) for(int i = 0; i < 128; i++) { // do expensive stuff involving c... } delete c; } return 0; }
Heterogeneous parallelism
[edit]One thing to note is that the #pragma omp parallel for introduced in the first example is really a composite directive consisting of a #pragma omp for nested inside a #pragma omp parallel. Aside from parallelizing loop iterations, other coarser grained directives are available inside a #pragma omp parallel scope.
void ExpensiveFunction1() { // ... } void ExpensiveFunction2() { // ... } void ExpensiveFunction3() { // ... } int main() { #pragma omp parallel { #pragma omp sections nowait { #pragma omp section { // do expensive stuff } #pragma omp section ExpensiveFunction1(); #pragma omp section ExpensiveFunction2(); #pragma omp section ExpensiveFunction3(); } } return 0; }
#pragma omp sections introduces a region with multiple sections that can be executed in parallel. The number of sections is independent of the number of threads; some threads may execute multiple sections and others may execute none. There is an implied barrier at the end of the sections directive unless the nowait argument is passed. That is, all threads will wait until all sections have been processed before moving on.
Gotchas
[edit]- OpenMP #pragma's always introduce a new scope, so the following wouldn't work:
#pragma omp critical MyClass *x = new MyClass(); x->DoStuff(); // x not declared in this scope
Instead, do:
MyClass *x; #pragma omp critical x = new MyClass(); x->DoStuff();