ALIs
kommt nochOpenMP - Parallel programming on shared memory systems
This document provides a guide to usage of OpenMP and availability of OpenMP on the HPC systems at LRZ.
An abstract description of OpenMP
OpenMP is a parallelization method available for the programming languages Fortran, C and C++, which is targeted toward use on shared memory systems. Since the OpenMP standard was developed with support from many vendors, programs parallelized with OpenMP should be widely portable.
The most current unified standard 3.0 for the Fortran, C and C++ base languages was released in May, 2008. However, implementations of the new features available (especially tasking) will take some time to arrive yet; most compilers now support the 2.5 standard.
The OpenMP parallelization model
From the operating system point of view, OpenMP functionality is based on the use of threads, while the user's job simply consists in inserting suitable parallelization directives into her/his code. These directives should not influence the sequential functionality of the code; this is supported through their taking the form of Fortran comments and C/C++ preprocessor pragmas, respectively. However, an OpenMP aware compiler will be capable of transforming the code-blocks marked by OpenMP directives into threaded code; at run time the user can then decide (by setting suitable environment variables) what resources should be made available to the parallel parts of his executable, and how they are organized or scheduled. The following image illustrates this.
However hardware and operation mode of the computing system put limits to the application of OpenMP parallel programs: Usually, it will not be sensible to share processors with other applications because scalability of the codes will be negatively impacted due to load imbalance and/or memory contention. For much the same reasons it is in many cases not useful to generate more threads than CPUs are available. Correspondingly, you need to be aware of your computing centers' policies regarding the usage of multiprocessing resources.
At run time the following situation presents itself: Certain regions of the application can be executed in parallel, the rest - which should be as small as possible - will be executed serially (i.e., by one CPU with one thread).
Program execution always starts in serial mode; as soon as the first parallel region is reached, a team of threads is formed ("forked") based on the user's requirements (4 threads in the image above), and each thread executes code enclosed in the parallel region. Of course it is necessary to impose a suitable division of the work to be done among the threads ("work sharing"). At the end of the parallel region all threads are synchronized ("join"), and the following serial code is only worked on by the master thread, while the slave threads wait (shaded yellow squares) until a new parallel region begins. This alteration between serial and parallel regions can be repeated arbitrarily often; the threads are only terminated when the application finishes.
The OpenMP standard also allows nesting of parallelism: A thread in a team of threads may generate a sub-team; some OpenMP implementations however do not allow the use of more than one thread below the top nesting level.
A priori the number of threads used does not need to conform with the number of CPUs available for a job. However for achieving good performance it is necessary to determine and possibly enforce an optimal assignment of CPUs to threads. This may involve additional functionality in the operating system, and is not covered by the OpenMP standard.
Comparison with other parallelization methods
In contrast to using MPI (which usually requires a lot of work), one can very quickly obtain a functioning parallel program with OpenMP in many cases. However, in order to achieve good scalability and high performance it will be necessary to use suitable tools to perform further optimization. Even then, scalability of the resulting code will not always be on par with the corresponding code parallelized with MPI.
|
MPI |
OpenMP |
proprietary |
HPF |
|
|---|---|---|---|---|
|
portable |
yes |
yes |
no |
yes |
|
scalable |
yes |
partially |
partially |
partially |
|
supports data parallelism |
no |
yes |
yes |
yes |
|
supports incremental parallelization |
no |
yes |
yes |
partially |
|
serial functionality intact? |
no |
yes |
yes |
yes |
|
correctness verifiable? |
no |
yes |
? |
? |
On high performance computing systems with a combined shared and distributed memory architecture using MPI and OpenMP in a complementary manner is one possible strategy for parallelization ("hybrid parallel programs"). Schematically one obtains the following hierarchy of parallelism:
-
The job to be done is subdivided into large chunks, which are distributed to (fat) compute nodes using MPI.
-
Each chunk of work is then further subdivided by suitable OpenMP directives. Hence each compute node generates a team of threads, each thread working on part of a chunk.
-
On the lowest level, e.g. the loops within part of a chunk, the well-known optimization methods (either by compiler or by manual optimization) should be used to obtain good single CPU or single thread performance. The method used will depend on the hardware (e.g., RISC-like vs. vector-like).
Note that the hardware architecture may also have an influence on the OpenMP parallelization method itself. Furthermore, unlimited intermixing of MPI and OpenMP requires a thread-safe implementation of MPI; the level of thread-safeness can be obtained by calling the MPI_Init_thread subroutine with suitable parameters; depending on the result, appropriate care may be required to follow the limitations inherent in the various threading support levels defined by the MPI standard.
Overview of OpenMP functionality
A partial description of OpenMP functionality is provided by the LRZ OpenMP presentation (900 kByte PDF), with a separate presentation (500 kByte PDF) on optimization and performance issues.
Remarks on the usage of OpenMP
OpenMP compilers at LRZ and RZG
OpenMP enabled compilers are available on all HPC systems at LRZ and RZG:
-
The Intel Compiler suite presently supports OpenMP 2.5, and will support OpenMP 3.0 with the upcoming 11.0 Release. These compilers are available on both Itanium and x86_64 based systems.
-
The IBM Compilers support OpenMP 2.5 on Power or PowerPC based systems.
-
The PGI and Pathscale Compiler suites support OpenMP on x86_64 based systems.
-
The GCC supports OpenMP provided at least version 4.2 is used. OpenMP 3.0 support is targeted for the 4.4 release.
Compiler switches for LRZ and RZG platforms
For activation of OpenMP directives at least one additional compiler switch is required.
|
Vendor |
Compiler calls |
OpenMP option |
|---|---|---|
|
Intel |
ifort / icc / icpc |
-openmp |
|
IBM |
xlf[95,2003]_r, xlc_r, xlC_r |
-qsmp=omp |
|
PGI |
pgf90 / pgcc / pgCC |
-mp |
|
Pathscale |
pathf90 / pathcc / pathCC |
-mp |
Controlling the Run Time environment
Stub library, module and include file for Fortran
In order to keep code compilable for the serial functionality, any OpenMP function calls or declarations should also be decorated with an active comment:
implicit none
...
!$ integer OMP_GET_THREAD_NUM
!$ external OMP_GET_THREAD_NUM
...
mythread = 0
!$ mythread = OMP_GET_THREAD_NUM()
Note that without the IMPLICIT NONE statement and missing declaration OMP_GET_THREAD_NUM has the wrong type!
If you do not wish to do this, it is also possible to use
-
either an include file omp_lib.h
-
or a Fortran 90 module omp_lib.f90
for compilation of the serial code. For linkage one also needs a stub library. All this is provided in the Intel compilers via the option -openmp-stubs, which will otherwise produce purely serial code. The above code can then be written as follows:
|
Fortran 77 style |
Fortran 90 style |
|---|---|
implicit none
...
include 'omp_lib.h'
...
mythread = OMP_GET_THREAD_NUM()
|
use omp_lib
implicit none
...
...
mythread = OMP_GET_THREAD_NUM()
|
OpenMP extensions supported by the Intel Compilers
The Intel Fortran and C/C++ compilers on x86_64 as well as IA64 provide some additional functionality described in the following.
Run time control: Environment variables |
||
|---|---|---|
|
Name |
Explanation |
Default value |
|
KMP_AFFINITY |
see the description on Intel's web site |
schedule threads to cores or threads (logical CPUs) in a user controlled manner. |
|
KMP_ALL_THREADS |
maximum number of threads available to a parallel region |
max(32, 4*OMP_NUM_THREADS, 4*(No. of processors) |
|
KMP_BLOCKTIME |
Interval after which inactive thread is put to sleep, in
milliseconds. |
200 milliseconds |
|
KMP_LIBRARY |
Select execution mode for OpenMP runtime library. Possible values are:
|
throughput |
|
KMP_MONITOR_STACKSIZE |
Set stacksize in bytes for monitor thread |
max(32768, system minimum thread stack size) |
|
KMP_STACKSIZE |
stack size (in bytes) usable for each thread. Change if your application segfaults for no apparent reason. You may also need to increase your shell stack limit appropriately. |
2 MByte on IA32, 4 MByte on IA64 |
Notes:
-
Setting suitable postfixes where appropriate allows you to specify units. I.e., KMP_STACKSIZE=6m sets a value of 6 MByte.
-
There are also some extension routine calls, i.e. kmp_set_stacksize_s(...) with an implementation dependent integer kind as argument, which can be used instead of the environment variables described above. However this will usually not be portable and usage is hence discouraged unless for specific needs.
NUMA-related directives
For Fortran compiler releases 9.1.36 and higher, Intel has implemented an additional proprietary directive which supports correctly distributed memory initialization and NUMA pre-fetching. This directive is of the form
!DIR$ MEMORYTOUCH (array-name [ , schedule-type [ ( chunk-size ) ]] [ , init-type])
where the parameter names have the following meaning:
- array-name is an array of type INTEGER(4), INTEGER(8), REAL(4) or REAL(8)
- schedule-type is one of STATIC, GUIDED, RUNTIME or DYNAMIC, and should be consistent with the OpenMP conforming processing of the subsequent parallel loops.
- chunk-size is an integer expression
- init-type is one of LOAD or STORE
If init-type is LOAD, the compiler generates an OpenMP loop which fetches elements of array-name into a temporary variable. If init-type is STORE, the compiler generates an OpenMP loop which sets elements of array-name to zero. Examples:
!DIR$ memorytouch (A) !DIR$ memorytouch (A , LOAD) !DIR$ memorytouch (A , STATIC (load+jf(3)) ) !DIR$ memorytouch (A , GUIDED (20), STORE)
While the MEMORYTOUCH directive is accepted on all platforms, at present it is meaningful only on certain Itanium-based systems with NUMA designs and when OpenMP is enabled.
References, examples and documentation
- Intel Compiler documentation on the LRZ web site
- Note that the links on this page lead to a password-protected area. Issue the command get_manuals_passwd when logged in to one of the LRZ HPC systems to obtain access information.
- OpenMP home page
- The central source of information about OpenMP
- OpenMP Specifications
- For Fortran, C, and C++
- Compilers and tools
- Various vendor's implementations and add-ons
- OpenMP at HLRS
Acknowledgments go to Isabel Loebich and Michael Resch, Höchstleistungsrechenzentrum Stuttgart, for a very stimulating OpenMP workshop and the permission to reuse material from this workshop in this document.