| // This file does not contain any code; it just contains additional text and formatting
|
| // for doxygen.
|
|
|
| |
| //===----------------------------------------------------------------------===// |
| // |
| // The LLVM Compiler Infrastructure |
| // |
| // This file is dual licensed under the MIT and the University of Illinois Open |
| // Source Licenses. See LICENSE.txt for details. |
| // |
| //===----------------------------------------------------------------------===// |
| |
|
|
| /*! @mainpage Intel® OpenMP* Runtime Library Interface
|
| @section sec_intro Introduction
|
|
|
| This document describes the interface provided by the
|
| Intel® OpenMP\other runtime library to the compiler.
|
| Routines that are directly called as simple functions by user code are
|
| not currently described here, since their definition is in the OpenMP
|
| specification available from http://openmp.org
|
|
|
| The aim here is to explain the interface from the compiler to the runtime.
|
|
|
| The overall design is described, and each function in the interface
|
| has its own description. (At least, that's the ambition, we may not be there yet).
|
|
|
| @section sec_building Building the Runtime
|
| For the impatient, we cover building the runtime as the first topic here.
|
|
|
| A top-level Makefile is provided that attempts to derive a suitable
|
| configuration for the most commonly used environments. To see the
|
| default settings, type:
|
| @code
|
| % make info
|
| @endcode
|
|
|
| You can change the Makefile's behavior with the following options:
|
|
|
| - <b>omp_root</b>: The path to the top-level directory containing the top-level
|
| Makefile. By default, this will take on the value of the
|
| current working directory.
|
|
|
| - <b>omp_os</b>: Operating system. By default, the build will attempt to
|
| detect this. Currently supports "linux", "macos", and
|
| "windows".
|
|
|
| - <b>arch</b>: Architecture. By default, the build will attempt to
|
| detect this if not specified by the user. Currently
|
| supported values are
|
| - "32" for IA-32 architecture
|
| - "32e" for Intel® 64 architecture
|
| - "mic" for Intel® Many Integrated Core Architecture (
|
| If "mic" is specified then "icc" will be used as the
|
| compiler, and appropriate k1om binutils will be used. The
|
| necessary packages must be installed on the build machine
|
| for this to be possible, but an
|
| Intel® Xeon Phi™
|
| coprocessor is not required to build the library).
|
|
|
| - <b>compiler</b>: Which compiler to use for the build. Defaults to "icc"
|
| or "icl" depending on the value of omp_os. Also supports
|
| "gcc" when omp_os is "linux" for gcc\other versions
|
| 4.6.2 and higher. For icc on OS X\other, OS X\other versions
|
| greater than 10.6 are not supported currently. Also, icc
|
| version 13.0 is not supported. The selected compiler should be
|
| installed and in the user's path. The corresponding
|
| Fortran compiler should also be in the path.
|
|
|
| - <b>mode</b>: Library mode: default is "release". Also supports "debug".
|
|
|
| To use any of the options above, simple add <option_name>=<value>. For
|
| example, if you want to build with gcc instead of icc, type:
|
| @code
|
| % make compiler=gcc
|
| @endcode
|
|
|
| Underneath the hood of the top-level Makefile, the runtime is built by
|
| a perl script that in turn drives a detailed runtime system make. The
|
| script can be found at <tt>tools/build.pl</tt>, and will print
|
| information about all its flags and controls if invoked as
|
| @code
|
| % tools/build.pl --help
|
| @endcode
|
|
|
| If invoked with no arguments, it will try to build a set of libraries
|
| that are appropriate for the machine on which the build is happening.
|
| There are many options for building out of tree, and configuring library
|
| features that can also be used. Consult the <tt>--help</tt> output for details.
|
|
|
| @section sec_supported Supported RTL Build Configurations
|
|
|
| The architectures supported are IA-32 architecture, Intel® 64, and
|
| Intel® Many Integrated Core Architecture. The build configurations
|
| supported are shown in the table below.
|
|
|
| <table border=1>
|
| <tr><th> <th>icc/icl<th>gcc
|
| <tr><td>Linux\other OS<td>Yes(1,5)<td>Yes(2,4)
|
| <tr><td>OS X\other<td>Yes(1,3,4)<td>No
|
| <tr><td>Windows\other OS<td>Yes(1,4)<td>No
|
| </table>
|
| (1) On IA-32 architecture and Intel® 64, icc/icl versions 12.x
|
| are supported (12.1 is recommended).<br>
|
| (2) gcc version 4.6.2 is supported.<br>
|
| (3) For icc on OS X\other, OS X\other version 10.5.8 is supported.<br>
|
| (4) Intel® Many Integrated Core Architecture not supported.<br>
|
| (5) On Intel® Many Integrated Core Architecture, icc/icl versions 13.0 or later are required.
|
|
|
| @section sec_frontend Front-end Compilers that work with this RTL
|
|
|
| The following compilers are known to do compatible code generation for
|
| this RTL: icc/icl, gcc. Code generation is discussed in more detail
|
| later in this document.
|
|
|
| @section sec_outlining Outlining
|
|
|
| The runtime interface is based on the idea that the compiler
|
| "outlines" sections of code that are to run in parallel into separate
|
| functions that can then be invoked in multiple threads. For instance,
|
| simple code like this
|
|
|
| @code
|
| void foo()
|
| {
|
| #pragma omp parallel
|
| {
|
| ... do something ...
|
| }
|
| }
|
| @endcode
|
| is converted into something that looks conceptually like this (where
|
| the names used are merely illustrative; the real library function
|
| names will be used later after we've discussed some more issues...)
|
|
|
| @code
|
| static void outlinedFooBody()
|
| {
|
| ... do something ...
|
| }
|
|
|
| void foo()
|
| {
|
| __OMP_runtime_fork(outlinedFooBody, (void*)0); // Not the real function name!
|
| }
|
| @endcode
|
|
|
| @subsection SEC_SHAREDVARS Addressing shared variables
|
|
|
| In real uses of the OpenMP\other API there are normally references
|
| from the outlined code to shared variables that are in scope in the containing function.
|
| Therefore the containing function must be able to address
|
| these variables. The runtime supports two alternate ways of doing
|
| this.
|
|
|
| @subsubsection SEC_SEC_OT Current Technique
|
| The technique currently supported by the runtime library is to receive
|
| a separate pointer to each shared variable that can be accessed from
|
| the outlined function. This is what is shown in the example below.
|
|
|
| We hope soon to provide an alternative interface to support the
|
| alternate implementation described in the next section. The
|
| alternative implementation has performance advantages for small
|
| parallel regions that have many shared variables.
|
|
|
| @subsubsection SEC_SEC_PT Future Technique
|
| The idea is to treat the outlined function as though it
|
| were a lexically nested function, and pass it a single argument which
|
| is the pointer to the parent's stack frame. Provided that the compiler
|
| knows the layout of the parent frame when it is generating the outlined
|
| function it can then access the up-level variables at appropriate
|
| offsets from the parent frame. This is a classical compiler technique
|
| from the 1960s to support languages like Algol (and its descendants)
|
| that support lexically nested functions.
|
|
|
| The main benefit of this technique is that there is no code required
|
| at the fork point to marshal the arguments to the outlined function.
|
| Since the runtime knows statically how many arguments must be passed to the
|
| outlined function, it can easily copy them to the thread's stack
|
| frame. Therefore the performance of the fork code is independent of
|
| the number of shared variables that are accessed by the outlined
|
| function.
|
|
|
| If it is hard to determine the stack layout of the parent while generating the
|
| outlined code, it is still possible to use this approach by collecting all of
|
| the variables in the parent that are accessed from outlined functions into
|
| a single `struct` which is placed on the stack, and whose address is passed
|
| to the outlined functions. In this way the offsets of the shared variables
|
| are known (since they are inside the struct) without needing to know
|
| the complete layout of the parent stack-frame. From the point of view
|
| of the runtime either of these techniques is equivalent, since in either
|
| case it only has to pass a single argument to the outlined function to allow
|
| it to access shared variables.
|
|
|
| A scheme like this is how gcc\other generates outlined functions.
|
|
|
| @section SEC_INTERFACES Library Interfaces
|
| The library functions used for specific parts of the OpenMP\other language implementation
|
| are documented in different modules.
|
|
|
| - @ref BASIC_TYPES fundamental types used by the runtime in many places
|
| - @ref DEPRECATED functions that are in the library but are no longer required
|
| - @ref STARTUP_SHUTDOWN functions for initializing and finalizing the runtime
|
| - @ref PARALLEL functions for implementing `omp parallel`
|
| - @ref THREAD_STATES functions for supporting thread state inquiries
|
| - @ref WORK_SHARING functions for work sharing constructs such as `omp for`, `omp sections`
|
| - @ref THREADPRIVATE functions to support thread private data, copyin etc
|
| - @ref SYNCHRONIZATION functions to support `omp critical`, `omp barrier`, `omp master`, reductions etc
|
| - @ref ATOMIC_OPS functions to support atomic operations
|
| - Documentation on tasking has still to be written...
|
|
|
| @section SEC_EXAMPLES Examples
|
| @subsection SEC_WORKSHARING_EXAMPLE Work Sharing Example
|
| This example shows the code generated for a parallel for with reduction and dynamic scheduling.
|
|
|
| @code
|
| extern float foo( void );
|
|
|
| int main () {
|
| int i;
|
| float r = 0.0;
|
| #pragma omp parallel for schedule(dynamic) reduction(+:r)
|
| for ( i = 0; i < 10; i ++ ) {
|
| r += foo();
|
| }
|
| }
|
| @endcode
|
|
|
| The transformed code looks like this.
|
| @code
|
| extern float foo( void );
|
|
|
| int main () {
|
| static int zero = 0;
|
| auto int gtid;
|
| auto float r = 0.0;
|
| __kmpc_begin( & loc3, 0 );
|
| // The gtid is not actually required in this example so could be omitted;
|
| // We show its initialization here because it is often required for calls into
|
| // the runtime and should be locally cached like this.
|
| gtid = __kmpc_global thread num( & loc3 );
|
| __kmpc_fork call( & loc7, 1, main_7_parallel_3, & r );
|
| __kmpc_end( & loc0 );
|
| return 0;
|
| }
|
|
|
| struct main_10_reduction_t_5 { float r_10_rpr; };
|
|
|
| static kmp_critical_name lck = { 0 };
|
| static ident_t loc10; // loc10.flags should contain KMP_IDENT_ATOMIC_REDUCE bit set
|
| // if compiler has generated an atomic reduction.
|
|
|
| void main_7_parallel_3( int *gtid, int *btid, float *r_7_shp ) {
|
| auto int i_7_pr;
|
| auto int lower, upper, liter, incr;
|
| auto struct main_10_reduction_t_5 reduce;
|
| reduce.r_10_rpr = 0.F;
|
| liter = 0;
|
| __kmpc_dispatch_init_4( & loc7,*gtid, 35, 0, 9, 1, 1 );
|
| while ( __kmpc_dispatch_next_4( & loc7, *gtid, & liter, & lower, & upper, & incr ) ) {
|
| for( i_7_pr = lower; upper >= i_7_pr; i_7_pr ++ )
|
| reduce.r_10_rpr += foo();
|
| }
|
| switch( __kmpc_reduce_nowait( & loc10, *gtid, 1, 4, & reduce, main_10_reduce_5, & lck ) ) {
|
| case 1:
|
| *r_7_shp += reduce.r_10_rpr;
|
| __kmpc_end_reduce_nowait( & loc10, *gtid, & lck );
|
| break;
|
| case 2:
|
| __kmpc_atomic_float4_add( & loc10, *gtid, r_7_shp, reduce.r_10_rpr );
|
| break;
|
| default:;
|
| }
|
| }
|
|
|
| void main_10_reduce_5( struct main_10_reduction_t_5 *reduce_lhs,
|
| struct main_10_reduction_t_5 *reduce_rhs )
|
| {
|
| reduce_lhs->r_10_rpr += reduce_rhs->r_10_rpr;
|
| }
|
| @endcode
|
|
|
| @defgroup BASIC_TYPES Basic Types
|
| Types that are used throughout the runtime.
|
|
|
| @defgroup DEPRECATED Deprecated Functions
|
| Functions in this group are for backwards compatibility only, and
|
| should not be used in new code.
|
|
|
| @defgroup STARTUP_SHUTDOWN Startup and Shutdown
|
| These functions are for library initialization and shutdown.
|
|
|
| @defgroup PARALLEL Parallel (fork/join)
|
| These functions are used for implementing <tt>\#pragma omp parallel</tt>.
|
|
|
| @defgroup THREAD_STATES Thread Information
|
| These functions return information about the currently executing thread.
|
|
|
| @defgroup WORK_SHARING Work Sharing
|
| These functions are used for implementing
|
| <tt>\#pragma omp for</tt>, <tt>\#pragma omp sections</tt>, <tt>\#pragma omp single</tt> and
|
| <tt>\#pragma omp master</tt> constructs.
|
|
|
| When handling loops, there are different functions for each of the signed and unsigned 32 and 64 bit integer types
|
| which have the name suffixes `_4`, `_4u`, `_8` and `_8u`. The semantics of each of the functions is the same,
|
| so they are only described once.
|
|
|
| Static loop scheduling is handled by @ref __kmpc_for_static_init_4 and friends. Only a single call is needed,
|
| since the iterations to be executed by any give thread can be determined as soon as the loop parameters are known.
|
|
|
| Dynamic scheduling is handled by the @ref __kmpc_dispatch_init_4 and @ref __kmpc_dispatch_next_4 functions.
|
| The init function is called once in each thread outside the loop, while the next function is called each
|
| time that the previous chunk of work has been exhausted.
|
|
|
| @defgroup SYNCHRONIZATION Synchronization
|
| These functions are used for implementing barriers.
|
|
|
| @defgroup THREADPRIVATE Thread private data support
|
| These functions support copyin/out and thread private data.
|
|
|
| @defgroup TASKING Tasking support
|
| These functions support are used to implement tasking constructs.
|
|
|
| */
|
|
|