runtime/doc/doxygen/libomp_interface.h - openmp - Git at Google

 // This file does not contain any code; it just contains additional text and formatting
 // for doxygen.


 //===----------------------------------------------------------------------===//
 //
 //                     The LLVM Compiler Infrastructure
 //
 // This file is dual licensed under the MIT and the University of Illinois Open
 // Source Licenses. See LICENSE.txt for details.
 //
 //===----------------------------------------------------------------------===//


 /*! @mainpage Intel&reg;&nbsp; OpenMP* Runtime Library Interface
 @section sec_intro Introduction

 This document describes the interface provided by the
 Intel&reg;&nbsp;OpenMP\other runtime library to the compiler.
 Routines that are directly called as simple functions by user code are
 not currently described here, since their definition is in the OpenMP
 specification available from http://openmp.org

 The aim here is to explain the interface from the compiler to the runtime.

 The overall design is described, and each function in the interface
 has its own description. (At least, that's the ambition, we may not be there yet).

 @section sec_building Building the Runtime
 For the impatient, we cover building the runtime as the first topic here.

 A top-level Makefile is provided that attempts to derive a suitable
 configuration for the most commonly used environments.  To see the
 default settings, type:
 @code
 % make info
 @endcode

 You can change the Makefile's behavior with the following options:

  - <b>omp_root</b>:    The path to the top-level directory containing the top-level
              Makefile.  By default, this will take on the value of the
              current working directory.

  - <b>omp_os</b>:      Operating system.  By default, the build will attempt to
              detect this. Currently supports "linux", "macos", and
              "windows".

  - <b>arch</b>:        Architecture. By default, the build will attempt to
 	     detect this if not specified by the user. Currently
 	     supported values are
              - "32" for IA-32 architecture
              - "32e" for Intel&reg;&nbsp;64 architecture
              - "mic" for Intel&reg;&nbsp;Many Integrated Core Architecture (
              If "mic" is specified then "icc" will be used as the
              compiler, and appropriate k1om binutils will be used. The
              necessary packages must be installed on the build machine
              for this to be possible, but an
 	     Intel&reg;&nbsp;Xeon Phi&trade;&nbsp;
              coprocessor is not required to build the library).

  - <b>compiler</b>:    Which compiler to use for the build.  Defaults to "icc"
              or "icl" depending on the value of omp_os. Also supports
              "gcc" when omp_os is "linux" for gcc\other versions
              4.6.2 and higher. For icc on OS X\other, OS X\other versions
 	     greater than 10.6 are not supported currently. Also, icc
 	     version 13.0 is not supported. The selected compiler should be
              installed and in the user's path. The corresponding
              Fortran compiler should also be in the path.

  - <b>mode</b>:        Library mode: default is "release".  Also supports "debug".

 To use any of the options above, simple add &lt;option_name&gt;=&lt;value&gt;.  For
 example, if you want to build with gcc instead of icc, type:
 @code
 % make compiler=gcc
 @endcode

 Underneath the hood of the top-level Makefile, the runtime is built by
 a perl script that in turn drives a detailed runtime system make.  The
 script can be found at <tt>tools/build.pl</tt>, and will print
 information about all its flags and controls if invoked as
 @code
 % tools/build.pl --help
 @endcode

 If invoked with no arguments, it will try to build a set of libraries
 that are appropriate for the machine on which the build is happening.
 There are many options for building out of tree, and configuring library
 features that can also be used. Consult the <tt>--help</tt> output for details.

 @section sec_supported Supported RTL Build Configurations

 The architectures supported are IA-32 architecture, Intel&reg;&nbsp; 64, and
 Intel&reg;&nbsp; Many Integrated Core Architecture.  The build configurations
 supported are shown in the table below.

 <table border=1>
 <tr><th> <th>icc/icl<th>gcc
 <tr><td>Linux\other OS<td>Yes(1,5)<td>Yes(2,4)
 <tr><td>OS X\other<td>Yes(1,3,4)<td>No
 <tr><td>Windows\other OS<td>Yes(1,4)<td>No
 </table>
 (1) On IA-32 architecture and Intel&reg;&nbsp; 64, icc/icl versions 12.x
     are supported (12.1 is recommended).<br>
 (2) gcc version 4.6.2 is supported.<br>
 (3) For icc on OS X\other, OS X\other version 10.5.8 is supported.<br>
 (4) Intel&reg;&nbsp; Many Integrated Core Architecture not supported.<br>
 (5) On Intel&reg;&nbsp; Many Integrated Core Architecture, icc/icl versions 13.0 or later are required.

 @section sec_frontend Front-end Compilers that work with this RTL

 The following compilers are known to do compatible code generation for
 this RTL: icc/icl, gcc.  Code generation is discussed in more detail
 later in this document.

 @section sec_outlining Outlining

 The runtime interface is based on the idea that the compiler
 "outlines" sections of code that are to run in parallel into separate
 functions that can then be invoked in multiple threads.  For instance,
 simple code like this

 @code
 void foo()
 {
 #pragma omp parallel
     {
         ... do something ...
     }
 }
 @endcode
 is converted into something that looks conceptually like this (where
 the names used are merely illustrative; the real library function
 names will be used later after we've discussed some more issues...)

 @code
 static void outlinedFooBody()
 {
     ... do something ...
 }

 void foo()
 {
     __OMP_runtime_fork(outlinedFooBody, (void*)0);   // Not the real function name!
 }
 @endcode

 @subsection SEC_SHAREDVARS Addressing shared variables

 In real uses of the OpenMP\other API there are normally references
 from the outlined code  to shared variables that are in scope in the containing function.
 Therefore the containing function must be able to address
 these variables. The runtime supports two alternate ways of doing
 this.

 @subsubsection SEC_SEC_OT Current Technique
 The technique currently supported by the runtime library is to receive
 a separate pointer to each shared variable that can be accessed from
 the outlined function.  This is what is shown in the example below.

 We hope soon to provide an alternative interface to support the
 alternate implementation described in the next section. The
 alternative implementation has performance advantages for small
 parallel regions that have many shared variables.

 @subsubsection SEC_SEC_PT Future Technique
 The idea is to treat the outlined function as though it
 were a lexically nested function, and pass it a single argument which
 is the pointer to the parent's stack frame. Provided that the compiler
 knows the layout of the parent frame when it is generating the outlined
 function it can then access the up-level variables at appropriate
 offsets from the parent frame.  This is a classical compiler technique
 from the 1960s to support languages like Algol (and its descendants)
 that support lexically nested functions.

 The main benefit of this technique is that there is no code required
 at the fork point to marshal the arguments to the outlined function.
 Since the runtime knows statically how many arguments must be passed to the
 outlined function, it can easily copy them to the thread's stack
 frame.  Therefore the performance of the fork code is independent of
 the number of shared variables that are accessed by the outlined
 function.

 If it is hard to determine the stack layout of the parent while generating the
 outlined code, it is still possible to use this approach by collecting all of
 the variables in the parent that are accessed from outlined functions into
 a single `struct` which is placed on the stack, and whose address is passed
 to the outlined functions. In this way the offsets of the shared variables
 are known (since they are inside the struct) without needing to know
 the complete layout of the parent stack-frame. From the point of view
 of the runtime either of these techniques is equivalent, since in either
 case it only has to pass a single argument to the outlined function to allow
 it to access shared variables.

 A scheme like this is how gcc\other generates outlined functions.

 @section SEC_INTERFACES Library Interfaces
 The library functions used for specific parts of the OpenMP\other language implementation
 are documented in different modules.

  - @ref BASIC_TYPES fundamental types used by the runtime in many places
  - @ref DEPRECATED  functions that are in the library but are no longer required
  - @ref STARTUP_SHUTDOWN functions for initializing and finalizing the runtime
  - @ref PARALLEL functions for implementing `omp parallel`
  - @ref THREAD_STATES functions for supporting thread state inquiries
  - @ref WORK_SHARING functions for work sharing constructs such as `omp for`, `omp sections`
  - @ref THREADPRIVATE functions to support thread private data, copyin etc
  - @ref SYNCHRONIZATION functions to support `omp critical`, `omp barrier`, `omp master`, reductions etc
  - @ref ATOMIC_OPS functions to support atomic operations
  - Documentation on tasking has still to be written...

 @section SEC_EXAMPLES Examples
 @subsection SEC_WORKSHARING_EXAMPLE Work Sharing Example
 This example shows the code generated for a parallel for with reduction and dynamic scheduling.

 @code
 extern float foo( void );

 int main () {
     int i;
     float r = 0.0;
     #pragma omp parallel for schedule(dynamic) reduction(+:r)
     for ( i = 0; i < 10; i ++ ) {
         r += foo();
     }
 }
 @endcode

 The transformed code looks like this.
 @code
 extern float foo( void );

 int main () {
     static int zero = 0;
     auto int gtid;
     auto float r = 0.0;
     __kmpc_begin( & loc3, 0 );
     // The gtid is not actually required in this example so could be omitted;
     // We show its initialization here because it is often required for calls into
     // the runtime and should be locally cached like this.
     gtid = __kmpc_global thread num( & loc3 );
     __kmpc_fork call( & loc7, 1, main_7_parallel_3, & r );
     __kmpc_end( & loc0 );
     return 0;
 }

 struct main_10_reduction_t_5 { float r_10_rpr; };

 static kmp_critical_name lck = { 0 };
 static ident_t loc10; // loc10.flags should contain KMP_IDENT_ATOMIC_REDUCE bit set
                       // if compiler has generated an atomic reduction.

 void main_7_parallel_3( int *gtid, int *btid, float *r_7_shp ) {
     auto int i_7_pr;
     auto int lower, upper, liter, incr;
     auto struct main_10_reduction_t_5 reduce;
     reduce.r_10_rpr = 0.F;
     liter = 0;
     __kmpc_dispatch_init_4( & loc7,*gtid, 35, 0, 9, 1, 1 );
     while ( __kmpc_dispatch_next_4( & loc7, *gtid, & liter, & lower, & upper, & incr ) ) {
         for( i_7_pr = lower; upper >= i_7_pr; i_7_pr ++ )
           reduce.r_10_rpr += foo();
     }
     switch( __kmpc_reduce_nowait( & loc10, *gtid, 1, 4, & reduce, main_10_reduce_5, & lck ) ) {
         case 1:
            *r_7_shp += reduce.r_10_rpr;
            __kmpc_end_reduce_nowait( & loc10, *gtid, & lck );
            break;
         case 2:
            __kmpc_atomic_float4_add( & loc10, *gtid, r_7_shp, reduce.r_10_rpr );
            break;
         default:;
     }
 }

 void main_10_reduce_5( struct main_10_reduction_t_5 *reduce_lhs,
                        struct main_10_reduction_t_5 *reduce_rhs )
 {
     reduce_lhs->r_10_rpr += reduce_rhs->r_10_rpr;
 }
 @endcode

 @defgroup BASIC_TYPES Basic Types
 Types that are used throughout the runtime.

 @defgroup DEPRECATED Deprecated Functions
 Functions in this group are for backwards compatibility only, and
 should not be used in new code.

 @defgroup STARTUP_SHUTDOWN Startup and Shutdown
 These functions are for library initialization and shutdown.

 @defgroup PARALLEL Parallel (fork/join)
 These functions are used for implementing <tt>\#pragma omp parallel</tt>.

 @defgroup THREAD_STATES Thread Information
 These functions return information about the currently executing thread.

 @defgroup WORK_SHARING Work Sharing
 These functions are used for implementing
 <tt>\#pragma omp for</tt>, <tt>\#pragma omp sections</tt>, <tt>\#pragma omp single</tt> and
 <tt>\#pragma omp master</tt> constructs.

 When handling loops, there are different functions for each of the signed and unsigned 32 and 64 bit integer types
 which have the name suffixes `_4`, `_4u`, `_8` and `_8u`. The semantics of each of the functions is the same,
 so they are only described once.

 Static loop scheduling is handled by  @ref __kmpc_for_static_init_4 and friends. Only a single call is needed,
 since the iterations to be executed by any give thread can be determined as soon as the loop parameters are known.

 Dynamic scheduling is handled by the @ref __kmpc_dispatch_init_4 and @ref __kmpc_dispatch_next_4 functions.
 The init function is called once in each thread outside the loop, while the next function is called each
 time that the previous chunk of work has been exhausted.

 @defgroup SYNCHRONIZATION Synchronization
 These functions are used for implementing barriers.

 @defgroup THREADPRIVATE Thread private data support
 These functions support copyin/out and thread private data.

 @defgroup TASKING Tasking support
 These functions support are used to implement tasking constructs.

 */
	// This file does not contain any code; it just contains additional text and formatting
	// for doxygen.


	//===----------------------------------------------------------------------===//
	//
	// The LLVM Compiler Infrastructure
	//
	// This file is dual licensed under the MIT and the University of Illinois Open
	// Source Licenses. See LICENSE.txt for details.
	//
	//===----------------------------------------------------------------------===//


	/! @mainpage Intel®  OpenMP Runtime Library Interface
	@section sec_intro Introduction

	This document describes the interface provided by the
	Intel® OpenMP\other runtime library to the compiler.
	Routines that are directly called as simple functions by user code are
	not currently described here, since their definition is in the OpenMP
	specification available from http://openmp.org

	The aim here is to explain the interface from the compiler to the runtime.

	The overall design is described, and each function in the interface
	has its own description. (At least, that's the ambition, we may not be there yet).

	@section sec_building Building the Runtime
	For the impatient, we cover building the runtime as the first topic here.

	A top-level Makefile is provided that attempts to derive a suitable
	configuration for the most commonly used environments. To see the
	default settings, type:
	@code
	% make info
	@endcode

	You can change the Makefile's behavior with the following options:

	- <b>omp_root</b>: The path to the top-level directory containing the top-level
	Makefile. By default, this will take on the value of the
	current working directory.

	- <b>omp_os</b>: Operating system. By default, the build will attempt to
	detect this. Currently supports "linux", "macos", and
	"windows".

	- <b>arch</b>: Architecture. By default, the build will attempt to
	detect this if not specified by the user. Currently
	supported values are
	- "32" for IA-32 architecture
	- "32e" for Intel® 64 architecture
	- "mic" for Intel® Many Integrated Core Architecture (
	If "mic" is specified then "icc" will be used as the
	compiler, and appropriate k1om binutils will be used. The
	necessary packages must be installed on the build machine
	for this to be possible, but an
	Intel® Xeon Phi™
	coprocessor is not required to build the library).

	- <b>compiler</b>: Which compiler to use for the build. Defaults to "icc"
	or "icl" depending on the value of omp_os. Also supports
	"gcc" when omp_os is "linux" for gcc\other versions
	4.6.2 and higher. For icc on OS X\other, OS X\other versions
	greater than 10.6 are not supported currently. Also, icc
	version 13.0 is not supported. The selected compiler should be
	installed and in the user's path. The corresponding
	Fortran compiler should also be in the path.

	- <b>mode</b>: Library mode: default is "release". Also supports "debug".

	To use any of the options above, simple add <option_name>=<value>. For
	example, if you want to build with gcc instead of icc, type:
	@code
	% make compiler=gcc
	@endcode

	Underneath the hood of the top-level Makefile, the runtime is built by
	a perl script that in turn drives a detailed runtime system make. The
	script can be found at <tt>tools/build.pl</tt>, and will print
	information about all its flags and controls if invoked as
	@code
	% tools/build.pl --help
	@endcode

	If invoked with no arguments, it will try to build a set of libraries
	that are appropriate for the machine on which the build is happening.
	There are many options for building out of tree, and configuring library
	features that can also be used. Consult the <tt>--help</tt> output for details.

	@section sec_supported Supported RTL Build Configurations

	The architectures supported are IA-32 architecture, Intel®  64, and
	Intel®  Many Integrated Core Architecture. The build configurations
	supported are shown in the table below.

	<table border=1>
	<tr><th> <th>icc/icl<th>gcc
	<tr><td>Linux\other OS<td>Yes(1,5)<td>Yes(2,4)
	<tr><td>OS X\other<td>Yes(1,3,4)<td>No
	<tr><td>Windows\other OS<td>Yes(1,4)<td>No
	</table>
	(1) On IA-32 architecture and Intel®  64, icc/icl versions 12.x
	are supported (12.1 is recommended).<br>
	(2) gcc version 4.6.2 is supported.<br>
	(3) For icc on OS X\other, OS X\other version 10.5.8 is supported.<br>
	(4) Intel®  Many Integrated Core Architecture not supported.<br>
	(5) On Intel®  Many Integrated Core Architecture, icc/icl versions 13.0 or later are required.

	@section sec_frontend Front-end Compilers that work with this RTL

	The following compilers are known to do compatible code generation for
	this RTL: icc/icl, gcc. Code generation is discussed in more detail
	later in this document.

	@section sec_outlining Outlining

	The runtime interface is based on the idea that the compiler
	"outlines" sections of code that are to run in parallel into separate
	functions that can then be invoked in multiple threads. For instance,
	simple code like this

	@code
	void foo()
	{
	#pragma omp parallel
	{
	... do something ...
	}
	}
	@endcode
	is converted into something that looks conceptually like this (where
	the names used are merely illustrative; the real library function
	names will be used later after we've discussed some more issues...)

	@code
	static void outlinedFooBody()
	{
	... do something ...
	}

	void foo()
	{
	__OMP_runtime_fork(outlinedFooBody, (void*)0); // Not the real function name!
	}
	@endcode

	@subsection SEC_SHAREDVARS Addressing shared variables

	In real uses of the OpenMP\other API there are normally references
	from the outlined code to shared variables that are in scope in the containing function.
	Therefore the containing function must be able to address
	these variables. The runtime supports two alternate ways of doing
	this.

	@subsubsection SEC_SEC_OT Current Technique
	The technique currently supported by the runtime library is to receive
	a separate pointer to each shared variable that can be accessed from
	the outlined function. This is what is shown in the example below.

	We hope soon to provide an alternative interface to support the
	alternate implementation described in the next section. The
	alternative implementation has performance advantages for small
	parallel regions that have many shared variables.

	@subsubsection SEC_SEC_PT Future Technique
	The idea is to treat the outlined function as though it
	were a lexically nested function, and pass it a single argument which
	is the pointer to the parent's stack frame. Provided that the compiler
	knows the layout of the parent frame when it is generating the outlined
	function it can then access the up-level variables at appropriate
	offsets from the parent frame. This is a classical compiler technique
	from the 1960s to support languages like Algol (and its descendants)
	that support lexically nested functions.

	The main benefit of this technique is that there is no code required
	at the fork point to marshal the arguments to the outlined function.
	Since the runtime knows statically how many arguments must be passed to the
	outlined function, it can easily copy them to the thread's stack
	frame. Therefore the performance of the fork code is independent of
	the number of shared variables that are accessed by the outlined
	function.

	If it is hard to determine the stack layout of the parent while generating the
	outlined code, it is still possible to use this approach by collecting all of
	the variables in the parent that are accessed from outlined functions into
	a single `struct` which is placed on the stack, and whose address is passed
	to the outlined functions. In this way the offsets of the shared variables
	are known (since they are inside the struct) without needing to know
	the complete layout of the parent stack-frame. From the point of view
	of the runtime either of these techniques is equivalent, since in either
	case it only has to pass a single argument to the outlined function to allow
	it to access shared variables.

	A scheme like this is how gcc\other generates outlined functions.

	@section SEC_INTERFACES Library Interfaces
	The library functions used for specific parts of the OpenMP\other language implementation
	are documented in different modules.

	- @ref BASIC_TYPES fundamental types used by the runtime in many places
	- @ref DEPRECATED functions that are in the library but are no longer required
	- @ref STARTUP_SHUTDOWN functions for initializing and finalizing the runtime
	- @ref PARALLEL functions for implementing `omp parallel`
	- @ref THREAD_STATES functions for supporting thread state inquiries
	- @ref WORK_SHARING functions for work sharing constructs such as `omp for`, `omp sections`
	- @ref THREADPRIVATE functions to support thread private data, copyin etc
	- @ref SYNCHRONIZATION functions to support `omp critical`, `omp barrier`, `omp master`, reductions etc
	- @ref ATOMIC_OPS functions to support atomic operations
	- Documentation on tasking has still to be written...

	@section SEC_EXAMPLES Examples
	@subsection SEC_WORKSHARING_EXAMPLE Work Sharing Example
	This example shows the code generated for a parallel for with reduction and dynamic scheduling.

	@code
	extern float foo( void );

	int main () {
	int i;
	float r = 0.0;
	#pragma omp parallel for schedule(dynamic) reduction(+:r)
	for ( i = 0; i < 10; i ++ ) {
	r += foo();
	}
	}
	@endcode

	The transformed code looks like this.
	@code
	extern float foo( void );

	int main () {
	static int zero = 0;
	auto int gtid;
	auto float r = 0.0;
	__kmpc_begin( & loc3, 0 );
	// The gtid is not actually required in this example so could be omitted;
	// We show its initialization here because it is often required for calls into
	// the runtime and should be locally cached like this.
	gtid = __kmpc_global thread num( & loc3 );
	__kmpc_fork call( & loc7, 1, main_7_parallel_3, & r );
	__kmpc_end( & loc0 );
	return 0;
	}

	struct main_10_reduction_t_5 { float r_10_rpr; };

	static kmp_critical_name lck = { 0 };
	static ident_t loc10; // loc10.flags should contain KMP_IDENT_ATOMIC_REDUCE bit set
	// if compiler has generated an atomic reduction.

	void main_7_parallel_3( int gtid, int btid, float *r_7_shp ) {
	auto int i_7_pr;
	auto int lower, upper, liter, incr;
	auto struct main_10_reduction_t_5 reduce;
	reduce.r_10_rpr = 0.F;
	liter = 0;
	__kmpc_dispatch_init_4( & loc7,*gtid, 35, 0, 9, 1, 1 );
	while ( __kmpc_dispatch_next_4( & loc7, *gtid, & liter, & lower, & upper, & incr ) ) {
	for( i_7_pr = lower; upper >= i_7_pr; i_7_pr ++ )
	reduce.r_10_rpr += foo();
	}
	switch( __kmpc_reduce_nowait( & loc10, *gtid, 1, 4, & reduce, main_10_reduce_5, & lck ) ) {
	case 1:
	*r_7_shp += reduce.r_10_rpr;
	__kmpc_end_reduce_nowait( & loc10, *gtid, & lck );
	break;
	case 2:
	__kmpc_atomic_float4_add( & loc10, *gtid, r_7_shp, reduce.r_10_rpr );
	break;
	default:;
	}
	}

	void main_10_reduce_5( struct main_10_reduction_t_5 *reduce_lhs,
	struct main_10_reduction_t_5 *reduce_rhs )
	{
	reduce_lhs->r_10_rpr += reduce_rhs->r_10_rpr;
	}
	@endcode

	@defgroup BASIC_TYPES Basic Types
	Types that are used throughout the runtime.

	@defgroup DEPRECATED Deprecated Functions
	Functions in this group are for backwards compatibility only, and
	should not be used in new code.

	@defgroup STARTUP_SHUTDOWN Startup and Shutdown
	These functions are for library initialization and shutdown.

	@defgroup PARALLEL Parallel (fork/join)
	These functions are used for implementing <tt>\#pragma omp parallel</tt>.

	@defgroup THREAD_STATES Thread Information
	These functions return information about the currently executing thread.

	@defgroup WORK_SHARING Work Sharing
	These functions are used for implementing
	<tt>\#pragma omp for</tt>, <tt>\#pragma omp sections</tt>, <tt>\#pragma omp single</tt> and
	<tt>\#pragma omp master</tt> constructs.

	When handling loops, there are different functions for each of the signed and unsigned 32 and 64 bit integer types
	which have the name suffixes `_4`, `_4u`, `_8` and `_8u`. The semantics of each of the functions is the same,
	so they are only described once.

	Static loop scheduling is handled by @ref __kmpc_for_static_init_4 and friends. Only a single call is needed,
	since the iterations to be executed by any give thread can be determined as soon as the loop parameters are known.

	Dynamic scheduling is handled by the @ref __kmpc_dispatch_init_4 and @ref __kmpc_dispatch_next_4 functions.
	The init function is called once in each thread outside the loop, while the next function is called each
	time that the previous chunk of work has been exhausted.

	@defgroup SYNCHRONIZATION Synchronization
	These functions are used for implementing barriers.

	@defgroup THREADPRIVATE Thread private data support
	These functions support copyin/out and thread private data.

	@defgroup TASKING Tasking support
	These functions support are used to implement tasking constructs.

	*/