MicroBenchmarks/LCALS/README-LCALS_instructions.txt - llvm-test-suite - Git at Google

 //
 // See README-LCALS_license.txt for access and distribution restrictions
 //
 ================================================================================
 ================================================================================
 LCALS: Livermore Compiler Analysis Loop Suite
  by Rich Hornung (hornung1@llnl.gov),
     Center for Applied Scientific Computing,
     Lawrence Livermore National Laboratory
 ================================================================================
 ================================================================================

  o This code is under continuing development.  Go to http://codesign.llnl.gov
    to acquire the latest released version.

  o This loop suite is designed to measure performance for a variety of loops
    using different compilers and platforms.  In particular, the suite
    helps to understand compiler optimization, run-time performance issues,
    and platform capabilities.  The suite is also useful as a source of
    example code snippets for interactions with compiler developers.

  o The loops in the suite are partitioned into three subsets based on their
    origins (and also to avoid having them all in a single source file).  Each
    loop is implemented using multiple software constructs (i.e., referred
    to herein as "variants").  The three loop subsets are:

    -  Subset A: Loops representative of those found in application codes.
       They are implemented in source files named runA<variant>Loops.cxx.

    -  Subset B: Basic loops that help to illustrate compiler optimization
       issues. They are implemented in source files named runB<variant>Loops.cxx

    -  Subset C: Loops extracted from "Livermore Loops coded in C" developed by
       Steve Langer, which were derived from the Fortran version by Frank
       McMahon.  They are implemented in source files runC<variant>Loops.cxx

    Please see the contents of the loop source files to understand the
    differences among the variants.

  o New loops may be added to the suite by inserting them into appropriate
    loop source files and modifying a few other files that control suite
    execution and parametrization.  Details are provided below.

  o Various parameters can be adjusted to control how loops are defined and run.

    -- Each loop may be run with different loop lengths (currently up to three
       lengths for each loop) and will be sampled some number of times to
       generate execution timing data.  Loop length and sampling parameters
       may be modified to evaluate different platform performance
       characteristics.  Details are provided below.

  o Various run time statistics can be generated for analysis.  Currently,
    these include: min run time, max run time, average run time,
    standard deviation across run times, and average execution time relative
    to a reference loop variant.  Here, run time is the time required to
    execute the loop for one "sampling" pass through the suite.  See below.


 --------------------------------------------------------------------------------
 Loop kernels and variants:

  o Each loop in the suite is defined by its traditional C/C++ for-loop
    "kernel".  Then, each loop appears in multiple variants that use different
    programming and execution constructs.

  o Loops that emply traditional C/C++ for-loop syntax are referred to as
    "Raw" variants.  The "Raw" variant of each loop represents the version
    obtained from its original source, plus minor modifications necessary
    to plug into the loop suite framework.  For example, the loops in the
    runCRawLoops.cxx file are essentially verbatim from the Livermore Loops
    Coded in C" suite mentioned above.  Typically, the "Raw" loops serve as
    reference implemenation for runtime comparisons.

  o Other variants use loop traversal C++ template methods and represent the
    loop body as a lambda function or functor class.  One of the main goals
    of the suite is to assess how SIMD vectorization, OpenMP multithreading,
    etc. work with these different loop implementation choices.

    Note that only a subset of the loops in the suite appear in the OpenMP
    variants since many of the loops do not benefit from thread parallelism
    due to OpenMP overheads.  OpenMP loops are implmented in source files
    named runOMP<variant>Loops.cxx; in particular, they are not broken out
    into separate source files based on the subsets described above.

  o Although all loop bodies contain only C-syntax, the loop framework
    uses C++ classes and templates. So a C++ compiler is required to compile
    the code.  All C++ compilers should be able to compile the framework
    code and "Raw" loop variants.

  o Not all compilers implement the OpenMP standard. Thus, those loop variants
    may not be compiled and run depending on the compiler being used.

  o The intent of the C++ lambda and functor loop variants is to evaluate
    compilers in the context of C++ abstraction layers using template methods.
    Not all compilers support standard C++ lambda expressions at this time.
    Thus, the lambda variants of the loops may not be compiled and run
    depending on the compiler being used.


 ******************** Test Suite Note ***********************
 *                                                          *
 *     Below is the original build instructions, the        *
 *     test suite replaces this build system with the       *
 *     llvm test-suite CMake system.  The control of        *
 *     loop suite and timing has been altered to use        *
 *     the google benchmark library included in the         *
 *     MicroBenchmarks directory of the llvm test-suite.    *
 *                                                          *
 ************************************************************
 --------------------------------------------------------------------------------
 Compiling and running the loop suite:

  The loop suite is typically compiled by typing 'make' and then executed as

     ./lcals.exe <optional output directory>

  o The executable generated by the Makefile accepts an optional argument
    which is the name of a directory for placing output files that contain
    detailed timing, checksum, and FOM (when specified) results.  Some of
    these files provide a summary of loop suite performance.  Othere
    contain subsets of this information in comma-delimited text files that may
    be imported into Microsoft Excel to generate spreadsheets and plots.
    When no output directory is given, a summary of the results is printed
    to standard output.

  o LCALS is highly parametrized to explore many compilation and execution
    options. Exercising the full range of options can be achieved by making
    straightforward modifications in a few files, as describe below:

    -- Makefile: This file contains a simple build system for the code.
                 It has a variety of configurations for current LLNL
                 computing systems. Building for other platforms or changing
                 any compiler options can done by modifying this file.

    -- LCALS_rules.mk: This file contains "-D" compilation options that
                 conrol some aspects of LCALS parametrization. The effect of
                 these options is described in the comments in this file.
                 It is also helpful to see how they are used in the
                 LCALSParams.hxx file.

    -- main.cxx: The main program determines many of the LCALS execution
                 options, such as which loops are run (kernels and variants).

    -- LCALSSuite.cxx: The routine defineLoopSuiteRunInfo() in this file
                 defines loop lengths and sampling parameters for each loop
                 in the suite.  It also defines loop weights used in Figure
                 of Merit (FOM) calculations.

    -- LCALSSuite.hxx: This file contains '#define' preprocessor directives
                 that can be used to turn on/off compilation of individual
                 loop kernels and loop variants in the suite. This can be
                 helpful for generating assembly code in small doses.

  o Details on many of these items are given in the next section.


 --------------------------------------------------------------------------------
 Controlling loop suite execution and timing output:

  o The execution of the loop suite follows the pattern described here:

    Iterate over specified number of passes through the loop suite {

       Iterate over specified loop variants to run {

          Iterate over loop lengths to run (e.g., long, medium, short) {

             Iterate over each loop specified to run {

                TIMER_START()
                Iterate over specified number of samples (for loop and length) {

                   Execute loop variant and length.

                }
                TIMER_STOP()

             }  // end iteration over loops to run

          } // end iteration over loop lengths

       } // end iteration over loop variants

    } // end iteration over suite passes

  o The loop suite is parametrized so that its execution may be controlled
    by editing various items in a small number of source and header files
    as described below:

    -- Set number of passes through the suite by setting the variable
       'num_suite_passes' in main.cxx.

    -- Set loop variants to run by adding the corresponding enumeration
       constants to the vector 'run_variants' in main.cxx.  To prevent a
       variant from running, simply comment out the line which adds the
       corresponding enum value to the vector.

       NOTE: The first entry added this array indicates the reference variant
             for relative execution time statistics.

       NOTE: An additional argument may be given to the exectuable to run
             loops outside of the standard LCASL benchmark.  This requires
             that "BUILD_MISC" is defined in the Makefile.

    -- Set which loop lengths to run by setting the appropriate entry in
       the array 'run_loop_length' in main.cxx (true/false for each length).

    -- Set which loop kernels will run be setting entries in the array
       'run_loop' in main.cxx (true/false for each loop).

    -- The lengths and number of samples per pass for each loop are set
       in the routine defineLoopSuiteRunInfo() in LCALSSuite.cxx.

       NOTE: The "samples per pass" values for each loop were determined
       manually to give approximately 1 second of execution time for its
       serial raw variant on an Intel ES-2670 node. To reduce or increase the
       total suite execution time, or change the loop lengths used, change
       the 'sample_frac' and/or 'loop_length_factor' variables in
       main.cxx.  All default loop lengths will be multiplied by the
       loop_length_factor value.  The sample count for each loop will be
       multiplied by sample_frac/loop_length_factor.

    -- The "LoopKernelID" and "LoopLength" enumeration types in the file
       LCALSSuite.hxx are used to identify loops and loop lengths
       in the suite.  Macros are also provided in that file to conditionally
       compile each loop in the suite.

       The way in which the loops are compiled can influence execution times.
       For example, some compilers perform optimizations for loops compiled
       individually that they do not perform when the same loop is compiled as
       part of a larger suite.

  o All loop forms use the same data arrays, which are pre-allocated based
    on the loop lengths.  To help with SIMD vectorization and ensure corretness
    data arrays are allocated to be aligned width SIMD vector unit boundaries.
    This can be changed by setting the 'LCALS_DATA_ALIGN' constant in the
    file LCALSParams.hxx.

  o To minimize the effects of execution of each loop on the others,
    data caches are flushed before each loop is run.

    -- Data cache size is set for some LLNL platforms based on hostname.
       If unknown, a warning message will appear when loop suite is run.
       Please edit main.cxx to set the largest data cache size for other
       platforms.

  o A simple checksum mechanism is provided to verify that different variants
    of each loop, and implementation changes made to individual loops, generate
    the same numerical results.  "-D" compiler options are provided in the
    LCALS_rules.mk file to control this behavior.  Note that certain levels
    and types of compiler optimization will cause slight differences in
    checksums due to changes in operation order, for example.  Thus, the
    checksums may only be a qualitative indicator of correct execution.

    -- Note that the routines loopInit() and loopFinalize() in LCALSSuite.cxx
       initialize data and compute result checksums for each loop.  These
       must remain consistent with the data used in each loop for correctness.


  o There are two mechanisms available to generate execution timing data for
    loops in the suite.  The choice is made by defining/undefining the
    associated "-D" option in the LCALS_rules.mk file.  See that file for
    more information.


 --------------------------------------------------------------------------------
 Figures of Merit:

  o The program output includes a Figure of Merit (FOM) value for each loop
    variant and loop length that is executed.  The intent of the FOM is to
    complement execution timing data with another measure of performance and
    compiler optimization.  Using the FOM values and total loop suite execution
    time information in the Figure of Merit report, one can compare different
    compilers' abilities to optimize on a given platform, performance of
    different optimization levels for a given compiler, or potential performance
    of different architectures, etc.

  o In the FOM calculation, execution time for each loop is weighted by a
    factor defined in the loop setup routines.  The loops are partitioned into
    six classes depending on their structure; e.g., data-parallel, order-
    dependent, etc.  The weight for each loop class indicates its relative
    importance based on code constructs we want the suite to emphasize
    and how easy we think it should be for a compiler to optimize.  Each loop
    in the suite is given a weight, w_i (i is the loop id), based on which
    class it exists in.  Loop classes and weights are defined in the file
    LCALSSuite.cxx.

  o The FOM is calculated as follows.

    - Relative FOM (FOM_rel).  The aim of the FOM_rel value is to measure
      a compiler's ability to optimize different loop constructs.

      -- When the code is executed, a reference loop execution time, t_ref, is
         computed using a loop that any compiler should be able to optimize
         well and which should run faster than any loop in the suite.
         To help insure this, two simple loops are run, an element-wise vector
         product and a vector dot product. Then, t_ref is the minimum execution
         time between the two.

      -- After the suite is run, FOM_rel is calulated as:

         FOM_rel = W * t_ref / Sum_i [ w_i * t_i ]

         The denominator is a weighted sum of execution times for the loops
         that were run; t_i is the run time for loop i.  W = Sum_i ( w_i ) is
         the sum of loop weights.

      -- Note that FOM_rel is a dimensionless quantity that satisfies
         0 <= FOM_rel <= 1, and FOM_rel increases as loop execution times
         decrease.  In the ideal case, where each loop executes as fast as the
         reference loop (which should be impossible), t_i = t_ref for each i.
         So FOM_rel = 1.
	//
	// See README-LCALS_license.txt for access and distribution restrictions
	//
	================================================================================
	================================================================================
	LCALS: Livermore Compiler Analysis Loop Suite
	by Rich Hornung (hornung1@llnl.gov),
	Center for Applied Scientific Computing,
	Lawrence Livermore National Laboratory
	================================================================================
	================================================================================

	o This code is under continuing development. Go to http://codesign.llnl.gov
	to acquire the latest released version.

	o This loop suite is designed to measure performance for a variety of loops
	using different compilers and platforms. In particular, the suite
	helps to understand compiler optimization, run-time performance issues,
	and platform capabilities. The suite is also useful as a source of
	example code snippets for interactions with compiler developers.

	o The loops in the suite are partitioned into three subsets based on their
	origins (and also to avoid having them all in a single source file). Each
	loop is implemented using multiple software constructs (i.e., referred
	to herein as "variants"). The three loop subsets are:

	- Subset A: Loops representative of those found in application codes.
	They are implemented in source files named runA<variant>Loops.cxx.

	- Subset B: Basic loops that help to illustrate compiler optimization
	issues. They are implemented in source files named runB<variant>Loops.cxx

	- Subset C: Loops extracted from "Livermore Loops coded in C" developed by
	Steve Langer, which were derived from the Fortran version by Frank
	McMahon. They are implemented in source files runC<variant>Loops.cxx

	Please see the contents of the loop source files to understand the
	differences among the variants.

	o New loops may be added to the suite by inserting them into appropriate
	loop source files and modifying a few other files that control suite
	execution and parametrization. Details are provided below.

	o Various parameters can be adjusted to control how loops are defined and run.

	-- Each loop may be run with different loop lengths (currently up to three
	lengths for each loop) and will be sampled some number of times to
	generate execution timing data. Loop length and sampling parameters
	may be modified to evaluate different platform performance
	characteristics. Details are provided below.

	o Various run time statistics can be generated for analysis. Currently,
	these include: min run time, max run time, average run time,
	standard deviation across run times, and average execution time relative
	to a reference loop variant. Here, run time is the time required to
	execute the loop for one "sampling" pass through the suite. See below.


	--------------------------------------------------------------------------------
	Loop kernels and variants:

	o Each loop in the suite is defined by its traditional C/C++ for-loop
	"kernel". Then, each loop appears in multiple variants that use different
	programming and execution constructs.

	o Loops that emply traditional C/C++ for-loop syntax are referred to as
	"Raw" variants. The "Raw" variant of each loop represents the version
	obtained from its original source, plus minor modifications necessary
	to plug into the loop suite framework. For example, the loops in the
	runCRawLoops.cxx file are essentially verbatim from the Livermore Loops
	Coded in C" suite mentioned above. Typically, the "Raw" loops serve as
	reference implemenation for runtime comparisons.

	o Other variants use loop traversal C++ template methods and represent the
	loop body as a lambda function or functor class. One of the main goals
	of the suite is to assess how SIMD vectorization, OpenMP multithreading,
	etc. work with these different loop implementation choices.

	Note that only a subset of the loops in the suite appear in the OpenMP
	variants since many of the loops do not benefit from thread parallelism
	due to OpenMP overheads. OpenMP loops are implmented in source files
	named runOMP<variant>Loops.cxx; in particular, they are not broken out
	into separate source files based on the subsets described above.

	o Although all loop bodies contain only C-syntax, the loop framework
	uses C++ classes and templates. So a C++ compiler is required to compile
	the code. All C++ compilers should be able to compile the framework
	code and "Raw" loop variants.

	o Not all compilers implement the OpenMP standard. Thus, those loop variants
	may not be compiled and run depending on the compiler being used.

	o The intent of the C++ lambda and functor loop variants is to evaluate
	compilers in the context of C++ abstraction layers using template methods.
	Not all compilers support standard C++ lambda expressions at this time.
	Thus, the lambda variants of the loops may not be compiled and run
	depending on the compiler being used.


	****************** Test Suite Note *********************
	* *
	* Below is the original build instructions, the *
	* test suite replaces this build system with the *
	* llvm test-suite CMake system. The control of *
	* loop suite and timing has been altered to use *
	* the google benchmark library included in the *
	* MicroBenchmarks directory of the llvm test-suite. *
	* *
	************************************************************
	--------------------------------------------------------------------------------
	Compiling and running the loop suite:

	The loop suite is typically compiled by typing 'make' and then executed as

	./lcals.exe <optional output directory>

	o The executable generated by the Makefile accepts an optional argument
	which is the name of a directory for placing output files that contain
	detailed timing, checksum, and FOM (when specified) results. Some of
	these files provide a summary of loop suite performance. Othere
	contain subsets of this information in comma-delimited text files that may
	be imported into Microsoft Excel to generate spreadsheets and plots.
	When no output directory is given, a summary of the results is printed
	to standard output.

	o LCALS is highly parametrized to explore many compilation and execution
	options. Exercising the full range of options can be achieved by making
	straightforward modifications in a few files, as describe below:

	-- Makefile: This file contains a simple build system for the code.
	It has a variety of configurations for current LLNL
	computing systems. Building for other platforms or changing
	any compiler options can done by modifying this file.

	-- LCALS_rules.mk: This file contains "-D" compilation options that
	conrol some aspects of LCALS parametrization. The effect of
	these options is described in the comments in this file.
	It is also helpful to see how they are used in the
	LCALSParams.hxx file.

	-- main.cxx: The main program determines many of the LCALS execution
	options, such as which loops are run (kernels and variants).

	-- LCALSSuite.cxx: The routine defineLoopSuiteRunInfo() in this file
	defines loop lengths and sampling parameters for each loop
	in the suite. It also defines loop weights used in Figure
	of Merit (FOM) calculations.

	-- LCALSSuite.hxx: This file contains '#define' preprocessor directives
	that can be used to turn on/off compilation of individual
	loop kernels and loop variants in the suite. This can be
	helpful for generating assembly code in small doses.

	o Details on many of these items are given in the next section.


	--------------------------------------------------------------------------------
	Controlling loop suite execution and timing output:

	o The execution of the loop suite follows the pattern described here:

	Iterate over specified number of passes through the loop suite {

	Iterate over specified loop variants to run {

	Iterate over loop lengths to run (e.g., long, medium, short) {

	Iterate over each loop specified to run {

	TIMER_START()
	Iterate over specified number of samples (for loop and length) {

	Execute loop variant and length.

	}
	TIMER_STOP()

	} // end iteration over loops to run

	} // end iteration over loop lengths

	} // end iteration over loop variants

	} // end iteration over suite passes

	o The loop suite is parametrized so that its execution may be controlled
	by editing various items in a small number of source and header files
	as described below:

	-- Set number of passes through the suite by setting the variable
	'num_suite_passes' in main.cxx.

	-- Set loop variants to run by adding the corresponding enumeration
	constants to the vector 'run_variants' in main.cxx. To prevent a
	variant from running, simply comment out the line which adds the
	corresponding enum value to the vector.

	NOTE: The first entry added this array indicates the reference variant
	for relative execution time statistics.

	NOTE: An additional argument may be given to the exectuable to run
	loops outside of the standard LCASL benchmark. This requires
	that "BUILD_MISC" is defined in the Makefile.

	-- Set which loop lengths to run by setting the appropriate entry in
	the array 'run_loop_length' in main.cxx (true/false for each length).

	-- Set which loop kernels will run be setting entries in the array
	'run_loop' in main.cxx (true/false for each loop).

	-- The lengths and number of samples per pass for each loop are set
	in the routine defineLoopSuiteRunInfo() in LCALSSuite.cxx.

	NOTE: The "samples per pass" values for each loop were determined
	manually to give approximately 1 second of execution time for its
	serial raw variant on an Intel ES-2670 node. To reduce or increase the
	total suite execution time, or change the loop lengths used, change
	the 'sample_frac' and/or 'loop_length_factor' variables in
	main.cxx. All default loop lengths will be multiplied by the
	loop_length_factor value. The sample count for each loop will be
	multiplied by sample_frac/loop_length_factor.

	-- The "LoopKernelID" and "LoopLength" enumeration types in the file
	LCALSSuite.hxx are used to identify loops and loop lengths
	in the suite. Macros are also provided in that file to conditionally
	compile each loop in the suite.

	The way in which the loops are compiled can influence execution times.
	For example, some compilers perform optimizations for loops compiled
	individually that they do not perform when the same loop is compiled as
	part of a larger suite.

	o All loop forms use the same data arrays, which are pre-allocated based
	on the loop lengths. To help with SIMD vectorization and ensure corretness
	data arrays are allocated to be aligned width SIMD vector unit boundaries.
	This can be changed by setting the 'LCALS_DATA_ALIGN' constant in the
	file LCALSParams.hxx.

	o To minimize the effects of execution of each loop on the others,
	data caches are flushed before each loop is run.

	-- Data cache size is set for some LLNL platforms based on hostname.
	If unknown, a warning message will appear when loop suite is run.
	Please edit main.cxx to set the largest data cache size for other
	platforms.

	o A simple checksum mechanism is provided to verify that different variants
	of each loop, and implementation changes made to individual loops, generate
	the same numerical results. "-D" compiler options are provided in the
	LCALS_rules.mk file to control this behavior. Note that certain levels
	and types of compiler optimization will cause slight differences in
	checksums due to changes in operation order, for example. Thus, the
	checksums may only be a qualitative indicator of correct execution.

	-- Note that the routines loopInit() and loopFinalize() in LCALSSuite.cxx
	initialize data and compute result checksums for each loop. These
	must remain consistent with the data used in each loop for correctness.


	o There are two mechanisms available to generate execution timing data for
	loops in the suite. The choice is made by defining/undefining the
	associated "-D" option in the LCALS_rules.mk file. See that file for
	more information.


	--------------------------------------------------------------------------------
	Figures of Merit:

	o The program output includes a Figure of Merit (FOM) value for each loop
	variant and loop length that is executed. The intent of the FOM is to
	complement execution timing data with another measure of performance and
	compiler optimization. Using the FOM values and total loop suite execution
	time information in the Figure of Merit report, one can compare different
	compilers' abilities to optimize on a given platform, performance of
	different optimization levels for a given compiler, or potential performance
	of different architectures, etc.

	o In the FOM calculation, execution time for each loop is weighted by a
	factor defined in the loop setup routines. The loops are partitioned into
	six classes depending on their structure; e.g., data-parallel, order-
	dependent, etc. The weight for each loop class indicates its relative
	importance based on code constructs we want the suite to emphasize
	and how easy we think it should be for a compiler to optimize. Each loop
	in the suite is given a weight, w_i (i is the loop id), based on which
	class it exists in. Loop classes and weights are defined in the file
	LCALSSuite.cxx.

	o The FOM is calculated as follows.

	- Relative FOM (FOM_rel). The aim of the FOM_rel value is to measure
	a compiler's ability to optimize different loop constructs.

	-- When the code is executed, a reference loop execution time, t_ref, is
	computed using a loop that any compiler should be able to optimize
	well and which should run faster than any loop in the suite.
	To help insure this, two simple loops are run, an element-wise vector
	product and a vector dot product. Then, t_ref is the minimum execution
	time between the two.

	-- After the suite is run, FOM_rel is calulated as:

	FOM_rel = W * t_ref / Sum_i [ w_i * t_i ]

	The denominator is a weighted sum of execution times for the loops
	that were run; t_i is the run time for loop i. W = Sum_i ( w_i ) is
	the sum of loop weights.

	-- Note that FOM_rel is a dimensionless quantity that satisfies
	0 <= FOM_rel <= 1, and FOM_rel increases as loop execution times
	decrease. In the ideal case, where each loop executes as fast as the
	reference loop (which should be impossible), t_i = t_ref for each i.
	So FOM_rel = 1.