MultiSource/Applications/aha/read.me - llvm-test-suite - Git at Google

                          Effect of Improvements

 Changing from calculating the correct answer for each new program to
 calculating them in advance and storing in a table, reduced the
 execution time by about 2.7%.

 An 8% improvement resulted from adding the "commutative" bit to the five
 commutative operations (add, mul, and, or, xor).  Perhaps more
 importantly, it reduced the printout of essentially duplicate solutions.

 An improvement by a factor of 2.58 (25.3/9.8) resulted from ensuring
 that the last register operand of the last instruction, when this
 instruction is created, refers to the result of the immediately
 preceding instruction.

 Continued the above idea for other register operands, i.e., ensured that
 SOME operand of the last instruction always refers to the result of the
 immediately preceding instruction.  Got an improvement by a factor of
 1.04 (9.8/9.4).

 3/16/02:  Got a factor of 1.85 by having it simulate the program only
 from the last changed instruction to the end, which means that usually
 only the last instruction is simulated.  Also, changed the trial
 value(s) so they "stick" at the last failed one(s).  When a trial value
 is changed, which happens after a success, the whole program must be
 simulated.

 3/16/02:  Got a factor of 1.010 by moving the assignment to
 computed_result inside the loop just ahead of where it was.  (The loop
 is usually executed only once.)

 3/16/02:  Got a factor of 1.020 by computing corr_res only when sticky_i
 and/or sticky_j change.

 3/17/02:  Tried making "numi" a constant defined with #define.  Got a 5%
 improvement.  Decided not to do this.

 3/19/02:  Got a factor of 1.166 by inlining "increment."

 3/23/02:  Took 1614 secs (26.9 min) to search with numi = 4.

 9/19/02:  Got a factor of 1.131 by requiring that immediate values be in
 the order 0, -1, 1, ... and using isa.opndstart[3] to avoid certain
 silly cases like ADD of 0, ADD of -1 (we do a subtract of 1), AND of 0
 or -1, etc.  This was made kind of necessary because the compare ops
 should have an immediate value of 0 as a possibility, whereas for most
 other ops, immediate 0 would never be used.

 9/22/02:  Changed shift immediate amounts to be given in an array
 (shimmed), so that fewer than 31 values can be specified.  This gave no
 change to execution time if all 31 are specified (1.222 secs for
 absolute value problem on a basic RISC, running on my 1.8 mHz Thinkpad).
 If only 4 values are specified, e.g. 1, 2, 30, and 31, the execution
 dropped to 0.450 secs, a factor of 2.71 improvement.
    I don't quite understand this, because the number of evaluations of
 the third instruction reduced from 14.2 million to 2.74 million, a
 factor of 5.18.
    It's partly explained by the program load time.  If you run aha with
 an argument of 1, it takes 0.140 secs.  Thus the time to start and end
 the program is about that amount.  So the ration of actual execution
 time is (1.222 - 0.140)/(0.450 - 0.140) = 3.49.  Closer to 5.18, but not
 very close.

 9/24/02:  Put ALL immediates (both ordinary and shift amounts) in the
 registers.  This did not affect the execution time (if compiled -O2),
 but it allowed deleting the operand type info in the isa table, and
 simplified the code a little (by 77 lines).
    Execution time for the standard run is now 0.591 secs on my 667 mHz
 machine (compiled -O2, which I guess I'll use from now on).

 9/25/02:  Made Aha! measure and print its own execution time, using
 clock().  I believe this is user + system time for the Aha! process,
 rather than wall clock time.  Found that -O2 and -O3 make no difference;
 the assembly language files search.s and check.s are identical.  Am
 using -O2.  The standard job runs in from 0.520 to 0.540 seconds process
 time on my 667 mHz office machine.
    The number of instruction evaluations is 62248 + 82618 + 2743328 (for
 the first, second, and third instruction resp.), or 2888194 total.  This
 corresponds to 122 cycles per evaluation.

 9/30/02:  Before today, the program consisted of three .cc files:  aha,
 search, and check.  Made it all one file (aha.cc), mainly because of
 problems with C in defining a preset array of values and not requiring
 the user to also set a variable equal to the number of values in the
 array.  No change to execution time.  Build (mainly compilation) time
 dropped from about 2.0 secs to 1.2 secs.  This change also permits
 inlining fix_operands, but trying that did not change execution time
 measurably (so it is not inlined now).

 10/14/02:  Before today, the incrementing of instructions was done with
 the rightmost operand varying most rapidly.  Today it was changed so
 that the leftmost operand varies most rapidly.  This simplifies the
 handling of commutative ops, and permits a few other minor
 simplifications.
    This gave a factor of 1.05 improvement in execution time.  Quite
 minor, but the program is a little simpler and I think it will simplify
 more complicated optimizations that may be done, such as (somehow)
 avoiding programs that have an instruction whose result is unused.
    A preliminary investigation of this shows that for a typical RISC
 instruction set, 39% of three-instruction programs have an unused
 result, and 70% of four-instruction programs have an unused result.
 This is compared to the present program, which ensures only that the
 second from last computed result does not go unused.  Thus there is hay
 to be made here.
    An attempt to skip ALL these silly programs resulted in a net
 increase in execution time, because it was implemented inefficiently.
 It seems to be hard to devise an efficient way to do this.  Some
 compromise might be practical, such as ensuring only that the second and
 third from last results are not both unused.

 10/15/02:  Changed the program as just mentioned, i.e., to ensure that
 instruction n (the last) uses the result of instruction n-1 and, if
 instruction n-1 does not use the result of instruction n-2, then the
 last instruction does.  This improved execution time by a factor of 1.4
 for three-instruction programs, and a factor of 1.8 for four-instruction
 programs.

 4/22/03: Ran Aha! on a two-input problem with n = 5 and 17 instructions
 enabled. Was searching for 5-instruction programs to compute the average
 of two signed integers (without overflowing). Shut it off after 144
 hours (6 days). I should make it display a "progress report" for such
 long jobs, such as printing out the first instruction in the list each
 time a new opcode is selected for it. Otherwise, you don't know if it
 somehow got into an infinite loop and you have no idea how long the run
 will take.
	Effect of Improvements

	Changing from calculating the correct answer for each new program to
	calculating them in advance and storing in a table, reduced the
	execution time by about 2.7%.

	An 8% improvement resulted from adding the "commutative" bit to the five
	commutative operations (add, mul, and, or, xor). Perhaps more
	importantly, it reduced the printout of essentially duplicate solutions.

	An improvement by a factor of 2.58 (25.3/9.8) resulted from ensuring
	that the last register operand of the last instruction, when this
	instruction is created, refers to the result of the immediately
	preceding instruction.

	Continued the above idea for other register operands, i.e., ensured that
	SOME operand of the last instruction always refers to the result of the
	immediately preceding instruction. Got an improvement by a factor of
	1.04 (9.8/9.4).

	3/16/02: Got a factor of 1.85 by having it simulate the program only
	from the last changed instruction to the end, which means that usually
	only the last instruction is simulated. Also, changed the trial
	value(s) so they "stick" at the last failed one(s). When a trial value
	is changed, which happens after a success, the whole program must be
	simulated.

	3/16/02: Got a factor of 1.010 by moving the assignment to
	computed_result inside the loop just ahead of where it was. (The loop
	is usually executed only once.)

	3/16/02: Got a factor of 1.020 by computing corr_res only when sticky_i
	and/or sticky_j change.

	3/17/02: Tried making "numi" a constant defined with #define. Got a 5%
	improvement. Decided not to do this.

	3/19/02: Got a factor of 1.166 by inlining "increment."

	3/23/02: Took 1614 secs (26.9 min) to search with numi = 4.

	9/19/02: Got a factor of 1.131 by requiring that immediate values be in
	the order 0, -1, 1, ... and using isa.opndstart[3] to avoid certain
	silly cases like ADD of 0, ADD of -1 (we do a subtract of 1), AND of 0
	or -1, etc. This was made kind of necessary because the compare ops
	should have an immediate value of 0 as a possibility, whereas for most
	other ops, immediate 0 would never be used.

	9/22/02: Changed shift immediate amounts to be given in an array
	(shimmed), so that fewer than 31 values can be specified. This gave no
	change to execution time if all 31 are specified (1.222 secs for
	absolute value problem on a basic RISC, running on my 1.8 mHz Thinkpad).
	If only 4 values are specified, e.g. 1, 2, 30, and 31, the execution
	dropped to 0.450 secs, a factor of 2.71 improvement.
	I don't quite understand this, because the number of evaluations of
	the third instruction reduced from 14.2 million to 2.74 million, a
	factor of 5.18.
	It's partly explained by the program load time. If you run aha with
	an argument of 1, it takes 0.140 secs. Thus the time to start and end
	the program is about that amount. So the ration of actual execution
	time is (1.222 - 0.140)/(0.450 - 0.140) = 3.49. Closer to 5.18, but not
	very close.

	9/24/02: Put ALL immediates (both ordinary and shift amounts) in the
	registers. This did not affect the execution time (if compiled -O2),
	but it allowed deleting the operand type info in the isa table, and
	simplified the code a little (by 77 lines).
	Execution time for the standard run is now 0.591 secs on my 667 mHz
	machine (compiled -O2, which I guess I'll use from now on).

	9/25/02: Made Aha! measure and print its own execution time, using
	clock(). I believe this is user + system time for the Aha! process,
	rather than wall clock time. Found that -O2 and -O3 make no difference;
	the assembly language files search.s and check.s are identical. Am
	using -O2. The standard job runs in from 0.520 to 0.540 seconds process
	time on my 667 mHz office machine.
	The number of instruction evaluations is 62248 + 82618 + 2743328 (for
	the first, second, and third instruction resp.), or 2888194 total. This
	corresponds to 122 cycles per evaluation.

	9/30/02: Before today, the program consisted of three .cc files: aha,
	search, and check. Made it all one file (aha.cc), mainly because of
	problems with C in defining a preset array of values and not requiring
	the user to also set a variable equal to the number of values in the
	array. No change to execution time. Build (mainly compilation) time
	dropped from about 2.0 secs to 1.2 secs. This change also permits
	inlining fix_operands, but trying that did not change execution time
	measurably (so it is not inlined now).

	10/14/02: Before today, the incrementing of instructions was done with
	the rightmost operand varying most rapidly. Today it was changed so
	that the leftmost operand varies most rapidly. This simplifies the
	handling of commutative ops, and permits a few other minor
	simplifications.
	This gave a factor of 1.05 improvement in execution time. Quite
	minor, but the program is a little simpler and I think it will simplify
	more complicated optimizations that may be done, such as (somehow)
	avoiding programs that have an instruction whose result is unused.
	A preliminary investigation of this shows that for a typical RISC
	instruction set, 39% of three-instruction programs have an unused
	result, and 70% of four-instruction programs have an unused result.
	This is compared to the present program, which ensures only that the
	second from last computed result does not go unused. Thus there is hay
	to be made here.
	An attempt to skip ALL these silly programs resulted in a net
	increase in execution time, because it was implemented inefficiently.
	It seems to be hard to devise an efficient way to do this. Some
	compromise might be practical, such as ensuring only that the second and
	third from last results are not both unused.

	10/15/02: Changed the program as just mentioned, i.e., to ensure that
	instruction n (the last) uses the result of instruction n-1 and, if
	instruction n-1 does not use the result of instruction n-2, then the
	last instruction does. This improved execution time by a factor of 1.4
	for three-instruction programs, and a factor of 1.8 for four-instruction
	programs.

	4/22/03: Ran Aha! on a two-input problem with n = 5 and 17 instructions
	enabled. Was searching for 5-instruction programs to compute the average
	of two signed integers (without overflowing). Shut it off after 144
	hours (6 days). I should make it display a "progress report" for such
	long jobs, such as printing out the first instruction in the list each
	time a new opcode is selected for it. Otherwise, you don't know if it
	somehow got into an infinite loop and you have no idea how long the run
	will take.