| llvm-mca - LLVM Machine Code Analyzer |
| ------------------------------------- |
| |
| llvm-mca is a performance analysis tool that uses information which is already |
| available in LLVM (e.g., scheduling models) to statically measure the |
| performance of machine code in a specific cpu. |
| |
| Performance is measured in terms of throughput as well as processor resource |
| consumption. The tool currently works for processors with an out-of-order |
| backend, for which there is a scheduling model available in LLVM. |
| |
| The main goal of this tool is not just to predict the performance of the code |
| when run on the target, but also help with diagnosing potential performance |
| issues. |
| |
| Given an assembly code sequence, llvm-mca estimates the IPC (instructions Per |
| cycle), as well as hardware resources pressure. The analysis and reporting style |
| were inspired by the IACA tool from Intel. |
| |
| The presence of long data dependency chains, as well as poor usage of hardware |
| resources may lead to bottlenecks in the backend. The tool is able to generate |
| a detailed report which should help with identifying and analyzing sources of |
| bottlenecks. |
| |
| Scheduling models are mostly used to compute instruction latencies, to obtain |
| read-advance information, and understand how processor resources are used by |
| instructions. By design, the quality of the performance analysis conducted by |
| the tool is inevitably affected by the quality of the target scheduling models. |
| However, scheduling models intentionally do not describe all processor details, |
| since the goal is just to enable the scheduling of machine instructions during |
| compilation. That means, there are processor details which are not important for |
| the purpose of scheduling instructions (and therefore not described by the |
| scheduling model), but are very important for this tool. |
| |
| A few examples of details that are missing in scheduling models are: |
| - Actual dispatch width (it often differs from the issue width). |
| - Number of read/write ports in the register file(s). |
| - Length of the load/store queue in the LSUnit. |
| |
| It is also very difficult to find a "good" abstract model to describe the |
| behavior of out-of-order processors. So, we have to keep in mind that all of |
| these aspects are going to affect the quality of the static analysis performed |
| by the tool. |
| |
| An extensive list of known limitations is reported in one of the last sections |
| of this document. There is also a section related to design problems which must |
| be addressed (hopefully with the help of the community). At the moment, the |
| tool has been mostly tested for x86 targets, but there are still several |
| limitations, some of which could be overcome by integrating extra information |
| into the scheduling models. |
| |
| How the tool works |
| ------------------ |
| |
| The tool takes assembly code as input. Assembly code is parsed into a sequence |
| of MCInst with the help of the existing LLVM target assembly parsers. The parsed |
| sequence of MCInst is then analyzed by a 'Pipeline' module to generate a |
| performance report. |
| |
| The Pipeline module internally emulates the execution of the machine code |
| sequence in a loop of iterations (which by default is 100). At the end of this |
| process, the pipeline collects a number of statistics which are then printed out |
| in the form of a report. |
| |
| Here is an example of performance report generated by the tool for a dot-product |
| of two packed float vectors of four elements. The analysis is conducted for |
| target x86, cpu btver2: |
| |
| /////////////////// |
| |
| Iterations: 300 |
| Instructions: 900 |
| Total Cycles: 610 |
| Dispatch Width: 2 |
| IPC: 1.48 |
| |
| |
| Resources: |
| [0] - JALU0 |
| [1] - JALU1 |
| [2] - JDiv |
| [3] - JFPM |
| [4] - JFPU0 |
| [5] - JFPU1 |
| [6] - JLAGU |
| [7] - JSAGU |
| [8] - JSTC |
| [9] - JVIMUL |
| |
| |
| Resource pressure per iteration: |
| [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] |
| - - - - 2.00 1.00 - - - - |
| |
| Resource pressure by instruction: |
| [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] Instructions: |
| - - - - - 1.00 - - - - vmulps %xmm0, %xmm1, %xmm2 |
| - - - - 1.00 - - - - - vhaddps %xmm2, %xmm2, %xmm3 |
| - - - - 1.00 - - - - - vhaddps %xmm3, %xmm3, %xmm4 |
| |
| |
| Instruction Info: |
| [1]: #uOps |
| [2]: Latency |
| [3]: RThroughput |
| [4]: MayLoad |
| [5]: MayStore |
| [6]: HasSideEffects |
| |
| [1] [2] [3] [4] [5] [6] Instructions: |
| 1 2 1.00 vmulps %xmm0, %xmm1, %xmm2 |
| 1 3 1.00 vhaddps %xmm2, %xmm2, %xmm3 |
| 1 3 1.00 vhaddps %xmm3, %xmm3, %xmm4 |
| |
| /////////////////// |
| |
| According to this report, the dot-product kernel has been executed 300 times, |
| for a total of 900 instructions dynamically executed. |
| |
| The report is structured in three main sections. A first section collects a few |
| performance numbers; the goal of this section is to give a very quick overview |
| of the performance throughput. In this example, the two important performance |
| indicators are a) the predicted total number of cycles, and b) the IPC. |
| IPC is probably the most important throughput indicator. A big delta between the |
| Dispatch Width and the computed IPC is an indicator of potential performance |
| issues. |
| |
| The second section is the so-called "resource pressure view". This view reports |
| the average number of resource cycles consumed every iteration by instructions |
| for every processor resource unit available on the target. Information is |
| structured in two tables. The first table reports the number of resource cycles |
| spent on average every iteration. The second table correlates the resource |
| cycles to the machine instruction in the sequence. For example, every iteration |
| of the dot-product, instruction 'vmulps' always executes on resource unit [5] |
| (JFPU1 - floating point pipeline #1), consuming an average of 1 resource cycle |
| per iteration. Note that on Jaguar, vector FP multiply can only be issued to |
| pipeline JFPU1, while horizontal FP adds can only be issued to pipeline JFPU0. |
| |
| The third (and last) section of the report shows the latency and reciprocal |
| throughput of every instruction in the sequence. That section also reports extra |
| information related to the number of micro opcodes, and opcode properties (i.e., |
| 'MayLoad', 'MayStore', and 'UnmodeledSideEffects'). |
| |
| The resource pressure view helps with identifying bottlenecks caused by high |
| usage of specific hardware resources. Situations with resource pressure mainly |
| concentrated on a few resources should, in general, be avoided. Ideally, |
| pressure should be uniformly distributed between multiple resources. |
| |
| Timeline View |
| ------------- |
| |
| A detailed report of each instruction's state transitions over time can be |
| enabled using the command line flag '-timeline'. This prints an extra section |
| in the report which contains the so-called "timeline view". Below is the |
| timeline view for the dot-product example from the previous section. |
| |
| /////////////// |
| Timeline view: |
| 012345 |
| Index 0123456789 |
| |
| [0,0] DeeER. . . vmulps %xmm0, %xmm1, %xmm2 |
| [0,1] D==eeeER . . vhaddps %xmm2, %xmm2, %xmm3 |
| [0,2] .D====eeeER . vhaddps %xmm3, %xmm3, %xmm4 |
| |
| [1,0] .DeeE-----R . vmulps %xmm0, %xmm1, %xmm2 |
| [1,1] . D=eeeE---R . vhaddps %xmm2, %xmm2, %xmm3 |
| [1,2] . D====eeeER . vhaddps %xmm3, %xmm3, %xmm4 |
| |
| [2,0] . DeeE-----R . vmulps %xmm0, %xmm1, %xmm2 |
| [2,1] . D====eeeER . vhaddps %xmm2, %xmm2, %xmm3 |
| [2,2] . D======eeeER vhaddps %xmm3, %xmm3, %xmm4 |
| |
| |
| Average Wait times (based on the timeline view): |
| [0]: Executions |
| [1]: Average time spent waiting in a scheduler's queue |
| [2]: Average time spent waiting in a scheduler's queue while ready |
| [3]: Average time elapsed from WB until retire stage |
| |
| [0] [1] [2] [3] |
| 0. 3 1.0 1.0 3.3 vmulps %xmm0, %xmm1, %xmm2 |
| 1. 3 3.3 0.7 1.0 vhaddps %xmm2, %xmm2, %xmm3 |
| 2. 3 5.7 0.0 0.0 vhaddps %xmm3, %xmm3, %xmm4 |
| /////////////// |
| |
| The timeline view is very interesting because it shows how instructions changed |
| in state during execution. It also gives an idea of how the tool "sees" |
| instructions executed on the target. |
| |
| The timeline view is structured in two tables. The first table shows how |
| instructions change in state over time (measured in cycles); the second table |
| (named "Average Wait times") reports useful timing statistics which should help |
| diagnose performance bottlenecks caused by long data dependencies and |
| sub-optimal usage of hardware resources. |
| |
| An instruction in the timeline view is identified by a pair of indices, where |
| the 'first' index identifies an iteration, and the 'second' index is the actual |
| instruction index (i.e., where it appears in the code sequence). |
| |
| Excluding the first and last column, the remaining columns are in cycles. |
| Cycles are numbered sequentially starting from 0. The following characters are |
| used to describe the state of an instruction: |
| |
| D : Instruction dispatched. |
| e : Instruction executing. |
| E : Instruction executed. |
| R : Instruction retired. |
| = : Instruction already dispatched, waiting to be executed. |
| - : Instruction executed, waiting to be retired. |
| |
| Based on the timeline view from the example, we know that: |
| - Instruction [1, 0] was dispatched at cycle 1. |
| - Instruction [1, 0] started executing at cycle 2. |
| - Instruction [1, 0] reached the write back stage at cycle 4. |
| - Instruction [1, 0] was retired at cycle 10. |
| |
| Instruction [1, 0] (i.e., the vmulps from iteration #1) doesn't have to wait in |
| the scheduler's queue for the operands to become available. By the time the |
| vmulps is dispatched, operands are already available, and pipeline JFPU1 is |
| ready to serve another instruction. So the instruction can be immediately |
| issued on the JFPU1 pipeline. That is demonstrated by the fact that the |
| instruction only spent 1cy in the scheduler's queue. |
| |
| There is a gap of 5 cycles between the write-back stage and the retire event. |
| That is because instructions must retire in program order, so [1,0] has to wait |
| for [0, 2] to be retired first (i.e., it has to wait until cycle 10). |
| |
| In the dot-product example, all instructions are in a RAW (Read After Write) |
| dependency chain. Register %xmm2 written by the vmulps is immediately used by |
| the first vhaddps, and register %xmm3 written by the first vhaddps is used by |
| the second vhaddps. Long data dependencies negatively affect the ILP |
| (Instruction Level Parallelism). |
| |
| In the dot-product example, there are anti-dependencies introduced by |
| instructions from different iterations. However, those dependencies can be |
| removed at register renaming stage (at the cost of allocating register aliases, |
| and therefore consuming temporary registers). |
| |
| Table "Average Wait times" helps diagnose performance issues that are caused by |
| the presence of long latency instructions and potentially long data dependencies |
| which may limit the ILP. Note that the tool by default assumes at least 1cy |
| between the dispatch event and the issue event. |
| |
| When the performance is limited by data dependencies and/or long latency |
| instructions, the number of cycles spent while in the "ready" state is expected |
| to be very small when compared with the total number of cycles spent in the |
| scheduler's queue. So the difference between the two counters is a good |
| indicator of how big of an impact data dependencies had on the execution of |
| instructions. When performance is mostly limited by the lack of hardware |
| resources, the delta between the two counters is small. However, the number of |
| cycles spent in the queue tends to be bigger (i.e., more than 1-3cy) especially |
| when compared with other low latency instructions. |
| |
| Extra statistics to further diagnose performance issues. |
| -------------------------------------------------------- |
| |
| Flag '-verbose' enables extra statistics and performance counters for the |
| dispatch logic, the reorder buffer, the retire control unit and the register |
| file. |
| |
| Below is an example of verbose output generated by the tool for the dot-product |
| example discussed in the previous sections. |
| |
| /////////////////// |
| Iterations: 300 |
| Instructions: 900 |
| Total Cycles: 610 |
| Dispatch Width: 2 |
| IPC: 1.48 |
| |
| |
| Dynamic Dispatch Stall Cycles: |
| RAT - Register unavailable: 0 |
| RCU - Retire tokens unavailable: 0 |
| SCHEDQ - Scheduler full: 272 |
| LQ - Load queue full: 0 |
| SQ - Store queue full: 0 |
| GROUP - Static restrictions on the dispatch group: 0 |
| |
| |
| Register Alias Table: |
| Total number of mappings created: 900 |
| Max number of mappings used: 35 |
| |
| |
| Dispatch Logic - number of cycles where we saw N instructions dispatched: |
| [# dispatched], [# cycles] |
| 0, 24 (3.9%) |
| 1, 272 (44.6%) |
| 2, 314 (51.5%) |
| |
| |
| Schedulers - number of cycles where we saw N instructions issued: |
| [# issued], [# cycles] |
| 0, 7 (1.1%) |
| 1, 306 (50.2%) |
| 2, 297 (48.7%) |
| |
| |
| Retire Control Unit - number of cycles where we saw N instructions retired: |
| [# retired], [# cycles] |
| 0, 109 (17.9%) |
| 1, 102 (16.7%) |
| 2, 399 (65.4%) |
| |
| |
| Scheduler's queue usage: |
| JALU01, 0/20 |
| JFPU01, 18/18 |
| JLSAGU, 0/12 |
| /////////////////// |
| |
| Based on the verbose report, the pipeline was only able to dispatch two |
| instructions 51.5% of the time. The dispatch group was limited to one |
| instruction 44.6% of the cycles, which corresponds to 272 cycles. |
| |
| If we look at section "Dynamic Dispatch Stall Cycles", we can see how counter |
| SCHEDQ reports 272 cycles. Counter SCHEDQ is incremented every time the |
| dispatch logic is unable to dispatch a full group of two instructions because |
| the scheduler's queue is full. |
| |
| Section "Scheduler's queue usage" shows how the maximum number of buffer entries |
| (i.e., scheduler's queue entries) used at runtime for resource JFPU01 reached |
| its maximum. Note that AMD Jaguar implements three schedulers: |
| * JALU01 - A scheduler for ALU instructions |
| * JLSAGU - A scheduler for address generation |
| * JFPU01 - A scheduler floating point operations. |
| |
| The dot-product is a kernel of three floating point instructions (a vector |
| multiply followed by two horizontal adds). That explains why only the floating |
| point scheduler appears to be used according to section "Scheduler's queue |
| usage". |
| |
| A full scheduler's queue is either caused by data dependency chains, or by a |
| sub-optimal usage of hardware resources. Sometimes, resource pressure can be |
| mitigated by rewriting the kernel using different instructions that consume |
| different scheduler resources. Schedulers with a small queue are less resilient |
| to bottlenecks caused by the presence of long data dependencies. |
| |
| In this example, we can conclude that the IPC is mostly limited by data |
| dependencies, and not by resource pressure. |
| |
| LLVM-MCA instruction flow |
| ------------------------- |
| |
| This section describes the instruction flow through the out-of-order backend, |
| as well as the functional units involved in the process. |
| |
| An instruction goes through a default sequence of stages: |
| - Dispatch (Instruction is dispatched to the schedulers). |
| - Issue (Instruction is issued to the processor pipelines). |
| - Write Back (Instruction is executed, and results are written back). |
| - Retire (Instruction is retired; writes are architecturally committed). |
| |
| The tool only models the out-of-order portion of a processor. Therefore, the |
| instruction fetch and decode stages are not modeled. Performance bottlenecks in |
| the frontend are not diagnosed by this tool. The tool assumes that instructions |
| have all been decoded and placed in a queue. Also, the tool doesn't know |
| anything about branch prediction. |
| |
| The long term plan is to make the process customizable, so that processors can |
| define their own. This is a future work. |
| |
| Instruction Dispatch |
| -------------------- |
| |
| During the Dispatch stage, instructions are picked in program order from a queue |
| of already decoded instructions, and dispatched in groups to the hardware |
| schedulers. The dispatch logic is implemented by class DispatchStage in file |
| DispatchStage.h. |
| |
| The size of a dispatch group depends on the availability of hardware resources, |
| and it cannot exceed the value of field 'DispatchWidth' in class DispatchStage. |
| Note that field DispatchWidth defaults to the value of field 'IssueWidth' from |
| the scheduling model. |
| |
| Users can override the DispatchWidth value with flag "-dispatch=<N>" (where 'N' |
| is an unsigned quantity). |
| |
| An instruction can be dispatched if: |
| - The size of the dispatch group is smaller than DispatchWidth |
| - There are enough entries in the reorder buffer |
| - There are enough temporary registers to do register renaming |
| - Schedulers are not full. |
| |
| Since r329067, scheduling models can now optionally specify which register |
| files are available on the processor. Class DispatchStage(see DispatchStage.h) |
| would use that information to initialize register file descriptors. |
| |
| By default, if the model doesn't describe register files, the tool |
| (optimistically) assumes a single register file with an unbounded number of |
| temporary registers. Users can limit the number of temporary registers that |
| are globally available for register renaming using flag |
| `-register-file-size=<N>`, where N is the number of temporaries. A value of |
| zero for N means 'unbounded'. Knowing how many temporaries are available for |
| register renaming, the tool can predict dispatch stalls caused by the lack of |
| temporaries. |
| |
| The number of reorder buffer entries consumed by an instruction depends on the |
| number of micro-opcodes it specifies in the target scheduling model (see field |
| 'NumMicroOpcodes' of TableGen class ProcWriteResources and its derived classes; |
| TargetSchedule.td). |
| |
| The reorder buffer is implemented by class RetireControlUnit (see |
| DispatchStage.h). Its goal is to track the progress of instructions that are |
| "in-flight", and retire instructions in program order. The number of entries |
| in the reorder buffer defaults to the value of field 'MicroOpBufferSize' from |
| the target scheduling model. |
| |
| Instructions that are dispatched to the schedulers consume scheduler buffer |
| entries. The tool queries the scheduling model to figure out the set of |
| buffered resources consumed by an instruction. Buffered resources are treated |
| like "scheduler" resources, and the field 'BufferSize' (from the processor |
| resource TableGen definition) defines the size of the scheduler's queue. |
| |
| Zero latency instructions (for example NOP instructions) don't consume scheduler |
| resources. However, those instructions still reserve a number of slots in the |
| reorder buffer. |
| |
| Instruction Issue |
| ----------------- |
| |
| As mentioned in the previous section, each scheduler resource implements a queue |
| of instructions. An instruction has to wait in the scheduler's queue until |
| input register operands become available. Only at that point, does the |
| instruction becomes eligible for execution and may be issued (potentially |
| out-of-order) to a pipeline for execution. |
| |
| Instruction latencies can be computed by the tool with the help of the |
| scheduling model; latency values are defined by the scheduling model through |
| ProcWriteResources objects. |
| |
| Class Scheduler (see file Scheduler.h) knows how to emulate multiple processor |
| schedulers. A Scheduler is responsible for tracking data dependencies, and |
| dynamically select which processor resources are consumed/used by instructions. |
| |
| Internally, the Scheduler class delegates the management of processor resource |
| units and resource groups to the ResourceManager class. ResourceManager is also |
| responsible for selecting resource units that are effectively consumed by |
| instructions. For example, if an instruction consumes 1cy of a resource group, |
| the ResourceManager object selects one of the available units from the group; by |
| default, it uses a round-robin selector to guarantee that resource usage is |
| uniformly distributed between all units of a group. |
| |
| Internally, class Scheduler implements three instruction queues: |
| - WaitQueue: a queue of instructions whose operands are not ready yet. |
| - ReadyQueue: a queue of instructions ready to execute. |
| - IssuedQueue: a queue of instructions executing. |
| |
| Depending on the operands availability, instructions that are dispatched to the |
| Scheduler are either placed into the WaitQueue or into the ReadyQueue. |
| |
| Every cycle, class Scheduler checks if instructions can be moved from the |
| WaitQueue to the ReadyQueue, and if instructions from the ReadyQueue can be |
| issued to the underlying pipelines. The algorithm prioritizes older |
| instructions over younger instructions. |
| |
| Objects of class ResourceState (see Scheduler.h) describe processor resources. |
| There is an instance of class ResourceState for each single processor resource |
| specified by the scheduling model. A ResourceState object for a processor |
| resource with multiple units dynamically tracks the availability of every single |
| unit. For example, the ResourceState of a resource group tracks the |
| availability of every resource in that group. Internally, ResourceState |
| implements a round-robin selector to dynamically pick the next unit to use from |
| the group. |
| |
| Write-Back and Retire Stage |
| --------------------------- |
| |
| Issued instructions are moved from the ReadyQueue to the IssuedQueue. There, |
| instructions wait until they reach the write-back stage. At that point, they |
| get removed from the queue and the retire control unit is notified. |
| |
| On the event of "instruction executed", the retire control unit flags the |
| instruction as "ready to retire". |
| |
| Instruction are retired in program order; an "instruction retired" event is sent |
| to the register file which frees the temporary registers allocated for the |
| instruction at register renaming stage. |
| |
| Load/Store Unit and Memory Consistency Model |
| -------------------------------------------- |
| |
| The tool attempts to emulate out-of-order execution of memory operations. Class |
| LSUnit (see file LSUnit.h) emulates a load/store unit implementing queues for |
| speculative execution of loads and stores. |
| |
| Each load (or store) consumes an entry in the load (or store) queue. The number |
| of slots in the load/store queues is unknown by the tool, since there is no |
| mention of it in the scheduling model. In practice, users can specify flag |
| `-lqueue=N` (vic. `-squeue=N`) to limit the number of entries in the queue to be |
| equal to exactly N (an unsigned value). If N is zero, then the tool assumes an |
| unbounded queue (this is the default). |
| |
| LSUnit implements a relaxed consistency model for memory loads and stores. The |
| rules are: |
| 1) A younger load is allowed to pass an older load only if there is no |
| intervening store in between the two loads. |
| 2) An younger store is not allowed to pass an older store. |
| 3) A younger store is not allowed to pass an older load. |
| 4) A younger load is allowed to pass an older store provided that the load does |
| not alias with the store. |
| |
| By default, this class conservatively (i.e., pessimistically) assumes that loads |
| always may-alias store operations. Essentially, this LSUnit doesn't perform |
| any sort of alias analysis to rule out cases where loads and stores don't |
| overlap with each other. The downside of this approach however is that younger |
| loads are never allowed to pass older stores. To make it possible for a |
| younger load to pass an older store, users can use the command line flag |
| -noalias. Under 'noalias', a younger load is always allowed to pass an older |
| store. |
| |
| Note that, in the case of write-combining memory, rule 2. could be relaxed a bit |
| to allow reordering of non-aliasing store operations. That being said, at the |
| moment, there is no way to further relax the memory model (flag -noalias is the |
| only option). Essentially, there is no option to specify a different memory |
| type (for example: write-back, write-combining, write-through; etc.) and |
| consequently to weaken or strengthen the memory model. |
| |
| Other limitations are: |
| * LSUnit doesn't know when store-to-load forwarding may occur. |
| * LSUnit doesn't know anything about the cache hierarchy and memory types. |
| * LSUnit doesn't know how to identify serializing operations and memory fences. |
| |
| No assumption is made on the store buffer size. As mentioned before, LSUnit |
| conservatively assumes a may-alias relation between loads and stores, and it |
| doesn't attempt to identify cases where store-to-load forwarding would occur in |
| practice. |
| |
| LSUnit doesn't attempt to predict whether a load or store hits or misses the L1 |
| cache. It only knows if an instruction "MayLoad" and/or "MayStore". For loads, |
| the scheduling model provides an "optimistic" load-to-use latency (which usually |
| matches the load-to-use latency for when there is a hit in the L1D). |
| |
| Class MCInstrDesc in LLVM doesn't know about serializing operations, nor |
| memory-barrier like instructions. LSUnit conservatively assumes that an |
| instruction which has both 'MayLoad' and 'UnmodeledSideEffects' behaves like a |
| "soft" load-barrier. That means, it serializes loads without forcing a flush of |
| the load queue. Similarly, instructions flagged with both 'MayStore' and |
| 'UnmodeledSideEffects' are treated like store barriers. A full memory barrier |
| is a 'MayLoad' and 'MayStore' instruction with 'UnmodeledSideEffects'. This is |
| inaccurate, but it is the best that we can do at the moment with the current |
| information available in LLVM. |
| |
| A load/store barrier consumes one entry of the load/store queue. A load/store |
| barrier enforces ordering of loads/stores. A younger load cannot pass a load |
| barrier. Also, a younger store cannot pass a store barrier. A younger load has |
| to wait for the memory/load barrier to execute. A load/store barrier is |
| "executed" when it becomes the oldest entry in the load/store queue(s). That |
| also means, by construction, all the older loads/stores have been executed. |
| |
| In conclusion the full set of rules is: |
| 1. A store may not pass a previous store. |
| 2. A load may not pass a previous store unless flag 'NoAlias' is set. |
| 3. A load may pass a previous load. |
| 4. A store may not pass a previous load (regardless of flag 'NoAlias'). |
| 5. A load has to wait until an older load barrier is fully executed. |
| 6. A store has to wait until an older store barrier is fully executed. |
| |
| Known limitations |
| ----------------- |
| Previous sections described cases where the tool is missing information to give |
| an accurate report. For example, the first sections of this document explained |
| how the lack of knowledge about the processor negatively affects the performance |
| analysis. The lack of knowledge is often a consequence of how scheduling models |
| are defined; as mentioned before, scheduling models intentionally don't describe |
| processors in fine details. That being said, the LLVM machine model can be |
| extended to expose more details, as long as they are opt-in for targets. |
| |
| The accuracy of the performance analysis is also affected by assumptions made by |
| the processor model used by the tool. |
| |
| Most recent Intel and AMD processors implement dedicated LoopBuffer/OpCache in |
| the hardware frontend to speedup the throughput in the presence of tight loops. |
| The presence of these buffers complicates the decoding logic, and requires |
| knowledge on the branch predictor too. Class 'SchedMachineModel' in TableGen |
| provides a field named 'LoopMicroOpBufferSize' which is used to describe loop |
| buffers. However, the purpose of that field is to enable loop unrolling of |
| tight loops; essentially, it affects the cost model used by pass loop-unroll. |
| |
| At the current state, the tool only describes the out-of-order portion of a |
| processor, and consequently doesn't try to predict the frontend throughput. That |
| being said, this tool could be definitely extended in future to also account for |
| the hardware frontend when doing performance analysis. This would inevitably |
| require extra (extensive) processor knowledge related to all the available |
| decoding paths in the hardware frontend, as well as branch prediction. |
| |
| Currently, the tool assumes a zero-latency "perfect" fetch&decode |
| stage; the full sequence of decoded instructions is immediately visible to the |
| dispatch logic from the start. |
| |
| The tool doesn't know about simultaneous mutithreading. According to the tool, |
| processor resources are not statically/dynamically partitioned. Processor |
| resources are fully available to the hardware thread executing the |
| microbenchmark. |
| |
| The execution model implemented by this tool assumes that instructions are |
| firstly dispatched in groups to hardware schedulers, and then issued to |
| pipelines for execution. The model assumes dynamic scheduling of instructions. |
| Instructions are placed in a queue and potentially executed out-of-order (based |
| on the operand availability). The dispatch stage is definitely distinct from the |
| issue stage. This will change in future; as mentioned in the first section, the |
| end goal is to let processors customize the process. |
| |
| This model doesn't correctly describe processors where the dispatch/issue is a |
| single stage. This is what happens for example in VLIW processors, where |
| instructions are packaged and statically scheduled at compile time; it is up to |
| the compiler to predict the latency of instructions and package issue groups |
| accordingly. For such targets, there is no dynamic scheduling done by the |
| hardware. |
| |
| Existing classes (DispatchStage, Scheduler, etc.) could be extended/adapted to |
| support processors with a single dispatch/issue stage. The execution flow would |
| require some changes in the way how existing components (i.e., DispatchStage, |
| Scheduler, etc.) interact. This can be a future development. |
| |
| The following sections describes other known limitations. The goal is not to |
| provide an extensive list of limitations; we want to report what we believe are |
| the most important limitations, and suggest possible methods to overcome them. |
| |
| Load/Store barrier instructions and serializing operations |
| ---------------------------------------------------------- |
| Section "Load/Store Unit and Memory Consistency Model" already mentioned how |
| LLVM doesn't know about serializing operations and memory barriers. Most of it |
| boils down to the fact that class MCInstrDesc (intentionally) doesn't expose |
| those properties. Instead, both serializing operations and memory barriers |
| "have side-effects" according to MCInstrDesc. That is because, at least for |
| scheduling purposes, knowing that an instruction has unmodeled side effects is |
| often enough to treat the instruction like a compiler scheduling barrier. |
| |
| A performance analysis tool could use the extra knowledge on barriers and |
| serializing operations to generate a more accurate performance report. One way |
| to improve this is by reserving a couple of bits in field 'Flags' from class |
| MCInstrDesc: one bit for barrier operations, and another bit to mark |
| instructions as serializing operations. |
| |
| Lack of support for instruction itineraries |
| ------------------------------------------- |
| The current version of the tool doesn't know how to process instruction |
| itineraries. This is probably one of the most important limitations, since it |
| affects a few out-of-order processors in LLVM. |
| |
| As mentioned in section 'Instruction Issue', class Scheduler delegates to an |
| instance of class ResourceManager the handling of processor resources. |
| ResourceManager is where most of the scheduling logic is implemented. |
| |
| Adding support for instruction itineraries requires that we teach |
| ResourceManager how to handle functional units and instruction stages. This |
| development can be a future extension, and it would probably require a few |
| changes to the ResourceManager interface. |
| |
| Instructions that affect control flow are not correctly modeled |
| --------------------------------------------------------------- |
| Examples of instructions that affect the control flow are: return, indirect |
| branches, calls, etc. The tool doesn't try to predict/evaluate branch targets. |
| In particular, the tool doesn't model any sort of branch prediction, nor does it |
| attempt to track changes to the program counter. The tool always assumes that |
| the input assembly sequence is the body of a microbenchmark (a simple loop |
| executed for a number of iterations). The "next" instruction in sequence is |
| always the next instruction to dispatch. |
| |
| Call instructions default to an arbitrary high latency of 100cy. A warning is |
| generated if the tool encounters a call instruction in the sequence. Return |
| instructions are not evaluated, and therefore control flow is not affected. |
| However, the tool still queries the processor scheduling model to obtain latency |
| information for instructions that affect the control flow. |
| |
| Known limitations on X86 processors |
| ----------------------------------- |
| |
| 1) Partial register updates versus full register updates. |
| |
| On x86-64, a 32-bit GPR write fully updates the super-register. Example: |
| add %edi %eax ## eax += edi |
| |
| Here, register %eax aliases the lower half of 64-bit register %rax. On x86-64, |
| register %rax is fully updated by the 'add' (the upper half of %rax is zeroed). |
| Essentially, it "kills" any previous definition of (the upper half of) register |
| %rax. |
| |
| On the other hand, 8/16 bit register writes only perform a so-called "partial |
| register update". Example: |
| add %di, %ax ## ax += di |
| |
| Here, register %eax is only partially updated. To be more specific, the lower |
| half of %eax is set, and the upper half is left unchanged. There is also no |
| change in the upper 48 bits of register %rax. |
| |
| To get accurate performance analysis, the tool has to know which instructions |
| perform a partial register update, and which instructions fully update the |
| destination's super-register. |
| |
| One way to expose this information is (again) via TableGen. For example, we |
| could add a flag in the TableGen instruction class to tag instructions that |
| perform partial register updates. Something like this: 'bit |
| hasPartialRegisterUpdate = 1'. However, this would force a `let |
| hasPartialRegisterUpdate = 0` on several instruction definitions. |
| |
| Another approach is to have a MCSubtargetInfo hook similar to this: |
| virtual bool updatesSuperRegisters(unsigned short opcode) { return false; } |
| |
| Targets will be able to override this method if needed. Again, this is just an |
| idea. But the plan is to have this fixed as a future development. |
| |
| 2) Macro Op fusion. |
| |
| The tool doesn't know about macro-op fusion. On modern x86 processors, a |
| 'cmp/test' followed by a 'jmp' is fused into a single macro operation. The |
| advantage is that the fused pair only consumes a single slot in the dispatch |
| group. |
| |
| As a future development, the tool should be extended to address macro-fusion. |
| Ideally, we could have LLVM generate a table enumerating all the opcode pairs |
| that can be fused together. That table could be exposed to the tool via the |
| MCSubtargetInfo interface. This is just an idea; there may be better ways to |
| implement this. |
| |
| 3) Intel processors: mixing legacy SSE with AVX instructions. |
| |
| On modern Intel processors with AVX, mixing legacy SSE code with AVX code |
| negatively impacts the performance. The tool is not aware of this issue, and |
| the performance penalty is not accounted when doing the analysis. This is |
| something that we would like to improve in future. |
| |
| 4) Zero-latency register moves and Zero-idioms. |
| |
| Most modern AMD/Intel processors know how to optimize out register-register |
| moves and zero idioms at register renaming stage. The tool doesn't know |
| about these patterns, and this may negatively impact the performance analysis. |
| |
| Known design problems |
| --------------------- |
| This section describes two design issues that are currently affecting the tool. |
| The long term plan is to "fix" these issues. |
| Both limitations would be easily fixed if we teach the tool how to directly |
| manipulate MachineInstr objects (instead of MCInst objects). |
| |
| 1) Variant instructions not correctly modeled. |
| |
| The tool doesn't know how to analyze instructions with a "variant" scheduling |
| class descriptor. A variant scheduling class needs to be resolved dynamically. |
| The "actual" scheduling class often depends on the subtarget, as well as |
| properties of the specific MachineInstr object. |
| |
| Unfortunately, the tool manipulates MCInst, and it doesn't know anything about |
| MachineInstr. As a consequence, the tool cannot use the existing machine |
| subtarget hooks that are normally used to resolve the variant scheduling class. |
| This is a major design issue which mostly affects ARM/AArch64 targets. It |
| mostly boils down to the fact that the existing scheduling framework was meant |
| to work for MachineInstr. |
| |
| When the tool encounters a "variant" instruction, it assumes a generic 1cy |
| latency. However, the tool would not be able to tell which processor resources |
| are effectively consumed by the variant instruction. |
| |
| 2) MCInst and MCInstrDesc. |
| |
| Performance analysis tools require data dependency information to correctly |
| predict the runtime performance of the code. This tool must always be able to |
| obtain the set of implicit/explicit register defs/uses for every instruction of |
| the input assembly sequence. |
| |
| In the first section of this document, it was mentioned how the tool takes as |
| input an assembly sequence. That sequence is parsed into a MCInst sequence with |
| the help of assembly parsers available from the targets. |
| |
| A MCInst is a very low-level instruction representation. The tool can inspect |
| the MCOperand sequence of an MCInst to identify register operands. However, |
| there is no way to tell register operands that are definitions from register |
| operands that are uses. |
| |
| In LLVM, class MCInstrDesc is used to fully describe target instructions and |
| their operands. The opcode of a machine instruction (a MachineInstr object) can |
| be used to query the instruction set through method `MCInstrInfo::get' to obtain |
| the associated MCInstrDesc object. |
| |
| However class MCInstrDesc describes properties and operands of MachineInstr |
| objects. Essentially, MCInstrDesc is not meant to be used to describe MCInst |
| objects. To be more specific, MCInstrDesc objects are automatically generated |
| via TableGen from the instruction set description in the target .td files. For |
| example, field `MCInstrDesc::NumDefs' is always equal to the cardinality of the |
| `(outs)` set from the TableGen instruction definition. |
| |
| By construction, register definitions always appear at the beginning of the |
| MachineOperands list in MachineInstr. Basically, the (outs) are the first |
| operands of a MachineInstr, and the (ins) will come after in the machine operand |
| list. Knowing the number of register definitions is enough to identify |
| all the register operands that are definitions. |
| |
| In a normal compilation process, MCInst objects are generated from MachineInstr |
| objects through a lowering step. By default the lowering logic simply iterates |
| over the machine operands of a MachineInstr, and converts/expands them into |
| equivalent MCOperand objects. |
| |
| The default lowering strategy has the advantage of preserving all of the above |
| mentioned assumptions on the machine operand sequence. That means, register |
| definitions would still be at the beginning of the MCOperand sequence, and |
| register uses would come after. |
| |
| Targets may still define custom lowering routines for specific opcodes. Some of |
| these routines may lower operands in a way that potentially breaks (some of) the |
| assumptions on the machine operand sequence which were valid for MachineInstr. |
| Luckily, this is not the most common form of lowering done by the targets, and |
| the vast majority of the MachineInstr are lowered based on the default strategy |
| which preserves the original machine operand sequence. This is especially true |
| for x86, where the custom lowering logic always preserves the original (i.e., |
| from the MachineInstr) operand sequence. |
| |
| This tool currently works under the strong (and potentially incorrect) |
| assumption that register def/uses in a MCInst can always be identified by |
| querying the machine instruction descriptor for the opcode. This assumption made |
| it possible to develop this tool and get good numbers at least for the |
| processors available in the x86 backend. |
| |
| That being said, the analysis is still potentially incorrect for other targets. |
| So we plan (with the help of the community) to find a proper mechanism to map |
| when possible MCOperand indices back to MachineOperand indices of the equivalent |
| MachineInstr. This would be equivalent to describing changes made by the |
| lowering step which affected the operand sequence. For example, we could have an |
| index for every register MCOperand (or -1, if the operand didn't exist in the |
| original MachineInstr). The mapping could look like this <0,1,3,2>. Here, |
| MCOperand #2 was obtained from the lowering of MachineOperand #3. etc. |
| |
| This information could be automatically generated via TableGen for all the |
| instructions whose custom lowering step breaks assumptions made by the tool on |
| the register operand sequence (In general, these instructions should be the |
| minority of a target's instruction set). Unfortunately, we don't have that |
| information now. As a consequence, we assume that the number of explicit |
| register definitions is the same number specified in MCInstrDesc. We also |
| assume that register definitions always come first in the operand sequence. |
| |
| In conclusion: these are for now the strong assumptions made by the tool: |
| * The number of explicit and implicit register definitions in a MCInst |
| matches the number of explicit and implicit definitions specified by the |
| MCInstrDesc object. |
| * Register uses always come after register definitions. |
| * If an opcode specifies an optional definition, then the optional |
| definition is always the last register operand in the sequence. |
| |
| Note that some of the information accessible from the MCInstrDesc is always |
| valid for MCInst. For example: implicit register defs, implicit register uses |
| and 'MayLoad/MayStore/HasUnmodeledSideEffects' opcode properties still apply to |
| MCInst. The tool knows about this, and uses that information during its |
| analysis. |
| |
| Future work |
| ----------- |
| * Address limitations (described in section "Known limitations"). |
| * Let processors specify the selection strategy for processor resource groups |
| and resources with multiple units. The tool currently uses a round-robin |
| selector to pick the next resource to use. |
| * Address limitations specifically described in section "Known limitations on |
| X86 processors". |
| * Address design issues identified in section "Known design problems". |
| * Define a standard interface for "Views". This would let users customize the |
| performance report generated by the tool. |
| |
| When interfaces are mature/stable: |
| * Move the logic into a library. This will enable a number of other |
| interesting use cases. |
| |
| Work is currently tracked on https://bugs.llvm.org. llvm-mca bugs are tagged |
| with prefix [llvm-mca]. You can easily find the full list of open bugs if you |
| search for that tag. |