Mercurial > hg > CbC > CbC_llvm
comparison docs/CommandGuide/llvm-mca.rst @ 147:c2174574ed3a
LLVM 10
author | Shinji KONO <kono@ie.u-ryukyu.ac.jp> |
---|---|
date | Wed, 14 Aug 2019 16:55:33 +0900 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
134:3a76565eade5 | 147:c2174574ed3a |
---|---|
1 llvm-mca - LLVM Machine Code Analyzer | |
2 ===================================== | |
3 | |
4 .. program:: llvm-mca | |
5 | |
6 SYNOPSIS | |
7 -------- | |
8 | |
9 :program:`llvm-mca` [*options*] [input] | |
10 | |
11 DESCRIPTION | |
12 ----------- | |
13 | |
14 :program:`llvm-mca` is a performance analysis tool that uses information | |
15 available in LLVM (e.g. scheduling models) to statically measure the performance | |
16 of machine code in a specific CPU. | |
17 | |
18 Performance is measured in terms of throughput as well as processor resource | |
19 consumption. The tool currently works for processors with an out-of-order | |
20 backend, for which there is a scheduling model available in LLVM. | |
21 | |
22 The main goal of this tool is not just to predict the performance of the code | |
23 when run on the target, but also help with diagnosing potential performance | |
24 issues. | |
25 | |
26 Given an assembly code sequence, :program:`llvm-mca` estimates the Instructions | |
27 Per Cycle (IPC), as well as hardware resource pressure. The analysis and | |
28 reporting style were inspired by the IACA tool from Intel. | |
29 | |
30 For example, you can compile code with clang, output assembly, and pipe it | |
31 directly into :program:`llvm-mca` for analysis: | |
32 | |
33 .. code-block:: bash | |
34 | |
35 $ clang foo.c -O2 -target x86_64-unknown-unknown -S -o - | llvm-mca -mcpu=btver2 | |
36 | |
37 Or for Intel syntax: | |
38 | |
39 .. code-block:: bash | |
40 | |
41 $ clang foo.c -O2 -target x86_64-unknown-unknown -mllvm -x86-asm-syntax=intel -S -o - | llvm-mca -mcpu=btver2 | |
42 | |
43 Scheduling models are not just used to compute instruction latencies and | |
44 throughput, but also to understand what processor resources are available | |
45 and how to simulate them. | |
46 | |
47 By design, the quality of the analysis conducted by :program:`llvm-mca` is | |
48 inevitably affected by the quality of the scheduling models in LLVM. | |
49 | |
50 If you see that the performance report is not accurate for a processor, | |
51 please `file a bug <https://bugs.llvm.org/enter_bug.cgi?product=libraries>`_ | |
52 against the appropriate backend. | |
53 | |
54 OPTIONS | |
55 ------- | |
56 | |
57 If ``input`` is "``-``" or omitted, :program:`llvm-mca` reads from standard | |
58 input. Otherwise, it will read from the specified filename. | |
59 | |
60 If the :option:`-o` option is omitted, then :program:`llvm-mca` will send its output | |
61 to standard output if the input is from standard input. If the :option:`-o` | |
62 option specifies "``-``", then the output will also be sent to standard output. | |
63 | |
64 | |
65 .. option:: -help | |
66 | |
67 Print a summary of command line options. | |
68 | |
69 .. option:: -o <filename> | |
70 | |
71 Use ``<filename>`` as the output filename. See the summary above for more | |
72 details. | |
73 | |
74 .. option:: -mtriple=<target triple> | |
75 | |
76 Specify a target triple string. | |
77 | |
78 .. option:: -march=<arch> | |
79 | |
80 Specify the architecture for which to analyze the code. It defaults to the | |
81 host default target. | |
82 | |
83 .. option:: -mcpu=<cpuname> | |
84 | |
85 Specify the processor for which to analyze the code. By default, the cpu name | |
86 is autodetected from the host. | |
87 | |
88 .. option:: -output-asm-variant=<variant id> | |
89 | |
90 Specify the output assembly variant for the report generated by the tool. | |
91 On x86, possible values are [0, 1]. A value of 0 (vic. 1) for this flag enables | |
92 the AT&T (vic. Intel) assembly format for the code printed out by the tool in | |
93 the analysis report. | |
94 | |
95 .. option:: -print-imm-hex | |
96 | |
97 Prefer hex format for numeric literals in the output assembly printed as part | |
98 of the report. | |
99 | |
100 .. option:: -dispatch=<width> | |
101 | |
102 Specify a different dispatch width for the processor. The dispatch width | |
103 defaults to field 'IssueWidth' in the processor scheduling model. If width is | |
104 zero, then the default dispatch width is used. | |
105 | |
106 .. option:: -register-file-size=<size> | |
107 | |
108 Specify the size of the register file. When specified, this flag limits how | |
109 many physical registers are available for register renaming purposes. A value | |
110 of zero for this flag means "unlimited number of physical registers". | |
111 | |
112 .. option:: -iterations=<number of iterations> | |
113 | |
114 Specify the number of iterations to run. If this flag is set to 0, then the | |
115 tool sets the number of iterations to a default value (i.e. 100). | |
116 | |
117 .. option:: -noalias=<bool> | |
118 | |
119 If set, the tool assumes that loads and stores don't alias. This is the | |
120 default behavior. | |
121 | |
122 .. option:: -lqueue=<load queue size> | |
123 | |
124 Specify the size of the load queue in the load/store unit emulated by the tool. | |
125 By default, the tool assumes an unbound number of entries in the load queue. | |
126 A value of zero for this flag is ignored, and the default load queue size is | |
127 used instead. | |
128 | |
129 .. option:: -squeue=<store queue size> | |
130 | |
131 Specify the size of the store queue in the load/store unit emulated by the | |
132 tool. By default, the tool assumes an unbound number of entries in the store | |
133 queue. A value of zero for this flag is ignored, and the default store queue | |
134 size is used instead. | |
135 | |
136 .. option:: -timeline | |
137 | |
138 Enable the timeline view. | |
139 | |
140 .. option:: -timeline-max-iterations=<iterations> | |
141 | |
142 Limit the number of iterations to print in the timeline view. By default, the | |
143 timeline view prints information for up to 10 iterations. | |
144 | |
145 .. option:: -timeline-max-cycles=<cycles> | |
146 | |
147 Limit the number of cycles in the timeline view. By default, the number of | |
148 cycles is set to 80. | |
149 | |
150 .. option:: -resource-pressure | |
151 | |
152 Enable the resource pressure view. This is enabled by default. | |
153 | |
154 .. option:: -register-file-stats | |
155 | |
156 Enable register file usage statistics. | |
157 | |
158 .. option:: -dispatch-stats | |
159 | |
160 Enable extra dispatch statistics. This view collects and analyzes instruction | |
161 dispatch events, as well as static/dynamic dispatch stall events. This view | |
162 is disabled by default. | |
163 | |
164 .. option:: -scheduler-stats | |
165 | |
166 Enable extra scheduler statistics. This view collects and analyzes instruction | |
167 issue events. This view is disabled by default. | |
168 | |
169 .. option:: -retire-stats | |
170 | |
171 Enable extra retire control unit statistics. This view is disabled by default. | |
172 | |
173 .. option:: -instruction-info | |
174 | |
175 Enable the instruction info view. This is enabled by default. | |
176 | |
177 .. option:: -show-encoding | |
178 | |
179 Enable the printing of instruction encodings within the instruction info view. | |
180 | |
181 .. option:: -all-stats | |
182 | |
183 Print all hardware statistics. This enables extra statistics related to the | |
184 dispatch logic, the hardware schedulers, the register file(s), and the retire | |
185 control unit. This option is disabled by default. | |
186 | |
187 .. option:: -all-views | |
188 | |
189 Enable all the view. | |
190 | |
191 .. option:: -instruction-tables | |
192 | |
193 Prints resource pressure information based on the static information | |
194 available from the processor model. This differs from the resource pressure | |
195 view because it doesn't require that the code is simulated. It instead prints | |
196 the theoretical uniform distribution of resource pressure for every | |
197 instruction in sequence. | |
198 | |
199 .. option:: -bottleneck-analysis | |
200 | |
201 Print information about bottlenecks that affect the throughput. This analysis | |
202 can be expensive, and it is disabled by default. Bottlenecks are highlighted | |
203 in the summary view. | |
204 | |
205 | |
206 EXIT STATUS | |
207 ----------- | |
208 | |
209 :program:`llvm-mca` returns 0 on success. Otherwise, an error message is printed | |
210 to standard error, and the tool returns 1. | |
211 | |
212 USING MARKERS TO ANALYZE SPECIFIC CODE BLOCKS | |
213 --------------------------------------------- | |
214 :program:`llvm-mca` allows for the optional usage of special code comments to | |
215 mark regions of the assembly code to be analyzed. A comment starting with | |
216 substring ``LLVM-MCA-BEGIN`` marks the beginning of a code region. A comment | |
217 starting with substring ``LLVM-MCA-END`` marks the end of a code region. For | |
218 example: | |
219 | |
220 .. code-block:: none | |
221 | |
222 # LLVM-MCA-BEGIN | |
223 ... | |
224 # LLVM-MCA-END | |
225 | |
226 If no user-defined region is specified, then :program:`llvm-mca` assumes a | |
227 default region which contains every instruction in the input file. Every region | |
228 is analyzed in isolation, and the final performance report is the union of all | |
229 the reports generated for every code region. | |
230 | |
231 Code regions can have names. For example: | |
232 | |
233 .. code-block:: none | |
234 | |
235 # LLVM-MCA-BEGIN A simple example | |
236 add %eax, %eax | |
237 # LLVM-MCA-END | |
238 | |
239 The code from the example above defines a region named "A simple example" with a | |
240 single instruction in it. Note how the region name doesn't have to be repeated | |
241 in the ``LLVM-MCA-END`` directive. In the absence of overlapping regions, | |
242 an anonymous ``LLVM-MCA-END`` directive always ends the currently active user | |
243 defined region. | |
244 | |
245 Example of nesting regions: | |
246 | |
247 .. code-block:: none | |
248 | |
249 # LLVM-MCA-BEGIN foo | |
250 add %eax, %edx | |
251 # LLVM-MCA-BEGIN bar | |
252 sub %eax, %edx | |
253 # LLVM-MCA-END bar | |
254 # LLVM-MCA-END foo | |
255 | |
256 Example of overlapping regions: | |
257 | |
258 .. code-block:: none | |
259 | |
260 # LLVM-MCA-BEGIN foo | |
261 add %eax, %edx | |
262 # LLVM-MCA-BEGIN bar | |
263 sub %eax, %edx | |
264 # LLVM-MCA-END foo | |
265 add %eax, %edx | |
266 # LLVM-MCA-END bar | |
267 | |
268 Note that multiple anonymous regions cannot overlap. Also, overlapping regions | |
269 cannot have the same name. | |
270 | |
271 There is no support for marking regions from high-level source code, like C or | |
272 C++. As a workaround, inline assembly directives may be used: | |
273 | |
274 .. code-block:: c++ | |
275 | |
276 int foo(int a, int b) { | |
277 __asm volatile("# LLVM-MCA-BEGIN foo"); | |
278 a += 42; | |
279 __asm volatile("# LLVM-MCA-END"); | |
280 a *= b; | |
281 return a; | |
282 } | |
283 | |
284 However, this interferes with optimizations like loop vectorization and may have | |
285 an impact on the code generated. This is because the ``__asm`` statements are | |
286 seen as real code having important side effects, which limits how the code | |
287 around them can be transformed. If users want to make use of inline assembly | |
288 to emit markers, then the recommendation is to always verify that the output | |
289 assembly is equivalent to the assembly generated in the absence of markers. | |
290 The `Clang options to emit optimization reports <https://clang.llvm.org/docs/UsersManual.html#options-to-emit-optimization-reports>`_ | |
291 can also help in detecting missed optimizations. | |
292 | |
293 HOW LLVM-MCA WORKS | |
294 ------------------ | |
295 | |
296 :program:`llvm-mca` takes assembly code as input. The assembly code is parsed | |
297 into a sequence of MCInst with the help of the existing LLVM target assembly | |
298 parsers. The parsed sequence of MCInst is then analyzed by a ``Pipeline`` module | |
299 to generate a performance report. | |
300 | |
301 The Pipeline module simulates the execution of the machine code sequence in a | |
302 loop of iterations (default is 100). During this process, the pipeline collects | |
303 a number of execution related statistics. At the end of this process, the | |
304 pipeline generates and prints a report from the collected statistics. | |
305 | |
306 Here is an example of a performance report generated by the tool for a | |
307 dot-product of two packed float vectors of four elements. The analysis is | |
308 conducted for target x86, cpu btver2. The following result can be produced via | |
309 the following command using the example located at | |
310 ``test/tools/llvm-mca/X86/BtVer2/dot-product.s``: | |
311 | |
312 .. code-block:: bash | |
313 | |
314 $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s | |
315 | |
316 .. code-block:: none | |
317 | |
318 Iterations: 300 | |
319 Instructions: 900 | |
320 Total Cycles: 610 | |
321 Total uOps: 900 | |
322 | |
323 Dispatch Width: 2 | |
324 uOps Per Cycle: 1.48 | |
325 IPC: 1.48 | |
326 Block RThroughput: 2.0 | |
327 | |
328 | |
329 Instruction Info: | |
330 [1]: #uOps | |
331 [2]: Latency | |
332 [3]: RThroughput | |
333 [4]: MayLoad | |
334 [5]: MayStore | |
335 [6]: HasSideEffects (U) | |
336 | |
337 [1] [2] [3] [4] [5] [6] Instructions: | |
338 1 2 1.00 vmulps %xmm0, %xmm1, %xmm2 | |
339 1 3 1.00 vhaddps %xmm2, %xmm2, %xmm3 | |
340 1 3 1.00 vhaddps %xmm3, %xmm3, %xmm4 | |
341 | |
342 | |
343 Resources: | |
344 [0] - JALU0 | |
345 [1] - JALU1 | |
346 [2] - JDiv | |
347 [3] - JFPA | |
348 [4] - JFPM | |
349 [5] - JFPU0 | |
350 [6] - JFPU1 | |
351 [7] - JLAGU | |
352 [8] - JMul | |
353 [9] - JSAGU | |
354 [10] - JSTC | |
355 [11] - JVALU0 | |
356 [12] - JVALU1 | |
357 [13] - JVIMUL | |
358 | |
359 | |
360 Resource pressure per iteration: | |
361 [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] | |
362 - - - 2.00 1.00 2.00 1.00 - - - - - - - | |
363 | |
364 Resource pressure by instruction: | |
365 [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions: | |
366 - - - - 1.00 - 1.00 - - - - - - - vmulps %xmm0, %xmm1, %xmm2 | |
367 - - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm2, %xmm2, %xmm3 | |
368 - - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm3, %xmm3, %xmm4 | |
369 | |
370 According to this report, the dot-product kernel has been executed 300 times, | |
371 for a total of 900 simulated instructions. The total number of simulated micro | |
372 opcodes (uOps) is also 900. | |
373 | |
374 The report is structured in three main sections. The first section collects a | |
375 few performance numbers; the goal of this section is to give a very quick | |
376 overview of the performance throughput. Important performance indicators are | |
377 **IPC**, **uOps Per Cycle**, and **Block RThroughput** (Block Reciprocal | |
378 Throughput). | |
379 | |
380 Field *DispatchWidth* is the maximum number of micro opcodes that are dispatched | |
381 to the out-of-order backend every simulated cycle. | |
382 | |
383 IPC is computed dividing the total number of simulated instructions by the total | |
384 number of cycles. | |
385 | |
386 Field *Block RThroughput* is the reciprocal of the block throughput. Block | |
387 throuhgput is a theoretical quantity computed as the maximum number of blocks | |
388 (i.e. iterations) that can be executed per simulated clock cycle in the absence | |
389 of loop carried dependencies. Block throughput is is superiorly | |
390 limited by the dispatch rate, and the availability of hardware resources. | |
391 | |
392 In the absence of loop-carried data dependencies, the observed IPC tends to a | |
393 theoretical maximum which can be computed by dividing the number of instructions | |
394 of a single iteration by the `Block RThroughput`. | |
395 | |
396 Field 'uOps Per Cycle' is computed dividing the total number of simulated micro | |
397 opcodes by the total number of cycles. A delta between Dispatch Width and this | |
398 field is an indicator of a performance issue. In the absence of loop-carried | |
399 data dependencies, the observed 'uOps Per Cycle' should tend to a theoretical | |
400 maximum throughput which can be computed by dividing the number of uOps of a | |
401 single iteration by the `Block RThroughput`. | |
402 | |
403 Field *uOps Per Cycle* is bounded from above by the dispatch width. That is | |
404 because the dispatch width limits the maximum size of a dispatch group. Both IPC | |
405 and 'uOps Per Cycle' are limited by the amount of hardware parallelism. The | |
406 availability of hardware resources affects the resource pressure distribution, | |
407 and it limits the number of instructions that can be executed in parallel every | |
408 cycle. A delta between Dispatch Width and the theoretical maximum uOps per | |
409 Cycle (computed by dividing the number of uOps of a single iteration by the | |
410 `Block RThroughput`) is an indicator of a performance bottleneck caused by the | |
411 lack of hardware resources. | |
412 In general, the lower the Block RThroughput, the better. | |
413 | |
414 In this example, ``uOps per iteration/Block RThroughput`` is 1.50. Since there | |
415 are no loop-carried dependencies, the observed `uOps Per Cycle` is expected to | |
416 approach 1.50 when the number of iterations tends to infinity. The delta between | |
417 the Dispatch Width (2.00), and the theoretical maximum uOp throughput (1.50) is | |
418 an indicator of a performance bottleneck caused by the lack of hardware | |
419 resources, and the *Resource pressure view* can help to identify the problematic | |
420 resource usage. | |
421 | |
422 The second section of the report is the `instruction info view`. It shows the | |
423 latency and reciprocal throughput of every instruction in the sequence. It also | |
424 reports extra information related to the number of micro opcodes, and opcode | |
425 properties (i.e., 'MayLoad', 'MayStore', and 'HasSideEffects'). | |
426 | |
427 Field *RThroughput* is the reciprocal of the instruction throughput. Throughput | |
428 is computed as the maximum number of instructions of a same type that can be | |
429 executed per clock cycle in the absence of operand dependencies. In this | |
430 example, the reciprocal throughput of a vector float multiply is 1 | |
431 cycles/instruction. That is because the FP multiplier JFPM is only available | |
432 from pipeline JFPU1. | |
433 | |
434 Instruction encodings are displayed within the instruction info view when flag | |
435 `-show-encoding` is specified. | |
436 | |
437 Below is an example of `-show-encoding` output for the dot-product kernel: | |
438 | |
439 .. code-block:: none | |
440 | |
441 Instruction Info: | |
442 [1]: #uOps | |
443 [2]: Latency | |
444 [3]: RThroughput | |
445 [4]: MayLoad | |
446 [5]: MayStore | |
447 [6]: HasSideEffects (U) | |
448 [7]: Encoding Size | |
449 | |
450 [1] [2] [3] [4] [5] [6] [7] Encodings: Instructions: | |
451 1 2 1.00 4 c5 f0 59 d0 vmulps %xmm0, %xmm1, %xmm2 | |
452 1 4 1.00 4 c5 eb 7c da vhaddps %xmm2, %xmm2, %xmm3 | |
453 1 4 1.00 4 c5 e3 7c e3 vhaddps %xmm3, %xmm3, %xmm4 | |
454 | |
455 The `Encoding Size` column shows the size in bytes of instructions. The | |
456 `Encodings` column shows the actual instruction encodings (byte sequences in | |
457 hex). | |
458 | |
459 The third section is the *Resource pressure view*. This view reports | |
460 the average number of resource cycles consumed every iteration by instructions | |
461 for every processor resource unit available on the target. Information is | |
462 structured in two tables. The first table reports the number of resource cycles | |
463 spent on average every iteration. The second table correlates the resource | |
464 cycles to the machine instruction in the sequence. For example, every iteration | |
465 of the instruction vmulps always executes on resource unit [6] | |
466 (JFPU1 - floating point pipeline #1), consuming an average of 1 resource cycle | |
467 per iteration. Note that on AMD Jaguar, vector floating-point multiply can | |
468 only be issued to pipeline JFPU1, while horizontal floating-point additions can | |
469 only be issued to pipeline JFPU0. | |
470 | |
471 The resource pressure view helps with identifying bottlenecks caused by high | |
472 usage of specific hardware resources. Situations with resource pressure mainly | |
473 concentrated on a few resources should, in general, be avoided. Ideally, | |
474 pressure should be uniformly distributed between multiple resources. | |
475 | |
476 Timeline View | |
477 ^^^^^^^^^^^^^ | |
478 The timeline view produces a detailed report of each instruction's state | |
479 transitions through an instruction pipeline. This view is enabled by the | |
480 command line option ``-timeline``. As instructions transition through the | |
481 various stages of the pipeline, their states are depicted in the view report. | |
482 These states are represented by the following characters: | |
483 | |
484 * D : Instruction dispatched. | |
485 * e : Instruction executing. | |
486 * E : Instruction executed. | |
487 * R : Instruction retired. | |
488 * = : Instruction already dispatched, waiting to be executed. | |
489 * \- : Instruction executed, waiting to be retired. | |
490 | |
491 Below is the timeline view for a subset of the dot-product example located in | |
492 ``test/tools/llvm-mca/X86/BtVer2/dot-product.s`` and processed by | |
493 :program:`llvm-mca` using the following command: | |
494 | |
495 .. code-block:: bash | |
496 | |
497 $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s | |
498 | |
499 .. code-block:: none | |
500 | |
501 Timeline view: | |
502 012345 | |
503 Index 0123456789 | |
504 | |
505 [0,0] DeeER. . . vmulps %xmm0, %xmm1, %xmm2 | |
506 [0,1] D==eeeER . . vhaddps %xmm2, %xmm2, %xmm3 | |
507 [0,2] .D====eeeER . vhaddps %xmm3, %xmm3, %xmm4 | |
508 [1,0] .DeeE-----R . vmulps %xmm0, %xmm1, %xmm2 | |
509 [1,1] . D=eeeE---R . vhaddps %xmm2, %xmm2, %xmm3 | |
510 [1,2] . D====eeeER . vhaddps %xmm3, %xmm3, %xmm4 | |
511 [2,0] . DeeE-----R . vmulps %xmm0, %xmm1, %xmm2 | |
512 [2,1] . D====eeeER . vhaddps %xmm2, %xmm2, %xmm3 | |
513 [2,2] . D======eeeER vhaddps %xmm3, %xmm3, %xmm4 | |
514 | |
515 | |
516 Average Wait times (based on the timeline view): | |
517 [0]: Executions | |
518 [1]: Average time spent waiting in a scheduler's queue | |
519 [2]: Average time spent waiting in a scheduler's queue while ready | |
520 [3]: Average time elapsed from WB until retire stage | |
521 | |
522 [0] [1] [2] [3] | |
523 0. 3 1.0 1.0 3.3 vmulps %xmm0, %xmm1, %xmm2 | |
524 1. 3 3.3 0.7 1.0 vhaddps %xmm2, %xmm2, %xmm3 | |
525 2. 3 5.7 0.0 0.0 vhaddps %xmm3, %xmm3, %xmm4 | |
526 | |
527 The timeline view is interesting because it shows instruction state changes | |
528 during execution. It also gives an idea of how the tool processes instructions | |
529 executed on the target, and how their timing information might be calculated. | |
530 | |
531 The timeline view is structured in two tables. The first table shows | |
532 instructions changing state over time (measured in cycles); the second table | |
533 (named *Average Wait times*) reports useful timing statistics, which should | |
534 help diagnose performance bottlenecks caused by long data dependencies and | |
535 sub-optimal usage of hardware resources. | |
536 | |
537 An instruction in the timeline view is identified by a pair of indices, where | |
538 the first index identifies an iteration, and the second index is the | |
539 instruction index (i.e., where it appears in the code sequence). Since this | |
540 example was generated using 3 iterations: ``-iterations=3``, the iteration | |
541 indices range from 0-2 inclusively. | |
542 | |
543 Excluding the first and last column, the remaining columns are in cycles. | |
544 Cycles are numbered sequentially starting from 0. | |
545 | |
546 From the example output above, we know the following: | |
547 | |
548 * Instruction [1,0] was dispatched at cycle 1. | |
549 * Instruction [1,0] started executing at cycle 2. | |
550 * Instruction [1,0] reached the write back stage at cycle 4. | |
551 * Instruction [1,0] was retired at cycle 10. | |
552 | |
553 Instruction [1,0] (i.e., vmulps from iteration #1) does not have to wait in the | |
554 scheduler's queue for the operands to become available. By the time vmulps is | |
555 dispatched, operands are already available, and pipeline JFPU1 is ready to | |
556 serve another instruction. So the instruction can be immediately issued on the | |
557 JFPU1 pipeline. That is demonstrated by the fact that the instruction only | |
558 spent 1cy in the scheduler's queue. | |
559 | |
560 There is a gap of 5 cycles between the write-back stage and the retire event. | |
561 That is because instructions must retire in program order, so [1,0] has to wait | |
562 for [0,2] to be retired first (i.e., it has to wait until cycle 10). | |
563 | |
564 In the example, all instructions are in a RAW (Read After Write) dependency | |
565 chain. Register %xmm2 written by vmulps is immediately used by the first | |
566 vhaddps, and register %xmm3 written by the first vhaddps is used by the second | |
567 vhaddps. Long data dependencies negatively impact the ILP (Instruction Level | |
568 Parallelism). | |
569 | |
570 In the dot-product example, there are anti-dependencies introduced by | |
571 instructions from different iterations. However, those dependencies can be | |
572 removed at register renaming stage (at the cost of allocating register aliases, | |
573 and therefore consuming physical registers). | |
574 | |
575 Table *Average Wait times* helps diagnose performance issues that are caused by | |
576 the presence of long latency instructions and potentially long data dependencies | |
577 which may limit the ILP. Note that :program:`llvm-mca`, by default, assumes at | |
578 least 1cy between the dispatch event and the issue event. | |
579 | |
580 When the performance is limited by data dependencies and/or long latency | |
581 instructions, the number of cycles spent while in the *ready* state is expected | |
582 to be very small when compared with the total number of cycles spent in the | |
583 scheduler's queue. The difference between the two counters is a good indicator | |
584 of how large of an impact data dependencies had on the execution of the | |
585 instructions. When performance is mostly limited by the lack of hardware | |
586 resources, the delta between the two counters is small. However, the number of | |
587 cycles spent in the queue tends to be larger (i.e., more than 1-3cy), | |
588 especially when compared to other low latency instructions. | |
589 | |
590 Bottleneck Analysis | |
591 ^^^^^^^^^^^^^^^^^^^ | |
592 The ``-bottleneck-analysis`` command line option enables the analysis of | |
593 performance bottlenecks. | |
594 | |
595 This analysis is potentially expensive. It attempts to correlate increases in | |
596 backend pressure (caused by pipeline resource pressure and data dependencies) to | |
597 dynamic dispatch stalls. | |
598 | |
599 Below is an example of ``-bottleneck-analysis`` output generated by | |
600 :program:`llvm-mca` for 500 iterations of the dot-product example on btver2. | |
601 | |
602 .. code-block:: none | |
603 | |
604 | |
605 Cycles with backend pressure increase [ 48.07% ] | |
606 Throughput Bottlenecks: | |
607 Resource Pressure [ 47.77% ] | |
608 - JFPA [ 47.77% ] | |
609 - JFPU0 [ 47.77% ] | |
610 Data Dependencies: [ 0.30% ] | |
611 - Register Dependencies [ 0.30% ] | |
612 - Memory Dependencies [ 0.00% ] | |
613 | |
614 Critical sequence based on the simulation: | |
615 | |
616 Instruction Dependency Information | |
617 +----< 2. vhaddps %xmm3, %xmm3, %xmm4 | |
618 | | |
619 | < loop carried > | |
620 | | |
621 | 0. vmulps %xmm0, %xmm1, %xmm2 | |
622 +----> 1. vhaddps %xmm2, %xmm2, %xmm3 ## RESOURCE interference: JFPA [ probability: 74% ] | |
623 +----> 2. vhaddps %xmm3, %xmm3, %xmm4 ## REGISTER dependency: %xmm3 | |
624 | | |
625 | < loop carried > | |
626 | | |
627 +----> 1. vhaddps %xmm2, %xmm2, %xmm3 ## RESOURCE interference: JFPA [ probability: 74% ] | |
628 | |
629 | |
630 According to the analysis, throughput is limited by resource pressure and not by | |
631 data dependencies. The analysis observed increases in backend pressure during | |
632 48.07% of the simulated run. Almost all those pressure increase events were | |
633 caused by contention on processor resources JFPA/JFPU0. | |
634 | |
635 The `critical sequence` is the most expensive sequence of instructions according | |
636 to the simulation. It is annotated to provide extra information about critical | |
637 register dependencies and resource interferences between instructions. | |
638 | |
639 Instructions from the critical sequence are expected to significantly impact | |
640 performance. By construction, the accuracy of this analysis is strongly | |
641 dependent on the simulation and (as always) by the quality of the processor | |
642 model in llvm. | |
643 | |
644 | |
645 Extra Statistics to Further Diagnose Performance Issues | |
646 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
647 The ``-all-stats`` command line option enables extra statistics and performance | |
648 counters for the dispatch logic, the reorder buffer, the retire control unit, | |
649 and the register file. | |
650 | |
651 Below is an example of ``-all-stats`` output generated by :program:`llvm-mca` | |
652 for 300 iterations of the dot-product example discussed in the previous | |
653 sections. | |
654 | |
655 .. code-block:: none | |
656 | |
657 Dynamic Dispatch Stall Cycles: | |
658 RAT - Register unavailable: 0 | |
659 RCU - Retire tokens unavailable: 0 | |
660 SCHEDQ - Scheduler full: 272 (44.6%) | |
661 LQ - Load queue full: 0 | |
662 SQ - Store queue full: 0 | |
663 GROUP - Static restrictions on the dispatch group: 0 | |
664 | |
665 | |
666 Dispatch Logic - number of cycles where we saw N micro opcodes dispatched: | |
667 [# dispatched], [# cycles] | |
668 0, 24 (3.9%) | |
669 1, 272 (44.6%) | |
670 2, 314 (51.5%) | |
671 | |
672 | |
673 Schedulers - number of cycles where we saw N micro opcodes issued: | |
674 [# issued], [# cycles] | |
675 0, 7 (1.1%) | |
676 1, 306 (50.2%) | |
677 2, 297 (48.7%) | |
678 | |
679 Scheduler's queue usage: | |
680 [1] Resource name. | |
681 [2] Average number of used buffer entries. | |
682 [3] Maximum number of used buffer entries. | |
683 [4] Total number of buffer entries. | |
684 | |
685 [1] [2] [3] [4] | |
686 JALU01 0 0 20 | |
687 JFPU01 17 18 18 | |
688 JLSAGU 0 0 12 | |
689 | |
690 | |
691 Retire Control Unit - number of cycles where we saw N instructions retired: | |
692 [# retired], [# cycles] | |
693 0, 109 (17.9%) | |
694 1, 102 (16.7%) | |
695 2, 399 (65.4%) | |
696 | |
697 Total ROB Entries: 64 | |
698 Max Used ROB Entries: 35 ( 54.7% ) | |
699 Average Used ROB Entries per cy: 32 ( 50.0% ) | |
700 | |
701 | |
702 Register File statistics: | |
703 Total number of mappings created: 900 | |
704 Max number of mappings used: 35 | |
705 | |
706 * Register File #1 -- JFpuPRF: | |
707 Number of physical registers: 72 | |
708 Total number of mappings created: 900 | |
709 Max number of mappings used: 35 | |
710 | |
711 * Register File #2 -- JIntegerPRF: | |
712 Number of physical registers: 64 | |
713 Total number of mappings created: 0 | |
714 Max number of mappings used: 0 | |
715 | |
716 If we look at the *Dynamic Dispatch Stall Cycles* table, we see the counter for | |
717 SCHEDQ reports 272 cycles. This counter is incremented every time the dispatch | |
718 logic is unable to dispatch a full group because the scheduler's queue is full. | |
719 | |
720 Looking at the *Dispatch Logic* table, we see that the pipeline was only able to | |
721 dispatch two micro opcodes 51.5% of the time. The dispatch group was limited to | |
722 one micro opcode 44.6% of the cycles, which corresponds to 272 cycles. The | |
723 dispatch statistics are displayed by either using the command option | |
724 ``-all-stats`` or ``-dispatch-stats``. | |
725 | |
726 The next table, *Schedulers*, presents a histogram displaying a count, | |
727 representing the number of micro opcodes issued on some number of cycles. In | |
728 this case, of the 610 simulated cycles, single opcodes were issued 306 times | |
729 (50.2%) and there were 7 cycles where no opcodes were issued. | |
730 | |
731 The *Scheduler's queue usage* table shows that the average and maximum number of | |
732 buffer entries (i.e., scheduler queue entries) used at runtime. Resource JFPU01 | |
733 reached its maximum (18 of 18 queue entries). Note that AMD Jaguar implements | |
734 three schedulers: | |
735 | |
736 * JALU01 - A scheduler for ALU instructions. | |
737 * JFPU01 - A scheduler floating point operations. | |
738 * JLSAGU - A scheduler for address generation. | |
739 | |
740 The dot-product is a kernel of three floating point instructions (a vector | |
741 multiply followed by two horizontal adds). That explains why only the floating | |
742 point scheduler appears to be used. | |
743 | |
744 A full scheduler queue is either caused by data dependency chains or by a | |
745 sub-optimal usage of hardware resources. Sometimes, resource pressure can be | |
746 mitigated by rewriting the kernel using different instructions that consume | |
747 different scheduler resources. Schedulers with a small queue are less resilient | |
748 to bottlenecks caused by the presence of long data dependencies. The scheduler | |
749 statistics are displayed by using the command option ``-all-stats`` or | |
750 ``-scheduler-stats``. | |
751 | |
752 The next table, *Retire Control Unit*, presents a histogram displaying a count, | |
753 representing the number of instructions retired on some number of cycles. In | |
754 this case, of the 610 simulated cycles, two instructions were retired during the | |
755 same cycle 399 times (65.4%) and there were 109 cycles where no instructions | |
756 were retired. The retire statistics are displayed by using the command option | |
757 ``-all-stats`` or ``-retire-stats``. | |
758 | |
759 The last table presented is *Register File statistics*. Each physical register | |
760 file (PRF) used by the pipeline is presented in this table. In the case of AMD | |
761 Jaguar, there are two register files, one for floating-point registers (JFpuPRF) | |
762 and one for integer registers (JIntegerPRF). The table shows that of the 900 | |
763 instructions processed, there were 900 mappings created. Since this dot-product | |
764 example utilized only floating point registers, the JFPuPRF was responsible for | |
765 creating the 900 mappings. However, we see that the pipeline only used a | |
766 maximum of 35 of 72 available register slots at any given time. We can conclude | |
767 that the floating point PRF was the only register file used for the example, and | |
768 that it was never resource constrained. The register file statistics are | |
769 displayed by using the command option ``-all-stats`` or | |
770 ``-register-file-stats``. | |
771 | |
772 In this example, we can conclude that the IPC is mostly limited by data | |
773 dependencies, and not by resource pressure. | |
774 | |
775 Instruction Flow | |
776 ^^^^^^^^^^^^^^^^ | |
777 This section describes the instruction flow through the default pipeline of | |
778 :program:`llvm-mca`, as well as the functional units involved in the process. | |
779 | |
780 The default pipeline implements the following sequence of stages used to | |
781 process instructions. | |
782 | |
783 * Dispatch (Instruction is dispatched to the schedulers). | |
784 * Issue (Instruction is issued to the processor pipelines). | |
785 * Write Back (Instruction is executed, and results are written back). | |
786 * Retire (Instruction is retired; writes are architecturally committed). | |
787 | |
788 The default pipeline only models the out-of-order portion of a processor. | |
789 Therefore, the instruction fetch and decode stages are not modeled. Performance | |
790 bottlenecks in the frontend are not diagnosed. :program:`llvm-mca` assumes that | |
791 instructions have all been decoded and placed into a queue before the simulation | |
792 start. Also, :program:`llvm-mca` does not model branch prediction. | |
793 | |
794 Instruction Dispatch | |
795 """""""""""""""""""" | |
796 During the dispatch stage, instructions are picked in program order from a | |
797 queue of already decoded instructions, and dispatched in groups to the | |
798 simulated hardware schedulers. | |
799 | |
800 The size of a dispatch group depends on the availability of the simulated | |
801 hardware resources. The processor dispatch width defaults to the value | |
802 of the ``IssueWidth`` in LLVM's scheduling model. | |
803 | |
804 An instruction can be dispatched if: | |
805 | |
806 * The size of the dispatch group is smaller than processor's dispatch width. | |
807 * There are enough entries in the reorder buffer. | |
808 * There are enough physical registers to do register renaming. | |
809 * The schedulers are not full. | |
810 | |
811 Scheduling models can optionally specify which register files are available on | |
812 the processor. :program:`llvm-mca` uses that information to initialize register | |
813 file descriptors. Users can limit the number of physical registers that are | |
814 globally available for register renaming by using the command option | |
815 ``-register-file-size``. A value of zero for this option means *unbounded*. By | |
816 knowing how many registers are available for renaming, the tool can predict | |
817 dispatch stalls caused by the lack of physical registers. | |
818 | |
819 The number of reorder buffer entries consumed by an instruction depends on the | |
820 number of micro-opcodes specified for that instruction by the target scheduling | |
821 model. The reorder buffer is responsible for tracking the progress of | |
822 instructions that are "in-flight", and retiring them in program order. The | |
823 number of entries in the reorder buffer defaults to the value specified by field | |
824 `MicroOpBufferSize` in the target scheduling model. | |
825 | |
826 Instructions that are dispatched to the schedulers consume scheduler buffer | |
827 entries. :program:`llvm-mca` queries the scheduling model to determine the set | |
828 of buffered resources consumed by an instruction. Buffered resources are | |
829 treated like scheduler resources. | |
830 | |
831 Instruction Issue | |
832 """"""""""""""""" | |
833 Each processor scheduler implements a buffer of instructions. An instruction | |
834 has to wait in the scheduler's buffer until input register operands become | |
835 available. Only at that point, does the instruction becomes eligible for | |
836 execution and may be issued (potentially out-of-order) for execution. | |
837 Instruction latencies are computed by :program:`llvm-mca` with the help of the | |
838 scheduling model. | |
839 | |
840 :program:`llvm-mca`'s scheduler is designed to simulate multiple processor | |
841 schedulers. The scheduler is responsible for tracking data dependencies, and | |
842 dynamically selecting which processor resources are consumed by instructions. | |
843 It delegates the management of processor resource units and resource groups to a | |
844 resource manager. The resource manager is responsible for selecting resource | |
845 units that are consumed by instructions. For example, if an instruction | |
846 consumes 1cy of a resource group, the resource manager selects one of the | |
847 available units from the group; by default, the resource manager uses a | |
848 round-robin selector to guarantee that resource usage is uniformly distributed | |
849 between all units of a group. | |
850 | |
851 :program:`llvm-mca`'s scheduler internally groups instructions into three sets: | |
852 | |
853 * WaitSet: a set of instructions whose operands are not ready. | |
854 * ReadySet: a set of instructions ready to execute. | |
855 * IssuedSet: a set of instructions executing. | |
856 | |
857 Depending on the operands availability, instructions that are dispatched to the | |
858 scheduler are either placed into the WaitSet or into the ReadySet. | |
859 | |
860 Every cycle, the scheduler checks if instructions can be moved from the WaitSet | |
861 to the ReadySet, and if instructions from the ReadySet can be issued to the | |
862 underlying pipelines. The algorithm prioritizes older instructions over younger | |
863 instructions. | |
864 | |
865 Write-Back and Retire Stage | |
866 """"""""""""""""""""""""""" | |
867 Issued instructions are moved from the ReadySet to the IssuedSet. There, | |
868 instructions wait until they reach the write-back stage. At that point, they | |
869 get removed from the queue and the retire control unit is notified. | |
870 | |
871 When instructions are executed, the retire control unit flags the instruction as | |
872 "ready to retire." | |
873 | |
874 Instructions are retired in program order. The register file is notified of the | |
875 retirement so that it can free the physical registers that were allocated for | |
876 the instruction during the register renaming stage. | |
877 | |
878 Load/Store Unit and Memory Consistency Model | |
879 """""""""""""""""""""""""""""""""""""""""""" | |
880 To simulate an out-of-order execution of memory operations, :program:`llvm-mca` | |
881 utilizes a simulated load/store unit (LSUnit) to simulate the speculative | |
882 execution of loads and stores. | |
883 | |
884 Each load (or store) consumes an entry in the load (or store) queue. Users can | |
885 specify flags ``-lqueue`` and ``-squeue`` to limit the number of entries in the | |
886 load and store queues respectively. The queues are unbounded by default. | |
887 | |
888 The LSUnit implements a relaxed consistency model for memory loads and stores. | |
889 The rules are: | |
890 | |
891 1. A younger load is allowed to pass an older load only if there are no | |
892 intervening stores or barriers between the two loads. | |
893 2. A younger load is allowed to pass an older store provided that the load does | |
894 not alias with the store. | |
895 3. A younger store is not allowed to pass an older store. | |
896 4. A younger store is not allowed to pass an older load. | |
897 | |
898 By default, the LSUnit optimistically assumes that loads do not alias | |
899 (`-noalias=true`) store operations. Under this assumption, younger loads are | |
900 always allowed to pass older stores. Essentially, the LSUnit does not attempt | |
901 to run any alias analysis to predict when loads and stores do not alias with | |
902 each other. | |
903 | |
904 Note that, in the case of write-combining memory, rule 3 could be relaxed to | |
905 allow reordering of non-aliasing store operations. That being said, at the | |
906 moment, there is no way to further relax the memory model (``-noalias`` is the | |
907 only option). Essentially, there is no option to specify a different memory | |
908 type (e.g., write-back, write-combining, write-through; etc.) and consequently | |
909 to weaken, or strengthen, the memory model. | |
910 | |
911 Other limitations are: | |
912 | |
913 * The LSUnit does not know when store-to-load forwarding may occur. | |
914 * The LSUnit does not know anything about cache hierarchy and memory types. | |
915 * The LSUnit does not know how to identify serializing operations and memory | |
916 fences. | |
917 | |
918 The LSUnit does not attempt to predict if a load or store hits or misses the L1 | |
919 cache. It only knows if an instruction "MayLoad" and/or "MayStore." For | |
920 loads, the scheduling model provides an "optimistic" load-to-use latency (which | |
921 usually matches the load-to-use latency for when there is a hit in the L1D). | |
922 | |
923 :program:`llvm-mca` does not know about serializing operations or memory-barrier | |
924 like instructions. The LSUnit conservatively assumes that an instruction which | |
925 has both "MayLoad" and unmodeled side effects behaves like a "soft" | |
926 load-barrier. That means, it serializes loads without forcing a flush of the | |
927 load queue. Similarly, instructions that "MayStore" and have unmodeled side | |
928 effects are treated like store barriers. A full memory barrier is a "MayLoad" | |
929 and "MayStore" instruction with unmodeled side effects. This is inaccurate, but | |
930 it is the best that we can do at the moment with the current information | |
931 available in LLVM. | |
932 | |
933 A load/store barrier consumes one entry of the load/store queue. A load/store | |
934 barrier enforces ordering of loads/stores. A younger load cannot pass a load | |
935 barrier. Also, a younger store cannot pass a store barrier. A younger load | |
936 has to wait for the memory/load barrier to execute. A load/store barrier is | |
937 "executed" when it becomes the oldest entry in the load/store queue(s). That | |
938 also means, by construction, all of the older loads/stores have been executed. | |
939 | |
940 In conclusion, the full set of load/store consistency rules are: | |
941 | |
942 #. A store may not pass a previous store. | |
943 #. A store may not pass a previous load (regardless of ``-noalias``). | |
944 #. A store has to wait until an older store barrier is fully executed. | |
945 #. A load may pass a previous load. | |
946 #. A load may not pass a previous store unless ``-noalias`` is set. | |
947 #. A load has to wait until an older load barrier is fully executed. |