annotate libc/benchmarks/RATIONALE.md @ 213:25ca0248ac32

...
author Shinji KONO <kono@ie.u-ryukyu.ac.jp>
date Sun, 11 Jul 2021 17:05:31 +0900
parents 2e18cbf3894f
children c4bab56944e8
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
207
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
1 # Benchmarking `llvm-libc`'s memory functions
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
2
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
3 ## Foreword
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
4
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
5 Microbenchmarks are valuable tools to assess and compare the performance of
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
6 isolated pieces of code. However they don't capture all interactions of complex
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
7 systems; and so other metrics can be equally important:
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
8
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
9 - **code size** (to reduce instruction cache pressure),
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
10 - **Profile Guided Optimization** friendliness,
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
11 - **hyperthreading / multithreading** friendliness.
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
12
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
13 ## Rationale
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
14
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
15 The goal here is to satisfy the [Benchmarking
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
16 Principles](https://en.wikipedia.org/wiki/Benchmark_\(computing\)#Benchmarking_Principles).
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
17
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
18 1. **Relevance**: Benchmarks should measure relatively vital features.
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
19 2. **Representativeness**: Benchmark performance metrics should be broadly
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
20 accepted by industry and academia.
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
21 3. **Equity**: All systems should be fairly compared.
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
22 4. **Repeatability**: Benchmark results can be verified.
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
23 5. **Cost-effectiveness**: Benchmark tests are economical.
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
24 6. **Scalability**: Benchmark tests should measure from single server to
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
25 multiple servers.
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
26 7. **Transparency**: Benchmark metrics should be easy to understand.
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
27
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
28 Benchmarking is a [subtle
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
29 art](https://en.wikipedia.org/wiki/Benchmark_\(computing\)#Challenges) and
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
30 benchmarking memory functions is no exception. Here we'll dive into
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
31 peculiarities of designing good microbenchmarks for `llvm-libc` memory
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
32 functions.
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
33
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
34 ## Challenges
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
35
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
36 As seen in the [README.md](README.md#stochastic-mode) the microbenchmarking
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
37 facility should focus on measuring **low latency code**. If copying a few bytes
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
38 takes in the order of a few cycles, the benchmark should be able to **measure
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
39 accurately down to the cycle**.
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
40
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
41 ### Measuring instruments
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
42
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
43 There are different sources of time in a computer (ordered from high to low resolution)
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
44 - [Performance
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
45 Counters](https://en.wikipedia.org/wiki/Hardware_performance_counter): used to
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
46 introspect the internals of the CPU,
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
47 - [High Precision Event
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
48 Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer): used to
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
49 trigger short lived actions,
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
50 - [Real-Time Clocks (RTC)](https://en.wikipedia.org/wiki/Real-time_clock): used
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
51 to keep track of the computer's time.
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
52
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
53 In theory **Performance Counters** provide cycle accurate measurement via the
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
54 `cpu cycles` event. But as we'll see, they are not really practical in this
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
55 context.
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
56
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
57 ### Performance counters and modern processor architecture
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
58
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
59 Modern CPUs are [out of
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
60 order](https://en.wikipedia.org/wiki/Out-of-order_execution) and
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
61 [superscalar](https://en.wikipedia.org/wiki/Superscalar_processor) as a
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
62 consequence it is [hard to know what is included when the counter is
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
63 read](https://en.wikipedia.org/wiki/Hardware_performance_counter#Instruction_based_sampling),
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
64 some instructions may still be **in flight**, some others may be executing
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
65 [**speculatively**](https://en.wikipedia.org/wiki/Speculative_execution). As a
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
66 matter of fact **on the same machine, measuring twice the same piece of code will yield
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
67 different results.**
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
68
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
69 ### Performance counters semantics inconsistencies and availability
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
70
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
71 Although they have the same name, the exact semantics of performance counters
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
72 are micro-architecture dependent: **it is generally not possible to compare two
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
73 micro-architectures exposing the same performance counters.**
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
74
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
75 Each vendor decides which performance counters to implement and their exact
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
76 meaning. Although we want to benchmark `llvm-libc` memory functions for all
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
77 available [target
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
78 triples](https://clang.llvm.org/docs/CrossCompilation.html#target-triple), there
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
79 are **no guarantees that the counter we're interested in is available.**
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
80
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
81 ### Additional imprecisions
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
82
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
83 - Reading performance counters is done through Kernel [System
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
84 calls](https://en.wikipedia.org/wiki/System_call). The System call itself
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
85 is costly (hundreds of cycles) and will perturbate the counter's value.
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
86 - [Interruptions](https://en.wikipedia.org/wiki/Interrupt#Processor_response)
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
87 can occur during measurement.
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
88 - If the system is already under monitoring (virtual machines or system wide
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
89 profiling) the kernel can decide to multiplex the performance counters
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
90 leading to lower precision or even completely missing the measurement.
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
91 - The Kernel can decide to [migrate the
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
92 process](https://en.wikipedia.org/wiki/Process_migration) to a different
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
93 core.
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
94 - [Dynamic frequency
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
95 scaling](https://en.wikipedia.org/wiki/Dynamic_frequency_scaling) can kick
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
96 in during the measurement and change the ticking duration. **Ultimately we
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
97 care about the amount of work over a period of time**. This removes some
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
98 legitimacy of measuring cycles rather than **raw time**.
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
99
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
100 ### Cycle accuracy conclusion
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
101
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
102 We have seen that performance counters are: not widely available, semantically
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
103 inconsistent across micro-architectures and imprecise on modern CPUs for small
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
104 snippets of code.
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
105
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
106 ## Design decisions
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
107
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
108 In order to achieve the needed precision we would need to resort on more widely
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
109 available counters and derive the time from a high number of runs: going from a
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
110 single deterministic measure to a probabilistic one.
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
111
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
112 **To get a good signal to noise ratio we need the running time of the piece of
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
113 code to be orders of magnitude greater than the measurement precision.**
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
114
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
115 For instance, if measurement precision is of 10 cycles, we need the function
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
116 runtime to take more than 1000 cycles to achieve 1%
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
117 [SNR](https://en.wikipedia.org/wiki/Signal-to-noise_ratio).
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
118
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
119 ### Repeating code N-times until precision is sufficient
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
120
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
121 The algorithm is as follows:
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
122
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
123 - We measure the time it takes to run the code _N_ times (Initially _N_ is 10
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
124 for instance)
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
125 - We deduce an approximation of the runtime of one iteration (= _runtime_ /
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
126 _N_).
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
127 - We increase _N_ by _X%_ and repeat the measurement (geometric progression).
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
128 - We keep track of the _one iteration runtime approximation_ and build a
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
129 weighted mean of all the samples so far (weight is proportional to _N_)
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
130 - We stop the process when the difference between the weighted mean and the
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
131 last estimation is smaller than _ε_ or when other stopping conditions are
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
132 met (total runtime, maximum iterations or maximum sample count).
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
133
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
134 This method allows us to be as precise as needed provided that the measured
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
135 runtime is proportional to _N_. Longer run times also smooth out imprecision
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
136 related to _interrupts_ and _context switches_.
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
137
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
138 Note: When measuring longer runtimes (e.g. copying several megabytes of data)
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
139 the above assumption doesn't hold anymore and the _ε_ precision cannot be
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
140 reached by increasing iterations. The whole benchmarking process becomes
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
141 prohibitively slow. In this case the algorithm is limited to a single sample and
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
142 repeated several times to get a decent 95% confidence interval.
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
143
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
144 ### Effect of branch prediction
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
145
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
146 When measuring code with branches, repeating the same call again and again will
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
147 allow the processor to learn the branching patterns and perfectly predict all
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
148 the branches, leading to unrealistic results.
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
149
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
150 **Decision: When benchmarking small buffer sizes, the function parameters should
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
151 be randomized between calls to prevent perfect branch predictions.**
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
152
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
153 ### Effect of the memory subsystem
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
154
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
155 The CPU is tightly coupled to the memory subsystem. It is common to see `L1`,
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
156 `L2` and `L3` data caches.
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
157
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
158 We may be tempted to randomize data accesses widely to exercise all the caching
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
159 layers down to RAM but the [cost of accessing lower layers of
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
160 memory](https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html)
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
161 completely dominates the runtime for small sizes.
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
162
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
163 So to respect **Equity** and **Repeatability** principles we should make sure we
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
164 **do not** depend on the memory subsystem.
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
165
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
166 **Decision: When benchmarking small buffer sizes, the data accessed by the
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
167 function should stay in `L1`.**
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
168
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
169 ### Effect of prefetching
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
170
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
171 In case of small buffer sizes,
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
172 [prefetching](https://en.wikipedia.org/wiki/Cache_prefetching) should not kick
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
173 in but in case of large buffers it may introduce a bias.
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
174
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
175 **Decision: When benchmarking large buffer sizes, the data should be accessed in
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
176 a random fashion to lower the impact of prefetching between calls.**
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
177
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
178 ### Effect of dynamic frequency scaling
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
179
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
180 Modern processors implement [dynamic frequency
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
181 scaling](https://en.wikipedia.org/wiki/Dynamic_frequency_scaling). In so-called
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
182 `performance` mode the CPU will increase its frequency and run faster than usual
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
183 within [some limits](https://en.wikipedia.org/wiki/Intel_Turbo_Boost) : _"The
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
184 increased clock rate is limited by the processor's power, current, and thermal
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
185 limits, the number of cores currently in use, and the maximum frequency of the
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
186 active cores."_
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
187
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
188 **Decision: When benchmarking we want to make sure the dynamic frequency scaling
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
189 is always set to `performance`. We also want to make sure that the time based
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
190 events are not impacted by frequency scaling.**
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
191
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
192 See [REAME.md](REAME.md) on how to set this up.
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
193
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
194 ### Reserved and pinned cores
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
195
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
196 Some operating systems allow [core
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
197 reservation](https://stackoverflow.com/questions/13583146/whole-one-core-dedicated-to-single-process).
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
198 It removes a set of perturbation sources like: process migration, context
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
199 switches and interrupts. When a core is hyperthreaded, both cores should be
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
200 reserved.
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
201
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
202 ## Microbenchmarks limitations
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
203
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
204 As stated in the Foreword section a number of effects do play a role in
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
205 production but are not directly measurable through microbenchmarks. The code
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
206 size of the benchmark is (much) smaller than the hot code of real applications
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
207 and **doesn't exhibit instruction cache pressure as much**.
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
208
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
209 ### iCache pressure
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
210
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
211 Fundamental functions that are called frequently will occupy the L1 iCache
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
212 ([illustration](https://en.wikipedia.org/wiki/CPU_cache#Example:_the_K8)). If
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
213 they are too big they will prevent other hot code to stay in the cache and incur
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
214 [stalls](https://en.wikipedia.org/wiki/CPU_cache#CPU_stalls). So the memory
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
215 functions should be as small as possible.
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
216
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
217 ### iTLB pressure
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
218
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
219 The same reasoning goes for instruction Translation Lookaside Buffer
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
220 ([iTLB](https://en.wikipedia.org/wiki/Translation_lookaside_buffer)) incurring
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
221 [TLB
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
222 misses](https://en.wikipedia.org/wiki/Translation_lookaside_buffer#TLB-miss_handling).
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
223
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
224 ## FAQ
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
225
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
226 1. Why don't you use Google Benchmark directly?
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
227
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
228 We reuse some parts of Google Benchmark (detection of frequency scaling, CPU
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
229 cache hierarchy informations) but when it comes to measuring memory
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
230 functions Google Benchmark have a few issues:
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
231
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
232 - Google Benchmark privileges code based configuration via macros and
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
233 builders. It is typically done in a static manner. In our case the
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
234 parameters we need to setup are a mix of what's usually controlled by
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
235 the framework (number of trials, maximum number of iterations, size
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
236 ranges) and parameters that are more tied to the function under test
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
237 (randomization strategies, custom values). Achieving this with Google
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
238 Benchmark is cumbersome as it involves templated benchmarks and
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
239 duplicated code. In the end, the configuration would be spread across
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
240 command line flags (via framework's option or custom flags), and code
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
241 constants.
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
242 - Output of the measurements is done through a `BenchmarkReporter` class,
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents:
diff changeset
243 that makes it hard to access the parameters discussed above.