207
|
1 # Benchmarking `llvm-libc`'s memory functions
|
|
2
|
|
3 ## Foreword
|
|
4
|
|
5 Microbenchmarks are valuable tools to assess and compare the performance of
|
|
6 isolated pieces of code. However they don't capture all interactions of complex
|
|
7 systems; and so other metrics can be equally important:
|
|
8
|
|
9 - **code size** (to reduce instruction cache pressure),
|
|
10 - **Profile Guided Optimization** friendliness,
|
|
11 - **hyperthreading / multithreading** friendliness.
|
|
12
|
|
13 ## Rationale
|
|
14
|
|
15 The goal here is to satisfy the [Benchmarking
|
|
16 Principles](https://en.wikipedia.org/wiki/Benchmark_\(computing\)#Benchmarking_Principles).
|
|
17
|
|
18 1. **Relevance**: Benchmarks should measure relatively vital features.
|
|
19 2. **Representativeness**: Benchmark performance metrics should be broadly
|
|
20 accepted by industry and academia.
|
|
21 3. **Equity**: All systems should be fairly compared.
|
|
22 4. **Repeatability**: Benchmark results can be verified.
|
|
23 5. **Cost-effectiveness**: Benchmark tests are economical.
|
|
24 6. **Scalability**: Benchmark tests should measure from single server to
|
|
25 multiple servers.
|
|
26 7. **Transparency**: Benchmark metrics should be easy to understand.
|
|
27
|
|
28 Benchmarking is a [subtle
|
|
29 art](https://en.wikipedia.org/wiki/Benchmark_\(computing\)#Challenges) and
|
|
30 benchmarking memory functions is no exception. Here we'll dive into
|
|
31 peculiarities of designing good microbenchmarks for `llvm-libc` memory
|
|
32 functions.
|
|
33
|
|
34 ## Challenges
|
|
35
|
|
36 As seen in the [README.md](README.md#stochastic-mode) the microbenchmarking
|
|
37 facility should focus on measuring **low latency code**. If copying a few bytes
|
|
38 takes in the order of a few cycles, the benchmark should be able to **measure
|
|
39 accurately down to the cycle**.
|
|
40
|
|
41 ### Measuring instruments
|
|
42
|
|
43 There are different sources of time in a computer (ordered from high to low resolution)
|
|
44 - [Performance
|
|
45 Counters](https://en.wikipedia.org/wiki/Hardware_performance_counter): used to
|
|
46 introspect the internals of the CPU,
|
|
47 - [High Precision Event
|
|
48 Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer): used to
|
|
49 trigger short lived actions,
|
|
50 - [Real-Time Clocks (RTC)](https://en.wikipedia.org/wiki/Real-time_clock): used
|
|
51 to keep track of the computer's time.
|
|
52
|
|
53 In theory **Performance Counters** provide cycle accurate measurement via the
|
|
54 `cpu cycles` event. But as we'll see, they are not really practical in this
|
|
55 context.
|
|
56
|
|
57 ### Performance counters and modern processor architecture
|
|
58
|
|
59 Modern CPUs are [out of
|
|
60 order](https://en.wikipedia.org/wiki/Out-of-order_execution) and
|
|
61 [superscalar](https://en.wikipedia.org/wiki/Superscalar_processor) as a
|
|
62 consequence it is [hard to know what is included when the counter is
|
|
63 read](https://en.wikipedia.org/wiki/Hardware_performance_counter#Instruction_based_sampling),
|
|
64 some instructions may still be **in flight**, some others may be executing
|
|
65 [**speculatively**](https://en.wikipedia.org/wiki/Speculative_execution). As a
|
|
66 matter of fact **on the same machine, measuring twice the same piece of code will yield
|
|
67 different results.**
|
|
68
|
|
69 ### Performance counters semantics inconsistencies and availability
|
|
70
|
|
71 Although they have the same name, the exact semantics of performance counters
|
|
72 are micro-architecture dependent: **it is generally not possible to compare two
|
|
73 micro-architectures exposing the same performance counters.**
|
|
74
|
|
75 Each vendor decides which performance counters to implement and their exact
|
|
76 meaning. Although we want to benchmark `llvm-libc` memory functions for all
|
|
77 available [target
|
|
78 triples](https://clang.llvm.org/docs/CrossCompilation.html#target-triple), there
|
|
79 are **no guarantees that the counter we're interested in is available.**
|
|
80
|
|
81 ### Additional imprecisions
|
|
82
|
|
83 - Reading performance counters is done through Kernel [System
|
|
84 calls](https://en.wikipedia.org/wiki/System_call). The System call itself
|
|
85 is costly (hundreds of cycles) and will perturbate the counter's value.
|
|
86 - [Interruptions](https://en.wikipedia.org/wiki/Interrupt#Processor_response)
|
|
87 can occur during measurement.
|
|
88 - If the system is already under monitoring (virtual machines or system wide
|
|
89 profiling) the kernel can decide to multiplex the performance counters
|
|
90 leading to lower precision or even completely missing the measurement.
|
|
91 - The Kernel can decide to [migrate the
|
|
92 process](https://en.wikipedia.org/wiki/Process_migration) to a different
|
|
93 core.
|
|
94 - [Dynamic frequency
|
|
95 scaling](https://en.wikipedia.org/wiki/Dynamic_frequency_scaling) can kick
|
|
96 in during the measurement and change the ticking duration. **Ultimately we
|
|
97 care about the amount of work over a period of time**. This removes some
|
|
98 legitimacy of measuring cycles rather than **raw time**.
|
|
99
|
|
100 ### Cycle accuracy conclusion
|
|
101
|
|
102 We have seen that performance counters are: not widely available, semantically
|
|
103 inconsistent across micro-architectures and imprecise on modern CPUs for small
|
|
104 snippets of code.
|
|
105
|
|
106 ## Design decisions
|
|
107
|
|
108 In order to achieve the needed precision we would need to resort on more widely
|
|
109 available counters and derive the time from a high number of runs: going from a
|
|
110 single deterministic measure to a probabilistic one.
|
|
111
|
|
112 **To get a good signal to noise ratio we need the running time of the piece of
|
|
113 code to be orders of magnitude greater than the measurement precision.**
|
|
114
|
|
115 For instance, if measurement precision is of 10 cycles, we need the function
|
|
116 runtime to take more than 1000 cycles to achieve 1%
|
|
117 [SNR](https://en.wikipedia.org/wiki/Signal-to-noise_ratio).
|
|
118
|
|
119 ### Repeating code N-times until precision is sufficient
|
|
120
|
|
121 The algorithm is as follows:
|
|
122
|
|
123 - We measure the time it takes to run the code _N_ times (Initially _N_ is 10
|
|
124 for instance)
|
|
125 - We deduce an approximation of the runtime of one iteration (= _runtime_ /
|
|
126 _N_).
|
|
127 - We increase _N_ by _X%_ and repeat the measurement (geometric progression).
|
|
128 - We keep track of the _one iteration runtime approximation_ and build a
|
|
129 weighted mean of all the samples so far (weight is proportional to _N_)
|
|
130 - We stop the process when the difference between the weighted mean and the
|
|
131 last estimation is smaller than _ε_ or when other stopping conditions are
|
|
132 met (total runtime, maximum iterations or maximum sample count).
|
|
133
|
|
134 This method allows us to be as precise as needed provided that the measured
|
|
135 runtime is proportional to _N_. Longer run times also smooth out imprecision
|
|
136 related to _interrupts_ and _context switches_.
|
|
137
|
|
138 Note: When measuring longer runtimes (e.g. copying several megabytes of data)
|
|
139 the above assumption doesn't hold anymore and the _ε_ precision cannot be
|
|
140 reached by increasing iterations. The whole benchmarking process becomes
|
|
141 prohibitively slow. In this case the algorithm is limited to a single sample and
|
|
142 repeated several times to get a decent 95% confidence interval.
|
|
143
|
|
144 ### Effect of branch prediction
|
|
145
|
|
146 When measuring code with branches, repeating the same call again and again will
|
|
147 allow the processor to learn the branching patterns and perfectly predict all
|
|
148 the branches, leading to unrealistic results.
|
|
149
|
|
150 **Decision: When benchmarking small buffer sizes, the function parameters should
|
|
151 be randomized between calls to prevent perfect branch predictions.**
|
|
152
|
|
153 ### Effect of the memory subsystem
|
|
154
|
|
155 The CPU is tightly coupled to the memory subsystem. It is common to see `L1`,
|
|
156 `L2` and `L3` data caches.
|
|
157
|
|
158 We may be tempted to randomize data accesses widely to exercise all the caching
|
|
159 layers down to RAM but the [cost of accessing lower layers of
|
|
160 memory](https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html)
|
|
161 completely dominates the runtime for small sizes.
|
|
162
|
|
163 So to respect **Equity** and **Repeatability** principles we should make sure we
|
|
164 **do not** depend on the memory subsystem.
|
|
165
|
|
166 **Decision: When benchmarking small buffer sizes, the data accessed by the
|
|
167 function should stay in `L1`.**
|
|
168
|
|
169 ### Effect of prefetching
|
|
170
|
|
171 In case of small buffer sizes,
|
|
172 [prefetching](https://en.wikipedia.org/wiki/Cache_prefetching) should not kick
|
|
173 in but in case of large buffers it may introduce a bias.
|
|
174
|
|
175 **Decision: When benchmarking large buffer sizes, the data should be accessed in
|
|
176 a random fashion to lower the impact of prefetching between calls.**
|
|
177
|
|
178 ### Effect of dynamic frequency scaling
|
|
179
|
|
180 Modern processors implement [dynamic frequency
|
|
181 scaling](https://en.wikipedia.org/wiki/Dynamic_frequency_scaling). In so-called
|
|
182 `performance` mode the CPU will increase its frequency and run faster than usual
|
|
183 within [some limits](https://en.wikipedia.org/wiki/Intel_Turbo_Boost) : _"The
|
|
184 increased clock rate is limited by the processor's power, current, and thermal
|
|
185 limits, the number of cores currently in use, and the maximum frequency of the
|
|
186 active cores."_
|
|
187
|
|
188 **Decision: When benchmarking we want to make sure the dynamic frequency scaling
|
|
189 is always set to `performance`. We also want to make sure that the time based
|
|
190 events are not impacted by frequency scaling.**
|
|
191
|
|
192 See [REAME.md](REAME.md) on how to set this up.
|
|
193
|
|
194 ### Reserved and pinned cores
|
|
195
|
|
196 Some operating systems allow [core
|
|
197 reservation](https://stackoverflow.com/questions/13583146/whole-one-core-dedicated-to-single-process).
|
|
198 It removes a set of perturbation sources like: process migration, context
|
|
199 switches and interrupts. When a core is hyperthreaded, both cores should be
|
|
200 reserved.
|
|
201
|
|
202 ## Microbenchmarks limitations
|
|
203
|
|
204 As stated in the Foreword section a number of effects do play a role in
|
|
205 production but are not directly measurable through microbenchmarks. The code
|
|
206 size of the benchmark is (much) smaller than the hot code of real applications
|
|
207 and **doesn't exhibit instruction cache pressure as much**.
|
|
208
|
|
209 ### iCache pressure
|
|
210
|
|
211 Fundamental functions that are called frequently will occupy the L1 iCache
|
|
212 ([illustration](https://en.wikipedia.org/wiki/CPU_cache#Example:_the_K8)). If
|
|
213 they are too big they will prevent other hot code to stay in the cache and incur
|
|
214 [stalls](https://en.wikipedia.org/wiki/CPU_cache#CPU_stalls). So the memory
|
|
215 functions should be as small as possible.
|
|
216
|
|
217 ### iTLB pressure
|
|
218
|
|
219 The same reasoning goes for instruction Translation Lookaside Buffer
|
|
220 ([iTLB](https://en.wikipedia.org/wiki/Translation_lookaside_buffer)) incurring
|
|
221 [TLB
|
|
222 misses](https://en.wikipedia.org/wiki/Translation_lookaside_buffer#TLB-miss_handling).
|
|
223
|
|
224 ## FAQ
|
|
225
|
|
226 1. Why don't you use Google Benchmark directly?
|
|
227
|
|
228 We reuse some parts of Google Benchmark (detection of frequency scaling, CPU
|
|
229 cache hierarchy informations) but when it comes to measuring memory
|
|
230 functions Google Benchmark have a few issues:
|
|
231
|
|
232 - Google Benchmark privileges code based configuration via macros and
|
|
233 builders. It is typically done in a static manner. In our case the
|
|
234 parameters we need to setup are a mix of what's usually controlled by
|
|
235 the framework (number of trials, maximum number of iterations, size
|
|
236 ranges) and parameters that are more tied to the function under test
|
|
237 (randomization strategies, custom values). Achieving this with Google
|
|
238 Benchmark is cumbersome as it involves templated benchmarks and
|
|
239 duplicated code. In the end, the configuration would be spread across
|
|
240 command line flags (via framework's option or custom flags), and code
|
|
241 constants.
|
|
242 - Output of the measurements is done through a `BenchmarkReporter` class,
|
|
243 that makes it hard to access the parameters discussed above.
|