CbC/CbC_llvm: libc/benchmarks/RATIONALE.md annotate

annotate libc/benchmarks/RATIONALE.md @ 213:25ca0248ac32

...

author	Shinji KONO <kono@ie.u-ryukyu.ac.jp>
date	Sun, 11 Jul 2021 17:05:31 +0900
parents	2e18cbf3894f
children	c4bab56944e8

rev	line source
207 2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	1 # Benchmarking `llvm-libc`'s memory functions
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	2
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	3 ## Foreword
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	4
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	5 Microbenchmarks are valuable tools to assess and compare the performance of
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	6 isolated pieces of code. However they don't capture all interactions of complex
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	7 systems; and so other metrics can be equally important:
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	8
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	9 - code size (to reduce instruction cache pressure),
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	10 - Profile Guided Optimization friendliness,
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	11 - hyperthreading / multithreading friendliness.
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	12
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	13 ## Rationale
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	14
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	15 The goal here is to satisfy the [Benchmarking
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	16 Principles](https://en.wikipedia.org/wiki/Benchmark_\(computing\)#Benchmarking_Principles).
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	17
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	18 1. Relevance: Benchmarks should measure relatively vital features.
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	19 2. Representativeness: Benchmark performance metrics should be broadly
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	20 accepted by industry and academia.
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	21 3. Equity: All systems should be fairly compared.
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	22 4. Repeatability: Benchmark results can be verified.
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	23 5. Cost-effectiveness: Benchmark tests are economical.
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	24 6. Scalability: Benchmark tests should measure from single server to
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	25 multiple servers.
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	26 7. Transparency: Benchmark metrics should be easy to understand.
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	27
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	28 Benchmarking is a [subtle
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	29 art](https://en.wikipedia.org/wiki/Benchmark_\(computing\)#Challenges) and
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	30 benchmarking memory functions is no exception. Here we'll dive into
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	31 peculiarities of designing good microbenchmarks for `llvm-libc` memory
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	32 functions.
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	33
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	34 ## Challenges
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	35
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	36 As seen in the [README.md](README.md#stochastic-mode) the microbenchmarking
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	37 facility should focus on measuring low latency code. If copying a few bytes
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	38 takes in the order of a few cycles, the benchmark should be able to **measure
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	39 accurately down to the cycle**.
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	40
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	41 ### Measuring instruments
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	42
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	43 There are different sources of time in a computer (ordered from high to low resolution)
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	44 - [Performance
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	45 Counters](https://en.wikipedia.org/wiki/Hardware_performance_counter): used to
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	46 introspect the internals of the CPU,
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	47 - [High Precision Event
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	48 Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer): used to
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	49 trigger short lived actions,
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	50 - [Real-Time Clocks (RTC)](https://en.wikipedia.org/wiki/Real-time_clock): used
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	51 to keep track of the computer's time.
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	52
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	53 In theory Performance Counters provide cycle accurate measurement via the
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	54 `cpu cycles` event. But as we'll see, they are not really practical in this
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	55 context.
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	56
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	57 ### Performance counters and modern processor architecture
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	58
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	59 Modern CPUs are [out of
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	60 order](https://en.wikipedia.org/wiki/Out-of-order_execution) and
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	61 [superscalar](https://en.wikipedia.org/wiki/Superscalar_processor) as a
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	62 consequence it is [hard to know what is included when the counter is
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	63 read](https://en.wikipedia.org/wiki/Hardware_performance_counter#Instruction_based_sampling),
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	64 some instructions may still be in flight, some others may be executing
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	65 [speculatively](https://en.wikipedia.org/wiki/Speculative_execution). As a
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	66 matter of fact **on the same machine, measuring twice the same piece of code will yield
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	67 different results.**
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	68
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	69 ### Performance counters semantics inconsistencies and availability
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	70
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	71 Although they have the same name, the exact semantics of performance counters
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	72 are micro-architecture dependent: **it is generally not possible to compare two
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	73 micro-architectures exposing the same performance counters.**
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	74
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	75 Each vendor decides which performance counters to implement and their exact
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	76 meaning. Although we want to benchmark `llvm-libc` memory functions for all
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	77 available [target
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	78 triples](https://clang.llvm.org/docs/CrossCompilation.html#target-triple), there
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	79 are no guarantees that the counter we're interested in is available.
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	80
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	81 ### Additional imprecisions
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	82
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	83 - Reading performance counters is done through Kernel [System
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	84 calls](https://en.wikipedia.org/wiki/System_call). The System call itself
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	85 is costly (hundreds of cycles) and will perturbate the counter's value.
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	86 - [Interruptions](https://en.wikipedia.org/wiki/Interrupt#Processor_response)
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	87 can occur during measurement.
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	88 - If the system is already under monitoring (virtual machines or system wide
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	89 profiling) the kernel can decide to multiplex the performance counters
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	90 leading to lower precision or even completely missing the measurement.
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	91 - The Kernel can decide to [migrate the
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	92 process](https://en.wikipedia.org/wiki/Process_migration) to a different
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	93 core.
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	94 - [Dynamic frequency
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	95 scaling](https://en.wikipedia.org/wiki/Dynamic_frequency_scaling) can kick
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	96 in during the measurement and change the ticking duration. **Ultimately we
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	97 care about the amount of work over a period of time**. This removes some
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	98 legitimacy of measuring cycles rather than raw time.
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	99
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	100 ### Cycle accuracy conclusion
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	101
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	102 We have seen that performance counters are: not widely available, semantically
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	103 inconsistent across micro-architectures and imprecise on modern CPUs for small
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	104 snippets of code.
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	105
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	106 ## Design decisions
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	107
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	108 In order to achieve the needed precision we would need to resort on more widely
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	109 available counters and derive the time from a high number of runs: going from a
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	110 single deterministic measure to a probabilistic one.
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	111
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	112 **To get a good signal to noise ratio we need the running time of the piece of
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	113 code to be orders of magnitude greater than the measurement precision.**
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	114
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	115 For instance, if measurement precision is of 10 cycles, we need the function
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	116 runtime to take more than 1000 cycles to achieve 1%
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	117 [SNR](https://en.wikipedia.org/wiki/Signal-to-noise_ratio).
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	118
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	119 ### Repeating code N-times until precision is sufficient
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	120
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	121 The algorithm is as follows:
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	122
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	123 - We measure the time it takes to run the code _N_ times (Initially _N_ is 10
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	124 for instance)
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	125 - We deduce an approximation of the runtime of one iteration (= _runtime_ /
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	126 _N_).
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	127 - We increase _N_ by _X%_ and repeat the measurement (geometric progression).
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	128 - We keep track of the _one iteration runtime approximation_ and build a
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	129 weighted mean of all the samples so far (weight is proportional to _N_)
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	130 - We stop the process when the difference between the weighted mean and the
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	131 last estimation is smaller than _ε_ or when other stopping conditions are
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	132 met (total runtime, maximum iterations or maximum sample count).
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	133
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	134 This method allows us to be as precise as needed provided that the measured
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	135 runtime is proportional to _N_. Longer run times also smooth out imprecision
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	136 related to _interrupts_ and _context switches_.
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	137
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	138 Note: When measuring longer runtimes (e.g. copying several megabytes of data)
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	139 the above assumption doesn't hold anymore and the _ε_ precision cannot be
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	140 reached by increasing iterations. The whole benchmarking process becomes
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	141 prohibitively slow. In this case the algorithm is limited to a single sample and
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	142 repeated several times to get a decent 95% confidence interval.
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	143
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	144 ### Effect of branch prediction
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	145
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	146 When measuring code with branches, repeating the same call again and again will
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	147 allow the processor to learn the branching patterns and perfectly predict all
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	148 the branches, leading to unrealistic results.
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	149
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	150 **Decision: When benchmarking small buffer sizes, the function parameters should
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	151 be randomized between calls to prevent perfect branch predictions.**
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	152
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	153 ### Effect of the memory subsystem
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	154
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	155 The CPU is tightly coupled to the memory subsystem. It is common to see `L1`,
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	156 `L2` and `L3` data caches.
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	157
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	158 We may be tempted to randomize data accesses widely to exercise all the caching
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	159 layers down to RAM but the [cost of accessing lower layers of
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	160 memory](https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html)
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	161 completely dominates the runtime for small sizes.
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	162
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	163 So to respect Equity and Repeatability principles we should make sure we
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	164 do not depend on the memory subsystem.
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	165
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	166 **Decision: When benchmarking small buffer sizes, the data accessed by the
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	167 function should stay in `L1`.**
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	168
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	169 ### Effect of prefetching
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	170
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	171 In case of small buffer sizes,
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	172 [prefetching](https://en.wikipedia.org/wiki/Cache_prefetching) should not kick
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	173 in but in case of large buffers it may introduce a bias.
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	174
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	175 **Decision: When benchmarking large buffer sizes, the data should be accessed in
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	176 a random fashion to lower the impact of prefetching between calls.**
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	177
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	178 ### Effect of dynamic frequency scaling
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	179
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	180 Modern processors implement [dynamic frequency
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	181 scaling](https://en.wikipedia.org/wiki/Dynamic_frequency_scaling). In so-called
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	182 `performance` mode the CPU will increase its frequency and run faster than usual
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	183 within [some limits](https://en.wikipedia.org/wiki/Intel_Turbo_Boost) : _"The
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	184 increased clock rate is limited by the processor's power, current, and thermal
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	185 limits, the number of cores currently in use, and the maximum frequency of the
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	186 active cores."_
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	187
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	188 **Decision: When benchmarking we want to make sure the dynamic frequency scaling
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	189 is always set to `performance`. We also want to make sure that the time based
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	190 events are not impacted by frequency scaling.**
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	191
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	192 See [REAME.md](REAME.md) on how to set this up.
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	193
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	194 ### Reserved and pinned cores
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	195
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	196 Some operating systems allow [core
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	197 reservation](https://stackoverflow.com/questions/13583146/whole-one-core-dedicated-to-single-process).
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	198 It removes a set of perturbation sources like: process migration, context
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	199 switches and interrupts. When a core is hyperthreaded, both cores should be
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	200 reserved.
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	201
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	202 ## Microbenchmarks limitations
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	203
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	204 As stated in the Foreword section a number of effects do play a role in
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	205 production but are not directly measurable through microbenchmarks. The code
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	206 size of the benchmark is (much) smaller than the hot code of real applications
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	207 and doesn't exhibit instruction cache pressure as much.
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	208
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	209 ### iCache pressure
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	210
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	211 Fundamental functions that are called frequently will occupy the L1 iCache
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	212 ([illustration](https://en.wikipedia.org/wiki/CPU_cache#Example:_the_K8)). If
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	213 they are too big they will prevent other hot code to stay in the cache and incur
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	214 [stalls](https://en.wikipedia.org/wiki/CPU_cache#CPU_stalls). So the memory
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	215 functions should be as small as possible.
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	216
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	217 ### iTLB pressure
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	218
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	219 The same reasoning goes for instruction Translation Lookaside Buffer
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	220 ([iTLB](https://en.wikipedia.org/wiki/Translation_lookaside_buffer)) incurring
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	221 [TLB
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	222 misses](https://en.wikipedia.org/wiki/Translation_lookaside_buffer#TLB-miss_handling).
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	223
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	224 ## FAQ
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	225
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	226 1. Why don't you use Google Benchmark directly?
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	227
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	228 We reuse some parts of Google Benchmark (detection of frequency scaling, CPU
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	229 cache hierarchy informations) but when it comes to measuring memory
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	230 functions Google Benchmark have a few issues:
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	231
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	232 - Google Benchmark privileges code based configuration via macros and
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	233 builders. It is typically done in a static manner. In our case the
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	234 parameters we need to setup are a mix of what's usually controlled by
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	235 the framework (number of trials, maximum number of iterations, size
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	236 ranges) and parameters that are more tied to the function under test
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	237 (randomization strategies, custom values). Achieving this with Google
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	238 Benchmark is cumbersome as it involves templated benchmarks and
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	239 duplicated code. In the end, the configuration would be spread across
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	240 command line flags (via framework's option or custom flags), and code
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	241 constants.
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	242 - Output of the measurements is done through a `BenchmarkReporter` class,
2e18cbf3894f LLVM12 Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: diff changeset	243 that makes it hard to access the parameters discussed above.

Mercurial > hg > CbC > CbC_llvm

annotate libc/benchmarks/RATIONALE.md @ 213:25ca0248ac32