CbC/CbC_llvm: llvm/docs/SpeculativeLoadHardening.md annotate

annotate llvm/docs/SpeculativeLoadHardening.md @ 164:fdfabb438fbf

...

author	anatofuz
date	Thu, 19 Mar 2020 17:02:53 +0900
parents	1d019706d866
children

rev	line source
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	1 # Speculative Load Hardening
1d019706d866 LLVM10 anatofuz parents: diff changeset	2
1d019706d866 LLVM10 anatofuz parents: diff changeset	3 ### A Spectre Variant #1 Mitigation Technique
1d019706d866 LLVM10 anatofuz parents: diff changeset	4
1d019706d866 LLVM10 anatofuz parents: diff changeset	5 Author: Chandler Carruth - [chandlerc@google.com](mailto:chandlerc@google.com)
1d019706d866 LLVM10 anatofuz parents: diff changeset	6
1d019706d866 LLVM10 anatofuz parents: diff changeset	7 ## Problem Statement
1d019706d866 LLVM10 anatofuz parents: diff changeset	8
1d019706d866 LLVM10 anatofuz parents: diff changeset	9 Recently, Google Project Zero and other researchers have found information leak
1d019706d866 LLVM10 anatofuz parents: diff changeset	10 vulnerabilities by exploiting speculative execution in modern CPUs. These
1d019706d866 LLVM10 anatofuz parents: diff changeset	11 exploits are currently broken down into three variants:
1d019706d866 LLVM10 anatofuz parents: diff changeset	12 * GPZ Variant #1 (a.k.a. Spectre Variant #1): Bounds check (or predicate) bypass
1d019706d866 LLVM10 anatofuz parents: diff changeset	13 * GPZ Variant #2 (a.k.a. Spectre Variant #2): Branch target injection
1d019706d866 LLVM10 anatofuz parents: diff changeset	14 * GPZ Variant #3 (a.k.a. Meltdown): Rogue data cache load
1d019706d866 LLVM10 anatofuz parents: diff changeset	15
1d019706d866 LLVM10 anatofuz parents: diff changeset	16 For more details, see the Google Project Zero blog post and the Spectre research
1d019706d866 LLVM10 anatofuz parents: diff changeset	17 paper:
1d019706d866 LLVM10 anatofuz parents: diff changeset	18 * https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html
1d019706d866 LLVM10 anatofuz parents: diff changeset	19 * https://spectreattack.com/spectre.pdf
1d019706d866 LLVM10 anatofuz parents: diff changeset	20
1d019706d866 LLVM10 anatofuz parents: diff changeset	21 The core problem of GPZ Variant #1 is that speculative execution uses branch
1d019706d866 LLVM10 anatofuz parents: diff changeset	22 prediction to select the path of instructions speculatively executed. This path
1d019706d866 LLVM10 anatofuz parents: diff changeset	23 is speculatively executed with the available data, and may load from memory and
1d019706d866 LLVM10 anatofuz parents: diff changeset	24 leak the loaded values through various side channels that survive even when the
1d019706d866 LLVM10 anatofuz parents: diff changeset	25 speculative execution is unwound due to being incorrect. Mispredicted paths can
1d019706d866 LLVM10 anatofuz parents: diff changeset	26 cause code to be executed with data inputs that never occur in correct
1d019706d866 LLVM10 anatofuz parents: diff changeset	27 executions, making checks against malicious inputs ineffective and allowing
1d019706d866 LLVM10 anatofuz parents: diff changeset	28 attackers to use malicious data inputs to leak secret data. Here is an example,
1d019706d866 LLVM10 anatofuz parents: diff changeset	29 extracted and simplified from the Project Zero paper:
1d019706d866 LLVM10 anatofuz parents: diff changeset	30 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	31 struct array {
1d019706d866 LLVM10 anatofuz parents: diff changeset	32 unsigned long length;
1d019706d866 LLVM10 anatofuz parents: diff changeset	33 unsigned char data[];
1d019706d866 LLVM10 anatofuz parents: diff changeset	34 };
1d019706d866 LLVM10 anatofuz parents: diff changeset	35 struct array *arr1 = ...; // small array
1d019706d866 LLVM10 anatofuz parents: diff changeset	36 struct array *arr2 = ...; // array of size 0x400
1d019706d866 LLVM10 anatofuz parents: diff changeset	37 unsigned long untrusted_offset_from_caller = ...;
1d019706d866 LLVM10 anatofuz parents: diff changeset	38 if (untrusted_offset_from_caller < arr1->length) {
1d019706d866 LLVM10 anatofuz parents: diff changeset	39 unsigned char value = arr1->data[untrusted_offset_from_caller];
1d019706d866 LLVM10 anatofuz parents: diff changeset	40 unsigned long index2 = ((value&1)*0x100)+0x200;
1d019706d866 LLVM10 anatofuz parents: diff changeset	41 unsigned char value2 = arr2->data[index2];
1d019706d866 LLVM10 anatofuz parents: diff changeset	42 }
1d019706d866 LLVM10 anatofuz parents: diff changeset	43 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	44
1d019706d866 LLVM10 anatofuz parents: diff changeset	45 The key of the attack is to call this with `untrusted_offset_from_caller` that
1d019706d866 LLVM10 anatofuz parents: diff changeset	46 is far outside of the bounds when the branch predictor will predict that it
1d019706d866 LLVM10 anatofuz parents: diff changeset	47 will be in-bounds. In that case, the body of the `if` will be executed
1d019706d866 LLVM10 anatofuz parents: diff changeset	48 speculatively, and may read secret data into `value` and leak it via a
1d019706d866 LLVM10 anatofuz parents: diff changeset	49 cache-timing side channel when a dependent access is made to populate `value2`.
1d019706d866 LLVM10 anatofuz parents: diff changeset	50
1d019706d866 LLVM10 anatofuz parents: diff changeset	51 ## High Level Mitigation Approach
1d019706d866 LLVM10 anatofuz parents: diff changeset	52
1d019706d866 LLVM10 anatofuz parents: diff changeset	53 While several approaches are being actively pursued to mitigate specific
1d019706d866 LLVM10 anatofuz parents: diff changeset	54 branches and/or loads inside especially risky software (most notably various OS
1d019706d866 LLVM10 anatofuz parents: diff changeset	55 kernels), these approaches require manual and/or static analysis aided auditing
1d019706d866 LLVM10 anatofuz parents: diff changeset	56 of code and explicit source changes to apply the mitigation. They are unlikely
1d019706d866 LLVM10 anatofuz parents: diff changeset	57 to scale well to large applications. We are proposing a comprehensive
1d019706d866 LLVM10 anatofuz parents: diff changeset	58 mitigation approach that would apply automatically across an entire program
1d019706d866 LLVM10 anatofuz parents: diff changeset	59 rather than through manual changes to the code. While this is likely to have a
1d019706d866 LLVM10 anatofuz parents: diff changeset	60 high performance cost, some applications may be in a good position to take this
1d019706d866 LLVM10 anatofuz parents: diff changeset	61 performance / security tradeoff.
1d019706d866 LLVM10 anatofuz parents: diff changeset	62
1d019706d866 LLVM10 anatofuz parents: diff changeset	63 The specific technique we propose is to cause loads to be checked using
1d019706d866 LLVM10 anatofuz parents: diff changeset	64 branchless code to ensure that they are executing along a valid control flow
1d019706d866 LLVM10 anatofuz parents: diff changeset	65 path. Consider the following C-pseudo-code representing the core idea of a
1d019706d866 LLVM10 anatofuz parents: diff changeset	66 predicate guarding potentially invalid loads:
1d019706d866 LLVM10 anatofuz parents: diff changeset	67 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	68 void leak(int data);
1d019706d866 LLVM10 anatofuz parents: diff changeset	69 void example(int* pointer1, int* pointer2) {
1d019706d866 LLVM10 anatofuz parents: diff changeset	70 if (condition) {
1d019706d866 LLVM10 anatofuz parents: diff changeset	71 // ... lots of code ...
1d019706d866 LLVM10 anatofuz parents: diff changeset	72 leak(*pointer1);
1d019706d866 LLVM10 anatofuz parents: diff changeset	73 } else {
1d019706d866 LLVM10 anatofuz parents: diff changeset	74 // ... more code ...
1d019706d866 LLVM10 anatofuz parents: diff changeset	75 leak(*pointer2);
1d019706d866 LLVM10 anatofuz parents: diff changeset	76 }
1d019706d866 LLVM10 anatofuz parents: diff changeset	77 }
1d019706d866 LLVM10 anatofuz parents: diff changeset	78 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	79
1d019706d866 LLVM10 anatofuz parents: diff changeset	80 This would get transformed into something resembling the following:
1d019706d866 LLVM10 anatofuz parents: diff changeset	81 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	82 uintptr_t all_ones_mask = std::numerical_limits<uintptr_t>::max();
1d019706d866 LLVM10 anatofuz parents: diff changeset	83 uintptr_t all_zeros_mask = 0;
1d019706d866 LLVM10 anatofuz parents: diff changeset	84 void leak(int data);
1d019706d866 LLVM10 anatofuz parents: diff changeset	85 void example(int* pointer1, int* pointer2) {
1d019706d866 LLVM10 anatofuz parents: diff changeset	86 uintptr_t predicate_state = all_ones_mask;
1d019706d866 LLVM10 anatofuz parents: diff changeset	87 if (condition) {
1d019706d866 LLVM10 anatofuz parents: diff changeset	88 // Assuming ?: is implemented using branchless logic...
1d019706d866 LLVM10 anatofuz parents: diff changeset	89 predicate_state = !condition ? all_zeros_mask : predicate_state;
1d019706d866 LLVM10 anatofuz parents: diff changeset	90 // ... lots of code ...
1d019706d866 LLVM10 anatofuz parents: diff changeset	91 //
1d019706d866 LLVM10 anatofuz parents: diff changeset	92 // Harden the pointer so it can't be loaded
1d019706d866 LLVM10 anatofuz parents: diff changeset	93 pointer1 &= predicate_state;
1d019706d866 LLVM10 anatofuz parents: diff changeset	94 leak(*pointer1);
1d019706d866 LLVM10 anatofuz parents: diff changeset	95 } else {
1d019706d866 LLVM10 anatofuz parents: diff changeset	96 predicate_state = condition ? all_zeros_mask : predicate_state;
1d019706d866 LLVM10 anatofuz parents: diff changeset	97 // ... more code ...
1d019706d866 LLVM10 anatofuz parents: diff changeset	98 //
1d019706d866 LLVM10 anatofuz parents: diff changeset	99 // Alternative: Harden the loaded value
1d019706d866 LLVM10 anatofuz parents: diff changeset	100 int value2 = *pointer2 & predicate_state;
1d019706d866 LLVM10 anatofuz parents: diff changeset	101 leak(value2);
1d019706d866 LLVM10 anatofuz parents: diff changeset	102 }
1d019706d866 LLVM10 anatofuz parents: diff changeset	103 }
1d019706d866 LLVM10 anatofuz parents: diff changeset	104 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	105
1d019706d866 LLVM10 anatofuz parents: diff changeset	106 The result should be that if the `if (condition) {` branch is mis-predicted,
1d019706d866 LLVM10 anatofuz parents: diff changeset	107 there is a data dependency on the condition used to zero out any pointers
1d019706d866 LLVM10 anatofuz parents: diff changeset	108 prior to loading through them or to zero out all of the loaded bits. Even
1d019706d866 LLVM10 anatofuz parents: diff changeset	109 though this code pattern may still execute speculatively, invalid speculative
1d019706d866 LLVM10 anatofuz parents: diff changeset	110 executions are prevented from leaking secret data from memory (but note that
1d019706d866 LLVM10 anatofuz parents: diff changeset	111 this data might still be loaded in safe ways, and some regions of memory are
1d019706d866 LLVM10 anatofuz parents: diff changeset	112 required to not hold secrets, see below for detailed limitations). This
1d019706d866 LLVM10 anatofuz parents: diff changeset	113 approach only requires the underlying hardware have a way to implement a
1d019706d866 LLVM10 anatofuz parents: diff changeset	114 branchless and unpredicted conditional update of a register's value. All modern
1d019706d866 LLVM10 anatofuz parents: diff changeset	115 architectures have support for this, and in fact such support is necessary to
1d019706d866 LLVM10 anatofuz parents: diff changeset	116 correctly implement constant time cryptographic primitives.
1d019706d866 LLVM10 anatofuz parents: diff changeset	117
1d019706d866 LLVM10 anatofuz parents: diff changeset	118 Crucial properties of this approach:
1d019706d866 LLVM10 anatofuz parents: diff changeset	119 * It is not preventing any particular side-channel from working. This is
1d019706d866 LLVM10 anatofuz parents: diff changeset	120 important as there are an unknown number of potential side channels and we
1d019706d866 LLVM10 anatofuz parents: diff changeset	121 expect to continue discovering more. Instead, it prevents the observation of
1d019706d866 LLVM10 anatofuz parents: diff changeset	122 secret data in the first place.
1d019706d866 LLVM10 anatofuz parents: diff changeset	123 * It accumulates the predicate state, protecting even in the face of nested
1d019706d866 LLVM10 anatofuz parents: diff changeset	124 correctly predicted control flows.
1d019706d866 LLVM10 anatofuz parents: diff changeset	125 * It passes this predicate state across function boundaries to provide
1d019706d866 LLVM10 anatofuz parents: diff changeset	126 [interprocedural protection](#interprocedural-checking).
1d019706d866 LLVM10 anatofuz parents: diff changeset	127 * When hardening the address of a load, it uses a destructive or
1d019706d866 LLVM10 anatofuz parents: diff changeset	128 non-reversible modification of the address to prevent an attacker from
1d019706d866 LLVM10 anatofuz parents: diff changeset	129 reversing the check using attacker-controlled inputs.
1d019706d866 LLVM10 anatofuz parents: diff changeset	130 * It does not completely block speculative execution, and merely prevents
1d019706d866 LLVM10 anatofuz parents: diff changeset	131 mis-speculated paths from leaking secrets from memory (and stalls
1d019706d866 LLVM10 anatofuz parents: diff changeset	132 speculation until this can be determined).
1d019706d866 LLVM10 anatofuz parents: diff changeset	133 * It is completely general and makes no fundamental assumptions about the
1d019706d866 LLVM10 anatofuz parents: diff changeset	134 underlying architecture other than the ability to do branchless conditional
1d019706d866 LLVM10 anatofuz parents: diff changeset	135 data updates and a lack of value prediction.
1d019706d866 LLVM10 anatofuz parents: diff changeset	136 * It does not require programmers to identify all possible secret data using
1d019706d866 LLVM10 anatofuz parents: diff changeset	137 static source code annotations or code vulnerable to a variant #1 style
1d019706d866 LLVM10 anatofuz parents: diff changeset	138 attack.
1d019706d866 LLVM10 anatofuz parents: diff changeset	139
1d019706d866 LLVM10 anatofuz parents: diff changeset	140 Limitations of this approach:
1d019706d866 LLVM10 anatofuz parents: diff changeset	141 * It requires re-compiling source code to insert hardening instruction
1d019706d866 LLVM10 anatofuz parents: diff changeset	142 sequences. Only software compiled in this mode is protected.
1d019706d866 LLVM10 anatofuz parents: diff changeset	143 * The performance is heavily dependent on a particular architecture's
1d019706d866 LLVM10 anatofuz parents: diff changeset	144 implementation strategy. We outline a potential x86 implementation below and
1d019706d866 LLVM10 anatofuz parents: diff changeset	145 characterize its performance.
1d019706d866 LLVM10 anatofuz parents: diff changeset	146 * It does not defend against secret data already loaded from memory and
1d019706d866 LLVM10 anatofuz parents: diff changeset	147 residing in registers or leaked through other side-channels in
1d019706d866 LLVM10 anatofuz parents: diff changeset	148 non-speculative execution. Code dealing with this, e.g cryptographic
1d019706d866 LLVM10 anatofuz parents: diff changeset	149 routines, already uses constant-time algorithms and code to prevent
1d019706d866 LLVM10 anatofuz parents: diff changeset	150 side-channels. Such code should also scrub registers of secret data following
1d019706d866 LLVM10 anatofuz parents: diff changeset	151 [these
1d019706d866 LLVM10 anatofuz parents: diff changeset	152 guidelines](https://github.com/HACS-workshop/spectre-mitigations/blob/master/crypto_guidelines.md).
1d019706d866 LLVM10 anatofuz parents: diff changeset	153 * To achieve reasonable performance, many loads may not be checked, such as
1d019706d866 LLVM10 anatofuz parents: diff changeset	154 those with compile-time fixed addresses. This primarily consists of accesses
1d019706d866 LLVM10 anatofuz parents: diff changeset	155 at compile-time constant offsets of global and local variables. Code which
1d019706d866 LLVM10 anatofuz parents: diff changeset	156 needs this protection and intentionally stores secret data must ensure the
1d019706d866 LLVM10 anatofuz parents: diff changeset	157 memory regions used for secret data are necessarily dynamic mappings or heap
1d019706d866 LLVM10 anatofuz parents: diff changeset	158 allocations. This is an area which can be tuned to provide more comprehensive
1d019706d866 LLVM10 anatofuz parents: diff changeset	159 protection at the cost of performance.
1d019706d866 LLVM10 anatofuz parents: diff changeset	160 * [Hardened loads](#hardening-the-address-of-the-load) may still load data from
1d019706d866 LLVM10 anatofuz parents: diff changeset	161 _valid_ addresses if not _attacker-controlled_ addresses. To prevent these
1d019706d866 LLVM10 anatofuz parents: diff changeset	162 from reading secret data, the low 2gb of the address space and 2gb above and
1d019706d866 LLVM10 anatofuz parents: diff changeset	163 below any executable pages should be protected.
1d019706d866 LLVM10 anatofuz parents: diff changeset	164
1d019706d866 LLVM10 anatofuz parents: diff changeset	165 Credit:
1d019706d866 LLVM10 anatofuz parents: diff changeset	166 * The core idea of tracing misspeculation through data and marking pointers to
1d019706d866 LLVM10 anatofuz parents: diff changeset	167 block misspeculated loads was developed as part of a HACS 2018 discussion
1d019706d866 LLVM10 anatofuz parents: diff changeset	168 between Chandler Carruth, Paul Kocher, Thomas Pornin, and several other
1d019706d866 LLVM10 anatofuz parents: diff changeset	169 individuals.
1d019706d866 LLVM10 anatofuz parents: diff changeset	170 * Core idea of masking out loaded bits was part of the original mitigation
1d019706d866 LLVM10 anatofuz parents: diff changeset	171 suggested by Jann Horn when these attacks were reported.
1d019706d866 LLVM10 anatofuz parents: diff changeset	172
1d019706d866 LLVM10 anatofuz parents: diff changeset	173
1d019706d866 LLVM10 anatofuz parents: diff changeset	174 ### Indirect Branches, Calls, and Returns
1d019706d866 LLVM10 anatofuz parents: diff changeset	175
1d019706d866 LLVM10 anatofuz parents: diff changeset	176 It is possible to attack control flow other than conditional branches with
1d019706d866 LLVM10 anatofuz parents: diff changeset	177 variant #1 style mispredictions.
1d019706d866 LLVM10 anatofuz parents: diff changeset	178 * A prediction towards a hot call target of a virtual method can lead to it
1d019706d866 LLVM10 anatofuz parents: diff changeset	179 being speculatively executed when an expected type is used (often called
1d019706d866 LLVM10 anatofuz parents: diff changeset	180 "type confusion").
1d019706d866 LLVM10 anatofuz parents: diff changeset	181 * A hot case may be speculatively executed due to prediction instead of the
1d019706d866 LLVM10 anatofuz parents: diff changeset	182 correct case for a switch statement implemented as a jump table.
1d019706d866 LLVM10 anatofuz parents: diff changeset	183 * A hot common return address may be predicted incorrectly when returning from
1d019706d866 LLVM10 anatofuz parents: diff changeset	184 a function.
1d019706d866 LLVM10 anatofuz parents: diff changeset	185
1d019706d866 LLVM10 anatofuz parents: diff changeset	186 These code patterns are also vulnerable to Spectre variant #2, and as such are
1d019706d866 LLVM10 anatofuz parents: diff changeset	187 best mitigated with a
1d019706d866 LLVM10 anatofuz parents: diff changeset	188 [retpoline](https://support.google.com/faqs/answer/7625886) on x86 platforms.
1d019706d866 LLVM10 anatofuz parents: diff changeset	189 When a mitigation technique like retpoline is used, speculation simply cannot
1d019706d866 LLVM10 anatofuz parents: diff changeset	190 proceed through an indirect control flow edge (or it cannot be mispredicted in
1d019706d866 LLVM10 anatofuz parents: diff changeset	191 the case of a filled RSB) and so it is also protected from variant #1 style
1d019706d866 LLVM10 anatofuz parents: diff changeset	192 attacks. However, some architectures, micro-architectures, or vendors do not
1d019706d866 LLVM10 anatofuz parents: diff changeset	193 employ the retpoline mitigation, and on future x86 hardware (both Intel and
1d019706d866 LLVM10 anatofuz parents: diff changeset	194 AMD) it is expected to become unnecessary due to hardware-based mitigation.
1d019706d866 LLVM10 anatofuz parents: diff changeset	195
1d019706d866 LLVM10 anatofuz parents: diff changeset	196 When not using a retpoline, these edges will need independent protection from
1d019706d866 LLVM10 anatofuz parents: diff changeset	197 variant #1 style attacks. The analogous approach to that used for conditional
1d019706d866 LLVM10 anatofuz parents: diff changeset	198 control flow should work:
1d019706d866 LLVM10 anatofuz parents: diff changeset	199 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	200 uintptr_t all_ones_mask = std::numerical_limits<uintptr_t>::max();
1d019706d866 LLVM10 anatofuz parents: diff changeset	201 uintptr_t all_zeros_mask = 0;
1d019706d866 LLVM10 anatofuz parents: diff changeset	202 void leak(int data);
1d019706d866 LLVM10 anatofuz parents: diff changeset	203 void example(int* pointer1, int* pointer2) {
1d019706d866 LLVM10 anatofuz parents: diff changeset	204 uintptr_t predicate_state = all_ones_mask;
1d019706d866 LLVM10 anatofuz parents: diff changeset	205 switch (condition) {
1d019706d866 LLVM10 anatofuz parents: diff changeset	206 case 0:
1d019706d866 LLVM10 anatofuz parents: diff changeset	207 // Assuming ?: is implemented using branchless logic...
1d019706d866 LLVM10 anatofuz parents: diff changeset	208 predicate_state = (condition != 0) ? all_zeros_mask : predicate_state;
1d019706d866 LLVM10 anatofuz parents: diff changeset	209 // ... lots of code ...
1d019706d866 LLVM10 anatofuz parents: diff changeset	210 //
1d019706d866 LLVM10 anatofuz parents: diff changeset	211 // Harden the pointer so it can't be loaded
1d019706d866 LLVM10 anatofuz parents: diff changeset	212 pointer1 &= predicate_state;
1d019706d866 LLVM10 anatofuz parents: diff changeset	213 leak(*pointer1);
1d019706d866 LLVM10 anatofuz parents: diff changeset	214 break;
1d019706d866 LLVM10 anatofuz parents: diff changeset	215
1d019706d866 LLVM10 anatofuz parents: diff changeset	216 case 1:
1d019706d866 LLVM10 anatofuz parents: diff changeset	217 predicate_state = (condition != 1) ? all_zeros_mask : predicate_state;
1d019706d866 LLVM10 anatofuz parents: diff changeset	218 // ... more code ...
1d019706d866 LLVM10 anatofuz parents: diff changeset	219 //
1d019706d866 LLVM10 anatofuz parents: diff changeset	220 // Alternative: Harden the loaded value
1d019706d866 LLVM10 anatofuz parents: diff changeset	221 int value2 = *pointer2 & predicate_state;
1d019706d866 LLVM10 anatofuz parents: diff changeset	222 leak(value2);
1d019706d866 LLVM10 anatofuz parents: diff changeset	223 break;
1d019706d866 LLVM10 anatofuz parents: diff changeset	224
1d019706d866 LLVM10 anatofuz parents: diff changeset	225 // ...
1d019706d866 LLVM10 anatofuz parents: diff changeset	226 }
1d019706d866 LLVM10 anatofuz parents: diff changeset	227 }
1d019706d866 LLVM10 anatofuz parents: diff changeset	228 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	229
1d019706d866 LLVM10 anatofuz parents: diff changeset	230 The core idea remains the same: validate the control flow using data-flow and
1d019706d866 LLVM10 anatofuz parents: diff changeset	231 use that validation to check that loads cannot leak information along
1d019706d866 LLVM10 anatofuz parents: diff changeset	232 misspeculated paths. Typically this involves passing the desired target of such
1d019706d866 LLVM10 anatofuz parents: diff changeset	233 control flow across the edge and checking that it is correct afterwards. Note
1d019706d866 LLVM10 anatofuz parents: diff changeset	234 that while it is tempting to think that this mitigates variant #2 attacks, it
1d019706d866 LLVM10 anatofuz parents: diff changeset	235 does not. Those attacks go to arbitrary gadgets that don't include the checks.
1d019706d866 LLVM10 anatofuz parents: diff changeset	236
1d019706d866 LLVM10 anatofuz parents: diff changeset	237
1d019706d866 LLVM10 anatofuz parents: diff changeset	238 ### Variant #1.1 and #1.2 attacks: "Bounds Check Bypass Store"
1d019706d866 LLVM10 anatofuz parents: diff changeset	239
1d019706d866 LLVM10 anatofuz parents: diff changeset	240 Beyond the core variant #1 attack, there are techniques to extend this attack.
1d019706d866 LLVM10 anatofuz parents: diff changeset	241 The primary technique is known as "Bounds Check Bypass Store" and is discussed
1d019706d866 LLVM10 anatofuz parents: diff changeset	242 in this research paper: https://people.csail.mit.edu/vlk/spectre11.pdf
1d019706d866 LLVM10 anatofuz parents: diff changeset	243
1d019706d866 LLVM10 anatofuz parents: diff changeset	244 We will analyze these two variants independently. First, variant #1.1 works by
1d019706d866 LLVM10 anatofuz parents: diff changeset	245 speculatively storing over the return address after a bounds check bypass. This
1d019706d866 LLVM10 anatofuz parents: diff changeset	246 speculative store then ends up being used by the CPU during speculative
1d019706d866 LLVM10 anatofuz parents: diff changeset	247 execution of the return, potentially directing speculative execution to
1d019706d866 LLVM10 anatofuz parents: diff changeset	248 arbitrary gadgets in the binary. Let's look at an example.
1d019706d866 LLVM10 anatofuz parents: diff changeset	249 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	250 unsigned char local_buffer[4];
1d019706d866 LLVM10 anatofuz parents: diff changeset	251 unsigned char *untrusted_data_from_caller = ...;
1d019706d866 LLVM10 anatofuz parents: diff changeset	252 unsigned long untrusted_size_from_caller = ...;
1d019706d866 LLVM10 anatofuz parents: diff changeset	253 if (untrusted_size_from_caller < sizeof(local_buffer)) {
1d019706d866 LLVM10 anatofuz parents: diff changeset	254 // Speculative execution enters here with a too-large size.
1d019706d866 LLVM10 anatofuz parents: diff changeset	255 memcpy(local_buffer, untrusted_data_from_caller,
1d019706d866 LLVM10 anatofuz parents: diff changeset	256 untrusted_size_from_caller);
1d019706d866 LLVM10 anatofuz parents: diff changeset	257 // The stack has now been smashed, writing an attacker-controlled
1d019706d866 LLVM10 anatofuz parents: diff changeset	258 // address over the return address.
1d019706d866 LLVM10 anatofuz parents: diff changeset	259 minor_processing(local_buffer);
1d019706d866 LLVM10 anatofuz parents: diff changeset	260 return;
1d019706d866 LLVM10 anatofuz parents: diff changeset	261 // Control will speculate to the attacker-written address.
1d019706d866 LLVM10 anatofuz parents: diff changeset	262 }
1d019706d866 LLVM10 anatofuz parents: diff changeset	263 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	264
1d019706d866 LLVM10 anatofuz parents: diff changeset	265 However, this can be mitigated by hardening the load of the return address just
1d019706d866 LLVM10 anatofuz parents: diff changeset	266 like any other load. This is sometimes complicated because x86 for example
1d019706d866 LLVM10 anatofuz parents: diff changeset	267 implicitly loads the return address off the stack. However, the
1d019706d866 LLVM10 anatofuz parents: diff changeset	268 implementation technique below is specifically designed to mitigate this
1d019706d866 LLVM10 anatofuz parents: diff changeset	269 implicit load by using the stack pointer to communicate misspeculation between
1d019706d866 LLVM10 anatofuz parents: diff changeset	270 functions. This additionally causes a misspeculation to have an invalid stack
1d019706d866 LLVM10 anatofuz parents: diff changeset	271 pointer and never be able to read the speculatively stored return address. See
1d019706d866 LLVM10 anatofuz parents: diff changeset	272 the detailed discussion below.
1d019706d866 LLVM10 anatofuz parents: diff changeset	273
1d019706d866 LLVM10 anatofuz parents: diff changeset	274 For variant #1.2, the attacker speculatively stores into the vtable or jump
1d019706d866 LLVM10 anatofuz parents: diff changeset	275 table used to implement an indirect call or indirect jump. Because this is
1d019706d866 LLVM10 anatofuz parents: diff changeset	276 speculative, this will often be possible even when these are stored in
1d019706d866 LLVM10 anatofuz parents: diff changeset	277 read-only pages. For example:
1d019706d866 LLVM10 anatofuz parents: diff changeset	278 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	279 class FancyObject : public BaseObject {
1d019706d866 LLVM10 anatofuz parents: diff changeset	280 public:
1d019706d866 LLVM10 anatofuz parents: diff changeset	281 void DoSomething() override;
1d019706d866 LLVM10 anatofuz parents: diff changeset	282 };
1d019706d866 LLVM10 anatofuz parents: diff changeset	283 void f(unsigned long attacker_offset, unsigned long attacker_data) {
1d019706d866 LLVM10 anatofuz parents: diff changeset	284 FancyObject object = getMyObject();
1d019706d866 LLVM10 anatofuz parents: diff changeset	285 unsigned long *arr[4] = getFourDataPointers();
1d019706d866 LLVM10 anatofuz parents: diff changeset	286 if (attacker_offset < 4) {
1d019706d866 LLVM10 anatofuz parents: diff changeset	287 // We have bypassed the bounds check speculatively.
1d019706d866 LLVM10 anatofuz parents: diff changeset	288 unsigned long *data = arr[attacker_offset];
1d019706d866 LLVM10 anatofuz parents: diff changeset	289 // Now we have computed a pointer inside of `object`, the vptr.
1d019706d866 LLVM10 anatofuz parents: diff changeset	290 *data = attacker_data;
1d019706d866 LLVM10 anatofuz parents: diff changeset	291 // The vptr points to the virtual table and we speculatively clobber that.
1d019706d866 LLVM10 anatofuz parents: diff changeset	292 g(object); // Hand the object to some other routine.
1d019706d866 LLVM10 anatofuz parents: diff changeset	293 }
1d019706d866 LLVM10 anatofuz parents: diff changeset	294 }
1d019706d866 LLVM10 anatofuz parents: diff changeset	295 // In another file, we call a method on the object.
1d019706d866 LLVM10 anatofuz parents: diff changeset	296 void g(BaseObject &object) {
1d019706d866 LLVM10 anatofuz parents: diff changeset	297 object.DoSomething();
1d019706d866 LLVM10 anatofuz parents: diff changeset	298 // This speculatively calls the address stored over the vtable.
1d019706d866 LLVM10 anatofuz parents: diff changeset	299 }
1d019706d866 LLVM10 anatofuz parents: diff changeset	300 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	301
1d019706d866 LLVM10 anatofuz parents: diff changeset	302 Mitigating this requires hardening loads from these locations, or mitigating
1d019706d866 LLVM10 anatofuz parents: diff changeset	303 the indirect call or indirect jump. Any of these are sufficient to block the
1d019706d866 LLVM10 anatofuz parents: diff changeset	304 call or jump from using a speculatively stored value that has been read back.
1d019706d866 LLVM10 anatofuz parents: diff changeset	305
1d019706d866 LLVM10 anatofuz parents: diff changeset	306 For both of these, using retpolines would be equally sufficient. One possible
1d019706d866 LLVM10 anatofuz parents: diff changeset	307 hybrid approach is to use retpolines for indirect call and jump, while relying
1d019706d866 LLVM10 anatofuz parents: diff changeset	308 on SLH to mitigate returns.
1d019706d866 LLVM10 anatofuz parents: diff changeset	309
1d019706d866 LLVM10 anatofuz parents: diff changeset	310 Another approach that is sufficient for both of these is to harden all of the
1d019706d866 LLVM10 anatofuz parents: diff changeset	311 speculative stores. However, as most stores aren't interesting and don't
1d019706d866 LLVM10 anatofuz parents: diff changeset	312 inherently leak data, this is expected to be prohibitively expensive given the
1d019706d866 LLVM10 anatofuz parents: diff changeset	313 attack it is defending against.
1d019706d866 LLVM10 anatofuz parents: diff changeset	314
1d019706d866 LLVM10 anatofuz parents: diff changeset	315
1d019706d866 LLVM10 anatofuz parents: diff changeset	316 ## Implementation Details
1d019706d866 LLVM10 anatofuz parents: diff changeset	317
1d019706d866 LLVM10 anatofuz parents: diff changeset	318 There are a number of complex details impacting the implementation of this
1d019706d866 LLVM10 anatofuz parents: diff changeset	319 technique, both on a particular architecture and within a particular compiler.
1d019706d866 LLVM10 anatofuz parents: diff changeset	320 We discuss proposed implementation techniques for the x86 architecture and the
1d019706d866 LLVM10 anatofuz parents: diff changeset	321 LLVM compiler. These are primarily to serve as an example, as other
1d019706d866 LLVM10 anatofuz parents: diff changeset	322 implementation techniques are very possible.
1d019706d866 LLVM10 anatofuz parents: diff changeset	323
1d019706d866 LLVM10 anatofuz parents: diff changeset	324
1d019706d866 LLVM10 anatofuz parents: diff changeset	325 ### x86 Implementation Details
1d019706d866 LLVM10 anatofuz parents: diff changeset	326
1d019706d866 LLVM10 anatofuz parents: diff changeset	327 On the x86 platform we break down the implementation into three core
1d019706d866 LLVM10 anatofuz parents: diff changeset	328 components: accumulating the predicate state through the control flow graph,
1d019706d866 LLVM10 anatofuz parents: diff changeset	329 checking the loads, and checking control transfers between procedures.
1d019706d866 LLVM10 anatofuz parents: diff changeset	330
1d019706d866 LLVM10 anatofuz parents: diff changeset	331
1d019706d866 LLVM10 anatofuz parents: diff changeset	332 #### Accumulating Predicate State
1d019706d866 LLVM10 anatofuz parents: diff changeset	333
1d019706d866 LLVM10 anatofuz parents: diff changeset	334 Consider baseline x86 instructions like the following, which test three
1d019706d866 LLVM10 anatofuz parents: diff changeset	335 conditions and if all pass, loads data from memory and potentially leaks it
1d019706d866 LLVM10 anatofuz parents: diff changeset	336 through some side channel:
1d019706d866 LLVM10 anatofuz parents: diff changeset	337 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	338 # %bb.0: # %entry
1d019706d866 LLVM10 anatofuz parents: diff changeset	339 pushq %rax
1d019706d866 LLVM10 anatofuz parents: diff changeset	340 testl %edi, %edi
1d019706d866 LLVM10 anatofuz parents: diff changeset	341 jne .LBB0_4
1d019706d866 LLVM10 anatofuz parents: diff changeset	342 # %bb.1: # %then1
1d019706d866 LLVM10 anatofuz parents: diff changeset	343 testl %esi, %esi
1d019706d866 LLVM10 anatofuz parents: diff changeset	344 jne .LBB0_4
1d019706d866 LLVM10 anatofuz parents: diff changeset	345 # %bb.2: # %then2
1d019706d866 LLVM10 anatofuz parents: diff changeset	346 testl %edx, %edx
1d019706d866 LLVM10 anatofuz parents: diff changeset	347 je .LBB0_3
1d019706d866 LLVM10 anatofuz parents: diff changeset	348 .LBB0_4: # %exit
1d019706d866 LLVM10 anatofuz parents: diff changeset	349 popq %rax
1d019706d866 LLVM10 anatofuz parents: diff changeset	350 retq
1d019706d866 LLVM10 anatofuz parents: diff changeset	351 .LBB0_3: # %danger
1d019706d866 LLVM10 anatofuz parents: diff changeset	352 movl (%rcx), %edi
1d019706d866 LLVM10 anatofuz parents: diff changeset	353 callq leak
1d019706d866 LLVM10 anatofuz parents: diff changeset	354 popq %rax
1d019706d866 LLVM10 anatofuz parents: diff changeset	355 retq
1d019706d866 LLVM10 anatofuz parents: diff changeset	356 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	357
1d019706d866 LLVM10 anatofuz parents: diff changeset	358 When we go to speculatively execute the load, we want to know whether any of
1d019706d866 LLVM10 anatofuz parents: diff changeset	359 the dynamically executed predicates have been misspeculated. To track that,
1d019706d866 LLVM10 anatofuz parents: diff changeset	360 along each conditional edge, we need to track the data which would allow that
1d019706d866 LLVM10 anatofuz parents: diff changeset	361 edge to be taken. On x86, this data is stored in the flags register used by the
1d019706d866 LLVM10 anatofuz parents: diff changeset	362 conditional jump instruction. Along both edges after this fork in control flow,
1d019706d866 LLVM10 anatofuz parents: diff changeset	363 the flags register remains alive and contains data that we can use to build up
1d019706d866 LLVM10 anatofuz parents: diff changeset	364 our accumulated predicate state. We accumulate it using the x86 conditional
1d019706d866 LLVM10 anatofuz parents: diff changeset	365 move instruction which also reads the flag registers where the state resides.
1d019706d866 LLVM10 anatofuz parents: diff changeset	366 These conditional move instructions are known to not be predicted on any x86
1d019706d866 LLVM10 anatofuz parents: diff changeset	367 processors, making them immune to misprediction that could reintroduce the
1d019706d866 LLVM10 anatofuz parents: diff changeset	368 vulnerability. When we insert the conditional moves, the code ends up looking
1d019706d866 LLVM10 anatofuz parents: diff changeset	369 like the following:
1d019706d866 LLVM10 anatofuz parents: diff changeset	370 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	371 # %bb.0: # %entry
1d019706d866 LLVM10 anatofuz parents: diff changeset	372 pushq %rax
1d019706d866 LLVM10 anatofuz parents: diff changeset	373 xorl %eax, %eax # Zero out initial predicate state.
1d019706d866 LLVM10 anatofuz parents: diff changeset	374 movq $-1, %r8 # Put all-ones mask into a register.
1d019706d866 LLVM10 anatofuz parents: diff changeset	375 testl %edi, %edi
1d019706d866 LLVM10 anatofuz parents: diff changeset	376 jne .LBB0_1
1d019706d866 LLVM10 anatofuz parents: diff changeset	377 # %bb.2: # %then1
1d019706d866 LLVM10 anatofuz parents: diff changeset	378 cmovneq %r8, %rax # Conditionally update predicate state.
1d019706d866 LLVM10 anatofuz parents: diff changeset	379 testl %esi, %esi
1d019706d866 LLVM10 anatofuz parents: diff changeset	380 jne .LBB0_1
1d019706d866 LLVM10 anatofuz parents: diff changeset	381 # %bb.3: # %then2
1d019706d866 LLVM10 anatofuz parents: diff changeset	382 cmovneq %r8, %rax # Conditionally update predicate state.
1d019706d866 LLVM10 anatofuz parents: diff changeset	383 testl %edx, %edx
1d019706d866 LLVM10 anatofuz parents: diff changeset	384 je .LBB0_4
1d019706d866 LLVM10 anatofuz parents: diff changeset	385 .LBB0_1:
1d019706d866 LLVM10 anatofuz parents: diff changeset	386 cmoveq %r8, %rax # Conditionally update predicate state.
1d019706d866 LLVM10 anatofuz parents: diff changeset	387 popq %rax
1d019706d866 LLVM10 anatofuz parents: diff changeset	388 retq
1d019706d866 LLVM10 anatofuz parents: diff changeset	389 .LBB0_4: # %danger
1d019706d866 LLVM10 anatofuz parents: diff changeset	390 cmovneq %r8, %rax # Conditionally update predicate state.
1d019706d866 LLVM10 anatofuz parents: diff changeset	391 ...
1d019706d866 LLVM10 anatofuz parents: diff changeset	392 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	393
1d019706d866 LLVM10 anatofuz parents: diff changeset	394 Here we create the "empty" or "correct execution" predicate state by zeroing
1d019706d866 LLVM10 anatofuz parents: diff changeset	395 `%rax`, and we create a constant "incorrect execution" predicate value by
1d019706d866 LLVM10 anatofuz parents: diff changeset	396 putting `-1` into `%r8`. Then, along each edge coming out of a conditional
1d019706d866 LLVM10 anatofuz parents: diff changeset	397 branch we do a conditional move that in a correct execution will be a no-op,
1d019706d866 LLVM10 anatofuz parents: diff changeset	398 but if misspeculated, will replace the `%rax` with the value of `%r8`.
1d019706d866 LLVM10 anatofuz parents: diff changeset	399 Misspeculating any one of the three predicates will cause `%rax` to hold the
1d019706d866 LLVM10 anatofuz parents: diff changeset	400 "incorrect execution" value from `%r8` as we preserve incoming values when
1d019706d866 LLVM10 anatofuz parents: diff changeset	401 execution is correct rather than overwriting it.
1d019706d866 LLVM10 anatofuz parents: diff changeset	402
1d019706d866 LLVM10 anatofuz parents: diff changeset	403 We now have a value in `%rax` in each basic block that indicates if at some
1d019706d866 LLVM10 anatofuz parents: diff changeset	404 point previously a predicate was mispredicted. And we have arranged for that
1d019706d866 LLVM10 anatofuz parents: diff changeset	405 value to be particularly effective when used below to harden loads.
1d019706d866 LLVM10 anatofuz parents: diff changeset	406
1d019706d866 LLVM10 anatofuz parents: diff changeset	407
1d019706d866 LLVM10 anatofuz parents: diff changeset	408 ##### Indirect Call, Branch, and Return Predicates
1d019706d866 LLVM10 anatofuz parents: diff changeset	409
1d019706d866 LLVM10 anatofuz parents: diff changeset	410 There is no analogous flag to use when tracing indirect calls, branches, and
1d019706d866 LLVM10 anatofuz parents: diff changeset	411 returns. The predicate state must be accumulated through some other means.
1d019706d866 LLVM10 anatofuz parents: diff changeset	412 Fundamentally, this is the reverse of the problem posed in CFI: we need to
1d019706d866 LLVM10 anatofuz parents: diff changeset	413 check where we came from rather than where we are going. For function-local
1d019706d866 LLVM10 anatofuz parents: diff changeset	414 jump tables, this is easily arranged by testing the input to the jump table
1d019706d866 LLVM10 anatofuz parents: diff changeset	415 within each destination (not yet implemented, use retpolines):
1d019706d866 LLVM10 anatofuz parents: diff changeset	416 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	417 pushq %rax
1d019706d866 LLVM10 anatofuz parents: diff changeset	418 xorl %eax, %eax # Zero out initial predicate state.
1d019706d866 LLVM10 anatofuz parents: diff changeset	419 movq $-1, %r8 # Put all-ones mask into a register.
1d019706d866 LLVM10 anatofuz parents: diff changeset	420 jmpq *.LJTI0_0(,%rdi,8) # Indirect jump through table.
1d019706d866 LLVM10 anatofuz parents: diff changeset	421 .LBB0_2: # %sw.bb
1d019706d866 LLVM10 anatofuz parents: diff changeset	422 testq $0, %rdi # Validate index used for jump table.
1d019706d866 LLVM10 anatofuz parents: diff changeset	423 cmovneq %r8, %rax # Conditionally update predicate state.
1d019706d866 LLVM10 anatofuz parents: diff changeset	424 ...
1d019706d866 LLVM10 anatofuz parents: diff changeset	425 jmp _Z4leaki # TAILCALL
1d019706d866 LLVM10 anatofuz parents: diff changeset	426
1d019706d866 LLVM10 anatofuz parents: diff changeset	427 .LBB0_3: # %sw.bb1
1d019706d866 LLVM10 anatofuz parents: diff changeset	428 testq $1, %rdi # Validate index used for jump table.
1d019706d866 LLVM10 anatofuz parents: diff changeset	429 cmovneq %r8, %rax # Conditionally update predicate state.
1d019706d866 LLVM10 anatofuz parents: diff changeset	430 ...
1d019706d866 LLVM10 anatofuz parents: diff changeset	431 jmp _Z4leaki # TAILCALL
1d019706d866 LLVM10 anatofuz parents: diff changeset	432
1d019706d866 LLVM10 anatofuz parents: diff changeset	433 .LBB0_5: # %sw.bb10
1d019706d866 LLVM10 anatofuz parents: diff changeset	434 testq $2, %rdi # Validate index used for jump table.
1d019706d866 LLVM10 anatofuz parents: diff changeset	435 cmovneq %r8, %rax # Conditionally update predicate state.
1d019706d866 LLVM10 anatofuz parents: diff changeset	436 ...
1d019706d866 LLVM10 anatofuz parents: diff changeset	437 jmp _Z4leaki # TAILCALL
1d019706d866 LLVM10 anatofuz parents: diff changeset	438 ...
1d019706d866 LLVM10 anatofuz parents: diff changeset	439
1d019706d866 LLVM10 anatofuz parents: diff changeset	440 .section .rodata,"a",@progbits
1d019706d866 LLVM10 anatofuz parents: diff changeset	441 .p2align 3
1d019706d866 LLVM10 anatofuz parents: diff changeset	442 .LJTI0_0:
1d019706d866 LLVM10 anatofuz parents: diff changeset	443 .quad .LBB0_2
1d019706d866 LLVM10 anatofuz parents: diff changeset	444 .quad .LBB0_3
1d019706d866 LLVM10 anatofuz parents: diff changeset	445 .quad .LBB0_5
1d019706d866 LLVM10 anatofuz parents: diff changeset	446 ...
1d019706d866 LLVM10 anatofuz parents: diff changeset	447 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	448
1d019706d866 LLVM10 anatofuz parents: diff changeset	449 Returns have a simple mitigation technique on x86-64 (or other ABIs which have
1d019706d866 LLVM10 anatofuz parents: diff changeset	450 what is called a "red zone" region beyond the end of the stack). This region is
1d019706d866 LLVM10 anatofuz parents: diff changeset	451 guaranteed to be preserved across interrupts and context switches, making the
1d019706d866 LLVM10 anatofuz parents: diff changeset	452 return address used in returning to the current code remain on the stack and
1d019706d866 LLVM10 anatofuz parents: diff changeset	453 valid to read. We can emit code in the caller to verify that a return edge was
1d019706d866 LLVM10 anatofuz parents: diff changeset	454 not mispredicted:
1d019706d866 LLVM10 anatofuz parents: diff changeset	455 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	456 callq other_function
1d019706d866 LLVM10 anatofuz parents: diff changeset	457 return_addr:
1d019706d866 LLVM10 anatofuz parents: diff changeset	458 testq -8(%rsp), return_addr # Validate return address.
1d019706d866 LLVM10 anatofuz parents: diff changeset	459 cmovneq %r8, %rax # Update predicate state.
1d019706d866 LLVM10 anatofuz parents: diff changeset	460 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	461
1d019706d866 LLVM10 anatofuz parents: diff changeset	462 For an ABI without a "red zone" (and thus unable to read the return address
1d019706d866 LLVM10 anatofuz parents: diff changeset	463 from the stack), we can compute the expected return address prior to the call
1d019706d866 LLVM10 anatofuz parents: diff changeset	464 into a register preserved across the call and use that similarly to the above.
1d019706d866 LLVM10 anatofuz parents: diff changeset	465
1d019706d866 LLVM10 anatofuz parents: diff changeset	466 Indirect calls (and returns in the absence of a red zone ABI) pose the most
1d019706d866 LLVM10 anatofuz parents: diff changeset	467 significant challenge to propagate. The simplest technique would be to define a
1d019706d866 LLVM10 anatofuz parents: diff changeset	468 new ABI such that the intended call target is passed into the called function
1d019706d866 LLVM10 anatofuz parents: diff changeset	469 and checked in the entry. Unfortunately, new ABIs are quite expensive to deploy
1d019706d866 LLVM10 anatofuz parents: diff changeset	470 in C and C++. While the target function could be passed in TLS, we would still
1d019706d866 LLVM10 anatofuz parents: diff changeset	471 require complex logic to handle a mixture of functions compiled with and
1d019706d866 LLVM10 anatofuz parents: diff changeset	472 without this extra logic (essentially, making the ABI backwards compatible).
1d019706d866 LLVM10 anatofuz parents: diff changeset	473 Currently, we suggest using retpolines here and will continue to investigate
1d019706d866 LLVM10 anatofuz parents: diff changeset	474 ways of mitigating this.
1d019706d866 LLVM10 anatofuz parents: diff changeset	475
1d019706d866 LLVM10 anatofuz parents: diff changeset	476
1d019706d866 LLVM10 anatofuz parents: diff changeset	477 ##### Optimizations, Alternatives, and Tradeoffs
1d019706d866 LLVM10 anatofuz parents: diff changeset	478
1d019706d866 LLVM10 anatofuz parents: diff changeset	479 Merely accumulating predicate state involves significant cost. There are
1d019706d866 LLVM10 anatofuz parents: diff changeset	480 several key optimizations we employ to minimize this and various alternatives
1d019706d866 LLVM10 anatofuz parents: diff changeset	481 that present different tradeoffs in the generated code.
1d019706d866 LLVM10 anatofuz parents: diff changeset	482
1d019706d866 LLVM10 anatofuz parents: diff changeset	483 First, we work to reduce the number of instructions used to track the state:
1d019706d866 LLVM10 anatofuz parents: diff changeset	484 * Rather than inserting a `cmovCC` instruction along every conditional edge in
1d019706d866 LLVM10 anatofuz parents: diff changeset	485 the original program, we track each set of condition flags we need to capture
1d019706d866 LLVM10 anatofuz parents: diff changeset	486 prior to entering each basic block and reuse a common `cmovCC` sequence for
1d019706d866 LLVM10 anatofuz parents: diff changeset	487 those.
1d019706d866 LLVM10 anatofuz parents: diff changeset	488 * We could further reuse suffixes when there are multiple `cmovCC`
1d019706d866 LLVM10 anatofuz parents: diff changeset	489 instructions required to capture the set of flags. Currently this is
1d019706d866 LLVM10 anatofuz parents: diff changeset	490 believed to not be worth the cost as paired flags are relatively rare and
1d019706d866 LLVM10 anatofuz parents: diff changeset	491 suffixes of them are exceedingly rare.
1d019706d866 LLVM10 anatofuz parents: diff changeset	492 * A common pattern in x86 is to have multiple conditional jump instructions
1d019706d866 LLVM10 anatofuz parents: diff changeset	493 that use the same flags but handle different conditions. Naively, we could
1d019706d866 LLVM10 anatofuz parents: diff changeset	494 consider each fallthrough between them an "edge" but this causes a much more
1d019706d866 LLVM10 anatofuz parents: diff changeset	495 complex control flow graph. Instead, we accumulate the set of conditions
1d019706d866 LLVM10 anatofuz parents: diff changeset	496 necessary for fallthrough and use a sequence of `cmovCC` instructions in a
1d019706d866 LLVM10 anatofuz parents: diff changeset	497 single fallthrough edge to track it.
1d019706d866 LLVM10 anatofuz parents: diff changeset	498
1d019706d866 LLVM10 anatofuz parents: diff changeset	499 Second, we trade register pressure for simpler `cmovCC` instructions by
1d019706d866 LLVM10 anatofuz parents: diff changeset	500 allocating a register for the "bad" state. We could read that value from memory
1d019706d866 LLVM10 anatofuz parents: diff changeset	501 as part of the conditional move instruction, however, this creates more
1d019706d866 LLVM10 anatofuz parents: diff changeset	502 micro-ops and requires the load-store unit to be involved. Currently, we place
1d019706d866 LLVM10 anatofuz parents: diff changeset	503 the value into a virtual register and allow the register allocator to decide
1d019706d866 LLVM10 anatofuz parents: diff changeset	504 when the register pressure is sufficient to make it worth spilling to memory
1d019706d866 LLVM10 anatofuz parents: diff changeset	505 and reloading.
1d019706d866 LLVM10 anatofuz parents: diff changeset	506
1d019706d866 LLVM10 anatofuz parents: diff changeset	507
1d019706d866 LLVM10 anatofuz parents: diff changeset	508 #### Hardening Loads
1d019706d866 LLVM10 anatofuz parents: diff changeset	509
1d019706d866 LLVM10 anatofuz parents: diff changeset	510 Once we have the predicate accumulated into a special value for correct vs.
1d019706d866 LLVM10 anatofuz parents: diff changeset	511 misspeculated, we need to apply this to loads in a way that ensures they do not
1d019706d866 LLVM10 anatofuz parents: diff changeset	512 leak secret data. There are two primary techniques for this: we can either
1d019706d866 LLVM10 anatofuz parents: diff changeset	513 harden the loaded value to prevent observation, or we can harden the address
1d019706d866 LLVM10 anatofuz parents: diff changeset	514 itself to prevent the load from occurring. These have significantly different
1d019706d866 LLVM10 anatofuz parents: diff changeset	515 performance tradeoffs.
1d019706d866 LLVM10 anatofuz parents: diff changeset	516
1d019706d866 LLVM10 anatofuz parents: diff changeset	517
1d019706d866 LLVM10 anatofuz parents: diff changeset	518 ##### Hardening loaded values
1d019706d866 LLVM10 anatofuz parents: diff changeset	519
1d019706d866 LLVM10 anatofuz parents: diff changeset	520 The most appealing way to harden loads is to mask out all of the bits loaded.
1d019706d866 LLVM10 anatofuz parents: diff changeset	521 The key requirement is that for each bit loaded, along the misspeculated path
1d019706d866 LLVM10 anatofuz parents: diff changeset	522 that bit is always fixed at either 0 or 1 regardless of the value of the bit
1d019706d866 LLVM10 anatofuz parents: diff changeset	523 loaded. The most obvious implementation uses either an `and` instruction with
1d019706d866 LLVM10 anatofuz parents: diff changeset	524 an all-zero mask along misspeculated paths and an all-one mask along correct
1d019706d866 LLVM10 anatofuz parents: diff changeset	525 paths, or an `or` instruction with an all-one mask along misspeculated paths
1d019706d866 LLVM10 anatofuz parents: diff changeset	526 and an all-zero mask along correct paths. Other options become less appealing
1d019706d866 LLVM10 anatofuz parents: diff changeset	527 such as multiplying by zero, or multiple shift instructions. For reasons we
1d019706d866 LLVM10 anatofuz parents: diff changeset	528 elaborate on below, we end up suggesting you use `or` with an all-ones mask,
1d019706d866 LLVM10 anatofuz parents: diff changeset	529 making the x86 instruction sequence look like the following:
1d019706d866 LLVM10 anatofuz parents: diff changeset	530 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	531 ...
1d019706d866 LLVM10 anatofuz parents: diff changeset	532
1d019706d866 LLVM10 anatofuz parents: diff changeset	533 .LBB0_4: # %danger
1d019706d866 LLVM10 anatofuz parents: diff changeset	534 cmovneq %r8, %rax # Conditionally update predicate state.
1d019706d866 LLVM10 anatofuz parents: diff changeset	535 movl (%rsi), %edi # Load potentially secret data from %rsi.
1d019706d866 LLVM10 anatofuz parents: diff changeset	536 orl %eax, %edi
1d019706d866 LLVM10 anatofuz parents: diff changeset	537 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	538
1d019706d866 LLVM10 anatofuz parents: diff changeset	539 Other useful patterns may be to fold the load into the `or` instruction itself
1d019706d866 LLVM10 anatofuz parents: diff changeset	540 at the cost of a register-to-register copy.
1d019706d866 LLVM10 anatofuz parents: diff changeset	541
1d019706d866 LLVM10 anatofuz parents: diff changeset	542 There are some challenges with deploying this approach:
1d019706d866 LLVM10 anatofuz parents: diff changeset	543 1. Many loads on x86 are folded into other instructions. Separating them would
1d019706d866 LLVM10 anatofuz parents: diff changeset	544 add very significant and costly register pressure with prohibitive
1d019706d866 LLVM10 anatofuz parents: diff changeset	545 performance cost.
1d019706d866 LLVM10 anatofuz parents: diff changeset	546 1. Loads may not target a general purpose register requiring extra instructions
1d019706d866 LLVM10 anatofuz parents: diff changeset	547 to map the state value into the correct register class, and potentially more
1d019706d866 LLVM10 anatofuz parents: diff changeset	548 expensive instructions to mask the value in some way.
1d019706d866 LLVM10 anatofuz parents: diff changeset	549 1. The flags registers on x86 are very likely to be live, and challenging to
1d019706d866 LLVM10 anatofuz parents: diff changeset	550 preserve cheaply.
1d019706d866 LLVM10 anatofuz parents: diff changeset	551 1. There are many more values loaded than pointers & indices used for loads. As
1d019706d866 LLVM10 anatofuz parents: diff changeset	552 a consequence, hardening the result of a load requires substantially more
1d019706d866 LLVM10 anatofuz parents: diff changeset	553 instructions than hardening the address of the load (see below).
1d019706d866 LLVM10 anatofuz parents: diff changeset	554
1d019706d866 LLVM10 anatofuz parents: diff changeset	555 Despite these challenges, hardening the result of the load critically allows
1d019706d866 LLVM10 anatofuz parents: diff changeset	556 the load to proceed and thus has dramatically less impact on the total
1d019706d866 LLVM10 anatofuz parents: diff changeset	557 speculative / out-of-order potential of the execution. There are also several
1d019706d866 LLVM10 anatofuz parents: diff changeset	558 interesting techniques to try and mitigate these challenges and make hardening
1d019706d866 LLVM10 anatofuz parents: diff changeset	559 the results of loads viable in at least some cases. However, we generally
1d019706d866 LLVM10 anatofuz parents: diff changeset	560 expect to fall back when unprofitable from hardening the loaded value to the
1d019706d866 LLVM10 anatofuz parents: diff changeset	561 next approach of hardening the address itself.
1d019706d866 LLVM10 anatofuz parents: diff changeset	562
1d019706d866 LLVM10 anatofuz parents: diff changeset	563
1d019706d866 LLVM10 anatofuz parents: diff changeset	564 ###### Loads folded into data-invariant operations can be hardened after the operation
1d019706d866 LLVM10 anatofuz parents: diff changeset	565
1d019706d866 LLVM10 anatofuz parents: diff changeset	566 The first key to making this feasible is to recognize that many operations on
1d019706d866 LLVM10 anatofuz parents: diff changeset	567 x86 are "data-invariant". That is, they have no (known) observable behavior
1d019706d866 LLVM10 anatofuz parents: diff changeset	568 differences due to the particular input data. These instructions are often used
1d019706d866 LLVM10 anatofuz parents: diff changeset	569 when implementing cryptographic primitives dealing with private key data
1d019706d866 LLVM10 anatofuz parents: diff changeset	570 because they are not believed to provide any side-channels. Similarly, we can
1d019706d866 LLVM10 anatofuz parents: diff changeset	571 defer hardening until after them as they will not in-and-of-themselves
1d019706d866 LLVM10 anatofuz parents: diff changeset	572 introduce a speculative execution side-channel. This results in code sequences
1d019706d866 LLVM10 anatofuz parents: diff changeset	573 that look like:
1d019706d866 LLVM10 anatofuz parents: diff changeset	574 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	575 ...
1d019706d866 LLVM10 anatofuz parents: diff changeset	576
1d019706d866 LLVM10 anatofuz parents: diff changeset	577 .LBB0_4: # %danger
1d019706d866 LLVM10 anatofuz parents: diff changeset	578 cmovneq %r8, %rax # Conditionally update predicate state.
1d019706d866 LLVM10 anatofuz parents: diff changeset	579 addl (%rsi), %edi # Load and accumulate without leaking.
1d019706d866 LLVM10 anatofuz parents: diff changeset	580 orl %eax, %edi
1d019706d866 LLVM10 anatofuz parents: diff changeset	581 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	582
1d019706d866 LLVM10 anatofuz parents: diff changeset	583 While an addition happens to the loaded (potentially secret) value, that
1d019706d866 LLVM10 anatofuz parents: diff changeset	584 doesn't leak any data and we then immediately harden it.
1d019706d866 LLVM10 anatofuz parents: diff changeset	585
1d019706d866 LLVM10 anatofuz parents: diff changeset	586
1d019706d866 LLVM10 anatofuz parents: diff changeset	587 ###### Hardening of loaded values deferred down the data-invariant expression graph
1d019706d866 LLVM10 anatofuz parents: diff changeset	588
1d019706d866 LLVM10 anatofuz parents: diff changeset	589 We can generalize the previous idea and sink the hardening down the expression
1d019706d866 LLVM10 anatofuz parents: diff changeset	590 graph across as many data-invariant operations as desirable. This can use very
1d019706d866 LLVM10 anatofuz parents: diff changeset	591 conservative rules for whether something is data-invariant. The primary goal
1d019706d866 LLVM10 anatofuz parents: diff changeset	592 should be to handle multiple loads with a single hardening instruction:
1d019706d866 LLVM10 anatofuz parents: diff changeset	593 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	594 ...
1d019706d866 LLVM10 anatofuz parents: diff changeset	595
1d019706d866 LLVM10 anatofuz parents: diff changeset	596 .LBB0_4: # %danger
1d019706d866 LLVM10 anatofuz parents: diff changeset	597 cmovneq %r8, %rax # Conditionally update predicate state.
1d019706d866 LLVM10 anatofuz parents: diff changeset	598 addl (%rsi), %edi # Load and accumulate without leaking.
1d019706d866 LLVM10 anatofuz parents: diff changeset	599 addl 4(%rsi), %edi # Continue without leaking.
1d019706d866 LLVM10 anatofuz parents: diff changeset	600 addl 8(%rsi), %edi
1d019706d866 LLVM10 anatofuz parents: diff changeset	601 orl %eax, %edi # Mask out bits from all three loads.
1d019706d866 LLVM10 anatofuz parents: diff changeset	602 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	603
1d019706d866 LLVM10 anatofuz parents: diff changeset	604
1d019706d866 LLVM10 anatofuz parents: diff changeset	605 ###### Preserving the flags while hardening loaded values on Haswell, Zen, and newer processors
1d019706d866 LLVM10 anatofuz parents: diff changeset	606
1d019706d866 LLVM10 anatofuz parents: diff changeset	607 Sadly, there are no useful instructions on x86 that apply a mask to all 64 bits
1d019706d866 LLVM10 anatofuz parents: diff changeset	608 without touching the flag registers. However, we can harden loaded values that
1d019706d866 LLVM10 anatofuz parents: diff changeset	609 are narrower than a word (fewer than 32-bits on 32-bit systems and fewer than
1d019706d866 LLVM10 anatofuz parents: diff changeset	610 64-bits on 64-bit systems) by zero-extending the value to the full word size
1d019706d866 LLVM10 anatofuz parents: diff changeset	611 and then shifting right by at least the number of original bits using the BMI2
1d019706d866 LLVM10 anatofuz parents: diff changeset	612 `shrx` instruction:
1d019706d866 LLVM10 anatofuz parents: diff changeset	613 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	614 ...
1d019706d866 LLVM10 anatofuz parents: diff changeset	615
1d019706d866 LLVM10 anatofuz parents: diff changeset	616 .LBB0_4: # %danger
1d019706d866 LLVM10 anatofuz parents: diff changeset	617 cmovneq %r8, %rax # Conditionally update predicate state.
1d019706d866 LLVM10 anatofuz parents: diff changeset	618 addl (%rsi), %edi # Load and accumulate 32 bits of data.
1d019706d866 LLVM10 anatofuz parents: diff changeset	619 shrxq %rax, %rdi, %rdi # Shift out all 32 bits loaded.
1d019706d866 LLVM10 anatofuz parents: diff changeset	620 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	621
1d019706d866 LLVM10 anatofuz parents: diff changeset	622 Because on x86 the zero-extend is free, this can efficiently harden the loaded
1d019706d866 LLVM10 anatofuz parents: diff changeset	623 value.
1d019706d866 LLVM10 anatofuz parents: diff changeset	624
1d019706d866 LLVM10 anatofuz parents: diff changeset	625
1d019706d866 LLVM10 anatofuz parents: diff changeset	626 ##### Hardening the address of the load
1d019706d866 LLVM10 anatofuz parents: diff changeset	627
1d019706d866 LLVM10 anatofuz parents: diff changeset	628 When hardening the loaded value is inapplicable, most often because the
1d019706d866 LLVM10 anatofuz parents: diff changeset	629 instruction directly leaks information (like `cmp` or `jmpq`), we switch to
1d019706d866 LLVM10 anatofuz parents: diff changeset	630 hardening the _address_ of the load instead of the loaded value. This avoids
1d019706d866 LLVM10 anatofuz parents: diff changeset	631 increasing register pressure by unfolding the load or paying some other high
1d019706d866 LLVM10 anatofuz parents: diff changeset	632 cost.
1d019706d866 LLVM10 anatofuz parents: diff changeset	633
1d019706d866 LLVM10 anatofuz parents: diff changeset	634 To understand how this works in practice, we need to examine the exact
1d019706d866 LLVM10 anatofuz parents: diff changeset	635 semantics of the x86 addressing modes which, in its fully general form, looks
1d019706d866 LLVM10 anatofuz parents: diff changeset	636 like `(%base,%index,scale)offset`. Here `%base` and `%index` are 64-bit
1d019706d866 LLVM10 anatofuz parents: diff changeset	637 registers that can potentially be any value, and may be attacker controlled,
1d019706d866 LLVM10 anatofuz parents: diff changeset	638 and `scale` and `offset` are fixed immediate values. `scale` must be `1`, `2`,
1d019706d866 LLVM10 anatofuz parents: diff changeset	639 `4`, or `8`, and `offset` can be any 32-bit sign extended value. The exact
1d019706d866 LLVM10 anatofuz parents: diff changeset	640 computation performed to find the address is then: `%base + (scale * %index) +
1d019706d866 LLVM10 anatofuz parents: diff changeset	641 offset` under 64-bit 2's complement modular arithmetic.
1d019706d866 LLVM10 anatofuz parents: diff changeset	642
1d019706d866 LLVM10 anatofuz parents: diff changeset	643 One issue with this approach is that, after hardening, the `%base + (scale *
1d019706d866 LLVM10 anatofuz parents: diff changeset	644 %index)` subexpression will compute a value near zero (`-1 + (scale * -1)`) and
1d019706d866 LLVM10 anatofuz parents: diff changeset	645 then a large, positive `offset` will index into memory within the first two
1d019706d866 LLVM10 anatofuz parents: diff changeset	646 gigabytes of address space. While these offsets are not attacker controlled,
1d019706d866 LLVM10 anatofuz parents: diff changeset	647 the attacker could chose to attack a load which happens to have the desired
1d019706d866 LLVM10 anatofuz parents: diff changeset	648 offset and then successfully read memory in that region. This significantly
1d019706d866 LLVM10 anatofuz parents: diff changeset	649 raises the burden on the attacker and limits the scope of attack but does not
1d019706d866 LLVM10 anatofuz parents: diff changeset	650 eliminate it. To fully close the attack we must work with the operating system
1d019706d866 LLVM10 anatofuz parents: diff changeset	651 to preclude mapping memory in the low two gigabytes of address space.
1d019706d866 LLVM10 anatofuz parents: diff changeset	652
1d019706d866 LLVM10 anatofuz parents: diff changeset	653
1d019706d866 LLVM10 anatofuz parents: diff changeset	654 ###### 64-bit load checking instructions
1d019706d866 LLVM10 anatofuz parents: diff changeset	655
1d019706d866 LLVM10 anatofuz parents: diff changeset	656 We can use the following instruction sequences to check loads. We set up `%r8`
1d019706d866 LLVM10 anatofuz parents: diff changeset	657 in these examples to hold the special value of `-1` which will be `cmov`ed over
1d019706d866 LLVM10 anatofuz parents: diff changeset	658 `%rax` in misspeculated paths.
1d019706d866 LLVM10 anatofuz parents: diff changeset	659
1d019706d866 LLVM10 anatofuz parents: diff changeset	660 Single register addressing mode:
1d019706d866 LLVM10 anatofuz parents: diff changeset	661 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	662 ...
1d019706d866 LLVM10 anatofuz parents: diff changeset	663
1d019706d866 LLVM10 anatofuz parents: diff changeset	664 .LBB0_4: # %danger
1d019706d866 LLVM10 anatofuz parents: diff changeset	665 cmovneq %r8, %rax # Conditionally update predicate state.
1d019706d866 LLVM10 anatofuz parents: diff changeset	666 orq %rax, %rsi # Mask the pointer if misspeculating.
1d019706d866 LLVM10 anatofuz parents: diff changeset	667 movl (%rsi), %edi
1d019706d866 LLVM10 anatofuz parents: diff changeset	668 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	669
1d019706d866 LLVM10 anatofuz parents: diff changeset	670 Two register addressing mode:
1d019706d866 LLVM10 anatofuz parents: diff changeset	671 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	672 ...
1d019706d866 LLVM10 anatofuz parents: diff changeset	673
1d019706d866 LLVM10 anatofuz parents: diff changeset	674 .LBB0_4: # %danger
1d019706d866 LLVM10 anatofuz parents: diff changeset	675 cmovneq %r8, %rax # Conditionally update predicate state.
1d019706d866 LLVM10 anatofuz parents: diff changeset	676 orq %rax, %rsi # Mask the pointer if misspeculating.
1d019706d866 LLVM10 anatofuz parents: diff changeset	677 orq %rax, %rcx # Mask the index if misspeculating.
1d019706d866 LLVM10 anatofuz parents: diff changeset	678 movl (%rsi,%rcx), %edi
1d019706d866 LLVM10 anatofuz parents: diff changeset	679 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	680
1d019706d866 LLVM10 anatofuz parents: diff changeset	681 This will result in a negative address near zero or in `offset` wrapping the
1d019706d866 LLVM10 anatofuz parents: diff changeset	682 address space back to a small positive address. Small, negative addresses will
1d019706d866 LLVM10 anatofuz parents: diff changeset	683 fault in user-mode for most operating systems, but targets which need the high
1d019706d866 LLVM10 anatofuz parents: diff changeset	684 address space to be user accessible may need to adjust the exact sequence used
1d019706d866 LLVM10 anatofuz parents: diff changeset	685 above. Additionally, the low addresses will need to be marked unreadable by the
1d019706d866 LLVM10 anatofuz parents: diff changeset	686 OS to fully harden the load.
1d019706d866 LLVM10 anatofuz parents: diff changeset	687
1d019706d866 LLVM10 anatofuz parents: diff changeset	688
1d019706d866 LLVM10 anatofuz parents: diff changeset	689 ###### RIP-relative addressing is even easier to break
1d019706d866 LLVM10 anatofuz parents: diff changeset	690
1d019706d866 LLVM10 anatofuz parents: diff changeset	691 There is a common addressing mode idiom that is substantially harder to check:
1d019706d866 LLVM10 anatofuz parents: diff changeset	692 addressing relative to the instruction pointer. We cannot change the value of
1d019706d866 LLVM10 anatofuz parents: diff changeset	693 the instruction pointer register and so we have the harder problem of forcing
1d019706d866 LLVM10 anatofuz parents: diff changeset	694 `%base + scale * %index + offset` to be an invalid address, by only changing
1d019706d866 LLVM10 anatofuz parents: diff changeset	695 `%index`. The only advantage we have is that the attacker also cannot modify
1d019706d866 LLVM10 anatofuz parents: diff changeset	696 `%base`. If we use the fast instruction sequence above, but only apply it to
1d019706d866 LLVM10 anatofuz parents: diff changeset	697 the index, we will always access `%rip + (scale * -1) + offset`. If the
1d019706d866 LLVM10 anatofuz parents: diff changeset	698 attacker can find a load which with this address happens to point to secret
1d019706d866 LLVM10 anatofuz parents: diff changeset	699 data, then they can reach it. However, the loader and base libraries can also
1d019706d866 LLVM10 anatofuz parents: diff changeset	700 simply refuse to map the heap, data segments, or stack within 2gb of any of the
1d019706d866 LLVM10 anatofuz parents: diff changeset	701 text in the program, much like it can reserve the low 2gb of address space.
1d019706d866 LLVM10 anatofuz parents: diff changeset	702
1d019706d866 LLVM10 anatofuz parents: diff changeset	703
1d019706d866 LLVM10 anatofuz parents: diff changeset	704 ###### The flag registers again make everything hard
1d019706d866 LLVM10 anatofuz parents: diff changeset	705
1d019706d866 LLVM10 anatofuz parents: diff changeset	706 Unfortunately, the technique of using `orq`-instructions has a serious flaw on
1d019706d866 LLVM10 anatofuz parents: diff changeset	707 x86. The very thing that makes it easy to accumulate state, the flag registers
1d019706d866 LLVM10 anatofuz parents: diff changeset	708 containing predicates, causes serious problems here because they may be alive
1d019706d866 LLVM10 anatofuz parents: diff changeset	709 and used by the loading instruction or subsequent instructions. On x86, the
1d019706d866 LLVM10 anatofuz parents: diff changeset	710 `orq` instruction sets the flags and will override anything already there.
1d019706d866 LLVM10 anatofuz parents: diff changeset	711 This makes inserting them into the instruction stream very hazardous.
1d019706d866 LLVM10 anatofuz parents: diff changeset	712 Unfortunately, unlike when hardening the loaded value, we have no fallback here
1d019706d866 LLVM10 anatofuz parents: diff changeset	713 and so we must have a fully general approach available.
1d019706d866 LLVM10 anatofuz parents: diff changeset	714
1d019706d866 LLVM10 anatofuz parents: diff changeset	715 The first thing we must do when generating these sequences is try to analyze
1d019706d866 LLVM10 anatofuz parents: diff changeset	716 the surrounding code to prove that the flags are not in fact alive or being
1d019706d866 LLVM10 anatofuz parents: diff changeset	717 used. Typically, it has been set by some other instruction which just happens
1d019706d866 LLVM10 anatofuz parents: diff changeset	718 to set the flags register (much like ours!) with no actual dependency. In those
1d019706d866 LLVM10 anatofuz parents: diff changeset	719 cases, it is safe to directly insert these instructions. Alternatively we may
1d019706d866 LLVM10 anatofuz parents: diff changeset	720 be able to move them earlier to avoid clobbering the used value.
1d019706d866 LLVM10 anatofuz parents: diff changeset	721
1d019706d866 LLVM10 anatofuz parents: diff changeset	722 However, this may ultimately be impossible. In that case, we need to preserve
1d019706d866 LLVM10 anatofuz parents: diff changeset	723 the flags around these instructions:
1d019706d866 LLVM10 anatofuz parents: diff changeset	724 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	725 ...
1d019706d866 LLVM10 anatofuz parents: diff changeset	726
1d019706d866 LLVM10 anatofuz parents: diff changeset	727 .LBB0_4: # %danger
1d019706d866 LLVM10 anatofuz parents: diff changeset	728 cmovneq %r8, %rax # Conditionally update predicate state.
1d019706d866 LLVM10 anatofuz parents: diff changeset	729 pushfq
1d019706d866 LLVM10 anatofuz parents: diff changeset	730 orq %rax, %rcx # Mask the pointer if misspeculating.
1d019706d866 LLVM10 anatofuz parents: diff changeset	731 orq %rax, %rdx # Mask the index if misspeculating.
1d019706d866 LLVM10 anatofuz parents: diff changeset	732 popfq
1d019706d866 LLVM10 anatofuz parents: diff changeset	733 movl (%rcx,%rdx), %edi
1d019706d866 LLVM10 anatofuz parents: diff changeset	734 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	735
1d019706d866 LLVM10 anatofuz parents: diff changeset	736 Using the `pushf` and `popf` instructions saves the flags register around our
1d019706d866 LLVM10 anatofuz parents: diff changeset	737 inserted code, but comes at a high cost. First, we must store the flags to the
1d019706d866 LLVM10 anatofuz parents: diff changeset	738 stack and reload them. Second, this causes the stack pointer to be adjusted
1d019706d866 LLVM10 anatofuz parents: diff changeset	739 dynamically, requiring a frame pointer be used for referring to temporaries
1d019706d866 LLVM10 anatofuz parents: diff changeset	740 spilled to the stack, etc.
1d019706d866 LLVM10 anatofuz parents: diff changeset	741
1d019706d866 LLVM10 anatofuz parents: diff changeset	742 On newer x86 processors we can use the `lahf` and `sahf` instructions to save
1d019706d866 LLVM10 anatofuz parents: diff changeset	743 all of the flags besides the overflow flag in a register rather than on the
1d019706d866 LLVM10 anatofuz parents: diff changeset	744 stack. We can then use `seto` and `add` to save and restore the overflow flag
1d019706d866 LLVM10 anatofuz parents: diff changeset	745 in a register. Combined, this will save and restore flags in the same manner as
1d019706d866 LLVM10 anatofuz parents: diff changeset	746 above but using two registers rather than the stack. That is still very
1d019706d866 LLVM10 anatofuz parents: diff changeset	747 expensive if slightly less expensive than `pushf` and `popf` in most cases.
1d019706d866 LLVM10 anatofuz parents: diff changeset	748
1d019706d866 LLVM10 anatofuz parents: diff changeset	749
1d019706d866 LLVM10 anatofuz parents: diff changeset	750 ###### A flag-less alternative on Haswell, Zen and newer processors
1d019706d866 LLVM10 anatofuz parents: diff changeset	751
1d019706d866 LLVM10 anatofuz parents: diff changeset	752 Starting with the BMI2 x86 instruction set extensions available on Haswell and
1d019706d866 LLVM10 anatofuz parents: diff changeset	753 Zen processors, there is an instruction for shifting that does not set any
1d019706d866 LLVM10 anatofuz parents: diff changeset	754 flags: `shrx`. We can use this and the `lea` instruction to implement analogous
1d019706d866 LLVM10 anatofuz parents: diff changeset	755 code sequences to the above ones. However, these are still very marginally
1d019706d866 LLVM10 anatofuz parents: diff changeset	756 slower, as there are fewer ports able to dispatch shift instructions in most
1d019706d866 LLVM10 anatofuz parents: diff changeset	757 modern x86 processors than there are for `or` instructions.
1d019706d866 LLVM10 anatofuz parents: diff changeset	758
1d019706d866 LLVM10 anatofuz parents: diff changeset	759 Fast, single register addressing mode:
1d019706d866 LLVM10 anatofuz parents: diff changeset	760 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	761 ...
1d019706d866 LLVM10 anatofuz parents: diff changeset	762
1d019706d866 LLVM10 anatofuz parents: diff changeset	763 .LBB0_4: # %danger
1d019706d866 LLVM10 anatofuz parents: diff changeset	764 cmovneq %r8, %rax # Conditionally update predicate state.
1d019706d866 LLVM10 anatofuz parents: diff changeset	765 shrxq %rax, %rsi, %rsi # Shift away bits if misspeculating.
1d019706d866 LLVM10 anatofuz parents: diff changeset	766 movl (%rsi), %edi
1d019706d866 LLVM10 anatofuz parents: diff changeset	767 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	768
1d019706d866 LLVM10 anatofuz parents: diff changeset	769 This will collapse the register to zero or one, and everything but the offset
1d019706d866 LLVM10 anatofuz parents: diff changeset	770 in the addressing mode to be less than or equal to 9. This means the full
1d019706d866 LLVM10 anatofuz parents: diff changeset	771 address can only be guaranteed to be less than `(1 << 31) + 9`. The OS may wish
1d019706d866 LLVM10 anatofuz parents: diff changeset	772 to protect an extra page of the low address space to account for this
1d019706d866 LLVM10 anatofuz parents: diff changeset	773
1d019706d866 LLVM10 anatofuz parents: diff changeset	774
1d019706d866 LLVM10 anatofuz parents: diff changeset	775 ##### Optimizations
1d019706d866 LLVM10 anatofuz parents: diff changeset	776
1d019706d866 LLVM10 anatofuz parents: diff changeset	777 A very large portion of the cost for this approach comes from checking loads in
1d019706d866 LLVM10 anatofuz parents: diff changeset	778 this way, so it is important to work to optimize this. However, beyond making
1d019706d866 LLVM10 anatofuz parents: diff changeset	779 the instruction sequences to apply the checks efficient (for example by
1d019706d866 LLVM10 anatofuz parents: diff changeset	780 avoiding `pushfq` and `popfq` sequences), the only significant optimization is
1d019706d866 LLVM10 anatofuz parents: diff changeset	781 to check fewer loads without introducing a vulnerability. We apply several
1d019706d866 LLVM10 anatofuz parents: diff changeset	782 techniques to accomplish that.
1d019706d866 LLVM10 anatofuz parents: diff changeset	783
1d019706d866 LLVM10 anatofuz parents: diff changeset	784
1d019706d866 LLVM10 anatofuz parents: diff changeset	785 ###### Don't check loads from compile-time constant stack offsets
1d019706d866 LLVM10 anatofuz parents: diff changeset	786
1d019706d866 LLVM10 anatofuz parents: diff changeset	787 We implement this optimization on x86 by skipping the checking of loads which
1d019706d866 LLVM10 anatofuz parents: diff changeset	788 use a fixed frame pointer offset.
1d019706d866 LLVM10 anatofuz parents: diff changeset	789
1d019706d866 LLVM10 anatofuz parents: diff changeset	790 The result of this optimization is that patterns like reloading a spilled
1d019706d866 LLVM10 anatofuz parents: diff changeset	791 register or accessing a global field don't get checked. This is a very
1d019706d866 LLVM10 anatofuz parents: diff changeset	792 significant performance win.
1d019706d866 LLVM10 anatofuz parents: diff changeset	793
1d019706d866 LLVM10 anatofuz parents: diff changeset	794
1d019706d866 LLVM10 anatofuz parents: diff changeset	795 ###### Don't check dependent loads
1d019706d866 LLVM10 anatofuz parents: diff changeset	796
1d019706d866 LLVM10 anatofuz parents: diff changeset	797 A core part of why this mitigation strategy works is that it establishes a
1d019706d866 LLVM10 anatofuz parents: diff changeset	798 data-flow check on the loaded address. However, this means that if the address
1d019706d866 LLVM10 anatofuz parents: diff changeset	799 itself was already loaded using a checked load, there is no need to check a
1d019706d866 LLVM10 anatofuz parents: diff changeset	800 dependent load provided it is within the same basic block as the checked load,
1d019706d866 LLVM10 anatofuz parents: diff changeset	801 and therefore has no additional predicates guarding it. Consider code like the
1d019706d866 LLVM10 anatofuz parents: diff changeset	802 following:
1d019706d866 LLVM10 anatofuz parents: diff changeset	803 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	804 ...
1d019706d866 LLVM10 anatofuz parents: diff changeset	805
1d019706d866 LLVM10 anatofuz parents: diff changeset	806 .LBB0_4: # %danger
1d019706d866 LLVM10 anatofuz parents: diff changeset	807 movq (%rcx), %rdi
1d019706d866 LLVM10 anatofuz parents: diff changeset	808 movl (%rdi), %edx
1d019706d866 LLVM10 anatofuz parents: diff changeset	809 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	810
1d019706d866 LLVM10 anatofuz parents: diff changeset	811 This will get transformed into:
1d019706d866 LLVM10 anatofuz parents: diff changeset	812 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	813 ...
1d019706d866 LLVM10 anatofuz parents: diff changeset	814
1d019706d866 LLVM10 anatofuz parents: diff changeset	815 .LBB0_4: # %danger
1d019706d866 LLVM10 anatofuz parents: diff changeset	816 cmovneq %r8, %rax # Conditionally update predicate state.
1d019706d866 LLVM10 anatofuz parents: diff changeset	817 orq %rax, %rcx # Mask the pointer if misspeculating.
1d019706d866 LLVM10 anatofuz parents: diff changeset	818 movq (%rcx), %rdi # Hardened load.
1d019706d866 LLVM10 anatofuz parents: diff changeset	819 movl (%rdi), %edx # Unhardened load due to dependent addr.
1d019706d866 LLVM10 anatofuz parents: diff changeset	820 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	821
1d019706d866 LLVM10 anatofuz parents: diff changeset	822 This doesn't check the load through `%rdi` as that pointer is dependent on a
1d019706d866 LLVM10 anatofuz parents: diff changeset	823 checked load already.
1d019706d866 LLVM10 anatofuz parents: diff changeset	824
1d019706d866 LLVM10 anatofuz parents: diff changeset	825
1d019706d866 LLVM10 anatofuz parents: diff changeset	826 ###### Protect large, load-heavy blocks with a single lfence
1d019706d866 LLVM10 anatofuz parents: diff changeset	827
1d019706d866 LLVM10 anatofuz parents: diff changeset	828 It may be worth using a single `lfence` instruction at the start of a block
1d019706d866 LLVM10 anatofuz parents: diff changeset	829 which begins with a (very) large number of loads that require independent
1d019706d866 LLVM10 anatofuz parents: diff changeset	830 protection and which require hardening the address of the load. However, this
1d019706d866 LLVM10 anatofuz parents: diff changeset	831 is unlikely to be profitable in practice. The latency hit of the hardening
1d019706d866 LLVM10 anatofuz parents: diff changeset	832 would need to exceed that of an `lfence` when correctly speculatively
1d019706d866 LLVM10 anatofuz parents: diff changeset	833 executed. But in that case, the `lfence` cost is a complete loss of speculative
1d019706d866 LLVM10 anatofuz parents: diff changeset	834 execution (at a minimum). So far, the evidence we have of the performance cost
1d019706d866 LLVM10 anatofuz parents: diff changeset	835 of using `lfence` indicates few if any hot code patterns where this trade off
1d019706d866 LLVM10 anatofuz parents: diff changeset	836 would make sense.
1d019706d866 LLVM10 anatofuz parents: diff changeset	837
1d019706d866 LLVM10 anatofuz parents: diff changeset	838
1d019706d866 LLVM10 anatofuz parents: diff changeset	839 ###### Tempting optimizations that break the security model
1d019706d866 LLVM10 anatofuz parents: diff changeset	840
1d019706d866 LLVM10 anatofuz parents: diff changeset	841 Several optimizations were considered which didn't pan out due to failure to
1d019706d866 LLVM10 anatofuz parents: diff changeset	842 uphold the security model. One in particular is worth discussing as many others
1d019706d866 LLVM10 anatofuz parents: diff changeset	843 will reduce to it.
1d019706d866 LLVM10 anatofuz parents: diff changeset	844
1d019706d866 LLVM10 anatofuz parents: diff changeset	845 We wondered whether only the first load in a basic block could be checked. If
1d019706d866 LLVM10 anatofuz parents: diff changeset	846 the check works as intended, it forms an invalid pointer that doesn't even
1d019706d866 LLVM10 anatofuz parents: diff changeset	847 virtual-address translate in the hardware. It should fault very early on in its
1d019706d866 LLVM10 anatofuz parents: diff changeset	848 processing. Maybe that would stop things in time for the misspeculated path to
1d019706d866 LLVM10 anatofuz parents: diff changeset	849 fail to leak any secrets. This doesn't end up working because the processor is
1d019706d866 LLVM10 anatofuz parents: diff changeset	850 fundamentally out-of-order, even in its speculative domain. As a consequence,
1d019706d866 LLVM10 anatofuz parents: diff changeset	851 the attacker could cause the initial address computation itself to stall and
1d019706d866 LLVM10 anatofuz parents: diff changeset	852 allow an arbitrary number of unrelated loads (including attacked loads of
1d019706d866 LLVM10 anatofuz parents: diff changeset	853 secret data) to pass through.
1d019706d866 LLVM10 anatofuz parents: diff changeset	854
1d019706d866 LLVM10 anatofuz parents: diff changeset	855
1d019706d866 LLVM10 anatofuz parents: diff changeset	856 #### Interprocedural Checking
1d019706d866 LLVM10 anatofuz parents: diff changeset	857
1d019706d866 LLVM10 anatofuz parents: diff changeset	858 Modern x86 processors may speculate into called functions and out of functions
1d019706d866 LLVM10 anatofuz parents: diff changeset	859 to their return address. As a consequence, we need a way to check loads that
1d019706d866 LLVM10 anatofuz parents: diff changeset	860 occur after a misspeculated predicate but where the load and the misspeculated
1d019706d866 LLVM10 anatofuz parents: diff changeset	861 predicate are in different functions. In essence, we need some interprocedural
1d019706d866 LLVM10 anatofuz parents: diff changeset	862 generalization of the predicate state tracking. A primary challenge to passing
1d019706d866 LLVM10 anatofuz parents: diff changeset	863 the predicate state between functions is that we would like to not require a
1d019706d866 LLVM10 anatofuz parents: diff changeset	864 change to the ABI or calling convention in order to make this mitigation more
1d019706d866 LLVM10 anatofuz parents: diff changeset	865 deployable, and further would like code mitigated in this way to be easily
1d019706d866 LLVM10 anatofuz parents: diff changeset	866 mixed with code not mitigated in this way and without completely losing the
1d019706d866 LLVM10 anatofuz parents: diff changeset	867 value of the mitigation.
1d019706d866 LLVM10 anatofuz parents: diff changeset	868
1d019706d866 LLVM10 anatofuz parents: diff changeset	869
1d019706d866 LLVM10 anatofuz parents: diff changeset	870 ##### Embed the predicate state into the high bit(s) of the stack pointer
1d019706d866 LLVM10 anatofuz parents: diff changeset	871
1d019706d866 LLVM10 anatofuz parents: diff changeset	872 We can use the same technique that allows hardening pointers to pass the
1d019706d866 LLVM10 anatofuz parents: diff changeset	873 predicate state into and out of functions. The stack pointer is trivially
1d019706d866 LLVM10 anatofuz parents: diff changeset	874 passed between functions and we can test for it having the high bits set to
1d019706d866 LLVM10 anatofuz parents: diff changeset	875 detect when it has been marked due to misspeculation. The callsite instruction
1d019706d866 LLVM10 anatofuz parents: diff changeset	876 sequence looks like (assuming a misspeculated state value of `-1`):
1d019706d866 LLVM10 anatofuz parents: diff changeset	877 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	878 ...
1d019706d866 LLVM10 anatofuz parents: diff changeset	879
1d019706d866 LLVM10 anatofuz parents: diff changeset	880 .LBB0_4: # %danger
1d019706d866 LLVM10 anatofuz parents: diff changeset	881 cmovneq %r8, %rax # Conditionally update predicate state.
1d019706d866 LLVM10 anatofuz parents: diff changeset	882 shlq $47, %rax
1d019706d866 LLVM10 anatofuz parents: diff changeset	883 orq %rax, %rsp
1d019706d866 LLVM10 anatofuz parents: diff changeset	884 callq other_function
1d019706d866 LLVM10 anatofuz parents: diff changeset	885 movq %rsp, %rax
1d019706d866 LLVM10 anatofuz parents: diff changeset	886 sarq 63, %rax # Sign extend the high bit to all bits.
1d019706d866 LLVM10 anatofuz parents: diff changeset	887 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	888
1d019706d866 LLVM10 anatofuz parents: diff changeset	889 This first puts the predicate state into the high bits of `%rsp` before calling
1d019706d866 LLVM10 anatofuz parents: diff changeset	890 the function and then reads it back out of high bits of `%rsp` afterward. When
1d019706d866 LLVM10 anatofuz parents: diff changeset	891 correctly executing (speculatively or not), these are all no-ops. When
1d019706d866 LLVM10 anatofuz parents: diff changeset	892 misspeculating, the stack pointer will end up negative. We arrange for it to
1d019706d866 LLVM10 anatofuz parents: diff changeset	893 remain a canonical address, but otherwise leave the low bits alone to allow
1d019706d866 LLVM10 anatofuz parents: diff changeset	894 stack adjustments to proceed normally without disrupting this. Within the
1d019706d866 LLVM10 anatofuz parents: diff changeset	895 called function, we can extract this predicate state and then reset it on
1d019706d866 LLVM10 anatofuz parents: diff changeset	896 return:
1d019706d866 LLVM10 anatofuz parents: diff changeset	897 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	898 other_function:
1d019706d866 LLVM10 anatofuz parents: diff changeset	899 # prolog
1d019706d866 LLVM10 anatofuz parents: diff changeset	900 callq other_function
1d019706d866 LLVM10 anatofuz parents: diff changeset	901 movq %rsp, %rax
1d019706d866 LLVM10 anatofuz parents: diff changeset	902 sarq 63, %rax # Sign extend the high bit to all bits.
1d019706d866 LLVM10 anatofuz parents: diff changeset	903 # ...
1d019706d866 LLVM10 anatofuz parents: diff changeset	904
1d019706d866 LLVM10 anatofuz parents: diff changeset	905 .LBB0_N:
1d019706d866 LLVM10 anatofuz parents: diff changeset	906 cmovneq %r8, %rax # Conditionally update predicate state.
1d019706d866 LLVM10 anatofuz parents: diff changeset	907 shlq $47, %rax
1d019706d866 LLVM10 anatofuz parents: diff changeset	908 orq %rax, %rsp
1d019706d866 LLVM10 anatofuz parents: diff changeset	909 retq
1d019706d866 LLVM10 anatofuz parents: diff changeset	910 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	911
1d019706d866 LLVM10 anatofuz parents: diff changeset	912 This approach is effective when all code is mitigated in this fashion, and can
1d019706d866 LLVM10 anatofuz parents: diff changeset	913 even survive very limited reaches into unmitigated code (the state will
1d019706d866 LLVM10 anatofuz parents: diff changeset	914 round-trip in and back out of an unmitigated function, it just won't be
1d019706d866 LLVM10 anatofuz parents: diff changeset	915 updated). But it does have some limitations. There is a cost to merging the
1d019706d866 LLVM10 anatofuz parents: diff changeset	916 state into `%rsp` and it doesn't insulate mitigated code from misspeculation in
1d019706d866 LLVM10 anatofuz parents: diff changeset	917 an unmitigated caller.
1d019706d866 LLVM10 anatofuz parents: diff changeset	918
1d019706d866 LLVM10 anatofuz parents: diff changeset	919 There is also an advantage to using this form of interprocedural mitigation: by
1d019706d866 LLVM10 anatofuz parents: diff changeset	920 forming these invalid stack pointer addresses we can prevent speculative
1d019706d866 LLVM10 anatofuz parents: diff changeset	921 returns from successfully reading speculatively written values to the actual
1d019706d866 LLVM10 anatofuz parents: diff changeset	922 stack. This works first by forming a data-dependency between computing the
1d019706d866 LLVM10 anatofuz parents: diff changeset	923 address of the return address on the stack and our predicate state. And even
1d019706d866 LLVM10 anatofuz parents: diff changeset	924 when satisfied, if a misprediction causes the state to be poisoned the
1d019706d866 LLVM10 anatofuz parents: diff changeset	925 resulting stack pointer will be invalid.
1d019706d866 LLVM10 anatofuz parents: diff changeset	926
1d019706d866 LLVM10 anatofuz parents: diff changeset	927
1d019706d866 LLVM10 anatofuz parents: diff changeset	928 ##### Rewrite API of internal functions to directly propagate predicate state
1d019706d866 LLVM10 anatofuz parents: diff changeset	929
1d019706d866 LLVM10 anatofuz parents: diff changeset	930 (Not yet implemented.)
1d019706d866 LLVM10 anatofuz parents: diff changeset	931
1d019706d866 LLVM10 anatofuz parents: diff changeset	932 We have the option with internal functions to directly adjust their API to
1d019706d866 LLVM10 anatofuz parents: diff changeset	933 accept the predicate as an argument and return it. This is likely to be
1d019706d866 LLVM10 anatofuz parents: diff changeset	934 marginally cheaper than embedding into `%rsp` for entering functions.
1d019706d866 LLVM10 anatofuz parents: diff changeset	935
1d019706d866 LLVM10 anatofuz parents: diff changeset	936
1d019706d866 LLVM10 anatofuz parents: diff changeset	937 ##### Use `lfence` to guard function transitions
1d019706d866 LLVM10 anatofuz parents: diff changeset	938
1d019706d866 LLVM10 anatofuz parents: diff changeset	939 An `lfence` instruction can be used to prevent subsequent loads from
1d019706d866 LLVM10 anatofuz parents: diff changeset	940 speculatively executing until all prior mispredicted predicates have resolved.
1d019706d866 LLVM10 anatofuz parents: diff changeset	941 We can use this broader barrier to speculative loads executing between
1d019706d866 LLVM10 anatofuz parents: diff changeset	942 functions. We emit it in the entry block to handle calls, and prior to each
1d019706d866 LLVM10 anatofuz parents: diff changeset	943 return. This approach also has the advantage of providing the strongest degree
1d019706d866 LLVM10 anatofuz parents: diff changeset	944 of mitigation when mixed with unmitigated code by halting all misspeculation
1d019706d866 LLVM10 anatofuz parents: diff changeset	945 entering a function which is mitigated, regardless of what occurred in the
1d019706d866 LLVM10 anatofuz parents: diff changeset	946 caller. However, such a mixture is inherently more risky. Whether this kind of
1d019706d866 LLVM10 anatofuz parents: diff changeset	947 mixture is a sufficient mitigation requires careful analysis.
1d019706d866 LLVM10 anatofuz parents: diff changeset	948
1d019706d866 LLVM10 anatofuz parents: diff changeset	949 Unfortunately, experimental results indicate that the performance overhead of
1d019706d866 LLVM10 anatofuz parents: diff changeset	950 this approach is very high for certain patterns of code. A classic example is
1d019706d866 LLVM10 anatofuz parents: diff changeset	951 any form of recursive evaluation engine. The hot, rapid call and return
1d019706d866 LLVM10 anatofuz parents: diff changeset	952 sequences exhibit dramatic performance loss when mitigated with `lfence`. This
1d019706d866 LLVM10 anatofuz parents: diff changeset	953 component alone can regress performance by 2x or more, making it an unpleasant
1d019706d866 LLVM10 anatofuz parents: diff changeset	954 tradeoff even when only used in a mixture of code.
1d019706d866 LLVM10 anatofuz parents: diff changeset	955
1d019706d866 LLVM10 anatofuz parents: diff changeset	956
1d019706d866 LLVM10 anatofuz parents: diff changeset	957 ##### Use an internal TLS location to pass predicate state
1d019706d866 LLVM10 anatofuz parents: diff changeset	958
1d019706d866 LLVM10 anatofuz parents: diff changeset	959 We can define a special thread-local value to hold the predicate state between
1d019706d866 LLVM10 anatofuz parents: diff changeset	960 functions. This avoids direct ABI implications by using a side channel between
1d019706d866 LLVM10 anatofuz parents: diff changeset	961 callers and callees to communicate the predicate state. It also allows implicit
1d019706d866 LLVM10 anatofuz parents: diff changeset	962 zero-initialization of the state, which allows non-checked code to be the first
1d019706d866 LLVM10 anatofuz parents: diff changeset	963 code executed.
1d019706d866 LLVM10 anatofuz parents: diff changeset	964
1d019706d866 LLVM10 anatofuz parents: diff changeset	965 However, this requires a load from TLS in the entry block, a store to TLS
1d019706d866 LLVM10 anatofuz parents: diff changeset	966 before every call and every ret, and a load from TLS after every call. As a
1d019706d866 LLVM10 anatofuz parents: diff changeset	967 consequence it is expected to be substantially more expensive even than using
1d019706d866 LLVM10 anatofuz parents: diff changeset	968 `%rsp` and potentially `lfence` within the function entry block.
1d019706d866 LLVM10 anatofuz parents: diff changeset	969
1d019706d866 LLVM10 anatofuz parents: diff changeset	970
1d019706d866 LLVM10 anatofuz parents: diff changeset	971 ##### Define a new ABI and/or calling convention
1d019706d866 LLVM10 anatofuz parents: diff changeset	972
1d019706d866 LLVM10 anatofuz parents: diff changeset	973 We could define a new ABI and/or calling convention to explicitly pass the
1d019706d866 LLVM10 anatofuz parents: diff changeset	974 predicate state in and out of functions. This may be interesting if none of the
1d019706d866 LLVM10 anatofuz parents: diff changeset	975 alternatives have adequate performance, but it makes deployment and adoption
1d019706d866 LLVM10 anatofuz parents: diff changeset	976 dramatically more complex, and potentially infeasible.
1d019706d866 LLVM10 anatofuz parents: diff changeset	977
1d019706d866 LLVM10 anatofuz parents: diff changeset	978
1d019706d866 LLVM10 anatofuz parents: diff changeset	979 ## High-Level Alternative Mitigation Strategies
1d019706d866 LLVM10 anatofuz parents: diff changeset	980
1d019706d866 LLVM10 anatofuz parents: diff changeset	981 There are completely different alternative approaches to mitigating variant 1
1d019706d866 LLVM10 anatofuz parents: diff changeset	982 attacks. [Most](https://lwn.net/Articles/743265/)
1d019706d866 LLVM10 anatofuz parents: diff changeset	983 [discussion](https://lwn.net/Articles/744287/) so far focuses on mitigating
1d019706d866 LLVM10 anatofuz parents: diff changeset	984 specific known attackable components in the Linux kernel (or other kernels) by
1d019706d866 LLVM10 anatofuz parents: diff changeset	985 manually rewriting the code to contain an instruction sequence that is not
1d019706d866 LLVM10 anatofuz parents: diff changeset	986 vulnerable. For x86 systems this is done by either injecting an `lfence`
1d019706d866 LLVM10 anatofuz parents: diff changeset	987 instruction along the code path which would leak data if executed speculatively
1d019706d866 LLVM10 anatofuz parents: diff changeset	988 or by rewriting memory accesses to have branch-less masking to a known safe
1d019706d866 LLVM10 anatofuz parents: diff changeset	989 region. On Intel systems, `lfence` [will prevent the speculative load of secret
1d019706d866 LLVM10 anatofuz parents: diff changeset	990 data](https://newsroom.intel.com/wp-content/uploads/sites/11/2018/01/Intel-Analysis-of-Speculative-Execution-Side-Channels.pdf).
1d019706d866 LLVM10 anatofuz parents: diff changeset	991 On AMD systems `lfence` is currently a no-op, but can be made
1d019706d866 LLVM10 anatofuz parents: diff changeset	992 dispatch-serializing by setting an MSR, and thus preclude misspeculation of the
1d019706d866 LLVM10 anatofuz parents: diff changeset	993 code path ([mitigation G-2 +
1d019706d866 LLVM10 anatofuz parents: diff changeset	994 V1-1](https://developer.amd.com/wp-content/resources/Managing-Speculation-on-AMD-Processors.pdf)).
1d019706d866 LLVM10 anatofuz parents: diff changeset	995
1d019706d866 LLVM10 anatofuz parents: diff changeset	996 However, this relies on finding and enumerating all possible points in code
1d019706d866 LLVM10 anatofuz parents: diff changeset	997 which could be attacked to leak information. While in some cases static
1d019706d866 LLVM10 anatofuz parents: diff changeset	998 analysis is effective at doing this at scale, in many cases it still relies on
1d019706d866 LLVM10 anatofuz parents: diff changeset	999 human judgement to evaluate whether code might be vulnerable. Especially for
1d019706d866 LLVM10 anatofuz parents: diff changeset	1000 software systems which receive less detailed scrutiny but remain sensitive to
1d019706d866 LLVM10 anatofuz parents: diff changeset	1001 these attacks, this seems like an impractical security model. We need an
1d019706d866 LLVM10 anatofuz parents: diff changeset	1002 automatic and systematic mitigation strategy.
1d019706d866 LLVM10 anatofuz parents: diff changeset	1003
1d019706d866 LLVM10 anatofuz parents: diff changeset	1004
1d019706d866 LLVM10 anatofuz parents: diff changeset	1005 ### Automatic `lfence` on Conditional Edges
1d019706d866 LLVM10 anatofuz parents: diff changeset	1006
1d019706d866 LLVM10 anatofuz parents: diff changeset	1007 A natural way to scale up the existing hand-coded mitigations is simply to
1d019706d866 LLVM10 anatofuz parents: diff changeset	1008 inject an `lfence` instruction into both the target and fallthrough
1d019706d866 LLVM10 anatofuz parents: diff changeset	1009 destinations of every conditional branch. This ensures that no predicate or
1d019706d866 LLVM10 anatofuz parents: diff changeset	1010 bounds check can be bypassed speculatively. However, the performance overhead
1d019706d866 LLVM10 anatofuz parents: diff changeset	1011 of this approach is, simply put, catastrophic. Yet it remains the only truly
1d019706d866 LLVM10 anatofuz parents: diff changeset	1012 "secure by default" approach known prior to this effort and serves as the
1d019706d866 LLVM10 anatofuz parents: diff changeset	1013 baseline for performance.
1d019706d866 LLVM10 anatofuz parents: diff changeset	1014
1d019706d866 LLVM10 anatofuz parents: diff changeset	1015 One attempt to address the performance overhead of this and make it more
1d019706d866 LLVM10 anatofuz parents: diff changeset	1016 realistic to deploy is [MSVC's /Qspectre
1d019706d866 LLVM10 anatofuz parents: diff changeset	1017 switch](https://blogs.msdn.microsoft.com/vcblog/2018/01/15/spectre-mitigations-in-msvc/).
1d019706d866 LLVM10 anatofuz parents: diff changeset	1018 Their technique is to use static analysis within the compiler to only insert
1d019706d866 LLVM10 anatofuz parents: diff changeset	1019 `lfence` instructions into conditional edges at risk of attack. However,
1d019706d866 LLVM10 anatofuz parents: diff changeset	1020 [initial](https://arstechnica.com/gadgets/2018/02/microsofts-compiler-level-spectre-fix-shows-how-hard-this-problem-will-be-to-solve/)
1d019706d866 LLVM10 anatofuz parents: diff changeset	1021 [analysis](https://www.paulkocher.com/doc/MicrosoftCompilerSpectreMitigation.html)
1d019706d866 LLVM10 anatofuz parents: diff changeset	1022 has shown that this approach is incomplete and only catches a small and limited
1d019706d866 LLVM10 anatofuz parents: diff changeset	1023 subset of attackable patterns which happen to resemble very closely the initial
1d019706d866 LLVM10 anatofuz parents: diff changeset	1024 proofs of concept. As such, while its performance is acceptable, it does not
1d019706d866 LLVM10 anatofuz parents: diff changeset	1025 appear to be an adequate systematic mitigation.
1d019706d866 LLVM10 anatofuz parents: diff changeset	1026
1d019706d866 LLVM10 anatofuz parents: diff changeset	1027
1d019706d866 LLVM10 anatofuz parents: diff changeset	1028 ## Performance Overhead
1d019706d866 LLVM10 anatofuz parents: diff changeset	1029
1d019706d866 LLVM10 anatofuz parents: diff changeset	1030 The performance overhead of this style of comprehensive mitigation is very
1d019706d866 LLVM10 anatofuz parents: diff changeset	1031 high. However, it compares very favorably with previously recommended
1d019706d866 LLVM10 anatofuz parents: diff changeset	1032 approaches such as the `lfence` instruction. Just as users can restrict the
1d019706d866 LLVM10 anatofuz parents: diff changeset	1033 scope of `lfence` to control its performance impact, this mitigation technique
1d019706d866 LLVM10 anatofuz parents: diff changeset	1034 could be restricted in scope as well.
1d019706d866 LLVM10 anatofuz parents: diff changeset	1035
1d019706d866 LLVM10 anatofuz parents: diff changeset	1036 However, it is important to understand what it would cost to get a fully
1d019706d866 LLVM10 anatofuz parents: diff changeset	1037 mitigated baseline. Here we assume targeting a Haswell (or newer) processor and
1d019706d866 LLVM10 anatofuz parents: diff changeset	1038 using all of the tricks to improve performance (so leaves the low 2gb
1d019706d866 LLVM10 anatofuz parents: diff changeset	1039 unprotected and +/- 2gb surrounding any PC in the program). We ran both
1d019706d866 LLVM10 anatofuz parents: diff changeset	1040 Google's microbenchmark suite and a large highly-tuned server built using
1d019706d866 LLVM10 anatofuz parents: diff changeset	1041 ThinLTO and PGO. All were built with `-march=haswell` to give access to BMI2
1d019706d866 LLVM10 anatofuz parents: diff changeset	1042 instructions, and benchmarks were run on large Haswell servers. We collected
1d019706d866 LLVM10 anatofuz parents: diff changeset	1043 data both with an `lfence`-based mitigation and load hardening as presented
1d019706d866 LLVM10 anatofuz parents: diff changeset	1044 here. The summary is that mitigating with load hardening is 1.77x faster than
1d019706d866 LLVM10 anatofuz parents: diff changeset	1045 mitigating with `lfence`, and the overhead of load hardening compared to a
1d019706d866 LLVM10 anatofuz parents: diff changeset	1046 normal program is likely between a 10% overhead and a 50% overhead with most
1d019706d866 LLVM10 anatofuz parents: diff changeset	1047 large applications seeing a 30% overhead or less.
1d019706d866 LLVM10 anatofuz parents: diff changeset	1048
1d019706d866 LLVM10 anatofuz parents: diff changeset	1049 \| Benchmark \| `lfence` \| Load Hardening \| Mitigated Speedup \|
1d019706d866 LLVM10 anatofuz parents: diff changeset	1050 \| -------------------------------------- \| -------: \| -------------: \| ----------------: \|
1d019706d866 LLVM10 anatofuz parents: diff changeset	1051 \| Google microbenchmark suite \| -74.8% \| -36.4% \| 2.5x \|
1d019706d866 LLVM10 anatofuz parents: diff changeset	1052 \| Large server QPS (using ThinLTO & PGO) \| -62% \| -29% \| 1.8x \|
1d019706d866 LLVM10 anatofuz parents: diff changeset	1053
1d019706d866 LLVM10 anatofuz parents: diff changeset	1054 Below is a visualization of the microbenchmark suite results which helps show
1d019706d866 LLVM10 anatofuz parents: diff changeset	1055 the distribution of results that is somewhat lost in the summary. The y-axis is
1d019706d866 LLVM10 anatofuz parents: diff changeset	1056 a log-scale speedup ratio of load hardening relative to `lfence` (up -> faster
1d019706d866 LLVM10 anatofuz parents: diff changeset	1057 -> better). Each box-and-whiskers represents one microbenchmark which may have
1d019706d866 LLVM10 anatofuz parents: diff changeset	1058 many different metrics measured. The red line marks the median, the box marks
1d019706d866 LLVM10 anatofuz parents: diff changeset	1059 the first and third quartiles, and the whiskers mark the min and max.
1d019706d866 LLVM10 anatofuz parents: diff changeset	1060
1d019706d866 LLVM10 anatofuz parents: diff changeset	1061 ![Microbenchmark result visualization](speculative_load_hardening_microbenchmarks.png)
1d019706d866 LLVM10 anatofuz parents: diff changeset	1062
1d019706d866 LLVM10 anatofuz parents: diff changeset	1063 We don't yet have benchmark data on SPEC or the LLVM test suite, but we can
1d019706d866 LLVM10 anatofuz parents: diff changeset	1064 work on getting that. Still, the above should give a pretty clear
1d019706d866 LLVM10 anatofuz parents: diff changeset	1065 characterization of the performance, and specific benchmarks are unlikely to
1d019706d866 LLVM10 anatofuz parents: diff changeset	1066 reveal especially interesting properties.
1d019706d866 LLVM10 anatofuz parents: diff changeset	1067
1d019706d866 LLVM10 anatofuz parents: diff changeset	1068
1d019706d866 LLVM10 anatofuz parents: diff changeset	1069 ### Future Work: Fine Grained Control and API-Integration
1d019706d866 LLVM10 anatofuz parents: diff changeset	1070
1d019706d866 LLVM10 anatofuz parents: diff changeset	1071 The performance overhead of this technique is likely to be very significant and
1d019706d866 LLVM10 anatofuz parents: diff changeset	1072 something users wish to control or reduce. There are interesting options here
1d019706d866 LLVM10 anatofuz parents: diff changeset	1073 that impact the implementation strategy used.
1d019706d866 LLVM10 anatofuz parents: diff changeset	1074
1d019706d866 LLVM10 anatofuz parents: diff changeset	1075 One particularly appealing option is to allow both opt-in and opt-out of this
1d019706d866 LLVM10 anatofuz parents: diff changeset	1076 mitigation at reasonably fine granularity such as on a per-function basis,
1d019706d866 LLVM10 anatofuz parents: diff changeset	1077 including intelligent handling of inlining decisions -- protected code can be
1d019706d866 LLVM10 anatofuz parents: diff changeset	1078 prevented from inlining into unprotected code, and unprotected code will become
1d019706d866 LLVM10 anatofuz parents: diff changeset	1079 protected when inlined into protected code. For systems where only a limited
1d019706d866 LLVM10 anatofuz parents: diff changeset	1080 set of code is reachable by externally controlled inputs, it may be possible to
1d019706d866 LLVM10 anatofuz parents: diff changeset	1081 limit the scope of mitigation through such mechanisms without compromising the
1d019706d866 LLVM10 anatofuz parents: diff changeset	1082 application's overall security. The performance impact may also be focused in a
1d019706d866 LLVM10 anatofuz parents: diff changeset	1083 few key functions that can be hand-mitigated in ways that have lower
1d019706d866 LLVM10 anatofuz parents: diff changeset	1084 performance overhead while the remainder of the application receives automatic
1d019706d866 LLVM10 anatofuz parents: diff changeset	1085 protection.
1d019706d866 LLVM10 anatofuz parents: diff changeset	1086
1d019706d866 LLVM10 anatofuz parents: diff changeset	1087 For both limiting the scope of mitigation or manually mitigating hot functions,
1d019706d866 LLVM10 anatofuz parents: diff changeset	1088 there needs to be some support for mixing mitigated and unmitigated code
1d019706d866 LLVM10 anatofuz parents: diff changeset	1089 without completely defeating the mitigation. For the first use case, it would
1d019706d866 LLVM10 anatofuz parents: diff changeset	1090 be particularly desirable that mitigated code remains safe when being called
1d019706d866 LLVM10 anatofuz parents: diff changeset	1091 during misspeculation from unmitigated code.
1d019706d866 LLVM10 anatofuz parents: diff changeset	1092
1d019706d866 LLVM10 anatofuz parents: diff changeset	1093 For the second use case, it may be important to connect the automatic
1d019706d866 LLVM10 anatofuz parents: diff changeset	1094 mitigation technique to explicit mitigation APIs such as what is described in
1d019706d866 LLVM10 anatofuz parents: diff changeset	1095 http://wg21.link/p0928 (or any other eventual API) so that there is a clean way
1d019706d866 LLVM10 anatofuz parents: diff changeset	1096 to switch from automatic to manual mitigation without immediately exposing a
1d019706d866 LLVM10 anatofuz parents: diff changeset	1097 hole. However, the design for how to do this is hard to come up with until the
1d019706d866 LLVM10 anatofuz parents: diff changeset	1098 APIs are better established. We will revisit this as those APIs mature.

Mercurial > hg > CbC > CbC_llvm

annotate llvm/docs/SpeculativeLoadHardening.md @ 164:fdfabb438fbf