annotate llvm/docs/SpeculativeLoadHardening.md @ 164:fdfabb438fbf

...
author anatofuz
date Thu, 19 Mar 2020 17:02:53 +0900
parents 1d019706d866
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
150
anatofuz
parents:
diff changeset
1 # Speculative Load Hardening
anatofuz
parents:
diff changeset
2
anatofuz
parents:
diff changeset
3 ### A Spectre Variant #1 Mitigation Technique
anatofuz
parents:
diff changeset
4
anatofuz
parents:
diff changeset
5 Author: Chandler Carruth - [chandlerc@google.com](mailto:chandlerc@google.com)
anatofuz
parents:
diff changeset
6
anatofuz
parents:
diff changeset
7 ## Problem Statement
anatofuz
parents:
diff changeset
8
anatofuz
parents:
diff changeset
9 Recently, Google Project Zero and other researchers have found information leak
anatofuz
parents:
diff changeset
10 vulnerabilities by exploiting speculative execution in modern CPUs. These
anatofuz
parents:
diff changeset
11 exploits are currently broken down into three variants:
anatofuz
parents:
diff changeset
12 * GPZ Variant #1 (a.k.a. Spectre Variant #1): Bounds check (or predicate) bypass
anatofuz
parents:
diff changeset
13 * GPZ Variant #2 (a.k.a. Spectre Variant #2): Branch target injection
anatofuz
parents:
diff changeset
14 * GPZ Variant #3 (a.k.a. Meltdown): Rogue data cache load
anatofuz
parents:
diff changeset
15
anatofuz
parents:
diff changeset
16 For more details, see the Google Project Zero blog post and the Spectre research
anatofuz
parents:
diff changeset
17 paper:
anatofuz
parents:
diff changeset
18 * https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html
anatofuz
parents:
diff changeset
19 * https://spectreattack.com/spectre.pdf
anatofuz
parents:
diff changeset
20
anatofuz
parents:
diff changeset
21 The core problem of GPZ Variant #1 is that speculative execution uses branch
anatofuz
parents:
diff changeset
22 prediction to select the path of instructions speculatively executed. This path
anatofuz
parents:
diff changeset
23 is speculatively executed with the available data, and may load from memory and
anatofuz
parents:
diff changeset
24 leak the loaded values through various side channels that survive even when the
anatofuz
parents:
diff changeset
25 speculative execution is unwound due to being incorrect. Mispredicted paths can
anatofuz
parents:
diff changeset
26 cause code to be executed with data inputs that never occur in correct
anatofuz
parents:
diff changeset
27 executions, making checks against malicious inputs ineffective and allowing
anatofuz
parents:
diff changeset
28 attackers to use malicious data inputs to leak secret data. Here is an example,
anatofuz
parents:
diff changeset
29 extracted and simplified from the Project Zero paper:
anatofuz
parents:
diff changeset
30 ```
anatofuz
parents:
diff changeset
31 struct array {
anatofuz
parents:
diff changeset
32 unsigned long length;
anatofuz
parents:
diff changeset
33 unsigned char data[];
anatofuz
parents:
diff changeset
34 };
anatofuz
parents:
diff changeset
35 struct array *arr1 = ...; // small array
anatofuz
parents:
diff changeset
36 struct array *arr2 = ...; // array of size 0x400
anatofuz
parents:
diff changeset
37 unsigned long untrusted_offset_from_caller = ...;
anatofuz
parents:
diff changeset
38 if (untrusted_offset_from_caller < arr1->length) {
anatofuz
parents:
diff changeset
39 unsigned char value = arr1->data[untrusted_offset_from_caller];
anatofuz
parents:
diff changeset
40 unsigned long index2 = ((value&1)*0x100)+0x200;
anatofuz
parents:
diff changeset
41 unsigned char value2 = arr2->data[index2];
anatofuz
parents:
diff changeset
42 }
anatofuz
parents:
diff changeset
43 ```
anatofuz
parents:
diff changeset
44
anatofuz
parents:
diff changeset
45 The key of the attack is to call this with `untrusted_offset_from_caller` that
anatofuz
parents:
diff changeset
46 is far outside of the bounds when the branch predictor will predict that it
anatofuz
parents:
diff changeset
47 will be in-bounds. In that case, the body of the `if` will be executed
anatofuz
parents:
diff changeset
48 speculatively, and may read secret data into `value` and leak it via a
anatofuz
parents:
diff changeset
49 cache-timing side channel when a dependent access is made to populate `value2`.
anatofuz
parents:
diff changeset
50
anatofuz
parents:
diff changeset
51 ## High Level Mitigation Approach
anatofuz
parents:
diff changeset
52
anatofuz
parents:
diff changeset
53 While several approaches are being actively pursued to mitigate specific
anatofuz
parents:
diff changeset
54 branches and/or loads inside especially risky software (most notably various OS
anatofuz
parents:
diff changeset
55 kernels), these approaches require manual and/or static analysis aided auditing
anatofuz
parents:
diff changeset
56 of code and explicit source changes to apply the mitigation. They are unlikely
anatofuz
parents:
diff changeset
57 to scale well to large applications. We are proposing a comprehensive
anatofuz
parents:
diff changeset
58 mitigation approach that would apply automatically across an entire program
anatofuz
parents:
diff changeset
59 rather than through manual changes to the code. While this is likely to have a
anatofuz
parents:
diff changeset
60 high performance cost, some applications may be in a good position to take this
anatofuz
parents:
diff changeset
61 performance / security tradeoff.
anatofuz
parents:
diff changeset
62
anatofuz
parents:
diff changeset
63 The specific technique we propose is to cause loads to be checked using
anatofuz
parents:
diff changeset
64 branchless code to ensure that they are executing along a valid control flow
anatofuz
parents:
diff changeset
65 path. Consider the following C-pseudo-code representing the core idea of a
anatofuz
parents:
diff changeset
66 predicate guarding potentially invalid loads:
anatofuz
parents:
diff changeset
67 ```
anatofuz
parents:
diff changeset
68 void leak(int data);
anatofuz
parents:
diff changeset
69 void example(int* pointer1, int* pointer2) {
anatofuz
parents:
diff changeset
70 if (condition) {
anatofuz
parents:
diff changeset
71 // ... lots of code ...
anatofuz
parents:
diff changeset
72 leak(*pointer1);
anatofuz
parents:
diff changeset
73 } else {
anatofuz
parents:
diff changeset
74 // ... more code ...
anatofuz
parents:
diff changeset
75 leak(*pointer2);
anatofuz
parents:
diff changeset
76 }
anatofuz
parents:
diff changeset
77 }
anatofuz
parents:
diff changeset
78 ```
anatofuz
parents:
diff changeset
79
anatofuz
parents:
diff changeset
80 This would get transformed into something resembling the following:
anatofuz
parents:
diff changeset
81 ```
anatofuz
parents:
diff changeset
82 uintptr_t all_ones_mask = std::numerical_limits<uintptr_t>::max();
anatofuz
parents:
diff changeset
83 uintptr_t all_zeros_mask = 0;
anatofuz
parents:
diff changeset
84 void leak(int data);
anatofuz
parents:
diff changeset
85 void example(int* pointer1, int* pointer2) {
anatofuz
parents:
diff changeset
86 uintptr_t predicate_state = all_ones_mask;
anatofuz
parents:
diff changeset
87 if (condition) {
anatofuz
parents:
diff changeset
88 // Assuming ?: is implemented using branchless logic...
anatofuz
parents:
diff changeset
89 predicate_state = !condition ? all_zeros_mask : predicate_state;
anatofuz
parents:
diff changeset
90 // ... lots of code ...
anatofuz
parents:
diff changeset
91 //
anatofuz
parents:
diff changeset
92 // Harden the pointer so it can't be loaded
anatofuz
parents:
diff changeset
93 pointer1 &= predicate_state;
anatofuz
parents:
diff changeset
94 leak(*pointer1);
anatofuz
parents:
diff changeset
95 } else {
anatofuz
parents:
diff changeset
96 predicate_state = condition ? all_zeros_mask : predicate_state;
anatofuz
parents:
diff changeset
97 // ... more code ...
anatofuz
parents:
diff changeset
98 //
anatofuz
parents:
diff changeset
99 // Alternative: Harden the loaded value
anatofuz
parents:
diff changeset
100 int value2 = *pointer2 & predicate_state;
anatofuz
parents:
diff changeset
101 leak(value2);
anatofuz
parents:
diff changeset
102 }
anatofuz
parents:
diff changeset
103 }
anatofuz
parents:
diff changeset
104 ```
anatofuz
parents:
diff changeset
105
anatofuz
parents:
diff changeset
106 The result should be that if the `if (condition) {` branch is mis-predicted,
anatofuz
parents:
diff changeset
107 there is a *data* dependency on the condition used to zero out any pointers
anatofuz
parents:
diff changeset
108 prior to loading through them or to zero out all of the loaded bits. Even
anatofuz
parents:
diff changeset
109 though this code pattern may still execute speculatively, *invalid* speculative
anatofuz
parents:
diff changeset
110 executions are prevented from leaking secret data from memory (but note that
anatofuz
parents:
diff changeset
111 this data might still be loaded in safe ways, and some regions of memory are
anatofuz
parents:
diff changeset
112 required to not hold secrets, see below for detailed limitations). This
anatofuz
parents:
diff changeset
113 approach only requires the underlying hardware have a way to implement a
anatofuz
parents:
diff changeset
114 branchless and unpredicted conditional update of a register's value. All modern
anatofuz
parents:
diff changeset
115 architectures have support for this, and in fact such support is necessary to
anatofuz
parents:
diff changeset
116 correctly implement constant time cryptographic primitives.
anatofuz
parents:
diff changeset
117
anatofuz
parents:
diff changeset
118 Crucial properties of this approach:
anatofuz
parents:
diff changeset
119 * It is not preventing any particular side-channel from working. This is
anatofuz
parents:
diff changeset
120 important as there are an unknown number of potential side channels and we
anatofuz
parents:
diff changeset
121 expect to continue discovering more. Instead, it prevents the observation of
anatofuz
parents:
diff changeset
122 secret data in the first place.
anatofuz
parents:
diff changeset
123 * It accumulates the predicate state, protecting even in the face of nested
anatofuz
parents:
diff changeset
124 *correctly* predicted control flows.
anatofuz
parents:
diff changeset
125 * It passes this predicate state across function boundaries to provide
anatofuz
parents:
diff changeset
126 [interprocedural protection](#interprocedural-checking).
anatofuz
parents:
diff changeset
127 * When hardening the address of a load, it uses a *destructive* or
anatofuz
parents:
diff changeset
128 *non-reversible* modification of the address to prevent an attacker from
anatofuz
parents:
diff changeset
129 reversing the check using attacker-controlled inputs.
anatofuz
parents:
diff changeset
130 * It does not completely block speculative execution, and merely prevents
anatofuz
parents:
diff changeset
131 *mis*-speculated paths from leaking secrets from memory (and stalls
anatofuz
parents:
diff changeset
132 speculation until this can be determined).
anatofuz
parents:
diff changeset
133 * It is completely general and makes no fundamental assumptions about the
anatofuz
parents:
diff changeset
134 underlying architecture other than the ability to do branchless conditional
anatofuz
parents:
diff changeset
135 data updates and a lack of value prediction.
anatofuz
parents:
diff changeset
136 * It does not require programmers to identify all possible secret data using
anatofuz
parents:
diff changeset
137 static source code annotations or code vulnerable to a variant #1 style
anatofuz
parents:
diff changeset
138 attack.
anatofuz
parents:
diff changeset
139
anatofuz
parents:
diff changeset
140 Limitations of this approach:
anatofuz
parents:
diff changeset
141 * It requires re-compiling source code to insert hardening instruction
anatofuz
parents:
diff changeset
142 sequences. Only software compiled in this mode is protected.
anatofuz
parents:
diff changeset
143 * The performance is heavily dependent on a particular architecture's
anatofuz
parents:
diff changeset
144 implementation strategy. We outline a potential x86 implementation below and
anatofuz
parents:
diff changeset
145 characterize its performance.
anatofuz
parents:
diff changeset
146 * It does not defend against secret data already loaded from memory and
anatofuz
parents:
diff changeset
147 residing in registers or leaked through other side-channels in
anatofuz
parents:
diff changeset
148 non-speculative execution. Code dealing with this, e.g cryptographic
anatofuz
parents:
diff changeset
149 routines, already uses constant-time algorithms and code to prevent
anatofuz
parents:
diff changeset
150 side-channels. Such code should also scrub registers of secret data following
anatofuz
parents:
diff changeset
151 [these
anatofuz
parents:
diff changeset
152 guidelines](https://github.com/HACS-workshop/spectre-mitigations/blob/master/crypto_guidelines.md).
anatofuz
parents:
diff changeset
153 * To achieve reasonable performance, many loads may not be checked, such as
anatofuz
parents:
diff changeset
154 those with compile-time fixed addresses. This primarily consists of accesses
anatofuz
parents:
diff changeset
155 at compile-time constant offsets of global and local variables. Code which
anatofuz
parents:
diff changeset
156 needs this protection and intentionally stores secret data must ensure the
anatofuz
parents:
diff changeset
157 memory regions used for secret data are necessarily dynamic mappings or heap
anatofuz
parents:
diff changeset
158 allocations. This is an area which can be tuned to provide more comprehensive
anatofuz
parents:
diff changeset
159 protection at the cost of performance.
anatofuz
parents:
diff changeset
160 * [Hardened loads](#hardening-the-address-of-the-load) may still load data from
anatofuz
parents:
diff changeset
161 _valid_ addresses if not _attacker-controlled_ addresses. To prevent these
anatofuz
parents:
diff changeset
162 from reading secret data, the low 2gb of the address space and 2gb above and
anatofuz
parents:
diff changeset
163 below any executable pages should be protected.
anatofuz
parents:
diff changeset
164
anatofuz
parents:
diff changeset
165 Credit:
anatofuz
parents:
diff changeset
166 * The core idea of tracing misspeculation through data and marking pointers to
anatofuz
parents:
diff changeset
167 block misspeculated loads was developed as part of a HACS 2018 discussion
anatofuz
parents:
diff changeset
168 between Chandler Carruth, Paul Kocher, Thomas Pornin, and several other
anatofuz
parents:
diff changeset
169 individuals.
anatofuz
parents:
diff changeset
170 * Core idea of masking out loaded bits was part of the original mitigation
anatofuz
parents:
diff changeset
171 suggested by Jann Horn when these attacks were reported.
anatofuz
parents:
diff changeset
172
anatofuz
parents:
diff changeset
173
anatofuz
parents:
diff changeset
174 ### Indirect Branches, Calls, and Returns
anatofuz
parents:
diff changeset
175
anatofuz
parents:
diff changeset
176 It is possible to attack control flow other than conditional branches with
anatofuz
parents:
diff changeset
177 variant #1 style mispredictions.
anatofuz
parents:
diff changeset
178 * A prediction towards a hot call target of a virtual method can lead to it
anatofuz
parents:
diff changeset
179 being speculatively executed when an expected type is used (often called
anatofuz
parents:
diff changeset
180 "type confusion").
anatofuz
parents:
diff changeset
181 * A hot case may be speculatively executed due to prediction instead of the
anatofuz
parents:
diff changeset
182 correct case for a switch statement implemented as a jump table.
anatofuz
parents:
diff changeset
183 * A hot common return address may be predicted incorrectly when returning from
anatofuz
parents:
diff changeset
184 a function.
anatofuz
parents:
diff changeset
185
anatofuz
parents:
diff changeset
186 These code patterns are also vulnerable to Spectre variant #2, and as such are
anatofuz
parents:
diff changeset
187 best mitigated with a
anatofuz
parents:
diff changeset
188 [retpoline](https://support.google.com/faqs/answer/7625886) on x86 platforms.
anatofuz
parents:
diff changeset
189 When a mitigation technique like retpoline is used, speculation simply cannot
anatofuz
parents:
diff changeset
190 proceed through an indirect control flow edge (or it cannot be mispredicted in
anatofuz
parents:
diff changeset
191 the case of a filled RSB) and so it is also protected from variant #1 style
anatofuz
parents:
diff changeset
192 attacks. However, some architectures, micro-architectures, or vendors do not
anatofuz
parents:
diff changeset
193 employ the retpoline mitigation, and on future x86 hardware (both Intel and
anatofuz
parents:
diff changeset
194 AMD) it is expected to become unnecessary due to hardware-based mitigation.
anatofuz
parents:
diff changeset
195
anatofuz
parents:
diff changeset
196 When not using a retpoline, these edges will need independent protection from
anatofuz
parents:
diff changeset
197 variant #1 style attacks. The analogous approach to that used for conditional
anatofuz
parents:
diff changeset
198 control flow should work:
anatofuz
parents:
diff changeset
199 ```
anatofuz
parents:
diff changeset
200 uintptr_t all_ones_mask = std::numerical_limits<uintptr_t>::max();
anatofuz
parents:
diff changeset
201 uintptr_t all_zeros_mask = 0;
anatofuz
parents:
diff changeset
202 void leak(int data);
anatofuz
parents:
diff changeset
203 void example(int* pointer1, int* pointer2) {
anatofuz
parents:
diff changeset
204 uintptr_t predicate_state = all_ones_mask;
anatofuz
parents:
diff changeset
205 switch (condition) {
anatofuz
parents:
diff changeset
206 case 0:
anatofuz
parents:
diff changeset
207 // Assuming ?: is implemented using branchless logic...
anatofuz
parents:
diff changeset
208 predicate_state = (condition != 0) ? all_zeros_mask : predicate_state;
anatofuz
parents:
diff changeset
209 // ... lots of code ...
anatofuz
parents:
diff changeset
210 //
anatofuz
parents:
diff changeset
211 // Harden the pointer so it can't be loaded
anatofuz
parents:
diff changeset
212 pointer1 &= predicate_state;
anatofuz
parents:
diff changeset
213 leak(*pointer1);
anatofuz
parents:
diff changeset
214 break;
anatofuz
parents:
diff changeset
215
anatofuz
parents:
diff changeset
216 case 1:
anatofuz
parents:
diff changeset
217 predicate_state = (condition != 1) ? all_zeros_mask : predicate_state;
anatofuz
parents:
diff changeset
218 // ... more code ...
anatofuz
parents:
diff changeset
219 //
anatofuz
parents:
diff changeset
220 // Alternative: Harden the loaded value
anatofuz
parents:
diff changeset
221 int value2 = *pointer2 & predicate_state;
anatofuz
parents:
diff changeset
222 leak(value2);
anatofuz
parents:
diff changeset
223 break;
anatofuz
parents:
diff changeset
224
anatofuz
parents:
diff changeset
225 // ...
anatofuz
parents:
diff changeset
226 }
anatofuz
parents:
diff changeset
227 }
anatofuz
parents:
diff changeset
228 ```
anatofuz
parents:
diff changeset
229
anatofuz
parents:
diff changeset
230 The core idea remains the same: validate the control flow using data-flow and
anatofuz
parents:
diff changeset
231 use that validation to check that loads cannot leak information along
anatofuz
parents:
diff changeset
232 misspeculated paths. Typically this involves passing the desired target of such
anatofuz
parents:
diff changeset
233 control flow across the edge and checking that it is correct afterwards. Note
anatofuz
parents:
diff changeset
234 that while it is tempting to think that this mitigates variant #2 attacks, it
anatofuz
parents:
diff changeset
235 does not. Those attacks go to arbitrary gadgets that don't include the checks.
anatofuz
parents:
diff changeset
236
anatofuz
parents:
diff changeset
237
anatofuz
parents:
diff changeset
238 ### Variant #1.1 and #1.2 attacks: "Bounds Check Bypass Store"
anatofuz
parents:
diff changeset
239
anatofuz
parents:
diff changeset
240 Beyond the core variant #1 attack, there are techniques to extend this attack.
anatofuz
parents:
diff changeset
241 The primary technique is known as "Bounds Check Bypass Store" and is discussed
anatofuz
parents:
diff changeset
242 in this research paper: https://people.csail.mit.edu/vlk/spectre11.pdf
anatofuz
parents:
diff changeset
243
anatofuz
parents:
diff changeset
244 We will analyze these two variants independently. First, variant #1.1 works by
anatofuz
parents:
diff changeset
245 speculatively storing over the return address after a bounds check bypass. This
anatofuz
parents:
diff changeset
246 speculative store then ends up being used by the CPU during speculative
anatofuz
parents:
diff changeset
247 execution of the return, potentially directing speculative execution to
anatofuz
parents:
diff changeset
248 arbitrary gadgets in the binary. Let's look at an example.
anatofuz
parents:
diff changeset
249 ```
anatofuz
parents:
diff changeset
250 unsigned char local_buffer[4];
anatofuz
parents:
diff changeset
251 unsigned char *untrusted_data_from_caller = ...;
anatofuz
parents:
diff changeset
252 unsigned long untrusted_size_from_caller = ...;
anatofuz
parents:
diff changeset
253 if (untrusted_size_from_caller < sizeof(local_buffer)) {
anatofuz
parents:
diff changeset
254 // Speculative execution enters here with a too-large size.
anatofuz
parents:
diff changeset
255 memcpy(local_buffer, untrusted_data_from_caller,
anatofuz
parents:
diff changeset
256 untrusted_size_from_caller);
anatofuz
parents:
diff changeset
257 // The stack has now been smashed, writing an attacker-controlled
anatofuz
parents:
diff changeset
258 // address over the return address.
anatofuz
parents:
diff changeset
259 minor_processing(local_buffer);
anatofuz
parents:
diff changeset
260 return;
anatofuz
parents:
diff changeset
261 // Control will speculate to the attacker-written address.
anatofuz
parents:
diff changeset
262 }
anatofuz
parents:
diff changeset
263 ```
anatofuz
parents:
diff changeset
264
anatofuz
parents:
diff changeset
265 However, this can be mitigated by hardening the load of the return address just
anatofuz
parents:
diff changeset
266 like any other load. This is sometimes complicated because x86 for example
anatofuz
parents:
diff changeset
267 *implicitly* loads the return address off the stack. However, the
anatofuz
parents:
diff changeset
268 implementation technique below is specifically designed to mitigate this
anatofuz
parents:
diff changeset
269 implicit load by using the stack pointer to communicate misspeculation between
anatofuz
parents:
diff changeset
270 functions. This additionally causes a misspeculation to have an invalid stack
anatofuz
parents:
diff changeset
271 pointer and never be able to read the speculatively stored return address. See
anatofuz
parents:
diff changeset
272 the detailed discussion below.
anatofuz
parents:
diff changeset
273
anatofuz
parents:
diff changeset
274 For variant #1.2, the attacker speculatively stores into the vtable or jump
anatofuz
parents:
diff changeset
275 table used to implement an indirect call or indirect jump. Because this is
anatofuz
parents:
diff changeset
276 speculative, this will often be possible even when these are stored in
anatofuz
parents:
diff changeset
277 read-only pages. For example:
anatofuz
parents:
diff changeset
278 ```
anatofuz
parents:
diff changeset
279 class FancyObject : public BaseObject {
anatofuz
parents:
diff changeset
280 public:
anatofuz
parents:
diff changeset
281 void DoSomething() override;
anatofuz
parents:
diff changeset
282 };
anatofuz
parents:
diff changeset
283 void f(unsigned long attacker_offset, unsigned long attacker_data) {
anatofuz
parents:
diff changeset
284 FancyObject object = getMyObject();
anatofuz
parents:
diff changeset
285 unsigned long *arr[4] = getFourDataPointers();
anatofuz
parents:
diff changeset
286 if (attacker_offset < 4) {
anatofuz
parents:
diff changeset
287 // We have bypassed the bounds check speculatively.
anatofuz
parents:
diff changeset
288 unsigned long *data = arr[attacker_offset];
anatofuz
parents:
diff changeset
289 // Now we have computed a pointer inside of `object`, the vptr.
anatofuz
parents:
diff changeset
290 *data = attacker_data;
anatofuz
parents:
diff changeset
291 // The vptr points to the virtual table and we speculatively clobber that.
anatofuz
parents:
diff changeset
292 g(object); // Hand the object to some other routine.
anatofuz
parents:
diff changeset
293 }
anatofuz
parents:
diff changeset
294 }
anatofuz
parents:
diff changeset
295 // In another file, we call a method on the object.
anatofuz
parents:
diff changeset
296 void g(BaseObject &object) {
anatofuz
parents:
diff changeset
297 object.DoSomething();
anatofuz
parents:
diff changeset
298 // This speculatively calls the address stored over the vtable.
anatofuz
parents:
diff changeset
299 }
anatofuz
parents:
diff changeset
300 ```
anatofuz
parents:
diff changeset
301
anatofuz
parents:
diff changeset
302 Mitigating this requires hardening loads from these locations, or mitigating
anatofuz
parents:
diff changeset
303 the indirect call or indirect jump. Any of these are sufficient to block the
anatofuz
parents:
diff changeset
304 call or jump from using a speculatively stored value that has been read back.
anatofuz
parents:
diff changeset
305
anatofuz
parents:
diff changeset
306 For both of these, using retpolines would be equally sufficient. One possible
anatofuz
parents:
diff changeset
307 hybrid approach is to use retpolines for indirect call and jump, while relying
anatofuz
parents:
diff changeset
308 on SLH to mitigate returns.
anatofuz
parents:
diff changeset
309
anatofuz
parents:
diff changeset
310 Another approach that is sufficient for both of these is to harden all of the
anatofuz
parents:
diff changeset
311 speculative stores. However, as most stores aren't interesting and don't
anatofuz
parents:
diff changeset
312 inherently leak data, this is expected to be prohibitively expensive given the
anatofuz
parents:
diff changeset
313 attack it is defending against.
anatofuz
parents:
diff changeset
314
anatofuz
parents:
diff changeset
315
anatofuz
parents:
diff changeset
316 ## Implementation Details
anatofuz
parents:
diff changeset
317
anatofuz
parents:
diff changeset
318 There are a number of complex details impacting the implementation of this
anatofuz
parents:
diff changeset
319 technique, both on a particular architecture and within a particular compiler.
anatofuz
parents:
diff changeset
320 We discuss proposed implementation techniques for the x86 architecture and the
anatofuz
parents:
diff changeset
321 LLVM compiler. These are primarily to serve as an example, as other
anatofuz
parents:
diff changeset
322 implementation techniques are very possible.
anatofuz
parents:
diff changeset
323
anatofuz
parents:
diff changeset
324
anatofuz
parents:
diff changeset
325 ### x86 Implementation Details
anatofuz
parents:
diff changeset
326
anatofuz
parents:
diff changeset
327 On the x86 platform we break down the implementation into three core
anatofuz
parents:
diff changeset
328 components: accumulating the predicate state through the control flow graph,
anatofuz
parents:
diff changeset
329 checking the loads, and checking control transfers between procedures.
anatofuz
parents:
diff changeset
330
anatofuz
parents:
diff changeset
331
anatofuz
parents:
diff changeset
332 #### Accumulating Predicate State
anatofuz
parents:
diff changeset
333
anatofuz
parents:
diff changeset
334 Consider baseline x86 instructions like the following, which test three
anatofuz
parents:
diff changeset
335 conditions and if all pass, loads data from memory and potentially leaks it
anatofuz
parents:
diff changeset
336 through some side channel:
anatofuz
parents:
diff changeset
337 ```
anatofuz
parents:
diff changeset
338 # %bb.0: # %entry
anatofuz
parents:
diff changeset
339 pushq %rax
anatofuz
parents:
diff changeset
340 testl %edi, %edi
anatofuz
parents:
diff changeset
341 jne .LBB0_4
anatofuz
parents:
diff changeset
342 # %bb.1: # %then1
anatofuz
parents:
diff changeset
343 testl %esi, %esi
anatofuz
parents:
diff changeset
344 jne .LBB0_4
anatofuz
parents:
diff changeset
345 # %bb.2: # %then2
anatofuz
parents:
diff changeset
346 testl %edx, %edx
anatofuz
parents:
diff changeset
347 je .LBB0_3
anatofuz
parents:
diff changeset
348 .LBB0_4: # %exit
anatofuz
parents:
diff changeset
349 popq %rax
anatofuz
parents:
diff changeset
350 retq
anatofuz
parents:
diff changeset
351 .LBB0_3: # %danger
anatofuz
parents:
diff changeset
352 movl (%rcx), %edi
anatofuz
parents:
diff changeset
353 callq leak
anatofuz
parents:
diff changeset
354 popq %rax
anatofuz
parents:
diff changeset
355 retq
anatofuz
parents:
diff changeset
356 ```
anatofuz
parents:
diff changeset
357
anatofuz
parents:
diff changeset
358 When we go to speculatively execute the load, we want to know whether any of
anatofuz
parents:
diff changeset
359 the dynamically executed predicates have been misspeculated. To track that,
anatofuz
parents:
diff changeset
360 along each conditional edge, we need to track the data which would allow that
anatofuz
parents:
diff changeset
361 edge to be taken. On x86, this data is stored in the flags register used by the
anatofuz
parents:
diff changeset
362 conditional jump instruction. Along both edges after this fork in control flow,
anatofuz
parents:
diff changeset
363 the flags register remains alive and contains data that we can use to build up
anatofuz
parents:
diff changeset
364 our accumulated predicate state. We accumulate it using the x86 conditional
anatofuz
parents:
diff changeset
365 move instruction which also reads the flag registers where the state resides.
anatofuz
parents:
diff changeset
366 These conditional move instructions are known to not be predicted on any x86
anatofuz
parents:
diff changeset
367 processors, making them immune to misprediction that could reintroduce the
anatofuz
parents:
diff changeset
368 vulnerability. When we insert the conditional moves, the code ends up looking
anatofuz
parents:
diff changeset
369 like the following:
anatofuz
parents:
diff changeset
370 ```
anatofuz
parents:
diff changeset
371 # %bb.0: # %entry
anatofuz
parents:
diff changeset
372 pushq %rax
anatofuz
parents:
diff changeset
373 xorl %eax, %eax # Zero out initial predicate state.
anatofuz
parents:
diff changeset
374 movq $-1, %r8 # Put all-ones mask into a register.
anatofuz
parents:
diff changeset
375 testl %edi, %edi
anatofuz
parents:
diff changeset
376 jne .LBB0_1
anatofuz
parents:
diff changeset
377 # %bb.2: # %then1
anatofuz
parents:
diff changeset
378 cmovneq %r8, %rax # Conditionally update predicate state.
anatofuz
parents:
diff changeset
379 testl %esi, %esi
anatofuz
parents:
diff changeset
380 jne .LBB0_1
anatofuz
parents:
diff changeset
381 # %bb.3: # %then2
anatofuz
parents:
diff changeset
382 cmovneq %r8, %rax # Conditionally update predicate state.
anatofuz
parents:
diff changeset
383 testl %edx, %edx
anatofuz
parents:
diff changeset
384 je .LBB0_4
anatofuz
parents:
diff changeset
385 .LBB0_1:
anatofuz
parents:
diff changeset
386 cmoveq %r8, %rax # Conditionally update predicate state.
anatofuz
parents:
diff changeset
387 popq %rax
anatofuz
parents:
diff changeset
388 retq
anatofuz
parents:
diff changeset
389 .LBB0_4: # %danger
anatofuz
parents:
diff changeset
390 cmovneq %r8, %rax # Conditionally update predicate state.
anatofuz
parents:
diff changeset
391 ...
anatofuz
parents:
diff changeset
392 ```
anatofuz
parents:
diff changeset
393
anatofuz
parents:
diff changeset
394 Here we create the "empty" or "correct execution" predicate state by zeroing
anatofuz
parents:
diff changeset
395 `%rax`, and we create a constant "incorrect execution" predicate value by
anatofuz
parents:
diff changeset
396 putting `-1` into `%r8`. Then, along each edge coming out of a conditional
anatofuz
parents:
diff changeset
397 branch we do a conditional move that in a correct execution will be a no-op,
anatofuz
parents:
diff changeset
398 but if misspeculated, will replace the `%rax` with the value of `%r8`.
anatofuz
parents:
diff changeset
399 Misspeculating any one of the three predicates will cause `%rax` to hold the
anatofuz
parents:
diff changeset
400 "incorrect execution" value from `%r8` as we preserve incoming values when
anatofuz
parents:
diff changeset
401 execution is correct rather than overwriting it.
anatofuz
parents:
diff changeset
402
anatofuz
parents:
diff changeset
403 We now have a value in `%rax` in each basic block that indicates if at some
anatofuz
parents:
diff changeset
404 point previously a predicate was mispredicted. And we have arranged for that
anatofuz
parents:
diff changeset
405 value to be particularly effective when used below to harden loads.
anatofuz
parents:
diff changeset
406
anatofuz
parents:
diff changeset
407
anatofuz
parents:
diff changeset
408 ##### Indirect Call, Branch, and Return Predicates
anatofuz
parents:
diff changeset
409
anatofuz
parents:
diff changeset
410 There is no analogous flag to use when tracing indirect calls, branches, and
anatofuz
parents:
diff changeset
411 returns. The predicate state must be accumulated through some other means.
anatofuz
parents:
diff changeset
412 Fundamentally, this is the reverse of the problem posed in CFI: we need to
anatofuz
parents:
diff changeset
413 check where we came from rather than where we are going. For function-local
anatofuz
parents:
diff changeset
414 jump tables, this is easily arranged by testing the input to the jump table
anatofuz
parents:
diff changeset
415 within each destination (not yet implemented, use retpolines):
anatofuz
parents:
diff changeset
416 ```
anatofuz
parents:
diff changeset
417 pushq %rax
anatofuz
parents:
diff changeset
418 xorl %eax, %eax # Zero out initial predicate state.
anatofuz
parents:
diff changeset
419 movq $-1, %r8 # Put all-ones mask into a register.
anatofuz
parents:
diff changeset
420 jmpq *.LJTI0_0(,%rdi,8) # Indirect jump through table.
anatofuz
parents:
diff changeset
421 .LBB0_2: # %sw.bb
anatofuz
parents:
diff changeset
422 testq $0, %rdi # Validate index used for jump table.
anatofuz
parents:
diff changeset
423 cmovneq %r8, %rax # Conditionally update predicate state.
anatofuz
parents:
diff changeset
424 ...
anatofuz
parents:
diff changeset
425 jmp _Z4leaki # TAILCALL
anatofuz
parents:
diff changeset
426
anatofuz
parents:
diff changeset
427 .LBB0_3: # %sw.bb1
anatofuz
parents:
diff changeset
428 testq $1, %rdi # Validate index used for jump table.
anatofuz
parents:
diff changeset
429 cmovneq %r8, %rax # Conditionally update predicate state.
anatofuz
parents:
diff changeset
430 ...
anatofuz
parents:
diff changeset
431 jmp _Z4leaki # TAILCALL
anatofuz
parents:
diff changeset
432
anatofuz
parents:
diff changeset
433 .LBB0_5: # %sw.bb10
anatofuz
parents:
diff changeset
434 testq $2, %rdi # Validate index used for jump table.
anatofuz
parents:
diff changeset
435 cmovneq %r8, %rax # Conditionally update predicate state.
anatofuz
parents:
diff changeset
436 ...
anatofuz
parents:
diff changeset
437 jmp _Z4leaki # TAILCALL
anatofuz
parents:
diff changeset
438 ...
anatofuz
parents:
diff changeset
439
anatofuz
parents:
diff changeset
440 .section .rodata,"a",@progbits
anatofuz
parents:
diff changeset
441 .p2align 3
anatofuz
parents:
diff changeset
442 .LJTI0_0:
anatofuz
parents:
diff changeset
443 .quad .LBB0_2
anatofuz
parents:
diff changeset
444 .quad .LBB0_3
anatofuz
parents:
diff changeset
445 .quad .LBB0_5
anatofuz
parents:
diff changeset
446 ...
anatofuz
parents:
diff changeset
447 ```
anatofuz
parents:
diff changeset
448
anatofuz
parents:
diff changeset
449 Returns have a simple mitigation technique on x86-64 (or other ABIs which have
anatofuz
parents:
diff changeset
450 what is called a "red zone" region beyond the end of the stack). This region is
anatofuz
parents:
diff changeset
451 guaranteed to be preserved across interrupts and context switches, making the
anatofuz
parents:
diff changeset
452 return address used in returning to the current code remain on the stack and
anatofuz
parents:
diff changeset
453 valid to read. We can emit code in the caller to verify that a return edge was
anatofuz
parents:
diff changeset
454 not mispredicted:
anatofuz
parents:
diff changeset
455 ```
anatofuz
parents:
diff changeset
456 callq other_function
anatofuz
parents:
diff changeset
457 return_addr:
anatofuz
parents:
diff changeset
458 testq -8(%rsp), return_addr # Validate return address.
anatofuz
parents:
diff changeset
459 cmovneq %r8, %rax # Update predicate state.
anatofuz
parents:
diff changeset
460 ```
anatofuz
parents:
diff changeset
461
anatofuz
parents:
diff changeset
462 For an ABI without a "red zone" (and thus unable to read the return address
anatofuz
parents:
diff changeset
463 from the stack), we can compute the expected return address prior to the call
anatofuz
parents:
diff changeset
464 into a register preserved across the call and use that similarly to the above.
anatofuz
parents:
diff changeset
465
anatofuz
parents:
diff changeset
466 Indirect calls (and returns in the absence of a red zone ABI) pose the most
anatofuz
parents:
diff changeset
467 significant challenge to propagate. The simplest technique would be to define a
anatofuz
parents:
diff changeset
468 new ABI such that the intended call target is passed into the called function
anatofuz
parents:
diff changeset
469 and checked in the entry. Unfortunately, new ABIs are quite expensive to deploy
anatofuz
parents:
diff changeset
470 in C and C++. While the target function could be passed in TLS, we would still
anatofuz
parents:
diff changeset
471 require complex logic to handle a mixture of functions compiled with and
anatofuz
parents:
diff changeset
472 without this extra logic (essentially, making the ABI backwards compatible).
anatofuz
parents:
diff changeset
473 Currently, we suggest using retpolines here and will continue to investigate
anatofuz
parents:
diff changeset
474 ways of mitigating this.
anatofuz
parents:
diff changeset
475
anatofuz
parents:
diff changeset
476
anatofuz
parents:
diff changeset
477 ##### Optimizations, Alternatives, and Tradeoffs
anatofuz
parents:
diff changeset
478
anatofuz
parents:
diff changeset
479 Merely accumulating predicate state involves significant cost. There are
anatofuz
parents:
diff changeset
480 several key optimizations we employ to minimize this and various alternatives
anatofuz
parents:
diff changeset
481 that present different tradeoffs in the generated code.
anatofuz
parents:
diff changeset
482
anatofuz
parents:
diff changeset
483 First, we work to reduce the number of instructions used to track the state:
anatofuz
parents:
diff changeset
484 * Rather than inserting a `cmovCC` instruction along every conditional edge in
anatofuz
parents:
diff changeset
485 the original program, we track each set of condition flags we need to capture
anatofuz
parents:
diff changeset
486 prior to entering each basic block and reuse a common `cmovCC` sequence for
anatofuz
parents:
diff changeset
487 those.
anatofuz
parents:
diff changeset
488 * We could further reuse suffixes when there are multiple `cmovCC`
anatofuz
parents:
diff changeset
489 instructions required to capture the set of flags. Currently this is
anatofuz
parents:
diff changeset
490 believed to not be worth the cost as paired flags are relatively rare and
anatofuz
parents:
diff changeset
491 suffixes of them are exceedingly rare.
anatofuz
parents:
diff changeset
492 * A common pattern in x86 is to have multiple conditional jump instructions
anatofuz
parents:
diff changeset
493 that use the same flags but handle different conditions. Naively, we could
anatofuz
parents:
diff changeset
494 consider each fallthrough between them an "edge" but this causes a much more
anatofuz
parents:
diff changeset
495 complex control flow graph. Instead, we accumulate the set of conditions
anatofuz
parents:
diff changeset
496 necessary for fallthrough and use a sequence of `cmovCC` instructions in a
anatofuz
parents:
diff changeset
497 single fallthrough edge to track it.
anatofuz
parents:
diff changeset
498
anatofuz
parents:
diff changeset
499 Second, we trade register pressure for simpler `cmovCC` instructions by
anatofuz
parents:
diff changeset
500 allocating a register for the "bad" state. We could read that value from memory
anatofuz
parents:
diff changeset
501 as part of the conditional move instruction, however, this creates more
anatofuz
parents:
diff changeset
502 micro-ops and requires the load-store unit to be involved. Currently, we place
anatofuz
parents:
diff changeset
503 the value into a virtual register and allow the register allocator to decide
anatofuz
parents:
diff changeset
504 when the register pressure is sufficient to make it worth spilling to memory
anatofuz
parents:
diff changeset
505 and reloading.
anatofuz
parents:
diff changeset
506
anatofuz
parents:
diff changeset
507
anatofuz
parents:
diff changeset
508 #### Hardening Loads
anatofuz
parents:
diff changeset
509
anatofuz
parents:
diff changeset
510 Once we have the predicate accumulated into a special value for correct vs.
anatofuz
parents:
diff changeset
511 misspeculated, we need to apply this to loads in a way that ensures they do not
anatofuz
parents:
diff changeset
512 leak secret data. There are two primary techniques for this: we can either
anatofuz
parents:
diff changeset
513 harden the loaded value to prevent observation, or we can harden the address
anatofuz
parents:
diff changeset
514 itself to prevent the load from occurring. These have significantly different
anatofuz
parents:
diff changeset
515 performance tradeoffs.
anatofuz
parents:
diff changeset
516
anatofuz
parents:
diff changeset
517
anatofuz
parents:
diff changeset
518 ##### Hardening loaded values
anatofuz
parents:
diff changeset
519
anatofuz
parents:
diff changeset
520 The most appealing way to harden loads is to mask out all of the bits loaded.
anatofuz
parents:
diff changeset
521 The key requirement is that for each bit loaded, along the misspeculated path
anatofuz
parents:
diff changeset
522 that bit is always fixed at either 0 or 1 regardless of the value of the bit
anatofuz
parents:
diff changeset
523 loaded. The most obvious implementation uses either an `and` instruction with
anatofuz
parents:
diff changeset
524 an all-zero mask along misspeculated paths and an all-one mask along correct
anatofuz
parents:
diff changeset
525 paths, or an `or` instruction with an all-one mask along misspeculated paths
anatofuz
parents:
diff changeset
526 and an all-zero mask along correct paths. Other options become less appealing
anatofuz
parents:
diff changeset
527 such as multiplying by zero, or multiple shift instructions. For reasons we
anatofuz
parents:
diff changeset
528 elaborate on below, we end up suggesting you use `or` with an all-ones mask,
anatofuz
parents:
diff changeset
529 making the x86 instruction sequence look like the following:
anatofuz
parents:
diff changeset
530 ```
anatofuz
parents:
diff changeset
531 ...
anatofuz
parents:
diff changeset
532
anatofuz
parents:
diff changeset
533 .LBB0_4: # %danger
anatofuz
parents:
diff changeset
534 cmovneq %r8, %rax # Conditionally update predicate state.
anatofuz
parents:
diff changeset
535 movl (%rsi), %edi # Load potentially secret data from %rsi.
anatofuz
parents:
diff changeset
536 orl %eax, %edi
anatofuz
parents:
diff changeset
537 ```
anatofuz
parents:
diff changeset
538
anatofuz
parents:
diff changeset
539 Other useful patterns may be to fold the load into the `or` instruction itself
anatofuz
parents:
diff changeset
540 at the cost of a register-to-register copy.
anatofuz
parents:
diff changeset
541
anatofuz
parents:
diff changeset
542 There are some challenges with deploying this approach:
anatofuz
parents:
diff changeset
543 1. Many loads on x86 are folded into other instructions. Separating them would
anatofuz
parents:
diff changeset
544 add very significant and costly register pressure with prohibitive
anatofuz
parents:
diff changeset
545 performance cost.
anatofuz
parents:
diff changeset
546 1. Loads may not target a general purpose register requiring extra instructions
anatofuz
parents:
diff changeset
547 to map the state value into the correct register class, and potentially more
anatofuz
parents:
diff changeset
548 expensive instructions to mask the value in some way.
anatofuz
parents:
diff changeset
549 1. The flags registers on x86 are very likely to be live, and challenging to
anatofuz
parents:
diff changeset
550 preserve cheaply.
anatofuz
parents:
diff changeset
551 1. There are many more values loaded than pointers & indices used for loads. As
anatofuz
parents:
diff changeset
552 a consequence, hardening the result of a load requires substantially more
anatofuz
parents:
diff changeset
553 instructions than hardening the address of the load (see below).
anatofuz
parents:
diff changeset
554
anatofuz
parents:
diff changeset
555 Despite these challenges, hardening the result of the load critically allows
anatofuz
parents:
diff changeset
556 the load to proceed and thus has dramatically less impact on the total
anatofuz
parents:
diff changeset
557 speculative / out-of-order potential of the execution. There are also several
anatofuz
parents:
diff changeset
558 interesting techniques to try and mitigate these challenges and make hardening
anatofuz
parents:
diff changeset
559 the results of loads viable in at least some cases. However, we generally
anatofuz
parents:
diff changeset
560 expect to fall back when unprofitable from hardening the loaded value to the
anatofuz
parents:
diff changeset
561 next approach of hardening the address itself.
anatofuz
parents:
diff changeset
562
anatofuz
parents:
diff changeset
563
anatofuz
parents:
diff changeset
564 ###### Loads folded into data-invariant operations can be hardened after the operation
anatofuz
parents:
diff changeset
565
anatofuz
parents:
diff changeset
566 The first key to making this feasible is to recognize that many operations on
anatofuz
parents:
diff changeset
567 x86 are "data-invariant". That is, they have no (known) observable behavior
anatofuz
parents:
diff changeset
568 differences due to the particular input data. These instructions are often used
anatofuz
parents:
diff changeset
569 when implementing cryptographic primitives dealing with private key data
anatofuz
parents:
diff changeset
570 because they are not believed to provide any side-channels. Similarly, we can
anatofuz
parents:
diff changeset
571 defer hardening until after them as they will not in-and-of-themselves
anatofuz
parents:
diff changeset
572 introduce a speculative execution side-channel. This results in code sequences
anatofuz
parents:
diff changeset
573 that look like:
anatofuz
parents:
diff changeset
574 ```
anatofuz
parents:
diff changeset
575 ...
anatofuz
parents:
diff changeset
576
anatofuz
parents:
diff changeset
577 .LBB0_4: # %danger
anatofuz
parents:
diff changeset
578 cmovneq %r8, %rax # Conditionally update predicate state.
anatofuz
parents:
diff changeset
579 addl (%rsi), %edi # Load and accumulate without leaking.
anatofuz
parents:
diff changeset
580 orl %eax, %edi
anatofuz
parents:
diff changeset
581 ```
anatofuz
parents:
diff changeset
582
anatofuz
parents:
diff changeset
583 While an addition happens to the loaded (potentially secret) value, that
anatofuz
parents:
diff changeset
584 doesn't leak any data and we then immediately harden it.
anatofuz
parents:
diff changeset
585
anatofuz
parents:
diff changeset
586
anatofuz
parents:
diff changeset
587 ###### Hardening of loaded values deferred down the data-invariant expression graph
anatofuz
parents:
diff changeset
588
anatofuz
parents:
diff changeset
589 We can generalize the previous idea and sink the hardening down the expression
anatofuz
parents:
diff changeset
590 graph across as many data-invariant operations as desirable. This can use very
anatofuz
parents:
diff changeset
591 conservative rules for whether something is data-invariant. The primary goal
anatofuz
parents:
diff changeset
592 should be to handle multiple loads with a single hardening instruction:
anatofuz
parents:
diff changeset
593 ```
anatofuz
parents:
diff changeset
594 ...
anatofuz
parents:
diff changeset
595
anatofuz
parents:
diff changeset
596 .LBB0_4: # %danger
anatofuz
parents:
diff changeset
597 cmovneq %r8, %rax # Conditionally update predicate state.
anatofuz
parents:
diff changeset
598 addl (%rsi), %edi # Load and accumulate without leaking.
anatofuz
parents:
diff changeset
599 addl 4(%rsi), %edi # Continue without leaking.
anatofuz
parents:
diff changeset
600 addl 8(%rsi), %edi
anatofuz
parents:
diff changeset
601 orl %eax, %edi # Mask out bits from all three loads.
anatofuz
parents:
diff changeset
602 ```
anatofuz
parents:
diff changeset
603
anatofuz
parents:
diff changeset
604
anatofuz
parents:
diff changeset
605 ###### Preserving the flags while hardening loaded values on Haswell, Zen, and newer processors
anatofuz
parents:
diff changeset
606
anatofuz
parents:
diff changeset
607 Sadly, there are no useful instructions on x86 that apply a mask to all 64 bits
anatofuz
parents:
diff changeset
608 without touching the flag registers. However, we can harden loaded values that
anatofuz
parents:
diff changeset
609 are narrower than a word (fewer than 32-bits on 32-bit systems and fewer than
anatofuz
parents:
diff changeset
610 64-bits on 64-bit systems) by zero-extending the value to the full word size
anatofuz
parents:
diff changeset
611 and then shifting right by at least the number of original bits using the BMI2
anatofuz
parents:
diff changeset
612 `shrx` instruction:
anatofuz
parents:
diff changeset
613 ```
anatofuz
parents:
diff changeset
614 ...
anatofuz
parents:
diff changeset
615
anatofuz
parents:
diff changeset
616 .LBB0_4: # %danger
anatofuz
parents:
diff changeset
617 cmovneq %r8, %rax # Conditionally update predicate state.
anatofuz
parents:
diff changeset
618 addl (%rsi), %edi # Load and accumulate 32 bits of data.
anatofuz
parents:
diff changeset
619 shrxq %rax, %rdi, %rdi # Shift out all 32 bits loaded.
anatofuz
parents:
diff changeset
620 ```
anatofuz
parents:
diff changeset
621
anatofuz
parents:
diff changeset
622 Because on x86 the zero-extend is free, this can efficiently harden the loaded
anatofuz
parents:
diff changeset
623 value.
anatofuz
parents:
diff changeset
624
anatofuz
parents:
diff changeset
625
anatofuz
parents:
diff changeset
626 ##### Hardening the address of the load
anatofuz
parents:
diff changeset
627
anatofuz
parents:
diff changeset
628 When hardening the loaded value is inapplicable, most often because the
anatofuz
parents:
diff changeset
629 instruction directly leaks information (like `cmp` or `jmpq`), we switch to
anatofuz
parents:
diff changeset
630 hardening the _address_ of the load instead of the loaded value. This avoids
anatofuz
parents:
diff changeset
631 increasing register pressure by unfolding the load or paying some other high
anatofuz
parents:
diff changeset
632 cost.
anatofuz
parents:
diff changeset
633
anatofuz
parents:
diff changeset
634 To understand how this works in practice, we need to examine the exact
anatofuz
parents:
diff changeset
635 semantics of the x86 addressing modes which, in its fully general form, looks
anatofuz
parents:
diff changeset
636 like `(%base,%index,scale)offset`. Here `%base` and `%index` are 64-bit
anatofuz
parents:
diff changeset
637 registers that can potentially be any value, and may be attacker controlled,
anatofuz
parents:
diff changeset
638 and `scale` and `offset` are fixed immediate values. `scale` must be `1`, `2`,
anatofuz
parents:
diff changeset
639 `4`, or `8`, and `offset` can be any 32-bit sign extended value. The exact
anatofuz
parents:
diff changeset
640 computation performed to find the address is then: `%base + (scale * %index) +
anatofuz
parents:
diff changeset
641 offset` under 64-bit 2's complement modular arithmetic.
anatofuz
parents:
diff changeset
642
anatofuz
parents:
diff changeset
643 One issue with this approach is that, after hardening, the `%base + (scale *
anatofuz
parents:
diff changeset
644 %index)` subexpression will compute a value near zero (`-1 + (scale * -1)`) and
anatofuz
parents:
diff changeset
645 then a large, positive `offset` will index into memory within the first two
anatofuz
parents:
diff changeset
646 gigabytes of address space. While these offsets are not attacker controlled,
anatofuz
parents:
diff changeset
647 the attacker could chose to attack a load which happens to have the desired
anatofuz
parents:
diff changeset
648 offset and then successfully read memory in that region. This significantly
anatofuz
parents:
diff changeset
649 raises the burden on the attacker and limits the scope of attack but does not
anatofuz
parents:
diff changeset
650 eliminate it. To fully close the attack we must work with the operating system
anatofuz
parents:
diff changeset
651 to preclude mapping memory in the low two gigabytes of address space.
anatofuz
parents:
diff changeset
652
anatofuz
parents:
diff changeset
653
anatofuz
parents:
diff changeset
654 ###### 64-bit load checking instructions
anatofuz
parents:
diff changeset
655
anatofuz
parents:
diff changeset
656 We can use the following instruction sequences to check loads. We set up `%r8`
anatofuz
parents:
diff changeset
657 in these examples to hold the special value of `-1` which will be `cmov`ed over
anatofuz
parents:
diff changeset
658 `%rax` in misspeculated paths.
anatofuz
parents:
diff changeset
659
anatofuz
parents:
diff changeset
660 Single register addressing mode:
anatofuz
parents:
diff changeset
661 ```
anatofuz
parents:
diff changeset
662 ...
anatofuz
parents:
diff changeset
663
anatofuz
parents:
diff changeset
664 .LBB0_4: # %danger
anatofuz
parents:
diff changeset
665 cmovneq %r8, %rax # Conditionally update predicate state.
anatofuz
parents:
diff changeset
666 orq %rax, %rsi # Mask the pointer if misspeculating.
anatofuz
parents:
diff changeset
667 movl (%rsi), %edi
anatofuz
parents:
diff changeset
668 ```
anatofuz
parents:
diff changeset
669
anatofuz
parents:
diff changeset
670 Two register addressing mode:
anatofuz
parents:
diff changeset
671 ```
anatofuz
parents:
diff changeset
672 ...
anatofuz
parents:
diff changeset
673
anatofuz
parents:
diff changeset
674 .LBB0_4: # %danger
anatofuz
parents:
diff changeset
675 cmovneq %r8, %rax # Conditionally update predicate state.
anatofuz
parents:
diff changeset
676 orq %rax, %rsi # Mask the pointer if misspeculating.
anatofuz
parents:
diff changeset
677 orq %rax, %rcx # Mask the index if misspeculating.
anatofuz
parents:
diff changeset
678 movl (%rsi,%rcx), %edi
anatofuz
parents:
diff changeset
679 ```
anatofuz
parents:
diff changeset
680
anatofuz
parents:
diff changeset
681 This will result in a negative address near zero or in `offset` wrapping the
anatofuz
parents:
diff changeset
682 address space back to a small positive address. Small, negative addresses will
anatofuz
parents:
diff changeset
683 fault in user-mode for most operating systems, but targets which need the high
anatofuz
parents:
diff changeset
684 address space to be user accessible may need to adjust the exact sequence used
anatofuz
parents:
diff changeset
685 above. Additionally, the low addresses will need to be marked unreadable by the
anatofuz
parents:
diff changeset
686 OS to fully harden the load.
anatofuz
parents:
diff changeset
687
anatofuz
parents:
diff changeset
688
anatofuz
parents:
diff changeset
689 ###### RIP-relative addressing is even easier to break
anatofuz
parents:
diff changeset
690
anatofuz
parents:
diff changeset
691 There is a common addressing mode idiom that is substantially harder to check:
anatofuz
parents:
diff changeset
692 addressing relative to the instruction pointer. We cannot change the value of
anatofuz
parents:
diff changeset
693 the instruction pointer register and so we have the harder problem of forcing
anatofuz
parents:
diff changeset
694 `%base + scale * %index + offset` to be an invalid address, by *only* changing
anatofuz
parents:
diff changeset
695 `%index`. The only advantage we have is that the attacker also cannot modify
anatofuz
parents:
diff changeset
696 `%base`. If we use the fast instruction sequence above, but only apply it to
anatofuz
parents:
diff changeset
697 the index, we will always access `%rip + (scale * -1) + offset`. If the
anatofuz
parents:
diff changeset
698 attacker can find a load which with this address happens to point to secret
anatofuz
parents:
diff changeset
699 data, then they can reach it. However, the loader and base libraries can also
anatofuz
parents:
diff changeset
700 simply refuse to map the heap, data segments, or stack within 2gb of any of the
anatofuz
parents:
diff changeset
701 text in the program, much like it can reserve the low 2gb of address space.
anatofuz
parents:
diff changeset
702
anatofuz
parents:
diff changeset
703
anatofuz
parents:
diff changeset
704 ###### The flag registers again make everything hard
anatofuz
parents:
diff changeset
705
anatofuz
parents:
diff changeset
706 Unfortunately, the technique of using `orq`-instructions has a serious flaw on
anatofuz
parents:
diff changeset
707 x86. The very thing that makes it easy to accumulate state, the flag registers
anatofuz
parents:
diff changeset
708 containing predicates, causes serious problems here because they may be alive
anatofuz
parents:
diff changeset
709 and used by the loading instruction or subsequent instructions. On x86, the
anatofuz
parents:
diff changeset
710 `orq` instruction **sets** the flags and will override anything already there.
anatofuz
parents:
diff changeset
711 This makes inserting them into the instruction stream very hazardous.
anatofuz
parents:
diff changeset
712 Unfortunately, unlike when hardening the loaded value, we have no fallback here
anatofuz
parents:
diff changeset
713 and so we must have a fully general approach available.
anatofuz
parents:
diff changeset
714
anatofuz
parents:
diff changeset
715 The first thing we must do when generating these sequences is try to analyze
anatofuz
parents:
diff changeset
716 the surrounding code to prove that the flags are not in fact alive or being
anatofuz
parents:
diff changeset
717 used. Typically, it has been set by some other instruction which just happens
anatofuz
parents:
diff changeset
718 to set the flags register (much like ours!) with no actual dependency. In those
anatofuz
parents:
diff changeset
719 cases, it is safe to directly insert these instructions. Alternatively we may
anatofuz
parents:
diff changeset
720 be able to move them earlier to avoid clobbering the used value.
anatofuz
parents:
diff changeset
721
anatofuz
parents:
diff changeset
722 However, this may ultimately be impossible. In that case, we need to preserve
anatofuz
parents:
diff changeset
723 the flags around these instructions:
anatofuz
parents:
diff changeset
724 ```
anatofuz
parents:
diff changeset
725 ...
anatofuz
parents:
diff changeset
726
anatofuz
parents:
diff changeset
727 .LBB0_4: # %danger
anatofuz
parents:
diff changeset
728 cmovneq %r8, %rax # Conditionally update predicate state.
anatofuz
parents:
diff changeset
729 pushfq
anatofuz
parents:
diff changeset
730 orq %rax, %rcx # Mask the pointer if misspeculating.
anatofuz
parents:
diff changeset
731 orq %rax, %rdx # Mask the index if misspeculating.
anatofuz
parents:
diff changeset
732 popfq
anatofuz
parents:
diff changeset
733 movl (%rcx,%rdx), %edi
anatofuz
parents:
diff changeset
734 ```
anatofuz
parents:
diff changeset
735
anatofuz
parents:
diff changeset
736 Using the `pushf` and `popf` instructions saves the flags register around our
anatofuz
parents:
diff changeset
737 inserted code, but comes at a high cost. First, we must store the flags to the
anatofuz
parents:
diff changeset
738 stack and reload them. Second, this causes the stack pointer to be adjusted
anatofuz
parents:
diff changeset
739 dynamically, requiring a frame pointer be used for referring to temporaries
anatofuz
parents:
diff changeset
740 spilled to the stack, etc.
anatofuz
parents:
diff changeset
741
anatofuz
parents:
diff changeset
742 On newer x86 processors we can use the `lahf` and `sahf` instructions to save
anatofuz
parents:
diff changeset
743 all of the flags besides the overflow flag in a register rather than on the
anatofuz
parents:
diff changeset
744 stack. We can then use `seto` and `add` to save and restore the overflow flag
anatofuz
parents:
diff changeset
745 in a register. Combined, this will save and restore flags in the same manner as
anatofuz
parents:
diff changeset
746 above but using two registers rather than the stack. That is still very
anatofuz
parents:
diff changeset
747 expensive if slightly less expensive than `pushf` and `popf` in most cases.
anatofuz
parents:
diff changeset
748
anatofuz
parents:
diff changeset
749
anatofuz
parents:
diff changeset
750 ###### A flag-less alternative on Haswell, Zen and newer processors
anatofuz
parents:
diff changeset
751
anatofuz
parents:
diff changeset
752 Starting with the BMI2 x86 instruction set extensions available on Haswell and
anatofuz
parents:
diff changeset
753 Zen processors, there is an instruction for shifting that does not set any
anatofuz
parents:
diff changeset
754 flags: `shrx`. We can use this and the `lea` instruction to implement analogous
anatofuz
parents:
diff changeset
755 code sequences to the above ones. However, these are still very marginally
anatofuz
parents:
diff changeset
756 slower, as there are fewer ports able to dispatch shift instructions in most
anatofuz
parents:
diff changeset
757 modern x86 processors than there are for `or` instructions.
anatofuz
parents:
diff changeset
758
anatofuz
parents:
diff changeset
759 Fast, single register addressing mode:
anatofuz
parents:
diff changeset
760 ```
anatofuz
parents:
diff changeset
761 ...
anatofuz
parents:
diff changeset
762
anatofuz
parents:
diff changeset
763 .LBB0_4: # %danger
anatofuz
parents:
diff changeset
764 cmovneq %r8, %rax # Conditionally update predicate state.
anatofuz
parents:
diff changeset
765 shrxq %rax, %rsi, %rsi # Shift away bits if misspeculating.
anatofuz
parents:
diff changeset
766 movl (%rsi), %edi
anatofuz
parents:
diff changeset
767 ```
anatofuz
parents:
diff changeset
768
anatofuz
parents:
diff changeset
769 This will collapse the register to zero or one, and everything but the offset
anatofuz
parents:
diff changeset
770 in the addressing mode to be less than or equal to 9. This means the full
anatofuz
parents:
diff changeset
771 address can only be guaranteed to be less than `(1 << 31) + 9`. The OS may wish
anatofuz
parents:
diff changeset
772 to protect an extra page of the low address space to account for this
anatofuz
parents:
diff changeset
773
anatofuz
parents:
diff changeset
774
anatofuz
parents:
diff changeset
775 ##### Optimizations
anatofuz
parents:
diff changeset
776
anatofuz
parents:
diff changeset
777 A very large portion of the cost for this approach comes from checking loads in
anatofuz
parents:
diff changeset
778 this way, so it is important to work to optimize this. However, beyond making
anatofuz
parents:
diff changeset
779 the instruction sequences to *apply* the checks efficient (for example by
anatofuz
parents:
diff changeset
780 avoiding `pushfq` and `popfq` sequences), the only significant optimization is
anatofuz
parents:
diff changeset
781 to check fewer loads without introducing a vulnerability. We apply several
anatofuz
parents:
diff changeset
782 techniques to accomplish that.
anatofuz
parents:
diff changeset
783
anatofuz
parents:
diff changeset
784
anatofuz
parents:
diff changeset
785 ###### Don't check loads from compile-time constant stack offsets
anatofuz
parents:
diff changeset
786
anatofuz
parents:
diff changeset
787 We implement this optimization on x86 by skipping the checking of loads which
anatofuz
parents:
diff changeset
788 use a fixed frame pointer offset.
anatofuz
parents:
diff changeset
789
anatofuz
parents:
diff changeset
790 The result of this optimization is that patterns like reloading a spilled
anatofuz
parents:
diff changeset
791 register or accessing a global field don't get checked. This is a very
anatofuz
parents:
diff changeset
792 significant performance win.
anatofuz
parents:
diff changeset
793
anatofuz
parents:
diff changeset
794
anatofuz
parents:
diff changeset
795 ###### Don't check dependent loads
anatofuz
parents:
diff changeset
796
anatofuz
parents:
diff changeset
797 A core part of why this mitigation strategy works is that it establishes a
anatofuz
parents:
diff changeset
798 data-flow check on the loaded address. However, this means that if the address
anatofuz
parents:
diff changeset
799 itself was already loaded using a checked load, there is no need to check a
anatofuz
parents:
diff changeset
800 dependent load provided it is within the same basic block as the checked load,
anatofuz
parents:
diff changeset
801 and therefore has no additional predicates guarding it. Consider code like the
anatofuz
parents:
diff changeset
802 following:
anatofuz
parents:
diff changeset
803 ```
anatofuz
parents:
diff changeset
804 ...
anatofuz
parents:
diff changeset
805
anatofuz
parents:
diff changeset
806 .LBB0_4: # %danger
anatofuz
parents:
diff changeset
807 movq (%rcx), %rdi
anatofuz
parents:
diff changeset
808 movl (%rdi), %edx
anatofuz
parents:
diff changeset
809 ```
anatofuz
parents:
diff changeset
810
anatofuz
parents:
diff changeset
811 This will get transformed into:
anatofuz
parents:
diff changeset
812 ```
anatofuz
parents:
diff changeset
813 ...
anatofuz
parents:
diff changeset
814
anatofuz
parents:
diff changeset
815 .LBB0_4: # %danger
anatofuz
parents:
diff changeset
816 cmovneq %r8, %rax # Conditionally update predicate state.
anatofuz
parents:
diff changeset
817 orq %rax, %rcx # Mask the pointer if misspeculating.
anatofuz
parents:
diff changeset
818 movq (%rcx), %rdi # Hardened load.
anatofuz
parents:
diff changeset
819 movl (%rdi), %edx # Unhardened load due to dependent addr.
anatofuz
parents:
diff changeset
820 ```
anatofuz
parents:
diff changeset
821
anatofuz
parents:
diff changeset
822 This doesn't check the load through `%rdi` as that pointer is dependent on a
anatofuz
parents:
diff changeset
823 checked load already.
anatofuz
parents:
diff changeset
824
anatofuz
parents:
diff changeset
825
anatofuz
parents:
diff changeset
826 ###### Protect large, load-heavy blocks with a single lfence
anatofuz
parents:
diff changeset
827
anatofuz
parents:
diff changeset
828 It may be worth using a single `lfence` instruction at the start of a block
anatofuz
parents:
diff changeset
829 which begins with a (very) large number of loads that require independent
anatofuz
parents:
diff changeset
830 protection *and* which require hardening the address of the load. However, this
anatofuz
parents:
diff changeset
831 is unlikely to be profitable in practice. The latency hit of the hardening
anatofuz
parents:
diff changeset
832 would need to exceed that of an `lfence` when *correctly* speculatively
anatofuz
parents:
diff changeset
833 executed. But in that case, the `lfence` cost is a complete loss of speculative
anatofuz
parents:
diff changeset
834 execution (at a minimum). So far, the evidence we have of the performance cost
anatofuz
parents:
diff changeset
835 of using `lfence` indicates few if any hot code patterns where this trade off
anatofuz
parents:
diff changeset
836 would make sense.
anatofuz
parents:
diff changeset
837
anatofuz
parents:
diff changeset
838
anatofuz
parents:
diff changeset
839 ###### Tempting optimizations that break the security model
anatofuz
parents:
diff changeset
840
anatofuz
parents:
diff changeset
841 Several optimizations were considered which didn't pan out due to failure to
anatofuz
parents:
diff changeset
842 uphold the security model. One in particular is worth discussing as many others
anatofuz
parents:
diff changeset
843 will reduce to it.
anatofuz
parents:
diff changeset
844
anatofuz
parents:
diff changeset
845 We wondered whether only the *first* load in a basic block could be checked. If
anatofuz
parents:
diff changeset
846 the check works as intended, it forms an invalid pointer that doesn't even
anatofuz
parents:
diff changeset
847 virtual-address translate in the hardware. It should fault very early on in its
anatofuz
parents:
diff changeset
848 processing. Maybe that would stop things in time for the misspeculated path to
anatofuz
parents:
diff changeset
849 fail to leak any secrets. This doesn't end up working because the processor is
anatofuz
parents:
diff changeset
850 fundamentally out-of-order, even in its speculative domain. As a consequence,
anatofuz
parents:
diff changeset
851 the attacker could cause the initial address computation itself to stall and
anatofuz
parents:
diff changeset
852 allow an arbitrary number of unrelated loads (including attacked loads of
anatofuz
parents:
diff changeset
853 secret data) to pass through.
anatofuz
parents:
diff changeset
854
anatofuz
parents:
diff changeset
855
anatofuz
parents:
diff changeset
856 #### Interprocedural Checking
anatofuz
parents:
diff changeset
857
anatofuz
parents:
diff changeset
858 Modern x86 processors may speculate into called functions and out of functions
anatofuz
parents:
diff changeset
859 to their return address. As a consequence, we need a way to check loads that
anatofuz
parents:
diff changeset
860 occur after a misspeculated predicate but where the load and the misspeculated
anatofuz
parents:
diff changeset
861 predicate are in different functions. In essence, we need some interprocedural
anatofuz
parents:
diff changeset
862 generalization of the predicate state tracking. A primary challenge to passing
anatofuz
parents:
diff changeset
863 the predicate state between functions is that we would like to not require a
anatofuz
parents:
diff changeset
864 change to the ABI or calling convention in order to make this mitigation more
anatofuz
parents:
diff changeset
865 deployable, and further would like code mitigated in this way to be easily
anatofuz
parents:
diff changeset
866 mixed with code not mitigated in this way and without completely losing the
anatofuz
parents:
diff changeset
867 value of the mitigation.
anatofuz
parents:
diff changeset
868
anatofuz
parents:
diff changeset
869
anatofuz
parents:
diff changeset
870 ##### Embed the predicate state into the high bit(s) of the stack pointer
anatofuz
parents:
diff changeset
871
anatofuz
parents:
diff changeset
872 We can use the same technique that allows hardening pointers to pass the
anatofuz
parents:
diff changeset
873 predicate state into and out of functions. The stack pointer is trivially
anatofuz
parents:
diff changeset
874 passed between functions and we can test for it having the high bits set to
anatofuz
parents:
diff changeset
875 detect when it has been marked due to misspeculation. The callsite instruction
anatofuz
parents:
diff changeset
876 sequence looks like (assuming a misspeculated state value of `-1`):
anatofuz
parents:
diff changeset
877 ```
anatofuz
parents:
diff changeset
878 ...
anatofuz
parents:
diff changeset
879
anatofuz
parents:
diff changeset
880 .LBB0_4: # %danger
anatofuz
parents:
diff changeset
881 cmovneq %r8, %rax # Conditionally update predicate state.
anatofuz
parents:
diff changeset
882 shlq $47, %rax
anatofuz
parents:
diff changeset
883 orq %rax, %rsp
anatofuz
parents:
diff changeset
884 callq other_function
anatofuz
parents:
diff changeset
885 movq %rsp, %rax
anatofuz
parents:
diff changeset
886 sarq 63, %rax # Sign extend the high bit to all bits.
anatofuz
parents:
diff changeset
887 ```
anatofuz
parents:
diff changeset
888
anatofuz
parents:
diff changeset
889 This first puts the predicate state into the high bits of `%rsp` before calling
anatofuz
parents:
diff changeset
890 the function and then reads it back out of high bits of `%rsp` afterward. When
anatofuz
parents:
diff changeset
891 correctly executing (speculatively or not), these are all no-ops. When
anatofuz
parents:
diff changeset
892 misspeculating, the stack pointer will end up negative. We arrange for it to
anatofuz
parents:
diff changeset
893 remain a canonical address, but otherwise leave the low bits alone to allow
anatofuz
parents:
diff changeset
894 stack adjustments to proceed normally without disrupting this. Within the
anatofuz
parents:
diff changeset
895 called function, we can extract this predicate state and then reset it on
anatofuz
parents:
diff changeset
896 return:
anatofuz
parents:
diff changeset
897 ```
anatofuz
parents:
diff changeset
898 other_function:
anatofuz
parents:
diff changeset
899 # prolog
anatofuz
parents:
diff changeset
900 callq other_function
anatofuz
parents:
diff changeset
901 movq %rsp, %rax
anatofuz
parents:
diff changeset
902 sarq 63, %rax # Sign extend the high bit to all bits.
anatofuz
parents:
diff changeset
903 # ...
anatofuz
parents:
diff changeset
904
anatofuz
parents:
diff changeset
905 .LBB0_N:
anatofuz
parents:
diff changeset
906 cmovneq %r8, %rax # Conditionally update predicate state.
anatofuz
parents:
diff changeset
907 shlq $47, %rax
anatofuz
parents:
diff changeset
908 orq %rax, %rsp
anatofuz
parents:
diff changeset
909 retq
anatofuz
parents:
diff changeset
910 ```
anatofuz
parents:
diff changeset
911
anatofuz
parents:
diff changeset
912 This approach is effective when all code is mitigated in this fashion, and can
anatofuz
parents:
diff changeset
913 even survive very limited reaches into unmitigated code (the state will
anatofuz
parents:
diff changeset
914 round-trip in and back out of an unmitigated function, it just won't be
anatofuz
parents:
diff changeset
915 updated). But it does have some limitations. There is a cost to merging the
anatofuz
parents:
diff changeset
916 state into `%rsp` and it doesn't insulate mitigated code from misspeculation in
anatofuz
parents:
diff changeset
917 an unmitigated caller.
anatofuz
parents:
diff changeset
918
anatofuz
parents:
diff changeset
919 There is also an advantage to using this form of interprocedural mitigation: by
anatofuz
parents:
diff changeset
920 forming these invalid stack pointer addresses we can prevent speculative
anatofuz
parents:
diff changeset
921 returns from successfully reading speculatively written values to the actual
anatofuz
parents:
diff changeset
922 stack. This works first by forming a data-dependency between computing the
anatofuz
parents:
diff changeset
923 address of the return address on the stack and our predicate state. And even
anatofuz
parents:
diff changeset
924 when satisfied, if a misprediction causes the state to be poisoned the
anatofuz
parents:
diff changeset
925 resulting stack pointer will be invalid.
anatofuz
parents:
diff changeset
926
anatofuz
parents:
diff changeset
927
anatofuz
parents:
diff changeset
928 ##### Rewrite API of internal functions to directly propagate predicate state
anatofuz
parents:
diff changeset
929
anatofuz
parents:
diff changeset
930 (Not yet implemented.)
anatofuz
parents:
diff changeset
931
anatofuz
parents:
diff changeset
932 We have the option with internal functions to directly adjust their API to
anatofuz
parents:
diff changeset
933 accept the predicate as an argument and return it. This is likely to be
anatofuz
parents:
diff changeset
934 marginally cheaper than embedding into `%rsp` for entering functions.
anatofuz
parents:
diff changeset
935
anatofuz
parents:
diff changeset
936
anatofuz
parents:
diff changeset
937 ##### Use `lfence` to guard function transitions
anatofuz
parents:
diff changeset
938
anatofuz
parents:
diff changeset
939 An `lfence` instruction can be used to prevent subsequent loads from
anatofuz
parents:
diff changeset
940 speculatively executing until all prior mispredicted predicates have resolved.
anatofuz
parents:
diff changeset
941 We can use this broader barrier to speculative loads executing between
anatofuz
parents:
diff changeset
942 functions. We emit it in the entry block to handle calls, and prior to each
anatofuz
parents:
diff changeset
943 return. This approach also has the advantage of providing the strongest degree
anatofuz
parents:
diff changeset
944 of mitigation when mixed with unmitigated code by halting all misspeculation
anatofuz
parents:
diff changeset
945 entering a function which is mitigated, regardless of what occurred in the
anatofuz
parents:
diff changeset
946 caller. However, such a mixture is inherently more risky. Whether this kind of
anatofuz
parents:
diff changeset
947 mixture is a sufficient mitigation requires careful analysis.
anatofuz
parents:
diff changeset
948
anatofuz
parents:
diff changeset
949 Unfortunately, experimental results indicate that the performance overhead of
anatofuz
parents:
diff changeset
950 this approach is very high for certain patterns of code. A classic example is
anatofuz
parents:
diff changeset
951 any form of recursive evaluation engine. The hot, rapid call and return
anatofuz
parents:
diff changeset
952 sequences exhibit dramatic performance loss when mitigated with `lfence`. This
anatofuz
parents:
diff changeset
953 component alone can regress performance by 2x or more, making it an unpleasant
anatofuz
parents:
diff changeset
954 tradeoff even when only used in a mixture of code.
anatofuz
parents:
diff changeset
955
anatofuz
parents:
diff changeset
956
anatofuz
parents:
diff changeset
957 ##### Use an internal TLS location to pass predicate state
anatofuz
parents:
diff changeset
958
anatofuz
parents:
diff changeset
959 We can define a special thread-local value to hold the predicate state between
anatofuz
parents:
diff changeset
960 functions. This avoids direct ABI implications by using a side channel between
anatofuz
parents:
diff changeset
961 callers and callees to communicate the predicate state. It also allows implicit
anatofuz
parents:
diff changeset
962 zero-initialization of the state, which allows non-checked code to be the first
anatofuz
parents:
diff changeset
963 code executed.
anatofuz
parents:
diff changeset
964
anatofuz
parents:
diff changeset
965 However, this requires a load from TLS in the entry block, a store to TLS
anatofuz
parents:
diff changeset
966 before every call and every ret, and a load from TLS after every call. As a
anatofuz
parents:
diff changeset
967 consequence it is expected to be substantially more expensive even than using
anatofuz
parents:
diff changeset
968 `%rsp` and potentially `lfence` within the function entry block.
anatofuz
parents:
diff changeset
969
anatofuz
parents:
diff changeset
970
anatofuz
parents:
diff changeset
971 ##### Define a new ABI and/or calling convention
anatofuz
parents:
diff changeset
972
anatofuz
parents:
diff changeset
973 We could define a new ABI and/or calling convention to explicitly pass the
anatofuz
parents:
diff changeset
974 predicate state in and out of functions. This may be interesting if none of the
anatofuz
parents:
diff changeset
975 alternatives have adequate performance, but it makes deployment and adoption
anatofuz
parents:
diff changeset
976 dramatically more complex, and potentially infeasible.
anatofuz
parents:
diff changeset
977
anatofuz
parents:
diff changeset
978
anatofuz
parents:
diff changeset
979 ## High-Level Alternative Mitigation Strategies
anatofuz
parents:
diff changeset
980
anatofuz
parents:
diff changeset
981 There are completely different alternative approaches to mitigating variant 1
anatofuz
parents:
diff changeset
982 attacks. [Most](https://lwn.net/Articles/743265/)
anatofuz
parents:
diff changeset
983 [discussion](https://lwn.net/Articles/744287/) so far focuses on mitigating
anatofuz
parents:
diff changeset
984 specific known attackable components in the Linux kernel (or other kernels) by
anatofuz
parents:
diff changeset
985 manually rewriting the code to contain an instruction sequence that is not
anatofuz
parents:
diff changeset
986 vulnerable. For x86 systems this is done by either injecting an `lfence`
anatofuz
parents:
diff changeset
987 instruction along the code path which would leak data if executed speculatively
anatofuz
parents:
diff changeset
988 or by rewriting memory accesses to have branch-less masking to a known safe
anatofuz
parents:
diff changeset
989 region. On Intel systems, `lfence` [will prevent the speculative load of secret
anatofuz
parents:
diff changeset
990 data](https://newsroom.intel.com/wp-content/uploads/sites/11/2018/01/Intel-Analysis-of-Speculative-Execution-Side-Channels.pdf).
anatofuz
parents:
diff changeset
991 On AMD systems `lfence` is currently a no-op, but can be made
anatofuz
parents:
diff changeset
992 dispatch-serializing by setting an MSR, and thus preclude misspeculation of the
anatofuz
parents:
diff changeset
993 code path ([mitigation G-2 +
anatofuz
parents:
diff changeset
994 V1-1](https://developer.amd.com/wp-content/resources/Managing-Speculation-on-AMD-Processors.pdf)).
anatofuz
parents:
diff changeset
995
anatofuz
parents:
diff changeset
996 However, this relies on finding and enumerating all possible points in code
anatofuz
parents:
diff changeset
997 which could be attacked to leak information. While in some cases static
anatofuz
parents:
diff changeset
998 analysis is effective at doing this at scale, in many cases it still relies on
anatofuz
parents:
diff changeset
999 human judgement to evaluate whether code might be vulnerable. Especially for
anatofuz
parents:
diff changeset
1000 software systems which receive less detailed scrutiny but remain sensitive to
anatofuz
parents:
diff changeset
1001 these attacks, this seems like an impractical security model. We need an
anatofuz
parents:
diff changeset
1002 automatic and systematic mitigation strategy.
anatofuz
parents:
diff changeset
1003
anatofuz
parents:
diff changeset
1004
anatofuz
parents:
diff changeset
1005 ### Automatic `lfence` on Conditional Edges
anatofuz
parents:
diff changeset
1006
anatofuz
parents:
diff changeset
1007 A natural way to scale up the existing hand-coded mitigations is simply to
anatofuz
parents:
diff changeset
1008 inject an `lfence` instruction into both the target and fallthrough
anatofuz
parents:
diff changeset
1009 destinations of every conditional branch. This ensures that no predicate or
anatofuz
parents:
diff changeset
1010 bounds check can be bypassed speculatively. However, the performance overhead
anatofuz
parents:
diff changeset
1011 of this approach is, simply put, catastrophic. Yet it remains the only truly
anatofuz
parents:
diff changeset
1012 "secure by default" approach known prior to this effort and serves as the
anatofuz
parents:
diff changeset
1013 baseline for performance.
anatofuz
parents:
diff changeset
1014
anatofuz
parents:
diff changeset
1015 One attempt to address the performance overhead of this and make it more
anatofuz
parents:
diff changeset
1016 realistic to deploy is [MSVC's /Qspectre
anatofuz
parents:
diff changeset
1017 switch](https://blogs.msdn.microsoft.com/vcblog/2018/01/15/spectre-mitigations-in-msvc/).
anatofuz
parents:
diff changeset
1018 Their technique is to use static analysis within the compiler to only insert
anatofuz
parents:
diff changeset
1019 `lfence` instructions into conditional edges at risk of attack. However,
anatofuz
parents:
diff changeset
1020 [initial](https://arstechnica.com/gadgets/2018/02/microsofts-compiler-level-spectre-fix-shows-how-hard-this-problem-will-be-to-solve/)
anatofuz
parents:
diff changeset
1021 [analysis](https://www.paulkocher.com/doc/MicrosoftCompilerSpectreMitigation.html)
anatofuz
parents:
diff changeset
1022 has shown that this approach is incomplete and only catches a small and limited
anatofuz
parents:
diff changeset
1023 subset of attackable patterns which happen to resemble very closely the initial
anatofuz
parents:
diff changeset
1024 proofs of concept. As such, while its performance is acceptable, it does not
anatofuz
parents:
diff changeset
1025 appear to be an adequate systematic mitigation.
anatofuz
parents:
diff changeset
1026
anatofuz
parents:
diff changeset
1027
anatofuz
parents:
diff changeset
1028 ## Performance Overhead
anatofuz
parents:
diff changeset
1029
anatofuz
parents:
diff changeset
1030 The performance overhead of this style of comprehensive mitigation is very
anatofuz
parents:
diff changeset
1031 high. However, it compares very favorably with previously recommended
anatofuz
parents:
diff changeset
1032 approaches such as the `lfence` instruction. Just as users can restrict the
anatofuz
parents:
diff changeset
1033 scope of `lfence` to control its performance impact, this mitigation technique
anatofuz
parents:
diff changeset
1034 could be restricted in scope as well.
anatofuz
parents:
diff changeset
1035
anatofuz
parents:
diff changeset
1036 However, it is important to understand what it would cost to get a fully
anatofuz
parents:
diff changeset
1037 mitigated baseline. Here we assume targeting a Haswell (or newer) processor and
anatofuz
parents:
diff changeset
1038 using all of the tricks to improve performance (so leaves the low 2gb
anatofuz
parents:
diff changeset
1039 unprotected and +/- 2gb surrounding any PC in the program). We ran both
anatofuz
parents:
diff changeset
1040 Google's microbenchmark suite and a large highly-tuned server built using
anatofuz
parents:
diff changeset
1041 ThinLTO and PGO. All were built with `-march=haswell` to give access to BMI2
anatofuz
parents:
diff changeset
1042 instructions, and benchmarks were run on large Haswell servers. We collected
anatofuz
parents:
diff changeset
1043 data both with an `lfence`-based mitigation and load hardening as presented
anatofuz
parents:
diff changeset
1044 here. The summary is that mitigating with load hardening is 1.77x faster than
anatofuz
parents:
diff changeset
1045 mitigating with `lfence`, and the overhead of load hardening compared to a
anatofuz
parents:
diff changeset
1046 normal program is likely between a 10% overhead and a 50% overhead with most
anatofuz
parents:
diff changeset
1047 large applications seeing a 30% overhead or less.
anatofuz
parents:
diff changeset
1048
anatofuz
parents:
diff changeset
1049 | Benchmark | `lfence` | Load Hardening | Mitigated Speedup |
anatofuz
parents:
diff changeset
1050 | -------------------------------------- | -------: | -------------: | ----------------: |
anatofuz
parents:
diff changeset
1051 | Google microbenchmark suite | -74.8% | -36.4% | **2.5x** |
anatofuz
parents:
diff changeset
1052 | Large server QPS (using ThinLTO & PGO) | -62% | -29% | **1.8x** |
anatofuz
parents:
diff changeset
1053
anatofuz
parents:
diff changeset
1054 Below is a visualization of the microbenchmark suite results which helps show
anatofuz
parents:
diff changeset
1055 the distribution of results that is somewhat lost in the summary. The y-axis is
anatofuz
parents:
diff changeset
1056 a log-scale speedup ratio of load hardening relative to `lfence` (up -> faster
anatofuz
parents:
diff changeset
1057 -> better). Each box-and-whiskers represents one microbenchmark which may have
anatofuz
parents:
diff changeset
1058 many different metrics measured. The red line marks the median, the box marks
anatofuz
parents:
diff changeset
1059 the first and third quartiles, and the whiskers mark the min and max.
anatofuz
parents:
diff changeset
1060
anatofuz
parents:
diff changeset
1061 ![Microbenchmark result visualization](speculative_load_hardening_microbenchmarks.png)
anatofuz
parents:
diff changeset
1062
anatofuz
parents:
diff changeset
1063 We don't yet have benchmark data on SPEC or the LLVM test suite, but we can
anatofuz
parents:
diff changeset
1064 work on getting that. Still, the above should give a pretty clear
anatofuz
parents:
diff changeset
1065 characterization of the performance, and specific benchmarks are unlikely to
anatofuz
parents:
diff changeset
1066 reveal especially interesting properties.
anatofuz
parents:
diff changeset
1067
anatofuz
parents:
diff changeset
1068
anatofuz
parents:
diff changeset
1069 ### Future Work: Fine Grained Control and API-Integration
anatofuz
parents:
diff changeset
1070
anatofuz
parents:
diff changeset
1071 The performance overhead of this technique is likely to be very significant and
anatofuz
parents:
diff changeset
1072 something users wish to control or reduce. There are interesting options here
anatofuz
parents:
diff changeset
1073 that impact the implementation strategy used.
anatofuz
parents:
diff changeset
1074
anatofuz
parents:
diff changeset
1075 One particularly appealing option is to allow both opt-in and opt-out of this
anatofuz
parents:
diff changeset
1076 mitigation at reasonably fine granularity such as on a per-function basis,
anatofuz
parents:
diff changeset
1077 including intelligent handling of inlining decisions -- protected code can be
anatofuz
parents:
diff changeset
1078 prevented from inlining into unprotected code, and unprotected code will become
anatofuz
parents:
diff changeset
1079 protected when inlined into protected code. For systems where only a limited
anatofuz
parents:
diff changeset
1080 set of code is reachable by externally controlled inputs, it may be possible to
anatofuz
parents:
diff changeset
1081 limit the scope of mitigation through such mechanisms without compromising the
anatofuz
parents:
diff changeset
1082 application's overall security. The performance impact may also be focused in a
anatofuz
parents:
diff changeset
1083 few key functions that can be hand-mitigated in ways that have lower
anatofuz
parents:
diff changeset
1084 performance overhead while the remainder of the application receives automatic
anatofuz
parents:
diff changeset
1085 protection.
anatofuz
parents:
diff changeset
1086
anatofuz
parents:
diff changeset
1087 For both limiting the scope of mitigation or manually mitigating hot functions,
anatofuz
parents:
diff changeset
1088 there needs to be some support for mixing mitigated and unmitigated code
anatofuz
parents:
diff changeset
1089 without completely defeating the mitigation. For the first use case, it would
anatofuz
parents:
diff changeset
1090 be particularly desirable that mitigated code remains safe when being called
anatofuz
parents:
diff changeset
1091 during misspeculation from unmitigated code.
anatofuz
parents:
diff changeset
1092
anatofuz
parents:
diff changeset
1093 For the second use case, it may be important to connect the automatic
anatofuz
parents:
diff changeset
1094 mitigation technique to explicit mitigation APIs such as what is described in
anatofuz
parents:
diff changeset
1095 http://wg21.link/p0928 (or any other eventual API) so that there is a clean way
anatofuz
parents:
diff changeset
1096 to switch from automatic to manual mitigation without immediately exposing a
anatofuz
parents:
diff changeset
1097 hole. However, the design for how to do this is hard to come up with until the
anatofuz
parents:
diff changeset
1098 APIs are better established. We will revisit this as those APIs mature.