150
|
1 # MLIR Rationale
|
|
2
|
|
3 This document is intended to capture some of the alternatives considered and
|
|
4 open debates in the design of MLIR, along with the rationale for certain
|
|
5 decisions we made. This is not intended to be a "finely groomed" document - we
|
|
6 prefer the ability to dump in interesting tidbits without worrying too much
|
|
7 about their consistency or readability.
|
|
8
|
|
9 [TOC]
|
|
10
|
|
11 ## Abstract
|
|
12
|
|
13 MLIR is a compiler intermediate representation with similarities to traditional
|
|
14 three-address SSA representations (like
|
|
15 [LLVM IR](http://llvm.org/docs/LangRef.html) or
|
|
16 [SIL](https://github.com/apple/swift/blob/master/docs/SIL.rst)), but which
|
|
17 introduces notions from the polyhedral loop optimization works as first class
|
|
18 concepts. This hybrid design is optimized to represent, analyze, and transform
|
|
19 high level dataflow graphs as well as target-specific code generated for high
|
|
20 performance data parallel systems. Beyond its representational capabilities, its
|
|
21 single continuous design provides a framework to lower from dataflow graphs to
|
|
22 high performance target specific code.
|
|
23
|
|
24 MLIR stands for one of "Multi-Level IR" or "Multi-dimensional Loop IR" or
|
|
25 "Machine Learning IR" or "Mid Level IR", we prefer the first. This document only
|
|
26 provides the rationale behind MLIR -- its actual
|
|
27 [specification document](LangRef.md) and other content is hosted elsewhere.
|
|
28
|
|
29 ## Introduction and Motivation
|
|
30
|
|
31 The Multi-Level Intermediate Representation (MLIR) is intended for easy
|
|
32 expression and optimization of computations involving deep loop nests and dense
|
|
33 matrices of high dimensionality. It is thus well-suited to deep learning
|
|
34 computations in particular. Yet it is general enough to also represent arbitrary
|
|
35 sequential computation. The representation allows high-level optimization and
|
|
36 parallelization for a wide range of parallel architectures including those with
|
|
37 deep memory hierarchies --- general-purpose multicores, GPUs, and specialized
|
|
38 neural network accelerators.
|
|
39
|
|
40 MLIR uses ideas drawn from IRs of LLVM and Swift for lower level constructs
|
|
41 while combining them with ideas from the polyhedral abstraction to represent
|
|
42 loop nests, multidimensional data (tensors), and transformations on these
|
|
43 entities as first class concepts in the IR.
|
|
44
|
|
45 MLIR is a multi-level IR, i.e., it represents code at a domain-specific
|
|
46 representation such as HLO or TensorFlow graphs, all the way down to the machine
|
|
47 level. MLIR is able to represent arbitrary control flow and arbitrary data
|
|
48 accesses, and is general enough to represent nearly all sequential computation.
|
|
49 This is a key distinction from existing polyhedral representation
|
|
50 implementations (such as LLVM [Polly](https://polly.llvm.org/)) that are able to
|
|
51 use the polyhedral abstraction in a way isolated from the LLVM IR and only for
|
|
52 affine loop nests, i.e., portions of the code where array accesses, loop bounds,
|
|
53 and conditionals are regular (involve linear functions of loop iterators and
|
|
54 constant symbols). The presence of statically unpredictable data accesses or
|
|
55 control flow does not preclude representation in MLIR, but only limits to a
|
|
56 certain extent the ability to reason about and apply transformations using the
|
|
57 polyhedral abstraction.
|
|
58
|
|
59 Maps, sets, and relations with affine constraints are the core structures
|
|
60 underlying a polyhedral representation of high-dimensional loop nests and
|
|
61 multidimensional arrays. These structures are represented as textual
|
|
62 expressions in a form close to their mathematical form. These structures are
|
|
63 used to capture loop nests, tensor data structures, and how they are reordered
|
|
64 and mapped for a target architecture. All structured or "conforming" loops are
|
|
65 captured as part of the polyhedral information, and so are tensor variables,
|
|
66 their layouts, and subscripted accesses to these tensors in memory.
|
|
67
|
|
68 The information captured in the IR allows a compact expression of all loop
|
|
69 transformations, data remappings, explicit copying necessary for explicitly
|
|
70 addressed memory in accelerators, mapping to pre-tuned expert written
|
|
71 primitives, and mapping to specialized vector instructions. Loop transformations
|
|
72 that can be easily implemented include the body of affine transformations: these
|
|
73 subsume all traditional loop transformations (unimodular and non-unimodular)
|
|
74 such as loop tiling, interchange, permutation, skewing, scaling, relative
|
|
75 shifting, reversal, fusion, and distribution/fission. Transformations on data
|
|
76 layout such as padding and transforming to blocked layouts are also represented
|
|
77 well via affine layout maps.
|
|
78
|
|
79 MLIR's design allows a progressive lowering to target-specific forms. Besides
|
|
80 high-level transformations for loop nests and data layouts that a typical
|
|
81 mid-level optimizer is expected to deal with, MLIR is also designed to perform
|
|
82 certain low-level scheduling and mapping decisions that a typical backend IR is
|
|
83 entrusted with: these include mapping to specialized vector instructions,
|
|
84 auto-vectorization, and software pipelining. The need to support these
|
|
85 transformations stems from the fact that neural network accelerators have
|
|
86 specialized units that deal with large chunks of data whose computation maps
|
|
87 back to chunks of more than one loop of the loop nests as viewed by a program at
|
|
88 a level closer to the original specification. Such specialized units or
|
|
89 instructions operate on multidimensional data chunks from a programmer's
|
|
90 viewpoint. It thus makes it hard or infeasible for a backend operating on a very
|
|
91 low-level IR close to assembly to lift and reconstruct loops and perform such a
|
|
92 mapping. This is in contrast to classic instruction selection and scheduling in
|
|
93 today's compilers that primarily only deals with the body of the innermost loop.
|
|
94 MLIR also facilitates automatic mapping to expert pre-tuned primitives or vendor
|
|
95 libraries operating on data at higher levels (or at the highest level) of the
|
|
96 memory hierarchy.
|
|
97
|
|
98 In summary, MLIR is convenient for and closed under the kind of transformations
|
|
99 needed to lower to general-purpose as well as specialized accelerators. It also
|
|
100 allows one to build modular and reusable target independent and target dependent
|
|
101 passes.
|
|
102
|
|
103 ## Design Decisions
|
|
104
|
|
105 This section sheds light on some of the design decisions -- some of these are
|
|
106 indirectly implied by the specification document.
|
|
107
|
|
108 ### Loads and stores
|
|
109
|
|
110 The 'load' and 'store' instructions are specifically crafted to fully resolve to
|
|
111 an element of a memref. These instructions take as arguments n+1 indices for an
|
|
112 n-ranked tensor. This disallows the equivalent of pointer arithmetic or the
|
|
113 ability to index into the same memref in other ways (something which C arrays
|
|
114 allow for example). Furthermore, for the affine constructs, the compiler can
|
|
115 follow use-def chains (e.g. through
|
|
116 [affine.apply operations](Dialects/Affine.md#affineapply-operation)) or through
|
|
117 the map attributes of [affine operations](Dialects/Affine.md#Operations)) to
|
|
118 precisely analyze references at compile-time using polyhedral techniques. This
|
|
119 is possible because of the [restrictions on dimensions and symbols](Dialects/Affine.md#restrictions-on-dimensions-and-symbols).
|
|
120
|
|
121 A scalar of element-type (a primitive type or a vector type) that is stored in
|
|
122 memory is modeled as a 0-d memref. This is also necessary for scalars that are
|
|
123 live out of for loops and if conditionals in a function, for which we don't yet
|
|
124 have an SSA representation --
|
|
125 [an extension](#mlfunction-extensions-for-"escaping-scalars") to allow that is
|
|
126 described later in this doc.
|
|
127
|
|
128 ### Symbols and types
|
|
129
|
|
130 The current MLIR disallows use of symbols in types. For example, when a tensor
|
|
131 or memref dimension is statically unknown, it is denoted in the type as '?'. An
|
|
132 SSA symbol is then bound to it when a memref is created. The actual value of the
|
|
133 unknown dimension can be queried using the "dim" builtin as shown below.
|
|
134
|
|
135 Example:
|
|
136
|
|
137 ```mlir
|
|
138 func foo(...) {
|
|
139 %A = alloc <8x?xf32, #lmap> (%N)
|
|
140 ...
|
|
141 call bar(%A) : (memref<8x?xf32, #lmap>)
|
|
142 }
|
|
143
|
|
144 func bar(%A : memref<8x?xf32, #lmap>) {
|
|
145 // Type of %A indicates that %A has dynamic shape with 8 rows
|
|
146 // and unknown number of columns. The number of columns is queried
|
|
147 // dynamically using dim instruction.
|
|
148 %N = dim %A, 1 : memref<8x?xf32, #lmap>
|
|
149
|
|
150 affine.for %i = 0 to 8 {
|
|
151 affine.for %j = 0 to %N {
|
|
152 // A[i,j] += 1
|
|
153 %s1 = affine.load %A[%i, %j] : memref<8x?xf32, #lmap>
|
|
154 %s2 = add %s1, 1
|
|
155 affine.store %s2, %A[%i, %j] : memref<8x?xf32, #lmap>
|
|
156 }
|
|
157 }
|
|
158 return
|
|
159 }
|
|
160
|
|
161 ```
|
|
162
|
|
163 An alternative design is to embed the reference to symbols directly in the
|
|
164 type - memref<8x%Nxf32>. We went for the current approach in MLIR because it
|
|
165 simplifies the design --- types remain immutable when the values of symbols
|
|
166 change.
|
|
167
|
|
168 ### Block Arguments vs PHI nodes
|
|
169
|
|
170 MLIR Regions represent SSA using "[block arguments](LangRef.md#blocks)" rather
|
|
171 than [PHI instructions](http://llvm.org/docs/LangRef.html#i-phi) used in LLVM.
|
|
172 This choice is representationally identical (the same constructs can be
|
|
173 represented in either form) but block arguments have several advantages:
|
|
174
|
|
175 1. LLVM PHI nodes always have to be kept at the top of a block, and
|
|
176 transformations frequently have to manually skip over them. This is defined
|
|
177 away with BB arguments.
|
|
178 1. LLVM has a separate function Argument node. This is defined away with BB
|
|
179 arguments, because the arguments to the entry block serve this purpose.
|
|
180 1. Blocks of PHI nodes in LLVM execute atomically, which is surprising and
|
|
181 super confusing to compiler engineers and it is easy to introduce bugs with
|
|
182 this (very related to the
|
|
183 "[lost copy](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.524.5461&rep=rep1&type=pdf)"
|
|
184 problem in SSA lowering literature.) With the BB argument representation,
|
|
185 this confusion is defined away.
|
|
186 1. The entry list of PHI nodes in LLVM are unordered, and some blocks have
|
|
187 thousands of predecessors (e.g. unwind blocks). This can cause long compile
|
|
188 time problems because transformations have to linearly scan this list. This
|
|
189 is defined away with BB argument representation.
|
|
190 1. LLVM has no way to represent values that are available only in one successor
|
|
191 but not the other, e.g. its invoke instruction cannot produce the exception
|
|
192 value JUST on the exception edge. Instead, the
|
|
193 [landingpad instruction](http://llvm.org/docs/LangRef.html#landingpad-instruction)
|
|
194 is a hack used to represent this. MLIR doesn't make use of this capability,
|
|
195 but SIL uses it extensively, e.g. in the
|
|
196 [switch_enum instruction](https://github.com/apple/swift/blob/master/docs/SIL.rst#switch-enum).
|
|
197
|
|
198 For more context, block arguments were previously used in the Swift
|
|
199 [SIL Intermediate Representation](https://github.com/apple/swift/blob/master/docs/SIL.rst),
|
|
200 and described in
|
|
201 [a talk on YouTube](https://www.youtube.com/watch?v=Ntj8ab-5cvE). The section of
|
|
202 interest
|
|
203 [starts here](https://www.google.com/url?q=https://youtu.be/Ntj8ab-5cvE?t%3D596&sa=D&ust=1529450150971000&usg=AFQjCNFQHEWL7m8q3eO-1DiKw9zqC2v24Q).
|
|
204
|
|
205 ### Index type disallowed in vector/tensor/memref types
|
|
206
|
|
207 Index types are not allowed as elements of `vector`, `tensor` or `memref` type.
|
|
208 Index types are intended to be used for platform-specific "size" values and may
|
|
209 appear in subscripts, sizes of aggregate types and affine expressions. They are
|
|
210 also tightly coupled with `affine.apply` and affine.load/store operations;
|
|
211 having `index` type is a necessary precondition of a value to be acceptable by
|
|
212 these operations. While it may be useful to have `memref<?xindex>` to express
|
|
213 indirect accesses, e.g. sparse matrix manipulations or lookup tables, it creates
|
|
214 problems MLIR is not ready to address yet. MLIR needs to internally store
|
|
215 constants of aggregate types and emit code operating on values of those types,
|
|
216 which are subject to target-specific size and alignment constraints. Since MLIR
|
|
217 does not have a target description mechanism at the moment, it cannot reliably
|
|
218 emit such code. Moreover, some platforms may not support vectors of type
|
|
219 equivalent to `index`.
|
|
220
|
|
221 Indirect access use cases can be alternatively supported by providing and
|
|
222 `index_cast` instruction that allows for conversion between `index` and
|
|
223 fixed-width integer types, at the SSA value level. It has an additional benefit
|
|
224 of supporting smaller integer types, e.g. `i8` or `i16`, for small indices
|
|
225 instead of (presumably larger) `index` type.
|
|
226
|
|
227 ### Bit width of a non-primitive types and `index` is undefined
|
|
228
|
|
229 The bit width of a compound type is not defined by MLIR, it may be defined by a
|
|
230 specific lowering pass. In MLIR, bit width is a property of certain primitive
|
|
231 _type_, in particular integers and floats. It is equal to the number that
|
|
232 appears in the type definition, e.g. the bit width of `i32` is `32`, so is the
|
|
233 bit width of `f32`. The bit width is not _necessarily_ related to the amount of
|
|
234 memory (in bytes) or the size of register (in bits) that is necessary to store
|
|
235 the value of the given type. These quantities are target and ABI-specific and
|
|
236 should be defined during the lowering process rather than imposed from above.
|
|
237 For example, `vector<3xi57>` is likely to be lowered to a vector of four 64-bit
|
|
238 integers, so that its storage requirement is `4 x 64 / 8 = 32` bytes, rather
|
|
239 than `(3 x 57) ceildiv 8 = 22` bytes as can be naively computed from the
|
|
240 bitwidth. Individual components of MLIR that allocate space for storing values
|
|
241 may use the bit size as the baseline and query the target description when it is
|
|
242 introduced.
|
|
243
|
|
244 The bit width is not defined for dialect-specific types at MLIR level. Dialects
|
|
245 are free to define their own quantities for type sizes.
|
|
246
|
|
247 ### Signless types
|
|
248
|
|
249 Integers in the builtin MLIR type system have a bitwidth (note that the `index`
|
|
250 type has a symbolic width equal to the machine word size), but they do not have
|
|
251 an intrinsic sign. This means that the "standard ops" operation set has things
|
|
252 like `addi` and `muli` which do two's complement arithmetic, but some other
|
|
253 operations get a sign, e.g. `divis` vs `diviu`.
|
|
254
|
|
255 LLVM uses the [same design](http://llvm.org/docs/LangRef.html#integer-type),
|
|
256 which was introduced in a revamp rolled out
|
|
257 [in the LLVM 2.0 integer type](http://releases.llvm.org/2.0/docs/LangRef.html#t_derived).
|
|
258 Prior to that, from
|
|
259 [LLVM 1.0](http://releases.llvm.org/1.0/docs/LangRef.html#t_classifications) to
|
|
260 [1.9](http://releases.llvm.org/1.9/docs/LangRef.html#t_classifications), LLVM
|
|
261 uses signed types like "sbyte" and "ubyte". This shift was important and has
|
|
262 served LLVM well over the years. The reason this is important is that it is a
|
|
263 good thing for an intermediate representation to represent the same computation
|
|
264 with the same instruction. Signed types got in the way, because (e.g.) an "add
|
|
265 of an sbyte" does the same computation as an "add of a ubyte", but the type
|
|
266 system made them look artificially different. This split also required casts
|
|
267 like "cast from sbyte to ubyte" which do nothing at the machine level. Removing
|
|
268 signs from the type system eliminated these problems, making the compiler
|
|
269 simpler.
|
|
270
|
|
271 More information about this split is available in an old
|
|
272 [talk on youtube](https://www.youtube.com/watch?v=VeRaLPupGks) talking about
|
|
273 LLVM 2.0.
|
|
274
|
|
275 Note that this rationale only applies to the "standard ops" dialect in which we
|
|
276 can express an opinion about its design. Other dialects generally try to model
|
|
277 an external system, and should aim to reflect its design as closely as possible.
|
|
278
|
|
279 ### Splitting floating point vs integer operations
|
|
280
|
|
281 The MLIR "standard" operation set splits many integer and floating point
|
|
282 operations into different categories, for example `addf` vs `addi` and `cmpf` vs
|
|
283 `cmpi`
|
|
284 ([following the design of LLVM](http://llvm.org/docs/LangRef.html#binary-operations)).
|
|
285 These instructions _are_ polymorphic on the number of elements in the type
|
|
286 though, for example `addf` is used with scalar floats, vectors of floats, and
|
|
287 tensors of floats (LLVM does the same thing with its scalar/vector types).
|
|
288
|
|
289 This split is important because floating point and integer operations are quite
|
|
290 different in practice: for example, floating point values include NaN's, so
|
|
291 [integer comparisons](http://llvm.org/docs/LangRef.html#icmp-instruction) and
|
|
292 [floating point comparisons](http://llvm.org/docs/LangRef.html#fcmp-instruction)
|
|
293 should use different comparison opcodes. On the arithmetic side of things,
|
|
294 floating point operations support rounding modes, floating point contractions,
|
|
295 ["fast math"](http://llvm.org/docs/LangRef.html#fadd-instruction), and integers
|
|
296 may want to have two's complement overflow behavior or be undefined on
|
|
297 [various forms of wrapping](http://llvm.org/docs/LangRef.html#add-instruction)
|
|
298 for performance.
|
|
299
|
|
300 We are a long way from this sort of thing being a priority to care about in
|
|
301 MLIR, but since we have experience and know the right way to do this, we'd
|
|
302 rather design it in from the beginning.
|
|
303
|
|
304 Note that this rationale only applies to the "standard ops" dialect in which we
|
|
305 can express an opinion about its design. Other dialects generally try to model
|
|
306 an external system, and should aim to reflect its design as closely as possible.
|
|
307
|
|
308 ### Specifying sign in integer comparison operations
|
|
309
|
|
310 Since integers are [signless](#signless-types), it is necessary to define the
|
|
311 sign for integer comparison operations. This sign indicates how to treat the
|
|
312 foremost bit of the integer: as sign bit or as most significant bit. For
|
|
313 example, comparing two `i4` values `0b1000` and `0b0010` yields different
|
|
314 results for unsigned (`8 > 3`) and signed (`-8 < 3`) interpretations. This
|
|
315 difference is only significant for _order_ comparisons, but not for _equality_
|
|
316 comparisons. Indeed, for the latter all bits must have the same value
|
|
317 independently of the sign. Since both arguments have exactly the same bit width
|
|
318 and cannot be padded by this operation, it is impossible to compare two values
|
|
319 whose bit representations would differ while the values are interpreted as
|
|
320 equal.
|
|
321
|
|
322 ### Specifying comparison kind as attribute
|
|
323
|
|
324 Unlike arithmetic, comparison operators share several common properties, e.g.
|
|
325 they cannot be considered associative. In practice, comparisons are sometimes
|
|
326 implemented by the same instruction or its variants so it makes sense to group
|
|
327 them together at the IR level.
|
|
328
|
|
329 An alternative would be introducing ten distinct operators for all currently
|
|
330 supported kinds of integer comparisons. These operators would have increased the
|
|
331 number of "reserved" names used by standard operations as well as the size of
|
|
332 the C++ API while their implementations would have been mostly identical.
|
|
333
|
|
334 The comparison kind is internally an integer attribute. However, for the sake of
|
|
335 readability by humans, custom assembly form accepts string literals that are
|
|
336 mapped to the underlying integer values: `cmpi "eq", %lhs, %rhs` better implies
|
|
337 integer equality comparison than `cmpi 0, %lhs, %rhs` where it is unclear what
|
|
338 gets compared to what else. This syntactic sugar is possible thanks to parser
|
|
339 logic redefinitions for custom assembly form of non-builtin operations.
|
|
340 Supporting it in the full notation would have required changing how the main
|
|
341 parsing algorithm works and may have unexpected repercussions. While it had been
|
|
342 possible to store the predicate as string attribute, it would have rendered
|
|
343 impossible to implement switching logic based on the comparison kind and made
|
|
344 attribute validity checks (one out of ten possible kinds) more complex.
|
|
345
|
|
346 ### 'select' operation to implement min/max
|
|
347
|
|
348 Although `min` and `max` operations are likely to occur as a result of
|
|
349 transforming affine loops in ML functions, we did not make them first-class
|
|
350 operations. Instead, we provide the `select` operation that can be combined with
|
|
351 `cmpi` to implement the minimum and maximum computation. Although they now
|
|
352 require two operations, they are likely to be emitted automatically during the
|
|
353 transformation inside MLIR. On the other hand, there are multiple benefits of
|
|
354 introducing `select`: standalone min/max would concern themselves with the
|
|
355 signedness of the comparison, already taken into account by `cmpi`; `select` can
|
|
356 support floats transparently if used after a float-comparison operation; the
|
|
357 lower-level targets provide `select`-like instructions making the translation
|
|
358 trivial.
|
|
359
|
|
360 This operation could have been implemented with additional control flow: `%r =
|
|
361 select %cond, %t, %f` is equivalent to
|
|
362
|
|
363 ```mlir
|
|
364 ^bb0:
|
|
365 cond_br %cond, ^bb1(%t), ^bb1(%f)
|
|
366 ^bb1(%r):
|
|
367 ```
|
|
368
|
|
369 However, this control flow granularity is not available in the ML functions
|
|
370 where min/max, and thus `select`, are likely to appear. In addition, simpler
|
|
371 control flow may be beneficial for optimization in general.
|
|
372
|
|
373 ### Regions
|
|
374
|
|
375 #### Attributes of type 'Block'
|
|
376
|
|
377 We considered representing regions through `ArrayAttr`s containing a list of a
|
|
378 special type `IRBlockAttr`, which in turn would contain a list of operations.
|
|
379 All attributes in MLIR are unique’d within the context, which would make the IR
|
|
380 inside the regions immortal for no good reason.
|
|
381
|
|
382 #### Use "inlined" functions as regions
|
|
383
|
|
384 We considered attaching a "force-inline" attribute on a function and/or a
|
|
385 function `call` operation. Even the minimal region support (use cases in
|
|
386 affine.for and affine.if existing before the regions) requires access to the
|
|
387 values defined in the dominating block, which is not supported by functions.
|
|
388 Conceptually, function bodies are instances of regions rather than the inverse;
|
|
389 regions can also be device kernels, alternative sections, etc.
|
|
390
|
|
391 #### Dedicated `region` operation
|
|
392
|
|
393 This would mean we have a special kind of operation that is allowed to have
|
|
394 regions while other operations are not. Such distinction is similar to the
|
|
395 Stmt/Op difference we have had and chose to remove to make the IR simpler and
|
|
396 more flexible. It would also require analyses and passes to consider the
|
|
397 interplay between operations (e.g., an `affine.for` operation must be followed
|
|
398 by a region operation). Finally, a region operation can be introduced using the
|
|
399 current implementation, among other operations and without being special in any
|
|
400 sense.
|
|
401
|
|
402 #### Explicit capture of the values used in a region
|
|
403
|
|
404 Being able to use values defined outside the region implies that use-def chains
|
|
405 may contain uses from different nested regions. Consequently, IR transformations
|
|
406 and analyses can pull the instruction defining the value across region
|
|
407 boundaries, for example in case of TableGen-defined canonicalization patterns.
|
|
408 This would not be the case if all used values had been passed as region
|
|
409 arguments. One of the motivations for introducing regions in the IR is precisely
|
|
410 to enable cross-region analyses and transformations that are simpler than
|
|
411 inter-procedural transformations. Having uses from different regions appear in
|
|
412 the same use-def chain, contrary to an additional data structure maintaining
|
|
413 correspondence between function call arguments as uses of the original
|
|
414 definitions and formal arguments as new definitions, enables such
|
|
415 simplification. Since individual operations now belong to blocks, which belong
|
|
416 to regions, it is always possible to check if the definition of the value
|
|
417 belongs to the same region as its particular use. The risk is that any IR
|
|
418 traversal will need to handle explicitly this situation and it is easy to forget
|
|
419 a check (or conversely it isn’t easy to design the right check in a tablegen
|
|
420 pattern for example): traversing use-def chains potentially crosses implicitly
|
|
421 semantic barriers, making it possible to unknowingly break region semantics.
|
|
422 This is expected to be caught in the verifier after the transformation.
|
|
423
|
|
424 At the same time, one may choose to pass certain or all values as region
|
|
425 arguments to explicitly break the use-def chains in the current proposal. This
|
|
426 can be combined with an attribute-imposed semantic requirement disallowing the
|
|
427 body of the region to refer to any value from outside it.
|
|
428
|
|
429 ### Quantized integer operations
|
|
430
|
|
431 We haven't designed integer quantized operations in MLIR, but experience from
|
|
432 TensorFlow suggests that it is better to put information about the quantization
|
|
433 range/scale into the type itself, rather than have a single type like "qint8"
|
|
434 and put these on attributes of the operation.
|
|
435
|
|
436 There are a few ways to do this with MLIR, including at least:
|
|
437
|
|
438 * We could do the same thing TensorFlow does - and we will _have_ to support
|
|
439 that model to some extent for compatibility.
|
|
440 * We can encode the fp range of quantized integers directly into the types
|
|
441 when they are constants. The best practice on this seems to be to encode the
|
|
442 zero point as well as a scale factor. This ensures that 0.0 is always
|
|
443 exactly representable, e.g. `qi8<-1.42, 31.23x>`.
|
|
444 * We could theoretically encode dynamically determined ranges into the types
|
|
445 using something like `qi8<?,?>` with the bounds being determined through the
|
|
446 SSA dataflow graph dynamically - similar to how dynamic shapes are handled.
|
|
447
|
|
448 We will definitely need to do #1 for compatibility, we probably want to do #2,
|
|
449 and we should investigate #3 over time. That said, our short term plan is to get
|
|
450 more implementation experience with the rest of the system first, then come back
|
|
451 to re-examine the representation for quantized arithmetic when we have that
|
|
452 experience. When we do, we should chat with benoitjacob@ and
|
|
453 [read the paper](https://arxiv.org/abs/1712.05877).
|
|
454
|
|
455 ### Dialect type extensions
|
|
456
|
|
457 This section describes the design decisions that shaped the dialect extensible
|
|
458 type system present in MLIR.
|
|
459
|
|
460 #### Reserving dialect type kinds
|
|
461
|
|
462 Dialects that wish to define type extensions must reserve a range of type kinds
|
|
463 within a '.def' file within the core IR library. This means that every dialect
|
|
464 wishing to define custom types must modify this file, but it guarantees that all
|
|
465 type casting checkings are performed in O(1) time.
|
|
466
|
|
467 #### Interactions between dialects
|
|
468
|
|
469 There are two different interactions between dialects that are important to
|
|
470 understand. When types of a dialect are:
|
|
471
|
|
472 * In operations of other dialects
|
|
473
|
|
474 - For standard/builtin operations, only standard/builtin types are
|
|
475 allowed. This restriction allows for operations to clearly understand
|
|
476 the invariants that they are working under.
|
|
477 - Outside of standard/builtin operations, dialects are expected to verify
|
|
478 the allowable operation types per operation.
|
|
479
|
|
480 * In types of other dialects
|
|
481
|
|
482 - For standard/builtin types, these types are allowed to contain types
|
|
483 from other dialects. This simplifies the type system and removes the
|
|
484 need for dialects to redefine all of the standard aggregate types, e.g.
|
|
485 tensor, as well as the memref type. Dialects are expected to verify that
|
|
486 a specific type is valid within a standard type, e.g. if a type can be
|
|
487 an element of a tensor.
|
|
488 - For dialect types, the dialect is expected to verify any type
|
|
489 invariants, e.g. if the standard tensor type can contain a specific type
|
|
490 of that dialect.
|
|
491
|
|
492 #### Separating builtin and standard types
|
|
493
|
|
494 Following the separation between the built-in and standard dialect, it makes
|
|
495 sense to separate built-in types and standard dialect types. Built-in types are
|
|
496 required for the validity of the IR itself, e.g. the function type (which
|
|
497 appears in function signatures and generic assembly forms of operations).
|
|
498 Integer, float, vector, memref and tensor types, while important, are not
|
|
499 necessary for IR validity.
|
|
500
|
|
501 #### Unregistered types
|
|
502
|
|
503 MLIR supports unregistered operations in generic assembly form. MLIR also
|
|
504 supports a similar concept for types. When parsing, if the dialect for dialect
|
|
505 type has not been registered the type is modeled as an 'OpaqueType'. This allows
|
|
506 for types to be round-tripped without needing to link in the dialect library
|
|
507 that defined them. No additional information about opaque types, outside of
|
|
508 parsing/printing, will be available.
|
|
509
|
|
510 #### Dialect type syntax
|
|
511
|
|
512 Dialect extended types are represented as string literals wrapped inside of the
|
|
513 dialect namespace. This means that the parser delegates to the dialect for
|
|
514 parsing specific type instances. This differs from the representation of dialect
|
|
515 defined operations, of which have an identifier name that the parser uses to
|
|
516 identify and parse them.
|
|
517
|
|
518 This representation was chosen for several reasons:
|
|
519
|
|
520 ##### Dialects must provide custom type parsers
|
|
521
|
|
522 Dialect type parsing cannot plug into the existing parser infrastructure as
|
|
523 operations do with the OpAsmParser/Printer. Operations have a defined syntax
|
|
524 structure that is the same across all dialects. Types, on the other hand, may
|
|
525 have many different, and sometimes conflicting, parsing constraints that would
|
|
526 be difficult/unmaintainable to provide within a single interface.
|
|
527
|
|
528 This also has the added benefit of encouraging dialects to reuse existing
|
|
529 external type parsers. For example, an LLVM dialect may provide an MLIR LLVM
|
|
530 type that is simply a wrapper around LLVM types. The LLVM dialect would then use
|
|
531 the existing LLVM type parsing infrastructure.
|
|
532
|
|
533 Example:
|
|
534
|
|
535 ```mlir
|
|
536 %s = "foo"() : () -> !llvm<"i32*">
|
|
537 ```
|
|
538
|
|
539 ##### Types do not always have canonical names
|
|
540
|
|
541 Unlike operations, types generally do not have a formal canonical name. For
|
|
542 example, function types have no defined keyword and integer types are defined by
|
|
543 a regular expression to support arbitrary bitwidth. Dialects with existing type
|
|
544 systems, e.g. LLVM, are likely to provide wrappers around their existing type
|
|
545 systems. For these wrapper types there is no simple canonical name, it's logical
|
|
546 to think of these types as existing within the namespace of the dialect. If a
|
|
547 dialect wishes to assign a canonical name to a type, it can be done via
|
|
548 [type aliases](LangRef.md#type-aliases).
|
|
549
|
|
550 ### Tuple types
|
|
551
|
|
552 The MLIR type system provides first class support for defining
|
|
553 [tuple types](LangRef.md#tuple-type). This is due to the fact that `Tuple`
|
|
554 represents a universal concept that is likely to, and has already begun to,
|
|
555 present itself in many different dialects. Though this type is first class in
|
|
556 the type system, it merely serves to provide a common mechanism in which to
|
|
557 represent this concept in MLIR. As such, MLIR provides no standard operations
|
|
558 for interfacing with `tuple` types. It is up to dialect authors to provide
|
|
559 operations, e.g. extract_tuple_element, to interpret and manipulate them. When
|
|
560 possible, operations should prefer to use multiple results instead. These
|
|
561 provide a myriad of benefits, such as alleviating any need for tuple-extract
|
|
562 operations that merely get in the way of analysis and transformation.
|
|
563
|
|
564 ### Assembly forms
|
|
565
|
|
566 MLIR decides to support both generic and custom assembly forms under the
|
|
567 following considerations:
|
|
568
|
|
569 MLIR is an open system; it is designed to support modular and pluggable
|
|
570 dialects. Depending on whether there exists a corresponding dialect and whether
|
|
571 the dialect is plugged in, operations may or may not be registered into MLIR
|
|
572 system. Yet we still need a way to investigate these operations. So the generic
|
|
573 assembly form is mandated by this aspect of MLIR system. It provides a default
|
|
574 textual form for operations.
|
|
575
|
|
576 On the other hand, an assembly form is for assisting developers to investigate
|
|
577 the IR. The generic form serves as a safe fallback but it can be too verbose for
|
|
578 certain ops. Therefore, MLIR gives each dialect the choice to define a custom
|
|
579 assembly form for each operation according to the operation's semantics and
|
|
580 specific needs. The custom assembly form can de-duplicate information from the
|
|
581 operation to derive a more concise form, thus better facilitating the
|
|
582 comprehension of the IR.
|
|
583
|
|
584 ## Examples
|
|
585
|
|
586 This section describes a few very simple examples that help understand how MLIR
|
|
587 represents computation.
|
|
588
|
|
589 ### Non-affine control flow
|
|
590
|
|
591 ```mlir
|
|
592 // A simple linear search in every row of a matrix
|
|
593 for (i = 0; i < N; i++) {
|
|
594 for (j = 0; j < N; j++) {
|
|
595 // dynamic control flow
|
|
596 if (a[i][j] == key) {
|
|
597 s[i] = j;
|
|
598 break;
|
|
599 }
|
|
600 }
|
|
601 }
|
|
602 ```
|
|
603
|
|
604 The presence of dynamic control flow leads to an inner non-affine function
|
|
605 nested in an outer function that using affine loops.
|
|
606
|
|
607 ```mlir
|
|
608 func @search(%A: memref<?x?xi32, %S: <?xi32>, %key : i32) {
|
|
609 %ni = dim %A, 0 : memref<?x?xi32>
|
|
610 // This loop can be parallelized
|
|
611 affine.for %i = 0 to %ni {
|
|
612 call @search_body (%A, %S, %key, %i) : (memref<?x?xi32>, memref<?xi32>, i32, i32)
|
|
613 }
|
|
614 return
|
|
615 }
|
|
616
|
|
617 func @search_body(%A: memref<?x?xi32>, %S: memref<?xi32>, %key: i32, %i : i32) {
|
|
618 %nj = dim %A, 1 : memref<?x?xi32>
|
|
619 br ^bb1(0)
|
|
620
|
|
621 ^bb1(%j: i32)
|
|
622 %p1 = cmpi "lt", %j, %nj : i32
|
|
623 cond_br %p1, ^bb2, ^bb5
|
|
624
|
|
625 ^bb2:
|
|
626 %v = affine.load %A[%i, %j] : memref<?x?xi32>
|
|
627 %p2 = cmpi "eq", %v, %key : i32
|
|
628 cond_br %p2, ^bb3(%j), ^bb4
|
|
629
|
|
630 ^bb3(%j: i32)
|
|
631 affine.store %j, %S[%i] : memref<?xi32>
|
|
632 br ^bb5
|
|
633
|
|
634 ^bb4:
|
|
635 %jinc = addi %j, 1 : i32
|
|
636 br ^bb1(%jinc)
|
|
637
|
|
638 ^bb5:
|
|
639 return
|
|
640 }
|
|
641 ```
|
|
642
|
|
643 As per the [MLIR spec](LangRef.md), the restrictions on dimensions and symbol
|
|
644 identifiers to be used with the affine.apply operation only apply to accesses
|
|
645 inside `affine.for` and `affine.if` operations. However, an analysis of accesses
|
|
646 inside the called function (`@search_body`) is necessary to determine if the
|
|
647 `%i` loop could be parallelized: such function access analysis is calling
|
|
648 context sensitive.
|
|
649
|
|
650 ### Non-affine loop bounds
|
|
651
|
|
652 Loop bounds that are not affine lead to a nesting of functions as shown below.
|
|
653
|
|
654 ```c
|
|
655 for (i = 0; i < N; i++)
|
|
656 for (j = 0; j < N; j++)
|
|
657 // Non-affine loop bound for k loop.
|
|
658 for (k = 0; k < pow(2, j); k++)
|
|
659 for (l = 0; l < N; l++) {
|
|
660 // block loop body
|
|
661 ...
|
|
662 }
|
|
663 ```
|
|
664
|
|
665 ```mlir
|
|
666 func @outer_nest(%n : index) {
|
|
667 affine.for %i = 0 to %n {
|
|
668 affine.for %j = 0 to %n {
|
|
669 %pow = call @pow(2, %j) : (index, index) -> index
|
|
670 call @inner_nest(%pow, %n) : ...
|
|
671 }
|
|
672 }
|
|
673 return
|
|
674 }
|
|
675
|
|
676 func @inner_nest(%m : index, %n : index) {
|
|
677 affine.for %k = 0 to %m {
|
|
678 affine.for %l = 0 to %n {
|
|
679 ...
|
|
680 }
|
|
681 }
|
|
682 return
|
|
683 }
|
|
684 ```
|
|
685
|
|
686 ### Reference 2D Convolution
|
|
687
|
|
688 The following example illustrates a reference implementation of a 2D
|
|
689 convolution, which uses an integer set `#domain` to represent valid input data
|
|
690 in a dilated convolution.
|
|
691
|
|
692 ```mlir
|
|
693 // Dilation factors S0 and S1 can be constant folded if constant at compile time.
|
|
694 #domain = (d0, d1)[S0,S1,S2,S3]: (d0 % S0 == 0, d1 % S1 == 0, d0 >= 0, d1 >= 0,
|
|
695 S3 - d0 - 1 >= 0, S4 - d1 - 1 >= 0)
|
|
696 // Identity map (shown here for illustration).
|
|
697 #map0 = (d0, d1, d2, d3, d4, d5, d6) -> (d0, d1, d2, d3, d4, d5, d6)
|
|
698
|
|
699 // Affine map from output to input coordinate space.
|
|
700 // d0 = output_h, d1 = output_w, d2 = kernel_h, d3 = kernel_w
|
|
701 // S0 = h_stride, S1 = w_stride, S2 = h_kernel_dilation, S3 = w_kernel_dilation
|
|
702 // S4 = h_pad_low, S5 = w_pad_low
|
|
703 // %out0 = %0#1 * %h_stride + %0#4 * %h_kernel_dilation - %h_pad_low
|
|
704 // %out1= %0#2 * %w_stride + %0#5 * %w_kernel_dilation - %w_pad_low
|
|
705 #map1_0 = (d0, d1, d2, d3) [S0, S1, S2, S3, S4, S5] -> (d0 * S0 + d2 * S2 - %S4)
|
|
706 #map1_1 = (d0, d1, d2, d3) [S0, S1, S2, S3, S4, S5] -> (d1 * S1 + d3 * S3 - %S5)
|
|
707
|
|
708 // Semi-affine map to undilated input coordinate space.
|
|
709 // d0 = input_h, d1 = input_w, S0 = h_base_dilation, S1 = w_base_dilation.
|
|
710 #map2_0 = (d0, d1) [S0, S1] -> (d0 / S0)
|
|
711 #map2_1 = (d0, d1) [S0, S1] -> (d1 / S1)
|
|
712
|
|
713 // Conv2D shapes:
|
|
714 // input: [batch, input_height, input_width, input_feature]
|
|
715 // kernel: [kernel_height, kernel_width, input_feature, output_feature]
|
|
716 // output: [batch, output_height, output_width, output_feature]
|
|
717 func @conv2d(%input: memref<16x1024x1024x3xf32, #lm0, /*scratchpad=*/1>,
|
|
718 %kernel: memref<5x5x3x32xf32, #lm0, /*scratchpad=*/1>,
|
|
719 %output: memref<16x512x512x32xf32, #lm0, /*scratchpad=*/1>) {
|
|
720 affine.for %b = 0 to %batch {
|
|
721 affine.for %oh = 0 to %output_height {
|
|
722 affine.for %ow = 0 to %output_width {
|
|
723 affine.for %of = 0 to %output_feature {
|
|
724 affine.for %kh = 0 to %kernel_height {
|
|
725 affine.for %kw = 0 to %kernel_width {
|
|
726 affine.for %if = 0 to %input_feature {
|
|
727 // Calculate input indices.
|
|
728 %1_0 = affine.apply #map1_0 (%0#1, %0#2, %0#4, %0#5)
|
|
729 [%h_stride, %w_stride, %h_kernel_dilation, %w_kernel_dilation,
|
|
730 %h_pad_low, %w_pad_low]
|
|
731 %1_1 = affine.apply #map1_1 (%0#1, %0#2, %0#4, %0#5)
|
|
732 [%h_stride, %w_stride, %h_kernel_dilation, %w_kernel_dilation,
|
|
733 %h_pad_low, %w_pad_low]
|
|
734
|
|
735 // Check if access is not in padding.
|
|
736 affine.if #domain(%1_0, %1_1)
|
|
737 [%h_base_dilation, %w_kernel_dilation, %h_bound, %w_bound] {
|
|
738 %2_0 = affine.apply #map2 (%1_0, %1_1)
|
|
739 %2_1 = affine.apply #map2 (%1_0, %1_1)
|
|
740 // Compute: output[output_indices] += input[input_indices] * kernel[kernel_indices]
|
|
741 call @multiply_accumulate(%input, %kernel, %output, %b, %oh, %ow, %of, %kh, %kw, %if, %2_0, %2_1)
|
|
742 }
|
|
743 }
|
|
744 }
|
|
745 }
|
|
746 }
|
|
747 }
|
|
748 }
|
|
749 }
|
|
750 return
|
|
751 }
|
|
752 ```
|
|
753
|
|
754 TODO (Add more examples showing the IR for a variety of interesting cases)
|
|
755
|
|
756 ## Design alternatives and extensions
|
|
757
|
|
758 This is a list of some design alternatives and extensions that we discussed in
|
|
759 detail but did not include in the spec or postponed them for future
|
|
760 consideration on demand. We will revisit these discussions when we have more
|
|
761 implementation experience and learn more about the challenges and limitations of
|
|
762 our current design in practice.
|
|
763
|
|
764 ### Polyhedral code representation alternatives: schedule lists vs schedules trees vs affine loop/if forms
|
|
765
|
|
766 The current MLIR uses a representation of polyhedral schedules using a tree of
|
|
767 if/for loops. We extensively debated the tradeoffs involved in the typical
|
|
768 unordered polyhedral instruction representation (where each instruction has
|
|
769 multidimensional schedule information), discussed the benefits of schedule tree
|
|
770 forms, and eventually decided to go with a syntactic tree of affine if/else
|
|
771 conditionals and affine for loops. Discussion of the tradeoff was captured in
|
|
772 this document:
|
|
773 [ MLIR: The case for a simplified polyhedral form](RationaleSimplifiedPolyhedralForm.md).
|
|
774
|
|
775 At a high level, we have two alternatives here:
|
|
776
|
|
777 1. Schedule tree representation instead of an affine loop AST form: The current
|
|
778 proposal uses an affine loop and conditional tree form, which is syntactic
|
|
779 and with no separation of domains as sets and schedules as multidimensional
|
|
780 affine functions. A schedule tree form however makes polyhedral domains and
|
|
781 schedules a first class concept in the IR allowing compact expression of
|
|
782 transformations through the schedule tree without changing the domains of
|
|
783 instructions. Such a representation also hides prologues, epilogues, partial
|
|
784 tiles, complex loop bounds and conditionals making loop nests free of
|
|
785 "syntax". Cost models instead look at domains and schedules. In addition, if
|
|
786 necessary such a domain schedule representation can be normalized to
|
|
787 explicitly propagate the schedule into domains and model all the cleanup
|
|
788 code. An example and more detail on the schedule tree form is in the next
|
|
789 section.
|
|
790 1. Having two different forms of "affine regions": an affine loop tree form
|
|
791 and a polyhedral schedule tree form. In the latter, ops could carry
|
|
792 attributes capturing domain, scheduling, and other polyhedral code
|
|
793 generation options with IntegerSet, AffineMap, and other attributes.
|
|
794
|
|
795 #### Schedule Tree Representation for Affine Regions
|
|
796
|
|
797 This representation is based on a simplified form of the domain/schedule
|
|
798 representation used by the polyhedral compiler community. Domains represent what
|
|
799 has to be executed while schedules represent the order in which domain elements
|
|
800 are interleaved. We model domains as non-piece-wise convex integer sets, and
|
|
801 schedules as affine functions; however, the former can be disjunctive, and the
|
|
802 latter can be piece-wise affine relations. In the schedule tree representation,
|
|
803 domain and schedules for instructions are represented in a tree-like structure
|
|
804 which is called a schedule tree. Each non-leaf node of the tree is an abstract
|
|
805 polyhedral dimension corresponding to an abstract fused loop for each ML
|
|
806 instruction that appears in that branch. Each leaf node is an ML Instruction.
|
|
807
|
|
808 ```mlir
|
|
809 // A tiled matmul code (128x128x128) represented in schedule tree form
|
|
810
|
|
811 // #map0 = (d0, d1, d2, d3, d4, d5) -> (128*d0 + d3, 128*d1 + d4, 128*d2 + d5)
|
|
812 #intset_ij = (i, j) [M, N, K] : i >= 0, -i + N - 1 >= 0, j >= 0, -j + N-1 >= 0
|
|
813 #intset_ijk = (i, j, k) [M, N, K] : i >= 0, -i + N - 1 >= 0, j >= 0,
|
|
814 -j + M-1 >= 0, k >= 0, -k + N - 1 >= 0)
|
|
815 func @matmul(%A, %B, %C, %M, %N, %K) : (...) { // %M, N, K are symbols
|
|
816 // t1, t2, t3, t4, t5, t6 are abstract polyhedral loops
|
|
817 mldim %t1 : {S1,S2,S3,S4,S5} floordiv (i, 128) {
|
|
818 mldim %t2 : {S1,S2,S3,S4,S5} floordiv (j, 128) {
|
|
819 // (%i, %j) = affine.apply (d0, d1) -> (128*d0, 128*d1) (%t1, %t2)
|
|
820 call dma_mem_to_scratchpad(%C, %i, %j, %M, %N, %K)
|
|
821 with @intset_ij(%i, %j) [%M, %N, %K]
|
|
822 mldim %t3 : {S2,S3,S4,S5} floordiv (k, 128) {
|
|
823 // (%i, %j, %k) = affine.apply (d0, d1, d2)
|
|
824 // -> (128*d0, 128*d1, 128*d2) (%t1, %t2, %t3)
|
|
825 call dma_mem_to_scratchpad(%A, ...) with #inset_ijk (%i, %j, %k) [%M, %N, %K]
|
|
826 // (%i, %j, %k) = affine.apply (d0, d1, d2)
|
|
827 // -> (128*d0, 128*d1, 128*d2) (%t1, %t2, %t3)
|
|
828 call dma_mem_to_scratchpad(%B, ...) with #inset_ijk (%i, %j, %k) [%M, %N, %K]
|
|
829 mldim %t4 : {S4} i mod 128 {
|
|
830 mldim %t5 : {S4} j mod 128 {
|
|
831 mldim %t6 : {S4} k mod 128 {
|
|
832 // (%i, %j, %k) = affine.apply #map0 (%t1, %t2, %t3, %t4, %t5, %t6)
|
|
833 call matmul_body(A, B, C, %i, %j, %k, %M, %N, %K)
|
|
834 with #inset_ijk(%i, %j, %k) [%M, %N, %K]
|
|
835 } // end mld4im t6
|
|
836 } // end mldim t5
|
|
837 } // end mldim t4
|
|
838 } // end mldim t3
|
|
839 // (%i, %j) = affine.apply (d0, d1) -> (128*d0, 128*d1) (%t1, %t2)
|
|
840 call $dma_scratchpad_to_mem_C ... with #intset(%i, %j) [%M, %N, %K]
|
|
841 } // end mldim t2
|
|
842 } // end mldim t1
|
|
843 return
|
|
844 }
|
|
845
|
|
846 ```
|
|
847
|
|
848 ### Affine Relations
|
|
849
|
|
850 The current MLIR spec includes affine maps and integer sets, but not affine
|
|
851 relations. Affine relations are a natural way to model read and write access
|
|
852 information, which can be very useful to capture the behavior of opaque external
|
|
853 library calls, high-performance vendor libraries, or user-provided / user-tuned
|
|
854 routines.
|
|
855
|
|
856 An affine relation is a relation between input and output dimension identifiers
|
|
857 while being symbolic on a list of symbolic identifiers and with affine
|
|
858 constraints on the identifiers.
|
|
859
|
|
860 Syntax:
|
|
861
|
|
862 ```
|
|
863 // Affine relation definition at the top of file
|
|
864 affine-rel-def ::= affine-rel-id `=` affine-relation-inline
|
|
865
|
|
866 affine-rel-id ::= `##` prefixed-id
|
|
867
|
|
868 affine-relation-inline ::=
|
|
869 `(` input-dims `)` (`[` symbols `]`)? `->`
|
|
870 `(` output-dims `)` : affine-constraint-conjunction
|
|
871
|
|
872 input-dims ::= bare-id-list
|
|
873 output-dims ::= bare-id-list
|
|
874 symbols ::= bare-id-list
|
|
875
|
|
876 affine-rel ::= affine-rel-id | affine-relation-inline
|
|
877
|
|
878 // Usage
|
|
879 affine-rel-spec ::= affine-rel dim-and-symbol-use-list
|
|
880 ```
|
|
881
|
|
882 All identifiers appearing in input-dims, output-dims, and symbol-dims are
|
|
883 pairwise distinct. All affine-constraint non-terminals in the above syntax are
|
|
884 allowed to contain identifiers only from input-dims, output-dims, and
|
|
885 symbol-dims.
|
|
886
|
|
887 Affine relations are used to model read, write, may_read, and may_write sets of
|
|
888 functions in the IR. The output dimension identifiers correspond to the data
|
|
889 dimensions.
|
|
890
|
|
891 Example:
|
|
892
|
|
893 ```mlir
|
|
894 // read relation: two elements ( d0 <= r0 <= d0+1 )
|
|
895 ##aff_rel9 = (d0) -> (r0) : r0 - d0 >= 0, d0 - r0 + 1 >= 0
|
|
896
|
|
897 func @count (%A : memref<128xf32>, %pos : i32) -> f32
|
|
898 reads: {%A ##aff_rel9 (%pos)}
|
|
899 writes: /* empty */
|
|
900 may_reads: /* empty */
|
|
901 may_writes: /* empty */ {
|
|
902 bb0 (%0, %1: memref<128xf32>, i64):
|
|
903 %val = affine.load %A [%pos]
|
|
904 %val = affine.load %A [%pos + 1]
|
|
905 %p = mulf %val, %val : f32
|
|
906 return %p : f32
|
|
907 }
|
|
908 ```
|
|
909
|
|
910 ### Regions
|
|
911
|
|
912 #### Making function definition an operation
|
|
913
|
|
914 MLIR supports values of a Function type. Instead of having first-class IR
|
|
915 concept for functions, one could define an operation with a body region that
|
|
916 defines a function value. The particularity of functions is that their names are
|
|
917 globally visible and can be referred to before being defined, unlike SSA values
|
|
918 that must be defined first. Implementing a "function definition" operation would
|
|
919 require to relax some of the SSA constraints in a region, and also make the IR
|
|
920 Module a region as well. It would also affect the core infrastructure (e.g.,
|
|
921 function passes) only for the sake of concept unification.
|
|
922
|
|
923 #### Having types on a region
|
|
924
|
|
925 Instead of inspecting the types of arguments of the first block, one could give
|
|
926 the region itself a type. This type would be redundant with block argument
|
|
927 types, which must have values and create room for type mismatches. While
|
|
928 functions do have types that are partly redundant with the arguments of the
|
|
929 first block in the function, this is necessary to support function declarations
|
|
930 that do not have a body which we can refer to in order to obtain the argument
|
|
931 types. A region is always contained in an operation or a function that can be
|
|
932 queried to obtain the “type” of the region if necessary.
|
|
933
|
|
934 A type on a region can be justified if Regions were to be considered separately
|
|
935 from the enclosing entity (operation or function) and had their own semantics
|
|
936 that should be checked.
|
|
937
|
|
938 #### Attaching attributes to regions
|
|
939
|
|
940 Regions could be annotated with dialect attributes to use attribute verification
|
|
941 hooks. An operation could take multiple regions as arguments, and each of them
|
|
942 may require different attributes. However, there are currently very few
|
|
943 practical cases where this would be necessary. Instead, one could simulate
|
|
944 per-region attributes with array attributes attached to the entity containing
|
|
945 the region (operation or function). This decreases the overall complexity of the
|
|
946 IR and enables more concise and op-specific forms, e.g., when all regions of an
|
|
947 op have the same attribute that can be only mentioned once. Since the semantics
|
|
948 of the region is entirely defined by the enclosing entity, it also makes sense
|
|
949 to have attributes attached to that entity rather than to the region itself.
|
|
950
|
|
951 This can be reconsidered in the future if we see a non-neglectable amount of use
|
|
952 cases.
|
|
953
|
|
954 ### Read/Write/May_Read/May_Write sets for External Functions
|
|
955
|
|
956 Having read, write, may_read, and may_write sets for external functions which
|
|
957 include opaque ones, high-performance vendor libraries such as CuDNN, CuB, MKL,
|
|
958 FFT libraries, user-provided/optimized functions, or data movement runtimes such
|
|
959 as DMA ones is a powerful feature. It allows the compiler to perform analysis,
|
|
960 composition/transformation in the presence of such calls and with loops around
|
|
961 such calls on sub-tensors. For user-provided or custom hand-tuned functions, the
|
|
962 read/write/may_read/may_write sets could be provided a-priori by a user as part
|
|
963 of the external function signature or they could be part of a database.
|
|
964
|
|
965 TODO: Design this, and update to use function attribute syntax.
|
|
966
|
|
967 Example:
|
|
968
|
|
969 ```mlir
|
|
970 ##rel9 ( ) [s0] -> (r0, r1) : 0 <= r0 <= 1023, 0 <= r1 <= s0 - 1
|
|
971
|
|
972 func @cblas_reduce_ffi(%M: memref<1024 x ? x f32, #layout_map0, /*mem=*/0>)
|
|
973 -> f32 [
|
|
974 reads: {%M, ##rel9() }
|
|
975 writes: /* empty */
|
|
976 may_reads: /* empty */
|
|
977 may_writes: /* empty */
|
|
978 ]
|
|
979
|
|
980 func @dma_mem_to_scratchpad(%a : memref<1024 x f32, #layout_map0, /*mem=*/0>,
|
|
981 %b : memref<1024 x f32, #layout_map0, 1>, %c : memref<1024 x f32,
|
|
982 #layout_map0>) [
|
|
983 reads: {%M, ##rel9() }
|
|
984 writes: /* empty */
|
|
985 may_reads: /* empty */
|
|
986 may_writes: /* empty */
|
|
987 ]
|
|
988
|
|
989 ```
|
|
990
|
|
991 ### Memref Extensions
|
|
992
|
|
993 1. Arbitrary polyhedral shapes for tensors: e.g., triangular shapes in tensor
|
|
994 dimensions where there is symmetry: use integer set (affine constraints) to
|
|
995 model tensor data space (instead of just extents). Requires some changes to
|
|
996 the IR and the in-memory form.
|
|
997 1. Layout maps
|
|
998
|
|
999 1. Allow piece-wise affine maps for layouts: allows clean modeling of
|
|
1000 boundary cases for images/tensors through padding, wrapping, mirroring,
|
|
1001 padding where padded values are the results of computation as opposed to
|
|
1002 data, padding in the interior as opposed to just boundaries.
|
|
1003 1. Allow many-to-one layout maps: Index and layout maps in the current
|
|
1004 proposal are bijective. Extending them to many-to-one layout maps allows
|
|
1005 cleaner(?) modeling of broadcast/reduce style computations while reusing
|
|
1006 memory.
|
|
1007
|
|
1008 Proposal 2(a) requires non-trivial changes to the IR and the in-memory
|
|
1009 representation. 2(b) requires no change, but impacts how cost models look at
|
|
1010 index and layout maps.
|
|
1011
|
|
1012 ### `affine.if` and `affine.for` Extensions for "Escaping Scalars"
|
|
1013
|
|
1014 We considered providing a representation for SSA values that are live out of
|
|
1015 `if/else` conditional bodies and loop carried in `affine.for` loops. We
|
|
1016 ultimately abandoned this approach due to its complexity. In the current design
|
|
1017 of MLIR, scalar variables cannot escape for loops or if instructions. In
|
|
1018 situations, where escaping is necessary, we use zero-dimensional tensors and
|
|
1019 memrefs instead of scalars.
|
|
1020
|
|
1021 **TODO**: This whole section is obsolete and should be updated to use block
|
|
1022 arguments and a yield like terminator in for/if instructions.
|
|
1023
|
|
1024 The abandoned design of supporting escaping scalars is as follows:
|
|
1025
|
|
1026 #### affine.for Instruction
|
|
1027
|
|
1028 Syntax:
|
|
1029
|
|
1030 ```
|
|
1031 [<out-var-list> =]
|
|
1032 for %<index-variable-name> = <lower-bound> ... <upper-bound> step <step>
|
|
1033 [with <in-var-list>] { <loop-instruction-list> }
|
|
1034 ```
|
|
1035
|
|
1036 out-var-list is a comma separated list of SSA values defined in the loop body
|
|
1037 and used outside the loop body. in-var-list is a comma separated list of SSA
|
|
1038 values used inside the loop body and their initializers. loop-instruction-list
|
|
1039 is a list of instructions that may also include a yield instruction.
|
|
1040
|
|
1041 Example:
|
|
1042
|
|
1043 ```mlir
|
|
1044 // Return sum of elements in 1-dimensional mref A
|
|
1045 func i32 @sum(%A : memref<?xi32>, %N : i32) -> (i32) {
|
|
1046 %init = 0
|
|
1047 %result = affine.for %i = 0 to N with %tmp(%init) {
|
|
1048 %value = affine.load %A[%i]
|
|
1049 %sum = %value + %tmp
|
|
1050 yield %sum
|
|
1051 }
|
|
1052 return %result : i32
|
|
1053 }
|
|
1054 ```
|
|
1055
|
|
1056 #### affine.if/else Instruction
|
|
1057
|
|
1058 Syntax:
|
|
1059
|
|
1060 ```
|
|
1061 <out-var-list> = affine.if (<cond-list>) {...} [else {...}]
|
|
1062 ```
|
|
1063
|
|
1064 Out-var-list is a list of SSA values defined by the if-instruction. The values
|
|
1065 are arguments to the yield-instruction that occurs in both then and else clauses
|
|
1066 when else clause is present. When if instruction contains only if clause, the
|
|
1067 escaping value defined in the then clause should be merged with the value the
|
|
1068 variable had before the if instruction. The design captured here does not handle
|
|
1069 this situation.
|
|
1070
|
|
1071 Example:
|
|
1072
|
|
1073 ```mlir
|
|
1074 // Compute sum of half of the array
|
|
1075 func i32 @sum_half(%A : memref<?xi32>, %N : i32) -> (i32) {
|
|
1076 %s0 = 0
|
|
1077 %s1 = affine.for %i = 1 ... N step 1 with %s2 (%s0) {
|
|
1078 %s3 = if (%i >= %N / 2) {
|
|
1079 %v0 = affine.load %A[%i]
|
|
1080 %s4 = %s2 + %v0
|
|
1081 yield %s4
|
|
1082 }
|
|
1083 yield %s3
|
|
1084 }
|
|
1085 return %s1 : i32
|
|
1086 }
|
|
1087 ```
|
|
1088
|
|
1089 ### Multithreading the compiler
|
|
1090
|
|
1091 People want compilers to go fast, and one simple way to do that is to
|
|
1092 multi-thread them. There are multiple strategies for this, but a simple one is
|
|
1093 to optimize and compile separate functions in parallel. LLVM's original pass
|
|
1094 manager anticipated this demand, and the CallGraphSCCPass manager is even
|
|
1095 designed to support this as well, but unfortunately, a few early design
|
|
1096 decisions in LLVM prevent this from ever happening. Instead, things like ThinLTO
|
|
1097 are forced to split programs into separate LLVM modules/context and optimize
|
|
1098 those chunks independently.
|
|
1099
|
|
1100 The problem is that LLVM has several objects in its IR that are globally uniqued
|
|
1101 and also mutable: notably constants like `i32 0`. In LLVM, these constants are
|
|
1102 `Value`'s, which allow them to be used as operands to instructions, and that
|
|
1103 they also have SSA use lists. Because these things are uniqued, every `i32 0` in
|
|
1104 any function shares a use list. This means that optimizing multiple functions in
|
|
1105 parallel won't work (at least without some sort of synchronization on the use
|
|
1106 lists, which would be unbearably inefficient).
|
|
1107
|
|
1108 MLIR now supports a multithreaded pass manager. We do this through several
|
|
1109 design choices:
|
|
1110
|
|
1111 1. MLIR makes use of extensive uniqued immutable data structures (affine
|
|
1112 expressions, types, etc are all immutable, uniqued, and immortal).
|
|
1113 2. Constants are defined in per-function pools, instead of being globally
|
|
1114 uniqued.
|
|
1115 3. Functions themselves are not SSA values either, so they don't have the same
|
|
1116 problem as constants.
|
|
1117 4. FunctionPasses are copied (through their copy ctor) into one instance per
|
|
1118 thread, avoiding sharing of local state across threads.
|
|
1119
|
|
1120 This allows MLIR function passes to support efficient multithreaded compilation
|
|
1121 and code generation.
|