CbC/CbC_llvm: mlir/docs/Dialects/Vector.md annotate

annotate mlir/docs/Dialects/Vector.md @ 194:f2ef29ba5fe2

try to remove setjmp.h start

author	Shinji KONO <kono@ie.u-ryukyu.ac.jp>
date	Mon, 22 Mar 2021 18:10:23 +0900
parents	0572611fdcc8
children	2e18cbf3894f

rev	line source
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	1 # 'vector' Dialect
0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	2
0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	3 [TOC]
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	4
1d019706d866 LLVM10 anatofuz parents: diff changeset	5 MLIR supports multi-dimensional `vector` types and custom operations on those
1d019706d866 LLVM10 anatofuz parents: diff changeset	6 types. A generic, retargetable, higher-order ``vector`` type (`n-D` with `n >
1d019706d866 LLVM10 anatofuz parents: diff changeset	7 1`) is a structured type, that carries semantic information useful for
1d019706d866 LLVM10 anatofuz parents: diff changeset	8 transformations. This document discusses retargetable abstractions that exist
1d019706d866 LLVM10 anatofuz parents: diff changeset	9 in MLIR today and operate on ssa-values of type `vector` along with pattern
1d019706d866 LLVM10 anatofuz parents: diff changeset	10 rewrites and lowerings that enable targeting specific instructions on concrete
1d019706d866 LLVM10 anatofuz parents: diff changeset	11 targets. These abstractions serve to separate concerns between operations on
1d019706d866 LLVM10 anatofuz parents: diff changeset	12 `memref` (a.k.a buffers) and operations on ``vector`` values. This is not a
1d019706d866 LLVM10 anatofuz parents: diff changeset	13 new proposal but rather a textual documentation of existing MLIR components
1d019706d866 LLVM10 anatofuz parents: diff changeset	14 along with a rationale.
1d019706d866 LLVM10 anatofuz parents: diff changeset	15
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	16 ## Positioning in the Codegen Infrastructure
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	17 The following diagram, recently presented with the [StructuredOps
1d019706d866 LLVM10 anatofuz parents: diff changeset	18 abstractions](https://drive.google.com/corp/drive/u/0/folders/1sRAsgsd8Bvpm_IxREmZf2agsGU2KvrK-),
1d019706d866 LLVM10 anatofuz parents: diff changeset	19 captures the current codegen paths implemented in MLIR in the various existing
1d019706d866 LLVM10 anatofuz parents: diff changeset	20 lowering paths.
1d019706d866 LLVM10 anatofuz parents: diff changeset	21 ![](https://user-images.githubusercontent.com/10148468/71177417-f78e4d80-2239-11ea-92ef-700f42ea503f.png)
1d019706d866 LLVM10 anatofuz parents: diff changeset	22
1d019706d866 LLVM10 anatofuz parents: diff changeset	23 The following diagram seeks to isolate `vector` dialects from the complexity
1d019706d866 LLVM10 anatofuz parents: diff changeset	24 of the codegen paths and focus on the payload-carrying ops that operate on std
1d019706d866 LLVM10 anatofuz parents: diff changeset	25 and `vector` types. This diagram is not to be taken as set in stone and
1d019706d866 LLVM10 anatofuz parents: diff changeset	26 representative of what exists today but rather illustrates the layering of
1d019706d866 LLVM10 anatofuz parents: diff changeset	27 abstractions in MLIR.
1d019706d866 LLVM10 anatofuz parents: diff changeset	28
1d019706d866 LLVM10 anatofuz parents: diff changeset	29 ![`vector` Abstractions in MLIR](https://user-images.githubusercontent.com/10148468/71176949-e85ad000-2238-11ea-9806-200843bc4943.png)
1d019706d866 LLVM10 anatofuz parents: diff changeset	30
1d019706d866 LLVM10 anatofuz parents: diff changeset	31 This separates concerns related to (a) defining efficient operations on
1d019706d866 LLVM10 anatofuz parents: diff changeset	32 `vector` types from (b) program analyses + transformations on `memref`, loops
1d019706d866 LLVM10 anatofuz parents: diff changeset	33 and other types of structured ops (be they `HLO`, `LHLO`, `Linalg` or other ).
1d019706d866 LLVM10 anatofuz parents: diff changeset	34 Looking a bit forward in time, we can put a stake in the ground and venture
1d019706d866 LLVM10 anatofuz parents: diff changeset	35 that the higher level of `vector`-level primitives we build and target from
1d019706d866 LLVM10 anatofuz parents: diff changeset	36 codegen (or some user/language level), the simpler our task will be, the more
1d019706d866 LLVM10 anatofuz parents: diff changeset	37 complex patterns can be expressed and the better performance will be.
1d019706d866 LLVM10 anatofuz parents: diff changeset	38
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	39 ## Components of a Generic Retargetable Vector-Level Dialect
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	40 The existing MLIR `vector`-level dialects are related to the following
1d019706d866 LLVM10 anatofuz parents: diff changeset	41 bottom-up abstractions:
1d019706d866 LLVM10 anatofuz parents: diff changeset	42
1d019706d866 LLVM10 anatofuz parents: diff changeset	43 1. Representation in `LLVMIR` via data structures, instructions and
1d019706d866 LLVM10 anatofuz parents: diff changeset	44 intrinsics. This is referred to as the `LLVM` level.
1d019706d866 LLVM10 anatofuz parents: diff changeset	45 2. Set of machine-specific operations and types that are built to translate
1d019706d866 LLVM10 anatofuz parents: diff changeset	46 almost 1-1 with the HW ISA. This is referred to as the Hardware Vector level;
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	47 a.k.a `HWV`. For instance, we have (a) the `NVVM` dialect (for `CUDA`) with
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	48 tensor core ops, (b) accelerator-specific dialects (internal), a potential
1d019706d866 LLVM10 anatofuz parents: diff changeset	49 (future) `CPU` dialect to capture `LLVM` intrinsics more closely and other
1d019706d866 LLVM10 anatofuz parents: diff changeset	50 dialects for specific hardware. Ideally this should be auto-generated as much
1d019706d866 LLVM10 anatofuz parents: diff changeset	51 as possible from the `LLVM` level.
1d019706d866 LLVM10 anatofuz parents: diff changeset	52 3. Set of virtual, machine-agnostic, operations that are informed by costs at
1d019706d866 LLVM10 anatofuz parents: diff changeset	53 the `HWV`-level. This is referred to as the Virtual Vector level; a.k.a
1d019706d866 LLVM10 anatofuz parents: diff changeset	54 `VV`. This is the level that higher-level abstractions (codegen, automatic
1d019706d866 LLVM10 anatofuz parents: diff changeset	55 vectorization, potential vector language, ...) targets.
1d019706d866 LLVM10 anatofuz parents: diff changeset	56
1d019706d866 LLVM10 anatofuz parents: diff changeset	57 The existing generic, retargetable, `vector`-level dialect is related to the
1d019706d866 LLVM10 anatofuz parents: diff changeset	58 following top-down rewrites and conversions:
1d019706d866 LLVM10 anatofuz parents: diff changeset	59
1d019706d866 LLVM10 anatofuz parents: diff changeset	60 1. MLIR Rewrite Patterns applied by the MLIR `PatternRewrite` infrastructure
1d019706d866 LLVM10 anatofuz parents: diff changeset	61 to progressively lower to implementations that match closer and closer to the
1d019706d866 LLVM10 anatofuz parents: diff changeset	62 `HWV`. Some patterns are "in-dialect" `VV -> VV` and some are conversions `VV
1d019706d866 LLVM10 anatofuz parents: diff changeset	63 -> HWV`.
1d019706d866 LLVM10 anatofuz parents: diff changeset	64 2. `Virtual Vector -> Hardware Vector` lowering is specified as a set of MLIR
1d019706d866 LLVM10 anatofuz parents: diff changeset	65 lowering patterns that are specified manually for now.
1d019706d866 LLVM10 anatofuz parents: diff changeset	66 3. `Hardware Vector -> LLVM` lowering is a mechanical process that is written
1d019706d866 LLVM10 anatofuz parents: diff changeset	67 manually at the moment and that should be automated, following the `LLVM ->
1d019706d866 LLVM10 anatofuz parents: diff changeset	68 Hardware Vector` ops generation as closely as possible.
1d019706d866 LLVM10 anatofuz parents: diff changeset	69
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	70 ## Short Description of the Existing Infrastructure
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	71
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	72 ### LLVM level
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	73 On CPU, the `n-D` `vector` type currently lowers to
1d019706d866 LLVM10 anatofuz parents: diff changeset	74 `!llvm<array<vector>>`. More concretely, `vector<4x8x128xf32>` lowers to
1d019706d866 LLVM10 anatofuz parents: diff changeset	75 `!llvm<[4 x [ 8 x [ 128 x float ]]]>`.
1d019706d866 LLVM10 anatofuz parents: diff changeset	76 There are tradeoffs involved related to how one can access subvectors and how
1d019706d866 LLVM10 anatofuz parents: diff changeset	77 one uses `llvm.extractelement`, `llvm.insertelement` and
1d019706d866 LLVM10 anatofuz parents: diff changeset	78 `llvm.shufflevector`. A [deeper dive section](#DeeperDive) discusses the
1d019706d866 LLVM10 anatofuz parents: diff changeset	79 current lowering choices and tradeoffs.
1d019706d866 LLVM10 anatofuz parents: diff changeset	80
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	81 ### Hardware Vector Ops
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	82 Hardware Vector Ops are implemented as one dialect per target.
1d019706d866 LLVM10 anatofuz parents: diff changeset	83 For internal hardware, we are auto-generating the specific HW dialects.
1d019706d866 LLVM10 anatofuz parents: diff changeset	84 For `GPU`, the `NVVM` dialect adds operations such as `mma.sync`, `shfl` and
1d019706d866 LLVM10 anatofuz parents: diff changeset	85 tests.
1d019706d866 LLVM10 anatofuz parents: diff changeset	86 For `CPU` things are somewhat in-flight because the abstraction is close to
1d019706d866 LLVM10 anatofuz parents: diff changeset	87 `LLVMIR`. The jury is still out on whether a generic `CPU` dialect is
1d019706d866 LLVM10 anatofuz parents: diff changeset	88 concretely needed, but it seems reasonable to have the same levels of
1d019706d866 LLVM10 anatofuz parents: diff changeset	89 abstraction for all targets and perform cost-based lowering decisions in MLIR
1d019706d866 LLVM10 anatofuz parents: diff changeset	90 even for `LLVM`.
1d019706d866 LLVM10 anatofuz parents: diff changeset	91 Specialized `CPU` dialects that would capture specific features not well
1d019706d866 LLVM10 anatofuz parents: diff changeset	92 captured by LLVM peephole optimizations of on different types that core MLIR
1d019706d866 LLVM10 anatofuz parents: diff changeset	93 supports (e.g. Scalable Vectors) are welcome future extensions.
1d019706d866 LLVM10 anatofuz parents: diff changeset	94
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	95 ### Virtual Vector Ops
0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	96 Some existing Standard and Vector Dialect on `n-D` `vector` types comprise:
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	97 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	98 %2 = std.addf %0, %1 : vector<3x7x8xf32> // -> vector<3x7x8xf32>
1d019706d866 LLVM10 anatofuz parents: diff changeset	99 %2 = std.mulf %0, %1 : vector<3x7x8xf32> // -> vector<3x7x8xf32>
1d019706d866 LLVM10 anatofuz parents: diff changeset	100 %2 = std.splat %1 : vector<3x7x8xf32> // -> vector<3x7x8xf32>
1d019706d866 LLVM10 anatofuz parents: diff changeset	101
1d019706d866 LLVM10 anatofuz parents: diff changeset	102 %1 = vector.extract %0[1]: vector<3x7x8xf32> // -> vector<7x8xf32>
1d019706d866 LLVM10 anatofuz parents: diff changeset	103 %1 = vector.extract %0[1, 5]: vector<3x7x8xf32> // -> vector<8xf32>
1d019706d866 LLVM10 anatofuz parents: diff changeset	104 %2 = vector.outerproduct %0, %1: vector<4xf32>, vector<8xf32> // -> vector<4x8xf32>
1d019706d866 LLVM10 anatofuz parents: diff changeset	105 %3 = vector.outerproduct %0, %1, %2: vector<4xf32>, vector<8xf32> // fma when adding %2
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	106 %3 = vector.strided_slice %0 {offsets = [2, 2], sizes = [2, 2], strides = [1, 1]}:
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	107 vector<4x8x16xf32> // Returns a slice of type vector<2x2x16xf32>
1d019706d866 LLVM10 anatofuz parents: diff changeset	108
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	109 %2 = vector.transfer_read %A[%0, %1]
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	110 {permutation_map = (d0, d1) -> (d0)}: memref<7x?xf32>, vector<4xf32>
1d019706d866 LLVM10 anatofuz parents: diff changeset	111
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	112 vector.transfer_write %f1, %A[%i0, %i1, %i2, %i3]
0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	113 {permutation_map = (d0, d1, d2, d3) -> (d3, d1, d0)} :
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	114 vector<5x4x3xf32>, memref<?x?x?x?xf32>
1d019706d866 LLVM10 anatofuz parents: diff changeset	115 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	116
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	117 The list of Vector is currently undergoing evolutions and is best kept
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	118 track of by following the evolution of the
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	119 [VectorOps.td](https://github.com/llvm/llvm-project/blob/master/mlir/include/mlir/Dialect/Vector/VectorOps.td)
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	120 ODS file (markdown documentation is automatically generated locally when
1d019706d866 LLVM10 anatofuz parents: diff changeset	121 building and populates the [Vector
1d019706d866 LLVM10 anatofuz parents: diff changeset	122 doc](https://github.com/llvm/llvm-project/blob/master/mlir/docs/Dialects/Vector.md)). Recent
1d019706d866 LLVM10 anatofuz parents: diff changeset	123 extensions are driven by concrete use cases of interest. A notable such use
1d019706d866 LLVM10 anatofuz parents: diff changeset	124 case is the `vector.contract` op which applies principles of the StructuredOps
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	125 abstraction to `vector` types.
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	126
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	127 ### Virtual Vector Rewrite Patterns
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	128
1d019706d866 LLVM10 anatofuz parents: diff changeset	129 The following rewrite patterns exist at the `VV->VV` level:
1d019706d866 LLVM10 anatofuz parents: diff changeset	130
1d019706d866 LLVM10 anatofuz parents: diff changeset	131 1. The now retired `MaterializeVector` pass used to legalize ops on a
1d019706d866 LLVM10 anatofuz parents: diff changeset	132 coarse-grained virtual `vector` to a finer-grained virtual `vector` by
1d019706d866 LLVM10 anatofuz parents: diff changeset	133 unrolling. This has been rewritten as a retargetable unroll-and-jam pattern on
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	134 `vector` ops and `vector` types.
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	135 2. The lowering of `vector_transfer` ops legalizes `vector` load/store ops to
1d019706d866 LLVM10 anatofuz parents: diff changeset	136 permuted loops over scalar load/stores. This should evolve to loops over
1d019706d866 LLVM10 anatofuz parents: diff changeset	137 `vector` load/stores + `mask` operations as they become available `vector` ops
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	138 at the `VV` level.
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	139
1d019706d866 LLVM10 anatofuz parents: diff changeset	140 The general direction is to add more Virtual Vector level ops and implement
1d019706d866 LLVM10 anatofuz parents: diff changeset	141 more useful `VV -> VV` rewrites as composable patterns that the PatternRewrite
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	142 infrastructure can apply iteratively.
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	143
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	144 ### Virtual Vector to Hardware Vector Lowering
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	145 For now, `VV -> HWV` are specified in C++ (see for instance the
1d019706d866 LLVM10 anatofuz parents: diff changeset	146 [SplatOpLowering for n-D
1d019706d866 LLVM10 anatofuz parents: diff changeset	147 vectors](https://github.com/tensorflow/mlir/commit/0a0c4867c6a6fcb0a2f17ef26a791c1d551fe33d)
1d019706d866 LLVM10 anatofuz parents: diff changeset	148 or the [VectorOuterProductOp
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	149 lowering](https://github.com/tensorflow/mlir/commit/957b1ca9680b4aacabb3a480fbc4ebd2506334b8)).
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	150
1d019706d866 LLVM10 anatofuz parents: diff changeset	151 Simple [conversion
1d019706d866 LLVM10 anatofuz parents: diff changeset	152 tests](https://github.com/llvm/llvm-project/blob/master/mlir/test/Conversion/VectorToLLVM/vector-to-llvm.mlir)
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	153 are available for the `LLVM` target starting from the Virtual Vector Level.
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	154
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	155 ## Rationale
0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	156 ### Hardware as `vector` Machines of Minimum Granularity
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	157
1d019706d866 LLVM10 anatofuz parents: diff changeset	158 Higher-dimensional `vector`s are ubiquitous in modern HPC hardware. One way to
1d019706d866 LLVM10 anatofuz parents: diff changeset	159 think about Generic Retargetable `vector`-Level Dialect is that it operates on
1d019706d866 LLVM10 anatofuz parents: diff changeset	160 `vector` types that are a multiples of a "good" `vector` size so the HW can
1d019706d866 LLVM10 anatofuz parents: diff changeset	161 efficiently implement a set of high-level primitives
1d019706d866 LLVM10 anatofuz parents: diff changeset	162 (e.g. `vector<8x8x8x16xf32>` when HW `vector` size is say `vector<4x8xf32>`).
1d019706d866 LLVM10 anatofuz parents: diff changeset	163
1d019706d866 LLVM10 anatofuz parents: diff changeset	164 Some notable `vector` sizes of interest include:
1d019706d866 LLVM10 anatofuz parents: diff changeset	165
1d019706d866 LLVM10 anatofuz parents: diff changeset	166 1. CPU: `vector<HW_vector_size * k>`, `vector<core_count * k’ x
1d019706d866 LLVM10 anatofuz parents: diff changeset	167 HW_vector_size * k>` and `vector<socket_count x core_count * k’ x
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	168 HW_vector_size * k>`
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	169 2. GPU: `vector<warp_size * k>`, `vector<warp_size * k x float4>` and
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	170 `vector<warp_size * k x 4 x 4 x 4>` for tensor_core sizes,
0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	171 3. Other accelerators: n-D `vector` as first-class citizens in the HW.
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	172
1d019706d866 LLVM10 anatofuz parents: diff changeset	173 Depending on the target, ops on sizes that are not multiples of the HW
1d019706d866 LLVM10 anatofuz parents: diff changeset	174 `vector` size may either produce slow code (e.g. by going through `LLVM`
1d019706d866 LLVM10 anatofuz parents: diff changeset	175 legalization) or may not legalize at all (e.g. some unsupported accelerator X
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	176 combination of ops and types).
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	177
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	178 ### Transformations Problems Avoided
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	179 A `vector<16x32x64xf32>` virtual `vector` is a coarse-grained type that can be
1d019706d866 LLVM10 anatofuz parents: diff changeset	180 “unrolled” to HW-specific sizes. The multi-dimensional unrolling factors are
1d019706d866 LLVM10 anatofuz parents: diff changeset	181 carried in the IR by the `vector` type. After unrolling, traditional
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	182 instruction-level scheduling can be run.
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	183
1d019706d866 LLVM10 anatofuz parents: diff changeset	184 The following key transformations (along with the supporting analyses and
1d019706d866 LLVM10 anatofuz parents: diff changeset	185 structural constraints) are completely avoided by operating on a ``vector``
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	186 `ssa-value` abstraction:
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	187
1d019706d866 LLVM10 anatofuz parents: diff changeset	188 1. Loop unroll and unroll-and-jam.
1d019706d866 LLVM10 anatofuz parents: diff changeset	189 2. Loop and load-store restructuring for register reuse.
1d019706d866 LLVM10 anatofuz parents: diff changeset	190 3. Load to store forwarding and Mem2reg.
1d019706d866 LLVM10 anatofuz parents: diff changeset	191 4. Coarsening (raising) from finer-grained `vector` form.
1d019706d866 LLVM10 anatofuz parents: diff changeset	192
1d019706d866 LLVM10 anatofuz parents: diff changeset	193 Note that “unrolling” in the context of `vector`s corresponds to partial loop
1d019706d866 LLVM10 anatofuz parents: diff changeset	194 unroll-and-jam and not full unrolling. As a consequence this is expected to
1d019706d866 LLVM10 anatofuz parents: diff changeset	195 compose with SW pipelining where applicable and does not result in ICache blow
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	196 up.
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	197
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	198 ### The Big Out-Of-Scope Piece: Automatic Vectorization
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	199 One important piece not discussed here is automatic vectorization
1d019706d866 LLVM10 anatofuz parents: diff changeset	200 (automatically raising from scalar to n-D `vector` ops and types). The TL;DR
1d019706d866 LLVM10 anatofuz parents: diff changeset	201 is that when the first "super-vectorization" prototype was implemented, MLIR
1d019706d866 LLVM10 anatofuz parents: diff changeset	202 was nowhere near as mature as it is today. As we continue building more
1d019706d866 LLVM10 anatofuz parents: diff changeset	203 abstractions in `VV -> HWV`, there is an opportunity to revisit vectorization
1d019706d866 LLVM10 anatofuz parents: diff changeset	204 in MLIR.
1d019706d866 LLVM10 anatofuz parents: diff changeset	205
1d019706d866 LLVM10 anatofuz parents: diff changeset	206 Since this topic touches on codegen abstractions, it is technically out of the
1d019706d866 LLVM10 anatofuz parents: diff changeset	207 scope of this survey document but there is a lot to discuss in light of
1d019706d866 LLVM10 anatofuz parents: diff changeset	208 structured op type representations and how a vectorization transformation can
1d019706d866 LLVM10 anatofuz parents: diff changeset	209 be reused across dialects. In particular, MLIR allows the definition of
1d019706d866 LLVM10 anatofuz parents: diff changeset	210 dialects at arbitrary levels of granularity and lends itself favorably to
1d019706d866 LLVM10 anatofuz parents: diff changeset	211 progressive lowering. The argument can be made that automatic vectorization on
1d019706d866 LLVM10 anatofuz parents: diff changeset	212 a loops + ops abstraction is akin to raising structural information that has
1d019706d866 LLVM10 anatofuz parents: diff changeset	213 been lost. Instead, it is possible to revisit vectorization as simple pattern
1d019706d866 LLVM10 anatofuz parents: diff changeset	214 rewrites, provided the IR is in a suitable form. For instance, vectorizing a
1d019706d866 LLVM10 anatofuz parents: diff changeset	215 `linalg.generic` op whose semantics match a `matmul` can be done [quite easily
1d019706d866 LLVM10 anatofuz parents: diff changeset	216 with a
1d019706d866 LLVM10 anatofuz parents: diff changeset	217 pattern](https://github.com/tensorflow/mlir/commit/bff722d6b59ab99b998f0c2b9fccd0267d9f93b5). In
1d019706d866 LLVM10 anatofuz parents: diff changeset	218 fact this pattern is trivial to generalize to any type of contraction when
1d019706d866 LLVM10 anatofuz parents: diff changeset	219 targeting the `vector.contract` op, as well as to any field (`+/*`, `min/+`,
1d019706d866 LLVM10 anatofuz parents: diff changeset	220 `max/+`, `or/and`, `logsumexp/+` ...) . In other words, by operating on a
1d019706d866 LLVM10 anatofuz parents: diff changeset	221 higher level of generic abstractions than affine loops, non-trivial
1d019706d866 LLVM10 anatofuz parents: diff changeset	222 transformations become significantly simpler and composable at a finer
1d019706d866 LLVM10 anatofuz parents: diff changeset	223 granularity.
1d019706d866 LLVM10 anatofuz parents: diff changeset	224
1d019706d866 LLVM10 anatofuz parents: diff changeset	225 Irrespective of the existence of an auto-vectorizer, one can build a notional
1d019706d866 LLVM10 anatofuz parents: diff changeset	226 vector language based on the VectorOps dialect and build end-to-end models
1d019706d866 LLVM10 anatofuz parents: diff changeset	227 with expressing `vector`s in the IR directly and simple
1d019706d866 LLVM10 anatofuz parents: diff changeset	228 pattern-rewrites. [EDSC](https://github.com/llvm/llvm-project/blob/master/mlir/docs/EDSC.md)s
1d019706d866 LLVM10 anatofuz parents: diff changeset	229 provide a simple way of driving such a notional language directly in C++.
1d019706d866 LLVM10 anatofuz parents: diff changeset	230
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	231 ## Bikeshed Naming Discussion
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	232 There are arguments against naming an n-D level of abstraction `vector`
1d019706d866 LLVM10 anatofuz parents: diff changeset	233 because most people associate it with 1-D `vector`s. On the other hand,
1d019706d866 LLVM10 anatofuz parents: diff changeset	234 `vector`s are first-class n-D values in MLIR.
1d019706d866 LLVM10 anatofuz parents: diff changeset	235 The alternative name Tile has been proposed, which conveys higher-D
1d019706d866 LLVM10 anatofuz parents: diff changeset	236 meaning. But it also is one of the most overloaded terms in compilers and
1d019706d866 LLVM10 anatofuz parents: diff changeset	237 hardware.
1d019706d866 LLVM10 anatofuz parents: diff changeset	238 For now, we generally use the `n-D` `vector` name and are open to better
1d019706d866 LLVM10 anatofuz parents: diff changeset	239 suggestions.
1d019706d866 LLVM10 anatofuz parents: diff changeset	240
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	241 ## DeeperDive
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	242
1d019706d866 LLVM10 anatofuz parents: diff changeset	243 This section describes the tradeoffs involved in lowering the MLIR n-D vector
1d019706d866 LLVM10 anatofuz parents: diff changeset	244 type and operations on it to LLVM-IR. Putting aside the [LLVM
1d019706d866 LLVM10 anatofuz parents: diff changeset	245 Matrix](http://lists.llvm.org/pipermail/llvm-dev/2018-October/126871.html)
1d019706d866 LLVM10 anatofuz parents: diff changeset	246 proposal for now, this assumes LLVM only has built-in support for 1-D
1d019706d866 LLVM10 anatofuz parents: diff changeset	247 vector. The relationship with the LLVM Matrix proposal is discussed at the end
1d019706d866 LLVM10 anatofuz parents: diff changeset	248 of this document.
1d019706d866 LLVM10 anatofuz parents: diff changeset	249
1d019706d866 LLVM10 anatofuz parents: diff changeset	250 MLIR does not currently support dynamic vector sizes (i.e. SVE style) so the
1d019706d866 LLVM10 anatofuz parents: diff changeset	251 discussion is limited to static rank and static vector sizes
1d019706d866 LLVM10 anatofuz parents: diff changeset	252 (e.g. `vector<4x8x16x32xf32>`). This section discusses operations on vectors
1d019706d866 LLVM10 anatofuz parents: diff changeset	253 in LLVM and MLIR.
1d019706d866 LLVM10 anatofuz parents: diff changeset	254
1d019706d866 LLVM10 anatofuz parents: diff changeset	255 LLVM instructions are prefixed by the `llvm.` dialect prefix
1d019706d866 LLVM10 anatofuz parents: diff changeset	256 (e.g. `llvm.insertvalue`). Such ops operate exclusively on 1-D vectors and
1d019706d866 LLVM10 anatofuz parents: diff changeset	257 aggregates following the [LLVM LangRef](https://llvm.org/docs/LangRef.html).
1d019706d866 LLVM10 anatofuz parents: diff changeset	258 MLIR operations are prefixed by the `vector.` dialect prefix
1d019706d866 LLVM10 anatofuz parents: diff changeset	259 (e.g. `vector.insertelement`). Such ops operate exclusively on MLIR `n-D`
1d019706d866 LLVM10 anatofuz parents: diff changeset	260 `vector` types.
1d019706d866 LLVM10 anatofuz parents: diff changeset	261
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	262 ### Alternatives For Lowering an n-D Vector Type to LLVM
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	263 Consider a vector of rank n with static sizes `{s_0, ... s_{n-1}}` (i.e. an
1d019706d866 LLVM10 anatofuz parents: diff changeset	264 MLIR `vector<s_0x...s_{n-1}xf32>`). Lowering such an `n-D` MLIR vector type to
1d019706d866 LLVM10 anatofuz parents: diff changeset	265 an LLVM descriptor can be done by either:
1d019706d866 LLVM10 anatofuz parents: diff changeset	266
1d019706d866 LLVM10 anatofuz parents: diff changeset	267 1. Flattening to a `1-D` vector: `!llvm<"(s_0...s_{n-1})xfloat">` in the
1d019706d866 LLVM10 anatofuz parents: diff changeset	268 MLIR LLVM dialect.
1d019706d866 LLVM10 anatofuz parents: diff changeset	269 2. Nested aggregate type of `1-D` vector:
1d019706d866 LLVM10 anatofuz parents: diff changeset	270 `!llvm<"[s_0x[s_1x[...<s_{n-1}xfloat>]]]">` in the MLIR LLVM dialect.
1d019706d866 LLVM10 anatofuz parents: diff changeset	271 3. A mix of both.
1d019706d866 LLVM10 anatofuz parents: diff changeset	272
1d019706d866 LLVM10 anatofuz parents: diff changeset	273 There are multiple tradeoffs involved in choosing one or the other that we
1d019706d866 LLVM10 anatofuz parents: diff changeset	274 discuss. It is important to note that “a mix of both” immediately reduces to
1d019706d866 LLVM10 anatofuz parents: diff changeset	275 “nested aggregate type of 1-D vector” with a `vector.cast %0:
1d019706d866 LLVM10 anatofuz parents: diff changeset	276 vector<4x8x16x32xf32> to vector<4x4096xf32>` operation, that flattens the most
1d019706d866 LLVM10 anatofuz parents: diff changeset	277 "k" minor dimensions.
1d019706d866 LLVM10 anatofuz parents: diff changeset	278
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	279 ### Constraints Inherited from LLVM (see LangRef)
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	280 The first constraint was already mentioned: LLVM only supports `1-D` `vector`
1d019706d866 LLVM10 anatofuz parents: diff changeset	281 types natively.
1d019706d866 LLVM10 anatofuz parents: diff changeset	282 Additional constraints are related to the difference in LLVM between vector
1d019706d866 LLVM10 anatofuz parents: diff changeset	283 and aggregate types:
1d019706d866 LLVM10 anatofuz parents: diff changeset	284 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	285 “Aggregate Types are a subset of derived types that can contain multiple
1d019706d866 LLVM10 anatofuz parents: diff changeset	286 member types. Arrays and structs are aggregate types. Vectors are not
1d019706d866 LLVM10 anatofuz parents: diff changeset	287 considered to be aggregate types.”.
1d019706d866 LLVM10 anatofuz parents: diff changeset	288 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	289
1d019706d866 LLVM10 anatofuz parents: diff changeset	290 This distinction is also reflected in some of the operations. For `1-D`
1d019706d866 LLVM10 anatofuz parents: diff changeset	291 vectors, the operations `llvm.extractelement`, `llvm.insertelement`, and
1d019706d866 LLVM10 anatofuz parents: diff changeset	292 `llvm.shufflevector` apply, with direct support for dynamic indices. For `n-D`
1d019706d866 LLVM10 anatofuz parents: diff changeset	293 vectors with `n>1`, and thus aggregate types at LLVM level, the more
1d019706d866 LLVM10 anatofuz parents: diff changeset	294 restrictive operations `llvm.extractvalue` and `llvm.insertvalue` apply, which
1d019706d866 LLVM10 anatofuz parents: diff changeset	295 only accept static indices. There is no direct shuffling support for aggregate
1d019706d866 LLVM10 anatofuz parents: diff changeset	296 types.
1d019706d866 LLVM10 anatofuz parents: diff changeset	297
1d019706d866 LLVM10 anatofuz parents: diff changeset	298 The next sentence illustrates a recurrent tradeoff, also found in MLIR,
1d019706d866 LLVM10 anatofuz parents: diff changeset	299 between “value types” (subject to SSA use-def chains) and “memory types”
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	300 (subject to aliasing and side-effects):
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	301 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	302 “Structures in memory are accessed using ‘load’ and ‘store’ by getting a
1d019706d866 LLVM10 anatofuz parents: diff changeset	303 pointer to a field with the llvm.getelementptr instruction. Structures in
1d019706d866 LLVM10 anatofuz parents: diff changeset	304 registers are accessed using the llvm.extractvalue and llvm.insertvalue
1d019706d866 LLVM10 anatofuz parents: diff changeset	305 instructions.”
1d019706d866 LLVM10 anatofuz parents: diff changeset	306 ```
1d019706d866 LLVM10 anatofuz parents: diff changeset	307
1d019706d866 LLVM10 anatofuz parents: diff changeset	308 When transposing this to MLIR, `llvm.getelementptr` works on pointers to `n-D`
1d019706d866 LLVM10 anatofuz parents: diff changeset	309 vectors in memory. For `n-D`, vectors values that live in registers we can use
1d019706d866 LLVM10 anatofuz parents: diff changeset	310 `vector.extract` and `vector.insert` which do not accept dynamic indices. Note
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	311 that this is consistent with hardware considerations as discussed below.
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	312
1d019706d866 LLVM10 anatofuz parents: diff changeset	313 An alternative is to use an LLVM `1-D` `vector` type for which one can use
1d019706d866 LLVM10 anatofuz parents: diff changeset	314 `llvm.extractelement`, `llvm.insertelement` and `llvm.shufflevector`. These
1d019706d866 LLVM10 anatofuz parents: diff changeset	315 operations accept dynamic indices. The implication is that one has to use a
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	316 flattened lowering of an MLIR n-D vector to an LLVM 1-D vector.
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	317
1d019706d866 LLVM10 anatofuz parents: diff changeset	318 There are multiple tradeoffs involved that mix implications on the programming
1d019706d866 LLVM10 anatofuz parents: diff changeset	319 model, execution on actual HW and what is visible or hidden from codegen. They
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	320 are discussed in the following sections.
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	321
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	322 ### Nested Aggregate
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	323 Pros:
1d019706d866 LLVM10 anatofuz parents: diff changeset	324
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	325 1. Natural encoding n-D vector -> (n-1)-D aggregate over 1-D vector.
0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	326 2. No need for linearization / delinearization logic inserted everywhere.
0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	327 3. `llvm.insertvalue`, `llvm.extractvalue` of `(n-k)-D` aggregate is natural.
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	328 4. `llvm.insertelement`, `llvm.extractelement`, `llvm.shufflevector` over
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	329 `1-D` vector type is natural.
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	330
1d019706d866 LLVM10 anatofuz parents: diff changeset	331 Cons:
1d019706d866 LLVM10 anatofuz parents: diff changeset	332
1d019706d866 LLVM10 anatofuz parents: diff changeset	333 1. `llvm.insertvalue` / `llvm.extractvalue` does not accept dynamic indices
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	334 but only static ones.
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	335 2. Dynamic indexing on the non-most-minor dimension requires roundtrips to
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	336 memory.
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	337 3. Special intrinsics and native instructions in LLVM operate on `1-D`
1d019706d866 LLVM10 anatofuz parents: diff changeset	338 vectors. This is not expected to be a practical limitation thanks to a
1d019706d866 LLVM10 anatofuz parents: diff changeset	339 `vector.cast %0: vector<4x8x16x32xf32> to vector<4x4096xf32>` operation, that
1d019706d866 LLVM10 anatofuz parents: diff changeset	340 flattens the most minor dimensions (see the bigger picture in implications on
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	341 codegen).
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	342
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	343 ### Flattened 1-D Vector Type
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	344
1d019706d866 LLVM10 anatofuz parents: diff changeset	345 Pros:
1d019706d866 LLVM10 anatofuz parents: diff changeset	346
1d019706d866 LLVM10 anatofuz parents: diff changeset	347 1. `insertelement` / `extractelement` / `shufflevector` with dynamic indexing
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	348 is possible over the whole lowered `n-D` vector type.
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	349 2. Supports special intrinsics and native operations.
1d019706d866 LLVM10 anatofuz parents: diff changeset	350
1d019706d866 LLVM10 anatofuz parents: diff changeset	351 Cons:
1d019706d866 LLVM10 anatofuz parents: diff changeset	352 1. Requires linearization/delinearization logic everywhere, translations are
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	353 complex.
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	354 2. Hides away the real HW structure behind dynamic indexing: at the end of the
1d019706d866 LLVM10 anatofuz parents: diff changeset	355 day, HW vector sizes are generally fixed and multiple vectors will be needed
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	356 to hold a vector that is larger than the HW.
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	357 3. Unlikely peephole optimizations will result in good code: arbitrary dynamic
1d019706d866 LLVM10 anatofuz parents: diff changeset	358 accesses, especially at HW vector boundaries unlikely to result in regular
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	359 patterns.
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	360
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	361 ### Discussion
0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	362 #### HW Vectors and Implications on the SW and the Programming Model
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	363 As of today, the LLVM model only support `1-D` vector types. This is
1d019706d866 LLVM10 anatofuz parents: diff changeset	364 unsurprising because historically, the vast majority of HW only supports `1-D`
1d019706d866 LLVM10 anatofuz parents: diff changeset	365 vector registers. We note that multiple HW vendors are in the process of
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	366 evolving to higher-dimensional physical vectors.
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	367
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	368 In the following discussion, let's assume the HW vector size is `1-D` and the
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	369 SW vector size is `n-D`, with `n >= 1`. The same discussion would apply with
1d019706d866 LLVM10 anatofuz parents: diff changeset	370 `2-D` HW `vector` size and `n >= 2`. In this context, most HW exhibit a vector
1d019706d866 LLVM10 anatofuz parents: diff changeset	371 register file. The number of such vectors is fixed.
1d019706d866 LLVM10 anatofuz parents: diff changeset	372 Depending on the rank and sizes of the SW vector abstraction and the HW vector
1d019706d866 LLVM10 anatofuz parents: diff changeset	373 sizes and number of registers, an `n-D` SW vector type may be materialized by
1d019706d866 LLVM10 anatofuz parents: diff changeset	374 a mix of multiple `1-D` HW vector registers + memory locations at a given
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	375 point in time.
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	376
1d019706d866 LLVM10 anatofuz parents: diff changeset	377 The implication of the physical HW constraints on the programming model are
1d019706d866 LLVM10 anatofuz parents: diff changeset	378 that one cannot index dynamically across hardware registers: a register file
1d019706d866 LLVM10 anatofuz parents: diff changeset	379 can generally not be indexed dynamically. This is because the register number
1d019706d866 LLVM10 anatofuz parents: diff changeset	380 is fixed and one either needs to unroll explicitly to obtain fixed register
1d019706d866 LLVM10 anatofuz parents: diff changeset	381 numbers or go through memory. This is a constraint familiar to CUDA
1d019706d866 LLVM10 anatofuz parents: diff changeset	382 programmers: when declaring a `private float a[4]`; and subsequently indexing
1d019706d866 LLVM10 anatofuz parents: diff changeset	383 with a dynamic value results in so-called local memory usage
1d019706d866 LLVM10 anatofuz parents: diff changeset	384 (i.e. roundtripping to memory).
1d019706d866 LLVM10 anatofuz parents: diff changeset	385
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	386 #### Implication on codegen
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	387 MLIR `n-D` vector types are currently represented as `(n-1)-D` arrays of `1-D`
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	388 vectors when lowered to LLVM.
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	389 This introduces the consequences on static vs dynamic indexing discussed
1d019706d866 LLVM10 anatofuz parents: diff changeset	390 previously: `extractelement`, `insertelement` and `shufflevector` on `n-D`
1d019706d866 LLVM10 anatofuz parents: diff changeset	391 vectors in MLIR only support static indices. Dynamic indices are only
1d019706d866 LLVM10 anatofuz parents: diff changeset	392 supported on the most minor `1-D` vector but not the outer `(n-1)-D`.
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	393 For other cases, explicit load / stores are required.
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	394
1d019706d866 LLVM10 anatofuz parents: diff changeset	395 The implications on codegen are as follows:
1d019706d866 LLVM10 anatofuz parents: diff changeset	396
1d019706d866 LLVM10 anatofuz parents: diff changeset	397 1. Loops around `vector` values are indirect addressing of vector values, they
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	398 must operate on explicit load / store operations over `n-D` vector types.
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	399 2. Once an `n-D` `vector` type is loaded into an SSA value (that may or may
1d019706d866 LLVM10 anatofuz parents: diff changeset	400 not live in `n` registers, with or without spilling, when eventually lowered),
1d019706d866 LLVM10 anatofuz parents: diff changeset	401 it may be unrolled to smaller `k-D` `vector` types and operations that
1d019706d866 LLVM10 anatofuz parents: diff changeset	402 correspond to the HW. This level of MLIR codegen is related to register
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	403 allocation and spilling that occur much later in the LLVM pipeline.
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	404 3. HW may support >1-D vectors with intrinsics for indirect addressing within
1d019706d866 LLVM10 anatofuz parents: diff changeset	405 these vectors. These can be targeted thanks to explicit `vector_cast`
1d019706d866 LLVM10 anatofuz parents: diff changeset	406 operations from MLIR `k-D` vector types and operations to LLVM `1-D` vectors +
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	407 intrinsics.
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	408
1d019706d866 LLVM10 anatofuz parents: diff changeset	409 Alternatively, we argue that directly lowering to a linearized abstraction
1d019706d866 LLVM10 anatofuz parents: diff changeset	410 hides away the codegen complexities related to memory accesses by giving a
1d019706d866 LLVM10 anatofuz parents: diff changeset	411 false impression of magical dynamic indexing across registers. Instead we
1d019706d866 LLVM10 anatofuz parents: diff changeset	412 prefer to make those very explicit in MLIR and allow codegen to explore
1d019706d866 LLVM10 anatofuz parents: diff changeset	413 tradeoffs.
1d019706d866 LLVM10 anatofuz parents: diff changeset	414 Different HW will require different tradeoffs in the sizes involved in steps
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	415 1., 2. and 3.
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	416
1d019706d866 LLVM10 anatofuz parents: diff changeset	417 Decisions made at the MLIR level will have implications at a much later stage
1d019706d866 LLVM10 anatofuz parents: diff changeset	418 in LLVM (after register allocation). We do not envision to expose concerns
1d019706d866 LLVM10 anatofuz parents: diff changeset	419 related to modeling of register allocation and spilling to MLIR
1d019706d866 LLVM10 anatofuz parents: diff changeset	420 explicitly. Instead, each target will expose a set of "good" target operations
1d019706d866 LLVM10 anatofuz parents: diff changeset	421 and `n-D` vector types, associated with costs that `PatterRewriters` at the
1d019706d866 LLVM10 anatofuz parents: diff changeset	422 MLIR level will be able to target. Such costs at the MLIR level will be
1d019706d866 LLVM10 anatofuz parents: diff changeset	423 abstract and used for ranking, not for accurate performance modeling. In the
1d019706d866 LLVM10 anatofuz parents: diff changeset	424 future such costs will be learned.
1d019706d866 LLVM10 anatofuz parents: diff changeset	425
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	426 #### Implication on Lowering to Accelerators
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	427 To target accelerators that support higher dimensional vectors natively, we
1d019706d866 LLVM10 anatofuz parents: diff changeset	428 can start from either `1-D` or `n-D` vectors in MLIR and use `vector.cast` to
1d019706d866 LLVM10 anatofuz parents: diff changeset	429 flatten the most minor dimensions to `1-D` `vector<Kxf32>` where `K` is an
1d019706d866 LLVM10 anatofuz parents: diff changeset	430 appropriate constant. Then, the existing lowering to LLVM-IR immediately
1d019706d866 LLVM10 anatofuz parents: diff changeset	431 applies, with extensions for accelerator-specific intrinsics.
1d019706d866 LLVM10 anatofuz parents: diff changeset	432
1d019706d866 LLVM10 anatofuz parents: diff changeset	433 It is the role of an Accelerator-specific vector dialect (see codegen flow in
1d019706d866 LLVM10 anatofuz parents: diff changeset	434 the figure above) to lower the `vector.cast`. Accelerator -> LLVM lowering
1d019706d866 LLVM10 anatofuz parents: diff changeset	435 would then consist of a bunch of `Accelerator -> Accelerator` rewrites to
1d019706d866 LLVM10 anatofuz parents: diff changeset	436 perform the casts composed with `Accelerator -> LLVM` conversions + intrinsics
1d019706d866 LLVM10 anatofuz parents: diff changeset	437 that operate on `1-D` `vector<Kxf32>`.
1d019706d866 LLVM10 anatofuz parents: diff changeset	438
1d019706d866 LLVM10 anatofuz parents: diff changeset	439 Some of those rewrites may need extra handling, especially if a reduction is
1d019706d866 LLVM10 anatofuz parents: diff changeset	440 involved. For example, `vector.cast %0: vector<K1x...xKnxf32> to
1d019706d866 LLVM10 anatofuz parents: diff changeset	441 vector<Kxf32>` when `K != K1 * … * Kn` and some arbitrary irregular
1d019706d866 LLVM10 anatofuz parents: diff changeset	442 `vector.cast %0: vector<4x4x17xf32> to vector<Kxf32>` may introduce masking
1d019706d866 LLVM10 anatofuz parents: diff changeset	443 and intra-vector shuffling that may not be worthwhile or even feasible,
1d019706d866 LLVM10 anatofuz parents: diff changeset	444 i.e. infinite cost.
1d019706d866 LLVM10 anatofuz parents: diff changeset	445
1d019706d866 LLVM10 anatofuz parents: diff changeset	446 However `vector.cast %0: vector<K1x...xKnxf32> to vector<Kxf32>` when `K =
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	447 K1 * … * Kn` should be close to a noop.
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	448
1d019706d866 LLVM10 anatofuz parents: diff changeset	449 As we start building accelerator-specific abstractions, we hope to achieve
1d019706d866 LLVM10 anatofuz parents: diff changeset	450 retargetable codegen: the same infra is used for CPU, GPU and accelerators
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	451 with extra MLIR patterns and costs.
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	452
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	453 #### Implication on calling external functions that operate on vectors
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	454 It is possible (likely) that we additionally need to linearize when calling an
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	455 external function.
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	456
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	457 ### Relationship to LLVM matrix type proposal.
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	458 The LLVM matrix proposal was formulated 1 year ago but seemed to be somewhat
1d019706d866 LLVM10 anatofuz parents: diff changeset	459 stalled until recently. In its current form, it is limited to 2-D matrix types
1d019706d866 LLVM10 anatofuz parents: diff changeset	460 and operations are implemented with LLVM intrinsics.
1d019706d866 LLVM10 anatofuz parents: diff changeset	461 In contrast, MLIR sits at a higher level of abstraction and allows the
1d019706d866 LLVM10 anatofuz parents: diff changeset	462 lowering of generic operations on generic n-D vector types from MLIR to
1d019706d866 LLVM10 anatofuz parents: diff changeset	463 aggregates of 1-D LLVM vectors.
1d019706d866 LLVM10 anatofuz parents: diff changeset	464 In the future, it could make sense to lower to the LLVM matrix abstraction
1d019706d866 LLVM10 anatofuz parents: diff changeset	465 also for CPU even though MLIR will continue needing higher level abstractions.
1d019706d866 LLVM10 anatofuz parents: diff changeset	466
1d019706d866 LLVM10 anatofuz parents: diff changeset	467 On the other hand, one should note that as MLIR is moving to LLVM, this
1d019706d866 LLVM10 anatofuz parents: diff changeset	468 document could become the unifying abstraction that people should target for
1d019706d866 LLVM10 anatofuz parents: diff changeset	469 >1-D vectors and the LLVM matrix proposal can be viewed as a subset of this
1d019706d866 LLVM10 anatofuz parents: diff changeset	470 work.
1d019706d866 LLVM10 anatofuz parents: diff changeset	471
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	472 ### Conclusion
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	473 The flattened 1-D vector design in the LLVM matrix proposal is good in a
1d019706d866 LLVM10 anatofuz parents: diff changeset	474 HW-specific world with special intrinsics. This is a good abstraction for
1d019706d866 LLVM10 anatofuz parents: diff changeset	475 register allocation, Instruction-Level-Parallelism and
1d019706d866 LLVM10 anatofuz parents: diff changeset	476 SoftWare-Pipelining/Modulo Scheduling optimizations at the register level.
1d019706d866 LLVM10 anatofuz parents: diff changeset	477 However MLIR codegen operates at a higher level of abstraction where we want
1d019706d866 LLVM10 anatofuz parents: diff changeset	478 to target operations on coarser-grained vectors than the HW size and on which
1d019706d866 LLVM10 anatofuz parents: diff changeset	479 unroll-and-jam is applied and patterns across multiple HW vectors can be
1d019706d866 LLVM10 anatofuz parents: diff changeset	480 matched.
1d019706d866 LLVM10 anatofuz parents: diff changeset	481
1d019706d866 LLVM10 anatofuz parents: diff changeset	482 This makes “nested aggregate type of 1-D vector” an appealing abstraction for
1d019706d866 LLVM10 anatofuz parents: diff changeset	483 lowering from MLIR because:
1d019706d866 LLVM10 anatofuz parents: diff changeset	484
1d019706d866 LLVM10 anatofuz parents: diff changeset	485 1. it does not hide complexity related to the buffer vs value semantics and
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	486 the memory subsystem and
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	487 2. it does not rely on LLVM to magically make all the things work from a too
1d019706d866 LLVM10 anatofuz parents: diff changeset	488 low-level abstraction.
1d019706d866 LLVM10 anatofuz parents: diff changeset	489
1d019706d866 LLVM10 anatofuz parents: diff changeset	490 The use of special intrinsics in a `1-D` LLVM world is still available thanks
1d019706d866 LLVM10 anatofuz parents: diff changeset	491 to an explicit `vector.cast` op.
1d019706d866 LLVM10 anatofuz parents: diff changeset	492
1d019706d866 LLVM10 anatofuz parents: diff changeset	493 ## Operations
1d019706d866 LLVM10 anatofuz parents: diff changeset	494
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	495 [include "Dialects/VectorOps.md"]

Mercurial > hg > CbC > CbC_llvm

annotate mlir/docs/Dialects/Vector.md @ 194:f2ef29ba5fe2