Members/tobaru/cbc/CbC_llvm: docs/BigEndianNEON.rst annotate

annotate docs/BigEndianNEON.rst @ 107:a03ddd01be7e

resolve warnings

author	Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp>
date	Sun, 31 Jan 2016 17:34:49 +0900
parents	54457678186b
children

rev	line source
77 54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	1 ==============================================
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	2 Using ARM NEON instructions in big endian mode
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	3 ==============================================
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	4
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	5 .. contents::
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	6 :local:
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	7
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	8 Introduction
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	9 ============
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	10
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	11 Generating code for big endian ARM processors is for the most part straightforward. NEON loads and stores however have some interesting properties that make code generation decisions less obvious in big endian mode.
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	12
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	13 The aim of this document is to explain the problem with NEON loads and stores, and the solution that has been implemented in LLVM.
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	14
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	15 In this document the term "vector" refers to what the ARM ABI calls a "short vector", which is a sequence of items that can fit in a NEON register. This sequence can be 64 or 128 bits in length, and can constitute 8, 16, 32 or 64 bit items. This document refers to A64 instructions throughout, but is almost applicable to the A32/ARMv7 instruction sets also. The ABI format for passing vectors in A32 is sligtly different to A64. Apart from that, the same concepts apply.
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	16
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	17 Example: C-level intrinsics -> assembly
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	18 ---------------------------------------
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	19
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	20 It may be helpful first to illustrate how C-level ARM NEON intrinsics are lowered to instructions.
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	21
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	22 This trivial C function takes a vector of four ints and sets the zero'th lane to the value "42"::
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	23
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	24 #include <arm_neon.h>
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	25 int32x4_t f(int32x4_t p) {
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	26 return vsetq_lane_s32(42, p, 0);
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	27 }
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	28
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	29 arm_neon.h intrinsics generate "generic" IR where possible (that is, normal IR instructions not ``llvm.arm.neon.*`` intrinsic calls). The above generates::
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	30
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	31 define <4 x i32> @f(<4 x i32> %p) {
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	32 %vset_lane = insertelement <4 x i32> %p, i32 42, i32 0
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	33 ret <4 x i32> %vset_lane
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	34 }
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	35
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	36 Which then becomes the following trivial assembly::
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	37
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	38 f: // @f
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	39 movz w8, #0x2a
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	40 ins v0.s[0], w8
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	41 ret
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	42
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	43 Problem
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	44 =======
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	45
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	46 The main problem is how vectors are represented in memory and in registers.
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	47
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	48 First, a recap. The "endianness" of an item affects its representation in memory only. In a register, a number is just a sequence of bits - 64 bits in the case of AArch64 general purpose registers. Memory, however, is a sequence of addressable units of 8 bits in size. Any number greater than 8 bits must therefore be split up into 8-bit chunks, and endianness describes the order in which these chunks are laid out in memory.
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	49
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	50 A "little endian" layout has the least significant byte first (lowest in memory address). A "big endian" layout has the most significant byte first. This means that when loading an item from big endian memory, the lowest 8-bits in memory must go in the most significant 8-bits, and so forth.
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	51
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	52 ``LDR`` and ``LD1``
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	53 ===================
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	54
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	55 .. figure:: ARM-BE-ldr.png
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	56 :align: right
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	57
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	58 Big endian vector load using ``LDR``.
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	59
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	60
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	61 A vector is a consecutive sequence of items that are operated on simultaneously. To load a 64-bit vector, 64 bits need to be read from memory. In little endian mode, we can do this by just performing a 64-bit load - ``LDR q0, [foo]``. However if we try this in big endian mode, because of the byte swapping the lane indices end up being swapped! The zero'th item as laid out in memory becomes the n'th lane in the vector.
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	62
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	63 .. figure:: ARM-BE-ld1.png
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	64 :align: right
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	65
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	66 Big endian vector load using ``LD1``. Note that the lanes retain the correct ordering.
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	67
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	68
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	69 Because of this, the instruction ``LD1`` performs a vector load but performs byte swapping not on the entire 64 bits, but on the individual items within the vector. This means that the register content is the same as it would have been on a little endian system.
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	70
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	71 It may seem that ``LD1`` should suffice to peform vector loads on a big endian machine. However there are pros and cons to the two approaches that make it less than simple which register format to pick.
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	72
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	73 There are two options:
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	74
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	75 1. The content of a vector register is the same as if it had been loaded with an ``LDR`` instruction.
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	76 2. The content of a vector register is the same as if it had been loaded with an ``LD1`` instruction.
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	77
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	78 Because ``LD1 == LDR + REV`` and similarly ``LDR == LD1 + REV`` (on a big endian system), we can simulate either type of load with the other type of load plus a ``REV`` instruction. So we're not deciding which instructions to use, but which format to use (which will then influence which instruction is best to use).
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	79
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	80 .. The 'clearer' container is required to make the following section header come after the floated
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	81 images above.
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	82 .. container:: clearer
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	83
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	84 Note that throughout this section we only mention loads. Stores have exactly the same problems as their associated loads, so have been skipped for brevity.
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	85
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	86
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	87 Considerations
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	88 ==============
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	89
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	90 LLVM IR Lane ordering
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	91 ---------------------
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	92
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	93 LLVM IR has first class vector types. In LLVM IR, the zero'th element of a vector resides at the lowest memory address. The optimizer relies on this property in certain areas, for example when concatenating vectors together. The intention is for arrays and vectors to have identical memory layouts - ``[4 x i8]`` and ``<4 x i8>`` should be represented the same in memory. Without this property there would be many special cases that the optimizer would have to cleverly handle.
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	94
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	95 Use of ``LDR`` would break this lane ordering property. This doesn't preclude the use of ``LDR``, but we would have to do one of two things:
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	96
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	97 1. Insert a ``REV`` instruction to reverse the lane order after every ``LDR``.
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	98 2. Disable all optimizations that rely on lane layout, and for every access to an individual lane (``insertelement``/``extractelement``/``shufflevector``) reverse the lane index.
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	99
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	100 AAPCS
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	101 -----
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	102
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	103 The ARM procedure call standard (AAPCS) defines the ABI for passing vectors between functions in registers. It states:
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	104
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	105 When a short vector is transferred between registers and memory it is treated as an opaque object. That is a short vector is stored in memory as if it were stored with a single ``STR`` of the entire register; a short vector is loaded from memory using the corresponding ``LDR`` instruction. On a little-endian system this means that element 0 will always contain the lowest addressed element of a short vector; on a big-endian system element 0 will contain the highest-addressed element of a short vector.
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	106
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	107 -- Procedure Call Standard for the ARM 64-bit Architecture (AArch64), 4.1.2 Short Vectors
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	108
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	109 The use of ``LDR`` and ``STR`` as the ABI defines has at least one advantage over ``LD1`` and ``ST1``. ``LDR`` and ``STR`` are oblivious to the size of the individual lanes of a vector. ``LD1`` and ``ST1`` are not - the lane size is encoded within them. This is important across an ABI boundary, because it would become necessary to know the lane width the callee expects. Consider the following code:
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	110
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	111 .. code-block:: c
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	112
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	113 <callee.c>
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	114 void callee(uint32x2_t v) {
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	115 ...
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	116 }
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	117
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	118 <caller.c>
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	119 extern void callee(uint32x2_t);
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	120 void caller() {
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	121 callee(...);
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	122 }
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	123
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	124 If ``callee`` changed its signature to ``uint16x4_t``, which is equivalent in register content, if we passed as ``LD1`` we'd break this code until ``caller`` was updated and recompiled.
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	125
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	126 There is an argument that if the signatures of the two functions are different then the behaviour should be undefined. But there may be functions that are agnostic to the lane layout of the vector, and treating the vector as an opaque value (just loading it and storing it) would be impossible without a common format across ABI boundaries.
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	127
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	128 So to preserve ABI compatibility, we need to use the ``LDR`` lane layout across function calls.
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	129
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	130 Alignment
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	131 ---------
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	132
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	133 In strict alignment mode, ``LDR qX`` requires its address to be 128-bit aligned, whereas ``LD1`` only requires it to be as aligned as the lane size. If we canonicalised on using ``LDR``, we'd still need to use ``LD1`` in some places to avoid alignment faults (the result of the ``LD1`` would then need to be reversed with ``REV``).
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	134
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	135 Most operating systems however do not run with alignment faults enabled, so this is often not an issue.
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	136
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	137 Summary
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	138 -------
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	139
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	140 The following table summarises the instructions that are required to be emitted for each property mentioned above for each of the two solutions.
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	141
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	142 +-------------------------------+-------------------------------+---------------------+
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	143 \| \| ``LDR`` layout \| ``LD1`` layout \|
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	144 +===============================+===============================+=====================+
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	145 \| Lane ordering \| ``LDR + REV`` \| ``LD1`` \|
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	146 +-------------------------------+-------------------------------+---------------------+
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	147 \| AAPCS \| ``LDR`` \| ``LD1 + REV`` \|
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	148 +-------------------------------+-------------------------------+---------------------+
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	149 \| Alignment for strict mode \| ``LDR`` / ``LD1 + REV`` \| ``LD1`` \|
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	150 +-------------------------------+-------------------------------+---------------------+
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	151
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	152 Neither approach is perfect, and choosing one boils down to choosing the lesser of two evils. The issue with lane ordering, it was decided, would have to change target-agnostic compiler passes and would result in a strange IR in which lane indices were reversed. It was decided that this was worse than the changes that would have to be made to support ``LD1``, so ``LD1`` was chosen as the canonical vector load instruction (and by inference, ``ST1`` for vector stores).
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	153
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	154 Implementation
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	155 ==============
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	156
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	157 There are 3 parts to the implementation:
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	158
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	159 1. Predicate ``LDR`` and ``STR`` instructions so that they are never allowed to be selected to generate vector loads and stores. The exception is one-lane vectors [1]_ - these by definition cannot have lane ordering problems so are fine to use ``LDR``/``STR``.
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	160
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	161 2. Create code generation patterns for bitconverts that create ``REV`` instructions.
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	162
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	163 3. Make sure appropriate bitconverts are created so that vector values get passed over call boundaries as 1-element vectors (which is the same as if they were loaded with ``LDR``).
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	164
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	165 Bitconverts
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	166 -----------
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	167
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	168 .. image:: ARM-BE-bitcastfail.png
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	169 :align: right
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	170
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	171 The main problem with the ``LD1`` solution is dealing with bitconverts (or bitcasts, or reinterpret casts). These are pseudo instructions that only change the compiler's interpretation of data, not the underlying data itself. A requirement is that if data is loaded and then saved again (called a "round trip"), the memory contents should be the same after the store as before the load. If a vector is loaded and is then bitconverted to a different vector type before storing, the round trip will currently be broken.
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	172
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	173 Take for example this code sequence::
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	174
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	175 %0 = load <4 x i32> %x
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	176 %1 = bitcast <4 x i32> %0 to <2 x i64>
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	177 store <2 x i64> %1, <2 x i64>* %y
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	178
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	179 This would produce a code sequence such as that in the figure on the right. The mismatched ``LD1`` and ``ST1`` cause the stored data to differ from the loaded data.
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	180
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	181 .. container:: clearer
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	182
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	183 When we see a bitcast from type ``X`` to type ``Y``, what we need to do is to change the in-register representation of the data to be as if it had just been loaded by a ``LD1`` of type ``Y``.
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	184
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	185 .. image:: ARM-BE-bitcastsuccess.png
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	186 :align: right
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	187
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	188 Conceptually this is simple - we can insert a ``REV`` undoing the ``LD1`` of type ``X`` (converting the in-register representation to the same as if it had been loaded by ``LDR``) and then insert another ``REV`` to change the representation to be as if it had been loaded by an ``LD1`` of type ``Y``.
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	189
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	190 For the previous example, this would be::
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	191
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	192 LD1 v0.4s, [x]
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	193
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	194 REV64 v0.4s, v0.4s // There is no REV128 instruction, so it must be synthesizedcd
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	195 EXT v0.16b, v0.16b, v0.16b, #8 // with a REV64 then an EXT to swap the two 64-bit elements.
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	196
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	197 REV64 v0.2d, v0.2d
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	198 EXT v0.16b, v0.16b, v0.16b, #8
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	199
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	200 ST1 v0.2d, [y]
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	201
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	202 It turns out that these ``REV`` pairs can, in almost all cases, be squashed together into a single ``REV``. For the example above, a ``REV128 4s`` + ``REV128 2d`` is actually a ``REV64 4s``, as shown in the figure on the right.
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	203
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	204 .. [1] One lane vectors may seem useless as a concept but they serve to distinguish between values held in general purpose registers and values held in NEON/VFP registers. For example, an ``i64`` would live in an ``x`` register, but ``<1 x i64>`` would live in a ``d`` register.
54457678186b LLVM 3.6 Kaito Tokumori <e105711@ie.u-ryukyu.ac.jp> parents: diff changeset	205

Mercurial > hg > Members > tobaru > cbc > CbC_llvm

annotate docs/BigEndianNEON.rst @ 107:a03ddd01be7e