The Power Bi-Endian Vector Programming Model

The Power Bi-Endian Vector Programming Model To ensure portability of applications optimized to exploit the SIMD functions of Power ISA processors, this reference defines a set of functions and data types for SIMD programming. Compliant compilers will provide suitable support for these functions, preferably as built-in functions that translate to one or more Power ISA instructions. Compilers are encouraged, but not required, to provide built-in functions to access individual instructions in the IBM Power® instruction set architecture. In most cases, each such built-in function should provide direct access to the underlying instruction. However, to ease porting between little-endian (LE) and big-endian (BE) Power systems, and between Power and other platforms, it is preferable that some built-in functions provide the same semantics on both LE and BE Power systems, even if this means that the built-in functions are implemented with different instruction sequences for LE and BE. To achieve this, vector built-in functions provide a set of functions derived from the set of hardware functions provided by the Power SIMD instructions. Unlike traditional “hardware intrinsic” built-in functions, no fixed mapping exists between these built-in functions and the generated hardware instruction sequence. Rather, the compiler is free to generate optimized instruction sequences that implement the semantics of the program specified by the programmer using these built-in functions. As we've seen, the Power SIMD instructions operate on groups of 1, 2, 4, 8, or 16 vector elements at a time in 128-bit registers. On a big-endian Power platform, vector elements are loaded from memory into a register so that the 0th element occupies the high-order bits of the register, and the (N – 1)th element occupies the low-order bits of the register. This is referred to as big-endian element order. On a little-endian Power platform, vector elements are loaded from memory such that the 0th element occupies the low-order bits of the register, and the (N – 1)th element occupies the high-order bits. This is referred to as little-endian element order. Much of the information in this chapter was formerly part of Chapter 6 of the 64-Bit ELF V2 ABI Specification for Power.

Language Elements The C and C++ languages are extended to use new identifiers vector, pixel, bool, __vector, __pixel, and __bool. These keywords are used to specify vector data types (). Because these identifiers may conflict with keywords in more recent language standards for C and C++, compilers may implement these in one of two ways. __vector, __pixel, __bool, and bool are defined as keywords, with vector and pixel as predefined macros that expand to __vector and __pixel, respectively. __vector, __pixel, and __bool are defined as keywords in all contexts, while vector, pixel, and bool are treated as keywords only within the context of a type declaration. As a motivating example, the vector token is used as a type in the C++ Standard Template Library, and hence cannot be used as an unrestricted keyword, but can be used in the context-sensitive implementation. For example, vector char is distinct from std::vector in the context-sensitive implementation. Vector literals may be specified using a type cast and a set of literal initializers in parentheses or braces. For example, vector int x = (vector int) (4, -1, 3, 6); vector double g = (vector double) { 3.5, -24.6 }; Current C compilers do not support literals for __int128 types. A vector __int128 constant can be constructed from smaller literals with appropriate cast-shift-or logic. For example, vector unsigned __int128 x = { (((unsigned __int128)0x1020304050607080) << 64) | 0x90A0B0C0D0E0F000 };

Vector Data Types Languages provide support for the data types in to represent vector data types stored in vector registers. For the C and C++ programming languages (and related/derived languages), the "Power SIMD C Types" listed in the leftmost column of may be used when Power SIMD language extensions are enabled. Either vector or __vector may be used in the type name. Note that the ELFv2 ABI for Power also includes a vector _Float16 data type. As of this writing, no current compilers for Power have implemented such a type. This document does not include that type or any intrinsics related to it. For the Fortran language, gives a correspondence between Fortran and C/C++ language types. The assignment operator always performs a byte-by-byte data copy for vector data types. Like other C/C++ language types, vector types may be defined to have const or volatile properties. Vector data types can be defined as being in static, auto, and register storage. Pointers to vector types are defined like pointers of other C/C++ types. Pointers to vector objects may be defined to have const and volatile properties. Pointers to vector objects must be addresses divisible by 16, as vector objects are always aligned on quadword (16-byte, or 128-bit) boundaries. The preferred way to access vectors at an application-defined address is by using vector pointers and the C/C++ dereference operator *. Similar to other C/C++ data types, the array reference operator [] may be used to access vector objects with a vector pointer with the usual definition to access the Nth vector element from a vector pointer. The dereference operator * may not be used to access data that is not aligned at least to a quadword boundary. Built-in functions such as and and provided for unaligned data access. Please refer to for an example. One vector type may be cast to another vector type without restriction. Such a cast is simply a reinterpretation of the bits, and does not change the data. There are no default conversions for vector types. Compilers are expected to recognize and optimize multiple operations that can be optimized into a single hardware instruction. For example, a load-and-splat hardware instruction (such as lxvdsx) might be generated for the following sequence: double *double_ptr; register vector double vd = vec_splats(*double_ptr); Vector Types Power SIMD C Types sizeof Alignment Description vector unsigned char 16 Quadword Vector of 16 unsigned bytes. vector signed char 16 Quadword Vector of 16 signed bytes. vector bool char 16 Quadword Vector of 16 bytes with a value of either 0 or 28 – 1. vector unsigned short 16 Quadword Vector of 8 unsigned halfwords. vector signed short 16 Quadword Vector of 8 signed halfwords. vector bool short 16 Quadword Vector of 8 halfwords with a value of either 0 or 216 – 1. vector pixel 16 Quadword Vector of 8 halfwords, each interpreted as a 1-bit channel and three 5-bit channels. vector unsigned int 16 Quadword Vector of 4 unsigned words. vector signed int 16 Quadword Vector of 4 signed words. vector bool int 16 Quadword Vector of 4 words with a value of either 0 or 232 – 1. vector unsigned long The vector long types are deprecated due to their ambiguity between 32-bit and 64-bit environments. The use of the vector long long types is preferred. vector unsigned long long 16 Quadword Vector of 2 unsigned doublewords. vector signed long vector signed long long 16 Quadword Vector of 2 signed doublewords. vector bool long vector bool long long 16 Quadword Vector of 2 doublewords with a value of either 0 or 264 – 1. vector unsigned __int128 16 Quadword Vector of 1 unsigned quadword. vector signed __int128 16 Quadword Vector of 1 signed quadword. vector float 16 Quadword Vector of 4 single-precision floats. vector double 16 Quadword Vector of 2 double-precision floats.

Vector Operators In addition to the dereference and assignment operators, the Power Bi-Endian Vector Programming Model provides the usual operators that are valid on pointers; these operators are also valid for pointers to vector types. The traditional C/C++ unary operators (+ -, and ~), are defined on vector types. The traditional C/C++ binary operators (+, -, *, %, /, shift, logical, and comparison) and the ternary operator (?:) are defined on like vector types. Other than ?:, these operators perform their operations "elementwise" on the base elements of the operands, as follows. For unary operators, the specified operation is performed on each base element of the single operand to derive the result value placed into the corresponding element of the vector result. The result type of unary operations is the type of the single operand. For example, vector signed int a, b; a = -b; produces the same result as vector signed int a, b; a = vec_neg (b); For binary operators, the specified operation is performed on corresponding base elements of both operands to derive the result value for each vector element of the vector result. Both operands of the binary operators must have the same vector type with the same base element type. The result of binary operators is the same type as the type of the operands. For example, vector signed int a, b; a = a + b; produces the same result as vector signed int a, b; a = vec_add (a, b); For the ternary operator (?:), the first operand must be an integral type, used to select between the second and third operands which must be of the same vector type. The result of the ternary operator will also have that type. For example, int test_value; vector signed int a, b, r; r = test_value ? a : b; produces the same result as int test_value; vector signed int a, b, r; if (test_value) r = a; else r = b; Further, the array reference operator may be applied to vector data types, yielding an l-value corresponding to the specified element in accordance with the vector element numbering rules (see ). An l-value may either be assigned a new value or accessed for reading its value. For example, vector signed int a; signed int b, c; b = a[0]; a[3] = c;

Vector Layout and Element Numbering Vector data types consist of a homogeneous sequence of elements of the base data type specified in the vector data type. Individual elements of a vector can be addressed by a vector element number. To understand how vector elements are represented in memory and in registers, it is best to start with some simple concepts of endianness.

Scalar Quantities and Endianness shows different representations of a 64-bit scalar integer with the hexadecimal value 0x0123456789ABCDEF. We say that the most significant byte (MSB) of this value is 0x01, and its least significant byte (LSB) is 0xEF. The scalar value is stored using eight bytes of memory. On a little-endian (LE) system, the LSB is stored at the lowest address of these eight bytes, and the MSB is stored at the highest address. On a big-endian (BE) system, the MSB is stored at the lowest address of these eight bytes, and the LSB is stored at the highest address. Regardless of the memory order, the register representation of the scalar value is identical; the MSB is located on the "left" end of the register, and the LSB is located on the "right" end. Of course, the concept of "left" and "right" is a useful fiction; there is no guarantee that the circuitry of a hardware register is laid out this way. However, we will see, as we deal with vector elements, that the concepts of left and right are more natural for human understanding than byte and element significance. Indeed, most programming languages have operators, such as shift-left and shift-right, that use this same terminology. Let's move from scalars to arrays, which are more interesting to us since we can use vector registers to operate on arrays, or portions of larger arrays. Suppose we have an array of bytes with values 0 through 15, as shown in . Note that each byte is a separate data element with only one possible representation in memory, so the array of bytes looks identical in memory, regardless of whether we are using a BE system or an LE system. But when we load these 16 bytes into a vector register, perhaps by using the ISA 3.0 lxv instruction, the byte at the lowest address on an LE system will be placed in the LSB of the vector register, but on a BE system will be placed in the MSB of the vector register. Thus the array elements appear "right to left" in the register on an LE system, and "left to right" in the register on a BE system.

Byte Arrays and Endianness Things become even more interesting when we consider arrays of larger elements. In , we see the layout of an array of four 32-bit integers, where the 0th element has hexadecimal value 0x00010203, the 1st element has value 0x04050607, the 2nd element has value 0x08090A0B, and the 3rd element has value 0x0C0D0E0F. The order of the array elements in memory is the same for both LE and BE systems; but the layout of each element itself is reversed. When the lxv instruction is used to load the memory into a vector register, again the low address is loaded into the LSB of the register for LE, but loaded into the MSB of the register for BE. The effect is that the array elements again appear right-to-left on a LE system and left-to-right on a BE system. Note that each 32-bit element of the array has its most significant bit "on the left" whether a LE or BE system is in use. This is of course necessary for proper arithmetic to be performed on the array elements by vector instructions.

Word Arrays and Endianness Thus on a BE system, we number vector elements starting with 0 on the left, while on an LE system, we number vector elements starting with 0 on the right. We will informally refer to these as big-endian and little-endian vector element numberings and vector layouts. This element numbering shall also be used by the [] accessor method to vector elements provided as an extension of the C/C++ languages by some compilers, as well as for other language extensions or library constructs that directly or indirectly refer to elements by their element number. Application programs may query the vector element ordering in use by testing the __VEC_ELEMENT_REG_ORDER__ macro. This macro has two possible values: __ORDER_LITTLE_ENDIAN__ Vector elements use little-endian element ordering. __ORDER_BIG_ENDIAN__ Vector elements use big-endian element ordering. This is no longer as useful as it once was. The primary use case was for big-endian vector layout in little-endian environments, which is now deprecated as discussed in . It's generally equivalent to test for __BIG_ENDIAN__ or __LITTLE_ENDIAN__. Remember that each element in a vector has the same representation in both big- and little-endian element orders. That is, an int is always 32 bits, with the sign bit in the high-order position. Programmers must be aware of this when programming with mixed data types, such as an instruction that multiplies two short elements to produce an int element. Always access entire elements to avoid potential endianness issues.

Vector Built-In Functions Some of the Power SIMD hardware instructions refer, implicitly or explicitly, to vector element numbers. For example, the vspltb instruction has as one of its inputs an index into a vector. The element at that index position is to be replicated in every element of the output vector. For another example, vmuleuh instruction operates on the even-numbered elements of its input vectors. The hardware instructions define these element numbers using big-endian element order, even when the machine is running in little-endian mode. Thus, a built-in function that maps directly to the underlying hardware instruction, regardless of the target endianness, has the potential to confuse programmers on little-endian platforms. It is more useful to define built-in functions that map to these instructions to use natural element order. That is, the explicit or implicit element numbers specified by such built-in functions should be interpreted using big-endian element order on a big-endian platform, and using little-endian element order on a little-endian platform. The descriptions of the built-in functions in contain notes on endian issues that apply to each built-in function. Furthermore, a built-in function requiring a different compiler implementation for big-endian than it uses for little-endian has a sample compiler implementation for both BE and LE. These sample implementations are only intended as examples; designers of a compiler are free to use other methods to implement the specified semantics. Of course, most built-in functions operate only on corresponding sets of elements of input vectors to produce output vectors, and thus are not "endian-sensitive." A complete list of endian-sensitive built-in functions can be found in . Endian-Sensitive Built-In Functions (ISA 2.07 only) (ISA 2.07 only)

Extended Data Movement Functions The built-in functions in map to Altivec/VMX load and store instructions and provide access to the “auto-aligning” memory instructions of the VMX ISA where low-order address bits are discarded before performing a memory access. These instructions load and store data in accordance with the program's current endian mode, and do not need to be adapted by the compiler to reflect little-endian operation during code generation. Before the bi-endian programming model was introduced, the vec_lvsl and vec_lvsr intrinsics were supported. These could be used in conjunction with and VMX load and store instructions for unaligned access. The vec_lvsl and vec_lvsr interfaces are deprecated in accordance with the interfaces specified here. For compatibility, the built-in pseudo sequences published in previous VMX documents continue to work with little-endian data layout and the little-endian vector layout described in this document. However, the use of these sequences in new code is discouraged and usually results in worse performance. It is recommended that compilers issue a warning when these functions are used in little-endian environments. VMX Memory Access Built-In Functions Built-in Function Corresponding Power Instructions Implementation Notes lvx Hardware works as a function of endian mode. lvebx, lvehx, lvewx Hardware works as a function of endian mode. lvxl Hardware works as a function of endian mode. stvx Hardware works as a function of endian mode. stvebx, stvehx, stvewx Hardware works as a function of endian mode. stvxl Hardware works as a function of endian mode.

Instead, it is recommended that programmers use the and vector built-in functions to access unaligned data streams. See the descriptions of these instructions in for further description and implementation details.

Big-Endian Vector Layout in Little-Endian Environments (Deprecated) Versions 1.0 through 1.4 of the 64-Bit ELFv2 ABI Specification for Power provided for optional compiler support for using big-endian element ordering in little-endian environments. This was initially deemed useful for porting certain libraries that assumed big-endian element ordering regardless of the endianness of their input streams. In practice, this introduced serious compiler complexity without much utility. Thus this support (previously controlled by switches -maltivec=be and/or -qaltivec=be) is now deprecated. Current versions of the GCC, Clang, and Open XL compilers do not implement this support.

Language-Specific Vector Support for Other Languages

Fortran shows the correspondence between the C/C++ types described in this document and their Fortran equivalents. In Fortran, the Boolean vector data types are represented by VECTOR(UNSIGNED(n)). Fortran Vector Data Types XL Fortran Vector Type XL C/C++ Vector Type VECTOR(INTEGER(1)) vector signed char VECTOR(INTEGER(2)) vector signed short VECTOR(INTEGER(4)) vector signed int VECTOR(INTEGER(8)) vector signed long long, vector signed long The vector long types are deprecated due to their ambiguity between 32-bit and 64-bit environments. The use of the vector long long types is preferred. VECTOR(INTEGER(16)) vector signed __int128 VECTOR(UNSIGNED(1)) vector unsigned char VECTOR(UNSIGNED(2)) vector unsigned short VECTOR(UNSIGNED(4)) vector unsigned int VECTOR(UNSIGNED(8)) vector unsigned long long, vector unsigned long VECTOR(UNSIGNED(16)) vector unsigned __int128 VECTOR(REAL(4)) vector float VECTOR(REAL(8)) vector double VECTOR(PIXEL) vector pixel

Because the Fortran language does not support pointers, vector built-in functions that expect pointers to a base type take an array element reference to indicate the address of a memory location that is the subject of a memory access built-in function. Because the Fortran language does not support type casts, the vec_convert and vec_concat built-in functions shown in are provided to perform bit-exact type conversions between vector types. Built-In Vector Conversion Functions Group Description VEC_CONCAT (ARG1, ARG2)(Fortran) Purpose: Concatenates two elements to form a vector. Result value: The resulting vector consists of the two scalar elements, ARG1 and ARG2, assigned to elements 0 and 1 (using the environment’s native endian numbering), respectively. Note: This function corresponds to the C/C++ vector constructor (vector type){a,b}. It is provided only for languages without vector constructors. vector signed long long vec_concat (signed long long, signed long long); vector unsigned long long vec_concat (unsigned long long, unsigned long long); vector double vec_concat (double, double); VEC_CONVERT(V, MOLD) Purpose: Converts a vector to a vector of a given type. Class: Pure function Argument type and attributes: V Must be an INTENT(IN) vector. MOLD Must be an INTENT(IN) vector. If it is a variable, it need not be defined. Result type and attributes: The result is a vector of the same type as MOLD. Result value: The result is as if it were on the left-hand side of an intrinsic assignment with V on the right-hand side.

Examples and Limitations

Unaligned vector access A common programming error is to cast a pointer to a base type (such as int) to a pointer of the corresponding vector type (such as vector int), and then dereference the pointer. This constitutes undefined behavior, because it casts a pointer with a smaller alignment requirement to a pointer with a larger alignment requirement. Compilers may not produce code that you expect in the presence of undefined behavior. Thus, do not write the following: int a[4096]; vector int x = *((vector int *) a); Instead, write this: int a[4096]; vector int x = vec_xl (0, a);

vec_sld and vec_sro are not bi-endian One oddity in the bi-endian vector programming model is that has big-endian semantics for code compiled for both big-endian and little-endian targets. That is, any code that uses without guarding it with a test on endianness is likely to be incorrect. At the time that the bi-endian model was being developed, it was discovered that existing code in several Linux packages was using in order to perform multiplies, or to otherwise shift portions of base elements left. A straightforward little-endian implementation of would concatenate the two input vectors in reverse order and shift bytes to the right. This would only give compatible results for vector char types. Those using this intrinsic as a cheap multiply, or to shift bytes within larger elements, would see different results on little-endian versus big-endian with such an implementation. Therefore it was decided that would not have a bi-endian implementation. is not bi-endian for similar reasons.

Limitations on bi-endianness of vec_perm The intrinsic is bi-endian, provided that it is used to reorder entire elements of the input vectors. To see why this is, let's examine the code generation for vector int t; vector int a = (vector int){0x00010203, 0x04050607, 0x08090a0b, 0x0c0d0e0f}; vector int b = (vector int){0x10111213, 0x14151617, 0x18191a1b, 0x1c1d1e1f}; vector char c = (vector char){0,1,2,3,28,29,30,31,12,13,14,15,20,21,22,23}; t = vec_perm (a, b, c); For big endian, a compiler should generate: vperm t,a,b,c For little endian targeting a POWER8 system, a compiler should generate: vnand d,c,c vperm t,b,a,d For little endian targeting a POWER9 system, a compiler should generate: vpermr t,b,a,c Note that the vpermr instruction takes care of modifying the permute control vector (PCV) c that was done using the vnand instruction for POWER8. Because only the bottom 5 bits of each element of the PCV are read by the hardware, this has the effect of subtracting the original elements of the PCV from 31. Note also that the PCV c has element values that are contiguous in groups of 4. This selects entire elements from the input vectors a and b to reorder. Thus the intent of the code is to select the first integer element of a, the last integer element of b, the last integer element of a, and the second integer element of b, in that order. The big endian result is {0x00010203, 0x1c1d1e1f, 0x0c0d0e0f, 0x14151617}, as shown here: a 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F b 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F c 0 1 2 3 28 29 30 31 12 13 14 15 20 21 22 23 t 00 01 02 03 1C 1D 1E 1F 0C 0D 0E 0F 14 15 16 17 For little endian, the modified PCV is elementwise subtracted from 31, giving {31,30,29,28,3,2,1,0,19,18,17,16,11,10,9,8}. Since the elements appear in reverse order in a register when loaded from little-endian memory, the elements appear in the register from left to right as {8,9,10,11,16,17,18,19,0,1,2,3,28,29,30,31}. So the following vperm instruction will again select entire elements using the groups of 4 contiguous bytes, and the values of the integers will be reordered without compromising each integer's contents. The little-endian result matches the big-endian result, as shown. Observe that a and b switch positions for little endian code generation. b 1C 1D 1E 1F 18 19 1A 1B 14 15 16 17 10 11 12 13 a 0C 0D 0E 0F 08 09 0A 0B 04 05 06 07 00 01 02 03 c 8 9 10 11 16 17 18 19 0 1 2 3 28 29 30 31 t 14 15 16 17 0C 0D 0E 0F 1C 1D 1E 1F 00 01 02 03 Now, suppose instead that the original PCV does not reorder entire integers at once: vector char c = (vector char){0,20,31,4,7,17,6,19,30,3,2,8,9,13,5,22}; The result of the big-endian implementation would be: t = {0x00141f04, 0x07110613, 0x1e030208, 0x090d0516}; a 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F b 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F c 0 20 31 4 7 17 6 19 30 3 2 8 9 13 5 22 t 00 14 1F 04 07 11 06 13 1E 03 02 08 09 0D 05 16 For little-endian, the modified PCV would be {31,11,0,27,24,14,25,12,1,28,29,23,22,18,26,9}, appearing in the register as {9,26,18,22,23,29,28,1,12,25,14,24,27,0,11,31}. The final little-endian result would be t = {0x071c1703, 0x10051204, 0x0b01001d, 0x15060e0a}; which bears no resemblance to the big-endian result. b 1C 1D 1E 1F 18 19 1A 1B 14 15 16 17 10 11 12 13 a 0C 0D 0E 0F 08 09 0A 0B 04 05 06 07 00 01 02 03 c 9 26 18 22 23 29 28 1 12 25 14 24 27 0 11 31 t 15 06 0E 0A 0B 01 00 1D 10 05 12 04 07 1C 17 03 The lesson here is to only use to reorder entire elements of a vector. If you must use vec_perm for another purpose, your code must include a test for endianness and separate algorithms for big- and little-endian. Examples of this may be seen in the Power Vector Library project (see ).