Vector Programming Interfaces

Vector Programming Interfaces To ensure portability of applications optimized to exploit the SIMD functions of Power ISA processors, the ELF V2 ABI defines a set of functions and data types for SIMD programming. ELF V2-compliant compilers will provide suitable support for these functions, preferably as built-in functions that translate to one or more Power ISA instructions. Compilers are encouraged, but not required, to provide built-in functions to access individual instructions in the IBM POWER® instruction set architecture. In most cases, each such built-in function should provide direct access to the underlying instruction. However, to ease porting between little-endian (LE) and big-endian (BE) POWER systems, and between POWER and other platforms, it is preferable that some built-in functions provide the same semantics on both LE and BE POWER systems, even if this means that the built-in functions are implemented with different instruction sequences for LE and BE. To achieve this, vector built-in functions provide a set of functions derived from the set of hardware functions provided by the Power vector SIMD instructions. Unlike traditional “hardware intrinsic” built-in functions, no fixed mapping exists between these built-in functions and the generated hardware instruction sequence. Rather, the compiler is free to generate optimized instruction sequences that implement the semantics of the program specified by the programmer using these built-in functions. This is primarily applicable to the vector facility of the POWER ISA, also known as Power SIMD, consisting of the VMX (or Altivec) and VSX instructions. This set of instructions operates on groups of 2, 4, 8, or 16 vector elements at a time in 128-bit registers. On a big-endian POWER platform, vector elements are loaded from memory into a register so that the 0th element occupies the high-order bits of the register, and the (N-1)th element occupies the low-order bits of the register. This is referred to as big-endian element order. On a little-endian POWER platform, vector elements are loaded from memory such that the 0th element occupies the low-order bits of the register, and the (N-1)th element occupies the high-order bits. This is referred to as little-endian element order.

Vector Data Types Languages provide support for the data types in to represent vector data types stored in vector registers. For the C and C++ programming languages (and related/derived languages), these data types may be accessed based on the type names listed in when Power ISA SIMD language extensions are enabled using either the vector or __vector keywords. For the Fortran language, gives a correspondence of Fortran and C/C++ language types. The assignment operator always performs a byte-by-byte data copy for vector data types. Like other C/C++ language types, vector types may be defined to have const or volatile properties. Vector data types can be defined as being in static, auto, and register storage. Pointers to vector types are defined like pointers of other C/C++ types. Pointers to objects may be defined to have const and volatile properties. While the preferred alignment for vector data types is a multiple of 16 bytes, pointers may point to vector objects at an arbitrary alignment. The preferred way to access vectors at an application-defined address is by using vector pointers and the C/C++ dereference operator *. Similar to other C /C++ data types, the array reference operator [] may be used to access vector objects with a vector pointer with the usual definition to access the n-th vector element from a vector pointer. The use of vector built-in functions such as vec_xl and vec_xst is discouraged except for languages where no dereference operators are available. vector char vca; vector char vcb; vector int via; int a[4]; void *vp; via = *(vector int *) &a[0]; vca = (vector char) via; vcb = vca; vca = *(vector char *)vp; *(vector char *)&a[0] = vca; Compilers are expected to recognize and optimize multiple operations that can be optimized into a single hardware instruction. For example, a load and splat hardware instruction might be generated for the following sequence: double *double_ptr; register vector double vd = vec_splats(*double_ptr);

Vector Operators In addition to the dereference and assignment operators, the Power SIMD Vector Programming API provides the usual operators that are valid on pointers; these operators are also valid for pointers to vector types. The traditional C/C++ operators are defined on vector types with “do all” semantics for unary and binary +, unary and binary -, binary *, binary %, and binary / as well as the unary and binary logical and comparison operators. For unary operators, the specified operation is performed on the corresponding base element of the single operand to derive the result value for each vector element of the vector result. The result type of unary operations is the type of the single input operand. For binary operators, the specified operation is performed on the corresponding base elements of both operands to derive the result value for each vector element of the vector result. Both operands of the binary operators must have the same vector type with the same base element type. The result of binary operators is the same type as the type of the input operands. Further, the array reference operator may be applied to vector data types, yielding an l-value corresponding to the specified element in accordance with the vector element numbering rules (see ). An l-value may either be assigned a new value or accessed for reading its value.

Vector Layout and Element Numbering Vector data types consist of a homogeneous sequence of elements of the base data type specified in the vector data type. Individual elements of a vector can be addressed by a vector element number. Element numbers can be established either by counting from the “left” of a register and assigning the left-most element the element number 0, or from the “right” of the register and assigning the right-most element the element number 0. In big-endian environments, establishing element counts from the left makes the element stored at the lowest memory address the lowest-numbered element. Thus, when vectors and arrays of a given base data type are overlaid, vector element 0 corresponds to array element 0, vector element 1 corresponds to array element 1, and so forth. In little-endian environments, establishing element counts from the right makes the element stored at the lowest memory address the lowest-numbered element. Thus, when vectors and arrays of a given base data type are overlaid, vector element 0 will correspond to array element 0, vector element 1 will correspond to array element 1, and so forth. Consequently, the vector numbering schemes can be described as big-endian and little-endian vector layouts and vector element numberings. (The term “endian” comes from the endian debates presented in Gulliver's Travels by Jonathan Swift.) For internal consistency, in the ELF V2 ABI, the default vector layout and vector element ordering in big-endian environments shall be big endian, and the default vector layout and vector element ordering in little-endian environments shall be little endian. This element numbering shall also be used by the [] accessor method to vector elements provided as an extension of the C/C++ languages by some compilers, as well as for other language extensions or library constructs that directly or indirectly refer to elements by their element number. Application programs may query the vector element ordering in use (that is, whether -qaltivec=be or -maltivec=be has been selected) by testing the __VEC_ELEMENT_REG_ORDER__ macro. This macro has two possible values: __ORDER_LITTLE_ENDIAN__ Vector elements use little-endian element ordering. __ORDER_BIG_ENDIAN__ Vector elements use big-endian element ordering.

Vector Built-in Functions The Power language environments provide a well-known set of built-in functions for the Power SIMD instructions (including both Altivec/VMX and VSX). A full description of these built-in functions is beyond the scope of this ABI document. Most built-in functions are polymorphic, operating on a variety of vector types (vectors of signed characters, vectors of unsigned halfwords, and so forth). Some of the Power SIMD (VMX/Altivec and/or VSX) hardware instructions refer, implicitly or explicitly, to vector element numbers. For example, the vspltb instruction has as one of its inputs an index into a vector. The element at that index position is to be replicated in every element of the output vector. For another example, the vmuleuh instruction operates on the even-numbered elements of its input vectors. The hardware instructions define these element numbers using big-endian element order, even when the machine is running in little-endian mode. Thus, a built-in function that maps directly to the underlying hardware instruction, regardless of the target endianness, has the potential to confuse programmers on little-endian platforms. It is more useful to define built-in functions that map to these instructions to use natural element order. That is, the explicit or implicit element numbers specified by such built-in functions should be interpreted using big-endian element order on a big-endian platform, and using little-endian element order on a little-endian platform. This ABI defines the following built-in functions to use natural element order. The Implementation Notes column suggests possible ways to implement little-endian (LE) versions of the built-in functions, although designers of a compiler are free to use other methods to implement the specified semantics as they see fit. Endian-Sensitive Operations Built-In Function Corresponding POWER Instructions Implementation Notes vec_bperm For LE unsigned long long ARGs, swap halves of ARG2 and of the result. vec_cntlz_lsbb For LE, use vctzlsbb. vec_cnttz_lsbb For LE, use vclzlsbb. vec_extract None vec_extract (v, 3) is equivalent to v[3]. vec_extract_fp32_ from_shorth For LE, extract the left four elements. vec_extract_fp32_ from_shortl For LE, extract the right four elements. vec_extract4b For LE, subtract the byte position from 12, and swap the halves of the result. vec_first_match _index For LE, use vctz. vec_first_match _index_or_eos For LE, use vctz. vec_insert None vec_insert (x, v, 3) returns the vector v with the third element modified to contain x. vec_insert4b For LE, subtract the byte position from 12, and swap the halves of ARG2. vec_mergee vmrgew Swap inputs and use vmrgow for LE. Phased in. This optional function is being phased in, and it may not be available on all implementations. vec_mergeh vmrghb, vmrghh, vmrghw Swap inputs and use vmrglb, and so on, for LE. vec_mergel vmrglb, vmrglh, vmrglw Swap inputs and use vmrghb, and so on, for LE. vec_mergeo vmrgow Swap inputs and use vmrgew for LE. Phased in. vec_mule vmuleub, vmulesb, vmuleuh, vmulesh Replace with vmuloub, and so on, for LE. vec_mulo vmuloub, vmulosb, vmulouh, vmulosh Replace with vmuleub, and so on, for LE. vec_pack vpkuhum, vpkuwum Swap input arguments for LE. vec_packpx vpkpx Swap input arguments for LE. vec_packs vpkuhus, vpkshss, vpkuwus, vpkswss Swap input arguments for LE. vec_packsu vpkuhus, vpkshus, vpkuwus, vpkswus Swap input arguments for LE. vec_perm vperm For LE, swap input arguments and complement the selection vector. vec_splat vspltb, vsplth, vspltw Subtract the element number from N-1 for LE. vec_sum2s vsum2sws For LE, swap elements 0 and 1, and elements 2 and 3, of the second input argument; then swap elements 0 and 1, and elements 2 and 3, of the result vector. vec_sums vsumsws For LE, use element 3 in little-endian order from the second input vector, and place the result in element 3 in little-endian order of the result vector. vec_unpackh vupkhsb, vupkhpx, vupkhsh Use vupklsb, and so on, for LE. vec_unpackl vupklsb, vupklpx, vupklsh Use vupkhsb, and so on, for LE. vec_xl_len_r For LE, the bytes are loaded left justified then shifted right 16-cnt bytes or rotated left cnt bytes. Let “cnt” be the number of bytes specified to be loaded by vec_xl_len_r. vec_xst_len_r For LE, the bytes are shifted left 16-cnt bytes or rotated right cnt bytes so they are left justified to be stored. Let “cnt” be the number of bytes specified to be stored by vec_xst_len_r.

Reminder: The assignment operator = is the preferred way to assign values from one vector data type to another vector data type in accordance with the C and C++ programming languages. Extended Data Movement Functions The built-in functions in map to Altivec/VMX load and store instructions and provide access to the “auto-aligning” memory instructions of the Altivec ISA where low-order address bits are discarded before performing a memory access. These instructions access load and store data in accordance with the program's current endian mode, and do not need to be adapted by the compiler to reflect little-endian operating during code generation: Altivec Memory Access Built-In Functions Built-in Function Corresponding POWER Instructions Implementation Notes vec_ld lvx Hardware works as a function of endian mode. vec_lde lvebx, lvehx, lvewx Hardware works as a function of endian mode. vec_ldl lvxl Hardware works as a function of endian mode. vec_st stvx Hardware works as a function of endian mode. vec_ste stvebx, stvehx, stvewx Hardware works as a function of endian mode. vec_stl stvxl Hardware works as a function of endian mode.

Previous versions of the Altivec built-in functions defined intrinsics to access the Altivec instructions lvsl and lvsr, which could be used in conjunction with vec_vperm and Altivec load and store instructions for unaligned access. The vec_lvsl and vec_lvsr interfaces are deprecated in accordance with the interfaces specified here. For compatibility, the built-in pseudo sequences published in previous VMX documents continue to work with little-endian data layout and the little-endian vector layout described in this document. However, the use of these sequences in new code is discouraged and usually results in worse performance. It is recommended (but not required) that compilers issue a warning when these functions are used in little-endian environments. It is recommended that programmers use the assignment operator = or the vector vec_xl and vec_xst vector built-in functions to access unaligned data streams. The set of extended mnemonics in may be provided by some compilers and are not required by the Power SIMD programming interfaces. In particular, the assignment operator = will have the same effect of copying values between vector data types and provides a preferable method to assign values while giving the compiler more freedom to optimize data allocation. The only use for these functions is to support some coding patterns enabling big-endian vector layout code sequences in both big-endian and little-endian environments. Memory access built-in functions that specify a vector element format (that is, the w4 and d2 forms) are deprecated. They will be phased out in future versions of this specification because vec_xl and vec_xst provide overloaded layout-specific memory access based on the specified vector data type. Optional Built-In Memory Access Functions Built-in Function Corresponding POWER Instructions Little-Endian Implementation Notes vec_xl lxvd2x lxvd2x ; xxpermdi vec_xlw4 Deprecated. The use of vector data type assignment and overloaded vec_xl and vec_xst vector built-in functions are preferred forms for assigning vector operations. Similarly, the use of __builtin_lxvd2x, __builtin_lxvw4x, __builtin_stxvd2x, __builtin_stxvw4x, available in some compilers, is discouraged. lxvw4x lxvd2x ; xxpermdi vec_xld2 lxvd2x lxvd2x ; xxpermdi vec_xst stxvd2x xxpermdi ; stxvd2x vec_xstw4 stxvw4x xxpermdi ; stxvd2x vec_xstd2 stxvd2x xxpermdi ; stxvd2x

The two optional built-in vector functions in can be used to load and store vectors with a big-endian element ordering (that is, bytes from low to high memory will be loaded from left to right into a vector char variable), independent of the -qaltivec=be or -maltivec=be setting. For more information, see . Optional Fixed Data Layout Built-In Vector Functions Built-in Function Corresponding POWER Instructions Little-Endian Implementation Notes vec_xl_be lxvd2x Use lxvd2x for vector long long; vector long, vector double. Use lxvd2x followed by reversal of elements within each doubleword for all other data types. vec_xst_be stxvd2x Use stxvd2x for vector long long; vector long, vector double. Use stxvd2x following a reversal of elements within each doubleword for all other data types.

In addition to the hardware-specific vector built-in functions, implementations are expected to provide the interfaces listed in . Built-In Interfaces for Inserting and Extracting Elements from a Vector Built-In Function Implementation Notes vec_extract vec_extract (v, 3) is equivalent to v[3]. vec_insert vec_insert (x, v, 3) returns the vector v with the third element modified to contain x.

Environments may provide the optional built-in vector functions listed in to adjust for endian behavior by reversing the order of elements (reve) and bytes within elements (revb). Optional Built-In Functions Name Description vec_revb Reverses the order of bytes within elements. vec_reve Reverses the order of elements.

Big-Endian Vector Layout in Little-Endian Environments Because the vector layout and element numbering cannot be represented in source code in an endian-neutral manner, code originating from big-endian platforms may need to be compiled on little-endian platforms, or vice versa. To simplify such application porting, some compilers may provide an additional bridge mode to enable a simplified porting for some applications. Note that such support only works for homogeneous data being loaded into vector registers (that is, no unions or structs containing elements of different sizes) and when those vectors are loaded from and stored to memory with element-size-specific built-in vector memory functions of and . That is because, in this mode, data within each element must be adjusted for little-endian data representation while providing a big-endian layout and numbering of vector elements within a vector. Because of the internal contradiction of big-endian vector layouts and little-endian data, such an environment will have intrinsic limitations for the type of functionality that may be offered. However, it may provide a useful bridge in the porting of code using vector built-ins between environments having different data layout models. Compiler designers may implement additional built-in functions or other mechanisms that use big-endian element ordering in little-endian mode. For example, the GCC and IBM XL compilers define the options -maltivec=be and -qaltivec=be, respectively, to allow programmers to specify that the built-ins will generate big-endian hardware instructions directly for the corresponding big-endian sequences in little-endian mode. To ensure consistent element operation in this mode, the lvx instructions and related instructions are changed to maintain a big-endian data layout in registers by adding appropriate permute sequences as shown in . The selected vector element order is reflected in the __VEC_ELEMENT_REG_ORDER__ macro. See . Altivec Built-In Vector Memory Access Functions (BE Layout in LE Mode) Built-In Function Corresponding POWER Instructions BE Vector Layout in Little-Endian Mode Implementation Notes vec_ld lvx Reverse elements with a vperm after load for LE based on vector base type. vec_lde lvebx, lvehx, lvewx Reverse elements with a vperm after load for LE based on vector base type. vec_ldl lvxl Reverse elements with a vperm after load for LE based on vector base type. vec_st stvx Reverse elements with a vperm before store for LE based on vector base type. vec_ste stvebx, stvehx, stvewx Reverse elements with a vperm before store for LE based on vector base type. vec_stl stvxl Reverse elements with a vperm before store for LE based on vector base type.

Access to memory instructions handling potentially unaligned accesses may be accomplished by using instructions (or instruction sequences) that perform little-endian load of the underlying vector data type while maintaining big-endian element ordering. See . Optional Built-In Memory Access Functions (BE Layout in LE Mode) Built-In Function Corresponding POWER Instructions BE Vector Layout in Little-Endian Mode Implementation Notes vec_xl lxvd2x Use lxvd2x for vector long long; vector long, vector double. vec_xlw4 Deprecated. The use of vector data type assignment and overloaded vec_xl and vec_xst vector built-in functions are preferred forms for assigning vector operations. Similarly, the use of __builtin_lxvd2x, __builtin_lxvw4x, __builtin_stxvd2x, __builtin_stxvw4x, available in some compilers, is discouraged. lxvw4x Use lxvw4x for vector int; vector float. vec_xld2 lxvd2x Use lxvd2x, followed by reversal of elements within each doubleword, for all other data types. vec_xst stxvd2x Use stxvd2x for vector long long; vector long, vector double. vec_xstw4 stxvw4x Use stxvw4x for vector int; vector float. vec_xstd2 stxvd2x Use stxvd2x, following a reversal of elements within each doubleword, for all other data types.

The use of -maltivec=be or -qaltivec=be in little-endian mode disables the transformations described in . The operation of the assignment operator is never changed by a setting such as -qaltivec=be or -maltivec=be.

Language-Specific Vector Support for Other Languages

Fortran shows the correspondence between the C/C++ types described in this document and their Fortran equivalents. In Fortran, the Boolean vector data types are represented by VECTOR(UNSIGNED(n)). Because the Fortran language does not support pointers, vector built-in functions that expect pointers to a base type take an array element reference to indicate the address of a memory location that is the subject of a memory access built-in function. Because the Fortran language does not support type casts, the vec_convert and vec_concat built-in functions shown in are provided to perform bit-exact type conversions between vector types. Built-In Vector Conversion Function Group Description VEC_CONCAT (ARG1, ARG2) (Fortran) POWER ISA 3.0 Purpose: Concatenates two elements to form a vector. Result value: The resulting vector consists of the two scalar elements, ARG1 and ARG2, assigned to elements 0 and 1 (using the environment’s native endian numbering), respectively. Note: This function corresponds to the C/C++ vector constructor (vector type){a,b}. It is provided only for languages without vector constructors. POWER ISA 3.0 vector signed long long vec_concat (signed long long, signed long long); POWER ISA 3.0 vector unsigned long long vec_concat (unsigned long long, unsigned long long); POWER ISA 3.0 vector double vec_concat (double, double); VEC_CONVERT(V, MOLD) Purpose: Converts a vector to a vector of a given type. Class: Pure function Argument type and attributes: V Must be an INTENT(IN) vector. MOLD Must be an INTENT(IN) vector. If it is a variable, it need not be defined. Result type and attributes: The result is a vector of the same type as MOLD. Result value: The result is as if it were on the left-hand side of an intrinsic assignment with V on the right-hand side.

gives a correspondence of Fortran and C/C++ language types. Fortran Vector Data Types XL Fortran Vector Type XL C/C++ Vector Type VECTOR(INTEGER(1)) vector signed char VECTOR(INTEGER(2)) vector signed short VECTOR(INTEGER(4)) vector signed int VECTOR(INTEGER(8)) vector signed long long, vector signed long VECTOR(INTEGER(16)) vector signed __int128 VECTOR(UNSIGNED(1)) vector unsigned char VECTOR(UNSIGNED(2)) vector unsigned short VECTOR(UNSIGNED(4)) vector unsigned int VECTOR(UNSIGNED(8)) vector unsigned long long, vector unsigned long VECTOR(UNSIGNED(16)) vector unsigned __int128 VECTOR(REAL(4)) vector float VECTOR(REAL(8)) vector double VECTOR(PIXEL) vector pixel

Library Interfaces

printf and scanf of Vector Data Types Support for vector variable input and output may be provided as an extension to the following POSIX library functions for the new vector conversion format strings: scanf fscanf sscanf wsscanf printf fprintf sprintf snprintf wsprintf vprintf vfprintf vsprintf vwsprintf (One sample implementation for such an extended specification is libvecprintf.) The size formatters are as follows: vl or lv consumes one argument and modifies an existing integer conversion, resulting in vector signed int, vector unsigned int, or vector bool for output conversions or vector signed int * or vector unsigned int * for input conversions. The data is then treated as a series of four 4-byte components, with the subsequent conversion format applied to each. vh or hv consumes one argument and modifies an existing short integer conversion, resulting in vector signed short or vector unsigned short for output conversions or vector signed short * or vector unsigned short * for input conversions. The data is treated as a series of eight 2-byte components, with the subsequent conversion format applied to each. v consumes one argument and modifies a 1-byte integer, 1-byte character, or 4-byte floating-point conversion. If the conversion is a floating-point conversion, the result is vector float for output conversion or vector float * for input conversion. The data is treated as a series of four 4-byte floating-point components with the subsequent conversion format applied to each. If the conversion is an integer or character conversion, the result is either vector signed char, vector unsigned char, or vector bool char for output conversion, or vector signed char * or vector unsigned char * for input conversions. The data is treated as a series of sixteen 1-byte components, with the subsequent conversion format applied to each. vv consumes one argument and modifies an 8-byte floating-point conversion. If the conversion is a floating-point conversion, the result is vector double for output conversion or vector double * for input conversion. The data is treated as a series of two 8-byte floating-point components with the subsequent conversion format applied to each. Integer and byte conversions are not defined for the vv modifier. As new vector types are defined, new format codes should be defined to support scanf and printf of those types. Any conversion format that can be applied to the singular form of a vector-data type can be used with a vector form. The %d, %x, %X, %u, %i, and %o integer conversions can be applied with the %lv, %vl, %hv, %vh, and %v vector-length qualifiers. The %c character conversion can be applied with the %v vector length qualifier. The %a, %A, %e, %E, %f, %F, %g, and %G float conversions can be applied with the %v vector length qualifier. For input conversions, an optional separator character can be specified excluding white space preceding the separator. If no separator is specified, the default separator is a space including white space characters preceding the separator, unless the conversion is c. Then, the default conversion is null. For output conversions, an optional separator character can be specified immediately preceding the vector size conversion. If no separator is specified, the default separator is a space unless the conversion is c. Then, the default separator is null.