Introduction to Vector Programming on Power
A Brief History The history of vector programming on Power processors begins with the AIM (Apple, IBM, Motorola) alliance in the 1990s. The AIM partners developed the Power Vector Media Extension (VMX) to accelerate multimedia applications, particularly image processing. VMX is the name still used by IBM for this instruction set. Freescale (formerly Motorola) used the trademark "AltiVec," while Apple at one time called it "Velocity Engine." While VMX remains the most official name, the term AltiVec is still in common use today. Freescale's AltiVec Technology Programming Interface Manual (the "AltiVec PIM") is still available online for reference (see ). The original VMX specification provided for thirty-two 128-bit vector registers (VRs). Each register can be treated as containing sixteen 8-bit character values, eight 16-bit short integer values, four 32-bit integer values, or four 32-bit single-precision floating-point values (the "VMX data types"). Furthermore, the integer data types have signed, unsigned, and boolean variants. An extensive set of arithmetic, logical, comparison, conversion, memory access, and permute class operations were specified to operate on these registers. The AltiVec PIM documents intrinsic functions to be used by programmers to access the VMX instruction set. Because similar operations are provided for all the VMX data types, the PIM provides for overloaded intrinsics that can operate on different data types. However, such function overloading is not normally acceptable in the C programming language, so compilers compliant with the AltiVec PIM (such as GCC and Clang) were required to add special handling to their parsers to permit this. The PIM suggested (but did not mandate) the use of a header file, <altivec.h>, for implementations that provide AltiVec intrinsics. This is common practice for all compliant compilers today. The first chips incorporating the VMX instruction set were introduced by Freescale in 1999, and used primarly in Apple desktop computers. IBM's last desktop CPU (the PowerPC 970) also included AltiVec support, and was used in the Apple PowerMac G5. IBM initially omitted support for VMX from its server-class computers, but added support for it in the POWER6 server family. IBM extended VMX by introducing the Vector-Scalar Extension (VSX) for the POWER7 family of processors. VSX adds sixty-four 128-bit vector-scalar registers (VSRs); however, to optimize the amount of per-process register state, the registers overlap with the VRs and the scalar floating-point registers (FPRs) (see ). The VSRs can represent all the data types representable by the VRs, and can also be treated as containing two 64-bit integers or two 64-bit double-precision floating-point values. However, ISA support for two 64-bit integers in VSRs was limited until Version 2.07 (POWER8) of the Power ISA, and only the VRs are supported for these instructions. Both the VMX and VSX instruction sets have been expanded for the POWER8 and POWER9 processor families. Starting with POWER8, a VSR can now contain a single 128-bit integer; and starting with POWER9, a VSR can contain a single 128-bit IEEE floating-point value. Again, the ISA currently only supports 128-bit operations on values in the VRs. The VMX and VSX instruction sets together may be referred to as the Power SIMD (single-instruction, multiple-data) instructions.
Little-Endian Linux The Power architecture has supported operation in either big-endian (BE) or little-endian (LE) mode from the beginning. However, IBM's Power servers were only shipped with big-endian operating systems (AIX, Linux, i5/OS) prior to the introduction of POWER8. With POWER8, IBM began supporting little-endian Linux distributions for the first time, and introduced a new application binary interface (the 64-Bit ELFv2 ABI Specification ) that can be used for either big- or little-endian support. In practice, the ELFv2 ABI is currently used only for little-endian Linux. Although Power has always supported big- and little-endian memory accesses, the introduction of vector register support added a layer of complexity to programming for processors operating in different endian modes. Arrays of elements loaded into a VR or VSR will be indexed from left to right in the register in big-endian mode, but will be indexed from right to left in the register in little-endian mode. However, the VMX and VSX instructions originally assumed that elements will always be indexed from left to right in the register. This is an inconvenience that needs to be hidden from the application programmer wherever possible. To this end, IBM developed a bi-endian vector programming model (see ). The intrinsic functions provided for the bi-endian vector programming model are described in .
The Unified Vector Register Set In OpenPOWER-compliant processors, floating-point and vector operations are implemented using a unified vector-scalar model. As shown in and , there are 64 vector-scalar registers; each is 128 bits wide.
Floating-Point Registers as Part of VSRs
Vector Registers as Part of VSRs
The vector-scalar registers can be addressed with VSX instructions, for vector and scalar processing of all 64 registers, or with the "classic" Power floating-point instructions to refer to a 32-register subset of these, having 64 bits per register. They can also be addressed with VMX instructions to refer to a 32-register subset of 128-bit registers.
Where to Report Bugs This reference provides guidance on using vector intrinsics that are supported by all compatible compilers. If you find a problem when using one of the intrinsics with a compatible compiler, please report a bug! Bug reporting procedures differ depending on which compiler you're using. GCC. The reporting procedure for bugs against the GNU Compiler Collection is described at https://gcc.gnu.org/bugs/. The GCC bugzilla tracker is located at https://gcc.gnu.org/bugzilla/. Clang/LLVM. The reporting procedure for bugs against the Clang compiler is described at https://llvm.org/docs/HowToSubmitABug.html. The LLVM bug tracking system is located at https://bugs.llvm.org/enter_bug.cgi. The XL compilers. For XL compilers provided with the Linux Community Edition, you can provide feedback to the XL compiler team via email (compinfo@cn.ibm.com); for other editions of XL compilers, please open a Case.
Useful Links The following documents provide additional reference materials. 64-Bit ELF V2 ABI Specification - Power Architecture. https://openpowerfoundation.org/?resource_lib=64-bit-elf-v2-abi-specification-power-architecture AltiVec Technology Program Interface Manual. https://www.nxp.com/docs/en/reference-manual/ALTIVECPIM.pdf Intel Architecture Instruction Set Extensions and Future Features Programming Reference. https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf Power Instruction Set Architecture, Version 3.0B Specification. https://openpowerfoundation.org/?resource_lib=power-isa-version-3-0 POWER8 Processor User's Manual for the Single-Chip Module. https://ibm.ent.box.com/s/649rlau0zjcc0yrulqf4cgx5wk3pgbfk POWER9 Processor User's Manual. https://ibm.ent.box.com/s/tmklq90ze7aj8f4n32er1mu3sy9u8k3k Power Vector Library. https://github.com/open-power-sdk/pveclib POWER8 In-Core Cryptography: The Unofficial Guide. https://github.com/noloader/POWER8-crypto/blob/master/power8-crypto.pdf Using the GNU Compiler Collection. https://gcc.gnu.org/onlinedocs/gcc.pdf GCC's Assembler Syntax. Felix Cloutier. https://www.felixcloutier.com/documents/gcc-asm.html
Conformance to this Specification Vector programs on OpenPOWER systems should follow the guide and best practices for vector programming as outlined in and in . Compliant compilers on OpenPOWER systems should provide suitable support for intrinsic functions, preferably as built-in vector functions that translate to one or more Power ISA instructions as described in and in . Compliant compilers targeting a supported ISA level (2.7 or 3.0, for example) should provide support for all intrinsic functions valid for that ISA level, except where an intrinsic function is marked as phased in, deferred, or deprecated.