Dealing with AVX and AVX512

Dealing with AVX and AVX512 AVX is a bit easier for PowerISA and the ELF V2 ABI. First we have lots (64) of vector registers and a superscalar vector pipeline (can execute two or more independent 128-bit vector operations concurrently). Second the ELF V2 ABI was designed to pass and return larger aggregates in vector registers: Up to 12 qualified vector arguments can be passed in v2–v13. A qualified vector argument corresponds to: A vector data type A member of a homogeneous aggregate of multiple like data types passed in up to eight vector registers. Homogeneous floating-point or vector aggregate return values that consist of up to eight registers with up to eight elements will be returned in floating-point or vector registers that correspond to the parameter registers that would be used if the return value type were the first input parameter to a function. So the ABI allows for passing up to three structures each representing 512-bit vectors and returning such (512-bit) structures all in VMX registers. This can be extended further by spilling parameters (beyond 12 X 128-bit vectors) to the parameter save area, but we should not need that, as most intrinsics only use 2 or 3 operands.. Vector registers not needed for parameter passing, along with an additional 8 volatile vector registers, are available for scratch and local variables. All can be used by the application without requiring register spill to the save area. So most intrinsic operations on 256- or 512-bit vectors can be held within existing PowerISA vector registers. For larger functions that might use multiple AVX 256 or 512-bit intrinsics and, as a result, push beyond the 20 volatile vector registers, the compiler will just allocate non-volatile vector registers by allocating a stack frame and spilling non-volatile vector registers to the save area (as needed in the function prologue). This frees up to 64 vectors (32 x 256-bit or 16 x 512-bit structs) for code optimization. Based on the specifics of our ISA and ABI we will not not use __vector_size__ (32) or (64) in the PowerPC implementation of __m256 and __m512 types. Instead we will typedef structs of 2 or 4 vector (__m128) fields. This allows efficient handling of these larger data types without requiring new GCC language extensions. In the end we should use the same type names and definitions as the GCC X86 intrinsic headers where possible. Where that is not possible we can define new typedefs that provide the best mapping to the underlying PowerISA hardware.