Dealing with AVX and AVX512
  
  AVX is a bit easier for PowerISA and the ELF V2 ABI. First we have 
  lots (64) of vector registers and a superscalar vector pipeline (can execute 
  two or more independent 128-bit vector operations concurrently). Second the ELF 
  V2 ABI was designed to pass and return larger aggregates in vector 
  registers:
  
    
      Up to 12 qualified vector arguments can be passed in 
      v2–v13.
    
    
      A qualified vector argument corresponds to:
        
          
            A vector data type
          
          
            A member of a homogeneous aggregate of multiple like data types 
            passed in up to eight vector registers.
          
          
            Homogeneous floating-point or vector aggregate return values 
            that consist of up to eight registers with up to eight elements will 
            be returned in floating-point or vector registers that correspond to 
            the parameter registers that would be used if the return value type 
            were the first input parameter to a function.
          
        
      
    
  
  So the ABI allows for passing up to three structures each 
  representing 512-bit vectors and returning such (512-bit) structures all in VMX 
  registers. This can be extended further by spilling parameters (beyond 12 X 
  128-bit vectors) to the parameter save area, but we should not need that, as 
  most intrinsics only use 2 or 3 operands.. Vector registers not needed for 
  parameter passing, along with an additional 8 volatile vector registers, are 
  available for scratch and local variables. All can be used by the application 
  without requiring register spill to the save area. So most intrinsic operations 
  on 256- or 512-bit vectors can be held within existing PowerISA vector 
  registers. 
  For larger functions that might use multiple AVX 256 or 512-bit 
  intrinsics and, as a result, push beyond the 20 volatile vector registers, the 
  compiler will just allocate non-volatile vector registers by allocating a stack 
  frame and spilling non-volatile vector registers to the save area (as needed in 
  the function prologue). This frees up to 64 vectors (32 x 256-bit or 16 x 
  512-bit structs) for code optimization. 
  Based on the specifics of our ISA and ABI we will not not use 
  __vector_size__ (32) or (64) in the PowerPC implementation of 
  __m256 and __m512 
  types. Instead we will typedef structs of 2 or 4 vector (__m128) fields. This 
  allows efficient handling of these larger data types without requiring new GCC 
  language extensions. 
  In the end we should use the same type names and definitions as the 
  GCC X86 intrinsic headers where possible. Where that is not possible we can 
  define new typedefs that provide the best mapping to the underlying PowerISA 
  hardware.