GCC Vector Extensions
The GCC vector extensions are common syntax but implemented in a
target specific way. Using the C vector extensions requires the
__gnu_inline__
attribute to avoid syntax errors in case the user specified C standard
compliance (-std=c90, -std=c11,
etc) that would normally disallow such
extensions.
The GCC implementation for PowerPC64 Little Endian is (mostly)
functionally compatible with x86_64 vector extension usage. We can use the same
type definitions (at least for vector_size (16)), operations, syntax
<{...}>
for vector initializers and constants, and array syntax
<[]>
for vector element access. So simple arithmetic / logical operations
on whole vectors should work as is.
The caveat is that the interface data type of the Intel Intrinsic may
not match the data types of the operation, so it may be necessary to cast the
operands to the specific type for the operation. This also applies to vector
initializers and accessing vector elements. You need to use the appropriate
type to get the expected results. Of course this applies to X86_64 as well. For
example:
Note the cast from the interface type (__m128} to the implementation
type (__v4sf, defined in the intrinsic header) for the vector float add (+)
operation. This is enough for the compiler to select the appropriate vector add
instruction for the float type. Then the result (which is
__v4sf) needs to be
cast back to the expected interface type (__m128).
Note also the use of array syntax (__A)[0])
to extract the lowest
(left mostHere we are using logical left and logical right
which will not match the PowerISA register view in Little endian. Logical left
is the left most element for initializers {left, … , right}, storage order
and array order where the left most element is [0].)
element of a vector. The cast (__v4sf) insures that the compiler knows we are
extracting the left most 32-bit float. The compiler insures the code generated
matches the Intel behavior for PowerPC64 Little Endian.
The code generation is complicated by the fact that PowerISA vector
registers are Big Endian (element 0 is the left most word of the vector) and
scalar loads / stores are also to / from the right most word / dword.
X86 scalar loads / stores are to / from the right most element for the
XMM vector register.
The PowerPC64 ELF V2 ABI mimics the X86 Little Endian behavior by placing
logical element [0] in the right most element of the vector register.
This may require the compiler to generate additional instructions
to place the scalar value in the expected position.
Application code with extensive use of scalar (vs packed) intrinsic loads /
stores should be flagged for rewrite to C code using existing scalar
types (float, double, int, long, etc.). The compiler may be able the
vectorize this scalar code using the native vector SIMD instruction set.
Another example is the set reverse order:
Note the use of initializer syntax used to collect a set of scalars
into a vector. Code with constant initializer values will generate a vector
constant of the appropriate endian. However code with variables in the
initializer can get complicated as it often requires transfers between register
sets and perhaps format conversions. We can assume that the compiler will
generate the correct code, but if this class of intrinsics shows up as a hot spot,
a rewrite to native PPC vector built-ins may be appropriate. For example
initializer of a variable replicated to all the vector fields might not be
recognized as a “load and splat” and making this explicit may help the
compiler generate better code.