Packed vs scalar intrinsicsSo what is actually going on here? The vector code is clear enough if
you know that the '+' operator is applied to each vector element. The intent of
the X86 built-in is a little less clear, as the GCC documentation for
__builtin_ia32_addsd is not very
helpful (nonexistent). So perhaps the
Intel Intrinsic Guide
will be more enlightening. To paraphrase:
From the
_mm_add_dp description ;
for each double float
element ([0] and [1] or bits [63:0] and [128:64]) for operands a and b are
added and resulting vector is returned. From the
_mm_add_sd description ;
Add element 0 of first operand
(a[0]) to element 0 of the second operand (b[0]) and return the packed vector
double {(a[0] + b[0]), a[1]}. Or said differently the sum of the logical left
most half of the the operands are returned in the logical left most half
(element [0]) of the result, along with the logical right half (element [1])
of the first operand (unchanged) in the logical right half of the result.
So the packed double is easy enough but the scalar double details are
more complicated. One source of complication is that while both Instruction Set
Architectures (SSE vs VSX) support scalar floating point operations in vector
registers the semantics are different. The vector bit and field numbering is different (reversed).
For Intel the scalar is always placed in the low order (right most)
bits of the XMM register (and the low order address for load and store).For PowerISA and VSX, scalar floating point operations and Floating
Point Registers (FPRs) are in the low numbered bits which is the left hand
side of the vector / scalar register (VSR). For the PowerPC64 ELF V2 little endian ABI we also make a point of
making the GCC vector extensions and vector built-ins, appear to be little
endian. So vector element 0 corresponds to the low order address and low
order (right hand) bits of the vector register (VSR).The handling of the non-scalar part of the register for scalar
operations are different.
For Intel ISA the scalar operations either leaves the high order part
of the XMM vector unchanged or in some cases force it to 0.0.For PowerISA scalar operations on the combined FPR/VSR register leaves
the remainder (right half of the VSR) undefined.To minimize confusion and use consistent nomenclature, I will try to
use the terms logical left and logical right elements based on the order they
apprear in a C vector initializers and element index order. So in the vector
(__v2df){1.0, 2.0}, The value 1.0 is the in the logical left element [0] and
the value 2.0 is logical right element [1].So lets look at how to implement these intrinsics for the PowerISA.
For example in this case we can use the GCC vector extension, like so:
The packed double implementation operates on the vector as a whole.
The scalar double implementation operates on and updates only [0] element of
the vector and leaves the __A[1] element unchanged.
Form this source the GCC
compiler generates the following code for PPC64LE target.:The packed vector double generated the corresponding VSX vector
double add (xvadddp). But the scalar implementation is a bit more complicated.
:
720: 07 1b 42 f0 xvadddp vs34,vs34,vs35
...
0000000000000740 :
740: 56 13 02 f0 xxspltd vs0,vs34,1
744: 57 1b 63 f0 xxspltd vs35,vs35,1
748: 03 19 60 f0 xsadddp vs35,vs0,vs35
74c: 57 18 42 f0 xxmrghd vs34,vs34,vs35
...
]]>First the PPC64LE vector format, element [0] is not in the correct
position for the scalar operations. So the compiler generates vector splat
double (xxspltd) instructions to copy elements __A[0] and
__B[0] into position
for the VSX scalar add double (xsadddp) that follows. However the VSX scalar
operation leaves the other half of the VSR undefined (which does not match the
expected Intel semantics). So the compiler must generates a vector merge high
double (xxmrghd) instruction to combine the original
__A[1] element (from vs34)
with the scalar add result from vs35
element [1]. This merge swings the scalar
result from vs35[1] element into the
vs34[0] position, while preserving the
original vs34[1] (from __A[1])
element (copied to itself).Fun
fact: The vector registers in PowerISA are decidedly Big Endian. But we decided
to make the PPC64LE ABI behave like a Little Endian system to make application
porting easier. This requires the compiler to manipulate the PowerISA vector
instrinsic behind the the scenes to get the correct Little Endian results. For
example the element selector [0|1] for vec_splat and the
generation of vec_mergeh vs vec_mergel
are reversed for the Little Endian.This technique applies to packed and scalar intrinsics for the the
usual arithmetic operators (add, subtract, multiply, divide). Using GCC vector
extensions in these intrinsic implementations provides the compiler more
opportunity to optimize the whole function. Now we can look at a slightly more interesting (complicated) case.
Square root (sqrt) is not an arithmetic operator in C and is usually handled
with a library call or a compiler builtin. We really want to avoid a library
call and want to avoid any unexpected side effects. As you see below the
implementation of
_mm_sqrt_pd and
_mm_sqrt_sd
intrinsics are based on GCC x86 built ins.
For the packed vector sqrt, the PowerISA VSX has an equivalent vector
double square root instruction and GCC provides the vec_sqrt builtin. But the
scalar implementation involves an additional parameter and an extra move.
This seems intended to mimick the propagation of the __A[1] input to the
logical right half of the XMM result that we saw with _mm_add_sd above.The instinct is to extract the low scalar (__B[0])
from operand __B
and pass this to the GCC __builtin_sqrt () before recombining that scalar
result with __A[1] for the vector result. Unfortunately C language standards
force the compiler to call the libm sqrt function unless -ffast-math is
specified. The -ffast-math option is not commonly used and we want to avoid the
external library dependency for what should be only a few inline instructions.
So this is not a good option.Thinking outside the box: we do have an inline intrinsic for a
(packed) vector double sqrt that we just implemented. However we need to
insure the other half of __B (__B[1])
does not cause any harmful side effects
(like raising exceptions for NAN or negative values). The simplest solution
is to vector splat __B[0] to both halves of a temporary
value before taking the vec_sqrt.
Then this result can be combined with __A[1] to return the final
result. For example:
In this example we use
_mm_set1_pd
to splat the scalar __B[0], before passing that vector to our
_mm_sqrt_pd implementation,
then pass the sqrt result (c[0]) with __A[1] to
_mm_setr_pd
to combine the final result. You could also use the {c[0], __A[1]}
initializer instead of _mm_setr_pd.Now we can look at vector and scalar compares that add their own
complications: For example, the Intel Intrinsic Guide for
_mm_cmpeq_pd
describes comparing double elements [0|1] and returning
either 0s for not equal and 1s (0xFFFFFFFFFFFFFFFF
or long long -1) for equal. The comparison result is intended as a select mask
(predicates) for selecting or ignoring specific elements in later operations.
The scalar version
_mm_cmpeq_sd
is similar except for the quirk
of only comparing element [0] and combining the result with __A[1] to return
the final vector result.The packed vector implementation for PowerISA is simple as VSX
provides the equivalent instruction and GCC provides the builtin
vec_cmpeq supporting the vector double type.
However the technique of using scalar comparison
operators on the __A[0] and __B[0]
does not work as the C comparison operators
return 0 or 1 results while we need the vector select mask (effectively 0 or
-1). Also we need to watch for sequences that mix scalar floats and integers,
generating if/then/else logic or requiring expensive transfers across register
banks.In this case we are better off using explicit vector built-ins for
_mm_add_sd and _mm_sqrt_sd as examples.
We can use vec_splat from element [0] to temporaries
where we can safely use vec_cmpeq to generate the expected selector mask. Note
that the vec_cmpeq returns a bool long type so we need to cast the result back
to __v2df. Then use the
(__m128d){c[0], __A[1]} initializer to combine the
comparison result with the original __A[1] input and cast to the require
interface type. So we have this example: