Packed vs scalar intrinsics

Packed vs scalar intrinsics So what is actually going on here? The vector code is clear enough if you know that the '+' operator is applied to each vector element. The intent of the X86 built-in is a little less clear, as the GCC documentation for __builtin_ia32_addsd is not very helpful (nonexistent). So perhaps the Intel Intrinsic Guide will be more enlightening. To paraphrase:

From the _mm_add_dp description ; for each double float element ([0] and [1] or bits [63:0] and [128:64]) for operands a and b are added and resulting vector is returned. From the _mm_add_sd description ; Add element 0 of first operand (a[0]) to element 0 of the second operand (b[0]) and return the packed vector double {(a[0] + b[0]), a[1]}. Or said differently the sum of the logical left most half of the the operands are returned in the logical left most half (element [0]) of the result, along with the logical right half (element [1]) of the first operand (unchanged) in the logical right half of the result.

So the packed double is easy enough but the scalar double details are more complicated. One source of complication is that while both Instruction Set Architectures (SSE vs VSX) support scalar floating point operations in vector registers the semantics are different. The vector bit and field numbering is different (reversed). For Intel the scalar is always placed in the low order (right most) bits of the XMM register (and the low order address for load and store). For PowerISA and VSX, scalar floating point operations and Floating Point Registers (FPRs) are in the low numbered bits which is the left hand side of the vector / scalar register (VSR). For the PowerPC64 ELF V2 little endian ABI we also make a point of making the GCC vector extensions and vector built-ins, appear to be little endian. So vector element 0 corresponds to the low order address and low order (right hand) bits of the vector register (VSR). The handling of the non-scalar part of the register for scalar operations are different. For Intel ISA the scalar operations either leaves the high order part of the XMM vector unchanged or in some cases force it to 0.0. For PowerISA scalar operations on the combined FPR/VSR register leaves the remainder (right half of the VSR) undefined. To minimize confusion and use consistent nomenclature, I will try to use the terms logical left and logical right elements based on the order they apprear in a C vector initializers and element index order. So in the vector (__v2df){1.0, 2.0}, The value 1.0 is the in the logical left element [0] and the value 2.0 is logical right element [1]. So lets look at how to implement these intrinsics for the PowerISA. For example in this case we can use the GCC vector extension, like so: The packed double implementation operates on the vector as a whole. The scalar double implementation operates on and updates only [0] element of the vector and leaves the __A[1] element unchanged. Form this source the GCC compiler generates the following code for PPC64LE target.: The packed vector double generated the corresponding VSX vector double add (xvadddp). But the scalar implementation is a bit more complicated. : 720: 07 1b 42 f0 xvadddp vs34,vs34,vs35 ... 0000000000000740 : 740: 56 13 02 f0 xxspltd vs0,vs34,1 744: 57 1b 63 f0 xxspltd vs35,vs35,1 748: 03 19 60 f0 xsadddp vs35,vs0,vs35 74c: 57 18 42 f0 xxmrghd vs34,vs34,vs35 ... ]]> First the PPC64LE vector format, element [0] is not in the correct position for the scalar operations. So the compiler generates vector splat double (xxspltd) instructions to copy elements __A[0] and __B[0] into position for the VSX scalar add double (xsadddp) that follows. However the VSX scalar operation leaves the other half of the VSR undefined (which does not match the expected Intel semantics). So the compiler must generates a vector merge high double (xxmrghd) instruction to combine the original __A[1] element (from vs34) with the scalar add result from vs35 element [1]. This merge swings the scalar result from vs35[1] element into the vs34[0] position, while preserving the original vs34[1] (from __A[1]) element (copied to itself).Fun fact: The vector registers in PowerISA are decidedly Big Endian. But we decided to make the PPC64LE ABI behave like a Little Endian system to make application porting easier. This requires the compiler to manipulate the PowerISA vector instrinsic behind the the scenes to get the correct Little Endian results. For example the element selector [0|1] for vec_splat and the generation of vec_mergeh vs vec_mergel are reversed for the Little Endian. This technique applies to packed and scalar intrinsics for the the usual arithmetic operators (add, subtract, multiply, divide). Using GCC vector extensions in these intrinsic implementations provides the compiler more opportunity to optimize the whole function. Now we can look at a slightly more interesting (complicated) case. Square root (sqrt) is not an arithmetic operator in C and is usually handled with a library call or a compiler builtin. We really want to avoid a library call and want to avoid any unexpected side effects. As you see below the implementation of _mm_sqrt_pd and _mm_sqrt_sd intrinsics are based on GCC x86 built ins. For the packed vector sqrt, the PowerISA VSX has an equivalent vector double square root instruction and GCC provides the vec_sqrt builtin. But the scalar implementation involves an additional parameter and an extra move. This seems intended to mimick the propagation of the __A[1] input to the logical right half of the XMM result that we saw with _mm_add_sd above. The instinct is to extract the low scalar (__B[0]) from operand __B and pass this to the GCC __builtin_sqrt () before recombining that scalar result with __A[1] for the vector result. Unfortunately C language standards force the compiler to call the libm sqrt function unless -ffast-math is specified. The -ffast-math option is not commonly used and we want to avoid the external library dependency for what should be only a few inline instructions. So this is not a good option. Thinking outside the box: we do have an inline intrinsic for a (packed) vector double sqrt that we just implemented. However we need to insure the other half of __B (__B[1]) does not cause any harmful side effects (like raising exceptions for NAN or negative values). The simplest solution is to vector splat __B[0] to both halves of a temporary value before taking the vec_sqrt. Then this result can be combined with __A[1] to return the final result. For example: In this example we use _mm_set1_pd to splat the scalar __B[0], before passing that vector to our _mm_sqrt_pd implementation, then pass the sqrt result (c[0]) with __A[1] to _mm_setr_pd to combine the final result. You could also use the {c[0], __A[1]} initializer instead of _mm_setr_pd. Now we can look at vector and scalar compares that add their own complications: For example, the Intel Intrinsic Guide for _mm_cmpeq_pd describes comparing double elements [0|1] and returning either 0s for not equal and 1s (0xFFFFFFFFFFFFFFFF or long long -1) for equal. The comparison result is intended as a select mask (predicates) for selecting or ignoring specific elements in later operations. The scalar version _mm_cmpeq_sd is similar except for the quirk of only comparing element [0] and combining the result with __A[1] to return the final vector result. The packed vector implementation for PowerISA is simple as VSX provides the equivalent instruction and GCC provides the builtin vec_cmpeq supporting the vector double type. However the technique of using scalar comparison operators on the __A[0] and __B[0] does not work as the C comparison operators return 0 or 1 results while we need the vector select mask (effectively 0 or -1). Also we need to watch for sequences that mix scalar floats and integers, generating if/then/else logic or requiring expensive transfers across register banks. In this case we are better off using explicit vector built-ins for _mm_add_sd and _mm_sqrt_sd as examples. We can use vec_splat from element [0] to temporaries where we can safely use vec_cmpeq to generate the expected selector mask. Note that the vec_cmpeq returns a bool long type so we need to cast the result back to __v2df. Then use the (__m128d){c[0], __A[1]} initializer to combine the comparison result with the original __A[1] input and cast to the require interface type. So we have this example: