|
|
<?xml version="1.0" encoding="UTF-8"?>
|
|
|
<!--
|
|
|
Copyright (c) 2017 OpenPOWER Foundation
|
|
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License");
|
|
|
you may not use this file except in compliance with the License.
|
|
|
You may obtain a copy of the License at
|
|
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
|
|
Unless required by applicable law or agreed to in writing, software
|
|
|
distributed under the License is distributed on an "AS IS" BASIS,
|
|
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
|
See the License for the specific language governing permissions and
|
|
|
limitations under the License.
|
|
|
|
|
|
-->
|
|
|
<section xmlns="http://docbook.org/ns/docbook"
|
|
|
xmlns:xi="http://www.w3.org/2001/XInclude"
|
|
|
xmlns:xlink="http://www.w3.org/1999/xlink"
|
|
|
version="5.0"
|
|
|
xml:id="sec_packed_vs_scalar_intrinsics">
|
|
|
<title>Packed vs scalar intrinsics</title>
|
|
|
|
|
|
<para>So what is actually going on here? The vector code is clear enough if
|
|
|
you know that the '+' operator is applied to each vector element. The intent of
|
|
|
the X86 built-in is a little less clear, as the GCC documentation for
|
|
|
<literal>__builtin_ia32_addsd</literal> is not very
|
|
|
helpful (nonexistent). So perhaps the
|
|
|
<link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_add_pd&expand=97">Intel Intrinsic Guide</link>
|
|
|
will be more enlightening. To paraphrase:
|
|
|
<blockquote>
|
|
|
|
|
|
<para>From the
|
|
|
<link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_add_pd&expand=97"><literal>_mm_add_dp</literal> description</link> ;
|
|
|
for each double float
|
|
|
element ([0] and [1] or bits [63:0] and [128:64]) for operands a and b are
|
|
|
added and resulting vector is returned. </para>
|
|
|
|
|
|
<para>From the
|
|
|
<link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_add_sd&expand=97,130"><literal>_mm_add_sd</literal> description</link> ;
|
|
|
Add element 0 of first operand
|
|
|
(a[0]) to element 0 of the second operand (b[0]) and return the packed vector
|
|
|
double {(a[0] + b[0]), a[1]}. Or said differently the sum of the logical left
|
|
|
most half of the the operands are returned in the logical left most half
|
|
|
(element [0]) of the result, along with the logical right half (element [1])
|
|
|
of the first operand (unchanged) in the logical right half of the result.</para></blockquote></para>
|
|
|
|
|
|
<para>So the packed double is easy enough but the scalar double details are
|
|
|
more complicated. One source of complication is that while both Instruction Set
|
|
|
Architectures (SSE vs VSX) support scalar floating point operations in vector
|
|
|
registers the semantics are different. </para>
|
|
|
|
|
|
<itemizedlist>
|
|
|
<listitem>
|
|
|
<para>The vector bit and field numbering is different (reversed).
|
|
|
<itemizedlist spacing="compact">
|
|
|
<listitem>
|
|
|
<para>For Intel the scalar is always placed in the low order (right most)
|
|
|
bits of the XMM register (and the low order address for load and store).</para>
|
|
|
</listitem>
|
|
|
|
|
|
<listitem>
|
|
|
<para>For PowerISA and VSX, scalar floating point operations and Floating
|
|
|
Point Registers (FPRs) are in the low numbered bits which is the left hand
|
|
|
side of the vector / scalar register (VSR). </para>
|
|
|
</listitem>
|
|
|
|
|
|
<listitem>
|
|
|
<para>For the PowerPC64 ELF V2 little endian ABI we also make a point of
|
|
|
making the GCC vector extensions and vector built-ins, appear to be little
|
|
|
endian. So vector element 0 corresponds to the low order address and low
|
|
|
order (right hand) bits of the vector register (VSR).</para>
|
|
|
</listitem>
|
|
|
</itemizedlist></para>
|
|
|
</listitem>
|
|
|
<listitem>
|
|
|
<para>The handling of the non-scalar part of the register for scalar
|
|
|
operations are different.
|
|
|
<itemizedlist spacing="compact">
|
|
|
<listitem>
|
|
|
<para>For Intel ISA the scalar operations either leaves the high order part
|
|
|
of the XMM vector unchanged or in some cases force it to 0.0.</para>
|
|
|
</listitem>
|
|
|
|
|
|
<listitem>
|
|
|
<para>For PowerISA scalar operations on the combined FPR/VSR register leaves
|
|
|
the remainder (right half of the VSR) <emphasis role="bold">undefined</emphasis>.</para>
|
|
|
</listitem>
|
|
|
</itemizedlist></para>
|
|
|
</listitem>
|
|
|
</itemizedlist>
|
|
|
|
|
|
<para>To minimize confusion and use consistent nomenclature, I will try to
|
|
|
use the terms logical left and logical right elements based on the order they
|
|
|
apprear in a C vector initializers and element index order. So in the vector
|
|
|
<literal>(__v2df){1.0, 2.0}</literal>, The value 1.0 is the in the logical left element [0] and
|
|
|
the value 2.0 is logical right element [1].</para>
|
|
|
|
|
|
<para>So lets look at how to implement these intrinsics for the PowerISA.
|
|
|
For example in this case we can use the GCC vector extension, like so:
|
|
|
<programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
|
|
|
_mm_add_pd (__m128d __A, __m128d __B)
|
|
|
{
|
|
|
return (__m128d) ((__v2df)__A + (__v2df)__B);
|
|
|
}
|
|
|
|
|
|
extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
|
|
|
_mm_add_sd (__m128d __A, __m128d __B)
|
|
|
{
|
|
|
__A[0] = __A[0] + __B[0];
|
|
|
return (__A);
|
|
|
}]]></programlisting></para>
|
|
|
|
|
|
<para>The packed double implementation operates on the vector as a whole.
|
|
|
The scalar double implementation operates on and updates only [0] element of
|
|
|
the vector and leaves the <literal>__A[1]</literal> element unchanged.
|
|
|
Form this source the GCC
|
|
|
compiler generates the following code for PPC64LE target.:</para>
|
|
|
|
|
|
<para>The packed vector double generated the corresponding VSX vector
|
|
|
double add (xvadddp). But the scalar implementation is a bit more complicated.
|
|
|
<programlisting><![CDATA[0000000000000720 <test_add_pd>:
|
|
|
720: 07 1b 42 f0 xvadddp vs34,vs34,vs35
|
|
|
...
|
|
|
|
|
|
0000000000000740 <test_add_sd>:
|
|
|
740: 56 13 02 f0 xxspltd vs0,vs34,1
|
|
|
744: 57 1b 63 f0 xxspltd vs35,vs35,1
|
|
|
748: 03 19 60 f0 xsadddp vs35,vs0,vs35
|
|
|
74c: 57 18 42 f0 xxmrghd vs34,vs34,vs35
|
|
|
...
|
|
|
]]></programlisting></para>
|
|
|
|
|
|
<para>First the PPC64LE vector format, element [0] is not in the correct
|
|
|
position for the scalar operations. So the compiler generates vector splat
|
|
|
double (<literal>xxspltd</literal>) instructions to copy elements <literal>__A[0]</literal> and
|
|
|
<literal>__B[0]</literal> into position
|
|
|
for the VSX scalar add double (xsadddp) that follows. However the VSX scalar
|
|
|
operation leaves the other half of the VSR undefined (which does not match the
|
|
|
expected Intel semantics). So the compiler must generates a vector merge high
|
|
|
double (<literal>xxmrghd</literal>) instruction to combine the original
|
|
|
<literal>__A[1]</literal> element (from <literal>vs34</literal>)
|
|
|
with the scalar add result from <literal>vs35</literal>
|
|
|
element [1]. This merge swings the scalar
|
|
|
result from <literal>vs35[1]</literal> element into the
|
|
|
<literal>vs34[0]</literal> position, while preserving the
|
|
|
original <literal>vs34[1]</literal> (from <literal>__A[1]</literal>)
|
|
|
element (copied to itself).<footnote><para>Fun
|
|
|
fact: The vector registers in PowerISA are decidedly Big Endian. But we decided
|
|
|
to make the PPC64LE ABI behave like a Little Endian system to make application
|
|
|
porting easier. This requires the compiler to manipulate the PowerISA vector
|
|
|
instrinsic behind the the scenes to get the correct Little Endian results. For
|
|
|
example the element selector [0|1] for <literal>vec_splat</literal> and the
|
|
|
generation of <literal>vec_mergeh</literal> vs <literal>vec_mergel</literal>
|
|
|
are reversed for the Little Endian.</para></footnote></para>
|
|
|
|
|
|
<para>This technique applies to packed and scalar intrinsics for the the
|
|
|
usual arithmetic operators (add, subtract, multiply, divide). Using GCC vector
|
|
|
extensions in these intrinsic implementations provides the compiler more
|
|
|
opportunity to optimize the whole function. </para>
|
|
|
|
|
|
<para>Now we can look at a slightly more interesting (complicated) case.
|
|
|
Square root (<literal>sqrt</literal>) is not an arithmetic operator in C and is usually handled
|
|
|
with a library call or a compiler builtin. We really want to avoid a library
|
|
|
call and want to avoid any unexpected side effects. As you see below the
|
|
|
implementation of
|
|
|
<link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_sqrt_pd&expand=4926"><literal>_mm_sqrt_pd</literal></link> and
|
|
|
<link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_sqrt_sd&expand=4926,4956"><literal>_mm_sqrt_sd</literal></link>
|
|
|
intrinsics are based on GCC x86 built ins.
|
|
|
<programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
|
|
|
_mm_sqrt_pd (__m128d __A)
|
|
|
{
|
|
|
return (__m128d)__builtin_ia32_sqrtpd ((__v2df)__A);
|
|
|
}
|
|
|
|
|
|
/* Return pair {sqrt (B[0]), A[1]}. */
|
|
|
extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
|
|
|
_mm_sqrt_sd (__m128d __A, __m128d __B)
|
|
|
{
|
|
|
__v2df __tmp = __builtin_ia32_movsd ((__v2df)__A, (__v2df)__B);
|
|
|
return (__m128d)__builtin_ia32_sqrtsd ((__v2df)__tmp);
|
|
|
}]]></programlisting></para>
|
|
|
|
|
|
<para>For the packed vector sqrt, the PowerISA VSX has an equivalent vector
|
|
|
double square root instruction and GCC provides the <literal>vec_sqrt</literal> builtin. But the
|
|
|
scalar implementation involves an additional parameter and an extra move.
|
|
|
This seems intended to mimick the propagation of the <literal>__A[1]</literal> input to the
|
|
|
logical right half of the XMM result that we saw with <literal>_mm_add_sd above</literal>.</para>
|
|
|
|
|
|
<para>The instinct is to extract the low scalar (<literal>__B[0]</literal>)
|
|
|
from operand <literal>__B</literal>
|
|
|
and pass this to the GCC <literal>__builtin_sqrt ()</literal> before recombining that scalar
|
|
|
result with <literal>__A[1]</literal> for the vector result. Unfortunately C language standards
|
|
|
force the compiler to call the libm sqrt function unless <literal>-ffast-math</literal> is
|
|
|
specified. The <literal>-ffast-math</literal> option is not commonly used and we want to avoid the
|
|
|
external library dependency for what should be only a few inline instructions.
|
|
|
So this is not a good option.</para>
|
|
|
|
|
|
<para>Thinking outside the box: we do have an inline intrinsic for a
|
|
|
(packed) vector double sqrt that we just implemented. However we need to
|
|
|
insure the other half of <literal>__B</literal> (<literal>__B[1]</literal>)
|
|
|
does not cause any harmful side effects
|
|
|
(like raising exceptions for NAN or negative values). The simplest solution
|
|
|
is to vector splat <literal>__B[0]</literal> to both halves of a temporary
|
|
|
value before taking the <literal>vec_sqrt</literal>.
|
|
|
Then this result can be combined with <literal>__A[1]</literal> to return the final
|
|
|
result. For example:
|
|
|
<programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
|
|
|
_mm_sqrt_pd (__m128d __A)
|
|
|
{
|
|
|
return (vec_sqrt (__A));
|
|
|
}
|
|
|
|
|
|
extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
|
|
|
_mm_sqrt_sd (__m128d __A, __m128d __B)
|
|
|
{
|
|
|
__m128d c;
|
|
|
c = _mm_sqrt_pd(_mm_set1_pd (__B[0]));
|
|
|
return (_mm_setr_pd (c[0], __A[1]));
|
|
|
}]]></programlisting></para>
|
|
|
|
|
|
<para>In this example we use
|
|
|
<link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_set1_pd&expand=4926,4956,4926,4956,4652"><literal>_mm_set1_pd</literal></link>
|
|
|
to splat the scalar <literal>__B[0]</literal>, before passing that vector to our
|
|
|
<literal>_mm_sqrt_pd</literal> implementation,
|
|
|
then pass the sqrt result (<literal>c[0]</literal>) with <literal>__A[1]</literal> to
|
|
|
<link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_setr_pd&expand=4679"><literal>_mm_setr_pd</literal></link>
|
|
|
to combine the final result. You could also use the <literal>{c[0], __A[1]}</literal>
|
|
|
initializer instead of <literal>_mm_setr_pd</literal>.</para>
|
|
|
|
|
|
<para>Now we can look at vector and scalar compares that add their own
|
|
|
complications: For example, the Intel Intrinsic Guide for
|
|
|
<link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_cmpeq_pd&expand=779,788,779"><literal>_mm_cmpeq_pd</literal></link>
|
|
|
describes comparing double elements [0|1] and returning
|
|
|
either 0s for not equal and 1s (<literal>0xFFFFFFFFFFFFFFFF</literal>
|
|
|
or long long -1) for equal. The comparison result is intended as a select mask
|
|
|
(predicates) for selecting or ignoring specific elements in later operations.
|
|
|
The scalar version
|
|
|
<link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_cmpeq_sd&expand=779,788"><literal>_mm_cmpeq_sd</literal></link>
|
|
|
is similar except for the quirk
|
|
|
of only comparing element [0] and combining the result with <literal>__A[1]</literal> to return
|
|
|
the final vector result.</para>
|
|
|
|
|
|
<para>The packed vector implementation for PowerISA is simple as VSX
|
|
|
provides the equivalent instruction and GCC provides the builtin
|
|
|
<literal>vec_cmpeq</literal> supporting the vector double type.
|
|
|
However the technique of using scalar comparison
|
|
|
operators on the <literal>__A[0]</literal> and <literal>__B[0]</literal>
|
|
|
does not work as the C comparison operators
|
|
|
return 0 or 1 results while we need the vector select mask (effectively 0 or
|
|
|
-1). Also we need to watch for sequences that mix scalar floats and integers,
|
|
|
generating if/then/else logic or requiring expensive transfers across register
|
|
|
banks.</para>
|
|
|
|
|
|
<para>In this case we are better off using explicit vector built-ins for
|
|
|
<literal>_mm_add_sd</literal> and <literal>_mm_sqrt_sd</literal> as examples.
|
|
|
We can use <literal>vec_splat</literal> from element [0] to temporaries
|
|
|
where we can safely use <literal>vec_cmpeq</literal> to generate the expected selector mask. Note
|
|
|
that the <literal>vec_cmpeq</literal> returns a bool long type so we need to cast the result back
|
|
|
to <literal>__v2df</literal>. Then use the
|
|
|
<literal>(__m128d){c[0], __A[1]}</literal> initializer to combine the
|
|
|
comparison result with the original <literal>__A[1]</literal> input and cast to the require
|
|
|
interface type. So we have this example:
|
|
|
<programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
|
|
|
_mm_cmpeq_pd (__m128d __A, __m128d __B)
|
|
|
{
|
|
|
return ((__m128d)vec_cmpeq (__A, __B));
|
|
|
}
|
|
|
|
|
|
extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
|
|
|
_mm_cmpeq_sd(__m128d __A, __m128d __B)
|
|
|
{
|
|
|
__v2df a, b, c;
|
|
|
/* PowerISA VSX does not allow partial (for just left double)
|
|
|
* results. So to insure we don't generate spurious exceptions
|
|
|
* (from the right double values) we splat the left double
|
|
|
* before we to the operation. */
|
|
|
a = vec_splat(__A, 0);
|
|
|
b = vec_splat(__B, 0);
|
|
|
c = (__v2df)vec_cmpeq(a, b);
|
|
|
/* Then we merge the left double result with the original right
|
|
|
* double from __A. */
|
|
|
return ((__m128d){c[0], __A[1]});
|
|
|
}]]></programlisting></para>
|
|
|
|
|
|
|
|
|
</section>
|
|
|
|