|
|
|
|
<?xml version="1.0" encoding="UTF-8"?>
|
|
|
|
|
<!--
|
|
|
|
|
Copyright (c) 2017 OpenPOWER Foundation
|
|
|
|
|
|
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License");
|
|
|
|
|
you may not use this file except in compliance with the License.
|
|
|
|
|
You may obtain a copy of the License at
|
|
|
|
|
|
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
|
|
|
|
|
|
Unless required by applicable law or agreed to in writing, software
|
|
|
|
|
distributed under the License is distributed on an "AS IS" BASIS,
|
|
|
|
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
|
|
|
See the License for the specific language governing permissions and
|
|
|
|
|
limitations under the License.
|
|
|
|
|
|
|
|
|
|
-->
|
|
|
|
|
<section xmlns="http://docbook.org/ns/docbook"
|
|
|
|
|
xmlns:xi="http://www.w3.org/2001/XInclude"
|
|
|
|
|
xmlns:xlink="http://www.w3.org/1999/xlink"
|
|
|
|
|
version="5.0"
|
|
|
|
|
xml:id="sec_gcc_vector_extensions">
|
|
|
|
|
<title>GCC Vector Extensions</title>
|
|
|
|
|
|
|
|
|
|
<para>The GCC vector extensions are common syntax but implemented in a
|
|
|
|
|
target specific way. Using the C vector extensions requires the
|
|
|
|
|
<literal>__gnu_inline__</literal>
|
|
|
|
|
attribute to avoid syntax errors in case the user specified C standard
|
|
|
|
|
compliance (<literal>-std=c90</literal>, <literal>-std=c11</literal>,
|
|
|
|
|
etc) that would normally disallow such
|
|
|
|
|
extensions. </para>
|
|
|
|
|
|
|
|
|
|
<para>The GCC implementation for PowerPC64 Little Endian is (mostly)
|
|
|
|
|
functionally compatible with x86_64 vector extension usage. We can use the same
|
|
|
|
|
type definitions (at least for vector_size (16)), operations, syntax
|
|
|
|
|
<literal><</literal><emphasis role="bold"><literal>{</literal></emphasis><literal>...</literal><emphasis role="bold"><literal>}</literal></emphasis><literal>></literal>
|
|
|
|
|
for vector initializers and constants, and array syntax
|
|
|
|
|
<literal><</literal><emphasis role="bold"><literal>[]</literal></emphasis><literal>></literal>
|
|
|
|
|
for vector element access. So simple arithmetic / logical operations
|
|
|
|
|
on whole vectors should work as is. </para>
|
|
|
|
|
|
|
|
|
|
<para>The caveat is that the interface data type of the Intel Intrinsic may
|
|
|
|
|
not match the data types of the operation, so it may be necessary to cast the
|
|
|
|
|
operands to the specific type for the operation. This also applies to vector
|
|
|
|
|
initializers and accessing vector elements. You need to use the appropriate
|
|
|
|
|
type to get the expected results. Of course this applies to X86_64 as well. For
|
|
|
|
|
example:
|
|
|
|
|
<programlisting><![CDATA[/* Perform the respective operation on the four SPFP values in A and B. */
|
|
|
|
|
extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
|
|
|
|
|
_mm_add_ps (__m128 __A, __m128 __B)
|
|
|
|
|
{
|
|
|
|
|
return (__m128) ((__v4sf)__A + (__v4sf)__B);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/* Stores the lower SPFP value. */
|
|
|
|
|
extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__))
|
|
|
|
|
_mm_store_ss (float *__P, __m128 __A)
|
|
|
|
|
{
|
|
|
|
|
*__P = ((__v4sf)__A)[0];
|
|
|
|
|
}]]></programlisting></para>
|
|
|
|
|
|
|
|
|
|
<para>Note the cast from the interface type (<literal>__m128</literal>} to the implementation
|
|
|
|
|
type (<literal>__v4sf</literal>, defined in the intrinsic header) for the vector float add (+)
|
|
|
|
|
operation. This is enough for the compiler to select the appropriate vector add
|
|
|
|
|
instruction for the float type. Then the result (which is
|
|
|
|
|
<literal>__v4sf</literal>) needs to be
|
|
|
|
|
cast back to the expected interface type (<literal>__m128</literal>). </para>
|
|
|
|
|
|
|
|
|
|
<para>Note also the use of <emphasis>array syntax</emphasis> (<literal>__A)[0]</literal>)
|
|
|
|
|
to extract the lowest
|
|
|
|
|
(left most<footnote><para>Here we are using logical left and logical right
|
|
|
|
|
which will not match the PowerISA register view in Little endian. Logical left
|
|
|
|
|
is the left most element for initializers {left, … , right}, storage order
|
|
|
|
|
and array order where the left most element is [0].</para></footnote>)
|
|
|
|
|
element of a vector. The cast (<literal>__v4sf</literal>) insures that the compiler knows we are
|
|
|
|
|
extracting the left most 32-bit float. The compiler insures the code generated
|
|
|
|
|
matches the Intel behavior for PowerPC64 Little Endian. </para>
|
|
|
|
|
|
|
|
|
|
<para>The code generation is complicated by the fact that PowerISA vector
|
|
|
|
|
registers are Big Endian (element 0 is the left most word of the vector) and
|
|
|
|
|
scalar loads / stores are also to / from the right most word / dword.
|
|
|
|
|
X86 scalar loads / stores are to / from the right most element for the
|
|
|
|
|
XMM vector register.
|
|
|
|
|
The PowerPC64 ELF V2 ABI mimics the X86 Little Endian behavior by placing
|
|
|
|
|
logical element [0] in the right most element of the vector register. </para>
|
|
|
|
|
|
|
|
|
|
<para>This may require the compiler to generate additional instructions
|
|
|
|
|
to place the scalar value in the expected position.
|
|
|
|
|
Application code with extensive use of scalar (vs packed) intrinsic loads /
|
|
|
|
|
stores should be flagged for rewrite to C code using existing scalar
|
|
|
|
|
types (float, double, int, long, etc.). The compiler may be able the
|
|
|
|
|
vectorize this scalar code using the native vector SIMD instruction set.</para>
|
|
|
|
|
|
|
|
|
|
<para>Another example is the set reverse order:
|
|
|
|
|
<programlisting><![CDATA[/* Create the vector [Z Y X W]. */
|
|
|
|
|
extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
|
|
|
|
|
_mm_set_ps (const float __Z, const float __Y, const float __X, const float __W)
|
|
|
|
|
{
|
|
|
|
|
return __extension__ (__m128)(__v4sf){ __W, __X, __Y, __Z };
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/* Create the vector [W X Y Z]. */
|
|
|
|
|
extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
|
|
|
|
|
_mm_setr_ps (float __Z, float __Y, float __X, float __W)
|
|
|
|
|
{
|
|
|
|
|
return __extension__ (__m128)(__v4sf){ __Z, __Y, __X, __W };
|
|
|
|
|
}]]></programlisting></para>
|
|
|
|
|
|
|
|
|
|
<para>Note the use of <emphasis>initializer syntax</emphasis> used to collect a set of scalars
|
|
|
|
|
into a vector. Code with constant initializer values will generate a vector
|
|
|
|
|
constant of the appropriate endian. However code with variables in the
|
|
|
|
|
initializer can get complicated as it often requires transfers between register
|
|
|
|
|
sets and perhaps format conversions. We can assume that the compiler will
|
|
|
|
|
generate the correct code, but if this class of intrinsics shows up as a hot spot,
|
|
|
|
|
a rewrite to native PPC vector built-ins may be appropriate. For example
|
|
|
|
|
initializer of a variable replicated to all the vector fields might not be
|
|
|
|
|
recognized as a “load and splat” and making this explicit may help the
|
|
|
|
|
compiler generate better code.</para>
|
|
|
|
|
|
|
|
|
|
</section>
|
|
|
|
|
|