Programming-Guides/Vector_Intrinsics/sec_gcc_vector_extensions.xml

<?xml version="1.0" encoding="UTF-8"?>
<!--
  Copyright (c) 2017 OpenPOWER Foundation
  
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.
  
-->
<section xmlns="http://docbook.org/ns/docbook"
  xmlns:xi="http://www.w3.org/2001/XInclude"
  xmlns:xlink="http://www.w3.org/1999/xlink"
  version="5.0"
  xml:id="sec_gcc_vector_extensions">
  <title>GCC Vector Extensions</title>
  
  <para>The GCC vector extensions are common syntax but implemented in a 
  target specific way. Using the C vector extensions requires the 
  <literal>__gnu_inline__</literal> 
  attribute to avoid syntax errors in case the user specified  C standard 
  compliance (<literal>-std=c90</literal>, <literal>-std=c11</literal>, 
  etc) that would normally disallow such 
  extensions. </para>

  <para>The GCC implementation for PowerPC64 Little Endian is (mostly) 
  functionally compatible with x86_64 vector extension usage. We can use the same 
  type definitions (at least for  vector_size (16)), operations, syntax 
  <literal>&lt;</literal><emphasis role="bold"><literal>{</literal></emphasis><literal>...</literal><emphasis role="bold"><literal>}</literal></emphasis><literal>&gt;</literal> 
  for vector initializers and constants, and array syntax 
  <literal>&lt;</literal><emphasis role="bold"><literal>[]</literal></emphasis><literal>&gt;</literal> 
  for vector element access. So simple arithmetic / logical operations 
  on whole vectors should work as is. </para>

  <para>The caveat is that the interface data type of the Intel Intrinsic may 
  not match the data types of the operation, so it may be necessary to cast the 
  operands to the specific type for the operation. This also applies to vector 
  initializers and accessing vector elements. You need to use the appropriate 
  type to get the expected results. Of course this applies to X86_64 as well. For 
  example:
  <programlisting><![CDATA[/* Perform the respective operation on the four SPFP values in A and B.  */
extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_add_ps (__m128 __A, __m128 __B)
{
  return (__m128) ((__v4sf)__A + (__v4sf)__B);
}

/* Stores the lower SPFP value.  */
extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_store_ss (float *__P, __m128 __A)
{
  *__P = ((__v4sf)__A)[0];
}]]></programlisting></para>

  <para>Note the cast from the interface type (<literal>__m128</literal>} to the implementation 
  type (<literal>__v4sf</literal>, defined in the intrinsic header) for the vector float add (+) 
  operation. This is enough for the compiler to select the appropriate vector add 
  instruction for the float type. Then the result (which is 
  <literal>__v4sf</literal>) needs to be 
  cast back to the expected interface type (<literal>__m128</literal>). </para>

  <para>Note also the use of <emphasis>array syntax</emphasis> (<literal>__A)[0]</literal>)
   to extract the lowest 
  (left most<footnote><para>Here we are using logical left and logical right 
  which will not match the PowerISA register view in Little endian. Logical left 
  is the left most element for initializers {left, … , right}, storage order 
  and array  order where the left most element is [0].</para></footnote>) 
  element of a vector. The cast (<literal>__v4sf</literal>) insures that the compiler knows we are 
  extracting the left most 32-bit float. The compiler insures the code generated 
  matches the Intel behavior for PowerPC64 Little Endian. </para>

  <para>The code generation is complicated by the fact that PowerISA vector 
  registers are Big Endian (element 0 is the left most word of the vector) and 
  scalar loads / stores are also to / from the right most word / dword.
  X86 scalar loads / stores are to / from the right most element for the
  XMM vector register.
  The PowerPC64 ELF V2 ABI mimics the X86 Little Endian behavior by placing
  logical element [0] in the right most element of the vector register. </para>

  <para>This may require the compiler to generate additional instructions
  to place the scalar value in the expected position.
  Application code with extensive use of scalar (vs packed) intrinsic loads / 
  stores should be flagged for rewrite to C code using existing scalar 
  types (float, double, int, long, etc.). The compiler may be able the
  vectorize this scalar code using the native vector SIMD instruction set.</para>

  <para>Another example is the set reverse order:
  <programlisting><![CDATA[/* Create the vector [Z Y X W].  */
extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_set_ps (const float __Z, const float __Y, const float __X, const float __W)
{
  return __extension__ (__m128)(__v4sf){ __W, __X, __Y, __Z };
}

/* Create the vector [W X Y Z].  */
extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_setr_ps (float __Z, float __Y, float __X, float __W)
{
  return __extension__ (__m128)(__v4sf){ __Z, __Y, __X, __W };
}]]></programlisting></para>

  <para>Note the use of <emphasis>initializer syntax</emphasis> used to collect a set of scalars 
  into a vector. Code with constant initializer values will generate a vector 
  constant of the appropriate endian. However code with variables in the 
  initializer can get complicated as it often requires transfers between register 
  sets and perhaps format conversions. We can assume that the compiler will 
  generate the correct code, but if this class of intrinsics shows up as a hot spot, 
  a rewrite to native PPC vector built-ins may be appropriate. For example 
  initializer of a variable replicated to all the vector fields might not be 
  recognized as a “load and splat” and making this explicit may help the 
  compiler generate better code.</para>

</section>