Programming-Guides/intrinsic.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" 
"http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd">
<article lang="">
  <section>
    <title>1 Intel Intrinsic porting guide for Power64LE.</title>
    <para>The goal of this project is to provide functional equivalents of the 
Intel MMX, SSE, and AVX intrinsic functions, that are commonly used in Linux 
applications, and make them (or equivalents) available for the PowerPC64LE 
platform. These X86 intrinsics started with the Intel and Microsoft compilers 
but were then ported to the GCC compiler. The GCC implementation is a set of 
headers with inline functions. These inline functions provide a implementation 
mapping from the Intel/Microsoft dialect intrinsic names to the corresponding 
GCC Intel built-in's or directly via C language vector extension syntax.</para>
    <para/>
    <para>The current proposal is to start with the existing X86 GCC intrinsic 
headers and port them (copy and change the source)  to POWER using C language 
vector extensions, VMX and VSX built-ins. Another key assumption is that we 
will be able to use many of existing Intel DejaGNU test cases on 
./gcc/testsuite/gcc.target/i386. This document is intended as a guide to 
developers participating in this effort. However this document provides 
guidance and examples that should be useful to developers who may encounter X86 
intrinsics in code that they are porting to another platform.</para>
    <para/>
  </section>
  <section>
    <title>1.1 Look at the source Luke</title>
    <para>So if this is a code porting activity, where is the source? All the 
source code we need to look at is in the GCC source trees. You can either git 
(https://gcc.gnu.org/wiki/GitMirror) the gcc source  or down load one of the 
recent AT source tars (for example: 
ftp://ftp.unicamp.br/pub/linuxpatch/toolchain/at/ubuntu/dists/xenial/at10.0/). 
 You will find the intrinsic headers in the ./gcc/config/i386/ 
sub-directory.</para>
    <para/>
    <para>If you have a Intel Linux workstation or laptop with GCC installed, 
you already have these headers, if you want to take a look:</para>
    <para/>
    <para/>
    <para/>
    <para/>
    <para/>
    <para/>
    <para/>
    <para/>
    <para>But depending on the vintage of the distro, these may not be the 
latest versions of the headers. Looking at the header source will tell you a 
few things.: The include structure (what other headers are implicitly 
included). The types that are used at the API. And finally, how the API is 
implemented.</para>
    <para/>
  </section>
  <section>
    <title>1.1.1 The structure of the intrinsic includes</title>
    <para>The GCC x86 intrinsic functions for vector were initially grouped by 
technology (MMX and SSE), which starts with MMX continues with SSE through 
SSE4.1 stacked like a set of Russian dolls.</para>
    <para/>
    <para>Basically each higher layer include, needs typedefs and helper macros 
defined by the lower level intrinsic includes. mm_malloc.h simply provides 
wrappers for posix_memalign and free. Then it gets a little weird, starting 
with the crypto extensions:For AVX, AVX2, and AVX512 they must have decided 
that the Russian Dolls thing was getting out of hand. AVX et all is split 
across 14 filesbut they do not want the applications include these 
individually.So immintrin.h  includes everything Intel vector, include all the 
AVX, AES, SSE and MMX flavors.</para>
    <para/>
    <para>So what is the net? The include structure provides some strong clues 
about the order that we should approach this effort.  For example if you need 
to intrinsic from SSE4 (smmintrin.h) we are likely to need to type definitions 
from SSE (emmintrin.h). So a bottoms up (MMX, SSE, SSE2, …) approach seems 
like the best plan of attack. Also saving the AVX parts for latter make sense, 
as most are just wider forms of operations that already exists in SSE.</para>
    <para/>
    <para>We should use the same include structure to implement our PowerISA 
equivalent API headers. This will make porting easier (drop-in replacement) and 
should get the application running quickly on POWER. Then we are in a position 
to profile and analyze the resulting application. This will show any hot spots 
where the simple one-to-one transformation results in bottlenecks and 
additional tuning is needed. For these cases we should improve our tools (SDK 
MA/SCA) to identify opportunities for, and perhaps propose, alternative 
sequences that are better tuned to PowerISA and our micro-architecture.</para>
    <para/>
  </section>
  <section>
    <title>1.1.2 The types used for intrinsics</title>
    <para>The type system for Intel intrinsics is a little strange. For example 
from xmmintrin.h:</para>
    <para/>
    <para>So there is one set of types that are used in the function prototypes 
of the API, and the internal types that are used in the implementation. Notice 
the special attribute <literal>__may_alias__</literal>. From the GCC documentation:So there are a 
couple of issues here: 1)  the API seem to force the compiler to assume 
aliasing of any parameter passed by reference. Normally the compiler assumes 
that parameters of different size do not overlap in storage, which allows more 
optimization. 2) the data type used at the interface may not be the correct 
type for the implied operation. So parameters of type __m128i (which is defined 
as vector long long) is also used for parameters and return values of vector 
[char | short | int ]. </para>
    <para/>
    <para>This may not matter when using x86 built-in's but does matter when 
the implementation uses C vector extensions or in our case use PowerPC generic 
vector built-ins (#2.1.3.2.<link linkend="">PowerISA Vector 
Intrinsics|outline</link>). For the later cases the type must be correct for 
the compiler to generate the correct type (char, short, int, long) (<link 
linkend="">#1.1.3.How the API is implemented.|outline</link>) for the generic 
builtin operation. There is also concern that excessive use of <literal>__may_alias__</literal> 
will limit compiler optimization. We are not sure how important this attribute 
is to the correct operation of the API.  So at a later stage we should 
experiment with removing it from our implementation for PowerPC</para>
    <para/>
    <para>The good news is that PowerISA has good support for 128-bit vectors 
and (with the addition of VSX) all the required vector data (char, short, int, 
long, float, double) types. However Intel supports a wider variety of the 
vector sizes  than PowerISA does. This started with the 64-bit MMX vector 
support that preceded SSE and extends to 256-bit and 512-bit vectors of AVX, 
AVX2, and AVX512 that followed SSE.</para>
    <para/>
    <para>Within the GCC Intel intrinsic implementation these are all 
implemented as vector attribute extensions of the appropriate  size (   
<literal>__vector_size__</literal> ({8 | 16 | 32, and 64}). For the PowerPC target  GCC currently 
only supports the native <literal>__vector_size__</literal> ( 16 ). These we can support directly 
in VMX/VSX registers and associated instructions. The GCC will compile with 
other   <literal>__vector_size__</literal> values, but the resulting types are treated as simple 
arrays of the element type. This does not allow the compiler to use the vector 
registers and vector instructions for these (nonnative) vectors.   So what is 
a programmer to do?</para>
  </section>
  <section>
    <title>1.1.2.1 Dealing with MMX</title>
    <para>MMX is actually the hard case. The __m64 type supports SIMD vector 
int types (char, short, int, long).  The  Intel API defines  __m64 as:</para>
    <para/>
    <para>Which is problematic for the PowerPC target (not really supported in 
GCC) and we would prefer to use a native PowerISA type that can be passed in a 
single register.  The PowerISA Rotate Under Mask instructions can easily 
extract and insert integer fields of a General Purpose Register (GPR). This 
implies that MMX integer types can be handled as a internal union of arrays for 
the supported element types. So an 64-bit unsigned long long is the best type 
for parameter passing and return values. Especially for the 64-bit (_si64) 
operations as these normally generate a single PowerISA instruction.</para>
    <para/>
    <para>The SSE extensions include some convert operations for _m128 to / 
from _m64 and this includes some int to / from float conversions. However in 
these cases the float operands always reside in SSE (XMM) registers (which 
match the PowerISA vector registers) and the MMX registers only contain integer 
values. POWER8 (PowerISA-2.07) has direct move instructions between GPRs and 
VSRs. So these transfers are normally a single instruction and any conversions 
can be handed in the vector unit.</para>
    <para/>
    <para>When transferring a __m64 value to a vector register we should also 
execute a xxsplatd instruction to insure there is valid data in all four 
element lanes before doing floating point operations. This avoids generating 
extraneous floating point exceptions that might be generated by uninitialized 
parts of the vector. The top two lanes will have the floating point results 
that are in position for direct transfer to a GPR or stored via Store Float 
Double (stfd). These operation are internal to the intrinsic implementation and 
there is no requirement to keep temporary vectors in correct Little Endian 
form.</para>
    <para/>
    <para>Also for the smaller element sizes and higher element counts (MMX 
_pi8 and _p16 types) the number of  Rotate Under Mask instructions required to 
disassemble the 64-bit __m64 into elements, perform the element calculations, 
and reassemble the elements in a single __m64 value can get larger. In this 
case we can generate shorter instruction sequences by transfering (via direct 
move instruction) the GPR __m64 value to the a vector register, performance the 
SIMD operation there, then transfer the __m64 result back to a GPR.</para>
    <para/>
  </section>
  <section>
    <title>1.1.2.2 Dealing with AVX and AVX512</title>
    <para>AVX is a bit easier for PowerISA and the ELF V2 ABI. First we have 
lots (64) of vector registers and a super scalar vector pipe-line (can execute 
two or more independent 128-bit vector operations concurrently). Second the ELF 
V2 ABI was designed to pass and return larger aggregates in vector 
registers:</para>
    <para/>
    <orderedlist>
      <listitem>
        <para>Up to 12 qualified vector arguments can be passed in 
v2–v13.</para>
      </listitem>
      <listitem>
        <para>A qualified vector argument corresponds to:</para>
      </listitem>
    </orderedlist>
    <para>So the ABI allows for passing up to three structures each 
representing 512-bit vectors and returning such (512-bit) structure all in VMX 
registers. This can be extended further by spilling parameters (beyond 12 X 
128-bit vectors) to the parameter save area, but we should not need that, as 
most intrinsics only use 2 or 3 operands.. Vector registers not needed for 
parameter passing, along with an additional 8 volatile vector registers, are 
available for scratch and local variables. All can be used by the application 
without requiring register spill to the save area. So most intrinsic operations 
on 256- or 512-bit vectors can be held within existing PowerISA vector 
registers. </para>
    <para/>
    <para>For larger functions that might use multiple AVX 256 or 512-bit 
intrinsics and, as a result, push beyond the 20 volatile vector registers, the 
compiler will just allocate non-volatile vector registers by allocating a stack 
frame and spilling non-volatile vector registers to the save area (as needed in 
the function prologue). This frees up to 64 vectors (32 x 256-bit or 16 x 
512-bit structs) for code optimization. </para>
    <para/>
    <para>Based on the specifics of our ISA and ABI we will not not use 
<literal>__vector_size__</literal> (32) or (64) in the PowerPC implementation of __m256 and __m512 
types. Instead we will typedef structs of 2 or 4 vector (__m128) fields. This 
allows efficient handling of these larger data types without require new GCC 
language extensions. </para>
    <para/>
    <para>In the end we should use the same type names and definitions as the 
GCC X86 intrinsic headers where possible. Where that is not possible we can 
define new typedefs that provide the best mapping to the underlying PowerISA 
hardware.</para>
  </section>
  <section>
    <title>1.1.3 How is this API implemented.</title>
    <para>One pleasant surprise is that many (at least for the older Intel) 
Intrinsics are implemented directly in C vector extension code and/or a simple 
mapping to GCC target specific builtins. </para>
  </section>
  <section>
    <title>1.1.3.1 Some simple examples</title>
    <para>For example; a vector double splat looks like this:</para>
    <para>Another example:</para>
    <para>Note in the example above the cast to __v2df for the operation. Both 
__m128d and __v2df are vector double, but __v2df does no have the <literal>__may_alias__</literal> 
attribute. And one more example:</para>
    <para>Note this requires a cast for the compiler to generate the correct 
code for the intended operation. The parameters and result are the generic 
__m128i, which is a vector long long with the <literal>__may_alias__</literal> attribute. But 
operation is a vector multiply low unsigned short (__v8hu). So not only do we 
use the cast to drop the <literal>__may_alias__</literal> attribute but we also need to cast to 
the correct (vector unsigned short) type for the specified operation.</para>
    <para/>
    <para>I have successfully copied these (and similar) source snippets over 
to the PPC64LE implementation unchanged. This of course assumes the associated 
types are defined and with compatible attributes.</para>
  </section>
  <section>
    <title>1.1.3.2 Those extra attributes</title>
    <para>You may have noticed there are some special attributes:</para>
    <para>So far I have been using these attributes unchanged.</para>
    <para/>
    <para>But most intrinsics map the Intel intrinsic to one or more target 
specific GCC builtins. For example:</para>
    <para/>
    <para>The first intrinsic (_mm_load_pd ) is implement as a C vector pointer 
reference, but from the comment assumes the compiler will use a movapd 
instruction that requires 16-byte alignment (will raise a general-protection 
exception if not aligned). This  implies that there is a performance advantage 
for at least some Intel processors to keep the vector aligned. The second 
intrinsic uses the explicit GCC builtin __builtin_ia32_loadupd to generate the 
movupd instruction which handles unaligned references.</para>
    <para/>
    <para>The opposite assumption applies to POWER and PPC64LE, where GCC 
generates the VSX  lxvd2x / xxswapd instruction sequence by default, which 
allows unaligned references. The PowerISA equivalent for aligned vector access 
is the VMX lvx instruction and the vec_ld builtin, which forces quadword 
aligned access (by ignoring the low order 4 bits of the effective address). The 
lvx instruction does not raise alignment exceptions, but perhaps should as part 
of our implementation of the Intel intrinsic. This requires that we use 
PowerISA VMX/VSX built-ins to insure we get the expected results.</para>
    <para/>
    <para>The current prototype defines the following:</para>
    <para>The aligned  load intrinsic adds an assert which checks alignment 
(to match the Intel semantic) and uses  the GCC builtin vec_ld (generates an 
lvx).  The assert generates extra code but this can be eliminated by defining 
NDEBUG at compile time. The unaligned load intrinsic uses the GCC builtin 
vec_vsx_ld  (for PPC64LE generates lxvd2x / xxswapd for power8  and will 
simplify to lxv or lxvx for power9).  And similarly for __mm_store_pd / 
__mm_storeu_pd, using vec_st and vec_vsx_st. These concepts extent to the 
load/store intrinsics for vector float and vector int.</para>
  </section>
  <section>
    <title>1.1.3.3 How did I find this out?</title>
    <para>The next question is where did I get the details above. The GCC 
documentation for __builtin_ia32_loadupd provides minimal information (the 
builtin name, parameters and return types). Not very informative. </para>
    <para/>
    <para>Looking up the Intel intrinsic description is more informative. You 
can Google the intrinsic name or use the <link 
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/">Intel 
Intrinsic guide </link> for this. The Intrinsic Guide is interactive and 
includes  Intel (Chip) technology and text based search capabilities. Clicking 
on the intrinsic name opens to a synopsis including; the underlying instruction 
name, text description, operation pseudo code, and in some cases performance 
information (latency and throughput).</para>
    <para/>
    <para>The key is to get a description of the intrinsic (operand fields and 
types, and which fields are updated for the result) and the underlying Intel 
instruction. If the Intrinsic guide is not clear you can look up the 
instruction details in the “<link 
xlink:href="https://software.intel.com/en-us/articles/intel-sdm">Intel® 64 and IA-32 
Architectures Software Developer’s Manual</link>”.</para>
    <para/>
    <para>Information about the PowerISA vector facilities is found in the 
<link 
xlink:href="https://openpowerfoundation.org/?resource_lib=ibm-power-isa-version-2-07-b"
>PowerISA Version 2.07B</link> (for POWER8 and <link 
xlink:href="https://www.docdroid.net/tWT7hjD/powerisa-v30.pdf.html">3.0 for 
POWER9</link>) manual, Book I, Chapter 6. Vector Facility and Chapter 7. 
Vector-Scalar Floating-Point Operations. Another good reference is the <link 
xlink:href="https://openpowerfoundation.org/technical/technical-resources/technical-spe
cifications/">OpenPOWER ELF V2 application binary interface</link> (ABI) 
document, Chapter 6. Vector Programming Interfaces and Appendix A. Predefined 
Functions for Vector Programming.</para>
    <para/>
    <para>Another useful document is the original <link 
xlink:href="http://www.nxp.com/assets/documents/data/en/reference-manuals/ALTIVECPEM.pd
f">Altivec Technology Programers Interface Manual</link> with a  user 
friendly structure and many helpful diagrams. But alas the PIM does does not 
cover the resent PowerISA (power7,  power8, and power9) enhancements.</para>
  </section>
  <section>
    <title>1.1.3.4 Examples implemented using other intrinsics</title>
    <para>Some intrinsic implementations are defined in terms of other 
intrinsics. For example.</para>
    <para/>
    <para/>
    <para/>
    <para/>
    <para/>
    <para/>
    <para/>
    <para/>
    <para/>
    <para/>
    <para/>
    <para/>
    <para/>
    <para/>
    <para/>
    <para>This notion of using part (one fourth or half) of the SSE XMM 
register and leaving the rest unchanged (or forced to zero) is specific to SSE 
scalar operations and can generate some complicated (sub-optimal) PowerISA 
code.  In this case _mm_load_sd passes the dereferenced double value  to 
_mm_set_sd which uses C vector initializer notation to combine (merge) that 
double scalar value with a scalar 0.0 constant into a vector double.</para>
    <para/>
    <para>While code like this should work as-is for PPC64LE, you should look 
at the generated code and assess if it is reasonable.  In this case the code 
is not awful (a load double splat, vector xor to generate 0.0s, then a xxmrghd 
to combine __F and 0.0).  Other examples may generate sub-optimal code and 
justify a rewrite to PowerISA scalar or vector code (<link 
xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/PowerPC-AltiVec_002fVSX-Built-
in-Functions.html#PowerPC-AltiVec_002fVSX-Built-in-Functions">GCC PowerPC 
AltiVec Built-in Functions</link> or inline assembler). </para>
    <para/>
    <para>Net: try using the existing C code if you can, but check on what the 
compiler generates.  If the generated code is horrendous, it may be worth the 
effort to write a PowerISA specific equivalent. For codes making extensive use 
of MMX or SSE scalar intrinsics you will be better off rewriting to use 
standard C scalar types and letting the the GCC compiler handle the details 
(see <link linkend="">#2.1.Prefered methods|outline</link>)</para>
  </section>
  <section>
    <title>2 How do we work this?</title>
    <para>The working assumption is to start with the existing GCC headers from 
./gcc/config/i386/, then convert them to PowerISA and add them to 
./gcc/config/rs6000/. I assume we will replicate the existing header structure 
and retain the existing header file and intrinsic names. This also allows us to 
reuse existing DejaGNU test cases from ./gcc/testsuite/gcc.target/i386, modify 
them as needed for the POWER target, and them to the 
./gcc/testsuite/gcc.target/powerpc.</para>
    <para/>
    <para>We can be flexible on the sequence that headers/intrinsics and test 
cases are ported.  This should be based on customer need and resolving 
internal dependencies.  This implies an oldest-to-newest / bottoms-up (MMX, 
SSE, SSE2, …) strategy. The assumption is, existing community and user 
application codes, are more likely to have optimized code for previous 
generation ubiquitous (SSE, SSE2, ...) processors than the latest (and rare) 
SkyLake AVX512.</para>
    <para/>
    <para>I would start with an existing header from the current GCC 
 ./gcc/config/i386/ and copy the header comment (including FSF copyright) down 
to any vector typedefs used in the API or implementation. Skip the Intel 
intrinsic implementation code for now, but add the ending #end if matching the 
headers conditional guard against multiple inclusion. You can add  #include 
&lt;alternative&gt; as needed. For examples:</para>
    <para/>
    <para>Then you can start adding small groups of related intrinsic 
implementations to the header to be compiled and  examine the generated code. 
Once you have what looks like reasonable code you can grep through 
 ./gcc/testsuite/gcc.target/i386 for examples using the intrinsic names you 
just added. You should be able to find functional tests for most X86 
intrinsics. </para>
    <para/>
    <para>The <link 
xlink:href="https://gcc.gnu.org/onlinedocs/gccint/Testsuites.html#Testsuites">GCC 
testsuite</link> uses the DejaGNU  test framework as documented in the <link 
xlink:href="https://gcc.gnu.org/onlinedocs/gccint/">GNU Compiler Collection (GCC) 
Internals</link> manual. GCC adds its own DejaGNU directives and extensions, 
that are embedded in the testsuite source as comments.  Some are platform 
specific and will need to be adjusted for tests that are ported to our 
platform. For example</para>
    <para>should become something like</para>
    <para/>
    <para>Repeat this process until you have equivalent implementations for all 
the intrinsics in that header and associated test cases that execute without 
error. </para>
  </section>
  <section>
    <title>2.1 Prefered methods</title>
    <para>As we will see there are multiple ways to implement the logic of 
these intrinsics. Some implementation methods are preferred because they allow 
the compiler to select instructions and provided the most flexibility for 
optimization across the whole sequence. Other methods may be required to 
deliver a specific semantic or to deliver better optimization than the current 
compiler is capable of. Some methods are more portable across multiple 
compilers (GCC, LLVM, ...). All of this should be taken into consideration for 
each intrinsic implementation. In general we should use the following list as a 
guide to these decisions:</para>
    <orderedlist>
      <listitem>
        <para/>
      </listitem>
      <listitem>
        <para>Use C vector arithmetic, logical, dereference, etc., operators in 
preference to intrinsics.</para>
      </listitem>
      <listitem>
        <para>Use the bi-endian interfaces from Appendix A of the ABI in 
preference to other intrinsics when available, as these are designed for 
portability among compilers.</para>
      </listitem>
      <listitem>
        <para>Use other, less well documented intrinsics (such as 
__builtin_vsx_*) when no better facility is available, in preference to 
assembly.</para>
      </listitem>
      <listitem>
        <para>If necessary, use inline assembly, but know what you're 
doing.</para>
      </listitem>
    </orderedlist>
    <para/>
  </section>
  <section>
    <title>2.2 Prepare yourself</title>
    <para>To port Intel intrinsics to POWER you will need to prepare yourself 
with knowledge of PowerISA vector facilities and how to access the associated 
documentation.</para>
    <para/>
    <orderedlist>
      <listitem>
        <para><link 
xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/Vector-Extensions.html#Vector-
Extensions">GCC vector extention</link> syntax and usage. This is one of a set 
of GCC “<link 
xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/C-Extensions.html#C-Extensions
">Extentions to the C </link><link 
xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/C-Extensions.html#C-Extensions
">language Family</link>” that the intrinsic header implementation depends 
on.  As many of the GCC intrinsics for x86 are implemented via C vector 
extensions, reading and understanding of this code is an important part of the 
porting process. </para>
      </listitem>
      <listitem>
        <para>Intel (x86) intrinsic and type naming conventions and how to find 
more information. The intrinsic name encodes  some information about the 
vector size and type of the data, but the pattern is not always  obvious. 
Using the online <link 
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#">Intel 
Intrinsic Guide</link> to look up the intrinsic by name is a good first 
step.</para>
      </listitem>
      <listitem>
        <para>PowerISA Vector facilities. The Vector facilities of POWER8 are 
extensive and cover the usual types and usual operations. However it has a 
different history and organization from Intel.  Both (Intel and PowerISA) have 
their quirks and in some cases the mapping may not be obvious. So familiarizing 
yourself with the PowerISA Vector (VMX) and Vector Scalar Extensions (VSX) is 
important.</para>
      </listitem>
    </orderedlist>
    <para/>
  </section>
  <section>
    <title>2.2.1 GCC Vector Extensions</title>
    <para>The GCC vector extensions are common syntax but implemented in a 
target specific way. Using the C vector extensions require the __gnu_inline__ 
attribute to avoid syntax errors in case the user specified  C standard 
compliance (-std=c90, -std=c11, etc) that would normally disallow such 
extensions. </para>
    <para/>
    <para>The GCC implementation for PowerPC64 Little Endian is (mostly) 
functionally compatible with x86_64 vector extension usage. We can use the same 
type definitions (at least for  vector_size (16)), operations, syntax 
&lt;{...}&gt; for vector initializers and constants, and array syntax 
&lt;[]&gt; for vector element access. So simple arithmetic / logical operations 
on whole vectors should work as is. </para>
    <para/>
    <para>The caveat is that the interface data type of the Intel Intrinsic may 
not match the data types of the operation, so it may be necessary to cast the 
operands to the specific type for the operation. This also applies to vector 
initializers and accessing vector elements. You need to use the appropriate 
type to get the expected results. Of course this applies to X86_64 as well. For 
example:</para>
    <para>Note the cast from the interface type (__m128} to the implementation 
type (__v4sf, defined in the intrinsic header) for the vector float add (+) 
operation. This is enough for the compiler to select the appropriate vector add 
instruction for the float type. Then the result (which is __v4sf) needs to be 
cast back to the expected interface type (__m128). </para>
    <para/>
    <para>Note also the use of array syntax (__A)[0]) to extract the lowest 
(left most<footnote><para>Here we are using logical left and logical right 
which will not match the PowerISA register view in Little endian. Logical left 
is the left most element for initializers {left, … , right}, storage order 
and array  order where the left most element is [0].</para></footnote>) 
element of a vector. The cast (__v4sf) insures that the compiler knows we are 
extracting the left most 32-bit float. The compiler insures the code generated 
matches the Intel behavior for PowerPC64 Little Endian. </para>
    <para/>
    <para>The code generation is complicated by the fact that PowerISA vector 
registers are Big Endian (element 0 is the left most word of the vector) and 
X86 scalar stores are from the left most (work/dword) for the vector register. 
Application code with extensive use of scalar (vs packed) intrinsic loads / 
stores should be flagged for rewrite to native PPC code using exisiing scalar 
types (float, double, int, long, etc.). </para>
    <para/>
    <para>Another example is the set reverse order:</para>
    <para>Note the use of initializer syntax used to collect a set of scalars 
into a vector. Code with constant initializer values will generate a vector 
constant of the appropriate endian. However code with variables in the 
initializer can get complicated as it often requires transfers between register 
sets and perhaps format conversions. We can assume that the compiler will 
generate the correct code, but if this class of intrinsics shows up a hot spot, 
a rewrite to native PPC vector built-ins may be appropriate. For example 
initializer of a variable replicated to all the vector fields might not be 
recognized as a “load and splat” and making this explicit may help the 
compiler generate better code.</para>
  </section>
  <section>
    <title>2.2.2 Intel Intrinsic functions</title>
    <para>So what is an intrinsic function? From Wikipedia:</para>
    <para/>
    <para>In <link 
xlink:href="https://en.wikipedia.org/wiki/Compiler_theory">compiler theory</link>, an 
intrinsic function is a function available for use in a given <link 
xlink:href="https://en.wikipedia.org/wiki/Programming_language">programming 
language</link> whose implementation is handled specially by the compiler. 
Typically, it substitutes a sequence of automatically generated instructions 
for the original function call, similar to an <link 
xlink:href="https://en.wikipedia.org/wiki/Inline_function">inline function</link>. 
Unlike an inline function though, the compiler has an intimate knowledge of the 
intrinsic function and can therefore better integrate it and optimize it for 
the situation. This is also called builtin function in many languages.</para>
    <para/>
    <para>The “Intel Intrinsics” API provides access to the many 
instruction set extensions (Intel Technologies) that Intel has added (and 
continues to add) over the years. The intrinsics provided access to new 
instruction capabilities before the compilers could exploit them directly. 
Initially these intrinsic functions where defined for the Intel and Microsoft 
compiler and where eventually implemented and contributed to GCC.</para>
    <para/>
    <para>The Intel Intrinsics have a specific type and naming structure. In 
this naming structure, functions starts with a common prefix (MMX and SSE use 
_mm_ prefix, while AVX added the _mm256 _mm512 prefixes), then a short 
functional name (set, load, store, add, mul, blend, shuffle, …) and a suffix 
(_pd, _sd, _pi32...) with type and packing information. See <link 
linkend="">Appendix B</link> for the list of common intrisic suffixes.</para>
    <para/>
    <para>Oddly many of the MMX/SSE operations are not vectors at all. There 
are a lot of scalar operations on a single float, double, or long long type. In 
effect these are scalars that can take advantage of the larger (xmm) register 
space. Also in the Intel 32-bit architecture they provided IEEE754 float and 
double types, and 64-bit integers that did not exist or where hard to implement 
in the base i386/387 instruction set. These scalar operation use a suffix 
starting with '_s' (_sd for scalar double float, _ss scalar float, and _si64 
for scalar long long).</para>
    <para/>
    <para>True vector operations use the packed or extended packed suffixes, 
starting with '_p' or '_ep' (_pd for vector double, _ps for vector float, and 
_epi32 for vector int). The use of '_ep'  seems to be reserved to disambiguate 
intrinsics that existed in the (64-bit vector) MMX extension from the extended 
(128-bit vector) SSE equivalent. For example _mm_add_pi32 is a MMX operation on 
a pair of 32-bit integers, while _mm_add_epi32 is an SSE2 operation on vector 
of 4 32-bit integers. </para>
    <para/>
    <para>The GCC  builtins for the <link 
xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/x86-Built-in-Functions.html#x8
6-Built-in-Functions">i386.target</link>, (includes x86 and x86_64) are not 
the same as the Intel Intrinsics. While they have similar intent and cover most 
of the same functions, they use a different naming (prefixed with 
__builtin_ia32_, then function name with type suffix) and uses GCC vector type 
modes for operand types. For example:</para>
    <para>Note: A key difference between GCC builtins for i386 and Powerpc is 
that the x86 builtins have different names of each operation and type while the 
powerpc altivec builtins tend to have a single generatic builtin for  each 
operation, across a set of compatible operand types. </para>
    <para/>
    <para>In GCC the Intel Intrinsic header (*intrin.h) files are implemented 
as a set of inline functions using the Intel Intrinsic API names and types. 
These functions are implemented as either GCC C vector extension code or via 
one or more GCC builtins for the i386 target. So lets take a look at some 
examples from GCC's SSE2 intrinsic header emmintrin.h:</para>
    <para/>
    <para/>
    <para>Note that the  _mm_add_pd is implemented direct as C vector 
extension code., while _mm_add_sd is implemented via the GCC builtin 
__builtin_ia32_addsd. From the discussion above we know the _pd suffix 
indicates a packed vector double while the _sd suffix indicates a scalar double 
in a XMM register. </para>
    <para/>
  </section>
  <section>
    <title>2.2.2.1 Packed vs scalar intrinsics</title>
    <para/>
    <para>So what is actually going on here? The vector code is clear enough if 
you know that '+' operator is applied to each vector element. The the intent of 
the builtin is a little less clear, as the GCC documentation for 
__builtin_ia32_addsd is not very helpful (nonexistent). So perhaps the <link 
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_add_
pd&amp;expand=97">Intel Intrinsic Guide</link> will be more enlightening. To 
paraphrase:</para>
    <para/>
    <para>From the <link 
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_add_
pd&amp;expand=97">_mm_add_dp description</link> ; for each double float 
element ([0] and [1] or bits [63:0] and [128:64]) for operands a and b are 
added and resulting vector is returned. </para>
    <para/>
    <para>From the <link 
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_add_
sd&amp;expand=97,130">_mm_add_sd</link><link 
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_add_
sd&amp;expand=97,130"> </link><link 
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_add_
sd&amp;expand=97,130">description</link> ; Add element 0 of first operand 
(a[0]) to element 0 of the second operand (b[0]) and return the packed vector 
double {(a[0] + b[0]), a[1]}. Or said differently the sum of the logical left 
most half of the the operands are returned in the logical left most half 
(element [0]) of the  result, along with the logical right half (element [1]) 
of the first operand (unchanged) in the logical right half of the result.</para>
    <para/>
    <para>So the packed double is easy enough but the scalar double details are 
more complicated. One source of complication is that while both Instruction Set 
Architectures (SSE vs VSX) support scalar floating point operations in vector 
registers the semantics are different. </para>
    <para/>
    <orderedlist>
      <listitem>
        <para>The vector bit and field numbering is different (reversed). 
</para>
      </listitem>
      <listitem>
        <para>The handling of the non-scalar part of the register for scalar 
operations are different.</para>
      </listitem>
    </orderedlist>
    <para/>
    <para>To minimize confusion and use consistent nomenclature, I will try to 
use the terms logical left and logical right elements based on the order they 
apprear in a C vector initializers and element index order. So in the vector 
(__v2df){1.0, 20.}, The value 1.0 is the in the logical left element [0] and 
the value 2.0 is logical right element [1].</para>
    <para/>
    <para>So lets look at how to implement these intrinsics for the PowerISA. 
For example in this case we can use the GCC vector extension, like so:</para>
    <para/>
    <para/>
    <para>The packed double implementation operates on the vector as a whole. 
The scalar double implementation operates on and updates only [0] element of 
the vector and leaves the __A[1] element unchanged.  Form this source the GCC 
compiler generates the following code for PPC64LE target.:</para>
    <para/>
    <para>The packed vector double generated the corresponding VSX vector 
double add (xvadddp). But the scalar implementation is bit more complicated. 
 </para>
    <para/>
    <para>First the PPC64LE vector format, element [0] is not in the correct 
position for  the scalar operations. So the compiler generates vector splat 
double (xxspltd) instructions to copy elements __A[0] and __B[0] into position 
for the VSX scalar add double (xsadddp) that follows. However the VSX scalar 
operation leaves the other half of the VSR undefined (which does not match the 
expected Intel semantics). So the compiler must generates a vector merge high 
double (xxmrghd) instruction to combine the original __A[1] element (from vs34) 
with the scalar add result from vs35 element [1]. This merge swings the scalar 
result from vs35[1] element into the vs34[0] position, while preserving the 
original vs34[1] (from __A[1]) element (copied to itself).<footnote><para>Fun 
fact: The vector registers in PowerISA are decidedly Big Endian. But we decided 
to make the PPC64LE ABI behave like a Little Endian system to make application 
porting easier. This require the compiler to manipulate the PowerISA vector 
instrinsic behind the the scenes to get the correct Little Endian results. For 
example the element selector [0|1] for vec_splat and the generation of 
vec_mergeh vs vec_mergel are reversed for the Little 
Endian.</para></footnote></para>
    <para/>
    <para>This technique applies to packed and scalar intrinsics for the the 
usual arithmetic operators (add, subtract, multiply, divide). Using GCC vector 
extensions in these intrinsic implementations provides the compiler more 
opportunity to optimize the whole function. </para>
    <para/>
    <para>Now we can look at a slightly more interesting (complicated) case. 
Square root (sqrt) is not a arithmetic operator in C and is usually handled 
with a library call or a compiler builtin. We really want to avoid a library 
calls and want to avoid any unexpected side effects. As you see below the 
implementation of <link 
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_sqrt
_pd&amp;expand=4926">_mm_sqrt_pd</link> and <link 
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_sqrt
_sd&amp;expand=4926,4956">_mm_sqrt_sd</link> intrinsics are based on GCC x86 
built ins. </para>
    <para/>
    <para>For the packed vector sqrt, the PowerISA VSX has an equivalent vector 
double square root instruction and GCC provides the vec_sqrt builtin. But the 
scalar implementation involves an additional parameter and an extra move. 
 This seems intended to mimick the propagation of the __A[1] input to the 
logical right half of the XMM result that we saw with _mm_add_sd above.</para>
    <para/>
    <para>The instinct is to extract the low scalar (__B[0]) from operand __B 
and pass this to  the GCC __builtin_sqrt () before recombining that scalar 
result with __A[1] for the vector result. Unfortunately C language standards 
force the compiler to call the libm sqrt function unless -ffast-math is 
specified. The -ffast-math option is not commonly used and we want to avoid the 
external library dependency for what should be only a few inline instructions. 
So this is not a good option.</para>
    <para/>
    <para>Thinking outside the box; we do have an inline intrinsic for a 
(packed) vector double sqrt, that we just implemented. However we need to 
insure the other half of __B (__B[1]) does not cause an harmful side effects 
(like raising exceptions for NAN or  negative values). The simplest solution 
is to splat __B[0] to both halves of a temporary value before taking the 
vec_sqrt. Then this result can be combined with __A[1] to return the final 
result. For example:</para>
    <para>In this  example we use <link 
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_set1
_pd&amp;expand=4926,4956,4926,4956,4652">_mm_set1_pd</link> to splat the 
scalar __B[0], before passing that vector to our _mm_sqrt_pd implementation, 
then pass the sqrt result (c[0])  with __A[1[ to  <link 
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_setr
_pd&amp;expand=4679">_mm_setr_p</link><link 
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_setr
_pd&amp;expand=4679">d</link> to combine the final result. You could also use 
the {c[0], __A[1]} initializer instead of _mm_setr_pd.</para>
    <para/>
    <para>Now we can look at vector and scalar compares that add there own 
complication: For example:</para>
    <para>The Intel Intrinsic Guide for <link 
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_cmpe
q_pd&amp;expand=779,788,779">_mm_cmpeq_pd</link> describes comparing double 
elements [0|1] and returning either 0s for not equal and 1s (0xFFFFFFFFFFFFFFFF 
or long long -1) for equal. The comparison result is intended as a select mask 
(predicates) for selecting or ignoring specific elements in later operations. 
The scalar version <link 
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_cmpe
q_sd&amp;expand=779,788">_mm_cmpeq_sd</link> is similar except for the quirk 
of only comparing element [0] and combining the result with __A[1] to return 
the final vector result.</para>
    <para/>
    <para>The packed vector implementation for PowerISA is simple as VSX 
provides the equivalent instruction and GCC provides the vec_cmpeq builtin 
supporting the vector double type. The technique of using scalar comparison 
operators on the __A[0] and __B[0] does not work as the C comparison operators 
return 0 or 1 results while we need the vector select mask (effectively 0 or 
-1). Also we need to watch for sequences that mix scalar floats and integers, 
generating if/then/else logic or requiring expensive transfers across register 
banks.</para>
    <para/>
    <para>In this case we are better off using explicit vector built-ins for 
_mm_add_sd as and example. We can use vec_splat from element [0] to temporaries 
where we can safely use vec_cmpeq to generate the expect selector mask. Note 
that the vec_cmpeq returns a bool long type so we need the cast the result back 
to __v2df. Then use the (__m128d){c[0], __A[1]} initializer to combine the 
comparison result with the original __A[1] input and cast to the require 
interface type.  So we have this example:</para>
    <para/>
    <para>Now lets look at a similar example that adds some surprising 
complexity. This is the compare not equal case so we should be able to find the 
equivalent vec_cmpne builtin:</para>
    <para/>
  </section>
  <section>
    <title>2.2.2.2 To vec_not or not</title>
    <para>Well not exactly. Looking at the OpenPOWER ABI document we see a 
reference to vec_cmpne for all numeric types. But when we look in the current 
GCC 6 documentation we find that vec_cmpne is not on the list. So it is planned 
in the ABI, but not implemented yet. </para>
    <para/>
    <para>Looking at the PowerISA 2.07B we find a VSX Vector Compare Equal to 
Double-Precision but no Not Equal. In fact we see only vector double compare 
instructions for greater than and greater than or equal in addition to the 
equal compare. Not only can't we find a not equal, there is no less than or 
less than or equal compares either. </para>
    <para/>
    <para>So what is going on here? Partially this is the Reduced Instruction 
Set Computer (RISC) design philosophy. In this case the compiler can generate 
all the required compares using the existing vector instructions and simple 
transforms based on Boolean algebra. So vec_cmpne(A,B) is simply vec_not 
(vec_cmpeq(A,B)). And vec_cmplt(A,B) is simply vec_cmpgt(B,A) based on the 
identity A &lt; B iff B &gt; A. Similarly vec_cmple(A,B) is implemented as 
vec_cmpge(B,A).</para>
    <para/>
    <para>What a minute, there is no vec_not() either. Can not find it in the 
PowerISA, the OpenPOWER ABI, or the GCC PowerPC Altivec Built-in documentation. 
There is no vec_move() either! How can this possible work?</para>
    <para/>
    <para>This is RISC philosophy again. We can always use a logical 
instruction (like bit wise and or or) to effect a move given that we also have 
nondestructive 3 register instruction forms. In the PowerISA most instruction 
have two input registers and a separate result register. So if the result 
register number is  different from either input register then the inputs are 
not clobbered (nondestructive). Of course nothing prevents you from specifying 
the same register for both inputs or even all three registers (result and both 
inputs).  And some times it is useful.</para>
    <para/>
    <para>The statement B = vec_or (A,A) is is effectively a vector move/copy 
from A to B. And A = vec_or (A,A) is obviously a nop (no operation). In the the 
PowerISA defines the preferred nop and register move for vector registers in 
this way.</para>
    <para/>
    <para>It is also useful to have hardware implement the logical operators 
nor (not or) and nand (not and).  The PowerISA provides these instruction for 
fixed point and vector logical operation. So vec_not(A) can be implemented as 
vec_nor(A,A). So looking at the  implementation of _mm_cmpne we propose the 
following:</para>
    <para/>
    <para>The Intel Intrinsics also include the not forms of the relational 
compares:</para>
    <para>The PowerISA and OpenPOWER ABI, or GCC PowerPC Altivec Built-in 
documentation do not provide any direct equivalents to the  not greater than 
class of compares. Again you don't really need them if you know Boolean 
algebra. We can use identities like {not (A &lt; B) iff A &gt;= B} and {not (A 
&lt;= B) iff A &gt; B}. So the PPC64LE implementation follows:</para>
    <para>These patterns repeat for the scalar version of the not compares. And 
in general the larger pattern described in this chapter applies to the other 
float and integer types with similar interfaces.</para>
    <para/>
  </section>
  <section>
    <title>2.2.2.3 Crossing lanes</title>
    <para>We have seen that, most of the time, vector SIMD units prefer to keep 
computations in the same “lane” (element number) as the input elements. The 
only exception in the examples so far are the occasional splat (copy one 
element to all the other elements of the vector) operations. Splat is an 
example of the general category of “permute” operations (Intel would call 
this a “shuffle” or “blend”). Permutes selects and rearrange the 
elements of (usually) a concatenated pair of vectors and delivers those 
selected elements, in a specific order, to a result vector. The selection and 
order of elements in the result is controlled by a third vector, either as 3rd 
input vector or and immediate field of the instruction.</para>
    <para/>
    <para>For example the Intel intrisics for <link 
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_hadd
&amp;expand=2757,4767,409,2757">Horizontal Add / Subtract</link> added with 
SSE3. These instrinsics add (subtract) adjacent element pairs, across pair of 
input vectors, placing the sum of the adjacent elements in the result vecotr.. 
For example <link 
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_hadd
_ps&amp;expand=2757,4767,409,2757,2757">_mm_hadd_ps</link>  which implments 
the operation on float:</para>
    <para>Horizontal Add (hadd) provides an incremental vector “sum across” 
operation commonly needed in matrix and vector transform math. Horizontal Add 
is incremental as you need three hadd instructions to sum across 4 vectors of 4 
elements ( 7 for 8 x 8, 15 for 16 x 16, …).</para>
    <para/>
    <para>The PowerISA does not have a sum-across operation for float or 
double. We can user the vector float add instruction after we rearrange the 
inputs so that element pairs line up for the horizontal add. For example we 
would need to permute the input vectors {1, 2, 3, 4} and {101, 102, 103, 104} 
into vectors {2, 4, 102, 104} and {1, 3, 101, 103} before the  vec_add. This 
requires two vector permutes to align the elements into the correct lanes for 
the vector add (to implement Horizontal Add).  </para>
    <para/>
    <para>The PowerISA provides generalized byte-level vector permute (vperm) 
based a vector register pair source as input and a control vector. The control 
vector provides 16 indexes (0-31) to select bytes from the concatenated input 
vector register pair (VRA, VRB). A more specific set of permutes (pack, unpack, 
merge, splat) operations (across element sizes) are encoded as separate 
 instruction opcodes or instruction immediate fields.</para>
    <para/>
    <para>Unfortunately only the general vec_perm can provide the realignment 
we need the _mm_hadd_ps operation or any of the int, short variants of hadd. 
For example:</para>
    <para/>
    <para>This requires two permute control vectors; one to select the even 
word elements across __X and __Y, and another to select the odd word elements 
across __X and __Y. The result of these permutes (vec_perm) are inputs to the 
vec_add and completes the hadd operation. </para>
    <para/>
    <para>Fortunately the permute required for the double (64-bit) case (IE 
_mm_hadd_pd) reduces to the equivalent of vec_mergeh / vec_mergel  doubleword 
(which are variants of  VSX Permute Doubleword Immediate). So the 
implementation of _mm_hadd_pd can be simplified to this:</para>
    <para>This eliminates the load of the control vectors required by the 
previous example.</para>
    <para/>
  </section>
  <section>
    <title>2.2.3 PowerISA Vector facilities.</title>
    <para>The PowerISA vector facilities (VMX and VSX) are extensive, but does 
not always provide a direct or obvious functional equivalent to the Intel 
Intrinsics. But being not obvious is not the same as imposible. It just 
requires some basic programing skills.</para>
    <para/>
    <para>It is a good idea to have an overall understanding of the vector 
capabilities the PowerISA. You do not need to memorize every instructions but 
is helps to know where to look. Both the PowerISA and OpenPOWER ABI have a 
specific structure and organization that can help you find what you looking 
for. </para>
    <para/>
    <para>It also helps to understand the relationship between the PowerISAs 
low level instructions and the higher abstraction of the vector intrinsics as 
defined by the OpenPOWER ABIs Vector Programming Interfaces and the the defacto 
 standard of GCC's PowerPC AltiVec Built-in Functions.</para>
  </section>
  <section>
    <title>2.2.3.1 The PowerISA</title>
    <para>The PowerISA is for historical reasons is organized at the top level 
by the distinction between older Vector Facility (Altivec / VMX) and the newer 
Vector-Scalar Floating-Point Operations (VSX). </para>
  </section>
  <section>
    <title>2.2.3.1.1 The Vector Facility (VMX)</title>
    <para>The orginal VMX supported SIMD integer byte, halfword, and word, and 
single float data types within a separate (from GPR and FPR) bank of 32 x 
128-bit vector registers. These operations like to stay within their (SIMD) 
lanes except where the operation changes the element data size (integer 
multiply, pack, and unpack). </para>
    <para/>
    <para>This is complimented by bit logical and shift / rotate / permute / 
merge instuctions that operate on the vector as a whole.  Some operation 
(permute, pack, merge, shift double, select) will select 128 bits from a pair 
of vectors (256-bits) and deliver 128-bit vector result. These instructions 
will cross lanes or multiple registers to grab fields and assmeble them into 
the single register result.</para>
    <para/>
    <para>The PowerISA 2.07B Chapter 6. Vector Facility is organised starting 
with an overview (chapters 6.1- 6.6) :</para>
    <para>Then a chapter on storage (load/store) access for vector and vector 
elements:</para>
  </section>
  <section>
    <title>2.2.3.1.1.1 Vector permute and formatting instructions</title>
    <para>The vector Permute and formatting chapter follows and is an important 
one to study. These operation operation on the byte, halfword, word (and with 
2.07 doubleword) integer types . Plus special Pixel type. The shifts 
instructions in this chapter operate on the vector as a whole at either the bit 
or the byte (octet) level, This is an important chapter to study for moving 
PowerISA vector results into the vector elements that Intel Intrinsics 
expect:</para>
    <para/>
    <para>The Vector Integer instructions include the add / subtract / Multiply 
/ Multiply Add/Sum / (no divide) operations for the standard integer types. 
There are instruction forms that  provide signed, unsigned, modulo, and 
saturate results for most operations. The PowerISA 2.07 extension add / 
subtract of 128-bit integers with carry and extend to 256, 512-bit and beyond , 
is included here. There are signed / unsigned compares across the standard 
integer types (byte, .. doubleword). The usual and bit-wise logical operations. 
And the SIMD shift / rotate instructions that operate on the vector elements 
for various types.</para>
    <para/>
    <para/>
    <para>The vector [single] float instructions are grouped into this chapter. 
This chapter does not include the double float instructions which are described 
in the VSX chapter. VSX also include additional float instructions that operate 
on the whole 64 register vector-scalar set.</para>
    <para/>
    <para/>
    <para>The vector XOR based instructions are new with PowerISA 2.07 (POWER8) 
and provide vector  crypto and check-sum operations:</para>
    <para/>
    <para>The vector gather and bit permute support bit level rearrangement of 
bits with in the vector. While the vector versions of the count leading zeros 
and population count are useful to accelerate specific algorithms </para>
    <para/>
    <para>The Decimal Integer add / subtract instructions complement the 
Decimal Floating-Point instructions. They can also be used to accelerated some 
binary to/from decimal conversions. The VSCR instruction provides access the 
the Non-Java mode floating-point control and the saturation status. These 
instruction are not normally of interest in porting Intel intrinsics.</para>
    <para/>
    <para>With PowerISA 2.07B (Power8) several major extension where added to 
the Vector Facility:</para>
    <orderedlist>
      <listitem>
        <para>Vector Crypto: Under “Vector Exclusive-OR-based Instructions 
Vector Exclusive-OR-based Instructions”, AES [inverse] Cipher, SHA 256 / 512 
Sigma, Polynomial Multiplication, and Permute and XOR instructions.</para>
      </listitem>
      <listitem>
        <para>64-bit Integer; signed and unsigned add / subtract, signed and 
unsigned compare, Even / Odd 32 x 32 multiple with 64-bit product, signed / 
unsigned max / min, rotate and shift left/right.</para>
      </listitem>
      <listitem>
        <para>Direct Move between GRPs and the FPRs / Left half of Vector 
Registers.</para>
      </listitem>
      <listitem>
        <para>128-bit integer add / subtract with carry / extend, direct 
support for vector __int128 and multiple precision arithmetic.</para>
      </listitem>
      <listitem>
        <para>Decimal Integer add subtract for 31 digit BCD.</para>
      </listitem>
      <listitem>
        <para>Miscellaneous SIMD extensions: Count leading Zeros, Population 
count, bit gather / permute, and vector forms of eqv, nand, orc.</para>
      </listitem>
    </orderedlist>
    <para/>
    <para>The rational for why these are included in the Vector Facilities 
(VMX) (vs Vector-Scalar Floating-Point Operations (VSX)) has more to do with 
how the instruction where encoded then with the type of operations or the ISA 
version of introduction. This is primarily a trade-off between the bits 
required for register selection vs bits for extended op-code space within in a 
fixed 32-bit instruction. Basically accessing 32 vector registers require 
5-bits per register, while accessing all 64 vector-scalar registers require 
6-bits per register. When you consider the most vector instructions require  3 
 and some (select, fused multiply-add) require 4 register operand forms,  the 
impact on op-code space is significant. The larger register set of VSX was 
justified by queuing theory of larger HPC matrix codes using double float, 
while 32 registers are sufficient for most applications.</para>
    <para/>
    <para>So by definition the VMX instructions are restricted to the original 
32 vector registers while VSX instructions are encoded to  access all 64 
floating-point scalar and vector double registers. This distinction can be 
troublesome when programming at the assembler level, but the compiler and 
compiler built-ins can hide most of this detail from the programmer. </para>
    <para/>
  </section>
  <section>
    <title>2.2.3.1.2 Vector-Scalar Floating-Point Operations (VSX)</title>
    <para>With PowerISA 2.06 (POWER7) we extended the vector SIMD capabilities 
of the PowerISA:</para>
    <orderedlist>
      <listitem>
        <para>Extend the available vector and floating-point scalar register 
sets from 32 registers each to a combined 64 x 64-bit scalar floating-point and 
64 x 128-bit vector registers.</para>
      </listitem>
      <listitem>
        <para>Enable scalar double float operations on all 64 scalar 
registers.</para>
      </listitem>
      <listitem>
        <para>Enable vector double and vector float operations for all 64 
vector registers.</para>
      </listitem>
      <listitem>
        <para>Enable super-scalar execution of vector instructions and support 
2 independent vector floating point  pipelines for parallel execution of 4 x 
64-bit Floating point Fused Multiply Adds (FMAs) and 8 x 32-bit (FMAs) per 
cycle.</para>
      </listitem>
    </orderedlist>
    <para/>
    <para>With PowerISA 2.07 (POWER8) we added single-precision scalar 
floating-point instruction to VSX. This completes the floating-point 
computational set for VSX. This ISA release also clarified how these operate in 
the Little Endian storage model.</para>
    <para/>
    <para>While the focus was on enhanced floating-point computation (for High 
Performance Computing),  VSX also extended  the ISA with additional storage 
access, logical, and permute (merge, splat, shift) instructions. This was 
necessary to extend these operations cover 64 VSX registers, and improves 
unaligned storage access for vectors  (not available in VMX).</para>
    <para/>
    <para>The PowerISA 2.07B Chapter 7. Vector-Scalar Floating-Point Operations 
is organized starting with an introduction and overview (chapters 7.1- 7.5) . 
The early sections (7.1 and 7.2) describe the layout of the 64 VSX registers 
and how they relate (overlap and inter-operate) to the existing floating point 
scalar (FPRs) and (VMX VRs) vector registers.</para>
    <para/>
    <para>The definitions given in “7.1.1.1 Compatibility with Category 
Floating-Point and Category Decimal Floating-Point Operations”, and 
“7.1.1.2 Compatibility with Category Vector Operations” </para>
    <para>Note; the reference to scalar element 0 above is from the big endian 
register perspective of the ISA. In the PPC64LE ABI implementation, and for the 
purpose of porting Intel intrinsics, this is logical element 1.  Intel SSE 
scalar intrinsics operated on logical element [0],  which is in the wrong 
position for PowerISA FPU and VSX scalar floating-point  operations. Another 
important note is what happens to the other half of the VSR when you execute a 
scalar floating-point instruction (The contents of doubleword 1 of a VSR … 
are undefined.)</para>
    <para/>
    <para>The compiler will hide some of this detail when generating code for 
little endian vector element [] notation and most vector built-ins. For example 
vec_splat (A, 0) is transformed for PPC64LE to xxspltd VRT,VRA,1. What the 
compiler can not hide is the different placement of scalars within vector 
registers.</para>
    <para/>
    <para>Vector registers (VRs) 0-31 overlay and can be accessed from vector 
scalar registers (VSRs) 32-63. The ABI also specifies that VR2-13 are used to 
pass parameter and return values. In some cases the same (similar) operations 
exist in both VMX and VSX instruction forms, while in the other cases 
operations only exist for VMX (byte level permute and shift) or VSX (Vector 
double).  </para>
    <para/>
    <para>So resister selection that; avoids unnecessary vector moves, follows 
the ABI, while maintaining the correct instruction specific register numbering, 
can be tricky. The <link 
xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/Machine-Constraints.html#Machi
ne-Constraints">GCC register constraint</link> annotations for Inline 
assembler using vector instructions  is challenging, even for experts. So only 
experts should be writing assembler and then only in extraordinary 
circumstances. You should leave these details to the compiler (using vector 
extensions and vector built-ins) when ever possible.</para>
    <para/>
    <para>The next sections get is into the details of floating point 
representation, operations, and exceptions. Basically the implementation 
details for the IEEE754R and C/C++ language standards that most developers only 
access via higher level APIs. So most programmers will not need this level of 
detail, but it is there if needed.</para>
    <para/>
    <para>Finally an overview the VSX storage access instructions for big and 
little endian and for aligned and unaligned data addresses. This included 
diagrams that illuminate the differences </para>
    <para/>
    <para>Section 7.6 starts with a VSX instruction Set Summary which is the 
place to start to get an feel for the types and operations supported.  The 
emphasis on float-point, both scalar and vector (especially vector double), is 
pronounced. Many of the scalar and single-precision vector instruction look 
like duplicates of what we have seen in the Chapter 4 Floating-Point and 
Chapter 6 Vector facilities. The difference here is, new instruction encodings 
to access the full 64 VSX register space. </para>
    <para/>
    <para>In addition there are small number of logical instructions are 
include to support predication (selecting / masking vector elements based on 
compare results). And set of permute, merge, shift, and splat instructions that 
operation on VSX word (float) and doubleword (double) elements. As mentioned 
about VMX section 6.8 these instructions are good to study as they are useful 
for realigning elements from PowerISA vector results to that required for Intel 
Intrinsics.</para>
    <para/>
    <para/>
    <para>The VSX Instruction Descriptions section contains the detail 
description for each VSX category instruction.  The table entries from the 
Instruction Set Summary are formatted in the document at hyperlinks to 
corresponding instruction description.</para>
    <para/>
  </section>
  <section>
    <title>2.2.3.2 PowerISA Vector Intrinsics</title>
    <para>The OpenPOWER ELF V2 application binary interface (ABI): Chapter 6. 
Vector Programming Interfaces and Appendix A. Predefined Functions for Vector 
Programming document the current and proposed vector built-ins we expect all 
C/C++ compilers implement. </para>
    <para/>
    <para>Some of these operations are endian sensitive and the compiler needs 
to make corresponding adjustments as  it generate code for endian sensitive 
built-ins. There is a good overview for this in the OpenPOWER ABI section 6.4. 
Vector Built-in Functions.</para>
    <para/>
    <para>Appendix A is organized (sorted) by built-in name, output type, then 
parameter types. Most built-ins are generic as the named the operation (add, 
sub, mul, cmpeq, ...) applies to multiple types. </para>
    <para/>
    <para>So the build vec_add built-in applies to all the signed and unsigned 
integer types (char, short, in, and long) plus float and double floating-point 
types. The compiler looks at the parameter type to select the vector 
instruction (or instruction sequence) that implements the (add) operation on 
that type. The compiler infers the output result type from the operation and 
input parameters and will complain if the target variable type is not 
compatible. For example:</para>
    <para/>
    <para>This is one key difference between PowerISA built-ins and Intel 
Intrinsics (Intel Intrinsics are not generic and include type information in 
the name). This is why it is so important to understand the vector element 
types and to add the appropriate type casts to get the correct results.</para>
    <para/>
    <para>The defacto standard implementation is GCC as defined in the include 
file &lt;altivec.h&gt; and documented in the GCC online documentation in <link 
xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/PowerPC-AltiVec_002fVSX-Built-
in-Functions.html#PowerPC-AltiVec_002fVSX-Built-in-Functions">6.59.20 PowerPC 
AltiVec Built-in Functions</link>. The header file name and section title 
reflect the origin of the Vector Facility, but recent versions of GCC altivec.h 
include built-ins for newer PowerISA 2.06 and 2.07 VMX plus VSX extensions. 
This is a work in progress where your  (older) distro GCC compiler may not 
include built-ins for the latest PowerISA 3.0 or ABI edition. So before you use 
a built-in you find in the ABI Appendix A, check the specific <link 
xlink:href="https://gcc.gnu.org/onlinedocs/">GCC online documentation</link> for the 
GCC version you are using.</para>
    <para/>
  </section>
  <section>
    <title>2.2.3.3 How vector elements change size and type</title>
    <para>Most vector built ins return the same vector type as the (first) 
input parameters, but there are exceptions. Examples include; conversions 
between types, compares , pack, unpack,  merge, and integer multiply 
operations.  </para>
    <para/>
    <para>Converting floats to from integer will change the type and something 
change the element size as well (double ↔ int and float ↔ long). For the 
VMX the conversions are always the same size (float ↔ [unsigned] int). But 
VSX allows conversion of 64-bit (long or double) to from 32-bit (float or 
 int)  with the inherent size changes. The PowerISA VSX defines a 4 element 
vector layout where little endian elements 0, 2 are used for input/output and 
elements 1,3 are undefined. The OpenPOWER ABI Appendix A define vec_double and 
vec_float with even/odd and high/low extensions as program aids. These are not 
included in GCC 7 or earlier but are planned for GCC 8.</para>
    <para/>
    <para>Compare operations produce either vector bool &lt;input element 
type&gt; (effectively bit masks) or predicates (the condition code for all and 
any are represented as an int truth variable). When a predicate compare (ie 
vec_all_eq, vec_any_gt), is used in a if statement,  the condition code is 
used directly in the conditional branch and the int truth value is not 
generated.</para>
    <para/>
    <para>Pack operations pack integer elements into the next smaller (half) 
integer sized elements. Pack operations include signed and unsigned saturate 
and unsigned modulo forms. As the packed result will be half the size (in 
bits), pack instructions require 2 vectors (256-bits) as input and generate a 
single 128-bit vector results.</para>
    <para/>
    <para>Unpack operations expand integer elements into the next larger size 
elements. The integers are always treated as signed values and sign-extended. 
The processor design avoids instructions that return multiple register values. 
So the PowerISA defines unpack-high and unpack low forms where instruction 
takes (the high or low) half of vector elements and extends them to fill the 
vector output. Element order is maintained and an unpack high / low sequence 
with same input vector has the effect of unpacking to a 256-bit result in two 
vector registers.</para>
    <para/>
    <para>Merge operations resemble shuffling two (vectors) card decks 
together, alternating (elements) cards in the result.   As we are merging from 
2 vectors (256-bits) into 1 vector (128-bits) and the elements do not change 
size, we have merge high and merge low instruction forms for each (byte, 
halfword and word) integer type. The merge high operations alternate elements 
from the (vector register left) high half of the two input vectors. The merge 
low operation alternate elements from the (vector register right) low half of 
the two input vectors. </para>
    <para/>
    <para>For PowerISA 2.07 we added vector merge word even / odd instructions. 
Instead of high or low elements the shuffle is from the even or odd number 
elements of the two input vectors. Passing the same vector to both inputs to 
merge produces splat like results for each doubleword half, which is handy in 
some convert operations. </para>
    <para/>
    <para>Integer multiply has the potential to generate twice as many bits in 
the product as input. A multiply of 2 int (32-bit) values produces a long 
(64-bits). Normal C language * operations ignore this and discard the top 
32-bits of the result. However  in some computations it useful to preserve the 
double product precision for intermediate computation before reducing the final 
result back to the original precision. </para>
    <para/>
    <para>The PowerISA VMX instruction set took the later approach ie keep all 
the product bits until the programmer explicitly asks for the truncated result. 
So the vector integer multiple are split into even/odd forms across signed and 
unsigned; byte, halfword and word inputs. This requires two instructions (given 
the same inputs) to generated the full vector  multiply across 2 vector 
registers and 256-bits. Again as POWER processors are super-scalar this pair of 
instructions should execute in parallel.</para>
    <para/>
    <para>The set of expanded product values can either be used directly in 
further (doubled precision) computation or merged/packed into the single single 
vector at the smaller bit size. This is what the compiler will generate for C 
vector extension multiply of vector integer types.</para>
    <para/>
  </section>
  <section>
    <title>2.2.4 Some more Intrinsic examples</title>
    <para>The intrinsic <link 
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_cvtp
d_ps&amp;expand=1624">_mm_cvtpd_ps</link> converts a packed vector double into 
a packed vector single float. Since only 2 doubles fit into a 128-bit vector 
only 2 floats are returned and occupy only half (64-bits) of the XMM register. 
For this intrinsic the 64-bit are packed into the logical left half of the 
registers and the logical right half of the register is set to zero (as per the 
Intel cvtpd2ps instruction).</para>
    <para/>
    <para>The PowerISA provides the VSX Vector round and Convert 
Double-Precision to Single-Precision format (xvcvdpsp) instruction. In the ABI 
this is vec_floato (vector double) .  This instruction convert each double 
element then transfers converted element 0 to float element 1, and converted 
element 1 to float element 3. Float elements 0 and 2 are undefined (the 
hardware can do what ever). This does not match the expected results for 
_mm_cvtpd_ps.</para>
    <para/>
    <para>So we need to re-position the results to word elements 0 and 2, which 
allows a pack operation to deliver the correct format. Here the merge odd 
splats element 1 to 0 and element 3 to 2. The Pack operation combines the low 
half of each doubleword from the vector result and vector of zeros to generate 
the require format.</para>
    <para/>
    <para>This  technique is also used to implement  <link 
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_cvtt
pd_epi32&amp;expand=1624,1859">_mm_cvttpd_epi32</link> which converts a packed 
vector double in to a packed vector int. The PowerISA instruction xvcvdpsxws 
uses a similar layout for the result as  xvcvdpsp and requires the same fix 
up.</para>
  </section>
  <section>
    <title>2.3 Profound differences </title>
    <para>We have already mentioned above a number of architectural differences 
that effect porting of codes containing Intel intrinsics to POWER. The fact 
that Intel supports multiple vector extensions with different vector widths 
(64, 128, 256, and 512-bits) while the PowerISA only supports vectors of 
128-bits is one issue. Another is the difference in how the respective ISAs 
support scalars in vector registers is another.  In the text above we propose 
workable alternatives for the PowerPC port. There also differences in the 
handling of floating point exceptions and rounding modes that may impact the 
application's performance or behavior. </para>
    <para/>
  </section>
  <section>
    <title>2.3.1 Floating Point Exceptions</title>
    <para>Nominally both ISAs support the IEEE754 specifications, but there are 
some subtle differences. Both architecture define a status and control register 
to record exceptions and enable / disable floating exceptions for program 
interrupt or default action. Intel has a MXCSR and PowerISA has a FPSCR which 
basically do the same thing but with different bit layout. </para>
    <para/>
    <para>Intel provides _mm_setcsr / _mm_getcsr intrinsics to allow direct 
access to the MXCSR. In the early days before the OS POSIX run-times where 
updated  to manage the MXCSR, this might have been useful. Today this would be 
highly discouraged with a strong preference to use the POSIX APIs 
(feclearexceptflag, fegetexceptflag, fesetexceptflag, ...) instead.</para>
    <para/>
    <para>If we implement _mm_setcsr / _mm_getcs at all, we should simply 
redirect the implementation to use the POSIX APIs from &lt;fenv.h&gt;. But it 
might be simpler just to replace these intrinsics with macros that generate 
#error.</para>
    <para/>
    <para>The Intel MXCSR does have some none (POSIX/IEEE754) standard quirks; 
Flush-To-Zero and Denormals-Are-Zeros flags. This simplifies the hardware 
response to what should be a rare condition (underflows where the result can 
not be represented in the exponent range and precision of the format) by simply 
returning a signed 0.0 value. The intrinsic header implementation does provide 
constant masks for _MM_DENORMALS_ZERO_ON (&lt;pmmintrin.h&gt;) and 
_MM_FLUSH_ZERO_ON (&lt;xmmintrin.h&gt;, so technically it is available to users 
of the Intel Intrinsics API.</para>
    <para/>
    <para>The VMX Vector facility provides a separate Vector Status and Control 
register (VSCR) with a Non-Java Mode control bit. This control combines the 
flush-to-zero semantics for floating Point underflow and denormal values. But 
this control only applies to VMX vector float instructions and does not apply 
to VSX scalar floating Point or vector double instructions. The FPSCR does 
define a Floating-Point non-IEEE mode which is optional in the architecture. 
This would apply to Scalar and VSX floating-point operations if it was 
implemented. This was largely intended for embedded processors and is not 
implemented in the POWER processor line.</para>
    <para/>
    <para>As the flush-to-zero is primarily a performance enhansement and is 
clearly outside the IEEE754 standard, it may be best to simply ignore this 
option for the intrinsic port.</para>
    <para/>
  </section>
  <section>
    <title>2.3.2 Floating-point rounding modes</title>
    <para>The Intel (x86 / x86_64) and PowerISA architectures both support the 
4 IEEE754 rounding modes. Again while the Intel Intrinsic API allows the 
application to change rounding modes via updates to the MXCSR it is a bad idea 
and should be replaced with the POSIX APIs (fegetround and fesetround). </para>
    <para/>
  </section>
  <section>
    <title>2.3.3 Performance</title>
    <para>The performance of a ported intrinsic depends on the specifics of the 
intrinsic and the context it is used in. Many of the SIMD operations have 
equivalent instructions in both architectures. For example the vector float and 
vector double match very closely. However the SSE and VSX scalars have subtle 
differences of how the scalar is positioned with the vector registers and what 
happens to the rest (non-scalar part) of the register (previously discussed in 
<link linkend="">here</link>). This requires additional PowerISA instructions 
to preserve the non-scalar portion of the vector registers. This may or may not 
be important to the logic of the program being ported, but we have handle the 
case where it is. </para>
    <para/>
    <para>This is where the context of now the intrinsic is used starts to 
matter. If the scalar intrinsics are used within a larger program the compiler 
may be able to eliminate the redundant register moves as the results are never 
used. In the other cases common set up (like permute vectors or bit masks) can 
be common-ed up and hoisted out of the loop. So it is very important to let the 
compiler do its job with higher optimization levels (-O3, 
-funroll-loops).</para>
    <para/>
  </section>
  <section>
    <title>2.3.3.1 Using SSE float and double scalars</title>
    <para>SSE scalar float / double intrinsics  “hand” optimization is no 
longer necessary. This was important, when SSE was initially introduced, and 
compiler support was limited or nonexistent.  Also SSE scalar float / double 
provided additional (16) registers and IEEE754 compliance, not available from 
the 8087 floating point architecture that preceded it. So application 
developers where motivated to use SSE instruction versus what the compiler was 
generating at the time.</para>
    <para/>
    <para>Modern compilers can now to generate and  optimize these (SSE 
scalar) instructions for Intel from C standard scalar code. Of course PowerISA 
supported IEEE754 float and double and had 32 dedicated floating point 
registers from the start (and now 64 with VSX). So replacing a Intel specific 
scalar intrinsic implementation with the equivalent C language scalar 
implementation is usually a win; allows the compiler to apply the latest 
optimization and tuning for the latest generation processor, and is portable to 
other platforms where the compiler can also apply the latest optimization and 
tuning for that processors latest generation.</para>
    <para/>
  </section>
  <section>
    <title>2.3.3.2 Using MMX intrinsics</title>
    <para>MMX was the first and oldest SIMD extension and initially filled a 
need for wider (64-bit) integer and additional register. This is back when 
processors were 32-bit and 8 x 32-bit registers was starting to cramp our 
programming style. Now 64-bit processors, larger register sets, and 128-bit (or 
larger) vector SIMD extensions are common. There is simply no good reasons 
write new code using the (now) very limited MMX capabilities. </para>
    <para/>
    <para>We recommend that existing MMX codes be rewritten to use the newer 
SSE  and VMX/VSX intrinsics or using the more portable GCC  builtin vector 
support or in the case of si64 operations use C scalar code. The MMX si64 
scalars which are just (64-bit) operations on long long int types and any 
modern C compiler can handle this type. The char short in SIMD operations 
should all be promoted to 128-bit SIMD operations on GCC builtin vectors. Both 
will improve cross platform portability and performance.</para>
    <para/>
  </section>
  <section>
    <title>Appendix A: Document References</title>
  </section>
  <section>
    <title>A.1 OpenPOWER and Power documents</title>
    <para>
      <link 
xlink:href="https://openpowerfoundation.org/technical/technical-resources/technical-spe
cifications/">OpenPOWER</link>
      <link 
xlink:href="https://openpowerfoundation.org/technical/technical-resources/technical-spe
cifications/">TM</link>
      <link 
xlink:href="https://openpowerfoundation.org/technical/technical-resources/technical-spe
cifications/"> Technical Specification</link>
    </para>
    <para>
      <link 
xlink:href="https://openpowerfoundation.org/?resource_lib=ibm-power-isa-version-2-07-b"
>Power ISA</link>
      <link 
xlink:href="https://openpowerfoundation.org/?resource_lib=ibm-power-isa-version-2-07-b"
>TM</link>
      <link 
xlink:href="https://openpowerfoundation.org/?resource_lib=ibm-power-isa-version-2-07-b"
> Version 2.07 B</link>
    </para>
    <para>
      <link xlink:href="https://www.docdroid.net/tWT7hjD/powerisa-v30.pdf.html">Power 
ISA</link>
      <link 
xlink:href="https://www.docdroid.net/tWT7hjD/powerisa-v30.pdf.html">TM</link>
      <link xlink:href="https://www.docdroid.net/tWT7hjD/powerisa-v30.pdf.html"> 
Version 3.0</link>
    </para>
    <para>
      <link 
xlink:href="https://openpowerfoundation.org/technical/technical-resources/technical-spe
cifications/">Power Architecture 64-bit ELF ABI Specification (AKA OpenPower 
ABI for Linux Supplement)</link>
    </para>
    <para>
      <link 
xlink:href="http://www.nxp.com/assets/documents/data/en/reference-manuals/ALTIVECPEM.pd
f">AltiVec™ Technology </link>
      <link 
xlink:href="http://www.nxp.com/assets/documents/data/en/reference-manuals/ALTIVECPEM.pd
f">Programming Environments Manual</link>
    </para>
    <para/>
  </section>
  <section>
    <title>A.2 Intel Documents</title>
    <para>
      <link xlink:href="https://software.intel.com/en-us/articles/intel-sdm">Intel® 
64 and IA-32 Architectures Software Developer’s Manual</link>
    </para>
    <para>
      <link 
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/">Intel</ulink
>
      <link 
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/">TM</link>
      <link 
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/"> Intrinsics 
Guide</link>
    </para>
    <para/>
  </section>
  <section>
    <title>A.3 GNU Compiler Collection (GCC) documents</title>
    <para>
      <link xlink:href="https://gcc.gnu.org/onlinedocs/">GCC online 
documentation</link>
    </para>
    <para>
      <link xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/">GCC Manual 
(GCC 6.3)</link>
    </para>
    <para>
      <link xlink:href="https://gcc.gnu.org/onlinedocs/gccint/">GCC Internals 
Manual</link>
    </para>
    <para/>
  </section>
  <section>
    <title/>
  </section>
  <section>
    <title>Appendix B: Intel Intrinsic suffixes</title>
  </section>
  <section>
    <title>B.1 MMX</title>
  </section>
  <section>
    <title>B.2 SSE</title>
  </section>
  <section>
    <title>B.3 SSE2</title>
  </section>
  <section>
    <title>B.4 AVX/AVX2 __m256_*</title>
  </section>
  <section>
    <title>B.5 AVX512 __m512_*</title>
    <para/>
  </section>
  <para>1</para>
</article>