|
|
<?xml version="1.0" encoding="UTF-8"?>
|
|
|
<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
|
|
|
"http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd">
|
|
|
<article lang="">
|
|
|
<section>
|
|
|
<title>1 Intel Intrinsic porting guide for Power64LE.</title>
|
|
|
<para>The goal of this project is to provide functional equivalents of the
|
|
|
Intel MMX, SSE, and AVX intrinsic functions, that are commonly used in Linux
|
|
|
applications, and make them (or equivalents) available for the PowerPC64LE
|
|
|
platform. These X86 intrinsics started with the Intel and Microsoft compilers
|
|
|
but were then ported to the GCC compiler. The GCC implementation is a set of
|
|
|
headers with inline functions. These inline functions provide a implementation
|
|
|
mapping from the Intel/Microsoft dialect intrinsic names to the corresponding
|
|
|
GCC Intel built-in's or directly via C language vector extension syntax.</para>
|
|
|
<para/>
|
|
|
<para>The current proposal is to start with the existing X86 GCC intrinsic
|
|
|
headers and port them (copy and change the source) to POWER using C language
|
|
|
vector extensions, VMX and VSX built-ins. Another key assumption is that we
|
|
|
will be able to use many of existing Intel DejaGNU test cases on
|
|
|
./gcc/testsuite/gcc.target/i386. This document is intended as a guide to
|
|
|
developers participating in this effort. However this document provides
|
|
|
guidance and examples that should be useful to developers who may encounter X86
|
|
|
intrinsics in code that they are porting to another platform.</para>
|
|
|
<para/>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>1.1 Look at the source Luke</title>
|
|
|
<para>So if this is a code porting activity, where is the source? All the
|
|
|
source code we need to look at is in the GCC source trees. You can either git
|
|
|
(https://gcc.gnu.org/wiki/GitMirror) the gcc source or down load one of the
|
|
|
recent AT source tars (for example:
|
|
|
ftp://ftp.unicamp.br/pub/linuxpatch/toolchain/at/ubuntu/dists/xenial/at10.0/).
|
|
|
You will find the intrinsic headers in the ./gcc/config/i386/
|
|
|
sub-directory.</para>
|
|
|
<para/>
|
|
|
<para>If you have a Intel Linux workstation or laptop with GCC installed,
|
|
|
you already have these headers, if you want to take a look:</para>
|
|
|
<para/>
|
|
|
<para/>
|
|
|
<para/>
|
|
|
<para/>
|
|
|
<para/>
|
|
|
<para/>
|
|
|
<para/>
|
|
|
<para/>
|
|
|
<para>But depending on the vintage of the distro, these may not be the
|
|
|
latest versions of the headers. Looking at the header source will tell you a
|
|
|
few things.: The include structure (what other headers are implicitly
|
|
|
included). The types that are used at the API. And finally, how the API is
|
|
|
implemented.</para>
|
|
|
<para/>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>1.1.1 The structure of the intrinsic includes</title>
|
|
|
<para>The GCC x86 intrinsic functions for vector were initially grouped by
|
|
|
technology (MMX and SSE), which starts with MMX continues with SSE through
|
|
|
SSE4.1 stacked like a set of Russian dolls.</para>
|
|
|
<para/>
|
|
|
<para>Basically each higher layer include, needs typedefs and helper macros
|
|
|
defined by the lower level intrinsic includes. mm_malloc.h simply provides
|
|
|
wrappers for posix_memalign and free. Then it gets a little weird, starting
|
|
|
with the crypto extensions:For AVX, AVX2, and AVX512 they must have decided
|
|
|
that the Russian Dolls thing was getting out of hand. AVX et all is split
|
|
|
across 14 filesbut they do not want the applications include these
|
|
|
individually.So immintrin.h includes everything Intel vector, include all the
|
|
|
AVX, AES, SSE and MMX flavors.</para>
|
|
|
<para/>
|
|
|
<para>So what is the net? The include structure provides some strong clues
|
|
|
about the order that we should approach this effort. For example if you need
|
|
|
to intrinsic from SSE4 (smmintrin.h) we are likely to need to type definitions
|
|
|
from SSE (emmintrin.h). So a bottoms up (MMX, SSE, SSE2, …) approach seems
|
|
|
like the best plan of attack. Also saving the AVX parts for latter make sense,
|
|
|
as most are just wider forms of operations that already exists in SSE.</para>
|
|
|
<para/>
|
|
|
<para>We should use the same include structure to implement our PowerISA
|
|
|
equivalent API headers. This will make porting easier (drop-in replacement) and
|
|
|
should get the application running quickly on POWER. Then we are in a position
|
|
|
to profile and analyze the resulting application. This will show any hot spots
|
|
|
where the simple one-to-one transformation results in bottlenecks and
|
|
|
additional tuning is needed. For these cases we should improve our tools (SDK
|
|
|
MA/SCA) to identify opportunities for, and perhaps propose, alternative
|
|
|
sequences that are better tuned to PowerISA and our micro-architecture.</para>
|
|
|
<para/>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>1.1.2 The types used for intrinsics</title>
|
|
|
<para>The type system for Intel intrinsics is a little strange. For example
|
|
|
from xmmintrin.h:</para>
|
|
|
<para/>
|
|
|
<para>So there is one set of types that are used in the function prototypes
|
|
|
of the API, and the internal types that are used in the implementation. Notice
|
|
|
the special attribute <literal>__may_alias__</literal>. From the GCC documentation:So there are a
|
|
|
couple of issues here: 1) the API seem to force the compiler to assume
|
|
|
aliasing of any parameter passed by reference. Normally the compiler assumes
|
|
|
that parameters of different size do not overlap in storage, which allows more
|
|
|
optimization. 2) the data type used at the interface may not be the correct
|
|
|
type for the implied operation. So parameters of type __m128i (which is defined
|
|
|
as vector long long) is also used for parameters and return values of vector
|
|
|
[char | short | int ]. </para>
|
|
|
<para/>
|
|
|
<para>This may not matter when using x86 built-in's but does matter when
|
|
|
the implementation uses C vector extensions or in our case use PowerPC generic
|
|
|
vector built-ins (#2.1.3.2.<link linkend="">PowerISA Vector
|
|
|
Intrinsics|outline</link>). For the later cases the type must be correct for
|
|
|
the compiler to generate the correct type (char, short, int, long) (<link
|
|
|
linkend="">#1.1.3.How the API is implemented.|outline</link>) for the generic
|
|
|
builtin operation. There is also concern that excessive use of <literal>__may_alias__</literal>
|
|
|
will limit compiler optimization. We are not sure how important this attribute
|
|
|
is to the correct operation of the API. So at a later stage we should
|
|
|
experiment with removing it from our implementation for PowerPC</para>
|
|
|
<para/>
|
|
|
<para>The good news is that PowerISA has good support for 128-bit vectors
|
|
|
and (with the addition of VSX) all the required vector data (char, short, int,
|
|
|
long, float, double) types. However Intel supports a wider variety of the
|
|
|
vector sizes than PowerISA does. This started with the 64-bit MMX vector
|
|
|
support that preceded SSE and extends to 256-bit and 512-bit vectors of AVX,
|
|
|
AVX2, and AVX512 that followed SSE.</para>
|
|
|
<para/>
|
|
|
<para>Within the GCC Intel intrinsic implementation these are all
|
|
|
implemented as vector attribute extensions of the appropriate size (
|
|
|
<literal>__vector_size__</literal> ({8 | 16 | 32, and 64}). For the PowerPC target GCC currently
|
|
|
only supports the native <literal>__vector_size__</literal> ( 16 ). These we can support directly
|
|
|
in VMX/VSX registers and associated instructions. The GCC will compile with
|
|
|
other <literal>__vector_size__</literal> values, but the resulting types are treated as simple
|
|
|
arrays of the element type. This does not allow the compiler to use the vector
|
|
|
registers and vector instructions for these (nonnative) vectors. So what is
|
|
|
a programmer to do?</para>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>1.1.2.1 Dealing with MMX</title>
|
|
|
<para>MMX is actually the hard case. The __m64 type supports SIMD vector
|
|
|
int types (char, short, int, long). The Intel API defines __m64 as:</para>
|
|
|
<para/>
|
|
|
<para>Which is problematic for the PowerPC target (not really supported in
|
|
|
GCC) and we would prefer to use a native PowerISA type that can be passed in a
|
|
|
single register. The PowerISA Rotate Under Mask instructions can easily
|
|
|
extract and insert integer fields of a General Purpose Register (GPR). This
|
|
|
implies that MMX integer types can be handled as a internal union of arrays for
|
|
|
the supported element types. So an 64-bit unsigned long long is the best type
|
|
|
for parameter passing and return values. Especially for the 64-bit (_si64)
|
|
|
operations as these normally generate a single PowerISA instruction.</para>
|
|
|
<para/>
|
|
|
<para>The SSE extensions include some convert operations for _m128 to /
|
|
|
from _m64 and this includes some int to / from float conversions. However in
|
|
|
these cases the float operands always reside in SSE (XMM) registers (which
|
|
|
match the PowerISA vector registers) and the MMX registers only contain integer
|
|
|
values. POWER8 (PowerISA-2.07) has direct move instructions between GPRs and
|
|
|
VSRs. So these transfers are normally a single instruction and any conversions
|
|
|
can be handed in the vector unit.</para>
|
|
|
<para/>
|
|
|
<para>When transferring a __m64 value to a vector register we should also
|
|
|
execute a xxsplatd instruction to insure there is valid data in all four
|
|
|
element lanes before doing floating point operations. This avoids generating
|
|
|
extraneous floating point exceptions that might be generated by uninitialized
|
|
|
parts of the vector. The top two lanes will have the floating point results
|
|
|
that are in position for direct transfer to a GPR or stored via Store Float
|
|
|
Double (stfd). These operation are internal to the intrinsic implementation and
|
|
|
there is no requirement to keep temporary vectors in correct Little Endian
|
|
|
form.</para>
|
|
|
<para/>
|
|
|
<para>Also for the smaller element sizes and higher element counts (MMX
|
|
|
_pi8 and _p16 types) the number of Rotate Under Mask instructions required to
|
|
|
disassemble the 64-bit __m64 into elements, perform the element calculations,
|
|
|
and reassemble the elements in a single __m64 value can get larger. In this
|
|
|
case we can generate shorter instruction sequences by transfering (via direct
|
|
|
move instruction) the GPR __m64 value to the a vector register, performance the
|
|
|
SIMD operation there, then transfer the __m64 result back to a GPR.</para>
|
|
|
<para/>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>1.1.2.2 Dealing with AVX and AVX512</title>
|
|
|
<para>AVX is a bit easier for PowerISA and the ELF V2 ABI. First we have
|
|
|
lots (64) of vector registers and a super scalar vector pipe-line (can execute
|
|
|
two or more independent 128-bit vector operations concurrently). Second the ELF
|
|
|
V2 ABI was designed to pass and return larger aggregates in vector
|
|
|
registers:</para>
|
|
|
<para/>
|
|
|
<orderedlist>
|
|
|
<listitem>
|
|
|
<para>Up to 12 qualified vector arguments can be passed in
|
|
|
v2–v13.</para>
|
|
|
</listitem>
|
|
|
<listitem>
|
|
|
<para>A qualified vector argument corresponds to:</para>
|
|
|
</listitem>
|
|
|
</orderedlist>
|
|
|
<para>So the ABI allows for passing up to three structures each
|
|
|
representing 512-bit vectors and returning such (512-bit) structure all in VMX
|
|
|
registers. This can be extended further by spilling parameters (beyond 12 X
|
|
|
128-bit vectors) to the parameter save area, but we should not need that, as
|
|
|
most intrinsics only use 2 or 3 operands.. Vector registers not needed for
|
|
|
parameter passing, along with an additional 8 volatile vector registers, are
|
|
|
available for scratch and local variables. All can be used by the application
|
|
|
without requiring register spill to the save area. So most intrinsic operations
|
|
|
on 256- or 512-bit vectors can be held within existing PowerISA vector
|
|
|
registers. </para>
|
|
|
<para/>
|
|
|
<para>For larger functions that might use multiple AVX 256 or 512-bit
|
|
|
intrinsics and, as a result, push beyond the 20 volatile vector registers, the
|
|
|
compiler will just allocate non-volatile vector registers by allocating a stack
|
|
|
frame and spilling non-volatile vector registers to the save area (as needed in
|
|
|
the function prologue). This frees up to 64 vectors (32 x 256-bit or 16 x
|
|
|
512-bit structs) for code optimization. </para>
|
|
|
<para/>
|
|
|
<para>Based on the specifics of our ISA and ABI we will not not use
|
|
|
<literal>__vector_size__</literal> (32) or (64) in the PowerPC implementation of __m256 and __m512
|
|
|
types. Instead we will typedef structs of 2 or 4 vector (__m128) fields. This
|
|
|
allows efficient handling of these larger data types without require new GCC
|
|
|
language extensions. </para>
|
|
|
<para/>
|
|
|
<para>In the end we should use the same type names and definitions as the
|
|
|
GCC X86 intrinsic headers where possible. Where that is not possible we can
|
|
|
define new typedefs that provide the best mapping to the underlying PowerISA
|
|
|
hardware.</para>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>1.1.3 How is this API implemented.</title>
|
|
|
<para>One pleasant surprise is that many (at least for the older Intel)
|
|
|
Intrinsics are implemented directly in C vector extension code and/or a simple
|
|
|
mapping to GCC target specific builtins. </para>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>1.1.3.1 Some simple examples</title>
|
|
|
<para>For example; a vector double splat looks like this:</para>
|
|
|
<para>Another example:</para>
|
|
|
<para>Note in the example above the cast to __v2df for the operation. Both
|
|
|
__m128d and __v2df are vector double, but __v2df does no have the <literal>__may_alias__</literal>
|
|
|
attribute. And one more example:</para>
|
|
|
<para>Note this requires a cast for the compiler to generate the correct
|
|
|
code for the intended operation. The parameters and result are the generic
|
|
|
__m128i, which is a vector long long with the <literal>__may_alias__</literal> attribute. But
|
|
|
operation is a vector multiply low unsigned short (__v8hu). So not only do we
|
|
|
use the cast to drop the <literal>__may_alias__</literal> attribute but we also need to cast to
|
|
|
the correct (vector unsigned short) type for the specified operation.</para>
|
|
|
<para/>
|
|
|
<para>I have successfully copied these (and similar) source snippets over
|
|
|
to the PPC64LE implementation unchanged. This of course assumes the associated
|
|
|
types are defined and with compatible attributes.</para>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>1.1.3.2 Those extra attributes</title>
|
|
|
<para>You may have noticed there are some special attributes:</para>
|
|
|
<para>So far I have been using these attributes unchanged.</para>
|
|
|
<para/>
|
|
|
<para>But most intrinsics map the Intel intrinsic to one or more target
|
|
|
specific GCC builtins. For example:</para>
|
|
|
<para/>
|
|
|
<para>The first intrinsic (_mm_load_pd ) is implement as a C vector pointer
|
|
|
reference, but from the comment assumes the compiler will use a movapd
|
|
|
instruction that requires 16-byte alignment (will raise a general-protection
|
|
|
exception if not aligned). This implies that there is a performance advantage
|
|
|
for at least some Intel processors to keep the vector aligned. The second
|
|
|
intrinsic uses the explicit GCC builtin __builtin_ia32_loadupd to generate the
|
|
|
movupd instruction which handles unaligned references.</para>
|
|
|
<para/>
|
|
|
<para>The opposite assumption applies to POWER and PPC64LE, where GCC
|
|
|
generates the VSX lxvd2x / xxswapd instruction sequence by default, which
|
|
|
allows unaligned references. The PowerISA equivalent for aligned vector access
|
|
|
is the VMX lvx instruction and the vec_ld builtin, which forces quadword
|
|
|
aligned access (by ignoring the low order 4 bits of the effective address). The
|
|
|
lvx instruction does not raise alignment exceptions, but perhaps should as part
|
|
|
of our implementation of the Intel intrinsic. This requires that we use
|
|
|
PowerISA VMX/VSX built-ins to insure we get the expected results.</para>
|
|
|
<para/>
|
|
|
<para>The current prototype defines the following:</para>
|
|
|
<para>The aligned load intrinsic adds an assert which checks alignment
|
|
|
(to match the Intel semantic) and uses the GCC builtin vec_ld (generates an
|
|
|
lvx). The assert generates extra code but this can be eliminated by defining
|
|
|
NDEBUG at compile time. The unaligned load intrinsic uses the GCC builtin
|
|
|
vec_vsx_ld (for PPC64LE generates lxvd2x / xxswapd for power8 and will
|
|
|
simplify to lxv or lxvx for power9). And similarly for __mm_store_pd /
|
|
|
__mm_storeu_pd, using vec_st and vec_vsx_st. These concepts extent to the
|
|
|
load/store intrinsics for vector float and vector int.</para>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>1.1.3.3 How did I find this out?</title>
|
|
|
<para>The next question is where did I get the details above. The GCC
|
|
|
documentation for __builtin_ia32_loadupd provides minimal information (the
|
|
|
builtin name, parameters and return types). Not very informative. </para>
|
|
|
<para/>
|
|
|
<para>Looking up the Intel intrinsic description is more informative. You
|
|
|
can Google the intrinsic name or use the <link
|
|
|
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/">Intel
|
|
|
Intrinsic guide </link> for this. The Intrinsic Guide is interactive and
|
|
|
includes Intel (Chip) technology and text based search capabilities. Clicking
|
|
|
on the intrinsic name opens to a synopsis including; the underlying instruction
|
|
|
name, text description, operation pseudo code, and in some cases performance
|
|
|
information (latency and throughput).</para>
|
|
|
<para/>
|
|
|
<para>The key is to get a description of the intrinsic (operand fields and
|
|
|
types, and which fields are updated for the result) and the underlying Intel
|
|
|
instruction. If the Intrinsic guide is not clear you can look up the
|
|
|
instruction details in the “<link
|
|
|
xlink:href="https://software.intel.com/en-us/articles/intel-sdm">Intel® 64 and IA-32
|
|
|
Architectures Software Developer’s Manual</link>”.</para>
|
|
|
<para/>
|
|
|
<para>Information about the PowerISA vector facilities is found in the
|
|
|
<link
|
|
|
xlink:href="https://openpowerfoundation.org/?resource_lib=ibm-power-isa-version-2-07-b"
|
|
|
>PowerISA Version 2.07B</link> (for POWER8 and <link
|
|
|
xlink:href="https://www.docdroid.net/tWT7hjD/powerisa-v30.pdf.html">3.0 for
|
|
|
POWER9</link>) manual, Book I, Chapter 6. Vector Facility and Chapter 7.
|
|
|
Vector-Scalar Floating-Point Operations. Another good reference is the <link
|
|
|
xlink:href="https://openpowerfoundation.org/technical/technical-resources/technical-spe
|
|
|
cifications/">OpenPOWER ELF V2 application binary interface</link> (ABI)
|
|
|
document, Chapter 6. Vector Programming Interfaces and Appendix A. Predefined
|
|
|
Functions for Vector Programming.</para>
|
|
|
<para/>
|
|
|
<para>Another useful document is the original <link
|
|
|
xlink:href="http://www.nxp.com/assets/documents/data/en/reference-manuals/ALTIVECPEM.pd
|
|
|
f">Altivec Technology Programers Interface Manual</link> with a user
|
|
|
friendly structure and many helpful diagrams. But alas the PIM does does not
|
|
|
cover the resent PowerISA (power7, power8, and power9) enhancements.</para>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>1.1.3.4 Examples implemented using other intrinsics</title>
|
|
|
<para>Some intrinsic implementations are defined in terms of other
|
|
|
intrinsics. For example.</para>
|
|
|
<para/>
|
|
|
<para/>
|
|
|
<para/>
|
|
|
<para/>
|
|
|
<para/>
|
|
|
<para/>
|
|
|
<para/>
|
|
|
<para/>
|
|
|
<para/>
|
|
|
<para/>
|
|
|
<para/>
|
|
|
<para/>
|
|
|
<para/>
|
|
|
<para/>
|
|
|
<para/>
|
|
|
<para>This notion of using part (one fourth or half) of the SSE XMM
|
|
|
register and leaving the rest unchanged (or forced to zero) is specific to SSE
|
|
|
scalar operations and can generate some complicated (sub-optimal) PowerISA
|
|
|
code. In this case _mm_load_sd passes the dereferenced double value to
|
|
|
_mm_set_sd which uses C vector initializer notation to combine (merge) that
|
|
|
double scalar value with a scalar 0.0 constant into a vector double.</para>
|
|
|
<para/>
|
|
|
<para>While code like this should work as-is for PPC64LE, you should look
|
|
|
at the generated code and assess if it is reasonable. In this case the code
|
|
|
is not awful (a load double splat, vector xor to generate 0.0s, then a xxmrghd
|
|
|
to combine __F and 0.0). Other examples may generate sub-optimal code and
|
|
|
justify a rewrite to PowerISA scalar or vector code (<link
|
|
|
xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/PowerPC-AltiVec_002fVSX-Built-
|
|
|
in-Functions.html#PowerPC-AltiVec_002fVSX-Built-in-Functions">GCC PowerPC
|
|
|
AltiVec Built-in Functions</link> or inline assembler). </para>
|
|
|
<para/>
|
|
|
<para>Net: try using the existing C code if you can, but check on what the
|
|
|
compiler generates. If the generated code is horrendous, it may be worth the
|
|
|
effort to write a PowerISA specific equivalent. For codes making extensive use
|
|
|
of MMX or SSE scalar intrinsics you will be better off rewriting to use
|
|
|
standard C scalar types and letting the the GCC compiler handle the details
|
|
|
(see <link linkend="">#2.1.Prefered methods|outline</link>)</para>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>2 How do we work this?</title>
|
|
|
<para>The working assumption is to start with the existing GCC headers from
|
|
|
./gcc/config/i386/, then convert them to PowerISA and add them to
|
|
|
./gcc/config/rs6000/. I assume we will replicate the existing header structure
|
|
|
and retain the existing header file and intrinsic names. This also allows us to
|
|
|
reuse existing DejaGNU test cases from ./gcc/testsuite/gcc.target/i386, modify
|
|
|
them as needed for the POWER target, and them to the
|
|
|
./gcc/testsuite/gcc.target/powerpc.</para>
|
|
|
<para/>
|
|
|
<para>We can be flexible on the sequence that headers/intrinsics and test
|
|
|
cases are ported. This should be based on customer need and resolving
|
|
|
internal dependencies. This implies an oldest-to-newest / bottoms-up (MMX,
|
|
|
SSE, SSE2, …) strategy. The assumption is, existing community and user
|
|
|
application codes, are more likely to have optimized code for previous
|
|
|
generation ubiquitous (SSE, SSE2, ...) processors than the latest (and rare)
|
|
|
SkyLake AVX512.</para>
|
|
|
<para/>
|
|
|
<para>I would start with an existing header from the current GCC
|
|
|
./gcc/config/i386/ and copy the header comment (including FSF copyright) down
|
|
|
to any vector typedefs used in the API or implementation. Skip the Intel
|
|
|
intrinsic implementation code for now, but add the ending #end if matching the
|
|
|
headers conditional guard against multiple inclusion. You can add #include
|
|
|
<alternative> as needed. For examples:</para>
|
|
|
<para/>
|
|
|
<para>Then you can start adding small groups of related intrinsic
|
|
|
implementations to the header to be compiled and examine the generated code.
|
|
|
Once you have what looks like reasonable code you can grep through
|
|
|
./gcc/testsuite/gcc.target/i386 for examples using the intrinsic names you
|
|
|
just added. You should be able to find functional tests for most X86
|
|
|
intrinsics. </para>
|
|
|
<para/>
|
|
|
<para>The <link
|
|
|
xlink:href="https://gcc.gnu.org/onlinedocs/gccint/Testsuites.html#Testsuites">GCC
|
|
|
testsuite</link> uses the DejaGNU test framework as documented in the <link
|
|
|
xlink:href="https://gcc.gnu.org/onlinedocs/gccint/">GNU Compiler Collection (GCC)
|
|
|
Internals</link> manual. GCC adds its own DejaGNU directives and extensions,
|
|
|
that are embedded in the testsuite source as comments. Some are platform
|
|
|
specific and will need to be adjusted for tests that are ported to our
|
|
|
platform. For example</para>
|
|
|
<para>should become something like</para>
|
|
|
<para/>
|
|
|
<para>Repeat this process until you have equivalent implementations for all
|
|
|
the intrinsics in that header and associated test cases that execute without
|
|
|
error. </para>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>2.1 Prefered methods</title>
|
|
|
<para>As we will see there are multiple ways to implement the logic of
|
|
|
these intrinsics. Some implementation methods are preferred because they allow
|
|
|
the compiler to select instructions and provided the most flexibility for
|
|
|
optimization across the whole sequence. Other methods may be required to
|
|
|
deliver a specific semantic or to deliver better optimization than the current
|
|
|
compiler is capable of. Some methods are more portable across multiple
|
|
|
compilers (GCC, LLVM, ...). All of this should be taken into consideration for
|
|
|
each intrinsic implementation. In general we should use the following list as a
|
|
|
guide to these decisions:</para>
|
|
|
<orderedlist>
|
|
|
<listitem>
|
|
|
<para/>
|
|
|
</listitem>
|
|
|
<listitem>
|
|
|
<para>Use C vector arithmetic, logical, dereference, etc., operators in
|
|
|
preference to intrinsics.</para>
|
|
|
</listitem>
|
|
|
<listitem>
|
|
|
<para>Use the bi-endian interfaces from Appendix A of the ABI in
|
|
|
preference to other intrinsics when available, as these are designed for
|
|
|
portability among compilers.</para>
|
|
|
</listitem>
|
|
|
<listitem>
|
|
|
<para>Use other, less well documented intrinsics (such as
|
|
|
__builtin_vsx_*) when no better facility is available, in preference to
|
|
|
assembly.</para>
|
|
|
</listitem>
|
|
|
<listitem>
|
|
|
<para>If necessary, use inline assembly, but know what you're
|
|
|
doing.</para>
|
|
|
</listitem>
|
|
|
</orderedlist>
|
|
|
<para/>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>2.2 Prepare yourself</title>
|
|
|
<para>To port Intel intrinsics to POWER you will need to prepare yourself
|
|
|
with knowledge of PowerISA vector facilities and how to access the associated
|
|
|
documentation.</para>
|
|
|
<para/>
|
|
|
<orderedlist>
|
|
|
<listitem>
|
|
|
<para><link
|
|
|
xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/Vector-Extensions.html#Vector-
|
|
|
Extensions">GCC vector extention</link> syntax and usage. This is one of a set
|
|
|
of GCC “<link
|
|
|
xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/C-Extensions.html#C-Extensions
|
|
|
">Extentions to the C </link><link
|
|
|
xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/C-Extensions.html#C-Extensions
|
|
|
">language Family</link>” that the intrinsic header implementation depends
|
|
|
on. As many of the GCC intrinsics for x86 are implemented via C vector
|
|
|
extensions, reading and understanding of this code is an important part of the
|
|
|
porting process. </para>
|
|
|
</listitem>
|
|
|
<listitem>
|
|
|
<para>Intel (x86) intrinsic and type naming conventions and how to find
|
|
|
more information. The intrinsic name encodes some information about the
|
|
|
vector size and type of the data, but the pattern is not always obvious.
|
|
|
Using the online <link
|
|
|
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#">Intel
|
|
|
Intrinsic Guide</link> to look up the intrinsic by name is a good first
|
|
|
step.</para>
|
|
|
</listitem>
|
|
|
<listitem>
|
|
|
<para>PowerISA Vector facilities. The Vector facilities of POWER8 are
|
|
|
extensive and cover the usual types and usual operations. However it has a
|
|
|
different history and organization from Intel. Both (Intel and PowerISA) have
|
|
|
their quirks and in some cases the mapping may not be obvious. So familiarizing
|
|
|
yourself with the PowerISA Vector (VMX) and Vector Scalar Extensions (VSX) is
|
|
|
important.</para>
|
|
|
</listitem>
|
|
|
</orderedlist>
|
|
|
<para/>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>2.2.1 GCC Vector Extensions</title>
|
|
|
<para>The GCC vector extensions are common syntax but implemented in a
|
|
|
target specific way. Using the C vector extensions require the __gnu_inline__
|
|
|
attribute to avoid syntax errors in case the user specified C standard
|
|
|
compliance (-std=c90, -std=c11, etc) that would normally disallow such
|
|
|
extensions. </para>
|
|
|
<para/>
|
|
|
<para>The GCC implementation for PowerPC64 Little Endian is (mostly)
|
|
|
functionally compatible with x86_64 vector extension usage. We can use the same
|
|
|
type definitions (at least for vector_size (16)), operations, syntax
|
|
|
<{...}> for vector initializers and constants, and array syntax
|
|
|
<[]> for vector element access. So simple arithmetic / logical operations
|
|
|
on whole vectors should work as is. </para>
|
|
|
<para/>
|
|
|
<para>The caveat is that the interface data type of the Intel Intrinsic may
|
|
|
not match the data types of the operation, so it may be necessary to cast the
|
|
|
operands to the specific type for the operation. This also applies to vector
|
|
|
initializers and accessing vector elements. You need to use the appropriate
|
|
|
type to get the expected results. Of course this applies to X86_64 as well. For
|
|
|
example:</para>
|
|
|
<para>Note the cast from the interface type (__m128} to the implementation
|
|
|
type (__v4sf, defined in the intrinsic header) for the vector float add (+)
|
|
|
operation. This is enough for the compiler to select the appropriate vector add
|
|
|
instruction for the float type. Then the result (which is __v4sf) needs to be
|
|
|
cast back to the expected interface type (__m128). </para>
|
|
|
<para/>
|
|
|
<para>Note also the use of array syntax (__A)[0]) to extract the lowest
|
|
|
(left most<footnote><para>Here we are using logical left and logical right
|
|
|
which will not match the PowerISA register view in Little endian. Logical left
|
|
|
is the left most element for initializers {left, … , right}, storage order
|
|
|
and array order where the left most element is [0].</para></footnote>)
|
|
|
element of a vector. The cast (__v4sf) insures that the compiler knows we are
|
|
|
extracting the left most 32-bit float. The compiler insures the code generated
|
|
|
matches the Intel behavior for PowerPC64 Little Endian. </para>
|
|
|
<para/>
|
|
|
<para>The code generation is complicated by the fact that PowerISA vector
|
|
|
registers are Big Endian (element 0 is the left most word of the vector) and
|
|
|
X86 scalar stores are from the left most (work/dword) for the vector register.
|
|
|
Application code with extensive use of scalar (vs packed) intrinsic loads /
|
|
|
stores should be flagged for rewrite to native PPC code using exisiing scalar
|
|
|
types (float, double, int, long, etc.). </para>
|
|
|
<para/>
|
|
|
<para>Another example is the set reverse order:</para>
|
|
|
<para>Note the use of initializer syntax used to collect a set of scalars
|
|
|
into a vector. Code with constant initializer values will generate a vector
|
|
|
constant of the appropriate endian. However code with variables in the
|
|
|
initializer can get complicated as it often requires transfers between register
|
|
|
sets and perhaps format conversions. We can assume that the compiler will
|
|
|
generate the correct code, but if this class of intrinsics shows up a hot spot,
|
|
|
a rewrite to native PPC vector built-ins may be appropriate. For example
|
|
|
initializer of a variable replicated to all the vector fields might not be
|
|
|
recognized as a “load and splat” and making this explicit may help the
|
|
|
compiler generate better code.</para>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>2.2.2 Intel Intrinsic functions</title>
|
|
|
<para>So what is an intrinsic function? From Wikipedia:</para>
|
|
|
<para/>
|
|
|
<para>In <link
|
|
|
xlink:href="https://en.wikipedia.org/wiki/Compiler_theory">compiler theory</link>, an
|
|
|
intrinsic function is a function available for use in a given <link
|
|
|
xlink:href="https://en.wikipedia.org/wiki/Programming_language">programming
|
|
|
language</link> whose implementation is handled specially by the compiler.
|
|
|
Typically, it substitutes a sequence of automatically generated instructions
|
|
|
for the original function call, similar to an <link
|
|
|
xlink:href="https://en.wikipedia.org/wiki/Inline_function">inline function</link>.
|
|
|
Unlike an inline function though, the compiler has an intimate knowledge of the
|
|
|
intrinsic function and can therefore better integrate it and optimize it for
|
|
|
the situation. This is also called builtin function in many languages.</para>
|
|
|
<para/>
|
|
|
<para>The “Intel Intrinsics” API provides access to the many
|
|
|
instruction set extensions (Intel Technologies) that Intel has added (and
|
|
|
continues to add) over the years. The intrinsics provided access to new
|
|
|
instruction capabilities before the compilers could exploit them directly.
|
|
|
Initially these intrinsic functions where defined for the Intel and Microsoft
|
|
|
compiler and where eventually implemented and contributed to GCC.</para>
|
|
|
<para/>
|
|
|
<para>The Intel Intrinsics have a specific type and naming structure. In
|
|
|
this naming structure, functions starts with a common prefix (MMX and SSE use
|
|
|
_mm_ prefix, while AVX added the _mm256 _mm512 prefixes), then a short
|
|
|
functional name (set, load, store, add, mul, blend, shuffle, …) and a suffix
|
|
|
(_pd, _sd, _pi32...) with type and packing information. See <link
|
|
|
linkend="">Appendix B</link> for the list of common intrisic suffixes.</para>
|
|
|
<para/>
|
|
|
<para>Oddly many of the MMX/SSE operations are not vectors at all. There
|
|
|
are a lot of scalar operations on a single float, double, or long long type. In
|
|
|
effect these are scalars that can take advantage of the larger (xmm) register
|
|
|
space. Also in the Intel 32-bit architecture they provided IEEE754 float and
|
|
|
double types, and 64-bit integers that did not exist or where hard to implement
|
|
|
in the base i386/387 instruction set. These scalar operation use a suffix
|
|
|
starting with '_s' (_sd for scalar double float, _ss scalar float, and _si64
|
|
|
for scalar long long).</para>
|
|
|
<para/>
|
|
|
<para>True vector operations use the packed or extended packed suffixes,
|
|
|
starting with '_p' or '_ep' (_pd for vector double, _ps for vector float, and
|
|
|
_epi32 for vector int). The use of '_ep' seems to be reserved to disambiguate
|
|
|
intrinsics that existed in the (64-bit vector) MMX extension from the extended
|
|
|
(128-bit vector) SSE equivalent. For example _mm_add_pi32 is a MMX operation on
|
|
|
a pair of 32-bit integers, while _mm_add_epi32 is an SSE2 operation on vector
|
|
|
of 4 32-bit integers. </para>
|
|
|
<para/>
|
|
|
<para>The GCC builtins for the <link
|
|
|
xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/x86-Built-in-Functions.html#x8
|
|
|
6-Built-in-Functions">i386.target</link>, (includes x86 and x86_64) are not
|
|
|
the same as the Intel Intrinsics. While they have similar intent and cover most
|
|
|
of the same functions, they use a different naming (prefixed with
|
|
|
__builtin_ia32_, then function name with type suffix) and uses GCC vector type
|
|
|
modes for operand types. For example:</para>
|
|
|
<para>Note: A key difference between GCC builtins for i386 and Powerpc is
|
|
|
that the x86 builtins have different names of each operation and type while the
|
|
|
powerpc altivec builtins tend to have a single generatic builtin for each
|
|
|
operation, across a set of compatible operand types. </para>
|
|
|
<para/>
|
|
|
<para>In GCC the Intel Intrinsic header (*intrin.h) files are implemented
|
|
|
as a set of inline functions using the Intel Intrinsic API names and types.
|
|
|
These functions are implemented as either GCC C vector extension code or via
|
|
|
one or more GCC builtins for the i386 target. So lets take a look at some
|
|
|
examples from GCC's SSE2 intrinsic header emmintrin.h:</para>
|
|
|
<para/>
|
|
|
<para/>
|
|
|
<para>Note that the _mm_add_pd is implemented direct as C vector
|
|
|
extension code., while _mm_add_sd is implemented via the GCC builtin
|
|
|
__builtin_ia32_addsd. From the discussion above we know the _pd suffix
|
|
|
indicates a packed vector double while the _sd suffix indicates a scalar double
|
|
|
in a XMM register. </para>
|
|
|
<para/>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>2.2.2.1 Packed vs scalar intrinsics</title>
|
|
|
<para/>
|
|
|
<para>So what is actually going on here? The vector code is clear enough if
|
|
|
you know that '+' operator is applied to each vector element. The the intent of
|
|
|
the builtin is a little less clear, as the GCC documentation for
|
|
|
__builtin_ia32_addsd is not very helpful (nonexistent). So perhaps the <link
|
|
|
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_add_
|
|
|
pd&expand=97">Intel Intrinsic Guide</link> will be more enlightening. To
|
|
|
paraphrase:</para>
|
|
|
<para/>
|
|
|
<para>From the <link
|
|
|
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_add_
|
|
|
pd&expand=97">_mm_add_dp description</link> ; for each double float
|
|
|
element ([0] and [1] or bits [63:0] and [128:64]) for operands a and b are
|
|
|
added and resulting vector is returned. </para>
|
|
|
<para/>
|
|
|
<para>From the <link
|
|
|
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_add_
|
|
|
sd&expand=97,130">_mm_add_sd</link><link
|
|
|
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_add_
|
|
|
sd&expand=97,130"> </link><link
|
|
|
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_add_
|
|
|
sd&expand=97,130">description</link> ; Add element 0 of first operand
|
|
|
(a[0]) to element 0 of the second operand (b[0]) and return the packed vector
|
|
|
double {(a[0] + b[0]), a[1]}. Or said differently the sum of the logical left
|
|
|
most half of the the operands are returned in the logical left most half
|
|
|
(element [0]) of the result, along with the logical right half (element [1])
|
|
|
of the first operand (unchanged) in the logical right half of the result.</para>
|
|
|
<para/>
|
|
|
<para>So the packed double is easy enough but the scalar double details are
|
|
|
more complicated. One source of complication is that while both Instruction Set
|
|
|
Architectures (SSE vs VSX) support scalar floating point operations in vector
|
|
|
registers the semantics are different. </para>
|
|
|
<para/>
|
|
|
<orderedlist>
|
|
|
<listitem>
|
|
|
<para>The vector bit and field numbering is different (reversed).
|
|
|
</para>
|
|
|
</listitem>
|
|
|
<listitem>
|
|
|
<para>The handling of the non-scalar part of the register for scalar
|
|
|
operations are different.</para>
|
|
|
</listitem>
|
|
|
</orderedlist>
|
|
|
<para/>
|
|
|
<para>To minimize confusion and use consistent nomenclature, I will try to
|
|
|
use the terms logical left and logical right elements based on the order they
|
|
|
apprear in a C vector initializers and element index order. So in the vector
|
|
|
(__v2df){1.0, 20.}, The value 1.0 is the in the logical left element [0] and
|
|
|
the value 2.0 is logical right element [1].</para>
|
|
|
<para/>
|
|
|
<para>So lets look at how to implement these intrinsics for the PowerISA.
|
|
|
For example in this case we can use the GCC vector extension, like so:</para>
|
|
|
<para/>
|
|
|
<para/>
|
|
|
<para>The packed double implementation operates on the vector as a whole.
|
|
|
The scalar double implementation operates on and updates only [0] element of
|
|
|
the vector and leaves the __A[1] element unchanged. Form this source the GCC
|
|
|
compiler generates the following code for PPC64LE target.:</para>
|
|
|
<para/>
|
|
|
<para>The packed vector double generated the corresponding VSX vector
|
|
|
double add (xvadddp). But the scalar implementation is bit more complicated.
|
|
|
</para>
|
|
|
<para/>
|
|
|
<para>First the PPC64LE vector format, element [0] is not in the correct
|
|
|
position for the scalar operations. So the compiler generates vector splat
|
|
|
double (xxspltd) instructions to copy elements __A[0] and __B[0] into position
|
|
|
for the VSX scalar add double (xsadddp) that follows. However the VSX scalar
|
|
|
operation leaves the other half of the VSR undefined (which does not match the
|
|
|
expected Intel semantics). So the compiler must generates a vector merge high
|
|
|
double (xxmrghd) instruction to combine the original __A[1] element (from vs34)
|
|
|
with the scalar add result from vs35 element [1]. This merge swings the scalar
|
|
|
result from vs35[1] element into the vs34[0] position, while preserving the
|
|
|
original vs34[1] (from __A[1]) element (copied to itself).<footnote><para>Fun
|
|
|
fact: The vector registers in PowerISA are decidedly Big Endian. But we decided
|
|
|
to make the PPC64LE ABI behave like a Little Endian system to make application
|
|
|
porting easier. This require the compiler to manipulate the PowerISA vector
|
|
|
instrinsic behind the the scenes to get the correct Little Endian results. For
|
|
|
example the element selector [0|1] for vec_splat and the generation of
|
|
|
vec_mergeh vs vec_mergel are reversed for the Little
|
|
|
Endian.</para></footnote></para>
|
|
|
<para/>
|
|
|
<para>This technique applies to packed and scalar intrinsics for the the
|
|
|
usual arithmetic operators (add, subtract, multiply, divide). Using GCC vector
|
|
|
extensions in these intrinsic implementations provides the compiler more
|
|
|
opportunity to optimize the whole function. </para>
|
|
|
<para/>
|
|
|
<para>Now we can look at a slightly more interesting (complicated) case.
|
|
|
Square root (sqrt) is not a arithmetic operator in C and is usually handled
|
|
|
with a library call or a compiler builtin. We really want to avoid a library
|
|
|
calls and want to avoid any unexpected side effects. As you see below the
|
|
|
implementation of <link
|
|
|
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_sqrt
|
|
|
_pd&expand=4926">_mm_sqrt_pd</link> and <link
|
|
|
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_sqrt
|
|
|
_sd&expand=4926,4956">_mm_sqrt_sd</link> intrinsics are based on GCC x86
|
|
|
built ins. </para>
|
|
|
<para/>
|
|
|
<para>For the packed vector sqrt, the PowerISA VSX has an equivalent vector
|
|
|
double square root instruction and GCC provides the vec_sqrt builtin. But the
|
|
|
scalar implementation involves an additional parameter and an extra move.
|
|
|
This seems intended to mimick the propagation of the __A[1] input to the
|
|
|
logical right half of the XMM result that we saw with _mm_add_sd above.</para>
|
|
|
<para/>
|
|
|
<para>The instinct is to extract the low scalar (__B[0]) from operand __B
|
|
|
and pass this to the GCC __builtin_sqrt () before recombining that scalar
|
|
|
result with __A[1] for the vector result. Unfortunately C language standards
|
|
|
force the compiler to call the libm sqrt function unless -ffast-math is
|
|
|
specified. The -ffast-math option is not commonly used and we want to avoid the
|
|
|
external library dependency for what should be only a few inline instructions.
|
|
|
So this is not a good option.</para>
|
|
|
<para/>
|
|
|
<para>Thinking outside the box; we do have an inline intrinsic for a
|
|
|
(packed) vector double sqrt, that we just implemented. However we need to
|
|
|
insure the other half of __B (__B[1]) does not cause an harmful side effects
|
|
|
(like raising exceptions for NAN or negative values). The simplest solution
|
|
|
is to splat __B[0] to both halves of a temporary value before taking the
|
|
|
vec_sqrt. Then this result can be combined with __A[1] to return the final
|
|
|
result. For example:</para>
|
|
|
<para>In this example we use <link
|
|
|
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_set1
|
|
|
_pd&expand=4926,4956,4926,4956,4652">_mm_set1_pd</link> to splat the
|
|
|
scalar __B[0], before passing that vector to our _mm_sqrt_pd implementation,
|
|
|
then pass the sqrt result (c[0]) with __A[1[ to <link
|
|
|
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_setr
|
|
|
_pd&expand=4679">_mm_setr_p</link><link
|
|
|
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_setr
|
|
|
_pd&expand=4679">d</link> to combine the final result. You could also use
|
|
|
the {c[0], __A[1]} initializer instead of _mm_setr_pd.</para>
|
|
|
<para/>
|
|
|
<para>Now we can look at vector and scalar compares that add there own
|
|
|
complication: For example:</para>
|
|
|
<para>The Intel Intrinsic Guide for <link
|
|
|
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_cmpe
|
|
|
q_pd&expand=779,788,779">_mm_cmpeq_pd</link> describes comparing double
|
|
|
elements [0|1] and returning either 0s for not equal and 1s (0xFFFFFFFFFFFFFFFF
|
|
|
or long long -1) for equal. The comparison result is intended as a select mask
|
|
|
(predicates) for selecting or ignoring specific elements in later operations.
|
|
|
The scalar version <link
|
|
|
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_cmpe
|
|
|
q_sd&expand=779,788">_mm_cmpeq_sd</link> is similar except for the quirk
|
|
|
of only comparing element [0] and combining the result with __A[1] to return
|
|
|
the final vector result.</para>
|
|
|
<para/>
|
|
|
<para>The packed vector implementation for PowerISA is simple as VSX
|
|
|
provides the equivalent instruction and GCC provides the vec_cmpeq builtin
|
|
|
supporting the vector double type. The technique of using scalar comparison
|
|
|
operators on the __A[0] and __B[0] does not work as the C comparison operators
|
|
|
return 0 or 1 results while we need the vector select mask (effectively 0 or
|
|
|
-1). Also we need to watch for sequences that mix scalar floats and integers,
|
|
|
generating if/then/else logic or requiring expensive transfers across register
|
|
|
banks.</para>
|
|
|
<para/>
|
|
|
<para>In this case we are better off using explicit vector built-ins for
|
|
|
_mm_add_sd as and example. We can use vec_splat from element [0] to temporaries
|
|
|
where we can safely use vec_cmpeq to generate the expect selector mask. Note
|
|
|
that the vec_cmpeq returns a bool long type so we need the cast the result back
|
|
|
to __v2df. Then use the (__m128d){c[0], __A[1]} initializer to combine the
|
|
|
comparison result with the original __A[1] input and cast to the require
|
|
|
interface type. So we have this example:</para>
|
|
|
<para/>
|
|
|
<para>Now lets look at a similar example that adds some surprising
|
|
|
complexity. This is the compare not equal case so we should be able to find the
|
|
|
equivalent vec_cmpne builtin:</para>
|
|
|
<para/>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>2.2.2.2 To vec_not or not</title>
|
|
|
<para>Well not exactly. Looking at the OpenPOWER ABI document we see a
|
|
|
reference to vec_cmpne for all numeric types. But when we look in the current
|
|
|
GCC 6 documentation we find that vec_cmpne is not on the list. So it is planned
|
|
|
in the ABI, but not implemented yet. </para>
|
|
|
<para/>
|
|
|
<para>Looking at the PowerISA 2.07B we find a VSX Vector Compare Equal to
|
|
|
Double-Precision but no Not Equal. In fact we see only vector double compare
|
|
|
instructions for greater than and greater than or equal in addition to the
|
|
|
equal compare. Not only can't we find a not equal, there is no less than or
|
|
|
less than or equal compares either. </para>
|
|
|
<para/>
|
|
|
<para>So what is going on here? Partially this is the Reduced Instruction
|
|
|
Set Computer (RISC) design philosophy. In this case the compiler can generate
|
|
|
all the required compares using the existing vector instructions and simple
|
|
|
transforms based on Boolean algebra. So vec_cmpne(A,B) is simply vec_not
|
|
|
(vec_cmpeq(A,B)). And vec_cmplt(A,B) is simply vec_cmpgt(B,A) based on the
|
|
|
identity A < B iff B > A. Similarly vec_cmple(A,B) is implemented as
|
|
|
vec_cmpge(B,A).</para>
|
|
|
<para/>
|
|
|
<para>What a minute, there is no vec_not() either. Can not find it in the
|
|
|
PowerISA, the OpenPOWER ABI, or the GCC PowerPC Altivec Built-in documentation.
|
|
|
There is no vec_move() either! How can this possible work?</para>
|
|
|
<para/>
|
|
|
<para>This is RISC philosophy again. We can always use a logical
|
|
|
instruction (like bit wise and or or) to effect a move given that we also have
|
|
|
nondestructive 3 register instruction forms. In the PowerISA most instruction
|
|
|
have two input registers and a separate result register. So if the result
|
|
|
register number is different from either input register then the inputs are
|
|
|
not clobbered (nondestructive). Of course nothing prevents you from specifying
|
|
|
the same register for both inputs or even all three registers (result and both
|
|
|
inputs). And some times it is useful.</para>
|
|
|
<para/>
|
|
|
<para>The statement B = vec_or (A,A) is is effectively a vector move/copy
|
|
|
from A to B. And A = vec_or (A,A) is obviously a nop (no operation). In the the
|
|
|
PowerISA defines the preferred nop and register move for vector registers in
|
|
|
this way.</para>
|
|
|
<para/>
|
|
|
<para>It is also useful to have hardware implement the logical operators
|
|
|
nor (not or) and nand (not and). The PowerISA provides these instruction for
|
|
|
fixed point and vector logical operation. So vec_not(A) can be implemented as
|
|
|
vec_nor(A,A). So looking at the implementation of _mm_cmpne we propose the
|
|
|
following:</para>
|
|
|
<para/>
|
|
|
<para>The Intel Intrinsics also include the not forms of the relational
|
|
|
compares:</para>
|
|
|
<para>The PowerISA and OpenPOWER ABI, or GCC PowerPC Altivec Built-in
|
|
|
documentation do not provide any direct equivalents to the not greater than
|
|
|
class of compares. Again you don't really need them if you know Boolean
|
|
|
algebra. We can use identities like {not (A < B) iff A >= B} and {not (A
|
|
|
<= B) iff A > B}. So the PPC64LE implementation follows:</para>
|
|
|
<para>These patterns repeat for the scalar version of the not compares. And
|
|
|
in general the larger pattern described in this chapter applies to the other
|
|
|
float and integer types with similar interfaces.</para>
|
|
|
<para/>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>2.2.2.3 Crossing lanes</title>
|
|
|
<para>We have seen that, most of the time, vector SIMD units prefer to keep
|
|
|
computations in the same “lane” (element number) as the input elements. The
|
|
|
only exception in the examples so far are the occasional splat (copy one
|
|
|
element to all the other elements of the vector) operations. Splat is an
|
|
|
example of the general category of “permute” operations (Intel would call
|
|
|
this a “shuffle” or “blend”). Permutes selects and rearrange the
|
|
|
elements of (usually) a concatenated pair of vectors and delivers those
|
|
|
selected elements, in a specific order, to a result vector. The selection and
|
|
|
order of elements in the result is controlled by a third vector, either as 3rd
|
|
|
input vector or and immediate field of the instruction.</para>
|
|
|
<para/>
|
|
|
<para>For example the Intel intrisics for <link
|
|
|
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_hadd
|
|
|
&expand=2757,4767,409,2757">Horizontal Add / Subtract</link> added with
|
|
|
SSE3. These instrinsics add (subtract) adjacent element pairs, across pair of
|
|
|
input vectors, placing the sum of the adjacent elements in the result vecotr..
|
|
|
For example <link
|
|
|
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_hadd
|
|
|
_ps&expand=2757,4767,409,2757,2757">_mm_hadd_ps</link> which implments
|
|
|
the operation on float:</para>
|
|
|
<para>Horizontal Add (hadd) provides an incremental vector “sum across”
|
|
|
operation commonly needed in matrix and vector transform math. Horizontal Add
|
|
|
is incremental as you need three hadd instructions to sum across 4 vectors of 4
|
|
|
elements ( 7 for 8 x 8, 15 for 16 x 16, …).</para>
|
|
|
<para/>
|
|
|
<para>The PowerISA does not have a sum-across operation for float or
|
|
|
double. We can user the vector float add instruction after we rearrange the
|
|
|
inputs so that element pairs line up for the horizontal add. For example we
|
|
|
would need to permute the input vectors {1, 2, 3, 4} and {101, 102, 103, 104}
|
|
|
into vectors {2, 4, 102, 104} and {1, 3, 101, 103} before the vec_add. This
|
|
|
requires two vector permutes to align the elements into the correct lanes for
|
|
|
the vector add (to implement Horizontal Add). </para>
|
|
|
<para/>
|
|
|
<para>The PowerISA provides generalized byte-level vector permute (vperm)
|
|
|
based a vector register pair source as input and a control vector. The control
|
|
|
vector provides 16 indexes (0-31) to select bytes from the concatenated input
|
|
|
vector register pair (VRA, VRB). A more specific set of permutes (pack, unpack,
|
|
|
merge, splat) operations (across element sizes) are encoded as separate
|
|
|
instruction opcodes or instruction immediate fields.</para>
|
|
|
<para/>
|
|
|
<para>Unfortunately only the general vec_perm can provide the realignment
|
|
|
we need the _mm_hadd_ps operation or any of the int, short variants of hadd.
|
|
|
For example:</para>
|
|
|
<para/>
|
|
|
<para>This requires two permute control vectors; one to select the even
|
|
|
word elements across __X and __Y, and another to select the odd word elements
|
|
|
across __X and __Y. The result of these permutes (vec_perm) are inputs to the
|
|
|
vec_add and completes the hadd operation. </para>
|
|
|
<para/>
|
|
|
<para>Fortunately the permute required for the double (64-bit) case (IE
|
|
|
_mm_hadd_pd) reduces to the equivalent of vec_mergeh / vec_mergel doubleword
|
|
|
(which are variants of VSX Permute Doubleword Immediate). So the
|
|
|
implementation of _mm_hadd_pd can be simplified to this:</para>
|
|
|
<para>This eliminates the load of the control vectors required by the
|
|
|
previous example.</para>
|
|
|
<para/>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>2.2.3 PowerISA Vector facilities.</title>
|
|
|
<para>The PowerISA vector facilities (VMX and VSX) are extensive, but does
|
|
|
not always provide a direct or obvious functional equivalent to the Intel
|
|
|
Intrinsics. But being not obvious is not the same as imposible. It just
|
|
|
requires some basic programing skills.</para>
|
|
|
<para/>
|
|
|
<para>It is a good idea to have an overall understanding of the vector
|
|
|
capabilities the PowerISA. You do not need to memorize every instructions but
|
|
|
is helps to know where to look. Both the PowerISA and OpenPOWER ABI have a
|
|
|
specific structure and organization that can help you find what you looking
|
|
|
for. </para>
|
|
|
<para/>
|
|
|
<para>It also helps to understand the relationship between the PowerISAs
|
|
|
low level instructions and the higher abstraction of the vector intrinsics as
|
|
|
defined by the OpenPOWER ABIs Vector Programming Interfaces and the the defacto
|
|
|
standard of GCC's PowerPC AltiVec Built-in Functions.</para>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>2.2.3.1 The PowerISA</title>
|
|
|
<para>The PowerISA is for historical reasons is organized at the top level
|
|
|
by the distinction between older Vector Facility (Altivec / VMX) and the newer
|
|
|
Vector-Scalar Floating-Point Operations (VSX). </para>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>2.2.3.1.1 The Vector Facility (VMX)</title>
|
|
|
<para>The orginal VMX supported SIMD integer byte, halfword, and word, and
|
|
|
single float data types within a separate (from GPR and FPR) bank of 32 x
|
|
|
128-bit vector registers. These operations like to stay within their (SIMD)
|
|
|
lanes except where the operation changes the element data size (integer
|
|
|
multiply, pack, and unpack). </para>
|
|
|
<para/>
|
|
|
<para>This is complimented by bit logical and shift / rotate / permute /
|
|
|
merge instuctions that operate on the vector as a whole. Some operation
|
|
|
(permute, pack, merge, shift double, select) will select 128 bits from a pair
|
|
|
of vectors (256-bits) and deliver 128-bit vector result. These instructions
|
|
|
will cross lanes or multiple registers to grab fields and assmeble them into
|
|
|
the single register result.</para>
|
|
|
<para/>
|
|
|
<para>The PowerISA 2.07B Chapter 6. Vector Facility is organised starting
|
|
|
with an overview (chapters 6.1- 6.6) :</para>
|
|
|
<para>Then a chapter on storage (load/store) access for vector and vector
|
|
|
elements:</para>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>2.2.3.1.1.1 Vector permute and formatting instructions</title>
|
|
|
<para>The vector Permute and formatting chapter follows and is an important
|
|
|
one to study. These operation operation on the byte, halfword, word (and with
|
|
|
2.07 doubleword) integer types . Plus special Pixel type. The shifts
|
|
|
instructions in this chapter operate on the vector as a whole at either the bit
|
|
|
or the byte (octet) level, This is an important chapter to study for moving
|
|
|
PowerISA vector results into the vector elements that Intel Intrinsics
|
|
|
expect:</para>
|
|
|
<para/>
|
|
|
<para>The Vector Integer instructions include the add / subtract / Multiply
|
|
|
/ Multiply Add/Sum / (no divide) operations for the standard integer types.
|
|
|
There are instruction forms that provide signed, unsigned, modulo, and
|
|
|
saturate results for most operations. The PowerISA 2.07 extension add /
|
|
|
subtract of 128-bit integers with carry and extend to 256, 512-bit and beyond ,
|
|
|
is included here. There are signed / unsigned compares across the standard
|
|
|
integer types (byte, .. doubleword). The usual and bit-wise logical operations.
|
|
|
And the SIMD shift / rotate instructions that operate on the vector elements
|
|
|
for various types.</para>
|
|
|
<para/>
|
|
|
<para/>
|
|
|
<para>The vector [single] float instructions are grouped into this chapter.
|
|
|
This chapter does not include the double float instructions which are described
|
|
|
in the VSX chapter. VSX also include additional float instructions that operate
|
|
|
on the whole 64 register vector-scalar set.</para>
|
|
|
<para/>
|
|
|
<para/>
|
|
|
<para>The vector XOR based instructions are new with PowerISA 2.07 (POWER8)
|
|
|
and provide vector crypto and check-sum operations:</para>
|
|
|
<para/>
|
|
|
<para>The vector gather and bit permute support bit level rearrangement of
|
|
|
bits with in the vector. While the vector versions of the count leading zeros
|
|
|
and population count are useful to accelerate specific algorithms </para>
|
|
|
<para/>
|
|
|
<para>The Decimal Integer add / subtract instructions complement the
|
|
|
Decimal Floating-Point instructions. They can also be used to accelerated some
|
|
|
binary to/from decimal conversions. The VSCR instruction provides access the
|
|
|
the Non-Java mode floating-point control and the saturation status. These
|
|
|
instruction are not normally of interest in porting Intel intrinsics.</para>
|
|
|
<para/>
|
|
|
<para>With PowerISA 2.07B (Power8) several major extension where added to
|
|
|
the Vector Facility:</para>
|
|
|
<orderedlist>
|
|
|
<listitem>
|
|
|
<para>Vector Crypto: Under “Vector Exclusive-OR-based Instructions
|
|
|
Vector Exclusive-OR-based Instructions”, AES [inverse] Cipher, SHA 256 / 512
|
|
|
Sigma, Polynomial Multiplication, and Permute and XOR instructions.</para>
|
|
|
</listitem>
|
|
|
<listitem>
|
|
|
<para>64-bit Integer; signed and unsigned add / subtract, signed and
|
|
|
unsigned compare, Even / Odd 32 x 32 multiple with 64-bit product, signed /
|
|
|
unsigned max / min, rotate and shift left/right.</para>
|
|
|
</listitem>
|
|
|
<listitem>
|
|
|
<para>Direct Move between GRPs and the FPRs / Left half of Vector
|
|
|
Registers.</para>
|
|
|
</listitem>
|
|
|
<listitem>
|
|
|
<para>128-bit integer add / subtract with carry / extend, direct
|
|
|
support for vector __int128 and multiple precision arithmetic.</para>
|
|
|
</listitem>
|
|
|
<listitem>
|
|
|
<para>Decimal Integer add subtract for 31 digit BCD.</para>
|
|
|
</listitem>
|
|
|
<listitem>
|
|
|
<para>Miscellaneous SIMD extensions: Count leading Zeros, Population
|
|
|
count, bit gather / permute, and vector forms of eqv, nand, orc.</para>
|
|
|
</listitem>
|
|
|
</orderedlist>
|
|
|
<para/>
|
|
|
<para>The rational for why these are included in the Vector Facilities
|
|
|
(VMX) (vs Vector-Scalar Floating-Point Operations (VSX)) has more to do with
|
|
|
how the instruction where encoded then with the type of operations or the ISA
|
|
|
version of introduction. This is primarily a trade-off between the bits
|
|
|
required for register selection vs bits for extended op-code space within in a
|
|
|
fixed 32-bit instruction. Basically accessing 32 vector registers require
|
|
|
5-bits per register, while accessing all 64 vector-scalar registers require
|
|
|
6-bits per register. When you consider the most vector instructions require 3
|
|
|
and some (select, fused multiply-add) require 4 register operand forms, the
|
|
|
impact on op-code space is significant. The larger register set of VSX was
|
|
|
justified by queuing theory of larger HPC matrix codes using double float,
|
|
|
while 32 registers are sufficient for most applications.</para>
|
|
|
<para/>
|
|
|
<para>So by definition the VMX instructions are restricted to the original
|
|
|
32 vector registers while VSX instructions are encoded to access all 64
|
|
|
floating-point scalar and vector double registers. This distinction can be
|
|
|
troublesome when programming at the assembler level, but the compiler and
|
|
|
compiler built-ins can hide most of this detail from the programmer. </para>
|
|
|
<para/>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>2.2.3.1.2 Vector-Scalar Floating-Point Operations (VSX)</title>
|
|
|
<para>With PowerISA 2.06 (POWER7) we extended the vector SIMD capabilities
|
|
|
of the PowerISA:</para>
|
|
|
<orderedlist>
|
|
|
<listitem>
|
|
|
<para>Extend the available vector and floating-point scalar register
|
|
|
sets from 32 registers each to a combined 64 x 64-bit scalar floating-point and
|
|
|
64 x 128-bit vector registers.</para>
|
|
|
</listitem>
|
|
|
<listitem>
|
|
|
<para>Enable scalar double float operations on all 64 scalar
|
|
|
registers.</para>
|
|
|
</listitem>
|
|
|
<listitem>
|
|
|
<para>Enable vector double and vector float operations for all 64
|
|
|
vector registers.</para>
|
|
|
</listitem>
|
|
|
<listitem>
|
|
|
<para>Enable super-scalar execution of vector instructions and support
|
|
|
2 independent vector floating point pipelines for parallel execution of 4 x
|
|
|
64-bit Floating point Fused Multiply Adds (FMAs) and 8 x 32-bit (FMAs) per
|
|
|
cycle.</para>
|
|
|
</listitem>
|
|
|
</orderedlist>
|
|
|
<para/>
|
|
|
<para>With PowerISA 2.07 (POWER8) we added single-precision scalar
|
|
|
floating-point instruction to VSX. This completes the floating-point
|
|
|
computational set for VSX. This ISA release also clarified how these operate in
|
|
|
the Little Endian storage model.</para>
|
|
|
<para/>
|
|
|
<para>While the focus was on enhanced floating-point computation (for High
|
|
|
Performance Computing), VSX also extended the ISA with additional storage
|
|
|
access, logical, and permute (merge, splat, shift) instructions. This was
|
|
|
necessary to extend these operations cover 64 VSX registers, and improves
|
|
|
unaligned storage access for vectors (not available in VMX).</para>
|
|
|
<para/>
|
|
|
<para>The PowerISA 2.07B Chapter 7. Vector-Scalar Floating-Point Operations
|
|
|
is organized starting with an introduction and overview (chapters 7.1- 7.5) .
|
|
|
The early sections (7.1 and 7.2) describe the layout of the 64 VSX registers
|
|
|
and how they relate (overlap and inter-operate) to the existing floating point
|
|
|
scalar (FPRs) and (VMX VRs) vector registers.</para>
|
|
|
<para/>
|
|
|
<para>The definitions given in “7.1.1.1 Compatibility with Category
|
|
|
Floating-Point and Category Decimal Floating-Point Operations”, and
|
|
|
“7.1.1.2 Compatibility with Category Vector Operations” </para>
|
|
|
<para>Note; the reference to scalar element 0 above is from the big endian
|
|
|
register perspective of the ISA. In the PPC64LE ABI implementation, and for the
|
|
|
purpose of porting Intel intrinsics, this is logical element 1. Intel SSE
|
|
|
scalar intrinsics operated on logical element [0], which is in the wrong
|
|
|
position for PowerISA FPU and VSX scalar floating-point operations. Another
|
|
|
important note is what happens to the other half of the VSR when you execute a
|
|
|
scalar floating-point instruction (The contents of doubleword 1 of a VSR …
|
|
|
are undefined.)</para>
|
|
|
<para/>
|
|
|
<para>The compiler will hide some of this detail when generating code for
|
|
|
little endian vector element [] notation and most vector built-ins. For example
|
|
|
vec_splat (A, 0) is transformed for PPC64LE to xxspltd VRT,VRA,1. What the
|
|
|
compiler can not hide is the different placement of scalars within vector
|
|
|
registers.</para>
|
|
|
<para/>
|
|
|
<para>Vector registers (VRs) 0-31 overlay and can be accessed from vector
|
|
|
scalar registers (VSRs) 32-63. The ABI also specifies that VR2-13 are used to
|
|
|
pass parameter and return values. In some cases the same (similar) operations
|
|
|
exist in both VMX and VSX instruction forms, while in the other cases
|
|
|
operations only exist for VMX (byte level permute and shift) or VSX (Vector
|
|
|
double). </para>
|
|
|
<para/>
|
|
|
<para>So resister selection that; avoids unnecessary vector moves, follows
|
|
|
the ABI, while maintaining the correct instruction specific register numbering,
|
|
|
can be tricky. The <link
|
|
|
xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/Machine-Constraints.html#Machi
|
|
|
ne-Constraints">GCC register constraint</link> annotations for Inline
|
|
|
assembler using vector instructions is challenging, even for experts. So only
|
|
|
experts should be writing assembler and then only in extraordinary
|
|
|
circumstances. You should leave these details to the compiler (using vector
|
|
|
extensions and vector built-ins) when ever possible.</para>
|
|
|
<para/>
|
|
|
<para>The next sections get is into the details of floating point
|
|
|
representation, operations, and exceptions. Basically the implementation
|
|
|
details for the IEEE754R and C/C++ language standards that most developers only
|
|
|
access via higher level APIs. So most programmers will not need this level of
|
|
|
detail, but it is there if needed.</para>
|
|
|
<para/>
|
|
|
<para>Finally an overview the VSX storage access instructions for big and
|
|
|
little endian and for aligned and unaligned data addresses. This included
|
|
|
diagrams that illuminate the differences </para>
|
|
|
<para/>
|
|
|
<para>Section 7.6 starts with a VSX instruction Set Summary which is the
|
|
|
place to start to get an feel for the types and operations supported. The
|
|
|
emphasis on float-point, both scalar and vector (especially vector double), is
|
|
|
pronounced. Many of the scalar and single-precision vector instruction look
|
|
|
like duplicates of what we have seen in the Chapter 4 Floating-Point and
|
|
|
Chapter 6 Vector facilities. The difference here is, new instruction encodings
|
|
|
to access the full 64 VSX register space. </para>
|
|
|
<para/>
|
|
|
<para>In addition there are small number of logical instructions are
|
|
|
include to support predication (selecting / masking vector elements based on
|
|
|
compare results). And set of permute, merge, shift, and splat instructions that
|
|
|
operation on VSX word (float) and doubleword (double) elements. As mentioned
|
|
|
about VMX section 6.8 these instructions are good to study as they are useful
|
|
|
for realigning elements from PowerISA vector results to that required for Intel
|
|
|
Intrinsics.</para>
|
|
|
<para/>
|
|
|
<para/>
|
|
|
<para>The VSX Instruction Descriptions section contains the detail
|
|
|
description for each VSX category instruction. The table entries from the
|
|
|
Instruction Set Summary are formatted in the document at hyperlinks to
|
|
|
corresponding instruction description.</para>
|
|
|
<para/>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>2.2.3.2 PowerISA Vector Intrinsics</title>
|
|
|
<para>The OpenPOWER ELF V2 application binary interface (ABI): Chapter 6.
|
|
|
Vector Programming Interfaces and Appendix A. Predefined Functions for Vector
|
|
|
Programming document the current and proposed vector built-ins we expect all
|
|
|
C/C++ compilers implement. </para>
|
|
|
<para/>
|
|
|
<para>Some of these operations are endian sensitive and the compiler needs
|
|
|
to make corresponding adjustments as it generate code for endian sensitive
|
|
|
built-ins. There is a good overview for this in the OpenPOWER ABI section 6.4.
|
|
|
Vector Built-in Functions.</para>
|
|
|
<para/>
|
|
|
<para>Appendix A is organized (sorted) by built-in name, output type, then
|
|
|
parameter types. Most built-ins are generic as the named the operation (add,
|
|
|
sub, mul, cmpeq, ...) applies to multiple types. </para>
|
|
|
<para/>
|
|
|
<para>So the build vec_add built-in applies to all the signed and unsigned
|
|
|
integer types (char, short, in, and long) plus float and double floating-point
|
|
|
types. The compiler looks at the parameter type to select the vector
|
|
|
instruction (or instruction sequence) that implements the (add) operation on
|
|
|
that type. The compiler infers the output result type from the operation and
|
|
|
input parameters and will complain if the target variable type is not
|
|
|
compatible. For example:</para>
|
|
|
<para/>
|
|
|
<para>This is one key difference between PowerISA built-ins and Intel
|
|
|
Intrinsics (Intel Intrinsics are not generic and include type information in
|
|
|
the name). This is why it is so important to understand the vector element
|
|
|
types and to add the appropriate type casts to get the correct results.</para>
|
|
|
<para/>
|
|
|
<para>The defacto standard implementation is GCC as defined in the include
|
|
|
file <altivec.h> and documented in the GCC online documentation in <link
|
|
|
xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/PowerPC-AltiVec_002fVSX-Built-
|
|
|
in-Functions.html#PowerPC-AltiVec_002fVSX-Built-in-Functions">6.59.20 PowerPC
|
|
|
AltiVec Built-in Functions</link>. The header file name and section title
|
|
|
reflect the origin of the Vector Facility, but recent versions of GCC altivec.h
|
|
|
include built-ins for newer PowerISA 2.06 and 2.07 VMX plus VSX extensions.
|
|
|
This is a work in progress where your (older) distro GCC compiler may not
|
|
|
include built-ins for the latest PowerISA 3.0 or ABI edition. So before you use
|
|
|
a built-in you find in the ABI Appendix A, check the specific <link
|
|
|
xlink:href="https://gcc.gnu.org/onlinedocs/">GCC online documentation</link> for the
|
|
|
GCC version you are using.</para>
|
|
|
<para/>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>2.2.3.3 How vector elements change size and type</title>
|
|
|
<para>Most vector built ins return the same vector type as the (first)
|
|
|
input parameters, but there are exceptions. Examples include; conversions
|
|
|
between types, compares , pack, unpack, merge, and integer multiply
|
|
|
operations. </para>
|
|
|
<para/>
|
|
|
<para>Converting floats to from integer will change the type and something
|
|
|
change the element size as well (double ↔ int and float ↔ long). For the
|
|
|
VMX the conversions are always the same size (float ↔ [unsigned] int). But
|
|
|
VSX allows conversion of 64-bit (long or double) to from 32-bit (float or
|
|
|
int) with the inherent size changes. The PowerISA VSX defines a 4 element
|
|
|
vector layout where little endian elements 0, 2 are used for input/output and
|
|
|
elements 1,3 are undefined. The OpenPOWER ABI Appendix A define vec_double and
|
|
|
vec_float with even/odd and high/low extensions as program aids. These are not
|
|
|
included in GCC 7 or earlier but are planned for GCC 8.</para>
|
|
|
<para/>
|
|
|
<para>Compare operations produce either vector bool <input element
|
|
|
type> (effectively bit masks) or predicates (the condition code for all and
|
|
|
any are represented as an int truth variable). When a predicate compare (ie
|
|
|
vec_all_eq, vec_any_gt), is used in a if statement, the condition code is
|
|
|
used directly in the conditional branch and the int truth value is not
|
|
|
generated.</para>
|
|
|
<para/>
|
|
|
<para>Pack operations pack integer elements into the next smaller (half)
|
|
|
integer sized elements. Pack operations include signed and unsigned saturate
|
|
|
and unsigned modulo forms. As the packed result will be half the size (in
|
|
|
bits), pack instructions require 2 vectors (256-bits) as input and generate a
|
|
|
single 128-bit vector results.</para>
|
|
|
<para/>
|
|
|
<para>Unpack operations expand integer elements into the next larger size
|
|
|
elements. The integers are always treated as signed values and sign-extended.
|
|
|
The processor design avoids instructions that return multiple register values.
|
|
|
So the PowerISA defines unpack-high and unpack low forms where instruction
|
|
|
takes (the high or low) half of vector elements and extends them to fill the
|
|
|
vector output. Element order is maintained and an unpack high / low sequence
|
|
|
with same input vector has the effect of unpacking to a 256-bit result in two
|
|
|
vector registers.</para>
|
|
|
<para/>
|
|
|
<para>Merge operations resemble shuffling two (vectors) card decks
|
|
|
together, alternating (elements) cards in the result. As we are merging from
|
|
|
2 vectors (256-bits) into 1 vector (128-bits) and the elements do not change
|
|
|
size, we have merge high and merge low instruction forms for each (byte,
|
|
|
halfword and word) integer type. The merge high operations alternate elements
|
|
|
from the (vector register left) high half of the two input vectors. The merge
|
|
|
low operation alternate elements from the (vector register right) low half of
|
|
|
the two input vectors. </para>
|
|
|
<para/>
|
|
|
<para>For PowerISA 2.07 we added vector merge word even / odd instructions.
|
|
|
Instead of high or low elements the shuffle is from the even or odd number
|
|
|
elements of the two input vectors. Passing the same vector to both inputs to
|
|
|
merge produces splat like results for each doubleword half, which is handy in
|
|
|
some convert operations. </para>
|
|
|
<para/>
|
|
|
<para>Integer multiply has the potential to generate twice as many bits in
|
|
|
the product as input. A multiply of 2 int (32-bit) values produces a long
|
|
|
(64-bits). Normal C language * operations ignore this and discard the top
|
|
|
32-bits of the result. However in some computations it useful to preserve the
|
|
|
double product precision for intermediate computation before reducing the final
|
|
|
result back to the original precision. </para>
|
|
|
<para/>
|
|
|
<para>The PowerISA VMX instruction set took the later approach ie keep all
|
|
|
the product bits until the programmer explicitly asks for the truncated result.
|
|
|
So the vector integer multiple are split into even/odd forms across signed and
|
|
|
unsigned; byte, halfword and word inputs. This requires two instructions (given
|
|
|
the same inputs) to generated the full vector multiply across 2 vector
|
|
|
registers and 256-bits. Again as POWER processors are super-scalar this pair of
|
|
|
instructions should execute in parallel.</para>
|
|
|
<para/>
|
|
|
<para>The set of expanded product values can either be used directly in
|
|
|
further (doubled precision) computation or merged/packed into the single single
|
|
|
vector at the smaller bit size. This is what the compiler will generate for C
|
|
|
vector extension multiply of vector integer types.</para>
|
|
|
<para/>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>2.2.4 Some more Intrinsic examples</title>
|
|
|
<para>The intrinsic <link
|
|
|
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_cvtp
|
|
|
d_ps&expand=1624">_mm_cvtpd_ps</link> converts a packed vector double into
|
|
|
a packed vector single float. Since only 2 doubles fit into a 128-bit vector
|
|
|
only 2 floats are returned and occupy only half (64-bits) of the XMM register.
|
|
|
For this intrinsic the 64-bit are packed into the logical left half of the
|
|
|
registers and the logical right half of the register is set to zero (as per the
|
|
|
Intel cvtpd2ps instruction).</para>
|
|
|
<para/>
|
|
|
<para>The PowerISA provides the VSX Vector round and Convert
|
|
|
Double-Precision to Single-Precision format (xvcvdpsp) instruction. In the ABI
|
|
|
this is vec_floato (vector double) . This instruction convert each double
|
|
|
element then transfers converted element 0 to float element 1, and converted
|
|
|
element 1 to float element 3. Float elements 0 and 2 are undefined (the
|
|
|
hardware can do what ever). This does not match the expected results for
|
|
|
_mm_cvtpd_ps.</para>
|
|
|
<para/>
|
|
|
<para>So we need to re-position the results to word elements 0 and 2, which
|
|
|
allows a pack operation to deliver the correct format. Here the merge odd
|
|
|
splats element 1 to 0 and element 3 to 2. The Pack operation combines the low
|
|
|
half of each doubleword from the vector result and vector of zeros to generate
|
|
|
the require format.</para>
|
|
|
<para/>
|
|
|
<para>This technique is also used to implement <link
|
|
|
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_cvtt
|
|
|
pd_epi32&expand=1624,1859">_mm_cvttpd_epi32</link> which converts a packed
|
|
|
vector double in to a packed vector int. The PowerISA instruction xvcvdpsxws
|
|
|
uses a similar layout for the result as xvcvdpsp and requires the same fix
|
|
|
up.</para>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>2.3 Profound differences </title>
|
|
|
<para>We have already mentioned above a number of architectural differences
|
|
|
that effect porting of codes containing Intel intrinsics to POWER. The fact
|
|
|
that Intel supports multiple vector extensions with different vector widths
|
|
|
(64, 128, 256, and 512-bits) while the PowerISA only supports vectors of
|
|
|
128-bits is one issue. Another is the difference in how the respective ISAs
|
|
|
support scalars in vector registers is another. In the text above we propose
|
|
|
workable alternatives for the PowerPC port. There also differences in the
|
|
|
handling of floating point exceptions and rounding modes that may impact the
|
|
|
application's performance or behavior. </para>
|
|
|
<para/>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>2.3.1 Floating Point Exceptions</title>
|
|
|
<para>Nominally both ISAs support the IEEE754 specifications, but there are
|
|
|
some subtle differences. Both architecture define a status and control register
|
|
|
to record exceptions and enable / disable floating exceptions for program
|
|
|
interrupt or default action. Intel has a MXCSR and PowerISA has a FPSCR which
|
|
|
basically do the same thing but with different bit layout. </para>
|
|
|
<para/>
|
|
|
<para>Intel provides _mm_setcsr / _mm_getcsr intrinsics to allow direct
|
|
|
access to the MXCSR. In the early days before the OS POSIX run-times where
|
|
|
updated to manage the MXCSR, this might have been useful. Today this would be
|
|
|
highly discouraged with a strong preference to use the POSIX APIs
|
|
|
(feclearexceptflag, fegetexceptflag, fesetexceptflag, ...) instead.</para>
|
|
|
<para/>
|
|
|
<para>If we implement _mm_setcsr / _mm_getcs at all, we should simply
|
|
|
redirect the implementation to use the POSIX APIs from <fenv.h>. But it
|
|
|
might be simpler just to replace these intrinsics with macros that generate
|
|
|
#error.</para>
|
|
|
<para/>
|
|
|
<para>The Intel MXCSR does have some none (POSIX/IEEE754) standard quirks;
|
|
|
Flush-To-Zero and Denormals-Are-Zeros flags. This simplifies the hardware
|
|
|
response to what should be a rare condition (underflows where the result can
|
|
|
not be represented in the exponent range and precision of the format) by simply
|
|
|
returning a signed 0.0 value. The intrinsic header implementation does provide
|
|
|
constant masks for _MM_DENORMALS_ZERO_ON (<pmmintrin.h>) and
|
|
|
_MM_FLUSH_ZERO_ON (<xmmintrin.h>, so technically it is available to users
|
|
|
of the Intel Intrinsics API.</para>
|
|
|
<para/>
|
|
|
<para>The VMX Vector facility provides a separate Vector Status and Control
|
|
|
register (VSCR) with a Non-Java Mode control bit. This control combines the
|
|
|
flush-to-zero semantics for floating Point underflow and denormal values. But
|
|
|
this control only applies to VMX vector float instructions and does not apply
|
|
|
to VSX scalar floating Point or vector double instructions. The FPSCR does
|
|
|
define a Floating-Point non-IEEE mode which is optional in the architecture.
|
|
|
This would apply to Scalar and VSX floating-point operations if it was
|
|
|
implemented. This was largely intended for embedded processors and is not
|
|
|
implemented in the POWER processor line.</para>
|
|
|
<para/>
|
|
|
<para>As the flush-to-zero is primarily a performance enhansement and is
|
|
|
clearly outside the IEEE754 standard, it may be best to simply ignore this
|
|
|
option for the intrinsic port.</para>
|
|
|
<para/>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>2.3.2 Floating-point rounding modes</title>
|
|
|
<para>The Intel (x86 / x86_64) and PowerISA architectures both support the
|
|
|
4 IEEE754 rounding modes. Again while the Intel Intrinsic API allows the
|
|
|
application to change rounding modes via updates to the MXCSR it is a bad idea
|
|
|
and should be replaced with the POSIX APIs (fegetround and fesetround). </para>
|
|
|
<para/>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>2.3.3 Performance</title>
|
|
|
<para>The performance of a ported intrinsic depends on the specifics of the
|
|
|
intrinsic and the context it is used in. Many of the SIMD operations have
|
|
|
equivalent instructions in both architectures. For example the vector float and
|
|
|
vector double match very closely. However the SSE and VSX scalars have subtle
|
|
|
differences of how the scalar is positioned with the vector registers and what
|
|
|
happens to the rest (non-scalar part) of the register (previously discussed in
|
|
|
<link linkend="">here</link>). This requires additional PowerISA instructions
|
|
|
to preserve the non-scalar portion of the vector registers. This may or may not
|
|
|
be important to the logic of the program being ported, but we have handle the
|
|
|
case where it is. </para>
|
|
|
<para/>
|
|
|
<para>This is where the context of now the intrinsic is used starts to
|
|
|
matter. If the scalar intrinsics are used within a larger program the compiler
|
|
|
may be able to eliminate the redundant register moves as the results are never
|
|
|
used. In the other cases common set up (like permute vectors or bit masks) can
|
|
|
be common-ed up and hoisted out of the loop. So it is very important to let the
|
|
|
compiler do its job with higher optimization levels (-O3,
|
|
|
-funroll-loops).</para>
|
|
|
<para/>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>2.3.3.1 Using SSE float and double scalars</title>
|
|
|
<para>SSE scalar float / double intrinsics “hand” optimization is no
|
|
|
longer necessary. This was important, when SSE was initially introduced, and
|
|
|
compiler support was limited or nonexistent. Also SSE scalar float / double
|
|
|
provided additional (16) registers and IEEE754 compliance, not available from
|
|
|
the 8087 floating point architecture that preceded it. So application
|
|
|
developers where motivated to use SSE instruction versus what the compiler was
|
|
|
generating at the time.</para>
|
|
|
<para/>
|
|
|
<para>Modern compilers can now to generate and optimize these (SSE
|
|
|
scalar) instructions for Intel from C standard scalar code. Of course PowerISA
|
|
|
supported IEEE754 float and double and had 32 dedicated floating point
|
|
|
registers from the start (and now 64 with VSX). So replacing a Intel specific
|
|
|
scalar intrinsic implementation with the equivalent C language scalar
|
|
|
implementation is usually a win; allows the compiler to apply the latest
|
|
|
optimization and tuning for the latest generation processor, and is portable to
|
|
|
other platforms where the compiler can also apply the latest optimization and
|
|
|
tuning for that processors latest generation.</para>
|
|
|
<para/>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>2.3.3.2 Using MMX intrinsics</title>
|
|
|
<para>MMX was the first and oldest SIMD extension and initially filled a
|
|
|
need for wider (64-bit) integer and additional register. This is back when
|
|
|
processors were 32-bit and 8 x 32-bit registers was starting to cramp our
|
|
|
programming style. Now 64-bit processors, larger register sets, and 128-bit (or
|
|
|
larger) vector SIMD extensions are common. There is simply no good reasons
|
|
|
write new code using the (now) very limited MMX capabilities. </para>
|
|
|
<para/>
|
|
|
<para>We recommend that existing MMX codes be rewritten to use the newer
|
|
|
SSE and VMX/VSX intrinsics or using the more portable GCC builtin vector
|
|
|
support or in the case of si64 operations use C scalar code. The MMX si64
|
|
|
scalars which are just (64-bit) operations on long long int types and any
|
|
|
modern C compiler can handle this type. The char short in SIMD operations
|
|
|
should all be promoted to 128-bit SIMD operations on GCC builtin vectors. Both
|
|
|
will improve cross platform portability and performance.</para>
|
|
|
<para/>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>Appendix A: Document References</title>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>A.1 OpenPOWER and Power documents</title>
|
|
|
<para>
|
|
|
<link
|
|
|
xlink:href="https://openpowerfoundation.org/technical/technical-resources/technical-spe
|
|
|
cifications/">OpenPOWER</link>
|
|
|
<link
|
|
|
xlink:href="https://openpowerfoundation.org/technical/technical-resources/technical-spe
|
|
|
cifications/">TM</link>
|
|
|
<link
|
|
|
xlink:href="https://openpowerfoundation.org/technical/technical-resources/technical-spe
|
|
|
cifications/"> Technical Specification</link>
|
|
|
</para>
|
|
|
<para>
|
|
|
<link
|
|
|
xlink:href="https://openpowerfoundation.org/?resource_lib=ibm-power-isa-version-2-07-b"
|
|
|
>Power ISA</link>
|
|
|
<link
|
|
|
xlink:href="https://openpowerfoundation.org/?resource_lib=ibm-power-isa-version-2-07-b"
|
|
|
>TM</link>
|
|
|
<link
|
|
|
xlink:href="https://openpowerfoundation.org/?resource_lib=ibm-power-isa-version-2-07-b"
|
|
|
> Version 2.07 B</link>
|
|
|
</para>
|
|
|
<para>
|
|
|
<link xlink:href="https://www.docdroid.net/tWT7hjD/powerisa-v30.pdf.html">Power
|
|
|
ISA</link>
|
|
|
<link
|
|
|
xlink:href="https://www.docdroid.net/tWT7hjD/powerisa-v30.pdf.html">TM</link>
|
|
|
<link xlink:href="https://www.docdroid.net/tWT7hjD/powerisa-v30.pdf.html">
|
|
|
Version 3.0</link>
|
|
|
</para>
|
|
|
<para>
|
|
|
<link
|
|
|
xlink:href="https://openpowerfoundation.org/technical/technical-resources/technical-spe
|
|
|
cifications/">Power Architecture 64-bit ELF ABI Specification (AKA OpenPower
|
|
|
ABI for Linux Supplement)</link>
|
|
|
</para>
|
|
|
<para>
|
|
|
<link
|
|
|
xlink:href="http://www.nxp.com/assets/documents/data/en/reference-manuals/ALTIVECPEM.pd
|
|
|
f">AltiVec™ Technology </link>
|
|
|
<link
|
|
|
xlink:href="http://www.nxp.com/assets/documents/data/en/reference-manuals/ALTIVECPEM.pd
|
|
|
f">Programming Environments Manual</link>
|
|
|
</para>
|
|
|
<para/>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>A.2 Intel Documents</title>
|
|
|
<para>
|
|
|
<link xlink:href="https://software.intel.com/en-us/articles/intel-sdm">Intel®
|
|
|
64 and IA-32 Architectures Software Developer’s Manual</link>
|
|
|
</para>
|
|
|
<para>
|
|
|
<link
|
|
|
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/">Intel</ulink
|
|
|
>
|
|
|
<link
|
|
|
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/">TM</link>
|
|
|
<link
|
|
|
xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/"> Intrinsics
|
|
|
Guide</link>
|
|
|
</para>
|
|
|
<para/>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>A.3 GNU Compiler Collection (GCC) documents</title>
|
|
|
<para>
|
|
|
<link xlink:href="https://gcc.gnu.org/onlinedocs/">GCC online
|
|
|
documentation</link>
|
|
|
</para>
|
|
|
<para>
|
|
|
<link xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/">GCC Manual
|
|
|
(GCC 6.3)</link>
|
|
|
</para>
|
|
|
<para>
|
|
|
<link xlink:href="https://gcc.gnu.org/onlinedocs/gccint/">GCC Internals
|
|
|
Manual</link>
|
|
|
</para>
|
|
|
<para/>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title/>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>Appendix B: Intel Intrinsic suffixes</title>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>B.1 MMX</title>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>B.2 SSE</title>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>B.3 SSE2</title>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>B.4 AVX/AVX2 __m256_*</title>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title>B.5 AVX512 __m512_*</title>
|
|
|
<para/>
|
|
|
</section>
|
|
|
<para>1</para>
|
|
|
</article>
|