|
|
|
<!--
|
|
|
|
Copyright (c) 2019 OpenPOWER Foundation
|
|
|
|
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License");
|
|
|
|
you may not use this file except in compliance with the License.
|
|
|
|
You may obtain a copy of the License at
|
|
|
|
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
|
|
|
|
Unless required by applicable law or agreed to in writing, software
|
|
|
|
distributed under the License is distributed on an "AS IS" BASIS,
|
|
|
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
|
|
See the License for the specific language governing permissions and
|
|
|
|
limitations under the License.
|
|
|
|
|
|
|
|
-->
|
|
|
|
<chapter version="5.0" xml:lang="en" xmlns="http://docbook.org/ns/docbook" xmlns:xi="http://www.w3.org/2001/XInclude"
|
|
|
|
xmlns:xlink="http://www.w3.org/1999/xlink" xml:id="section_techniques">
|
|
|
|
|
|
|
|
<!-- Chapter Title goes here. -->
|
|
|
|
<title>Vector Programming Techniques</title>
|
|
|
|
|
|
|
|
<section>
|
|
|
|
<title>Help the Compiler Help You</title>
|
|
|
|
<para>
|
|
|
|
The best way to use vector intrinsics is often <emphasis>not to
|
|
|
|
use them at all</emphasis>.
|
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
This may seem counterintuitive at first. Aren't vector
|
|
|
|
intrinsics the best way to ensure that the compiler does exactly
|
|
|
|
what you want? Well, sometimes. But the problem is that the
|
|
|
|
best instruction sequence today may not be the best instruction
|
|
|
|
sequence tomorrow. As the Power ISA moves forward, new
|
|
|
|
instruction capabilities appear, and the old code you wrote can
|
|
|
|
easily become obsolete. Then you start having to create
|
|
|
|
different versions of the code for different levels of the
|
|
|
|
Power ISA, and it can quickly become difficult to maintain.
|
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
Most often programmers use vector intrinsics to increase the
|
|
|
|
performance of loop kernels that dominate the performance of an
|
|
|
|
application or library. However, modern compilers are often
|
|
|
|
able to optimize such loops to use vector instructions without
|
|
|
|
having to resort to intrinsics, using an optimization called
|
|
|
|
autovectorization (or auto-SIMD). Your first focus when writing
|
|
|
|
loop kernels should be on making the code amenable to
|
|
|
|
autovectorization by the compiler. Start by writing the code
|
|
|
|
naturally, using scalar memory accesses and data operations, and
|
|
|
|
see whether the compiler autovectorizes your code. If not, here
|
|
|
|
are some steps you can try:
|
|
|
|
</para>
|
|
|
|
<itemizedlist>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
<emphasis role="underline">Check your optimization
|
|
|
|
level</emphasis>. Different compilers enable
|
|
|
|
autovectorization at different optimization levels. For
|
|
|
|
example, at this writing the GCC compiler requires
|
|
|
|
<code>-O3</code> to enable autovectorization by default.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
<emphasis role="underline">Consider using
|
|
|
|
<code>-ffast-math</code></emphasis>. This option assumes
|
|
|
|
that certain fussy aspects of IEEE floating-point can be
|
|
|
|
ignored, such as the presence of Not-a-Numbers (NaNs),
|
|
|
|
signed zeros, and so forth. <code>-ffast-math</code> may
|
|
|
|
also affect precision of results that may not matter to your
|
|
|
|
application. Turning on this option can simplify the
|
|
|
|
control flow of loops generated for your application by
|
|
|
|
removing tests for NaNs and so forth. (Note that
|
|
|
|
<code>-Ofast</code> turns on both -O3 and -ffast-math in
|
|
|
|
GCC.)
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
<emphasis role="underline">Align your data wherever
|
|
|
|
possible</emphasis>. For most effective auto-vectorization,
|
|
|
|
arrays of data should be aligned on at least a 16-byte
|
|
|
|
boundary, and pointers to that data should be identified as
|
|
|
|
having the appropriate alignment. For example:
|
|
|
|
</para>
|
|
|
|
<programlisting> float fdata[4096] __attribute__((aligned(16)));</programlisting>
|
|
|
|
<para>
|
|
|
|
ensures that the compiler can use an efficient, aligned
|
|
|
|
vector load to bring data from <code>fdata</code> into a
|
|
|
|
vector register. Autovectorization will appear more
|
|
|
|
profitable to the compiler when data is known to be
|
|
|
|
aligned.
|
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
You can also declare pointers to point to aligned data,
|
|
|
|
which is particularly useful in function arguments:
|
|
|
|
</para>
|
|
|
|
<programlisting> void foo (__attribute__((aligned(16))) double * aligned_fptr)</programlisting>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
<emphasis role="underline">Tell the compiler when data can't
|
|
|
|
overlap</emphasis>. In C and C++, use of pointers can cause
|
|
|
|
compilers to pessimistically analyze which memory references
|
|
|
|
can refer to the same memory. This can prevent important
|
|
|
|
optimizations, such as reordering memory references, or
|
|
|
|
keeping previously loaded values in memory rather than
|
|
|
|
reloading them. Inefficiently optimized scalar loops are
|
|
|
|
less likely to be autovectorized. You can annotate your
|
|
|
|
pointers with the <code>restrict</code> or
|
|
|
|
<code>__restrict__</code> keyword to tell the compiler that
|
|
|
|
your pointers don't "alias" with any other memory
|
|
|
|
references. (<code>restrict</code> can be used only in C
|
|
|
|
when compiling for the C99 standard or later.
|
|
|
|
<code>__restrict__</code> is a language extension, available
|
|
|
|
in both GCC and Clang, that can be used for both C and C++.)
|
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
Suppose you have a function that takes two pointer
|
|
|
|
arguments, one that points to data your function writes to, and
|
|
|
|
one that points to data your function reads from. By
|
|
|
|
default, the compiler may believe that the data being read
|
|
|
|
and written could overlap. To disabuse the compiler of this
|
|
|
|
notion, do the following:
|
|
|
|
</para>
|
|
|
|
<programlisting> void foo (double *__restrict__ outp, double *__restrict__ inp)</programlisting>
|
|
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
</section>
|
|
|
|
|
|
|
|
<section>
|
|
|
|
<title>Use Portable Intrinsics</title>
|
|
|
|
<para>
|
|
|
|
If you can't convince the compiler to autovectorize your code,
|
|
|
|
or you want to access specific processor features not
|
|
|
|
appropriate for autovectorization, you should use intrinsics.
|
|
|
|
However, you should go out of your way to use intrinsics that
|
|
|
|
are as portable as possible, in case you need to change
|
|
|
|
compilers in the future.
|
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
This reference provides intrinsics that are guaranteed to be
|
|
|
|
portable across compliant compilers. In particular, both the
|
|
|
|
GCC and Clang compilers for Power implement the intrinsics in
|
|
|
|
this manual. The compilers may each implement many more
|
|
|
|
intrinsics, but the ones in this manual are the only ones
|
|
|
|
guaranteed to be portable. So if you are using an interface not
|
|
|
|
described here, you should look for an equivalent one in this
|
|
|
|
manual and change your code to use that.
|
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
Where an intrinsic may not be available from all compilers or at
|
|
|
|
all ISA levels, this information is called out in the
|
|
|
|
description of the intrinsic in <xref
|
|
|
|
linkend="VIPR.reference.vecfns" />.
|
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
There are also other vector APIs that may be of use to you (see
|
|
|
|
<xref linkend="VIPR.techniques.apis" />). In particular, the
|
|
|
|
Power Vector Library (see <xref
|
|
|
|
linkend="VIPR.techniques.pveclib" />) provides additional
|
|
|
|
portability across compiler versions, as well as interfaces that
|
|
|
|
hide cases where assembly language is needed.
|
|
|
|
</para>
|
|
|
|
</section>
|
|
|
|
|
|
|
|
<section>
|
|
|
|
<title>Use Assembly Code Sparingly</title>
|
|
|
|
<para>
|
|
|
|
Sometimes the compiler will absolutely not cooperate in giving
|
|
|
|
you the code you need. You might not get the instruction you
|
|
|
|
want, or you might get extra instructions that are slowing down
|
|
|
|
your ideal performance. When that happens, the first thing you
|
|
|
|
should do is report this to the compiler community! This will
|
|
|
|
allow them to get the problem fixed in the next release of the
|
|
|
|
compiler. See <xref linkend="VIPR.intro.reporting" /> if you
|
|
|
|
need to report an issue.
|
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
In the meanwhile, though, what are your options? As a
|
|
|
|
workaround, your best option may be to use assembly code. There
|
|
|
|
are two ways to go about this. Using inline assembly is
|
|
|
|
generally appropriate only for very small snippets of code (1-5
|
|
|
|
instructions, say). If you want to write a whole function in
|
|
|
|
assembly code, though, it is better to create a separate
|
|
|
|
<code>.s</code> or <code>.S</code> file. The only difference in
|
|
|
|
these two file types is that a <code>.S</code> file will be
|
|
|
|
processed by the C preprocessor before being assembled.
|
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
Assembly programming is beyond the scope of this manual.
|
|
|
|
Getting inline assembly correct can be quite tricky, and it is
|
|
|
|
best to look at existing examples to learn how to use it
|
|
|
|
properly. However, there is a good introduction to inline
|
|
|
|
assembly in <emphasis>Using the GNU Compiler
|
|
|
|
Collection</emphasis>, in section 6.47 at the time of this
|
|
|
|
writing. Felix Cloutier has also written a very nice guide.
|
|
|
|
See <xref linkend="VIPR.intro.links" />.
|
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
If you write a function entirely in assembly, you are
|
|
|
|
responsible for following the calling conventions established by
|
|
|
|
the ABI (see <xref linkend="VIPR.intro.links" />). Again, it is
|
|
|
|
best to look at examples. One place to find well-written
|
|
|
|
<code>.S</code> files is in the GLIBC project.
|
|
|
|
</para>
|
|
|
|
</section>
|
|
|
|
|
|
|
|
<section xml:id="VIPR.techniques.apis">
|
|
|
|
<title>Other Vector Programming APIs</title>
|
|
|
|
<para>In addition to the intrinsic functions provided in this
|
|
|
|
reference, programmers should be aware of other vector programming
|
|
|
|
API resources.</para>
|
|
|
|
<section>
|
|
|
|
<title>x86 Vector Portability Headers</title>
|
|
|
|
<para>
|
|
|
|
Recent versions of the GCC and Clang open source compilers
|
|
|
|
provide "drop-in" portability headers for portions of the
|
|
|
|
Intel Architecture Instruction Set Extensions (see <xref
|
|
|
|
linkend="VIPR.intro.links" />). These headers mirror the APIs
|
|
|
|
of Intel headers having the same names. Support is provided
|
|
|
|
for the MMX and SSE layers, up through SSE4. At this time, no
|
|
|
|
support for the AVX layers is envisioned.
|
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
The portability headers provide the same semantics as the
|
|
|
|
corresponding Intel APIs, but using VMX and VSX instructions
|
|
|
|
to emulate the Intel vector instructions. It should be
|
|
|
|
emphasized that these headers are provided for portability,
|
|
|
|
and will not necessarily perform optimally (although in many
|
|
|
|
cases the performance is very good). Using these headers is
|
|
|
|
often a good first step in porting a library using Intel
|
|
|
|
intrinsics to Power, after which more detailed rewriting of
|
|
|
|
algorithms is usually desirable for best performance.
|
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
Access to the portability APIs occurs automatically when
|
|
|
|
including one of the corresponding Intel header files, such as
|
|
|
|
<code><mmintrin.h></code>.
|
|
|
|
</para>
|
|
|
|
</section>
|
|
|
|
<section xml:id="VIPR.techniques.pveclib">
|
|
|
|
<title>The Power Vector Library (pveclib)</title>
|
|
|
|
<para>The Power Vector Library, also known as
|
|
|
|
<code>pveclib</code>, is a separate project available from
|
|
|
|
github (see <xref linkend="VIPR.intro.links" />). The
|
|
|
|
<code>pveclib</code> project builds on top of the intrinsics
|
|
|
|
described in this manual to provide higher-level vector
|
|
|
|
interfaces that are highly portable. The goals of the project
|
|
|
|
include:
|
|
|
|
</para>
|
|
|
|
<itemizedlist>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Providing equivalent functions across versions of the
|
|
|
|
Power ISA. For example, the <emphasis>Vector
|
|
|
|
Multiply-by-10 Unsigned Quadword</emphasis> operation
|
|
|
|
introduced in Power ISA 3.0 (POWER9) can be implemented
|
|
|
|
using a few vector instructions on earlier Power ISA
|
|
|
|
versions.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Providing equivalent functions across compiler versions.
|
|
|
|
For example, intrinsics provided in later versions of the
|
|
|
|
compiler can be implemented as inline functions with
|
|
|
|
inline asm in earlier compiler versions.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
<listitem>
|
|
|
|
<para>
|
|
|
|
Providing higher-order functions not provided directly by
|
|
|
|
the Power ISA. One example is a vector SIMD implementation
|
|
|
|
for ASCII <code>__isalpha</code> and similar functions.
|
|
|
|
Another example is full <code>__int128</code>
|
|
|
|
implementations of <emphasis>Count Leading
|
|
|
|
Zeroes</emphasis>, <emphasis>Population Count</emphasis>,
|
|
|
|
and <emphasis>Multiply</emphasis>.
|
|
|
|
</para>
|
|
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
</section>
|
|
|
|
</section>
|
|
|
|
|
|
|
|
</chapter>
|