diff --git a/Intrinsics_Reference/ch_biendian.xml b/Intrinsics_Reference/ch_biendian.xml
index 958ba41..5846956 100644
--- a/Intrinsics_Reference/ch_biendian.xml
+++ b/Intrinsics_Reference/ch_biendian.xml
@@ -769,7 +769,7 @@ register vector double vd = vec_splats(*double_ptr);
introduced serious compiler complexity without much utility.
Thus this support (previously controlled by switches
-maltivec=be and/or -qaltivec=be) is
- now deprecated. Current versions of the gcc and clang
+ now deprecated. Current versions of the GCC and Clang
open-source compilers do not implement this support.
@@ -1146,8 +1146,8 @@ register vector double vd = vec_splats(*double_ptr);
elements using the groups of 4 contiguous bytes, and the
values of the integers will be reordered without compromising
each integer's contents. The fact that the little-endian
- result matches the big-endian result is left as an exercise to
- the reader.
+ result matches the big-endian result is left as an exercise
+ for the reader.
Now, suppose instead that the original PCV does not reorder
diff --git a/Intrinsics_Reference/ch_intro.xml b/Intrinsics_Reference/ch_intro.xml
index b2bb054..49a1946 100644
--- a/Intrinsics_Reference/ch_intro.xml
+++ b/Intrinsics_Reference/ch_intro.xml
@@ -54,10 +54,9 @@ xmlns:xlink="http://www.w3.org/1999/xlink" xml:id="section_intro">
provides for overloaded intrinsics that can operate on different
data types. However, such function overloading is not normally
acceptable in the C programming language, so compilers compliant
- with the AltiVec PIM (such as gcc and
- clang) were required to add special handling to
- their parsers to permit this. The PIM suggested (but did not
- mandate) the use of a header file,
+ with the AltiVec PIM (such as GCC and Clang) were required to
+ add special handling to their parsers to permit this. The PIM
+ suggested (but did not mandate) the use of a header file,
<altivec.h>, for implementations that provide
AltiVec intrinsics. This is common practice for all compliant
compilers today.
@@ -208,6 +207,15 @@ xmlns:xlink="http://www.w3.org/1999/xlink" xml:id="section_intro">
+
+
+ Using the GNU Compiler Collection.
+
+ https://gcc.gnu.org/onlinedocs/gcc.pdf
+
+
+
+
diff --git a/Intrinsics_Reference/ch_techniques.xml b/Intrinsics_Reference/ch_techniques.xml
index 3f8f4c1..892c5f9 100644
--- a/Intrinsics_Reference/ch_techniques.xml
+++ b/Intrinsics_Reference/ch_techniques.xml
@@ -23,45 +23,181 @@ xmlns:xlink="http://www.w3.org/1999/xlink" xml:id="section_techniques">
Help the Compiler Help You
- Start with scalar code, which is the most portable. Use various
- tricks for helping the compiler vectorize scalar code. Make
- sure you align your data on 16-byte boundaries wherever
- possible, and tell the compiler it's aligned. Use __restrict__
- pointers to promise data does not alias.
+ The best way to use vector intrinsics is often not to
+ use them at all.
+
+ This may seem counterintuitive at first. Aren't vector
+ intrinsics the best way to ensure that the compiler does exactly
+ what you want? Well, sometimes. But the problem is that the
+ best instruction sequence today may not be the best instruction
+ sequence tomorrow. As the PowerISA moves forward, new
+ instruction capabilities appear, and the old code you wrote can
+ easily become obsolete. Then you start having to create
+ different versions of the code for different levels of the
+ PowerISA, and it can quickly become difficult to maintain.
+
+
+ Most often programmers use vector intrinsics to increase the
+ performance of loop kernels that dominate the performance of an
+ application or library. However, modern compilers are often
+ able to optimize such loops to use vector instructions without
+ having to resort to intrinsics, using an optimization called
+ autovectorization (or auto-SIMD). Your first focus when writing
+ loop kernels should be on making the code amenable to
+ autovectorization by the compiler. Start by writing the code
+ naturally, using scalar memory accesses and data operations, and
+ see whether the compiler autovectorizes your code. If not, here
+ are some steps you can try:
+
+
+
+
+ Check your optimization
+ level. Different compilers enable
+ autovectorization at different optimization levels. For
+ example, at this writing the GCC compiler requires
+ -O3 to enable autovectorization by default.
+
+
+
+
+ Consider using
+ -ffast-math. This option assumes
+ that certain fussy aspects of IEEE floating-point can be
+ ignored, such as the presence of Not-a-Numbers (NaNs),
+ signed zeros, and so forth. -ffast-math may
+ also affect precision of results that may not matter to your
+ application. Turning on this option can simplify the
+ control flow of loops generated for your application by
+ removing tests for NaNs and so forth. (Note that
+ -Ofast turns on both -O3 and -ffast-math in
+ GCC.)
+
+
+
+
+ Align your data wherever
+ possible. For most effective auto-vectorization,
+ arrays of data should be aligned on at least a 16-byte
+ boundary, and pointers to that data should be identified as
+ having the appropriate alignment. For example:
+
+ float fdata[4096] __attribute__((aligned(16)));
+
+ ensures that the compiler can use an efficient, aligned
+ vector load to bring data from fdata into a
+ vector register. Autovectorization will appear more
+ profitable to the compiler when data is known to be
+ aligned.
+
+
+ You can also declare pointers to point to aligned data,
+ which is particularly useful in function arguments:
+
+ void foo (__attribute__((aligned(16))) double * aligned_fptr)
+
+
+
+ Tell the compiler when data can't
+ overlap. In C and C++, use of pointers can cause
+ compilers to pessimistically analyze which memory references
+ can refer to the same memory. This can prevent important
+ optimizations, such as reordering memory references, or
+ keeping previously loaded values in memory rather than
+ reloading them. Inefficiently optimized scalar loops are
+ less likely to be autovectorized. You can annotate your
+ pointers with the restrict or
+ __restrict__ keyword to tell the compiler that
+ your pointers don't "alias" with any other memory
+ references. (restrict can be used only in C
+ when compiling for the C99 standard or later.
+ __restrict__ is a language extension, available
+ in both GCC and Clang, that can be used for both C and C++.)
+
+
+ Suppose you have a function that takes two pointer
+ arguments, one that points to data your function writes to, and
+ one that points to data your function reads from. By
+ default, the compiler may believe that the data being read
+ and written could overlap. To disabuse the compiler of this
+ notion, do the following:
+
+ void foo (double *__restrict__ outp, double *__restrict__ inp)
+
+ Use Portable Intrinsics
- Individual compilers may provide other intrinsic support. Only
- the intrinsics in this manual are guaranteed to be portable
- across compliant compilers.
+ If you can't convince the compiler to autovectorize your code,
+ or you want to access specific processor features not
+ appropriate for autovectorization, you should use intrinsics.
+ However, you should go out of your way to use intrinsics that
+ are as portable as possible, in case you need to change
+ compilers in the future.
+
+
+ This reference provides intrinsics that are guaranteed to be
+ portable across compliant compilers. In particular, both the
+ GCC and Clang compilers for POWER implement the intrinsics in
+ this manual. The compilers may each implement many more
+ intrinsics, but the ones in this manual are the only ones
+ guaranteed to be portable. So if you are using an interface not
+ described here, you should look for an equivalent one in this
+ manual and change your code to use that.
- Some compilers may provide compatibility headers for use with
- other architectures. Recent GCC and Clang compilers support
- compatibility headers for the lower levels of the x86 vector
- architecture. These can be used initially for ease of porting,
- but for best performance, it is preferable to rewrite important
- sections of code with native Power intrinsics.
+ There are also other vector APIs that may be of use to you (see
+ ). In particular, the
+ POWER Vector Library (see ) provides additional
+ portability across compiler versions.
Use Assembly Code Sparingly
- filler
-
- Inline Assembly
- filler
-
-
- Assembly Files
- filler
-
+
+ Sometimes the compiler will absolutely not cooperate in giving
+ you the code you need. You might not get the instruction you
+ want, or you might get extra instructions that are slowing down
+ your ideal performance. When that happens, the first thing you
+ should do is report this to the compiler community! This will
+ allow them to get the problem fixed in the next release of the
+ compiler.
+
+
+ In the meanwhile, though, what are your options? As a
+ workaround, your best option may be to use assembly code. There
+ are two ways to go about this. Using inline assembly is
+ generally appropriate only for very small snippets of code (1-5
+ instructions, say). If you want to write a whole function in
+ assembly code, though, it is better to create a separate
+ .s or .S file. The only difference in
+ these two file types is that a .S file will be
+ processed by the C preprocessor before being assembled.
+
+
+ Assembly programming is beyond the scope of this manual.
+ Getting inline assembly correct can be quite tricky, and it is
+ best to look at existing examples to learn how to use it
+ properly. However, there is a good introduction to inline
+ assembly in Using the GNU Compiler
+ Collection (see ),
+ in section 6.47 at the time of this writing.
+
+
+ If you write a function entirely in assembly, you are
+ responsible for following the calling conventions established by
+ the ABI (see ). Again, it is
+ best to look at examples. One place to find well-written
+ .S files is in the GLIBC project.
+
-
+ Other Vector Programming APIsIn addition to the intrinsic functions provided in this
reference, programmers should be aware of other vector programming
@@ -69,14 +205,13 @@ xmlns:xlink="http://www.w3.org/1999/xlink" xml:id="section_techniques">
x86 Vector Portability Headers
- Recent versions of the gcc and clang
- open source compilers provide "drop-in" portability headers
- for portions of the Intel Architecture Instruction Set
- Extensions (see ). These
- headers mirror the APIs of Intel headers having the same
- names. Support is provided for the MMX and SSE layers, up
- through SSE4. At this time, no support for the AVX layers is
- envisioned.
+ Recent versions of the GCC and Clang open source compilers
+ provide "drop-in" portability headers for portions of the
+ Intel Architecture Instruction Set Extensions (see ). These headers mirror the APIs
+ of Intel headers having the same names. Support is provided
+ for the MMX and SSE layers, up through SSE4. At this time, no
+ support for the AVX layers is envisioned.
The portability headers provide the same semantics as the
@@ -95,7 +230,7 @@ xmlns:xlink="http://www.w3.org/1999/xlink" xml:id="section_techniques">
<mmintrin.h>.
-
+ The POWER Vector Library (pveclib)The POWER Vector Library, also known as
pveclib, is a separate project available from
diff --git a/Intrinsics_Reference/ch_vec_reference.xml b/Intrinsics_Reference/ch_vec_reference.xml
index a18fcdf..7117f70 100644
--- a/Intrinsics_Reference/ch_vec_reference.xml
+++ b/Intrinsics_Reference/ch_vec_reference.xml
@@ -23,8 +23,95 @@ xmlns:xlink="http://www.w3.org/1999/xlink" xml:id="VIPR.vec-ref">
How to Use This Reference
- Brief description of the format of the entries, the cross-reference
- index, and so forth.
+ This chapter contains reference material for each supported
+ vector intrinsic. The information for each intrinsic includes:
+
+
+
+
+ The intrinsic name and extended name;
+
+
+
+
+ A type-free example of the intrinsic's usage;
+
+
+
+
+ A description of the intrinsic's purpose;
+
+
+
+
+ A description of the value(s) returned from the intrinsic,
+ if any;
+
+
+
+
+ A description of any unusual characteristics of the
+ intrinsic when different target endiannesses are in force.
+ If the semantics of the intrinsic in big-endian and
+ little-endian modes are identical, the description will read
+ "None.";
+
+
+
+
+ Optionally, additional explanatory notes about the
+ intrinsic; and
+
+
+
+
+ A table of supported type signatures for the intrinsic.
+
+
+
+
+ Most intrinsics are overloaded, supporting multiple type
+ signatures. The types of the input arguments always determine
+ the type of the result argument; that is, it is not possible to
+ define two intrinsic overloads with the same input argument
+ types and different result argument types.
+
+
+ The type-free example of the intrinsic's usage uses the
+ convention that r represents
+ the result of the intrinsic, and a, b,
+ etc., represent the input arguments. The allowed type
+ combinations of these variables are shown as rows in the table
+ of supported type signatures.
+
+
+ Each row contains at least one example implementation. This
+ shows one way that a conforming compiler might achieve the
+ intended semantics of the intrinsic, but compilers are not
+ required to generate this code specifically. The letters
+ r, a, b,
+ etc., in the examples represent vector registers containing the
+ values of those variables. The letters t, u,
+ etc., represent vector registers containing temporary
+ intermediate results. The same register is assumed to be used
+ for each instance of one of these letters.
+
+
+ When implementations differ for big- and little-endian targets,
+ separate example implementations are shown for each endianness.
+
+
+ The implementations show which vector instructions are used in
+ the implementation of a particular intrinsic. When trying to
+ determine which intrinsic to use, it can be useful to have a
+ cross-reference from a specific vector instruction to the
+ intrinsics whose implementations make use of it. This manual
+ contains such a cross-reference () for the programmer's
+ convenience.