Finish all the front matter!

6 years ago · 2817b77c5a
parent 7a3454dc78
commit 2817b77c5a
4 changed files with 272 additions and 42 deletions
--- a/Intrinsics_Reference/ch_biendian.xml
+++ b/Intrinsics_Reference/ch_biendian.xml
@ -769,7 +769,7 @@ register vector double vd = vec_splats(*double_ptr);</programlisting>
 	introduced serious compiler complexity without much utility.
 	Thus this support (previously controlled by switches
 	<code>-maltivec=be</code> and/or <code>-qaltivec=be</code>) is
-	now deprecated.  Current versions of the gcc and clang
+	now deprecated.  Current versions of the GCC and Clang
 	open-source compilers do not implement this support.
      </para>
    </section>
@ -1146,8 +1146,8 @@ register vector double vd = vec_splats(*double_ptr);</programlisting>
 	elements using the groups of 4 contiguous bytes, and the
 	values of the integers will be reordered without compromising
 	each integer's contents.  The fact that the little-endian
-	result matches the big-endian result is left as an exercise to
-	the reader.
+	result matches the big-endian result is left as an exercise
+	for the reader.
      </para>
      <para>
 	Now, suppose instead that the original PCV does not reorder
--- a/Intrinsics_Reference/ch_intro.xml
+++ b/Intrinsics_Reference/ch_intro.xml
@ -54,10 +54,9 @@ xmlns:xlink="http://www.w3.org/1999/xlink" xml:id="section_intro">
      provides for overloaded intrinsics that can operate on different
      data types.  However, such function overloading is not normally
      acceptable in the C programming language, so compilers compliant
-      with the AltiVec PIM (such as <code>gcc</code> and
-      <code>clang</code>) were required to add special handling to
-      their parsers to permit this.  The PIM suggested (but did not
-      mandate) the use of a header file,
+      with the AltiVec PIM (such as GCC and Clang) were required to
+      add special handling to their parsers to permit this.  The PIM
+      suggested (but did not mandate) the use of a header file,
      <code>&lt;altivec.h&gt;</code>, for implementations that provide
      AltiVec intrinsics.  This is common practice for all compliant
      compilers today.
@ -208,6 +207,15 @@ xmlns:xlink="http://www.w3.org/1999/xlink" xml:id="section_intro">
 	  </emphasis>
 	</para>
      </listitem>
+      <listitem>
+	<para>
+	  <emphasis>Using the GNU Compiler Collection.</emphasis>
+	  <emphasis>
+	    <link xlink:href="https://gcc.gnu.org/onlinedocs/gcc.pdf">https://gcc.gnu.org/onlinedocs/gcc.pdf
+	    </link>
+	  </emphasis>
+	</para>
+      </listitem>
    </itemizedlist>
  </section>

--- a/Intrinsics_Reference/ch_techniques.xml
+++ b/Intrinsics_Reference/ch_techniques.xml
@ -23,45 +23,181 @@ xmlns:xlink="http://www.w3.org/1999/xlink" xml:id="section_techniques">
  <section>
    <title>Help the Compiler Help You</title>
    <para>
-      Start with scalar code, which is the most portable.  Use various
-      tricks for helping the compiler vectorize scalar code.  Make
-      sure you align your data on 16-byte boundaries wherever
-      possible, and tell the compiler it's aligned.  Use __restrict__
-      pointers to promise data does not alias.
+      The best way to use vector intrinsics is often <emphasis>not to
+      use them at all</emphasis>.
    </para>
+    <para>
+      This may seem counterintuitive at first.  Aren't vector
+      intrinsics the best way to ensure that the compiler does exactly
+      what you want?  Well, sometimes.  But the problem is that the
+      best instruction sequence today may not be the best instruction
+      sequence tomorrow.  As the PowerISA moves forward, new
+      instruction capabilities appear, and the old code you wrote can
+      easily become obsolete.  Then you start having to create
+      different versions of the code for different levels of the
+      PowerISA, and it can quickly become difficult to maintain.
+    </para>
+    <para>
+      Most often programmers use vector intrinsics to increase the
+      performance of loop kernels that dominate the performance of an
+      application or library.  However, modern compilers are often
+      able to optimize such loops to use vector instructions without
+      having to resort to intrinsics, using an optimization called
+      autovectorization (or auto-SIMD).  Your first focus when writing
+      loop kernels should be on making the code amenable to
+      autovectorization by the compiler.  Start by writing the code
+      naturally, using scalar memory accesses and data operations, and
+      see whether the compiler autovectorizes your code.  If not, here
+      are some steps you can try:
+    </para>
+    <itemizedlist>
+      <listitem>
+	<para>
+	  <emphasis role="underline">Check your optimization
+	  level</emphasis>.  Different compilers enable 
+	  autovectorization at different optimization levels.  For
+	  example, at this writing the GCC compiler requires
+	  <code>-O3</code> to enable autovectorization by default.
+	</para>
+      </listitem>
+      <listitem>
+	<para>
+	  <emphasis role="underline">Consider using
+	  <code>-ffast-math</code></emphasis>.  This option assumes
+	  that certain fussy aspects of IEEE floating-point can be
+	  ignored, such as the presence of Not-a-Numbers (NaNs),
+	  signed zeros, and so forth.  <code>-ffast-math</code> may
+	  also affect precision of results that may not matter to your
+	  application.  Turning on this option can simplify the
+	  control flow of loops generated for your application by
+	  removing tests for NaNs and so forth.  (Note that
+	  <code>-Ofast</code> turns on both -O3 and -ffast-math in
+	  GCC.)
+	</para>
+      </listitem>
+      <listitem>
+	<para>
+	  <emphasis role="underline">Align your data wherever
+	  possible</emphasis>.  For most effective auto-vectorization,
+	  arrays of data should be aligned on at least a 16-byte
+	  boundary, and pointers to that data should be identified as
+	  having the appropriate alignment.  For example:
+	</para>
+	<programlisting>  float fdata[4096] __attribute__((aligned(16)));</programlisting>
+	<para>
+	  ensures that the compiler can use an efficient, aligned
+	  vector load to bring data from <code>fdata</code> into a
+	  vector register.  Autovectorization will appear more
+	  profitable to the compiler when data is known to be
+	  aligned.
+	</para>
+	<para>
+	  You can also declare pointers to point to aligned data,
+	  which is particularly useful in function arguments:
+	</para>
+	<programlisting>  void foo (__attribute__((aligned(16))) double * aligned_fptr)</programlisting>
+      </listitem>
+      <listitem>
+	<para>
+	  <emphasis role="underline">Tell the compiler when data can't
+	  overlap</emphasis>.  In C and C++, use of pointers can cause
+	  compilers to pessimistically analyze which memory references
+	  can refer to the same memory.  This can prevent important
+	  optimizations, such as reordering memory references, or
+	  keeping previously loaded values in memory rather than
+	  reloading them.  Inefficiently optimized scalar loops are
+	  less likely to be autovectorized.  You can annotate your
+	  pointers with the <code>restrict</code> or
+	  <code>__restrict__</code> keyword to tell the compiler that
+	  your pointers don't "alias" with any other memory
+	  references.  (<code>restrict</code> can be used only in C
+	  when compiling for the C99 standard or later.
+	  <code>__restrict__</code> is a language extension, available
+	  in both GCC and Clang, that can be used for both C and C++.)
+	</para>
+	<para>
+	  Suppose you have a function that takes two pointer
+	  arguments, one that points to data your function writes to, and
+	  one that points to data your function reads from.  By
+	  default, the compiler may believe that the data being read
+	  and written could overlap.  To disabuse the compiler of this
+	  notion, do the following:
+	</para>
+	<programlisting>  void foo (double *__restrict__ outp, double *__restrict__ inp)</programlisting>
+      </listitem>
+    </itemizedlist>
  </section>

  <section>
    <title>Use Portable Intrinsics</title>
    <para>
-      Individual compilers may provide other intrinsic support.  Only
-      the intrinsics in this manual are guaranteed to be portable
-      across compliant compilers.
+      If you can't convince the compiler to autovectorize your code,
+      or you want to access specific processor features not
+      appropriate for autovectorization, you should use intrinsics.
+      However, you should go out of your way to use intrinsics that
+      are as portable as possible, in case you need to change
+      compilers in the future.
+    </para>
+    <para>
+      This reference provides intrinsics that are guaranteed to be
+      portable across compliant compilers.  In particular, both the
+      GCC and Clang compilers for POWER implement the intrinsics in
+      this manual.  The compilers may each implement many more
+      intrinsics, but the ones in this manual are the only ones
+      guaranteed to be portable.  So if you are using an interface not
+      described here, you should look for an equivalent one in this
+      manual and change your code to use that.
    </para>
    <para>
-      Some compilers may provide compatibility headers for use with
-      other architectures.  Recent GCC and Clang compilers support
-      compatibility headers for the lower levels of the x86 vector
-      architecture.  These can be used initially for ease of porting,
-      but for best performance, it is preferable to rewrite important
-      sections of code with native Power intrinsics.
+      There are also other vector APIs that may be of use to you (see
+      <xref linkend="VIPR.techniques.apis" />).  In particular, the
+      POWER Vector Library (see <xref
+      linkend="VIPR.techniques.pveclib" />) provides additional
+      portability across compiler versions.
    </para>
  </section>

  <section>
    <title>Use Assembly Code Sparingly</title>
-    <para>filler</para>
-    <section>
-      <title>Inline Assembly</title>
-      <para>filler</para>
-    </section>
-    <section>
-      <title>Assembly Files</title>
-      <para>filler</para>
-    </section>
+    <para>
+      Sometimes the compiler will absolutely not cooperate in giving
+      you the code you need.  You might not get the instruction you
+      want, or you might get extra instructions that are slowing down
+      your ideal performance.  When that happens, the first thing you
+      should do is report this to the compiler community!  This will
+      allow them to get the problem fixed in the next release of the
+      compiler.
+    </para>
+    <para>
+      In the meanwhile, though, what are your options?  As a
+      workaround, your best option may be to use assembly code.  There
+      are two ways to go about this.  Using inline assembly is
+      generally appropriate only for very small snippets of code (1-5
+      instructions, say).  If you want to write a whole function in
+      assembly code, though, it is better to create a separate
+      <code>.s</code> or <code>.S</code> file.  The only difference in
+      these two file types is that a <code>.S</code> file will be
+      processed by the C preprocessor before being assembled.
+    </para>
+    <para>
+      Assembly programming is beyond the scope of this manual.
+      Getting inline assembly correct can be quite tricky, and it is
+      best to look at existing examples to learn how to use it
+      properly.  However, there is a good introduction to inline
+      assembly in <emphasis>Using the GNU Compiler
+      Collection</emphasis> (see <xref linkend="VIPR.intro.links" />),
+      in section 6.47 at the time of this writing.
+    </para>
+    <para>
+      If you write a function entirely in assembly, you are
+      responsible for following the calling conventions established by
+      the ABI (see <xref linkend="VIPR.intro.links" />).  Again, it is
+      best to look at examples.  One place to find well-written
+      <code>.S</code> files is in the GLIBC project.
+    </para>
  </section>

-  <section>
+  <section xml:id="VIPR.techniques.apis">
    <title>Other Vector Programming APIs</title>
    <para>In addition to the intrinsic functions provided in this
    reference, programmers should be aware of other vector programming
@ -69,14 +205,13 @@ xmlns:xlink="http://www.w3.org/1999/xlink" xml:id="section_techniques">
    <section>
      <title>x86 Vector Portability Headers</title>
      <para>
-	Recent versions of the <code>gcc</code> and <code>clang</code>
-	open source compilers provide "drop-in" portability headers
-	for portions of the Intel Architecture Instruction Set
-	Extensions (see <xref linkend="VIPR.intro.links" />).  These
-	headers mirror the APIs of Intel headers having the same
-	names.  Support is provided for the MMX and SSE layers, up
-	through SSE4.  At this time, no support for the AVX layers is
-	envisioned.
+	Recent versions of the GCC and Clang open source compilers
+	provide "drop-in" portability headers for portions of the
+	Intel Architecture Instruction Set Extensions (see <xref
+	linkend="VIPR.intro.links" />).  These headers mirror the APIs
+	of Intel headers having the same names.  Support is provided
+	for the MMX and SSE layers, up through SSE4.  At this time, no
+	support for the AVX layers is envisioned.
      </para>
      <para>
 	The portability headers provide the same semantics as the
@ -95,7 +230,7 @@ xmlns:xlink="http://www.w3.org/1999/xlink" xml:id="section_techniques">
 	<code>&lt;mmintrin.h&gt;</code>.
      </para>
    </section>
-    <section>
+    <section xml:id="VIPR.techniques.pveclib">
      <title>The POWER Vector Library (pveclib)</title>
      <para>The POWER Vector Library, also known as
      <code>pveclib</code>, is a separate project available from
--- a/Intrinsics_Reference/ch_vec_reference.xml
+++ b/Intrinsics_Reference/ch_vec_reference.xml
@ -23,8 +23,95 @@ xmlns:xlink="http://www.w3.org/1999/xlink" xml:id="VIPR.vec-ref">
  <section>
    <title>How to Use This Reference</title>
    <para>
-      Brief description of the format of the entries, the cross-reference
-      index, and so forth.
+      This chapter contains reference material for each supported
+      vector intrinsic.  The information for each intrinsic includes:
+    </para>
+    <itemizedlist>
+      <listitem>
+	<para>
+	  The intrinsic name and extended name;
+	</para>
+      </listitem>
+      <listitem>
+	<para>
+	  A type-free example of the intrinsic's usage;
+	</para>
+      </listitem>
+      <listitem>
+	<para>
+	  A description of the intrinsic's purpose;
+	</para>
+      </listitem>
+      <listitem>
+	<para>
+	  A description of the value(s) returned from the intrinsic,
+	  if any;
+	</para>
+      </listitem>
+      <listitem>
+	<para>
+	  A description of any unusual characteristics of the
+	  intrinsic when different target endiannesses are in force.
+	  If the semantics of the intrinsic in big-endian and
+	  little-endian modes are identical, the description will read
+	  "None.";
+	</para>
+      </listitem>
+      <listitem>
+	<para>
+	  Optionally, additional explanatory notes about the
+	  intrinsic; and
+	</para>
+      </listitem>
+      <listitem>
+	<para>
+	  A table of supported type signatures for the intrinsic.
+	</para>
+      </listitem>
+    </itemizedlist>
+    <para>
+      Most intrinsics are overloaded, supporting multiple type
+      signatures.  The types of the input arguments always determine
+      the type of the result argument; that is, it is not possible to
+      define two intrinsic overloads with the same input argument
+      types and different result argument types.
+    </para>
+    <para>
+      The type-free example of the intrinsic's usage uses the
+      convention that <emphasis role="bold">r</emphasis> represents
+      the result of the intrinsic, and <emphasis
+      role="bold">a</emphasis>, <emphasis role="bold">b</emphasis>,
+      etc., represent the input arguments.  The allowed type
+      combinations of these variables are shown as rows in the table
+      of supported type signatures.
+    </para>
+    <para>
+      Each row contains at least one example implementation.  This
+      shows one way that a conforming compiler might achieve the
+      intended semantics of the intrinsic, but compilers are not
+      required to generate this code specifically.  The letters
+      <emphasis role="bold">r</emphasis>, <emphasis
+      role="bold">a</emphasis>, <emphasis role="bold">b</emphasis>,
+      etc., in the examples represent vector registers containing the
+      values of those variables.  The letters <emphasis
+      role="bold">t</emphasis>, <emphasis role="bold">u</emphasis>,
+      etc., represent vector registers containing temporary
+      intermediate results.  The same register is assumed to be used
+      for each instance of one of these letters.
+    </para>
+    <para>
+      When implementations differ for big- and little-endian targets,
+      separate example implementations are shown for each endianness.
+    </para>
+    <para>
+      The implementations show which vector instructions are used in
+      the implementation of a particular intrinsic.  When trying to
+      determine which intrinsic to use, it can be useful to have a
+      cross-reference from a specific vector instruction to the
+      intrinsics whose implementations make use of it.  This manual
+      contains such a cross-reference (<xref
+      linkend="section_isa_intrin_xref" />) for the programmer's
+      convenience.
    </para>
  </section>