<?xml version="1.0" encoding="UTF-8"?>
<!--
  Copyright (c) 2017 OpenPOWER Foundation
  
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.
  
-->
<section xmlns="http://docbook.org/ns/docbook"
  xmlns:xi="http://www.w3.org/2001/XInclude"
  xmlns:xlink="http://www.w3.org/1999/xlink"
  version="5.0"
  xml:id="sec_crossing_lanes">
  <title>Crossing lanes</title>
  
  <para>Vector SIMD units prefer to keep 
  computations in the same “lane” (element number) as the input elements. The 
  only exception in the examples so far are the occasional vector splat (copy one 
  element to all the other elements of the vector) operations. Splat is an 
  example of the general category of “permute” operations (Intel would call 
  this a “shuffle” or “blend”). </para>

 <para>Permutes select and rearrange the 
  elements of an input vector (or a concatenated pair of vectors) and deliver those 
  selected elements, in a specific order, to a result vector. The selection and 
  order of elements in the result is controlled by a third operand, either as a 3rd 
  input vector or as an immediate field of the instruction.</para>

  <para>For example, consider the Intel intrisics for 
  <link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_hadd&amp;expand=2757,4767,409,2757">Horizontal Add / Subtract</link> 
  added with SSE3. These instrinsics add (subtract) adjacent element pairs across a pair of 
  input vectors, placing the sum of the adjacent elements in the result vector. 
  For example 
  <link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_hadd_ps&amp;expand=2757,4767,409,2757,2757">_mm_hadd_ps</link>  
  which implements the operation on float:
  <programlisting><![CDATA[	result[0] = __A[1] + __A[0];
	result[1] = __A[3] + __A[2];
	result[2] = __B[1] + __B[0];
	result[3] = __B[3] + __B[2];]]></programlisting></para>

  <para>Horizontal Add (hadd) provides an incremental vector “sum across” 
  operation commonly needed in matrix and vector transform math. Horizontal Add 
  is incremental as you need three hadd instructions to sum across 4 vectors of 4 
  elements ( 7 for 8 x 8, 15 for 16 x 16, …).</para>
 
  <para>The PowerISA does not have a sum-across operation for float or 
  double. We can user the vector float add instruction after we rearrange the 
  inputs so that element pairs line up for the horizontal add. For example we 
  would need to permute the input vectors {1, 2, 3, 4} and {101, 102, 103, 104} 
  into vectors {2, 4, 102, 104} and {1, 3, 101, 103} before 
  the  <literal>vec_add</literal>. This 
  requires two vector permutes to align the elements into the correct lanes for 
  the vector add (to implement Horizontal Add).  </para>

  <para>The PowerISA provides generalized byte-level vector permute (vperm) 
  based on a vector register pair (32 bytes) source as input and a (16-byte) control vector. 
  The control 
  vector provides 16 indexes (0-31) to select bytes from the concatenated input 
  vector register pair (VRA, VRB). There are also predefined permutes (splat, pack, unpack, 
  merge) operations (across element sizes) that are encoded as separate 
  instruction op-codes or instruction immediate fields.</para>

  <para>Unfortunately only the general <literal>vec_perm</literal> 
  can provide the realignment 
  we need for the _mm_hadd_ps operation or any of the int, short variants of hadd. 
  For example:
  <programlisting><![CDATA[extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_hadd_ps (__m128 __X, __m128 __Y)
{
    __vector unsigned char xform2 = {
        0x00, 0x01, 0x02, 0x03,  0x08, 0x09, 0x0A, 0x0B,  
        0x10, 0x11, 0x12, 0x13,  0x18, 0x19, 0x1A, 0x1B
      };
    __vector unsigned char xform1 = {
        0x04, 0x05, 0x06, 0x07,  0x0C, 0x0D, 0x0E, 0x0F,  
        0x14, 0x15, 0x16, 0x17,  0x1C, 0x1D, 0x1E, 0x1F
      };
    return (__m128) vec_add (vec_perm ((__v4sf) __X, (__v4sf) __Y, xform1),
                             vec_perm ((__v4sf) __X, (__v4sf) __Y, xform2));
}]]></programlisting></para>

  <para>This requires two permute control vectors; one to select the even 
  word elements across <literal>__X</literal> and <literal>__Y</literal>, 
  and another to select the odd word elements 
  across <literal>__X</literal> and <literal>__Y</literal>. 
  The results of these permutes (<literal>vec_perm</literal>) are inputs to the 
  <literal>vec_add</literal> that completes the horizontal add operation. </para>

  <para>Fortunately the permute required for the double (64-bit) case 
  (<literal>_mm_hadd_pd</literal>) reduces to the equivalent of 
  <literal>vec_mergeh</literal> / <literal>vec_mergel</literal>  doubleword 
  (which are variants of  VSX Permute Doubleword Immediate). So the 
  implementation of _mm_hadd_pd can be simplified to this:
  <programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_hadd_pd (__m128d __X, __m128d __Y)
{
	return (__m128d) vec_add (vec_mergeh ((__v2df) __X, (__v2df)__Y),
			                  vec_mergel ((__v2df) __X, (__v2df)__Y));
}]]></programlisting></para>

  <para>This eliminates the load of the control vectors required by the 
  previous example.</para>

</section>