|
|
|
|
<?xml version="1.0" encoding="UTF-8"?>
|
|
|
|
|
<!--
|
|
|
|
|
Copyright (c) 2017 OpenPOWER Foundation
|
|
|
|
|
|
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License");
|
|
|
|
|
you may not use this file except in compliance with the License.
|
|
|
|
|
You may obtain a copy of the License at
|
|
|
|
|
|
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
|
|
|
|
|
|
Unless required by applicable law or agreed to in writing, software
|
|
|
|
|
distributed under the License is distributed on an "AS IS" BASIS,
|
|
|
|
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
|
|
|
See the License for the specific language governing permissions and
|
|
|
|
|
limitations under the License.
|
|
|
|
|
|
|
|
|
|
-->
|
|
|
|
|
<section xmlns="http://docbook.org/ns/docbook"
|
|
|
|
|
xmlns:xi="http://www.w3.org/2001/XInclude"
|
|
|
|
|
xmlns:xlink="http://www.w3.org/1999/xlink"
|
|
|
|
|
version="5.0"
|
|
|
|
|
xml:id="sec_crossing_lanes">
|
|
|
|
|
<title>Crossing lanes</title>
|
|
|
|
|
|
|
|
|
|
<para>Vector SIMD units prefer to keep
|
|
|
|
|
computations in the same “lane” (element number) as the input elements. The
|
|
|
|
|
only exception in the examples so far are the occasional vector splat (copy one
|
|
|
|
|
element to all the other elements of the vector) operations. Splat is an
|
|
|
|
|
example of the general category of “permute” operations (Intel would call
|
|
|
|
|
this a “shuffle” or “blend”). </para>
|
|
|
|
|
|
|
|
|
|
<para>Permutes select and rearrange the
|
|
|
|
|
elements of an input vector (or a concatenated pair of vectors) and deliver those
|
|
|
|
|
selected elements, in a specific order, to a result vector. The selection and
|
|
|
|
|
order of elements in the result is controlled by a third operand, either as a 3rd
|
|
|
|
|
input vector or as an immediate field of the instruction.</para>
|
|
|
|
|
|
|
|
|
|
<para>For example, consider the Intel intrisics for
|
|
|
|
|
<link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_hadd&expand=2757,4767,409,2757">Horizontal Add / Subtract</link>
|
|
|
|
|
added with SSE3. These instrinsics add (subtract) adjacent element pairs across a pair of
|
|
|
|
|
input vectors, placing the sum of the adjacent elements in the result vector.
|
|
|
|
|
For example
|
|
|
|
|
<link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_hadd_ps&expand=2757,4767,409,2757,2757">_mm_hadd_ps</link>
|
|
|
|
|
which implements the operation on float:
|
|
|
|
|
<programlisting><![CDATA[ result[0] = __A[1] + __A[0];
|
|
|
|
|
result[1] = __A[3] + __A[2];
|
|
|
|
|
result[2] = __B[1] + __B[0];
|
|
|
|
|
result[3] = __B[3] + __B[2];]]></programlisting></para>
|
|
|
|
|
|
|
|
|
|
<para>Horizontal Add (hadd) provides an incremental vector “sum across”
|
|
|
|
|
operation commonly needed in matrix and vector transform math. Horizontal Add
|
|
|
|
|
is incremental as you need three hadd instructions to sum across 4 vectors of 4
|
|
|
|
|
elements ( 7 for 8 x 8, 15 for 16 x 16, …).</para>
|
|
|
|
|
|
|
|
|
|
<para>The PowerISA does not have a sum-across operation for float or
|
|
|
|
|
double. We can user the vector float add instruction after we rearrange the
|
|
|
|
|
inputs so that element pairs line up for the horizontal add. For example we
|
|
|
|
|
would need to permute the input vectors {1, 2, 3, 4} and {101, 102, 103, 104}
|
|
|
|
|
into vectors {2, 4, 102, 104} and {1, 3, 101, 103} before
|
|
|
|
|
the <literal>vec_add</literal>. This
|
|
|
|
|
requires two vector permutes to align the elements into the correct lanes for
|
|
|
|
|
the vector add (to implement Horizontal Add). </para>
|
|
|
|
|
|
|
|
|
|
<para>The PowerISA provides generalized byte-level vector permute (vperm)
|
|
|
|
|
based on a vector register pair (32 bytes) source as input and a (16-byte) control vector.
|
|
|
|
|
The control
|
|
|
|
|
vector provides 16 indexes (0-31) to select bytes from the concatenated input
|
|
|
|
|
vector register pair (VRA, VRB). There are also predefined permutes (splat, pack, unpack,
|
|
|
|
|
merge) operations (across element sizes) that are encoded as separate
|
|
|
|
|
instruction op-codes or instruction immediate fields.</para>
|
|
|
|
|
|
|
|
|
|
<para>Unfortunately only the general <literal>vec_perm</literal>
|
|
|
|
|
can provide the realignment
|
|
|
|
|
we need for the _mm_hadd_ps operation or any of the int, short variants of hadd.
|
|
|
|
|
For example:
|
|
|
|
|
<programlisting><![CDATA[extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
|
|
|
|
|
_mm_hadd_ps (__m128 __X, __m128 __Y)
|
|
|
|
|
{
|
|
|
|
|
__vector unsigned char xform2 = {
|
|
|
|
|
0x00, 0x01, 0x02, 0x03, 0x08, 0x09, 0x0A, 0x0B,
|
|
|
|
|
0x10, 0x11, 0x12, 0x13, 0x18, 0x19, 0x1A, 0x1B
|
|
|
|
|
};
|
|
|
|
|
__vector unsigned char xform1 = {
|
|
|
|
|
0x04, 0x05, 0x06, 0x07, 0x0C, 0x0D, 0x0E, 0x0F,
|
|
|
|
|
0x14, 0x15, 0x16, 0x17, 0x1C, 0x1D, 0x1E, 0x1F
|
|
|
|
|
};
|
|
|
|
|
return (__m128) vec_add (vec_perm ((__v4sf) __X, (__v4sf) __Y, xform1),
|
|
|
|
|
vec_perm ((__v4sf) __X, (__v4sf) __Y, xform2));
|
|
|
|
|
}]]></programlisting></para>
|
|
|
|
|
|
|
|
|
|
<para>This requires two permute control vectors; one to select the even
|
|
|
|
|
word elements across <literal>__X</literal> and <literal>__Y</literal>,
|
|
|
|
|
and another to select the odd word elements
|
|
|
|
|
across <literal>__X</literal> and <literal>__Y</literal>.
|
|
|
|
|
The results of these permutes (<literal>vec_perm</literal>) are inputs to the
|
|
|
|
|
<literal>vec_add</literal> that completes the horizontal add operation. </para>
|
|
|
|
|
|
|
|
|
|
<para>Fortunately the permute required for the double (64-bit) case
|
|
|
|
|
(<literal>_mm_hadd_pd</literal>) reduces to the equivalent of
|
|
|
|
|
<literal>vec_mergeh</literal> / <literal>vec_mergel</literal> doubleword
|
|
|
|
|
(which are variants of VSX Permute Doubleword Immediate). So the
|
|
|
|
|
implementation of _mm_hadd_pd can be simplified to this:
|
|
|
|
|
<programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
|
|
|
|
|
_mm_hadd_pd (__m128d __X, __m128d __Y)
|
|
|
|
|
{
|
|
|
|
|
return (__m128d) vec_add (vec_mergeh ((__v2df) __X, (__v2df)__Y),
|
|
|
|
|
vec_mergel ((__v2df) __X, (__v2df)__Y));
|
|
|
|
|
}]]></programlisting></para>
|
|
|
|
|
|
|
|
|
|
<para>This eliminates the load of the control vectors required by the
|
|
|
|
|
previous example.</para>
|
|
|
|
|
|
|
|
|
|
</section>
|
|
|
|
|
|