|
|
|
@ -1034,18 +1034,149 @@ register vector double vd = vec_splats(*double_ptr);</programlisting>
|
|
|
|
|
</section>
|
|
|
|
|
|
|
|
|
|
<section>
|
|
|
|
|
<title>Examples</title>
|
|
|
|
|
<para>filler</para>
|
|
|
|
|
</section>
|
|
|
|
|
|
|
|
|
|
<section>
|
|
|
|
|
<title>Limitations</title>
|
|
|
|
|
<para>
|
|
|
|
|
<code>vec_sld</code>
|
|
|
|
|
</para>
|
|
|
|
|
<para>
|
|
|
|
|
<code>vec_perm</code>
|
|
|
|
|
</para>
|
|
|
|
|
<title>Examples and Limitations</title>
|
|
|
|
|
<section>
|
|
|
|
|
<title>Unaligned vector access</title>
|
|
|
|
|
<para>
|
|
|
|
|
A common programming error is to cast a pointer to a base type
|
|
|
|
|
(such as <code>int</code>) to a pointer of the corresponding
|
|
|
|
|
vector type (such as <code>vector int</code>), and then
|
|
|
|
|
dereference the pointer. This constitutes undefined behavior,
|
|
|
|
|
because it casts a pointer with a smaller alignment
|
|
|
|
|
requirement to a pointer with a larger alignment requirement.
|
|
|
|
|
Compilers may not produce code that you expect in the presence
|
|
|
|
|
of undefined behavior.
|
|
|
|
|
</para>
|
|
|
|
|
<para>
|
|
|
|
|
Thus, do not write the following:
|
|
|
|
|
</para>
|
|
|
|
|
<programlisting> int a[4096];
|
|
|
|
|
vector int x = *((vector int *) a);</programlisting>
|
|
|
|
|
<para>
|
|
|
|
|
Instead, write this:
|
|
|
|
|
</para>
|
|
|
|
|
<programlisting> int a[4096];
|
|
|
|
|
vector int x = vec_xl (0, a);</programlisting>
|
|
|
|
|
</section>
|
|
|
|
|
<section>
|
|
|
|
|
<title>vec_sld is not bi-endian</title>
|
|
|
|
|
<para>
|
|
|
|
|
One oddity in the bi-endian vector programming model is that
|
|
|
|
|
<code>vec_sld</code> has big-endian semantics for code
|
|
|
|
|
compiled for both big-endian and little-endian targets. That
|
|
|
|
|
is, any code that uses <code>vec_sld</code> without guarding
|
|
|
|
|
it with a test on endianness is likely to be incorrect.
|
|
|
|
|
</para>
|
|
|
|
|
<para>
|
|
|
|
|
At the time that the bi-endian model was being developed, it
|
|
|
|
|
was discovered that existing code in several Linux packages
|
|
|
|
|
was using <code>vec_sld</code> in order to perform multiplies,
|
|
|
|
|
or to otherwise shift portions of base elements left. A
|
|
|
|
|
straightforward little-endian implementation of
|
|
|
|
|
<code>vec_sld</code> would concatenate the two input vectors
|
|
|
|
|
in reverse order and shift bytes to the right. This would
|
|
|
|
|
only give compatible results for <code>vector char</code>
|
|
|
|
|
types. Those using this intrinsic as a cheap multiply, or to
|
|
|
|
|
shift bytes within larger elements, would see different
|
|
|
|
|
results on little-endian versus big-endian with such an
|
|
|
|
|
implementation. Therefore it was decided that
|
|
|
|
|
<code>vec_sld</code> would not have a bi-endian
|
|
|
|
|
implementation.
|
|
|
|
|
</para>
|
|
|
|
|
<para>
|
|
|
|
|
<code>vec_sro</code> is not bi-endian for similar reasons.
|
|
|
|
|
</para>
|
|
|
|
|
</section>
|
|
|
|
|
<section>
|
|
|
|
|
<title>Limitations on bi-endianness of vec_perm</title>
|
|
|
|
|
<para>
|
|
|
|
|
The <code>vec_perm</code> intrinsic is bi-endian, provided
|
|
|
|
|
that it is used to reorder entire elements of the input
|
|
|
|
|
vectors.
|
|
|
|
|
</para>
|
|
|
|
|
<para>
|
|
|
|
|
To see why this is, let's examine the code generation for
|
|
|
|
|
</para>
|
|
|
|
|
<programlisting> vector int t;
|
|
|
|
|
vector int a = (vector int){0x00010203, 0x04050607, 0x08090a0b, 0x0c0d0e0f};
|
|
|
|
|
vector int b = (vector int){0x10111213, 0x14151617, 0x18191a1b, 0x1c1d1e1f};
|
|
|
|
|
vector char c = (vector char){0,1,2,3,28,29,30,31,12,13,14,15,20,21,22,23};
|
|
|
|
|
t = vec_perm (a, b, c);</programlisting>
|
|
|
|
|
<para>
|
|
|
|
|
For big endian, a compiler should generate:
|
|
|
|
|
</para>
|
|
|
|
|
<programlisting> vperm t,a,b,c</programlisting>
|
|
|
|
|
<para>
|
|
|
|
|
For little endian targeting a POWER8 system, a compiler should
|
|
|
|
|
generate:
|
|
|
|
|
</para>
|
|
|
|
|
<programlisting> vnand d,c,c
|
|
|
|
|
vperm t,b,a,d</programlisting>
|
|
|
|
|
<para>
|
|
|
|
|
For little endian targeting a POWER9 system, a compiler should
|
|
|
|
|
generate:
|
|
|
|
|
</para>
|
|
|
|
|
<programlisting> vpermr t,b,a,c</programlisting>
|
|
|
|
|
<para>
|
|
|
|
|
Note that the <code>vpermr</code> instruction takes care of
|
|
|
|
|
modifying the permute control vector (PCV) <code>c</code> that
|
|
|
|
|
was done using the <code>vnand</code> instruction for POWER8.
|
|
|
|
|
Because only the bottom 5 bits of each element of the PCV are
|
|
|
|
|
read by the hardware, this has the effect of subtracting the
|
|
|
|
|
original elements of the PCV from 31.
|
|
|
|
|
</para>
|
|
|
|
|
<para>
|
|
|
|
|
Note also that the PCV <code>c</code> has element values that
|
|
|
|
|
are contiguous in groups of 4. This selects entire elements
|
|
|
|
|
from the input vectors <code>a</code> and <code>b</code> to
|
|
|
|
|
reorder. Thus the intent of the code is to select the first
|
|
|
|
|
integer element of <code>a</code>, the last integer element of
|
|
|
|
|
<code>b</code>, the last integer element of <code>a</code>,
|
|
|
|
|
and the second integer element of <code>b</code>, in that
|
|
|
|
|
order.
|
|
|
|
|
</para>
|
|
|
|
|
<para>
|
|
|
|
|
For little endian, the modified PCV is elementwise subtracted
|
|
|
|
|
from 31, giving {31,30,29,28,3,2,1,0,19,18,17,16,11,10,9,8}.
|
|
|
|
|
Since the elements appear in reverse order in a register when
|
|
|
|
|
loaded from little-endian memory, the elements appear in the
|
|
|
|
|
register from left to right as
|
|
|
|
|
{8,9,10,11,16,17,18,19,0,1,2,3,28,29,30,31}. So the following
|
|
|
|
|
<code>vperm</code> instruction will again select entire
|
|
|
|
|
elements using the groups of 4 contiguous bytes, and the
|
|
|
|
|
values of the integers will be reordered without compromising
|
|
|
|
|
each integer's contents. The fact that the little-endian
|
|
|
|
|
result matches the big-endian result is left as an exercise to
|
|
|
|
|
the reader.
|
|
|
|
|
</para>
|
|
|
|
|
<para>
|
|
|
|
|
Now, suppose instead that the original PCV does not reorder
|
|
|
|
|
entire integers at once:
|
|
|
|
|
</para>
|
|
|
|
|
<programlisting> vector char c = (vector char){0,20,31,4,7,17,6,19,30,3,2,8,9,13,5,22};</programlisting>
|
|
|
|
|
<para>
|
|
|
|
|
The result of the big-endian implementation would be:
|
|
|
|
|
</para>
|
|
|
|
|
<programlisting> t = {0x00141f04, 0x07110613, 0x1e030208, 0x090d0516};</programlisting>
|
|
|
|
|
<para>
|
|
|
|
|
For little-endian, the modified PCV would be
|
|
|
|
|
{31,11,0,27,24,14,25,12,1,28,29,23,22,18,26,9}, appearing in
|
|
|
|
|
the register as
|
|
|
|
|
{9,26,18,22,23,29,28,1,12,25,14,24,27,0,11,31}. The final
|
|
|
|
|
little-endian result would be
|
|
|
|
|
</para>
|
|
|
|
|
<programlisting> t = {0x071c1703, 0x10051204, 0x0b01001d, 0x15060e0a};</programlisting>
|
|
|
|
|
<para>
|
|
|
|
|
which bears no resemblance to the big-endian result.
|
|
|
|
|
</para>
|
|
|
|
|
<para>
|
|
|
|
|
The lesson here is to only use <code>vec_perm</code> to
|
|
|
|
|
reorder entire elements of a vector. If you must use vec_perm
|
|
|
|
|
for another purpose, your code must include a test for
|
|
|
|
|
endianness and separate algorithms for big- and
|
|
|
|
|
little-endian.
|
|
|
|
|
</para>
|
|
|
|
|
</section>
|
|
|
|
|
</section>
|
|
|
|
|
|
|
|
|
|
</chapter>
|
|
|
|
|