diff --git a/Intrinsics_Reference/ch_biendian.xml b/Intrinsics_Reference/ch_biendian.xml
index d313167..958ba41 100644
--- a/Intrinsics_Reference/ch_biendian.xml
+++ b/Intrinsics_Reference/ch_biendian.xml
@@ -1034,18 +1034,149 @@ register vector double vd = vec_splats(*double_ptr);
-
-
- Limitations
-
- vec_sld
-
-
- vec_perm
-
+ Examples and Limitations
+
+ Unaligned vector access
+
+ A common programming error is to cast a pointer to a base type
+ (such as int
) to a pointer of the corresponding
+ vector type (such as vector int
), and then
+ dereference the pointer. This constitutes undefined behavior,
+ because it casts a pointer with a smaller alignment
+ requirement to a pointer with a larger alignment requirement.
+ Compilers may not produce code that you expect in the presence
+ of undefined behavior.
+
+
+ Thus, do not write the following:
+
+ int a[4096];
+ vector int x = *((vector int *) a);
+
+ Instead, write this:
+
+ int a[4096];
+ vector int x = vec_xl (0, a);
+
+
+ vec_sld is not bi-endian
+
+ One oddity in the bi-endian vector programming model is that
+ vec_sld
has big-endian semantics for code
+ compiled for both big-endian and little-endian targets. That
+ is, any code that uses vec_sld
without guarding
+ it with a test on endianness is likely to be incorrect.
+
+
+ At the time that the bi-endian model was being developed, it
+ was discovered that existing code in several Linux packages
+ was using vec_sld
in order to perform multiplies,
+ or to otherwise shift portions of base elements left. A
+ straightforward little-endian implementation of
+ vec_sld
would concatenate the two input vectors
+ in reverse order and shift bytes to the right. This would
+ only give compatible results for vector char
+ types. Those using this intrinsic as a cheap multiply, or to
+ shift bytes within larger elements, would see different
+ results on little-endian versus big-endian with such an
+ implementation. Therefore it was decided that
+ vec_sld
would not have a bi-endian
+ implementation.
+
+
+ vec_sro
is not bi-endian for similar reasons.
+
+
+
+ Limitations on bi-endianness of vec_perm
+
+ The vec_perm
intrinsic is bi-endian, provided
+ that it is used to reorder entire elements of the input
+ vectors.
+
+
+ To see why this is, let's examine the code generation for
+
+ vector int t;
+ vector int a = (vector int){0x00010203, 0x04050607, 0x08090a0b, 0x0c0d0e0f};
+ vector int b = (vector int){0x10111213, 0x14151617, 0x18191a1b, 0x1c1d1e1f};
+ vector char c = (vector char){0,1,2,3,28,29,30,31,12,13,14,15,20,21,22,23};
+ t = vec_perm (a, b, c);
+
+ For big endian, a compiler should generate:
+
+ vperm t,a,b,c
+
+ For little endian targeting a POWER8 system, a compiler should
+ generate:
+
+ vnand d,c,c
+ vperm t,b,a,d
+
+ For little endian targeting a POWER9 system, a compiler should
+ generate:
+
+ vpermr t,b,a,c
+
+ Note that the vpermr
instruction takes care of
+ modifying the permute control vector (PCV) c
that
+ was done using the vnand
instruction for POWER8.
+ Because only the bottom 5 bits of each element of the PCV are
+ read by the hardware, this has the effect of subtracting the
+ original elements of the PCV from 31.
+
+
+ Note also that the PCV c
has element values that
+ are contiguous in groups of 4. This selects entire elements
+ from the input vectors a
and b
to
+ reorder. Thus the intent of the code is to select the first
+ integer element of a
, the last integer element of
+ b
, the last integer element of a
,
+ and the second integer element of b
, in that
+ order.
+
+
+ For little endian, the modified PCV is elementwise subtracted
+ from 31, giving {31,30,29,28,3,2,1,0,19,18,17,16,11,10,9,8}.
+ Since the elements appear in reverse order in a register when
+ loaded from little-endian memory, the elements appear in the
+ register from left to right as
+ {8,9,10,11,16,17,18,19,0,1,2,3,28,29,30,31}. So the following
+ vperm
instruction will again select entire
+ elements using the groups of 4 contiguous bytes, and the
+ values of the integers will be reordered without compromising
+ each integer's contents. The fact that the little-endian
+ result matches the big-endian result is left as an exercise to
+ the reader.
+
+
+ Now, suppose instead that the original PCV does not reorder
+ entire integers at once:
+
+ vector char c = (vector char){0,20,31,4,7,17,6,19,30,3,2,8,9,13,5,22};
+
+ The result of the big-endian implementation would be:
+
+ t = {0x00141f04, 0x07110613, 0x1e030208, 0x090d0516};
+
+ For little-endian, the modified PCV would be
+ {31,11,0,27,24,14,25,12,1,28,29,23,22,18,26,9}, appearing in
+ the register as
+ {9,26,18,22,23,29,28,1,12,25,14,24,27,0,11,31}. The final
+ little-endian result would be
+
+ t = {0x071c1703, 0x10051204, 0x0b01001d, 0x15060e0a};
+
+ which bears no resemblance to the big-endian result.
+
+
+ The lesson here is to only use vec_perm
to
+ reorder entire elements of a vector. If you must use vec_perm
+ for another purpose, your code must include a test for
+ endianness and separate algorithms for big- and
+ little-endian.
+
+