diff --git a/Intrinsics_Reference/ch_biendian.xml b/Intrinsics_Reference/ch_biendian.xml index d313167..958ba41 100644 --- a/Intrinsics_Reference/ch_biendian.xml +++ b/Intrinsics_Reference/ch_biendian.xml @@ -1034,18 +1034,149 @@ register vector double vd = vec_splats(*double_ptr);
- Examples - filler -
- -
- Limitations - - vec_sld - - - vec_perm - + Examples and Limitations +
+ Unaligned vector access + + A common programming error is to cast a pointer to a base type + (such as int) to a pointer of the corresponding + vector type (such as vector int), and then + dereference the pointer. This constitutes undefined behavior, + because it casts a pointer with a smaller alignment + requirement to a pointer with a larger alignment requirement. + Compilers may not produce code that you expect in the presence + of undefined behavior. + + + Thus, do not write the following: + + int a[4096]; + vector int x = *((vector int *) a); + + Instead, write this: + + int a[4096]; + vector int x = vec_xl (0, a); +
+
+ vec_sld is not bi-endian + + One oddity in the bi-endian vector programming model is that + vec_sld has big-endian semantics for code + compiled for both big-endian and little-endian targets. That + is, any code that uses vec_sld without guarding + it with a test on endianness is likely to be incorrect. + + + At the time that the bi-endian model was being developed, it + was discovered that existing code in several Linux packages + was using vec_sld in order to perform multiplies, + or to otherwise shift portions of base elements left. A + straightforward little-endian implementation of + vec_sld would concatenate the two input vectors + in reverse order and shift bytes to the right. This would + only give compatible results for vector char + types. Those using this intrinsic as a cheap multiply, or to + shift bytes within larger elements, would see different + results on little-endian versus big-endian with such an + implementation. Therefore it was decided that + vec_sld would not have a bi-endian + implementation. + + + vec_sro is not bi-endian for similar reasons. + +
+
+ Limitations on bi-endianness of vec_perm + + The vec_perm intrinsic is bi-endian, provided + that it is used to reorder entire elements of the input + vectors. + + + To see why this is, let's examine the code generation for + + vector int t; + vector int a = (vector int){0x00010203, 0x04050607, 0x08090a0b, 0x0c0d0e0f}; + vector int b = (vector int){0x10111213, 0x14151617, 0x18191a1b, 0x1c1d1e1f}; + vector char c = (vector char){0,1,2,3,28,29,30,31,12,13,14,15,20,21,22,23}; + t = vec_perm (a, b, c); + + For big endian, a compiler should generate: + + vperm t,a,b,c + + For little endian targeting a POWER8 system, a compiler should + generate: + + vnand d,c,c + vperm t,b,a,d + + For little endian targeting a POWER9 system, a compiler should + generate: + + vpermr t,b,a,c + + Note that the vpermr instruction takes care of + modifying the permute control vector (PCV) c that + was done using the vnand instruction for POWER8. + Because only the bottom 5 bits of each element of the PCV are + read by the hardware, this has the effect of subtracting the + original elements of the PCV from 31. + + + Note also that the PCV c has element values that + are contiguous in groups of 4. This selects entire elements + from the input vectors a and b to + reorder. Thus the intent of the code is to select the first + integer element of a, the last integer element of + b, the last integer element of a, + and the second integer element of b, in that + order. + + + For little endian, the modified PCV is elementwise subtracted + from 31, giving {31,30,29,28,3,2,1,0,19,18,17,16,11,10,9,8}. + Since the elements appear in reverse order in a register when + loaded from little-endian memory, the elements appear in the + register from left to right as + {8,9,10,11,16,17,18,19,0,1,2,3,28,29,30,31}. So the following + vperm instruction will again select entire + elements using the groups of 4 contiguous bytes, and the + values of the integers will be reordered without compromising + each integer's contents. The fact that the little-endian + result matches the big-endian result is left as an exercise to + the reader. + + + Now, suppose instead that the original PCV does not reorder + entire integers at once: + + vector char c = (vector char){0,20,31,4,7,17,6,19,30,3,2,8,9,13,5,22}; + + The result of the big-endian implementation would be: + + t = {0x00141f04, 0x07110613, 0x1e030208, 0x090d0516}; + + For little-endian, the modified PCV would be + {31,11,0,27,24,14,25,12,1,28,29,23,22,18,26,9}, appearing in + the register as + {9,26,18,22,23,29,28,1,12,25,14,24,27,0,11,31}. The final + little-endian result would be + + t = {0x071c1703, 0x10051204, 0x0b01001d, 0x15060e0a}; + + which bears no resemblance to the big-endian result. + + + The lesson here is to only use vec_perm to + reorder entire elements of a vector. If you must use vec_perm + for another purpose, your code must include a test for + endianness and separate algorithms for big- and + little-endian. + +