Crossing lanes

Crossing lanes Vector SIMD units prefer to keep computations in the same “lane” (element number) as the input elements. The only exception in the examples so far are the occasional vector splat (copy one element to all the other elements of the vector) operations. Splat is an example of the general category of “permute” operations (Intel would call this a “shuffle” or “blend”). Permutes select and rearrange the elements of an input vector (or a concatenated pair of vectors) and deliver those selected elements, in a specific order, to a result vector. The selection and order of elements in the result is controlled by a third operand, either as a 3rd input vector or as an immediate field of the instruction. For example, consider the Intel intrisics for Horizontal Add / Subtract added with SSE3. These instrinsics add (subtract) adjacent element pairs across a pair of input vectors, placing the sum of the adjacent elements in the result vector. For example _mm_hadd_ps which implements the operation on float: Horizontal Add (hadd) provides an incremental vector “sum across” operation commonly needed in matrix and vector transform math. Horizontal Add is incremental as you need three hadd instructions to sum across 4 vectors of 4 elements ( 7 for 8 x 8, 15 for 16 x 16, …). The PowerISA does not have a sum-across operation for float or double. We can user the vector float add instruction after we rearrange the inputs so that element pairs line up for the horizontal add. For example we would need to permute the input vectors {1, 2, 3, 4} and {101, 102, 103, 104} into vectors {2, 4, 102, 104} and {1, 3, 101, 103} before the vec_add. This requires two vector permutes to align the elements into the correct lanes for the vector add (to implement Horizontal Add). The PowerISA provides generalized byte-level vector permute (vperm) based on a vector register pair (32 bytes) source as input and a (16-byte) control vector. The control vector provides 16 indexes (0-31) to select bytes from the concatenated input vector register pair (VRA, VRB). There are also predefined permutes (splat, pack, unpack, merge) operations (across element sizes) that are encoded as separate instruction op-codes or instruction immediate fields. Unfortunately only the general vec_perm can provide the realignment we need for the _mm_hadd_ps operation or any of the int, short variants of hadd. For example: This requires two permute control vectors; one to select the even word elements across __X and __Y, and another to select the odd word elements across __X and __Y. The results of these permutes (vec_perm) are inputs to the vec_add that completes the horizontal add operation. Fortunately the permute required for the double (64-bit) case (_mm_hadd_pd) reduces to the equivalent of vec_mergeh / vec_mergel doubleword (which are variants of VSX Permute Doubleword Immediate). So the implementation of _mm_hadd_pd can be simplified to this: This eliminates the load of the control vectors required by the previous example.