Dealing with MMX
MMX is actually the harder case. The __m64
type supports SIMD vector
int types (char, short, int, long). The Intel API defines
__m64 as:
Which is problematic for the PowerPC target (not really supported in
GCC) and we would prefer to use a native PowerISA type that can be passed in a
single register. The PowerISA Rotate Under Mask instructions can easily
extract and insert integer fields of a General Purpose Register (GPR). This
implies that MMX integer types can be handled as an internal union of arrays for
the supported element types. So a 64-bit unsigned long long is the best type
for parameter passing and return values, especially for the 64-bit (_si64)
operations as these normally generate a single PowerISA instruction.
So for the PowerPC implementation we will define
__m64 as:
The SSE extensions include some copy / convert operations for
_m128 to /
from _m64 and this includes some int to / from float conversions. However in
these cases the float operands always reside in SSE (XMM) registers (which
match the PowerISA vector registers) and the MMX registers only contain integer
values. POWER8 (PowerISA-2.07) has direct move instructions between GPRs and
VSRs. So these transfers are normally a single instruction and any conversions
can be handled in the vector unit.
When transferring a __m64 value to a vector register we should also
execute a xxsplatd instruction to insure there is valid data in all four
float element lanes before doing floating point operations. This avoids causing
extraneous floating point exceptions that might be generated by uninitialized
parts of the vector. The top two lanes will have the floating point results
that are in position for direct transfer to a GPR or stored via Store Float
Double (stfd). These operation are internal to the intrinsic implementation and
there is no requirement to keep temporary vectors in correct Little Endian
form.
Also for the smaller element sizes and higher element counts (MMX
_pi8 and _p16 types)
the number of Rotate Under Mask instructions required to
disassemble the 64-bit __m64
into elements, perform the element calculations,
and reassemble the elements in a single __m64
value can get larger. In this
case we can generate shorter instruction sequences by transfering (via direct
move instruction) the GPR __m64 value to the
a vector register, performance the
SIMD operation there, then transfer the __m64
result back to a GPR.