|
|
|
|
<?xml version="1.0" encoding="UTF-8"?>
|
|
|
|
|
<!--
|
|
|
|
|
Copyright (c) 2017 OpenPOWER Foundation
|
|
|
|
|
|
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License");
|
|
|
|
|
you may not use this file except in compliance with the License.
|
|
|
|
|
You may obtain a copy of the License at
|
|
|
|
|
|
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
|
|
|
|
|
|
Unless required by applicable law or agreed to in writing, software
|
|
|
|
|
distributed under the License is distributed on an "AS IS" BASIS,
|
|
|
|
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
|
|
|
See the License for the specific language governing permissions and
|
|
|
|
|
limitations under the License.
|
|
|
|
|
|
|
|
|
|
-->
|
|
|
|
|
<section xmlns="http://docbook.org/ns/docbook"
|
|
|
|
|
xmlns:xi="http://www.w3.org/2001/XInclude"
|
|
|
|
|
xmlns:xlink="http://www.w3.org/1999/xlink"
|
|
|
|
|
version="5.0"
|
|
|
|
|
xml:id="sec_handling_mmx">
|
|
|
|
|
<title>Dealing with MMX</title>
|
|
|
|
|
|
|
|
|
|
<para>MMX is actually the harder case. The <literal>__m64</literal>
|
|
|
|
|
type supports SIMD vector
|
|
|
|
|
int types (char, short, int, long). The Intel API defines
|
|
|
|
|
<literal>__m64</literal> as:
|
|
|
|
|
<programlisting><![CDATA[typedef int __m64 __attribute__ ((__vector_size__ (8), __may_alias__));]]></programlisting></para>
|
|
|
|
|
|
|
|
|
|
<para>Which is problematic for the PowerPC target (not really supported in
|
|
|
|
|
GCC) and we would prefer to use a native PowerISA type that can be passed in a
|
|
|
|
|
single register. The PowerISA Rotate Under Mask instructions can easily
|
|
|
|
|
extract and insert integer fields of a General Purpose Register (GPR). This
|
|
|
|
|
implies that MMX integer types can be handled as an internal union of arrays for
|
|
|
|
|
the supported element types. So a 64-bit unsigned long long is the best type
|
|
|
|
|
for parameter passing and return values, especially for the 64-bit (_si64)
|
|
|
|
|
operations as these normally generate a single PowerISA instruction.
|
|
|
|
|
<phrase revisionflag="added">So for the PowerPC implementation we will define
|
|
|
|
|
<literal>__m64</literal> as:</phrase>
|
|
|
|
|
<programlisting><![CDATA[typedef __attribute__ ((__aligned__ (8))) unsigned long long __m64;]]></programlisting></para>
|
|
|
|
|
|
|
|
|
|
<para>The SSE extensions include some copy / convert operations for
|
|
|
|
|
<literal>_m128</literal> to /
|
|
|
|
|
from <literal>_m64</literal> and this includes some int to / from float conversions. However in
|
|
|
|
|
these cases the float operands always reside in SSE (XMM) registers (which
|
|
|
|
|
match the PowerISA vector registers) and the MMX registers only contain integer
|
|
|
|
|
values. POWER8 (PowerISA-2.07) has direct move instructions between GPRs and
|
|
|
|
|
VSRs. So these transfers are normally a single instruction and any conversions
|
|
|
|
|
can be handled in the vector unit.</para>
|
|
|
|
|
|
|
|
|
|
<para>When transferring a <literal>__m64</literal> value to a vector register we should also
|
|
|
|
|
execute a xxsplatd instruction to insure there is valid data in all four
|
|
|
|
|
float element lanes before doing floating point operations. This avoids causing
|
|
|
|
|
extraneous floating point exceptions that might be generated by uninitialized
|
|
|
|
|
parts of the vector. The top two lanes will have the floating point results
|
|
|
|
|
that are in position for direct transfer to a GPR or stored via Store Float
|
|
|
|
|
Double (stfd). These operation are internal to the intrinsic implementation and
|
|
|
|
|
there is no requirement to keep temporary vectors in correct Little Endian
|
|
|
|
|
form.</para>
|
|
|
|
|
|
|
|
|
|
<para>Also for the smaller element sizes and higher element counts (MMX
|
|
|
|
|
<literal>_pi8</literal> and <literal>_p16</literal> types)
|
|
|
|
|
the number of Rotate Under Mask instructions required to
|
|
|
|
|
disassemble the 64-bit <literal>__m64</literal>
|
|
|
|
|
into elements, perform the element calculations,
|
|
|
|
|
and reassemble the elements in a single <literal>__m64</literal>
|
|
|
|
|
value can get larger. In this
|
|
|
|
|
case we can generate shorter instruction sequences by transfering (via direct
|
|
|
|
|
move instruction) the GPR <literal>__m64</literal> value to the
|
|
|
|
|
a vector register, performance the
|
|
|
|
|
SIMD operation there, then transfer the <literal>__m64</literal>
|
|
|
|
|
result back to a GPR.</para>
|
|
|
|
|
|
|
|
|
|
</section>
|
|
|
|
|
|