|
|
<?xml version="1.0" encoding="UTF-8"?>
|
|
|
<!--
|
|
|
Copyright (c) 2017 OpenPOWER Foundation
|
|
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License");
|
|
|
you may not use this file except in compliance with the License.
|
|
|
You may obtain a copy of the License at
|
|
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
|
|
Unless required by applicable law or agreed to in writing, software
|
|
|
distributed under the License is distributed on an "AS IS" BASIS,
|
|
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
|
See the License for the specific language governing permissions and
|
|
|
limitations under the License.
|
|
|
|
|
|
-->
|
|
|
<section xmlns="http://docbook.org/ns/docbook"
|
|
|
xmlns:xi="http://www.w3.org/2001/XInclude"
|
|
|
xmlns:xlink="http://www.w3.org/1999/xlink"
|
|
|
version="5.0"
|
|
|
xml:id="sec_intel_intrinsic_types">
|
|
|
<title>The types used for intrinsics</title>
|
|
|
|
|
|
<para>The type system for Intel intrinsics is a little strange. For example
|
|
|
from xmmintrin.h:
|
|
|
<programlisting><![CDATA[/* The Intel API is flexible enough that we must allow aliasing with other
|
|
|
vector types, and their scalar components. */
|
|
|
typedef float __m128 __attribute__ ((__vector_size__ (16), __may_alias__));
|
|
|
|
|
|
/* Internal data types for implementing the intrinsics. */
|
|
|
typedef float __v4sf __attribute__ ((__vector_size__ (16)));]]></programlisting></para>
|
|
|
|
|
|
<para>So there is one set of types that are used in the function prototypes
|
|
|
of the API, and the internal types that are used in the implementation. Notice
|
|
|
the special attribute <literal>__may_alias__</literal>. From the GCC documentation:
|
|
|
|
|
|
<blockquote><para>
|
|
|
Accesses through pointers to types with this attribute are not subject
|
|
|
to type-based alias analysis, but are instead assumed to be able to alias any
|
|
|
other type of objects. ... This extension exists to support some vector APIs,
|
|
|
in which pointers to one vector type are permitted to alias pointers to a
|
|
|
different vector type.</para></blockquote></para>
|
|
|
|
|
|
<para>There are a couple of issues here:
|
|
|
<itemizedlist spacing="compact">
|
|
|
<listitem>
|
|
|
<para>The use of __may_alias__ in the API seems to force the compiler to assume
|
|
|
aliasing of any parameter passed by reference.</para>
|
|
|
</listitem>
|
|
|
<listitem>
|
|
|
<para>
|
|
|
The GCC vector builtin type system (example above) is slightly different
|
|
|
syntax from the original Altivec __vector types. Internally the two typedef forms
|
|
|
may represent the same 128-bit vector type,
|
|
|
but for early source parsing and overloaded vector builtins they are
|
|
|
handled differently.</para>
|
|
|
</listitem>
|
|
|
<listitem>
|
|
|
<para>The data type used at the interface may not be
|
|
|
the correct type for the implied operation.</para>
|
|
|
</listitem>
|
|
|
</itemizedlist>
|
|
|
Normally the compiler assumes that parameters of different size do
|
|
|
not overlap in storage, which allows more optimization.
|
|
|
However parameters for different vector element sizes
|
|
|
[char | short | int | long] are all passed and returned as type <literal>__m128i</literal>
|
|
|
(defined as vector long long). </para>
|
|
|
|
|
|
<para>This may not matter when using x86 built-ins but does matter when
|
|
|
the implementation uses C vector extensions or in our case using PowerPC
|
|
|
overloaded
|
|
|
vector built-ins
|
|
|
(<xref linkend="sec_powerisa_vector_intrinsics"/>).
|
|
|
For the latter cases the type must be correct for
|
|
|
the compiler to generate the correct code for the
|
|
|
type (char, short, int, long)
|
|
|
(<xref linkend="sec_api_implemented"/>) for
|
|
|
overloaded builtin operations.
|
|
|
There is also concern that excessive use of
|
|
|
<literal>__may_alias__</literal>
|
|
|
will limit compiler optimization. We are not sure how important this attribute
|
|
|
is to the correct operation of the API. So at a later stage we should
|
|
|
experiment with removing it from our implementation for PowerPC.</para>
|
|
|
|
|
|
<para>The good news is that PowerISA has good support for 128-bit vectors
|
|
|
and (with the addition of VSX) all the required vector data (char, short, int,
|
|
|
long, float, double) types. However Intel supports a wider variety of the
|
|
|
vector sizes than PowerISA does. This started with the 64-bit MMX vector
|
|
|
support that preceded SSE and extends to 256-bit and 512-bit vectors of AVX,
|
|
|
AVX2, and AVX512 that followed SSE.</para>
|
|
|
|
|
|
<para>Within the GCC Intel intrinsic implementation these are all
|
|
|
implemented as vector attribute extensions of the appropriate size (
|
|
|
<literal>__vector_size__</literal> ({8 | 16 | 32, and 64}). For the PowerPC target GCC currently
|
|
|
only supports the native <literal>__vector_size__</literal> ( 16 ). These we can support directly
|
|
|
in VMX/VSX registers and associated instructions.</para>
|
|
|
|
|
|
<para>GCC will compile code with
|
|
|
other <literal>__vector_size__</literal> values, but the resulting types are treated as simple
|
|
|
arrays of the element type. This does not allow the compiler to use the vector
|
|
|
registers for parameter passing and return values.
|
|
|
For example this intrinsic from immintrin.h:
|
|
|
<programlisting><![CDATA[typedef double __m256d __attribute__ ((__vector_size__ (32), __may_alias__));
|
|
|
|
|
|
extern __inline __m256d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
|
|
|
_mm256_add_pd (__m256d __A, __m256d __B)
|
|
|
{
|
|
|
return (__m256d) ((__v4df)__A + (__v4df)__B);
|
|
|
}
|
|
|
]]></programlisting></para>
|
|
|
<para>And test case:
|
|
|
<programlisting><![CDATA[__m256d
|
|
|
test_mm256_add_pd (__m256d __A, __m256d __B)
|
|
|
{
|
|
|
return (_mm256_add_pd (__A, __B));
|
|
|
}
|
|
|
]]></programlisting></para>
|
|
|
<para>Current GCC generates:
|
|
|
<programlisting><![CDATA[0000000000000970 <test_mm256_add_pd>:
|
|
|
970: 10 00 20 39 li r9,16
|
|
|
974: 98 26 80 7d lxvd2x vs12,0,r4
|
|
|
978: 98 2e 40 7d lxvd2x vs10,0,r5
|
|
|
97c: 20 00 e0 38 li r7,32
|
|
|
980: f8 ff e1 fb std r31,-8(r1)
|
|
|
984: b1 ff 21 f8 stdu r1,-80(r1)
|
|
|
988: 30 00 00 39 li r8,48
|
|
|
98c: 98 4e 04 7c lxvd2x vs0,r4,r9
|
|
|
990: 98 4e 65 7d lxvd2x vs11,r5,r9
|
|
|
994: 00 53 8c f1 xvadddp vs12,vs12,vs10
|
|
|
998: 00 00 c1 e8 ld r6,0(r1)
|
|
|
99c: 78 0b 3f 7c mr r31,r1
|
|
|
9a0: 00 5b 00 f0 xvadddp vs0,vs0,vs11
|
|
|
9a4: c1 ff c1 f8 stdu r6,-64(r1)
|
|
|
9a8: 98 3f 9f 7d stxvd2x vs12,r31,r7
|
|
|
9ac: 98 47 1f 7c stxvd2x vs0,r31,r8
|
|
|
9b0: 98 3e 9f 7d lxvd2x vs12,r31,r7
|
|
|
9b4: 98 46 1f 7c lxvd2x vs0,r31,r8
|
|
|
9b8: 50 00 3f 38 addi r1,r31,80
|
|
|
9bc: f8 ff e1 eb ld r31,-8(r1)
|
|
|
9c0: 98 1f 80 7d stxvd2x vs12,0,r3
|
|
|
9c4: 98 4f 03 7c stxvd2x vs0,r3,r9
|
|
|
9c8: 20 00 80 4e blr]]></programlisting></para>
|
|
|
|
|
|
<para>The compiler treats the parameters and return value
|
|
|
as scalar arrays, which are passed by reference.
|
|
|
The operation is vectorized in this case,
|
|
|
but the 256-bit result is returned through storage.</para>
|
|
|
|
|
|
<para>This is not what we want to see for a simple 4 by double add.
|
|
|
It would be better if we can pass and return
|
|
|
MMX (<xref linkend="sec_handling_mmx"/>) and AVX (<xref linkend="sec_handling_avx"/>)
|
|
|
values as PowerPC registers and avoid the storage references.
|
|
|
If we can get the parameter and return values as registers,
|
|
|
this example will reduce to:
|
|
|
<programlisting><![CDATA[0000000000000970 <test_mx256_add_pd>:
|
|
|
970: xvadddp vs34,vs34,vs36
|
|
|
974: xvadddp vs35,vs35,vs37
|
|
|
978: blr]]></programlisting></para>
|
|
|
|
|
|
<para>So the PowerISA VMX/VSX facilities and GCC compiler support for
|
|
|
128-bit/16-byte vectors and associated vector built-ins
|
|
|
are well matched to implementing equivalent X86 SSE intrinsic functions.
|
|
|
However implementing the older MMX (64-bit) and the latest
|
|
|
AVX (256 / 512-bit) extensions requires more thought and some
|
|
|
ingenuity.</para>
|
|
|
|
|
|
<xi:include href="sec_handling_mmx.xml"/>
|
|
|
<xi:include href="sec_handling_avx.xml"/>
|
|
|
|
|
|
</section>
|
|
|
|