You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Programming-Guides/Vector_Intrinsics/sec_intel_intrinsic_functio...

123 lines
6.5 KiB
XML

<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_intel_intrinsic_functions">
<title>Intel Intrinsic functions</title>
<para>So what is an intrinsic function? From Wikipedia:
<blockquote><para>In <link xlink:href="https://en.wikipedia.org/wiki/Compiler_theory">compiler theory</link>, an
<emphasis role="bold">intrinsic function</emphasis> is a function available for use in a given
<link xlink:href="https://en.wikipedia.org/wiki/Programming_language">programming
language</link> whose implementation is handled specially by the compiler.
Typically, it substitutes a sequence of automatically generated instructions
for the original function call, similar to an
<link xlink:href="https://en.wikipedia.org/wiki/Inline_function">inline function</link>.
Unlike an inline function though, the compiler has an intimate knowledge of the
intrinsic function and can therefore better integrate it and optimize it for
the situation. This is also called builtin function in many languages.</para></blockquote></para>
<para>The “Intel Intrinsics” API provides access to the many
instruction set extensions (Intel Technologies) that Intel has added (and
continues to add) over the years. The intrinsics provided access to new
instruction capabilities before the compilers could exploit them directly.
Initially these intrinsic functions where defined for the Intel and Microsoft
compiler and where eventually implemented and contributed to GCC.</para>
<para>The Intel Intrinsics have a specific type and naming structure. In
this naming structure, functions starts with a common prefix (MMX and SSE use
'_mm' prefix, while AVX added the '_mm256' '_mm512' prefixes), then a short
functional name ('set', 'load', 'store', 'add', 'mul', 'blend', 'shuffle', '…') and a suffix
('_pd', '_sd', '_pi32'...) with type and packing information. See
<xref linkend="app_intel_suffixes"/> for the list of common intrisic suffixes.</para>
<para>Oddly many of the MMX/SSE operations are not vectors at all. There
are a lot of scalar operations on a single float, double, or long long type. In
effect these are scalars that can take advantage of the larger (xmm) register
space. Also in the Intel 32-bit architecture they provided IEEE754 float and
double types, and 64-bit integers that did not exist or were hard to implement
in the base i386/387 instruction set. These scalar operations use a suffix
starting with '_s' (<literal>_sd</literal> for scalar double float,
<literal>_ss</literal> scalar float, and <literal>_si64</literal>
for scalar long long).</para>
<para>True vector operations use the packed or extended packed suffixes,
starting with '_p' or '_ep' (<literal>_pd</literal> for vector double,
<literal>_ps</literal> for vector float, and
<literal>_epi32</literal> for vector int). The use of '_ep'  
seems to be reserved to disambiguate
intrinsics that existed in the (64-bit vector) MMX extension from the extended
(128-bit vector) SSE equivalent. For example
<emphasis role="bold"><literal>_mm_add_pi32</literal></emphasis> is a MMX operation on
a pair of 32-bit integers, while
<emphasis role="bold"><literal>_mm_add_epi32</literal></emphasis> is an SSE2 operation on vector
of 4 32-bit integers.</para>
<para>The GCC  builtins for the
<link xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/x86-Built-in-Functions.html#x86-Built-in-Functions">i386.target</link>
(includes x86 and x86_64) are not
the same as the Intel Intrinsics. While they have similar intent and cover most
of the same functions, they use a different naming (prefixed with
<literal>__builtin_ia32_</literal>, then function name with type suffix) and uses GCC vector type
modes for operand types. For example:
<programlisting><![CDATA[v8qi __builtin_ia32_paddb (v8qi, v8qi)
v4hi __builtin_ia32_paddw (v4hi, v4hi)
v2si __builtin_ia32_paddd (v2si, v2si)
v2di __builtin_ia32_paddq (v2di, v2di)]]></programlisting></para>
<note><para>A key difference between GCC built-ins for i386 and PowerPC is
that the x86 built-ins have different names of each operation and type while the
PowerPC altivec built-ins tend to have a single generic built-in for  each
operation, across a set of compatible operand types. </para></note>
<para>In GCC the Intel Intrinsic header (*intrin.h) files are implemented
as a set of inline functions using the Intel Intrinsic API names and types.
These functions are implemented as either GCC C vector extension code or via
one or more GCC builtins for the i386 target. So lets take a look at some
examples from GCC's SSE2 intrinsic header emmintrin.h:
<programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_add_pd (__m128d __A, __m128d __B)
{
return (__m128d) ((__v2df)__A + (__v2df)__B);
}
extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_add_sd (__m128d __A, __m128d __B)
{
return (__m128d)__builtin_ia32_addsd ((__v2df)__A, (__v2df)__B);
}]]></programlisting></para>
<para>Note that the  
<emphasis role="bold"><literal>_mm_add_pd</literal></emphasis> is implemented direct as GCC C vector
extension code., while
<emphasis role="bold"><literal>_mm_add_sd</literal></emphasis> is implemented via the GCC builtin
<emphasis role="bold"><literal>__builtin_ia32_addsd</literal></emphasis>. From the
discussion above we know the <literal>_pd</literal> suffix
indicates a packed vector double while the <literal>_sd</literal> suffix indicates a scalar double
in a XMM register. </para>
<xi:include href="sec_packed_vs_scalar_intrinsics.xml"/>
<xi:include href="sec_vec_or_not.xml"/>
<xi:include href="sec_crossing_lanes.xml"/>
</section>