<?xml version="1.0" encoding="UTF-8"?> <!-- Copyright (c) 2017 OpenPOWER Foundation Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. --> <section xmlns="http://docbook.org/ns/docbook" xmlns:xi="http://www.w3.org/2001/XInclude" xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0" xml:id="sec_intel_intrinsic_functions"> <title>Intel Intrinsic functions</title> <para>So what is an intrinsic function? From Wikipedia: <blockquote><para>In <link xlink:href="https://en.wikipedia.org/wiki/Compiler_theory">compiler theory</link>, an <emphasis role="bold">intrinsic function</emphasis> is a function available for use in a given <link xlink:href="https://en.wikipedia.org/wiki/Programming_language">programming language</link> whose implementation is handled specially by the compiler. Typically, it substitutes a sequence of automatically generated instructions for the original function call, similar to an <link xlink:href="https://en.wikipedia.org/wiki/Inline_function">inline function</link>. Unlike an inline function though, the compiler has an intimate knowledge of the intrinsic function and can therefore better integrate it and optimize it for the situation. This is also called builtin function in many languages.</para></blockquote></para> <para>The “Intel Intrinsics” API provides access to the many instruction set extensions (Intel Technologies) that Intel has added (and continues to add) over the years. The intrinsics provided access to new instruction capabilities before the compilers could exploit them directly. Initially these intrinsic functions where defined for the Intel and Microsoft compiler and where eventually implemented and contributed to GCC.</para> <para>The Intel Intrinsics have a specific type and naming structure. In this naming structure, functions starts with a common prefix (MMX and SSE use '_mm' prefix, while AVX added the '_mm256' '_mm512' prefixes), then a short functional name ('set', 'load', 'store', 'add', 'mul', 'blend', 'shuffle', '…') and a suffix ('_pd', '_sd', '_pi32'...) with type and packing information. See <xref linkend="app_intel_suffixes"/> for the list of common intrisic suffixes.</para> <para>Oddly many of the MMX/SSE operations are not vectors at all. There are a lot of scalar operations on a single float, double, or long long type. In effect these are scalars that can take advantage of the larger (xmm) register space. Also in the Intel 32-bit architecture they provided IEEE754 float and double types, and 64-bit integers that did not exist or were hard to implement in the base i386/387 instruction set. These scalar operations use a suffix starting with '_s' (<literal>_sd</literal> for scalar double float, <literal>_ss</literal> scalar float, and <literal>_si64</literal> for scalar long long).</para> <para>True vector operations use the packed or extended packed suffixes, starting with '_p' or '_ep' (<literal>_pd</literal> for vector double, <literal>_ps</literal> for vector float, and <literal>_epi32</literal> for vector int). The use of '_ep' seems to be reserved to disambiguate intrinsics that existed in the (64-bit vector) MMX extension from the extended (128-bit vector) SSE equivalent. For example <emphasis role="bold"><literal>_mm_add_pi32</literal></emphasis> is a MMX operation on a pair of 32-bit integers, while <emphasis role="bold"><literal>_mm_add_epi32</literal></emphasis> is an SSE2 operation on vector of 4 32-bit integers.</para> <para>The GCC builtins for the <link xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/x86-Built-in-Functions.html#x86-Built-in-Functions">i386.target</link> (includes x86 and x86_64) are not the same as the Intel Intrinsics. While they have similar intent and cover most of the same functions, they use a different naming (prefixed with <literal>__builtin_ia32_</literal>, then function name with type suffix) and uses GCC vector type modes for operand types. For example: <programlisting><![CDATA[v8qi __builtin_ia32_paddb (v8qi, v8qi) v4hi __builtin_ia32_paddw (v4hi, v4hi) v2si __builtin_ia32_paddd (v2si, v2si) v2di __builtin_ia32_paddq (v2di, v2di)]]></programlisting></para> <note><para>A key difference between GCC built-ins for i386 and PowerPC is that the x86 built-ins have different names of each operation and type while the PowerPC altivec built-ins tend to have a single overloaded built-in for each operation, across a set of compatible operand types. </para></note> <para>In GCC the Intel Intrinsic header (*intrin.h) files are implemented as a set of inline functions using the Intel Intrinsic API names and types. These functions are implemented as either GCC C vector extension code or via one or more GCC builtins for the i386 target. So lets take a look at some examples from GCC's SSE2 intrinsic header emmintrin.h: <programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_add_pd (__m128d __A, __m128d __B) { return (__m128d) ((__v2df)__A + (__v2df)__B); } extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_add_sd (__m128d __A, __m128d __B) { return (__m128d)__builtin_ia32_addsd ((__v2df)__A, (__v2df)__B); }]]></programlisting></para> <para>Note that the <emphasis role="bold"><literal>_mm_add_pd</literal></emphasis> is implemented direct as GCC C vector extension code., while <emphasis role="bold"><literal>_mm_add_sd</literal></emphasis> is implemented via the GCC builtin <emphasis role="bold"><literal>__builtin_ia32_addsd</literal></emphasis>. From the discussion above we know the <literal>_pd</literal> suffix indicates a packed vector double while the <literal>_sd</literal> suffix indicates a scalar double in a XMM register. </para> <xi:include href="sec_packed_vs_scalar_intrinsics.xml"/> <xi:include href="sec_vec_or_not.xml"/> <xi:include href="sec_crossing_lanes.xml"/> </section>