|
|
<?xml version="1.0" encoding="UTF-8"?>
|
|
|
<!--
|
|
|
Copyright (c) 2017 OpenPOWER Foundation
|
|
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License");
|
|
|
you may not use this file except in compliance with the License.
|
|
|
You may obtain a copy of the License at
|
|
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
|
|
Unless required by applicable law or agreed to in writing, software
|
|
|
distributed under the License is distributed on an "AS IS" BASIS,
|
|
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
|
See the License for the specific language governing permissions and
|
|
|
limitations under the License.
|
|
|
|
|
|
-->
|
|
|
<section xmlns="http://docbook.org/ns/docbook"
|
|
|
xmlns:xi="http://www.w3.org/2001/XInclude"
|
|
|
xmlns:xlink="http://www.w3.org/1999/xlink"
|
|
|
version="5.0"
|
|
|
xml:id="sec_intel_intrinsic_includes">
|
|
|
<title>The structure of the intrinsic includes</title>
|
|
|
|
|
|
<para>The GCC x86 intrinsic functions for vector were initially grouped by
|
|
|
technology (MMX and SSE), which starts with MMX and continues with SSE through
|
|
|
SSE4.1 stacked like a set of Russian dolls.</para>
|
|
|
|
|
|
<para>Basically each higher layer include needs typedefs and helper macros
|
|
|
defined by the lower level intrinsic includes. mm_malloc.h simply provides
|
|
|
wrappers for posix_memalign and free. Then it gets a little weird, starting
|
|
|
with the crypto extensions:
|
|
|
|
|
|
<programlisting><![CDATA[wmmintrin.h (AES) includes emmintrin.h]]></programlisting></para>
|
|
|
|
|
|
<para>For AVX, AVX2, and AVX512 they must have decided
|
|
|
that the Russian Dolls thing was getting out of hand. AVX et al. is split
|
|
|
across 14 files:
|
|
|
|
|
|
<programlisting><![CDATA[#include <avxintrin.h>
|
|
|
#include <avx2intrin.h>
|
|
|
#include <avx512fintrin.h>
|
|
|
#include <avx512erintrin.h>
|
|
|
#include <avx512pfintrin.h>
|
|
|
#include <avx512cdintrin.h>
|
|
|
#include <avx512vlintrin.h>
|
|
|
#include <avx512bwintrin.h>
|
|
|
#include <avx512dqintrin.h>
|
|
|
#include <avx512vlbwintrin.h>
|
|
|
#include <avx512vldqintrin.h>
|
|
|
#include <avx512ifmaintrin.h>
|
|
|
#include <avx512ifmavlintrin.h>
|
|
|
#include <avx512vbmiintrin.h>
|
|
|
#include <avx512vbmivlintrin.h>]]></programlisting>
|
|
|
|
|
|
but they do not want the applications to include these
|
|
|
individually.</para>
|
|
|
|
|
|
<para>So <emphasis role="bold">immintrin.h</emphasis> includes everything Intel vector, including all the
|
|
|
AVX, AES, SSE, and MMX flavors.
|
|
|
<programlisting><![CDATA[#ifndef _IMMINTRIN_H_INCLUDED
|
|
|
# error "Never use <avxintrin.h> directly; include <immintrin.h> instead."
|
|
|
#endif]]></programlisting></para>
|
|
|
|
|
|
<para>So why is this interesting? The include structure provides some strong clues
|
|
|
about the order that we should approach this effort. For example if you need
|
|
|
to use intrinsics from SSE4 (smmintrin.h) you are likely to need to type definitions
|
|
|
from SSE (emmintrin.h). So a bottoms up (MMX, SSE, SSE2, …) approach seems
|
|
|
like the best plan of attack. Also saving the AVX parts for later make sense,
|
|
|
as most are just wider forms of operations that already exist in SSE.</para>
|
|
|
|
|
|
<para>We should use the same include structure to implement our PowerISA
|
|
|
equivalent API headers. This will make porting easier (drop-in replacement) and
|
|
|
should get the application running quickly on POWER. Then we will be in a position
|
|
|
to profile and analyze the resulting application. This will show any hot spots
|
|
|
where the simple one-to-one transformation results in bottlenecks and
|
|
|
additional tuning is needed. For these cases we should improve our tools (SDK
|
|
|
MA/SCA) to identify opportunities for, and perhaps propose, alternative
|
|
|
sequences that are better tuned to PowerISA and our micro-architecture.</para>
|
|
|
|
|
|
</section>
|
|
|
|