|
|
<?xml version="1.0" encoding="UTF-8"?>
|
|
|
<!--
|
|
|
Copyright (c) 2017 OpenPOWER Foundation
|
|
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License");
|
|
|
you may not use this file except in compliance with the License.
|
|
|
You may obtain a copy of the License at
|
|
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
|
|
Unless required by applicable law or agreed to in writing, software
|
|
|
distributed under the License is distributed on an "AS IS" BASIS,
|
|
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
|
See the License for the specific language governing permissions and
|
|
|
limitations under the License.
|
|
|
|
|
|
-->
|
|
|
<section xmlns="http://docbook.org/ns/docbook"
|
|
|
xmlns:xi="http://www.w3.org/2001/XInclude"
|
|
|
xmlns:xlink="http://www.w3.org/1999/xlink"
|
|
|
version="5.0"
|
|
|
xml:id="sec_extra_attributes">
|
|
|
<title>Those extra attributes</title>
|
|
|
|
|
|
<para>You may have noticed there are some special attributes:
|
|
|
|
|
|
<literallayout>__gnu_inline__
|
|
|
|
|
|
This attribute should be used with a function that is also declared with the
|
|
|
inline keyword. It directs GCC to treat the function as if it were defined in
|
|
|
gnu90 mode even when compiling in C99 or gnu99 mode.
|
|
|
|
|
|
If the function is declared extern, then this definition of the function is used
|
|
|
only for inlining. In no case is the function compiled as a standalone function,
|
|
|
not even if you take its address explicitly. Such an address becomes an external
|
|
|
reference, as if you had only declared the function, and had not defined it. This
|
|
|
has almost the effect of a macro. The way to use this is to put a function
|
|
|
definition in a header file with this attribute, and put another copy of the
|
|
|
function, without extern, in a library file. The definition in the header file
|
|
|
causes most calls to the function to be inlined.
|
|
|
|
|
|
__always_inline__
|
|
|
|
|
|
Generally, functions are not inlined unless optimization is specified. For func-
|
|
|
tions declared inline, this attribute inlines the function independent of any
|
|
|
restrictions that otherwise apply to inlining. Failure to inline such a function
|
|
|
is diagnosed as an error.
|
|
|
|
|
|
__artificial__
|
|
|
|
|
|
This attribute is useful for small inline wrappers that if possible should appear
|
|
|
during debugging as a unit. Depending on the debug info format it either means
|
|
|
marking the function as artificial or using the caller location for all instructions
|
|
|
within the inlined body.
|
|
|
|
|
|
__extension__
|
|
|
|
|
|
... -pedantic’ and other options cause warnings for many GNU C extensions.
|
|
|
You can prevent such warnings within one expression by writing __extension__</literallayout></para>
|
|
|
|
|
|
<para>So far I have been using these attributes unchanged.</para>
|
|
|
|
|
|
<para>But most intrinsics map the Intel intrinsic to one or more target
|
|
|
specific GCC builtins. For example:
|
|
|
<programlisting><![CDATA[/* Load two DPFP values from P. The address must be 16-byte aligned. */
|
|
|
extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
|
|
|
_mm_load_pd (double const *__P)
|
|
|
{
|
|
|
return *(__m128d *)__P;
|
|
|
}
|
|
|
|
|
|
/* Load two DPFP values from P. The address need not be 16-byte aligned. */
|
|
|
extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
|
|
|
_mm_loadu_pd (double const *__P)
|
|
|
{
|
|
|
return __builtin_ia32_loadupd (__P);
|
|
|
}]]></programlisting></para>
|
|
|
|
|
|
<para>The first intrinsic (_mm_load_pd ) is implement as a C vector pointer
|
|
|
reference, but from the comment assumes the compiler will use a
|
|
|
<emphasis role="bold">movapd</emphasis>
|
|
|
instruction that requires 16-byte alignment (will raise a general-protection
|
|
|
exception if not aligned). This implies that there is a performance advantage
|
|
|
for at least some Intel processors to keep the vector aligned. The second
|
|
|
intrinsic uses the explicit GCC builtin
|
|
|
<emphasis role="bold"><literal>__builtin_ia32_loadupd</literal></emphasis> to generate the
|
|
|
<emphasis role="bold"><literal>movupd</literal></emphasis> instruction which handles unaligned references.</para>
|
|
|
|
|
|
<para>The opposite assumption applies to POWER and PPC64LE, where GCC
|
|
|
generates the VSX <emphasis role="bold"><literal>lxvd2x</literal></emphasis> /
|
|
|
<emphasis role="bold"><literal>xxswapd</literal></emphasis>
|
|
|
instruction sequence by default, which
|
|
|
allows unaligned references. The PowerISA equivalent for aligned vector access
|
|
|
is the VMX <emphasis role="bold"><literal>lvx</literal></emphasis> instruction and the
|
|
|
<emphasis role="bold"><literal>vec_ld</literal></emphasis> builtin, which forces quadword
|
|
|
aligned access (by ignoring the low order 4 bits of the effective address). The
|
|
|
<emphasis role="bold"><literal>lvx</literal></emphasis> instruction does not raise
|
|
|
alignment exceptions, but perhaps should as part
|
|
|
of our implementation of the Intel intrinsic. This requires that we use
|
|
|
PowerISA VMX/VSX built-ins to insure we get the expected results.</para>
|
|
|
|
|
|
<para>The current prototype defines the following:
|
|
|
<programlisting><![CDATA[/* Load two DPFP values from P. The address must be 16-byte aligned. */
|
|
|
extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
|
|
|
_mm_load_pd (double const *__P)
|
|
|
{
|
|
|
assert(((unsigned long)__P & 0xfUL) == 0UL);
|
|
|
return ((__m128d)vec_ld(0, (__v16qu*)__P));
|
|
|
}
|
|
|
|
|
|
/* Load two DPFP values from P. The address need not be 16-byte aligned. */
|
|
|
extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
|
|
|
_mm_loadu_pd (double const *__P)
|
|
|
{
|
|
|
return (vec_vsx_ld(0, __P));
|
|
|
}]]></programlisting></para>
|
|
|
|
|
|
<para>The aligned load intrinsic adds an assert which checks alignment
|
|
|
(to match the Intel semantic) and uses the GCC builtin
|
|
|
<emphasis role="bold"><literal>vec_ld</literal></emphasis> (generates an
|
|
|
<emphasis role="bold"><literal>lvx</literal></emphasis>). The assert
|
|
|
generates extra code but this can be eliminated by defining
|
|
|
<emphasis role="bold"><literal>NDEBUG</literal></emphasis> at compile time.
|
|
|
The unaligned load intrinsic uses the GCC builtin
|
|
|
<literal>vec_vsx_ld</literal> (for PPC64LE generates
|
|
|
<emphasis role="bold"><literal>lxvd2x</literal></emphasis> /
|
|
|
<emphasis role="bold"><literal>xxswapd</literal></emphasis> for POWER8 and will
|
|
|
simplify to <emphasis role="bold"><literal>lxv</literal></emphasis>
|
|
|
or <emphasis role="bold"><literal>lxvx</literal></emphasis>
|
|
|
for POWER9). And similarly for <emphasis role="bold"><literal>__mm_store_pd</literal></emphasis> /
|
|
|
<emphasis role="bold"><literal>__mm_storeu_pd</literal></emphasis>, using
|
|
|
<emphasis role="bold"><literal>vec_st</literal></emphasis>
|
|
|
and <emphasis role="bold"><literal>vec_vsx_st</literal></emphasis>. These concepts extent to the
|
|
|
load/store intrinsics for vector float and vector int.</para>
|
|
|
|
|
|
</section>
|
|
|
|