|
|
<?xml version="1.0" encoding="UTF-8"?>
|
|
|
<!--
|
|
|
Copyright (c) 2017 OpenPOWER Foundation
|
|
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License");
|
|
|
you may not use this file except in compliance with the License.
|
|
|
You may obtain a copy of the License at
|
|
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
|
|
Unless required by applicable law or agreed to in writing, software
|
|
|
distributed under the License is distributed on an "AS IS" BASIS,
|
|
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
|
See the License for the specific language governing permissions and
|
|
|
limitations under the License.
|
|
|
|
|
|
-->
|
|
|
<section xmlns="http://docbook.org/ns/docbook"
|
|
|
xmlns:xi="http://www.w3.org/2001/XInclude"
|
|
|
xmlns:xlink="http://www.w3.org/1999/xlink"
|
|
|
version="5.0"
|
|
|
xml:id="sec_power_vector_permute_format">
|
|
|
<title>Vector permute and formatting instructions</title>
|
|
|
|
|
|
<para>The vector Permute and formatting chapter follows and is an important
|
|
|
one to study. These operate on the byte, halfword, word (and with
|
|
|
PowerISA 2.07 doubleword) integer types,
|
|
|
plus special pixel type. </para>
|
|
|
|
|
|
<para>The shift
|
|
|
instructions in this chapter operate on the vector as a whole at either the bit
|
|
|
or the byte (octet) level. This is an important chapter to study for moving
|
|
|
PowerISA vector results into the vector elements that Intel Intrinsics
|
|
|
expect:
|
|
|
|
|
|
<literallayout><literal>6.8 Vector Permute and Formatting Instructions . . . . . . . . . . . 249
|
|
|
6.8.1 Vector Pack and Unpack Instructions . . . . . . . . . . . . . 249
|
|
|
6.8.2 Vector Merge Instructions . . . . . . . . . . . . . . . . . . 256
|
|
|
6.8.3 Vector Splat Instructions . . . . . . . . . . . . . . . . . . 259
|
|
|
6.8.4 Vector Permute Instruction . . . . . . . . . . . . . . . . . . 260
|
|
|
6.8.5 Vector Select Instruction . . . . . . . . . . . . . . . . . . 261
|
|
|
6.8.6 Vector Shift Instructions . . . . . . . . . . . . . . . . . . 262</literal></literallayout></para>
|
|
|
|
|
|
<para>The Vector Integer instructions include the add / subtract / Multiply
|
|
|
/ Multiply Add/Sum / (no divide) operations for the standard integer types.
|
|
|
There are instruction forms that provide signed, unsigned, modulo, and
|
|
|
saturate results for most operations. PowerISA 2.07 extends vector integer
|
|
|
operations to add / subtract quadword (128-bit) integers with carry and extend.
|
|
|
This supports extended binary integer arithmetic to 256, 512-bit and beyond.
|
|
|
There are signed / unsigned compares across the standard
|
|
|
integer types (byte, .. doubleword); the usual bit-wise logical operations;
|
|
|
and the SIMD shift / rotate instructions that operate on the vector elements
|
|
|
for various integer types.
|
|
|
|
|
|
<literallayout><literal>6.9 Vector Integer Instructions . . . . . . . . . . . . . . . . . . 264
|
|
|
6.9.1 Vector Integer Arithmetic Instructions . . . . . . . . . . . . 264
|
|
|
6.9.2 Vector Integer Compare Instructions. . . . . . . . . . . . . . 294
|
|
|
6.9.3 Vector Logical Instructions . . . . . . . . . . . . . . . . . 300
|
|
|
6.9.4 Vector Integer Rotate and Shift Instructions . . . . . . . . . 302</literal></literallayout></para>
|
|
|
|
|
|
<para>The vector [single] float instructions are grouped into this chapter.
|
|
|
This chapter does not include the double float instructions, which are described
|
|
|
in the VSX chapter. VSX also includes additional float instructions that operate
|
|
|
on the whole 64 register vector-scalar set.
|
|
|
|
|
|
<literallayout><literal>6.10 Vector Floating-Point Instruction Set . . . . . . . . . . . . . 306
|
|
|
6.10.1 Vector Floating-Point Arithmetic Instructions . . . . . . . . 306
|
|
|
6.10.2 Vector Floating-Point Maximum and Minimum Instructions . . . 308
|
|
|
6.10.3 Vector Floating-Point Rounding and Conversion Instructions. . 309
|
|
|
6.10.4 Vector Floating-Point Compare Instructions . . . . . . . . . 313
|
|
|
6.10.5 Vector Floating-Point Estimate Instructions . . . . . . . . . 316</literal></literallayout></para>
|
|
|
|
|
|
<para>The vector XOR based instructions are new with PowerISA 2.07 (POWER8)
|
|
|
and provide vector crypto and check-sum operations:
|
|
|
|
|
|
<literallayout><literal>6.11 Vector Exclusive-OR-based Instructions . . . . . . . . . . . . 318
|
|
|
6.11.1 Vector AES Instructions . . . . . . . . . . . . . . . . . . . 318
|
|
|
6.11.2 Vector SHA-256 and SHA-512 Sigma Instructions . . . . . . . . 320
|
|
|
6.11.3 Vector Binary Polynomial Multiplication Instructions. . . . . 321
|
|
|
6.11.4 Vector Permute and Exclusive-OR Instruction . . . . . . . . . 323</literal></literallayout></para>
|
|
|
|
|
|
<para>The vector gather and bit permute instructions support bit-level rearrangement of
|
|
|
bits with in the vector, while the vector versions of the count leading zeros
|
|
|
and population count instructions are useful to accelerate specific algorithms.
|
|
|
|
|
|
<literallayout><literal>6.12 Vector Gather Instruction . . . . . . . . . . . . . . . . . . . 324
|
|
|
6.13 Vector Count Leading Zeros Instructions . . . . . . . . . . . . 325
|
|
|
6.14 Vector Population Count Instructions. . . . . . . . . . . . . . 326
|
|
|
6.15 Vector Bit Permute Instruction . . . . . . . . . . . . . . . . 327</literal></literallayout></para>
|
|
|
|
|
|
<para>The Decimal Integer add / subtract (fixed point) instructions complement the
|
|
|
Decimal Floating-Point instructions. They can also be used to accelerate some
|
|
|
binary to/from decimal conversions. The VSCR instructions provide access to
|
|
|
the Non-Java mode floating-point control and the saturation status. These
|
|
|
instructions are not normally of interest in porting Intel intrinsics.
|
|
|
|
|
|
<literallayout><literal>6.16 Decimal Integer Arithmetic Instructions . . . . . . . . . . . . 328
|
|
|
6.17 Vector Status and Control Register Instructions . . . . . . . . 331</literal></literallayout></para>
|
|
|
|
|
|
<para>With PowerISA 2.07B (Power8) several major extensions were added to
|
|
|
the Vector Facility:</para>
|
|
|
|
|
|
<itemizedlist spacing="compact">
|
|
|
<listitem>
|
|
|
<para>Vector Crypto: Under “Vector Exclusive-OR-based Instructions
|
|
|
Vector Exclusive-OR-based Instructions”, AES [inverse] Cipher, SHA 256 / 512
|
|
|
Sigma, Polynomial Multiplication, and Permute and XOR instructions.</para>
|
|
|
</listitem>
|
|
|
<listitem>
|
|
|
<para>64-bit Integer; signed and unsigned add / subtract, signed and
|
|
|
unsigned compare, Even / Odd 32 x 32 multiple with 64-bit product, signed /
|
|
|
unsigned max / min, rotate and shift left/right.</para>
|
|
|
</listitem>
|
|
|
<listitem>
|
|
|
<para>Direct Move between GPRs and the FPRs / Left half of Vector
|
|
|
Registers.</para>
|
|
|
</listitem>
|
|
|
<listitem>
|
|
|
<para>128-bit integer add / subtract with carry / extend, direct
|
|
|
support for vector <literal>__int128</literal> and multiple precision arithmetic.</para>
|
|
|
</listitem>
|
|
|
<listitem>
|
|
|
<para>Decimal Integer add / subtract for 31 digit Binary Coded Decimal (BCD).</para>
|
|
|
</listitem>
|
|
|
<listitem>
|
|
|
<para>Miscellaneous SIMD extensions: Count leading Zeros, Population
|
|
|
count, bit gather / permute, and vector forms of eqv, nand, orc.</para>
|
|
|
</listitem>
|
|
|
</itemizedlist>
|
|
|
|
|
|
<para>The rationale for these being included in the Vector Facilities
|
|
|
(VMX) (vs Vector-Scalar Floating-Point Operations (VSX)) has more to do with
|
|
|
how the instructions were encoded than with the type of operations or the ISA
|
|
|
version of introduction. This is primarily a trade-off between the bits
|
|
|
required for register selection versus the bits for extended op-code space within a
|
|
|
fixed 32-bit instruction. </para>
|
|
|
|
|
|
<para>Basically accessing 32 vector registers requires
|
|
|
5 bits per register, while accessing all 64 vector-scalar registers require
|
|
|
6 bits per register. When you consider that most vector instructions require
|
|
|
3 and some (select, fused multiply-add) require 4 register operand forms, the
|
|
|
impact on op-code space is significant. The larger register set of VSX was
|
|
|
justified by queueing theory of larger HPC matrix codes using double float,
|
|
|
while 32 registers are sufficient for most applications.</para>
|
|
|
|
|
|
<para>So by definition the VMX instructions are restricted to the original
|
|
|
32 vector registers while VSX instructions are encoded to access all 64
|
|
|
floating-point scalar and vector double registers. This distinction can be
|
|
|
troublesome when programming at the assembler level, but the compiler and
|
|
|
compiler built-ins can hide most of this detail from the programmer. </para>
|
|
|
|
|
|
</section>
|
|
|
|