Matrix-Multiply Assist (MMA) Intrinsic Reference
Introduction
Version 3.1 of the Power Instruction Set Architecture
Specification (see )
introduced instructions to accelerate matrix multiplication
computations. These instructions operate both on the VSRs and
on eight new 512-bit accumulator registers (ACCs). Intrinsic
functions to access these instructions are described in this
chapter.
Although the ACCs are treated as separate registers from the
VSRs, each ACC[i]
may use its associated VSRs
4i
to 4i+3
as scratch space. That is,
when ACC[i]
contains defined data, the contents of
VSRs 4i
to 4i+3
are undefined until
an xxmfacc
instruction is used to copy the contents
of ACC[i]
to the VSRs. Writing to a VSR associated
with ACC[i]
that contains defined data will cause
ACC[i]
to become undefined.
This reference is not intended to be a complete introduction to
MMA concepts. The reader is directed to the Matrix-Multiply
Assist Best Practices Guide (see ) and to the POWER ISA.
Review status: Chapter reviewed
by Paul Clarke; changes made. Chapter reviewed by Peter
Bergner; changes made.
Type Support
Many of the MMA instructions operate on aligned pairs of vectors
(that is, an even numbered vector and the next-higher numbered
vector), or on aligned quads of vectors (that is, a vector
number divisible by four and the three next-higher numbered
vectors). Compilers that support the MMA intrinsic functions
must define two types, __vector_pair
and
__vector_quad
, to represent these concepts.
Pointers and references to these types must also be supported
where these concepts exist in the source language.
Intrinsic Functions
The intrinsics in this section are not overloaded. Each is
presented with its prototype and the instruction it represents.
The string "vuc" is used as shorthand for "vector unsigned
char" throughout.
Memory Access
Load and store vector pairs.
lxvp
__builtin_vsx_lxvp
stxvp
__builtin_vsx_stxvp
lxvpx
__builtin_vsx_lxvp
stxvpx
__builtin_vsx_stxvp
Prototype
Instruction
__vector_pair __builtin_vsx_lxvp (signed long a, const __vector_pair* b)
lxvp r,a(b)
or
lxvpx r,b,a
void __builtin_vsx_stxvp (__vector_pair s, signed long a, const __vector_pair* b)
stxvp s,a(b)
or
stxvpx s,b,a
Assembly and Disassembly of Large Types
The following intrinsics are used to construct
__vector_pair
and __vector_quad
objects from 128-bit vectors, and deconstruct them into such
vectors. The disassembly interfaces place the results into
arrays of vectors using natural element order. The build
interfaces treat the vector input arguments as if they form an
array of vectors, with the first vector argument being array
element 0 in natural element order, the second vector argument
being array element 1, and so forth. The assemble interfaces
are deprecated because they do not give consistent results for
big- and little-endian targets, and users should use the build
interfaces instead.
Prototype
Notes
void __builtin_mma_assemble_acc (__vector_quad*, vuc, vuc, vuc, vuc)
Deprecated
void __builtin_mma_build_acc (__vector_quad*, vuc, vuc, vuc, vuc)
void __builtin_mma_disassemble_acc (void*, __vector_quad*)
void __builtin_vsx_assemble_pair (__vector_pair*, vuc, vuc)
Deprecated
void __builtin_vsx_build_pair (__vector_pair*, vuc, vuc)
void __builtin_vsx_disassemble_pair (void*, __vector_pair*)
Accumulator Clear Operation
This intrinsic function initializes an accumulator to zeros.
xxsetaccz
__builtin_mma_xxsetaccz
Prototype
Instruction
void __builtin_mma_xxsetaccz (__vector_quad* a)
xxsetaccz a
Conversion Operations
These instructions convert between vectors of single precision
and bfloat16 types.
xvcvbf16spn
__builtin_vsx_xvcvbf16spn
xvcvspbf16
__builtin_vsx_xvcvspbf16
Prototype
Instruction
vuc __builtin_vsx_xvcvbf16spn (vuc a)
xvcvbf16spn a
vuc __builtin_vsx_xvcvspbf16 (vuc a)
xvcvspbf16 a
Outer Product Operations
Each of these intrinsics generates an instruction to perform
an outer product operation.
pmxvbf16ger2
__builtin_mma_pmxvbf16ger2
pmxvbf16ger2nn
__builtin_mma_pmxvbf16ger2nn
pmxvbf16ger2np
__builtin_mma_pmxvbf16ger2np
pmxvbf16ger2pn
__builtin_mma_pmxvbf16ger2pn
pmxvbf16ger2pp
__builtin_mma_pmxvbf16ger2pp
pmxvf16ger2
__builtin_mma_pmxvf16ger2
pmxvf16ger2nn
__builtin_mma_pmxvf16ger2nn
pmxvf16ger2np
__builtin_mma_pmxvf16ger2np
pmxvf16ger2pn
__builtin_mma_pmxvf16ger2pn
pmxvf16ger2pp
__builtin_mma_pmxvf16ger2pp
pmxvf32ger
__builtin_mma_pmxvf32ger
pmxvf32gernn
__builtin_mma_pmxvf32gernn
pmxvf32gernp
__builtin_mma_pmxvf32gernp
pmxvf32gerpn
__builtin_mma_pmxvf32gerpn
pmxvf32gerpp
__builtin_mma_pmxvf32gerpp
pmxvf64ger
__builtin_mma_pmxvf64ger
pmxvf64gernn
__builtin_mma_pmxvf64gernn
pmxvf64gernp
__builtin_mma_pmxvf64gernp
pmxvf64gerpn
__builtin_mma_pmxvf64gerpn
pmxvf64gerpp
__builtin_mma_pmxvf64gerpp
pmxvi16ger2
__builtin_mma_pmxvi64ger2
pmxvi16ger2pp
__builtin_mma_pmxvi64ger2pp
pmxvi16ger2s
__builtin_mma_pmxvi64ger2s
pmxvi16ger2spp
__builtin_mma_pmxvi64ger2spp
pmxvi4ger8
__builtin_mma_pmxvi4ger8
pmxvi4ger8pp
__builtin_mma_pmxvi4ger8pp
pmxvi8ger4
__builtin_mma_pmxvi8ger4
pmxvi8ger4pp
__builtin_mma_pmxvi8ger4pp
pmxvi8ger4spp
__builtin_mma_pmxvi8ger4spp
xvbf16ger2
__builtin_mma_xvbf16ger2
xvbf16ger2nn
__builtin_mma_xvbf16ger2nn
xvbf16ger2np
__builtin_mma_xvbf16ger2np
xvbf16ger2pn
__builtin_mma_xvbf16ger2pn
xvbf16ger2pp
__builtin_mma_xvbf16ger2pp
xvf16ger2
__builtin_mma_xvf16ger2
xvf16ger2nn
__builtin_mma_xvf16ger2nn
xvf16ger2np
__builtin_mma_xvf16ger2np
xvf16ger2pn
__builtin_mma_xvf16ger2pn
xvf16ger2pp
__builtin_mma_xvf16ger2pp
xvf32ger
__builtin_mma_xvf32ger
xvf32gernn
__builtin_mma_xvf32gernn
xvf32gernp
__builtin_mma_xvf32gernp
xvf32gerpn
__builtin_mma_xvf32gerpn
xvf32gerpp
__builtin_mma_xvf32gerpp
xvf64ger
__builtin_mma_xvf64ger
xvf64gernn
__builtin_mma_xvf64gernn
xvf64gernp
__builtin_mma_xvf64gernp
xvf64gerpn
__builtin_mma_xvf64gerpn
xvf64gerpp
__builtin_mma_xvf64gerpp
xvi16ger2
__builtin_mma_xvi16ger2
xvi16ger2pp
__builtin_mma_xvi16ger2pp
xvi16ger2s
__builtin_mma_xvi16ger2s
xvi16ger2spp
__builtin_mma_xvi16ger2spp
xvi4ger8
__builtin_mma_xvi4ger8
xvi4ger8pp
__builtin_mma_xvi4ger8pp
xvi8ger4
__builtin_mma_xvi8ger4
xvi8ger4pp
__builtin_mma_xvi8ger4pp
xvi8ger4spp
__builtin_mma_xvi8ger4spp
Prototype
Instruction
void __builtin_mma_pmxvbf16ger2 (__vector_quad* a, vuc b, vuc c,
const int d, const int e, const int f)
pmxvbf16ger2 a,b,c,d,e,f
void __builtin_mma_pmxvbf16ger2nn (__vector_quad* a, vuc b, vuc c,
const int d, const int e, const int f)
pmxvbf16ger2nn a,b,c,d,e,f
void __builtin_mma_pmxvbf16ger2np (__vector_quad* a, vuc b, vuc c,
const int d, const int e, const int f)
pmxvbf16ger2np a,b,c,d,e,f
void __builtin_mma_pmxvbf16ger2pn (__vector_quad* a, vuc b, vuc c,
const int d, const int e, const int f)
pmxvbf16ger2pn a,b,c,d,e,f
void __builtin_mma_pmxvbf16ger2pp (__vector_quad* a, vuc b, vuc c,
const int d, const int e, const int f)
pmxvbf16ger2pp a,b,c,d,e,f
void __builtin_mma_pmxvf16ger2 (__vector_quad* a, vuc b, vuc c,
const int d, const int e, const int f)
pmxvf16ger2 a,b,c,d,e,f
void __builtin_mma_pmxvf16ger2nn (__vector_quad* a, vuc b, vuc c,
const int d, const int e, const int f)
pmxvf16ger2nn a,b,c,d,e,f
void __builtin_mma_pmxvf16ger2np (__vector_quad* a, vuc b, vuc c,
const int d, const int e, const int f)
pmxvf16ger2np a,b,c,d,e,f
void __builtin_mma_pmxvf16ger2pn (__vector_quad* a, vuc b, vuc c,
const int d, const int e, const int f)
pmxvf16ger2pn a,b,c,d,e,f
void __builtin_mma_pmxvf16ger2pp (__vector_quad* a, vuc b, vuc c,
const int d, const int e, const int f)
pmxvf16ger2pp a,b,c,d,e,f
void __builtin_mma_pmxvf32ger (__vector_quad* a, vuc b, vuc c,
const int d, const int e)
pmxvf32ger a,b,c,d,e
void __builtin_mma_pmxvf32gernn (__vector_quad* a, vuc b, vuc c,
const int d, const int e)
pmxvf32gernn a,b,c,d,e
void __builtin_mma_pmxvf32gernp (__vector_quad* a, vuc b, vuc c,
const int d, const int e)
pmxvf32gernp a,b,c,d,e
void __builtin_mma_pmxvf32gerpn (__vector_quad* a, vuc b, vuc c,
const int d, const int e)
pmxvf32gerpn a,b,c,d,e
void __builtin_mma_pmxvf32gerpp (__vector_quad* a, vuc b, vuc c,
const int d, const int e)
pmxvf32gerpp a,b,c,d,e
void __builtin_mma_pmxvf64ger (__vector_quad* a, __vector_pair b,
vuc c, const int d, const int e)
pmxvf64ger a,b,c,d,e
void __builtin_mma_pmxvf64gernn (__vector_quad* a, __vector_pair b,
vuc c, const int d, const int e)
pmxvf64gernn a,b,c,d,e
void __builtin_mma_pmxvf64gernp (__vector_quad* a, __vector_pair b,
vuc c, const int d, const int e)
pmxvf64gernp a,b,c,d,e
void __builtin_mma_pmxvf64gerpn (__vector_quad* a, __vector_pair b,
vuc c, const int d, const int e)
pmxvf64gerpn a,b,c,d,e
void __builtin_mma_pmxvf64gerpp (__vector_quad* a, __vector_pair b,
vuc c, const int d, const int e)
pmxvf64gerpp a,b,c,d,e
void __builtin_mma_pmxvi16ger2 (__vector_quad* a, vuc b, vuc c,
const int d, const int e, const int f)
pmxvi16ger2 a,b,c,d,e,f
void __builtin_mma_pmxvi16ger2pp (__vector_quad* a, vuc b, vuc c,
const int d, const int e, const int f)
pmxvi16ger2pp a,b,c,d,e,f
void __builtin_mma_pmxvi16ger2s (__vector_quad* a, vuc b, vuc c,
const int d, const int e, const int f)
pmxvi16ger2s a,b,c,d,e,f
void __builtin_mma_pmxvi16ger2spp (__vector_quad* a, vuc b, vuc c,
const int d, const int e, const int f)
pmxvi16ger2spp a,b,c,d,e,f
void __builtin_mma_pmxvi4ger8 (__vector_quad* a, vuc b, vuc c,
const int d, const int e, const int f)
pmxvi4ger8 a,b,c,d,e,f
void __builtin_mma_pmxvi4ger8pp (__vector_quad* a, vuc b, vuc c,
const int d, const int e, const int f)
pmxvi4ger8pp a,b,c,d,e,f
void __builtin_mma_pmxvi8ger4 (__vector_quad* a, vuc b, vuc c,
const int d, const int e, const int f)
pmxvi8ger4 a,b,c,d,e,f
void __builtin_mma_pmxvi8ger4pp (__vector_quad* a, vuc b, vuc c,
const int d, const int e, const int f)
pmxvi8ger4pp a,b,c,d,e,f
void __builtin_mma_pmxvi8ger4spp (__vector_quad* a, vuc b, vuc c,
const int d, const int e, const int f)
pmxvi8ger4spp a,b,c,d,e,f
void __builtin_mma_xvbf16ger2 (__vector_quad* a, vuc b, vuc c)
xvbf16ger2 a,b,c
void __builtin_mma_xvbf16ger2nn (__vector_quad* a, vuc b, vuc c)
xvbf16ger2nn a,b,c
void __builtin_mma_xvbf16ger2np (__vector_quad* a, vuc b, vuc c)
xvbf16ger2np a,b,c
void __builtin_mma_xvbf16ger2pn (__vector_quad* a, vuc b, vuc c)
xvbf16ger2pn a,b,c
void __builtin_mma_xvbf16ger2pp (__vector_quad* a, vuc b, vuc c)
xvbf16ger2pp a,b,c
void __builtin_mma_xvf16ger2 (__vector_quad* a, vuc b, vuc c)
xvf16ger2 a,b,c
void __builtin_mma_xvf16ger2nn (__vector_quad* a, vuc b, vuc c)
xvf16ger2nn a,b,c
void __builtin_mma_xvf16ger2np (__vector_quad* a, vuc b, vuc c)
xvf16ger2np a,b,c
void __builtin_mma_xvf16ger2pn (__vector_quad* a, vuc b, vuc c)
xvf16ger2pn a,b,c
void __builtin_mma_xvf16ger2pp (__vector_quad* a, vuc b, vuc c)
xvf16ger2pp a,b,c
void __builtin_mma_xvf32ger (__vector_quad* a, vuc b, vuc c)
xvf32ger a,b,c
void __builtin_mma_xvf32gernn (__vector_quad* a, vuc b, vuc c)
xvf32gernn a,b,c
void __builtin_mma_xvf32gernp (__vector_quad* a, vuc b, vuc c)
xvf32gernp a,b,c
void __builtin_mma_xvf32gerpn (__vector_quad* a, vuc b, vuc c)
xvf32gerpn a,b,c
void __builtin_mma_xvf32gerpp (__vector_quad* a, vuc b, vuc c)
xvf32gerpp a,b,c
void __builtin_mma_xvf64ger (__vector_quad* a, __vector_pair b, vuc c)
xvf64ger a,b,c
void __builtin_mma_xvf64gernn (__vector_quad* a, __vector_pair b, vuc c)
xvf64gernn a,b,c
void __builtin_mma_xvf64gernp (__vector_quad* a, __vector_pair b, vuc c)
xvf64gernp a,b,c
void __builtin_mma_xvf64gerpn (__vector_quad* a, __vector_pair b, vuc c)
xvf64gerpn a,b,c
void __builtin_mma_xvf64gerpp (__vector_quad* a, __vector_pair b, vuc c)
xvf64gerpp a,b,c
void __builtin_mma_xvi16ger2 (__vector_quad* a, vuc b, vuc c)
xvi16ger2 a,b,c
void __builtin_mma_xvi16ger2pp (__vector_quad* a, vuc b, vuc c)
xvi16ger2pp a,b,c
void __builtin_mma_xvi16ger2s (__vector_quad* a, vuc b, vuc c)
xvi16ger2s a,b,c
void __builtin_mma_xvi16ger2spp (__vector_quad* a, vuc b, vuc c)
xvi16ger2spp a,b,c
void __builtin_mma_xvi4ger8 (__vector_quad* a, vuc b, vuc c)
xvi4ger8 a,b,c
void __builtin_mma_xvi4ger8pp (__vector_quad* a, vuc b, vuc c)
xvi4ger8pp a,b,c
void __builtin_mma_xvi8ger4 (__vector_quad* a, vuc b, vuc c)
xvi8ger4 a,b,c
void __builtin_mma_xvi8ger4pp (__vector_quad* a, vuc b, vuc c)
xvi8ger4pp a,b,c
void __builtin_mma_xvi8ger4spp (__vector_quad* a, vuc b, vuc c)
xvi8ger4spp a,b,c