Most states set opsel_a directly to select the operand for the A input
of the main adder. The exception is the EXC_RESULT state, which uses
r.opsel_a set by the previous cycle to indicate which input operand to
use as the result.
In order to make timing, ensure that the controls that select the
inputs to the main adder (opsel_*, etc.) don't depend on any
complicated functions of the data (such as px_nz, pcmpb_eq, pcmpb_lt,
etc.), but are as far as possible constant for each state. There is
now a control called set_r for whether the result is written to r.r,
which enables us to avoid setting opsel_b or opsel_r conditionally in
some cases.
Also, to avoid a data-dependent setting of msel_2 in IDIV_DODIV state,
the IDIV_NR1 and IDIV_NR2 states have been reworked so that completion
of the required number of iterations is checked in IDIV_NR1 state, and
at that point, if the inverse estimate is < 0.5, we go to IDIV_USE0_5
state in order to use 0.5 as the estimate. This means that in the
normal case, the inverse estimate is already in Y when we get to
IDIV_DODIV state. IDIV_USE0_5 has been reworked to put R (which will
contain 0.5) into Y as the inverse estimate. That means that
IDIV_DODIV state doesn't have any data-dependent logic to put either P
or R into Y.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Since r.x is mostly set from the value in r.r and only once from
anything else (r.b.mantissa), move the check to before the input
multiplexer for the main adder, so it works on r.r rather than
whatever is selected by r.opsel_a.
For the case in DO_FRSP where we have B selected by r.opsel_a, we add
a new state so that we now get B into R and then check the low bits of
R.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Instead use things derived from the instruction in the first cycle,
such as r.is_multiply, r.is_addition, etc.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The architecture specifies that an invalid operation exception for
signalling NaN (VXSNAN) can occur in the same instructions as an
invalid operation exception for infinity times zero (VXIMZ) in the
case of a multiply-add instruction where B is a signalling NaN, and
one of A and C is infinity and the other is zero. This moves the
invalid operation tests around so as to handle this case correctly.
It also restructures the infinity and NaN cases to simplify the logic
a little.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
By starting out with result_sign = +/- sign of B, we avoid the need to
flip the result sign in a few places.
This also simplifies DO_FMADD state a bit by having DO_ZERO_DEN go to
DO_FMUL state for floating multiply-add where B is zero. (The
RENORM_A2 and RENORM_C2 states already do this.)
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Instead of operating on result_sign directly, the state machine now
sets a control variable "rsgn_op" that then directs a tiny ALU to do
what's required.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This moves the computation of r.result_sign out of the various
states for most instructions. Now the sign is mostly computed in the
first cycle (when e_in.valid is true).
The set of operations done on r.result_sign in the state machine are
now restricted to 5 (other than no change): invert, xor with
r.is_subtract, or set to the sign of A, B or C.
Similarly r.is_subtract and r.negate are computed in the first cycle
now.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
With ftdiv, we weren't setting result_exp to B.exponent before
testing result_exp in state FTDIV_1; the fix is to transfer B.exponent
to result_exp in state DO_FTDIV.
With ftsqrt, we were setting bit 1 of the destination CR field to 0
always, due to a typo.
Also move a couple of statements around to try to get slightly simpler
logic.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Renormalization of the divisor for fdiv[s] was adjusting the result
exponent in the wrong direction, making the result smaller in
magnitude than it should be by a power of 2. Fix this by negating
r.shift in the RENORM_B2 state and then subtracting it in the LOOKUP
cycle.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The sign recorded in FPRF was sometimes wrong because we weren't doing
the modifications that were done in pack_dp when setting FPRF (FPSCR
field). These modifications are: set sign for zero result of
subtraction based on rounding mode; negate result for fnmadd/sub;
but don't modify sign of NaNs.
Instead we now do these modifications in the main state machine code
and put the result in an 'rsign' variable that is used to set
v.res_sign, then r.res_sign is used in the next cycle both for setting
FPRF and in the pack_dp functions. That simplifies pack_dp and lets
us get rid of r.res_negate, r.res_subtract and r.res_rmode.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds an 'is_signed' signal to MultiplyInputType to indicate
whether the data1 and data2 fields are to be interpreted as signed or
unsigned numbers.
The 'not_result' field is replaced by a 'subtract' field which
provides a more intuitive interface for requesting that the product be
subtracted from the addend rather than added, i.e. subtract = 1 gives
C - A * B, vs. subtract = 0 giving C + A * B. (Previously the users
of the multipliers got the same effect by complementing the addend and
setting not_result = 1.)
The is_32bit field is removed because it is no longer used now that we
have a separate 32-bit multiplier.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
When a floating-point subtraction results in a zero result, the sign
of the result is required to be positive in all rounding modes except
the round to minus infinity mode, when it is negative. Consolidate
the logic for doing this in one place, in the pack_dp function,
instead of having it at each place where a zero result is generated.
Since fnmadd[s] and fnmsub[s] negate the result after this rule has
been applied, we use the r.negate signal to indicate a negation which
is now done in pack_dp. Thus the EXC_RESULT state no longer uses
r.negate, and in fact doesn't set v.result_sign at all; that is now
done in the states that lead into EXC_RESULT.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Do more decoding of the instruction ahead of the IDLE state
processing so that the IDLE state code becomes much simpler.
To make the decoding easier, we now use four insn_type_t codes for
floating-point operations rather than two. This also rearranges the
insn_type_t values a little to get the 4 FP opcode values to differ
only in the bottom 2 bits, and put OP_DIV, OP_DIVE and OP_MOD next to
them.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
With this, the large case statement sets values for a set of control
signals, which then control multiplexers and adders that generate
values for v.result_exp and v.shift. The plan is for the case
statement to turn into a microcode ROM eventually.
The value of v.result_exp is the sum of two values, either of which
can be negated (but not both). The first value can be chosen from the
result exponent, A exponent, B exponent arithmetically shifted right
one bit, or 0. The second value can be chosen from new_exp (which is
r.result_exp - r.shift), B exponent, C exponent or a constant. The
choices for the constant are 0, 56, the maximum exponent (max_exp) or
the exponent bias for trap-enabled overflow conditions (bias_exp).
These choices are controlled by the signals re_sel1, re_neg1, re_sel2
and re_neg2, and the sum is written into v.result_exp if re_set_result
is 1.
For v.shift we also compute the sum of two values, either of which
can be negated (but not both). The first value can be chosen from
new_exp, B exponent, r.shift, or 0. The second value can be chosen
from the A exponent or a constant. The possible constants are 0, 1,
4, 8, 32, 52, 56, 63, 64, or the minimum exponent (min_exp). These
choices are controlled by the signals rs_sel1, rs_neg1, rs_sel2 and
rs_neg2. After the adder there is a multiplexer which selects either
the sum or a shift count for normalization (derived from a count
leading zeroes operation on R) to be written into v.shift. The
count-leading-zeroes result does not go through the adder for timing
reasons.
In order to simplify the logic and help improve timing, settings of
the control signals have been made unconditional in a state in many
places, even if those settings are only required when some condition
is met.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
In preparation for an explicit exponent data path. The fix is that
fre[s] needs to negate the exponent after renomalization rather than
before, otherwise the exponent adjustment done by the renormalization
is in the wrong direction.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
fpu.vhdl:513:18⚠️ declaration of "result" hides signal "result" [-Whide]
variable result : std_ulogic_vector(63 downto 0);
Signed-off-by: Joel Stanley <joel@jms.id.au>
This starts the process of removing SPRs from the register file by
moving SRR0/1, SPRG0-3, HSRR0/1 and HSPRG0/1 out of the register file
and putting them into execute1. They are stored in a pair of small
RAM arrays, referred to as "even" and "odd". The reason for having
two arrays is so that two values can be read and written in each
cycle. For example, SRR0 and SRR1 can be written in parallel by an
interrupt and read in parallel by the rfid instruction.
The addresses in the RAM which will be accessed are determined in the
decode2 stage. We have one write address for both sides, but two read
addresses, since in future we will want to be able to read CTR at the
same time as either LR or TAR.
We now have a connection from writeback to execute1 which carries the
partial SRR1 value for an interrupt. SRR0 comes from the execute
pipeline; we no longer need to carry instruction addresses along the
LSU and FPU pipelines. Since SRR0 and SRR1 can be written in the same
cycle now, we don't need the little state machine in writeback any
more.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
- Arrange for XER to be written for OE=1 forms
- Arrange for condition codes to be set for RC=1 forms
(including correct handling for 32-bit mode)
- Don't instantiate the divider if we have an FPU.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds logic to the FPU to accomplish 64-bit integer divisions.
No instruction actually uses this yet.
The algorithm used is to obtain an estimate of the reciprocal of the
divisor using the lookup table and refine it by one to three
iterations of the Newton-Raphson algorithm (the number of iterations
depends on the number of significant bits in the dividend). Then the
reciprocal is multiplied by the dividend to get the quotient estimate.
The remainder is calculated as dividend - quotient * divisor. If the
remainder is greater than or equal to the divisor, the quotient is
incremented, or if a modulo operation is being done, the divisor is
subtracted from the remainder. The inverse estimate after refinement
is good enough that the quotient estimate is always equal to or one
less than the true quotient.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This changes the representation of the R, A, B and C registers in the
FPU from 10.54 format (10 bits to the left of the binary point and 54
bits to the right) to 8.56 format, to match the representation used in
the P and Y registers and the multiplier operands. This eliminates
the need for shifting when R, A, B or C is an input to the multiplier
and will make it easier to implement integer division in the FPU.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This makes the FPU able to stall other units at execute stage 2 and be
stalled by other units (specifically the LSU).
This means that the completion and writeback for an instruction can
now end up being deferred until the second cycle of a following
instruction, i.e. the cycle when the state machine has gone through
IDLE state into one of the DO_* states, which means we need to latch
the destination FPR number, CR mask, etc. from the previous
instruction so that we present the correct information to writeback.
The advantage of this is that we can get rid of the in_progress signal
from the LSU.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
While this is not an issue in VHDL, I noticed this when running
a script over the source and we may as well fix it.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
This changes the way GPR hazards are detected and tracked. Instead of
having a model of the pipeline in gpr_hazard.vhdl, which has to mirror
the behaviour of the real pipeline exactly, we now assign a 2-bit tag
to each instruction and record which GSPR the instruction writes.
Subsequent instructions that need to use the GSPR get the tag number
and stall until the value with that tag is being written back to the
register file.
For now, the forwarding paths are disabled. That gives about a 8%
reduction in coremark performance.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Instead of using the mask generator in the rounding process, this uses
simpler logic to add in a 1 at the appropriate position (bit 2 or bit
31, depending on precision) and mask off the low-order bits. Since
there are only two positions at which the masking and incrementing
need to be done, we don't need the full generality of the mask
generator. This reduces the amount of logic and improves timing.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
At present there is a state transition in the handling of the fmadd
instructions where the next state depends on the sign bit of the
multiplier result. This creates a critical path which doesn't make
timing on the A7-100. To fix this, we make the state transition
independent of the sign of the multiplier result, which improves
timing, but means we take one more cycle to do a fmadd-family
instruction in some cases.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The masking enabled by opsel_amask is only used when rounding, to trim
the rounded result to the required precision. We now do the masking
after the adder rather than before (on the A input). This gives the
same result and helps timing. The path from r.shift through the mask
generator and adder to v.r was showing up as a critical path.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This moves longmask into the reg_type record, meaning that it now
needs to be decided a cycle earlier, in order to help timing.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This moves opsel_a into the reg_type record, meaning that the A
multiplexer input now needs to be decided a cycle earlier. This helps
timing by eliminating the combinatorial path from r.state and other
things to opsel_a and thence to in_a and result.
This means that some things now take an extra cycle, in particular
some of the exception cases such as when one or both operands are
NaNs. The NaN handling has been moved out to its own state, which
simplifies the logic for exception cases in other places. Additions
or subtractions where FRB's exponent is smaller than FRA's will
also take an extra cycle.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This implements fmadd, fmsub, fnmadd, fnmsub and their
single-precision counterparts. The single-precision versions operate
the same as the double-precision versions until the final rounding and
overflow/underflow steps.
This adds an S register to store the low bits of the product. S
shifts into R on left shifts, and can be negated, but doesn't do any
other arithmetic.
This adds a test for the double-precision versions of these
instructions.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This implements the floating square-root calculation using a table
lookup of the inverse square root approximation, followed by three
iterations of Goldschmidt's algorithm, which gives estimates of both
sqrt(FRB) and 1/sqrt(FRB). Then the residual is calculated as
FRB - R * R and that is multiplied by the 1/sqrt(FRB) estimate to get
an adjustment to R. The residual and the adjustment can be negative,
and since we have an unsigned multiplier, the upper bits can be wrong.
In practice the adjustment fits into an 8-bit signed value, and the
bottom 8 bits of the adjustment product are correct, so we sign-extend
them, divide by 4 (because R is in 10.54 format) and add them to R.
Finally the residual is calculated again and compared to 2*R+1 to see
if a final increment is needed. Then the result is rounded and
written back.
This implements fsqrts as fsqrt, but with rounding to single precision
and underflow/overflow calculation using the single-precision exponent
range. This could be optimized later.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This implements frsqrte by table lookup. We first normalize the input
if necessary and adjust so that the exponent is even, giving us a
mantissa value in the range [1.0, 4.0), which is then used to look up
an entry in a 768-entry table. The 768 entries are appended to the
table for reciprocal estimates, giving a table of 1024 entries in
total. frsqrtes is implemented identically to frsqrte.
The estimate supplied is accurate to 1 part in 1024 or better.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This just returns the value from the inverse lookup table. The result
is accurate to better than one part in 512 (the architecture requires
1/256).
This also adds a simple test, which relies on the particular values in
the inverse lookup table, so it is not a general test.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This implements floating-point division A/B by a process that starts
with normalizing both inputs if necessary. Then an estimate of 1/B
from a lookup table is refined by 3 Newton-Raphson iterations and then
multiplied by A to get a quotient. The remainder is calculated as
A - R * B (where R is the result, i.e. the quotient) and the remainder
is compared to 0 and to B to see whether the quotient needs to be
incremented by 1. The calculations of 1 / B are done with 56 fraction
bits and intermediate results are truncated rather than rounded,
meaning that the final estimate of 1 / B is always correct or a little
bit low, never too high, and thus the calculated quotient is correct
or 1 unit too low. Doing the estimate of 1 / B with sufficient
precision that the quotient is always correct to the last bit without
needing any adjustment would require many more bits of precision.
This implements fdivs by computing a double-precision quotient and
then rounding it to single precision. It would be possible to
optimize this by e.g. doing only 2 iterations of Newton-Raphson and
then doing the remainder calculation and adjustment at single
precision rather than double precision.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This implements the fmul and fmuls instructions.
For fmul[s] with denormalized operands we normalize the inputs
before doing the multiplication, to eliminate the need for doing
count-leading-zeroes on P. This adds 3 or 5 cycles to the
execution time when one or both operands are denormalized.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>