For multiply and divide operations, execute1 now records the
destination GPR number, RC and OE from the instruction, and the
XER value. This means that the multiply and divide units don't
need to record those values and then send them back to execute1.
This makes the interface to those units a bit simpler. They
simply report an overflow signal along with the result value, and
execute1 takes care of updating XER if necessary.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>

With this, the divider is a unit that execute1 sends operands to and
which sends its results back to execute1, which then send them to
writeback. Execute1 now sends a stall signal when it gets a divide
or modulus instruction until it gets a valid signal back from the
divider. Divide and modulus instructions are no longer marked as
single-issue.
The data formatting step that used to be done in decode2 for div
and mod instructions is now done in execute1. We also do the
absolute value operation in that same cycle instead of taking an
extra cycle inside the divider for signed operations with a
negative operand.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>

This adds code to writeback to format data and test the result
against zero for the purpose of setting CR0. The data formatter
is able to shift and mask by bytes and do byte reversal and sign
extension. It can also put together bytes from two input
doublewords to support unaligned loads (including unaligned
byte-reversed loads).
The data formatter starts with an 8:1 multiplexer that is able
to direct any byte of the input to any byte of the output. This
lets us rotate the data and simultaneously byte-reverse it.
The rotated/reversed data goes to a register for the unaligned
cases that overlap two doublewords. Then there is per-byte logic
that does trimming, sign extension, and splicing together bytes
from a previous input doubleword (stored in data_latched) and the
current doubleword. Finally the 64-bit result is tested to set
CR0 if rc = 1.
This removes the RC logic from the execute2, multiply and divide
units, and the shift/mask/byte-reverse/sign-extend logic from
loadstore2.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>

This adds logic to detect the cases where the quotient of the
division overflows the range of the output representation, and
return all zeroes in those cases, which is what POWER9 does.
To do this, we extend the dividend register by 1 bit and we do
an extra step in the division process to get a 2^64 bit of the
quotient, which ends up in the 'overflow' signal. This catches all
the cases where dividend >= 2^64 * divisor, including the case
where divisor = 0, and the divde/divdeu cases where |RA| >= |RB|.
Then, in the output stage, we also check that the result fits in
the representable range, which depends on whether the division is
a signed division or not, and whether it is a 32-bit or 64-bit
division. If dividend >= 2^64 or the result doesn't fit in the
representable range, write_data is set to 0 and write_cr_data to
0x20000000 (i.e. cr0.eq = 1).
POWER9 sets the top 32 bits of the result to zero for 32-bit signed
divisions, and sets CR0 when RC=1 according to the 64-bit value
(i.e. CR0.LT is always 0 for 32-bit signed divisions, even if the
32-bit result is negative). However, modsw with a negative result
sets the top 32 bits to all 1s. We follow suit.
This updates divider_tb to check the invalid cases as well as the
valid case.
This also fixes a small bug where the reset signal for the divider
was driven from rst when it should have been driven from core_rst.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>

This moves the negation of negative operands for signed divide and
modulus operations out of the decode2 stage and into the divider.
If either of the operands for a signed divide or modulus operation
is negative, the divider now takes an extra cycle to negate the
operands that are negative.
The interface to the divider now has an 'is_signed' signal rather
than a 'neg_result' signal, and the dividend and divisor can be
negative, so divider_tb had to be updated for the new interface.
The reason for doing this is that one of the worst timing violations
on the Arty A7-100 at 100MHz involved the carry chain in the adders
that did the negation of the dividend and divisor in the decode stage.
Moving the negations to a separate cycle fixes that and also seems to
reduce the total number of slice LUTs used.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>

This adds a divider unit, connected to the core in much the same way
that the multiplier unit is connected. The division algorithm is
very simple-minded, taking 64 clock cycles for any division (even
32-bit division instructions).
The decoding is simplified by making use of regularities in the
instruction encoding for div* and mod* instructions. Instead of
having PPC_* encodings from the first-stage decoder for each of the
different div* and mod* instructions, we now just have PPC_DIV and
PPC_MOD, and the inputs to the divider that indicate what sort of
division operation to do are derived from instruction word bits.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>