This replaces loadstore2 with a dcache
The dcache unit is losely based on the icache one (same basic cache
layout), but has some significant logic additions to deal with stores,
loads with update, non-cachable accesses and other differences due to
operating in the execution part of the pipeline rather than the fetch
part.
The cache is store-through, though a hit with an existing line will
update the line rather than invalidate it.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Since the condition setting got moved to writeback, execute2 does
nothing aside from wasting a cycle. This removes it.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This makes the exts[bhw] instructions do the sign extension in the
writeback stage using the sign-extension logic there instead of
having unique sign extension logic in execute1. This requires
passing the data length and sign extend flag from decode2 down
through execute1 and execute2 and into writeback. As a side bonus
we reduce the number of values in insn_type_t by two.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds code to writeback to format data and test the result
against zero for the purpose of setting CR0. The data formatter
is able to shift and mask by bytes and do byte reversal and sign
extension. It can also put together bytes from two input
doublewords to support unaligned loads (including unaligned
byte-reversed loads).
The data formatter starts with an 8:1 multiplexer that is able
to direct any byte of the input to any byte of the output. This
lets us rotate the data and simultaneously byte-reverse it.
The rotated/reversed data goes to a register for the unaligned
cases that overlap two doublewords. Then there is per-byte logic
that does trimming, sign extension, and splicing together bytes
from a previous input doubleword (stored in data_latched) and the
current doubleword. Finally the 64-bit result is tested to set
CR0 if rc = 1.
This removes the RC logic from the execute2, multiply and divide
units, and the shift/mask/byte-reverse/sign-extend logic from
loadstore2.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Consolidate and/andc/nand, or/orc/nor and xor/eqv, using a common
invert on the input and output. This saves us about 200 LUTs.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
The goal is to have the icache fit in BRAM by latching the output
into a register. In order to avoid timing issues , we need to give
the BRAM a full cycle on reads, and thus we souce the BRAM address
directly from fetch1 latched NIA.
(Note: This will be problematic if/when we want to hash the address,
we'll probably be better off having fetch1 latch a fully hashed address
along with the normal one, so the icache can use the former to address
the BRAM and pass the latter along)
One difficulty is that we cannot really stall the icache without adding
more combo logic that would break the "one full cycle" BRAM model. This
means that on stalls from decode, by the time we stall fetch1, it has
already gone to the next address, which the icache is already latching.
We work around this by having a "stash" buffer in fetch2 that will stash
away the icache output on a stall, and override the output of the icache
with the content of the stash buffer when unstalling.
This requires a rewrite of the stop/step debug logic as well. We now
do most of the hard work in fetch1 which makes more sense.
Note: Vivado is still not inferring an built-in output register for the
BRAMs. I don't want to add another cycle... I don't fully understand why
it wouldn't be able to treat current_row as such but clearly it won't. At
least the timing seems good enough now for 100Mhz, possibly more.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This changes the names of the mul_32bit and mul_signed fields of
decode_rom_t to is_32bit and is_signed, so they can be used with
other types of operations besides multiplies.
This plumbs the is_32bit and is_signed flags down into execute1,
though they are not used at this point.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This aims to simplify the logic between the instruction image and
the register file read address ports and reduce the size of the decode
tables. With this patch, the input_reg_a column of the decode tables
can only select RA or zeroes, the input_reg_b column can only select
RB or a constant (0, -1, or an immediate value from the instruction),
and the input_reg_c columns can only select RS or zeroes.
That means that the rotate/shift/logical ops now have their first
input coming in via the input_reg_c column. That means we need to
add a read_data3 field to the Decode2ToExecuteType record, but that
will go away again when we split out the rotate/mask/logical ops to
their own unit.
As a related but not tightly connected change, this patch also sets
the read1_enable signal to the register file be 0 when RA=0 and the
input_reg_a for the instruction is RA_OR_ZERO (previously it was 1).
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
All of the PPC add and subtract instructions, including carrying
and extended versions, do much the same arithmetic operation:
result = (I xor A) + B + C
where A is the value from RA, I provides a logical inversion of A
(i.e. I is 0 or -1), B is either from RB or is a constant 0 or -1,
and C is 0, 1 or the carry bit from XER (CA).
To consolidate all the add/subtract instructions into a single
OP_ADD, we add a column to decode_rom_t to indicate when A should
be inverted, and change the input_carry field to a 3-state selector
to select C in the equation above.
This also adds a new "CONST_M1" value for input_reg_b_t to indicate
that B is a constant -1. This allows us to implement addme and
subfme.
The addex instruction appears not to exist, so the comments referring
to it are removed.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The register file is currently implemented as a whole pile of individual
1-bit registers instead of LUT memory which is a huge waste of FPGA
space.
This is caused by the output signal exposing the register file to the
outside world for simulation debug.
This removes that output, and moves the dumping of the register file
to the register file module itself. This saves about 8% of fpga on
the little Arty A7-35T.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The const* fields of decode_rom_t drove multiplexers in decode2 that
picked out various instruction fields and put them into the const*
fields of the Decode2ToExecute1Type record, from where they were
used in execute1. However, the code in execute1 can just as easily
use the appropriate fields of the original instruction word, since
that is now available in execute1. This therefore changes the
code to do that, resulting in smaller decode tables.
Suggested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Instead of doing mfctr, mflr, mftb, mtctr, mtlr as separate ops,
just pass down mfspr and mtspr ops with the spr number and let
execute1 decode which SPR we're addressing. This will help reduce
the number of instruction bits decode1 needs to look at.
In fact we now pass down the whole instruction from decode2 to
execute1. We will need more bits of the instruction in future,
and the tools should just optimize away any that we don't end
up using. Since the 'aa' bit was just a copy of an instruction
bit, we can now remove it from the record.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This moves the negation of negative operands for signed divide and
modulus operations out of the decode2 stage and into the divider.
If either of the operands for a signed divide or modulus operation
is negative, the divider now takes an extra cycle to negate the
operands that are negative.
The interface to the divider now has an 'is_signed' signal rather
than a 'neg_result' signal, and the dividend and divisor can be
negative, so divider_tb had to be updated for the new interface.
The reason for doing this is that one of the worst timing violations
on the Arty A7-100 at 100MHz involved the carry chain in the adders
that did the negation of the dividend and divisor in the decode stage.
Moving the negations to a separate cycle fixes that and also seems to
reduce the total number of slice LUTs used.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds a divider unit, connected to the core in much the same way
that the multiplier unit is connected. The division algorithm is
very simple-minded, taking 64 clock cycles for any division (even
32-bit division instructions).
The decoding is simplified by making use of regularities in the
instruction encoding for div* and mod* instructions. Instead of
having PPC_* encodings from the first-stage decoder for each of the
different div* and mod* instructions, we now just have PPC_DIV and
PPC_MOD, and the inputs to the divider that indicate what sort of
division operation to do are derived from instruction word bits.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This module adds some simple core controls:
reset, stop, start, step
along with icache clear and reading the NIA and core
status bits
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org
We only need two write ports for load with update instructions.
Having two write ports just for this instruction is expensive.
For now we will force them to be the only instruction in the
pipeline, and take two cycles of writeback.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
It simulated fine, but didn't synthesize. Fix some obvious issues
to get us going again.
Fixes: 9fbaea6f08 ("Rework CR file and add forwarding")
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
Right now we continually print all 3 possible GPRs an instruction
may be using. Add signals so we only print GPRs when they are
actually read. This should hopefully optimise away when synthesized.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
Handle the CR as a single field with per nibble enables. Forward any
writes in the same cycle.
If this proves to be an issue for timing, we may want to revisit
this in the future. For now, it keeps things simple.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
The decode2 stage was spaghetti code and needed cleaning up.
Create a series of functions to pull fields from a ppc instruction
and also a series of helpers to extract values for the execution
units.
As suggested by Paul, we should pass all signals to the execution
units and only set the valid signal conditionally, which should
use less resources.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>