This removes the constraint that loads and stores are single-issue,
at the expense of a stall of at least 2 cycles for every load and
store.
To do this, we plumb the existing stall signal that was generated
in dcache to core, where it gets ORed with the stall signal from
execute1. Execute1 generates a stall signal for the first two
cycles of each load and store, and dcache generates the stall
signal in the 3rd and subsequent cycles if it needs to.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This allows us to use the bypass at the input of execute1 for the
address and data operands for loadstore1.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This enables back-to-back execution of integer instructions where
the first instruction writes a GPR and the second reads the same
GPR. This is done with a set of multiplexers at the start of
execute1 which enable any of the three input operands to be taken
from the output of execute1 (i.e. r.e.write_data) rather than the
input from decode2 (i.e. e_in.read_data[123]).
This also requires changes to the hazard detection and handling.
Decode2 generates a signal indicating that the GPR being written
is available for bypass, which is true for instructions that are
executed in execute1 (rather than loadstore1/dcache). The
gpr_hazard module stores this "bypassable" bit, and if the same
GPR needs to be read by a subsequent instruction, it outputs a
"use_bypass" signal rather than generating a stall. The
use_bypass signal is then latched at the output of decode2 and
passed down to execute1 to control the input multiplexer.
At the moment there is no bypass on the inputs to loadstore1, but that
is OK because all load and store instructions are marked as
single-issue.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
With this, the divider is a unit that execute1 sends operands to and
which sends its results back to execute1, which then send them to
writeback. Execute1 now sends a stall signal when it gets a divide
or modulus instruction until it gets a valid signal back from the
divider. Divide and modulus instructions are no longer marked as
single-issue.
The data formatting step that used to be done in decode2 for div
and mod instructions is now done in execute1. We also do the
absolute value operation in that same cycle instead of taking an
extra cycle inside the divider for signed operations with a
negative operand.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
With this, the multiplier isn't a separate pipe that decode2 issues
instructions to, but rather is a unit that execute1 sends operands
to and which sends the result back to execute1, which then sends it
to writeback. Execute1 now sends a stall signal when it gets a
multiply instruction until it gets a valid signal back from the
multiplier.
This all means that we no longer need to mark the multiply
instructions as single-issue.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Right now our test cases fold the SPRs into the GPRs. That makes
debugging fails more difficult than it needs to be, so print
out the CTR, LR and CR.
We still need to print the XER, but that is in two spots in microwatt
and will take some more work.
This also adds many instructions to the tests that we have added
lately including overflow instructions, CR logicals and mt/mfxer.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
This stores the most common SPRs in the register file.
This includes CTR and LR and a not yet final list of others.
The register file is set to 64 entries for now. Specific types
are defined that can represent a GPR index (gpr_index_t) or
a GPR/SPR index (gspr_index_t) along with conversion functions
between the two.
On order to deal with some forms of branch updating both LR and
CTR, we introduced a delayed update of LR after a branch link.
Note: We currently stall the pipeline on such a delayed branch,
but we could avoid stalling fetch in that specific case as we
know we have a branch delay. We could also limit that to the
specific case where we need to update both CTR and LR.
This allows us to make bcreg, mtspr and mfspr pipelined. decode1
will automatically force the single issue flag on mfspr/mtspr to
a "slow" SPR.
[paulus@ozlabs.org - fix direction of decode2.stall_in]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Vivado by default tries to flatten the module hierarchy to improve
placement and timing. However this makes debugging timing issues
really hard as the net names in the timing report can be pretty
bogus.
This adds a generic that can be used to control attributes to stop
vivado from flattening the main core components. The resulting design
will have worst timing overall but it will be easier to understand
what the worst timing path are and address them.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We don't yet have a proper snooper for the icache, so for now make
icbi just flush the whole thing
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Adding lines seems to add only little extra as the BRAMs aren't
full, 2 ways is our current comprimise to limit pressure on small
FPGAs. We could go to 64 lines for a little more, but timing is
becoming a bit too right to my linking on the tags/LRU path of
the icache, so let's leave it at 32 for now.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This replaces loadstore2 with a dcache
The dcache unit is losely based on the icache one (same basic cache
layout), but has some significant logic additions to deal with stores,
loads with update, non-cachable accesses and other differences due to
operating in the execution part of the pipeline rather than the fetch
part.
The cache is store-through, though a hit with an existing line will
update the line rather than invalidate it.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Since the condition setting got moved to writeback, execute2 does
nothing aside from wasting a cycle. This removes it.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds logic to detect the cases where the quotient of the
division overflows the range of the output representation, and
return all zeroes in those cases, which is what POWER9 does.
To do this, we extend the dividend register by 1 bit and we do
an extra step in the division process to get a 2^64 bit of the
quotient, which ends up in the 'overflow' signal. This catches all
the cases where dividend >= 2^64 * divisor, including the case
where divisor = 0, and the divde/divdeu cases where |RA| >= |RB|.
Then, in the output stage, we also check that the result fits in
the representable range, which depends on whether the division is
a signed division or not, and whether it is a 32-bit or 64-bit
division. If dividend >= 2^64 or the result doesn't fit in the
representable range, write_data is set to 0 and write_cr_data to
0x20000000 (i.e. cr0.eq = 1).
POWER9 sets the top 32 bits of the result to zero for 32-bit signed
divisions, and sets CR0 when RC=1 according to the 64-bit value
(i.e. CR0.LT is always 0 for 32-bit signed divisions, even if the
32-bit result is negative). However, modsw with a negative result
sets the top 32 bits to all 1s. We follow suit.
This updates divider_tb to check the invalid cases as well as the
valid case.
This also fixes a small bug where the reset signal for the divider
was driven from rst when it should have been driven from core_rst.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds support for set associativity to the icache. It can still
be direct mapped by setting NUM_WAYS to 1.
The replacement policy uses a simple tree-PLRU for each set.
This is only lightly tested, tests pass but I have to double check
that we are using the ways effectively and not creating duplicates.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We only ever access the cache memory for at most the wishbone bus
width at a time. So having the BRAMs organized as a cache-line-wide
port is a waste of resources.
Instead, use a wishbone-wide memory and store a line as consecutive
rows in the BRAM.
This significantly improves BRAM usage in the FPGA as we can now use
more rows in the BRAM blocks. It also saves a few LUTs and muxes.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The goal is to have the icache fit in BRAM by latching the output
into a register. In order to avoid timing issues , we need to give
the BRAM a full cycle on reads, and thus we souce the BRAM address
directly from fetch1 latched NIA.
(Note: This will be problematic if/when we want to hash the address,
we'll probably be better off having fetch1 latch a fully hashed address
along with the normal one, so the icache can use the former to address
the BRAM and pass the latter along)
One difficulty is that we cannot really stall the icache without adding
more combo logic that would break the "one full cycle" BRAM model. This
means that on stalls from decode, by the time we stall fetch1, it has
already gone to the next address, which the icache is already latching.
We work around this by having a "stash" buffer in fetch2 that will stash
away the icache output on a stall, and override the output of the icache
with the content of the stash buffer when unstalling.
This requires a rewrite of the stop/step debug logic as well. We now
do most of the hard work in fetch1 which makes more sense.
Note: Vivado is still not inferring an built-in output register for the
BRAMs. I don't want to add another cycle... I don't fully understand why
it wouldn't be able to treat current_row as such but clearly it won't. At
least the timing seems good enough now for 100Mhz, possibly more.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The register file is currently implemented as a whole pile of individual
1-bit registers instead of LUT memory which is a huge waste of FPGA
space.
This is caused by the output signal exposing the register file to the
outside world for simulation debug.
This removes that output, and moves the dumping of the register file
to the register file module itself. This saves about 8% of fpga on
the little Arty A7-35T.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This gets the CI going again, but we will want to fix the test
harness since it's useful to be able to debug the core after it
executes an illegal instruction.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
I'm seeing an issue on my version of ghdl:
core.vhdl:137:24:error: actual expression must be globally static
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
This adds a divider unit, connected to the core in much the same way
that the multiplier unit is connected. The division algorithm is
very simple-minded, taking 64 clock cycles for any division (even
32-bit division instructions).
The decoding is simplified by making use of regularities in the
instruction encoding for div* and mod* instructions. Instead of
having PPC_* encodings from the first-stage decoder for each of the
different div* and mod* instructions, we now just have PPC_DIV and
PPC_MOD, and the inputs to the divider that indicate what sort of
division operation to do are derived from instruction word bits.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This module adds some simple core controls:
reset, stop, start, step
along with icache clear and reading the NIA and core
status bits
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org