Commit Graph

68 Commits (3510071d9a8dde12056f90dacb15c34eb6601971)

Author SHA1 Message Date
Benjamin Herrenschmidt b513f0fb48 dcache: Add a dcache
This replaces loadstore2 with a dcache

The dcache unit is losely based on the icache one (same basic cache
layout), but has some significant logic additions to deal with stores,
loads with update, non-cachable accesses and other differences due to
operating in the execution part of the pipeline rather than the fetch
part.

The cache is store-through, though a hit with an existing line will
update the line rather than invalidate it.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
5 years ago
Paul Mackerras f49a5a99a5 Remove execute2 stage
Since the condition setting got moved to writeback, execute2 does
nothing aside from wasting a cycle.  This removes it.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras d4f51e08c8 divider: Return 0 for invalid and overflow cases, like P9 does
This adds logic to detect the cases where the quotient of the
division overflows the range of the output representation, and
return all zeroes in those cases, which is what POWER9 does.
To do this, we extend the dividend register by 1 bit and we do
an extra step in the division process to get a 2^64 bit of the
quotient, which ends up in the 'overflow' signal.  This catches all
the cases where dividend >= 2^64 * divisor, including the case
where divisor = 0, and the divde/divdeu cases where |RA| >= |RB|.

Then, in the output stage, we also check that the result fits in
the representable range, which depends on whether the division is
a signed division or not, and whether it is a 32-bit or 64-bit
division.  If dividend >= 2^64 or the result doesn't fit in the
representable range, write_data is set to 0 and write_cr_data to
0x20000000 (i.e. cr0.eq = 1).

POWER9 sets the top 32 bits of the result to zero for 32-bit signed
divisions, and sets CR0 when RC=1 according to the 64-bit value
(i.e. CR0.LT is always 0 for 32-bit signed divisions, even if the
32-bit result is negative).  However, modsw with a negative result
sets the top 32 bits to all 1s.  We follow suit.

This updates divider_tb to check the invalid cases as well as the
valid case.

This also fixes a small bug where the reset signal for the divider
was driven from rst when it should have been driven from core_rst.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Benjamin Herrenschmidt b56b46b7d1 icache: Set associative icache
This adds support for set associativity to the icache. It can still
be direct mapped by setting NUM_WAYS to 1.

The replacement policy uses a simple tree-PLRU for each set.

This is only lightly tested, tests pass but I have to double check
that we are using the ways effectively and not creating duplicates.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
5 years ago
Benjamin Herrenschmidt d40c1c1a25 icache: Use narrower block RAMs
We only ever access the cache memory for at most the wishbone bus
width at a time. So having the BRAMs organized as a cache-line-wide
port is a waste of resources.

Instead, use a wishbone-wide memory and store a line as consecutive
rows in the BRAM.

This significantly improves BRAM usage in the FPGA as we can now use
more rows in the BRAM blocks. It also saves a few LUTs and muxes.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
5 years ago
Benjamin Herrenschmidt d415e5544a fetch/icache: Fit icache in BRAM
The goal is to have the icache fit in BRAM by latching the output
into a register. In order to avoid timing issues , we need to give
the BRAM a full cycle on reads, and thus we souce the BRAM address
directly from fetch1 latched NIA.

(Note: This will be problematic if/when we want to hash the address,
we'll probably be better off having fetch1 latch a fully hashed address
along with the normal one, so the icache can use the former to address
the BRAM and pass the latter along)

One difficulty is that we cannot really stall the icache without adding
more combo logic that would break the "one full cycle" BRAM model. This
means that on stalls from decode, by the time we stall fetch1, it has
already gone to the next address, which the icache is already latching.

We work around this by having a "stash" buffer in fetch2 that will stash
away the icache output on a stall, and override the output of the icache
with the content of the stash buffer when unstalling.

This requires a rewrite of the stop/step debug logic as well. We now
do most of the hard work in fetch1 which makes more sense.

Note: Vivado is still not inferring an built-in output register for the
BRAMs. I don't want to add another cycle... I don't fully understand why
it wouldn't be able to treat current_row as such but clearly it won't. At
least the timing seems good enough now for 100Mhz, possibly more.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
5 years ago
Benjamin Herrenschmidt ec1868f7d2 register_file: Move GPRs into distributed RAM
The register file is currently implemented as a whole pile of individual
1-bit registers instead of LUT memory which is a huge waste of FPGA
space.

This is caused by the output signal exposing the register file to the
outside world for simulation debug.

This removes that output, and moves the dumping of the register file
to the register file module itself. This saves about 8% of fpga on
the little Arty A7-35T.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
5 years ago
Anton Blanchard b57325ce29 Merge branch 'divider' of https://github.com/paulusmack/microwatt 5 years ago
Anton Blanchard 6cae10eebd Terminate test on illegal instruction
This gets the CI going again, but we will want to fix the test
harness since it's useful to be able to debug the core after it
executes an illegal instruction.

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
5 years ago
Anton Blanchard 8c5dcc8c4c Fix ghdl error
I'm seeing an issue on my version of ghdl:

  core.vhdl:137:24:error: actual expression must be globally static

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
5 years ago
Paul Mackerras d5bc6c8824 Add a divider unit and a testbench for it
This adds a divider unit, connected to the core in much the same way
that the multiplier unit is connected.  The division algorithm is
very simple-minded, taking 64 clock cycles for any division (even
32-bit division instructions).

The decoding is simplified by making use of regularities in the
instruction encoding for div* and mod* instructions.  Instead of
having PPC_* encodings from the first-stage decoder for each of the
different div* and mod* instructions, we now just have PPC_DIV and
PPC_MOD, and the inputs to the divider that indicate what sort of
division operation to do are derived from instruction word bits.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Benjamin Herrenschmidt 98f0994698 Add core debug module
This module adds some simple core controls:

  reset, stop, start, step

along with icache clear and reading the NIA and core
status bits

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org
5 years ago
Anton Blanchard 89849a6856 Add a simple direct mapped icache
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
5 years ago
Anton Blanchard 4528ef2b43 Reformat core.vhdl 5 years ago
Anton Blanchard 92a7152370 Rework pipeline, add stall and flush signals
This adds stall and flush signals to the pipeline.

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
5 years ago
Benjamin Herrenschmidt c97b080d8c Move wishbone arbiter out of the core
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
5 years ago
Anton Blanchard 147b259691 Use a better input signal in writeback
w_in comes from the execution unit, it makes more sense to call
it e_in.

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
5 years ago
Anton Blanchard 5a29cb4699 Initial import of microwatt
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
5 years ago