We have all the machinery in place to implement the neg instruction
as OP_ADD. Doing that means we can ditch OP_NEG, and saves about
66 slice LUTs on the A7-100.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Anything that isn't a load or store and anything that doesn't read the
CR can go as soon as its inputs are ready.
While we could also allow SPR read/write and carry read/write, we plan
to change them to be read in decode2 and written in writeback soon and
they will need separate hazard detection to be added.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
Check GPRs against any writers in the pipeline.
All instructions are still marked single in pipeline at
this stage.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
By using 4:1 multiplexers rather than 2:1, this cuts the number of
levels of multiplexing from 4 to 2 and also reduces the total number
of slice LUTs required. Because we are now handling 4 bits at each
level, including the bottom level, the logic to do the priority
encoding can be factored out into a function that is used at each
level.
This rearranges the logic so that the encoding and selection of bits
is done whether or not the input operand is zero, and the if statement
testing whether the input is zero only affects what is assigned to
result. With this we don't get the inferred latches and we can go
back to using signals rather than variables.
Also add some comments about what is being done.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The shared variable used for FIFO memory is not VHDL 2008 compliant.
I can't see why it needs to be a shared variable since reads and writes
update top and bottom synchronously, meaning they don't need same cycle
access to the FIFO memory.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
This adds logic to detect the cases where the quotient of the
division overflows the range of the output representation, and
return all zeroes in those cases, which is what POWER9 does.
To do this, we extend the dividend register by 1 bit and we do
an extra step in the division process to get a 2^64 bit of the
quotient, which ends up in the 'overflow' signal. This catches all
the cases where dividend >= 2^64 * divisor, including the case
where divisor = 0, and the divde/divdeu cases where |RA| >= |RB|.
Then, in the output stage, we also check that the result fits in
the representable range, which depends on whether the division is
a signed division or not, and whether it is a 32-bit or 64-bit
division. If dividend >= 2^64 or the result doesn't fit in the
representable range, write_data is set to 0 and write_cr_data to
0x20000000 (i.e. cr0.eq = 1).
POWER9 sets the top 32 bits of the result to zero for 32-bit signed
divisions, and sets CR0 when RC=1 according to the 64-bit value
(i.e. CR0.LT is always 0 for 32-bit signed divisions, even if the
32-bit result is negative). However, modsw with a negative result
sets the top 32 bits to all 1s. We follow suit.
This updates divider_tb to check the invalid cases as well as the
valid case.
This also fixes a small bug where the reset signal for the divider
was driven from rst when it should have been driven from core_rst.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
outstanding can only ever be -1 to 2 at the moment (0 or 1 on a
rising clock edge). Vivado is synthesizing a much wider adder
which is silly. Constrain it with a range statement. This should
be good for timing and saves us about 85 LUTs.
This will get relaxed when we add more pipelining, but only by a
few bits.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
The current code simulates correctly, but produces miscompares when synthesized
onto an FPGA. On closer inspection GHDL synthesis complains about inferred
latches and there does seem to be issues.
Convert it to variables that are always initialized to zero at the start of the
process.
Fixes: 24a4a796ce ("execute: Consolidate count-leading/trailing-zeroes implementations")
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
This adds combinatorial logic that does 32-bit and 64-bit count
leading and trailing zeroes in one unit, and consolidates the
four instructions under a single OP_CNTZ opcode.
This saves 84 slice LUTs on the Arty A7-100.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Consolidate and/andc/nand, or/orc/nor and xor/eqv, using a common
invert on the input and output. This saves us about 200 LUTs.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
The current scheme has UART0 repeating throughout the UART address range.
This patch tightens the address checking so that it only occurs once.
Signed-off-by: Alastair D'Silva <alastair@d-silva.org>
This adds support for set associativity to the icache. It can still
be direct mapped by setting NUM_WAYS to 1.
The replacement policy uses a simple tree-PLRU for each set.
This is only lightly tested, tests pass but I have to double check
that we are using the ways effectively and not creating duplicates.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We only ever access the cache memory for at most the wishbone bus
width at a time. So having the BRAMs organized as a cache-line-wide
port is a waste of resources.
Instead, use a wishbone-wide memory and store a line as consecutive
rows in the BRAM.
This significantly improves BRAM usage in the FPGA as we can now use
more rows in the BRAM blocks. It also saves a few LUTs and muxes.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The goal is to have the icache fit in BRAM by latching the output
into a register. In order to avoid timing issues , we need to give
the BRAM a full cycle on reads, and thus we souce the BRAM address
directly from fetch1 latched NIA.
(Note: This will be problematic if/when we want to hash the address,
we'll probably be better off having fetch1 latch a fully hashed address
along with the normal one, so the icache can use the former to address
the BRAM and pass the latter along)
One difficulty is that we cannot really stall the icache without adding
more combo logic that would break the "one full cycle" BRAM model. This
means that on stalls from decode, by the time we stall fetch1, it has
already gone to the next address, which the icache is already latching.
We work around this by having a "stash" buffer in fetch2 that will stash
away the icache output on a stall, and override the output of the icache
with the content of the stash buffer when unstalling.
This requires a rewrite of the stop/step debug logic as well. We now
do most of the hard work in fetch1 which makes more sense.
Note: Vivado is still not inferring an built-in output register for the
BRAMs. I don't want to add another cycle... I don't fully understand why
it wouldn't be able to treat current_row as such but clearly it won't. At
least the timing seems good enough now for 100Mhz, possibly more.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>