This computes the address sent to the MMU separately from that sent
to the dcache. This means that the address sent to the MMU doesn't
have the delay through the lsu_sum adder, making it available earlier.
The path through the lsu_sum adder and through the MMU to the MMU
done and err outputs showed up as a critical path on some builds.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This makes the calculation of busy as simple as possible and dependent
only on register outputs. The timing of busy is critical, as it gates
the valid signal for the next instruction, and therefore any delays
in dropping busy at the end of a load or store directly impact the
timing of a host of other paths.
This also separates the 'done without error' and 'done with error'
cases from the MMU into separate signals that are both driven directly
from registers.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This moves the incrementing or decrementing of r1.acks_pending
to the cycle after a strobe is output or an ack is seen on the
wishbone, and simplifies the logic that determines whether the
cycle is now complete. This means that the path from seeing
req_op equal to OP_STORE_HIT or OP_STORE_MISS to setting r1.state
and r1.cyc now just involves the stbs_done bit rather than a more
complex calculation involving the possibly incremented r1.acks_pending.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This makes d_out.valid and m_out.done come directly from registers in
order to improve timing. The inputs to the registers are set by the
same conditions that cause r1.hit_load_valid, r1.slow_valid,
r1.error_done and r1.stcx_fail to be set.
Note that the STORE_WAIT_ACK state doesn't test r1.mmu_req but assumes
that the request came from loadstore1. This is because we normally
have r1.full = 0 in this state, which means that r1.mmu_req can
change at any time.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds "if LOG_LENGTH > 0 generate" to the places in the core
where log output data is latched, so that when LOG_LENGTH = 0 we
don't create the logic to collect the data which won't be stored.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This uses an algorithm for count leading/trailing zeroes that is
faster on FPGAs, which makes timing easier. cntlz* and cnttz*
still take two cycles, though.
For count trailing zeroes, we compute x & -x, which for non-zero x
has a single 1 bit in the position of the least-significant 1 bit
in x. This one-hot representation can then be converted to a bit
number with six 32-input OR gates. For count leading zeroes, we
simply do a bit-reversal on x and then use the same algorithm.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This makes the l_out.done signal come from a clean latch, which
improves timing. The cost is that TLB load and invalidation
operations to the dcache now signal done back to loadstore1 one
cycle later than before, but that doesn't seem to affect overall
performance noticeably.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This improves timing by setting r1.wb.{adr,dat,sel} to the next
request when doing a write cycle on the wishbone before we know
whether the next request has a TLB and cache hit or not, i.e.
without depending on req_op. r1.wb.stb still depends on req_op.
This contains a workaround for what is probably a bug elsewhere,
in that changing r1.wb.sel unconditionally once we see stall=0
from the wishbone causes incorrect behaviour. Making it
conditional on there being a valid following request appears
to fix the problem.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This puts the inputs to the TLB PLRU through a register stage, so
the TLB PLRU update is done in the cycle after the TLB tag
matching rather than the same cycle. This improves timing.
The PLRU output is only used when writing the TLB in response to
a tlbwe request from the MMU, and that doesn't happen within one
cycle of a virtual-mode load or store, so the fact that the
tlb victim way information is delayed by one cycle doesn't
create any problems.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The computation of two_dwords from r.second_bytes has shown up as
part of a critical path at times. Instead we add a 'last_dword'
flag to the reg_stage_t record which tells us more directly
whether a valid flag coming in from dcache means that the
instruction is done, thereby shortening the path to the busy output
back to execute1.
This also simplifies some of the trim_ctl logic. The two_dwords = 0
case could never have use_second(i) = 1 for any of the bytes being
transferred, so "not use_second(i)" is always 1.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This eliminates a dependency of r.f.redirect_nia on the carry out
from the main adder in the case of a conditional trap instruction.
We can set r.f.redirect_nia unconditionally, even if no interrupt
is generated.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This does the PLRU update based on r1.cache_hit and r1.hit_way rather
than req_op and req_hit_way, which means there is now a register
between the TLB and cache tag lookup and the PLRU update, which should
help with timing.
The PLRU victim selection now becomes valid one cycle later, in the
cycle where r1.write_tag = 1. We now have replace_way coming from
the PLRU when r1.write_tag = 1 and from r1.store_way at other times,
and we use that instead of r1.store_way in situations where we need
it to be valid in the first cycle of the RELOAD_WAIT_ACK state.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This does the PLRU update based on r.hit_valid and r.hit_way rather
than req_is_hit and req_hit_way, which means there is now a register
between the TLB and cache tag lookup and the PLRU update, which
should help with timing.
As a result, the PLRU victim way selection becomes valid one cycle
later, in the cycle when r.state = CLR_TAG. So we have to use the
PLRU output directly in the CLR_TAG state and r.store_way in the
WAIT_ACK state.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This commit enhances hello_world.bin output by printing
a ASCII lightbulb, which turns out to be Microwatt's logo,
instead of simply a "Hello World" text message.
Signed-off-by: Gustavo Romero <gustavo.romero@protonmail.com>
The PVR is a privileged read-only SPR. Test reading and writing in both
supervisor and problem state. In supervisor state reading returns
microwatt's assigned PVR number and writing is a noop. In problem state
both reading and writing cause privileged instruction interrupts.
Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
This regenerate litedram for all targets (genesys2 is new in this
build) using the latest LiteX.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Some changes in LiteX broke us. Adapt the build system and
increase the init RAM size to 24KB.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
litedram ignores a couple of signals of his "pseudo-axi" port,
this adds a bit of documentation around it.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Make the DRAM data lines and user port width configurable, also
don't hard wire dependency on the wishbone data width.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This implements in the L2 cache the feature already in the L1s
allowing a request to be completed before the end of a refill
using partial line valid bits, and starting a refill from the
row of the first miss on that line instead of the beginning of
the line.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This fixes up a few issues with parameters:
Only arty has "has_uart1" since we haven't added plumbing for a second UART
anywhere else. Also "uart_is_16550" was mixing on one of the nexys_video
targets, and nexys_video toplevel was missing LOG_LENGTH.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When using an FPGA which routes the SPI clock via STARTUPE2 as is
done on the Nexys Video (or optionally on Arty), the HW needs at
least 3 beats of that clock to complete the switch from the internal
config clock to the one we provide.
This works around it by having the SPI controller send 8 dummy
clocks at boot time with CS held high.
Without this, flash identification will fail those boards
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Means we can synthesize at 40Mhz (where we currently make timing) and
our UART still works at 115200 baud.
Tested working hello world unmodified with ECP5 eval board. Orange
Crab is updated but is untested.
Signed-off-by: Michael Neuling <mikey@neuling.org>
This allows these targets
FPGA_TARGET=ORANGE-CRAB make microwatt.bit
FPGA_TARGET=ECP5-EVN make microwatt.bit
Default is ORANGE-CRAB as before
ECP5-EVN is tested on real hardware. The console only works at 38400 so
needs this in console.c and a recompile of hello_world to work:
-#define UART_FREQ 115200
+#define UART_FREQ 38400
With this 'FPGA_TARGET=ECP5-EVN make prog' works on the ECP5 dev board.
Signed-off-by: Michael Neuling <mikey@neuling.org>
This is useful to specify "-noflatten" which helps CI stay under 8GB
limit.
Normally the AUTONAME stage of yosys will take around 10GB if
operating on the whole design. With -noflatten, AUTONAME occurs only
per VHDL entity, so only consumes around 3GB of memory. This gets us
under the limitations on github actions.
More discussion here:
https://github.com/antonblanchard/microwatt/pull/209#issuecomment-652186078
Signed-off-by: Michael Neuling <mikey@neuling.org>
These are needed for synthesis that doesn't use fusesoc natively.
These were pulled in via 'fusesoc fetch ::uart16550:1.5.5-r1'
Signed-off-by: Michael Neuling <mikey@neuling.org>
nextpnr will leave an output file around even when it errors out, so
build to a tmp file and move it when we succeed so we don't confuse
make.
Signed-off-by: Michael Neuling <mikey@neuling.org>
This adds a path to allow the CR result of one instruction to be
forwarded to the next instruction, so that sequences such as
cmp; bc can avoid having a 1-cycle bubble.
Forwarding is not available for dot-form (Rc=1) instructions,
since the CR result for them is calculated in writeback. The
decode.output_cr field is used to identify those instructions
that compute the CR result in execute1.
For some reason, the multiply instructions incorrectly had
output_cr = 1 in the decode tables. This fixes that.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This latches the redirect signal inside execute1, so that it is sent
a cycle later to fetch1 (and to decode/icache as flush). This breaks
a long combinatorial chain from the branch and interrupt detection
in execute1 through the redirect/flush signals all the way back to
fetch1, icache and decode.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>