This adds a true random number generator for the Xilinx FPGAs which
uses a set of chaotic ring oscillators to generate random bits and
then passes them through a Linear Hybrid Cellular Automaton (LHCA) to
remove bias, as described in "High Speed True Random Number Generators
in Xilinx FPGAs" by Catalin Baetoniu of Xilinx Inc., in:
https://pdfs.semanticscholar.org/83ac/9e9c1bb3dad5180654984604c8d5d8137412.pdf
This requires adding a .xdc file to tell vivado that the combinatorial
loops that form the ring oscillators are intentional. The same
code should work on other FPGAs as well if their tools can be told to
accept the combinatorial loops.
For simulation, the random.vhdl module gets compiled in, which uses
the pseudorand() function to generate random numbers.
Synthesis using yosys uses nonrandom.vhdl, which always signals an
error, causing darn to return 0xffff_ffff_ffff_ffff.
This adds an implementation of the darn instruction. Darn can return
either raw or conditioned random numbers. On Xilinx FPGAs, reading a
raw random number gives the output of the ring oscillators, and
reading a conditioned random number gives the output of the LHCA.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Means we can synthesize at 40Mhz (where we currently make timing) and
our UART still works at 115200 baud.
Tested working hello world unmodified with ECP5 eval board. Orange
Crab is updated but is untested.
Signed-off-by: Michael Neuling <mikey@neuling.org>
This allows these targets
FPGA_TARGET=ORANGE-CRAB make microwatt.bit
FPGA_TARGET=ECP5-EVN make microwatt.bit
Default is ORANGE-CRAB as before
ECP5-EVN is tested on real hardware. The console only works at 38400 so
needs this in console.c and a recompile of hello_world to work:
-#define UART_FREQ 115200
+#define UART_FREQ 38400
With this 'FPGA_TARGET=ECP5-EVN make prog' works on the ECP5 dev board.
Signed-off-by: Michael Neuling <mikey@neuling.org>
This is useful to specify "-noflatten" which helps CI stay under 8GB
limit.
Normally the AUTONAME stage of yosys will take around 10GB if
operating on the whole design. With -noflatten, AUTONAME occurs only
per VHDL entity, so only consumes around 3GB of memory. This gets us
under the limitations on github actions.
More discussion here:
https://github.com/antonblanchard/microwatt/pull/209#issuecomment-652186078
Signed-off-by: Michael Neuling <mikey@neuling.org>
nextpnr will leave an output file around even when it errors out, so
build to a tmp file and move it when we succeed so we don't confuse
make.
Signed-off-by: Michael Neuling <mikey@neuling.org>
The fetch2 stage existed primarily to provide a stash buffer for the
output of icache when a stall occurred. However, we can get the same
effect -- of having the input to decode1 stay unchanged on a stall
cycle -- by using the read enable of the BRAMs in icache, and by
adding logic to keep the outputs unchanged on a clock cycle when
stall_in = 1. This reduces branch and interrupt latency by one
cycle.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This require the s25fl128s.vhd flash model and FMF libraries,
which will be built when passed to the Makefile via the
FLASH_MODEL_PATH argument. Otherwise a dummy module is used
which ties MISO to '1'.
The model isn't included as I'm not sure its licence (GPL) is
at this point, but it can be obtained from
https://github.com/ozbenh/microspi
FLASH_MODEL_PATH=<path to microspi>/model
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The rewrite of the Makefile to use "ghdl -c" somewhat broke building
the unisim library as ghdl doesn't yet support putting files in
separate libraries from a single command line invocation.
The workaround at the time was to put the entire project in "unisim"
which is ... weird and will break if we try to add another library
such as fmf.
This fixes it by generating the library separately using "ghdl -i"
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We don't run these but we should.
The SOC tests have bit rotted. We need to fix them but leave them out
for now.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Support for this has bitrotted and would require refactoring of L2 to
be brought back. It's also not really needed anymore now that we ship
pre-generated litedram and that LiteX supports what we do.
So take it out, which simplifies some of the scripts as well. This also
fixes up CSR alignment the sim model.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The test bench test simple access forms for now, it's a starting point
but it already helped find/fix a bug.
Includes a litedram update to be able to operate the sim model without
inits.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This adds a cache between the wishbone and litedram with the following
features (at this point, it's still evolving)
- 128 bytes line width in order to have a reasonable amount of
litedram pipelining on the 128-bit wide data port.
- Configurable geometry otherwise
- Stores are acked immediately on wishbone whether hit or miss
(minus a 2 cycles delay if there's a previous load response in the
way) and sent to LiteDRAM via 8 entries (configurable) store queue
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This adds a simulated litedram model along with the necessary
Makefile gunk to verilate it and wrap it for use by ghdl.
The core_dram_tb test bench is a variant of core_tb with
LiteDRAM simulated. It's not built by default, an explicit
make core_dram_tb
is necessary as to not require verilator to be installed for
the normal build process (also it's slow'ish).
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We still need to a way to our FPGA target on the command line, but this
at least gets us down to a common Makefile.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
Instead of building each file one by one (and having to track all
the dependencies manually), use the ghdl -c command that does
analysis and elaboration in one go.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
These provides some info about the SoC (though it's still somewhat
incomplete and needs more work, see comments).
There's also a control register for selecting DRAM vs. BRAM at 0
(and for soft-resetting the SoC but that isn't wired up yet).
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This adds a new module to implement an MMU. At the moment it doesn't
do very much. Tlbie instructions now get sent by loadstore1 to mmu,
which sends them to dcache, rather than loadstore1 sending them
directly to dcache. TLB misses from dcache now get sent by loadstore1
to mmu, which currently just returns an error. Loadstore1 then
generates a DSI in response to the error return from mmu.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This comes in two parts:
- A generator script which uses LiteX to generate litedram cores
along with their init files for various boards (currently Arty and
Nexys-video). This comes with configs for arty and nexys_video.
- A fusesoc "generator" which uses pre-generated litedram cores
The generation process is manual on purpose. This include pre-generated
cores for the two above boards.
This is done so that one doesn't have to install LiteX to build
microwatt. In addition, the generator script or wrapper vhdl tend to
break when LiteX changes significantly which happens.
This is still rather standalone and hasn't been plumbed into the SoC
or the FPGA toplevel files yet.
At this point LiteDRAM self-initializes using a built-in VexRiscv
"Minimum" core obtained from LiteX and included in this commit. There
is some plumbing to generate and cores that are initialized by Microwatt
directly but this isn't working yet and so isn't enabled yet.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
In preparation for adding a TLB to the dcache, this plumbs the
insn_type from execute1 through to loadstore1, so that we can have
other operations besides loads and stores (e.g. tlbie) going to
loadstore1 and thence to the dcache. This also plumbs the unit field
of the decode ROM from decode2 through to execute1 to simplify the
logic around which ops need to go to loadstore1.
The load and store data formatting are now not conditional on the
op being OP_LOAD or OP_STORE. This eliminates the inferred latches
clocked by each of the bits of r.op that we were getting previously.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
New unified ICP and ICS XICS compliant interrupt controller.
Configurable number of hardware sources.
Fixed hardware source number based on hardware line taken. All
hardware interrupts are a fixed priority. Level interrupts supported
only.
Hardwired to 0xc0004000 in SOC (UART is kept at 0xc0002000).
Signed-off-by: Michael Neuling <mikey@neuling.org>
This adds test cases for:
- sc, illegals and decrementer exceptions
- decrementer overflow
- rfid
- mt/mf sprg0/1 srr0/1
- mtdec
- mtmsrd
- sc
It also adds these test cases to make check/check_light
Signed-off-by: Michael Neuling <mikey@neuling.org>
Some distros don't have a version of ghdl with the LLVM or GCC backend,
so add a Docker image as an alternative.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
GHDL doesn't seem to have a way to specify the location of the object
file it writes, so right now they are all ending up in the root
directory. The Makefile rules did not reflect that, so make would
continually the files in fpga/*
Fix the rules to match what GHDL is doing.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
With this, the divider is a unit that execute1 sends operands to and
which sends its results back to execute1, which then send them to
writeback. Execute1 now sends a stall signal when it gets a divide
or modulus instruction until it gets a valid signal back from the
divider. Divide and modulus instructions are no longer marked as
single-issue.
The data formatting step that used to be done in decode2 for div
and mod instructions is now done in execute1. We also do the
absolute value operation in that same cycle instead of taking an
extra cycle inside the divider for signed operations with a
negative operand.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
With this, the multiplier isn't a separate pipe that decode2 issues
instructions to, but rather is a unit that execute1 sends operands
to and which sends the result back to execute1, which then sends it
to writeback. Execute1 now sends a stall signal when it gets a
multiply instruction until it gets a valid signal back from the
multiplier.
This all means that we no longer need to mark the multiply
instructions as single-issue.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This stores the most common SPRs in the register file.
This includes CTR and LR and a not yet final list of others.
The register file is set to 64 entries for now. Specific types
are defined that can represent a GPR index (gpr_index_t) or
a GPR/SPR index (gspr_index_t) along with conversion functions
between the two.
On order to deal with some forms of branch updating both LR and
CTR, we introduced a delayed update of LR after a branch link.
Note: We currently stall the pipeline on such a delayed branch,
but we could avoid stalling fetch in that specific case as we
know we have a branch delay. We could also limit that to the
specific case where we need to update both CTR and LR.
This allows us to make bcreg, mtspr and mfspr pipelined. decode1
will automatically force the single issue flag on mfspr/mtspr to
a "slow" SPR.
[paulus@ozlabs.org - fix direction of decode2.stall_in]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This replaces the simple_ram_behavioural and mw_soc_memory modules
with a common wishbone_bram_wrapper.vhdl that interfaces the
pipelined WB with a lower-level RAM module, along with an FPGA
and a sim variants of the latter.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This replaces loadstore2 with a dcache
The dcache unit is losely based on the icache one (same basic cache
layout), but has some significant logic additions to deal with stores,
loads with update, non-cachable accesses and other differences due to
operating in the execution part of the pipeline rather than the fetch
part.
The cache is store-through, though a hit with an existing line will
update the line rather than invalidate it.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Since the condition setting got moved to writeback, execute2 does
nothing aside from wasting a cycle. This removes it.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds code to writeback to format data and test the result
against zero for the purpose of setting CR0. The data formatter
is able to shift and mask by bytes and do byte reversal and sign
extension. It can also put together bytes from two input
doublewords to support unaligned loads (including unaligned
byte-reversed loads).
The data formatter starts with an 8:1 multiplexer that is able
to direct any byte of the input to any byte of the output. This
lets us rotate the data and simultaneously byte-reverse it.
The rotated/reversed data goes to a register for the unaligned
cases that overlap two doublewords. Then there is per-byte logic
that does trimming, sign extension, and splicing together bytes
from a previous input doubleword (stored in data_latched) and the
current doubleword. Finally the 64-bit result is tested to set
CR0 if rc = 1.
This removes the RC logic from the execute2, multiply and divide
units, and the shift/mask/byte-reverse/sign-extend logic from
loadstore2.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Check GPRs against any writers in the pipeline.
All instructions are still marked single in pipeline at
this stage.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
This adds combinatorial logic that does 32-bit and 64-bit count
leading and trailing zeroes in one unit, and consolidates the
four instructions under a single OP_CNTZ opcode.
This saves 84 slice LUTs on the Arty A7-100.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Consolidate and/andc/nand, or/orc/nor and xor/eqv, using a common
invert on the input and output. This saves us about 200 LUTs.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
This adds support for set associativity to the icache. It can still
be direct mapped by setting NUM_WAYS to 1.
The replacement policy uses a simple tree-PLRU for each set.
This is only lightly tested, tests pass but I have to double check
that we are using the ways effectively and not creating duplicates.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This adds a new entity 'rotator' which contains combinatorial logic
for rotating and masking 64-bit values. It implements the operations
of the rlwinm, rlwnm, rlwimi, rldicl, rldicr, rldic, rldimi, rldcl,
rldcr, sld, slw, srd, srw, srad, sradi, sraw and srawi instructions.
It consists of a 3-stage 64-bit rotator using 4:1 multiplexors at
each stage, two mask generators, output logic and control logic.
The insn_type_t values used for these instructions have been reduced
to just 5: OP_RLC, OP_RLCL and OP_RLCR for the rotate and mask
instructions (clear both left and right, clear left, clear right
variants), OP_SHL for left shifts, and OP_SHR for right shifts.
The control signals for the rotator are derived from the opcode
and from the is_32bit and is_signed fields of the decode_rom_t.
The rotator is instantiated as an entity in execute1 so that we can
be sure we only have one of it.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Instead of doing mfctr, mflr, mftb, mtctr, mtlr as separate ops,
just pass down mfspr and mtspr ops with the spr number and let
execute1 decode which SPR we're addressing. This will help reduce
the number of instruction bits decode1 needs to look at.
In fact we now pass down the whole instruction from decode2 to
execute1. We will need more bits of the instruction in future,
and the tools should just optimize away any that we don't end
up using. Since the 'aa' bit was just a copy of an instruction
bit, we can now remove it from the record.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds a divider unit, connected to the core in much the same way
that the multiplier unit is connected. The division algorithm is
very simple-minded, taking 64 clock cycles for any division (even
32-bit division instructions).
The decoding is simplified by making use of regularities in the
instruction encoding for div* and mod* instructions. Instead of
having PPC_* encodings from the first-stage decoder for each of the
different div* and mod* instructions, we now just have PPC_DIV and
PPC_MOD, and the inputs to the divider that indicate what sort of
division operation to do are derived from instruction word bits.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This module adds some simple core controls:
reset, stop, start, step
along with icache clear and reading the NIA and core
status bits
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org
This adds a debug module off the DMI (debug) bus which can act as a
wishbone master to generate read and write cycles.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This adds a simple bus that can be mastered from an external
system via JTAG, which will be used to hookup various debug
modules.
It's loosely based on the RiscV model (hence the DMI name).
The module currently only supports hooking up to a Xilinx BSCANE2
but it shouldn't be too hard to adapt it to support different TAPs
if necessary.
The JTAG protocol proper is not exactly the RiscV one at this point,
though I might still change it.
This comes with some sim variants of Xilinx BSCANE2 and BUFG and a
test bench.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The old reset code was overly complicated and never worked properly.
Replace it with a simpler sequence that uses a couple of shift registers
to assert resets:
- Wait a number of external clock cycles before removing reset from
the PLL.
- After the PLL locks and the external reset button isn't pressed,
wait a number of PLL clock cycles before removing reset from the SOC.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
The decode2 stage was spaghetti code and needed cleaning up.
Create a series of functions to pull fields from a ppc instruction
and also a series of helpers to extract values for the execution
units.
As suggested by Paul, we should pass all signals to the execution
units and only set the valid signal conditionally, which should
use less resources.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>