Commit Graph

80 Commits (d1e8e62feee3e2d2d01a69f1486e8099e8438b31)

Author SHA1 Message Date
Paul Mackerras d1e8e62fee Remove option for "short" 16x16 bit multiplier
Now that we have a 33 bit x 33 bit signed multiplier in execute1,
there is really no need for the 16 bit multiplier.  The coremark
results are just as good without it as with it.  This removes the
option for the sake of simplicity.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2 years ago
Paul Mackerras 595a758400 execute1: Add a pipelined 33-bit signed multiplier
This adds a pipelined 33-bit by 33-bit signed multiplier with one
cycle latency to the execute pipeline, and uses it for the mullw,
mulhw and mulhwu instructions.  Because it has one cycle of latency we
can assume that its result is available in the second execute stage
without needing to add busy logic to the second stage.

This adds both a generic version of the multiplier and a
Xilinx-specific version using four DSP slices of the Artix-7.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2 years ago
Paul Mackerras 21ab36a0c0 Pre-decode instructions when writing them to icache
This splits out the decoding done in the decode0 step into a separate
predecoder, used when writing instructions into the icache.  The
icache now holds 36 bits per instruction rather than 32.  For valid
instructions, those 36 bits comprise the bottom 26 bits of the
instruction word, a 9-bit insn_code value (which uniquely identifies
the instruction), and a zero in the MSB.  For illegal instructions,
the MSB is one and the full instruction word is in the bottom 32 bits.
Having the full instruction word available for illegal instructions
means that it can be printed in the log when simulating, or in future
could be placed in the HEIR register.

If we don't have an FPU, then the floating-point instructions are
regarded as illegal.  In that case, the insn_code values would fit
into 8 bits, which could be used in future to reduce the size of
decode_rom from 512 to 256 entries.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2 years ago
Paul Mackerras 2491aa7fc5 core: Make popcnt* take two cycles
This moves the calculation of the result for popcnt* into the
countbits unit, renamed from countzero, so that we can take two cycles
to get the result.  The motivation for this is that the popcnt*
calculation was showing up as a critical path.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2 years ago
Benjamin Herrenschmidt da0189af1e Add support for QMTech Wukong v2 board
For now only the V2 of the board (slightly different pinout)
and only the A100T variant. I also haven't added GPIOs or anything
else on the PMODs really.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
3 years ago
Paul Mackerras 734e4c4a52 core: Add a short multiplier
This adds an optional 16 bit x 16 bit signed multiplier and uses it
for multiply instructions that return the low 64 bits of the product
(mull[dw][o] and mulli, but not maddld) when the operands are both in
the range -2^15 .. 2^15 - 1.   The "short" 16-bit multiplier produces
its result combinatorially, so a multiply that uses it executes in one
cycle.  This improves the coremark result by about 4%, since coremark
does quite a lot of multiplies and they almost all have operands that
fit into 16 bits.

The presence of the short multiplier is controlled by a generic at the
execute1, SOC, core and top levels.  For now, it defaults to off for
all platforms, and can be enabled using the --has_short_mult flag to
fusesoc.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 years ago
Michael Neuling 2bd00f5119
Merge pull request #315 from paulusmack/pmu
Add basic PMU implementation
3 years ago
Paul Mackerras a7873b45f7 core: Add a basic performance monitor unit (PMU) implementation
This is the start of an implementation of a PMU according to PowerISA
v3.0B.  Things not implemented yet include most architected events,
the BHRB, event-based branches, thresholding, MMCR0[TBCC] field, etc.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 years ago
Anton Blanchard 7cfbcd5514 litesdcard: Add Nexys Video support
This board has a reset line that needs to be held low to power up the
SD card hardware.

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
3 years ago
Anton Blanchard 9caaa3fc46 litesdcard: Use vendor not board type
litesdcard provides a macro per vendor (eg xilinx, lattice) and not per
board, so modify the fusesoc generator to take a vendor. This will make
it easier to add litesdcard to more boards.

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
3 years ago
Anton Blanchard 458dfe01a6 Add liteeth support to Nexys Video
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
3 years ago
Michael Neuling 84473eda1b Merge pull request #277 from paulus/gpio
A few cleanups. GPIO IRQ number is now 4 as 3 is now taken by the SD card.
3 years ago
Paul Mackerras 21ed730514 arty_a7: Add litesdcard interface
This adds litesdcard.v generated from the litex/litesdcard project,
along with logic in top-arty.vhdl to connect it into the system.
There is now a DMA wishbone coming in to soc.vhdl which is narrower
than the other wishbone masters (it has 32-bit data rather than
64-bit) so there is a widening/narrowing adapter between it and the
main wishbone master arbiter.

Also, litesdcard generates a non-pipelined wishbone for its DMA
connection, which needs to be converted to a pipelined wishbone.  We
have a latch on both the incoming and outgoing sides of the wishbone
in order to help make timing (at the cost of two extra cycles of
latency).

litesdcard generates an interrupt signal which is wired up to input 3
of the ICS (IRQ 19).

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 years ago
Paul Mackerras f06ffcf9b7 Add a GPIO controller and use it to drive the shield I/O pins on the Arty
This adds a GPIO controller which provides 32 bits of I/O.  The
registers are modelled on the set used by the gpio-ftgpio010.c driver
in the Linux kernel.  Currently there is no interrupt capability
implemented, though an interrupt line from the GPIO subsystem to the
XICS has been connected.

For the Arty A7 board, GPIO lines 0 to 13 are connected to the pins
labelled IO0 to IO13 on the "shield" connector, GPIO lines 14 to 29
connect to IO26 to IO41, GPIO line 30 connects to the pin labelled A
(aka IO42), and GPIO line 31 is connected to LED 7.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 years ago
Paul Mackerras ae2afeca5c core: Track CR hazards and bypasses using tags
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras c0b45e153b core: Track GPR hazards using tags that propagate through the pipelines
This changes the way GPR hazards are detected and tracked.  Instead of
having a model of the pipeline in gpr_hazard.vhdl, which has to mirror
the behaviour of the real pipeline exactly, we now assign a 2-bit tag
to each instruction and record which GSPR the instruction writes.
Subsequent instructions that need to use the GSPR get the tag number
and stall until the value with that tag is being written back to the
register file.

For now, the forwarding paths are disabled.  That gives about a 8%
reduction in coremark performance.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras 0fb207be60 fetch1: Implement a simple branch target cache
This implements a cache in fetch1, where each entry stores the address
of a simple branch instruction (b or bc) and the target of the branch.
When fetching sequentially, if the address being fetched matches the
cache entry, then fetching will be redirected to the branch target.
The cache has 1024 entries and is direct-mapped, i.e. indexed by bits
11..2 of the NIA.

The bus from execute1 now carries information about taken and
not-taken simple branches, which fetch1 uses to update the cache.
The cache entry is updated for both taken and not-taken branches, with
the valid bit being set if the branch was taken and cleared if the
branch was not taken.

If fetching is redirected to the branch target then that goes down the
pipe as a predicted-taken branch, and decode1 does not do any static
branch prediction.  If fetching is not redirected, then the next
instruction goes down the pipe as normal and decode1 does its static
branch prediction.

In order to make timing, the lookup of the cache is pipelined, so on
each cycle the cache entry for the current NIA + 8 is read.  This
means that after a redirect (from decode1 or execute1), only the third
and subsequent sequentially-fetched instructions will be able to be
predicted.

This improves the coremark value on the Arty A7-100 from about 180 to
about 190 (more than 5%).

The BTC is optional.  Builds for the Artix 7 35-T part have it off by
default because the extra ~1420 LUTs it takes mean that the design
doesn't fit on the Arty A7-35 board.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras 856e9e955f core: Add framework for an FPU
This adds the skeleton of a floating-point unit and implements the
mffs and mtfsf instructions.

Execute1 sends FP instructions to the FPU and receives busy,
exception, FP interrupt and illegal interrupt signals from it.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras 45cd8f4fc3 core: Add support for floating-point loads and stores
This extends the register file so it can hold FPR values, and
implements the FP loads and stores that do not require conversion
between single and double precision.

We now have the FP, FE0 and FE1 bits in MSR.  FP loads and stores
cause a FP unavailable interrupt if MSR[FP] = 0.

The FPU facilities are optional and their presence is controlled by
the HAS_FPU generic passed down from the top-level board file.  It
defaults to true for all except the A7-35 boards.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Michael Neuling 6d6cf59bb7
Merge pull request #235 from paulusmack/master
More instructions and a random number generator
4 years ago
Boris Shingarov 679c547e5f fpga: Add support for Genesys2
Signed-off-by: Boris Shingarov <shingarov@labware.com>
4 years ago
Benjamin Herrenschmidt dbb137437c acorn: Add support for the Acorn CLE 215+
This is a NiteFury based PCIe M2 form-factor board originally
used for mining. It contains a speed grade 2 Artix 7 200T,
1GB of DDR3 and 32MB of flash.

The serial port is routed to pin 2 (RX) and 3 (TX) of the P2
connector (pin 1 is GND).

Note: Only 16MB of flash is currently usable until code is added
to configure the flash controller to use 4-bytes address commands
on that part.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
4 years ago
Paul Mackerras 1a7aebeef8 Add random number generator and implement the darn instruction
This adds a true random number generator for the Xilinx FPGAs which
uses a set of chaotic ring oscillators to generate random bits and
then passes them through a Linear Hybrid Cellular Automaton (LHCA) to
remove bias, as described in "High Speed True Random Number Generators
in Xilinx FPGAs" by Catalin Baetoniu of Xilinx Inc., in:

https://pdfs.semanticscholar.org/83ac/9e9c1bb3dad5180654984604c8d5d8137412.pdf

This requires adding a .xdc file to tell vivado that the combinatorial
loops that form the ring oscillators are intentional.  The same
code should work on other FPGAs as well if their tools can be told to
accept the combinatorial loops.

For simulation, the random.vhdl module gets compiled in, which uses
the pseudorand() function to generate random numbers.

Synthesis using yosys uses nonrandom.vhdl, which always signals an
error, causing darn to return 0xffff_ffff_ffff_ffff.

This adds an implementation of the darn instruction.  Darn can return
either raw or conditioned random numbers.  On Xilinx FPGAs, reading a
raw random number gives the output of the ring oscillators, and
reading a conditioned random number gives the output of the LHCA.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Benjamin Herrenschmidt b0241d9f2d corefile/nexys_video: Parameter fixes
This fixes up a few issues with parameters:

Only arty has "has_uart1" since we haven't added plumbing for a second UART
anywhere else. Also "uart_is_16550" was mixing on one of the nexys_video
targets, and nexys_video toplevel was missing LOG_LENGTH.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
4 years ago
Benjamin Herrenschmidt fb5c16d05e uart: Make 16550 the default
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
4 years ago
Benjamin Herrenschmidt 7575b1e0c2 uart: Import and hook up opencore 16550 compatible UART
This imports via fusesoc a 16550 compatible (ie "standard") UART,
and wires it up optionally in the SoC instead of the potato one.

This also adds support for a second UART (which is always a
16550) to Arty, wired to JC "bottom" port.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
4 years ago
Benjamin Herrenschmidt 8366710217 liteeth: Hook up LiteX LiteEth ethernet controller
Currently only generated for Arty.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
4 years ago
Paul Mackerras 64efd494e5 fpga: Add a xilinx_specific fileset to microwatt.core
At present this just has the Xilinx-specific multiplier code, but
might in future have other things.

This also adds the xilinx_specific fileset to the synth target.
Without that it was failing because there was no multiplier.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras 78de4fef72 Make LOG_LENGTH configurable per FPGA variant
This plumbs the LOG_LENGTH parameter (which controls how many entries
the core log RAM has) up to the top level so that it can be set on
the fusesoc command line and have different default values on
different FPGAs.

It now defaults to 512 entries generally and on the Artix-7 35 parts,
and 2048 on the larger Artix-7 FPGAs.  It can be set to 0 if desired.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras 0809bc898b multiply: Use DSP48 slices for multiplication on Xilinx FPGAs
This adds a custom implementation of the multiplier which uses 16
DSP48E1 slices to do a 64x64 bit multiplication in 2 cycles.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras b5a7dbb78d core: Remove fetch2 pipeline stage
The fetch2 stage existed primarily to provide a stash buffer for the
output of icache when a stall occurred.  However, we can get the same
effect -- of having the input to decode1 stay unchanged on a stall
cycle -- by using the read enable of the BRAMs in icache, and by
adding logic to keep the outputs unchanged on a clock cycle when
stall_in = 1.  This reduces branch and interrupt latency by one
cycle.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Benjamin Herrenschmidt cc4dcb3597 spi: Add SPI Flash controller
This adds an SPI flash controller which supports direct
memory-mapped access to the flash along with a manual
mode to send commands.

The direct mode can be set via generic to default to single
wire or quad mode. The controller supports normal, dual and quad
accesses with configurable commands, clock divider, dummy clocks
etc...

The SPI clock can be an even divider of sys_clk starting at 2
(so max 50Mhz with our typical Arty designs).

A flash offset is carried via generics to syscon to tell SW about
which portion of the flash is reserved for the FPGA bitfile. There
is currently no plumbing to make the CPU reset past that address (TBD).

Note: Operating at 50Mhz has proven unreliable without adding some
delay to the sampling of the input data. I'm working in improving
this, in the meantime, I'm leaving the default set at 25 Mhz.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
4 years ago
Benjamin Herrenschmidt a3857aac94 litedram: Add an L2 cache with store queue
This adds a cache between the wishbone and litedram with the following
features (at this point, it's still evolving)

  - 128 bytes line width in order to have a reasonable amount of
litedram pipelining on the 128-bit wide data port.

  - Configurable geometry otherwise

  - Stores are acked immediately on wishbone whether hit or miss
(minus a 2 cycles delay if there's a previous load response in the
way) and sent to LiteDRAM via 8 entries (configurable) store queue

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
4 years ago
Benjamin Herrenschmidt bf1b98b958 litedram: Add support for booting without BRAM
This adds an option to disable the main BRAM and instead copy a
payload stashed along with the init code in the secondary BRAM
into DRAM and boot from there

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
4 years ago
Paul Mackerras c164a2f4ea Merge branch 'mmu'
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Benjamin Herrenschmidt 025cf5efe8 syscon: Add syscon registers
These provides some info about the SoC (though it's still somewhat
incomplete and needs more work, see comments).

There's also a control register for selecting DRAM vs. BRAM at 0
(and for soft-resetting the SoC but that isn't wired up yet).

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
4 years ago
Benjamin Herrenschmidt 2cef3005cd fpga: Hookup nexys-video to litedram
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
4 years ago
Benjamin Herrenschmidt 3ac815823c fpga: Hookup Arty to litedram
The old toplevel.vhdl becomes top-generic.vhdl, which is to be used
by platforms that do not have a litedram option.

Arty has its own top-arty.vhdl which supports litedram and is now
hooked up

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
4 years ago
Paul Mackerras 8160f4f821 Add framework for implementing an MMU
This adds a new module to implement an MMU.  At the moment it doesn't
do very much.  Tlbie instructions now get sent by loadstore1 to mmu,
which sends them to dcache, rather than loadstore1 sending them
directly to dcache.  TLB misses from dcache now get sent by loadstore1
to mmu, which currently just returns an error.  Loadstore1 then
generates a DSI in response to the error return from mmu.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Benjamin Herrenschmidt 982cf166dd litedram: Add basic support for LiteX LiteDRAM
This comes in two parts:

 - A generator script which uses LiteX to generate litedram cores
along with their init files for various boards (currently Arty and
Nexys-video). This comes with configs for arty and nexys_video.

 - A fusesoc "generator" which uses pre-generated litedram cores

The generation process is manual on purpose. This include pre-generated
cores for the two above boards.

This is done so that one doesn't have to install LiteX to build
microwatt. In addition, the generator script or wrapper vhdl tend to
break when LiteX changes significantly which happens.

This is still rather standalone and hasn't been plumbed into the SoC
or the FPGA toplevel files yet.

At this point LiteDRAM self-initializes using a built-in VexRiscv
"Minimum" core obtained from LiteX and included in this commit. There
is some plumbing to generate and cores that are initialized by Microwatt
directly but this isn't working yet and so isn't enabled yet.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
4 years ago
Benjamin Herrenschmidt 0f97b320f6 Change default frequency to 100Mhz
LiteDRAM at the moment pretty much enforces 100Mhz, and our software
isn't quite yet adaptable, so switch out default to 100Mhz accross
the board. Recent timing improvements should make it a non-issue.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
4 years ago
Benjamin Herrenschmidt f124dc4a40 xics: Add missing fusesoc core file
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
4 years ago
Joel Stanley 6a3d2d95df Set default RAM to be 16K in microwatt.core
This allows it to run hello world out of the box.

Signed-off-by: Joel Stanley <joel@jms.id.au>
4 years ago
Benjamin Herrenschmidt 8e0389b973 ram: Rework main RAM interface
This replaces the simple_ram_behavioural and mw_soc_memory modules
with a common wishbone_bram_wrapper.vhdl that interfaces the
pipelined WB with a lower-level RAM module, along with an FPGA
and a sim variants of the latter.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
5 years ago
Benjamin Herrenschmidt d2762e70e5 Add option to not flatten hierarchy
Vivado by default tries to flatten the module hierarchy to improve
placement and timing. However this makes debugging timing issues
really hard as the net names in the timing report can be pretty
bogus.

This adds a generic that can be used to control attributes to stop
vivado from flattening the main core components. The resulting design
will have worst timing overall but it will be easier to understand
what the worst timing path are and address them.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
5 years ago
Benjamin Herrenschmidt b513f0fb48 dcache: Add a dcache
This replaces loadstore2 with a dcache

The dcache unit is losely based on the icache one (same basic cache
layout), but has some significant logic additions to deal with stores,
loads with update, non-cachable accesses and other differences due to
operating in the execution part of the pipeline rather than the fetch
part.

The cache is store-through, though a hit with an existing line will
update the line rather than invalidate it.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
5 years ago
Paul Mackerras f49a5a99a5 Remove execute2 stage
Since the condition setting got moved to writeback, execute2 does
nothing aside from wasting a cycle.  This removes it.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Anton Blanchard 813f834012 Add CR hazard detection
To keep things simple we treat the CR as a single entity.

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
5 years ago
Anton Blanchard bdc26b7527 Add GPR hazard detection
Check GPRs against any writers in the pipeline.

All instructions are still marked single in pipeline at
this stage.

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
5 years ago
Anton Blanchard d5346d0abf Separate issue control into its own unit
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
5 years ago