microwatt

Commit Graph

Author	SHA1	Message	Date
Benjamin Herrenschmidt	6068b635ae	Fix plru_tb to use the new plrufn and take out the old plru.vhdl This reworks (and simplifies) plru_tb to use the new plrufn module instead of the old (and now unused) plru module. The latter is now removed completely. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2 years ago
Paul Mackerras	86212dc879	icache: Split PLRU into storage and logic Rather than having update and decode logic for each individual PLRU as well as a register to store the current PLRU state, we now put the PLRU state in a little RAM, which will typically use LUT RAM on FPGAs, and have just a single copy of the logic to calculate the pseudo-LRU way and to update the PLRU state. This logic is in the plrufn module and is just combinatorial logic. A new module was created for this as other parts of the system are still using plru.vhdl. The PLRU RAM in the icache is read asynchronously in the cycle after the cache tag matching is done. At the end of that cycle the PLRU RAM entry is updated if the access was a cache hit, or a victim way is calculated and stored if the access was a cache miss and miss handling is starting in this cycle. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Joel Stanley	13aa52dfa7	antmicro-artix-dc-scm: Add liteeth As with the DRAM configuration, the DC-SCM board uses the same PHY as the Nexys Video and works with it's generated VHDL. Signed-off-by: Joel Stanley <joel@jms.id.au>	2 years ago
Paul Mackerras	9b184ff569	antmicro-artix-dc-scm: Add DRAM support This uses the exact same gateware as the nexys video, since the DRAM connection is identical to the nexys video down to the pin assignments on the FPGA. The only minor difference is that the DRAM chip on the dc-scm is a MT41K256M16TW vs. a ...HA part on the nexys video. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> [joel: rebase and tweaks] Signed-off-by: Joel Stanley <joel@jms.id.au>	2 years ago
Michael Neuling	d92af779eb	Add Antmicro Artix DC SCM hello world support works with: fusesoc build --target=antmicro-artix-dc-scm microwatt --ram_init_file=../hello_world/hello_world.hex Signed-off-by: Michael Neuling <mikey@neuling.org> [joel: Fixes and updates] Signed-off-by: Joel Stanley <joel@jms.id.au>	2 years ago
Dan Horák	1ddbacb67f	syscon: Implement a register for storing git hash info It also stores the dirty status so that's known. This does some Makefile tricks so that we only rebuild when the git hash changes. This avoids rebuilding the world every time we run make. Also adds fusesoc generator, so that should continue to work as before. Signed-off-by: Dan Horák <dan@danny.cz> Signed-off-by: Michael Neuling <mikey@neuling.org>	2 years ago
Paul Mackerras	d1e8e62fee	Remove option for "short" 16x16 bit multiplier Now that we have a 33 bit x 33 bit signed multiplier in execute1, there is really no need for the 16 bit multiplier. The coremark results are just as good without it as with it. This removes the option for the sake of simplicity. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	595a758400	execute1: Add a pipelined 33-bit signed multiplier This adds a pipelined 33-bit by 33-bit signed multiplier with one cycle latency to the execute pipeline, and uses it for the mullw, mulhw and mulhwu instructions. Because it has one cycle of latency we can assume that its result is available in the second execute stage without needing to add busy logic to the second stage. This adds both a generic version of the multiplier and a Xilinx-specific version using four DSP slices of the Artix-7. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	21ab36a0c0	Pre-decode instructions when writing them to icache This splits out the decoding done in the decode0 step into a separate predecoder, used when writing instructions into the icache. The icache now holds 36 bits per instruction rather than 32. For valid instructions, those 36 bits comprise the bottom 26 bits of the instruction word, a 9-bit insn_code value (which uniquely identifies the instruction), and a zero in the MSB. For illegal instructions, the MSB is one and the full instruction word is in the bottom 32 bits. Having the full instruction word available for illegal instructions means that it can be printed in the log when simulating, or in future could be placed in the HEIR register. If we don't have an FPU, then the floating-point instructions are regarded as illegal. In that case, the insn_code values would fit into 8 bits, which could be used in future to reduce the size of decode_rom from 512 to 256 entries. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	2491aa7fc5	core: Make popcnt* take two cycles This moves the calculation of the result for popcnt* into the countbits unit, renamed from countzero, so that we can take two cycles to get the result. The motivation for this is that the popcnt* calculation was showing up as a critical path. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Benjamin Herrenschmidt	da0189af1e	Add support for QMTech Wukong v2 board For now only the V2 of the board (slightly different pinout) and only the A100T variant. I also haven't added GPIOs or anything else on the PMODs really. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	3 years ago
Paul Mackerras	734e4c4a52	core: Add a short multiplier This adds an optional 16 bit x 16 bit signed multiplier and uses it for multiply instructions that return the low 64 bits of the product (mull[dw][o] and mulli, but not maddld) when the operands are both in the range -2^15 .. 2^15 - 1. The "short" 16-bit multiplier produces its result combinatorially, so a multiply that uses it executes in one cycle. This improves the coremark result by about 4%, since coremark does quite a lot of multiplies and they almost all have operands that fit into 16 bits. The presence of the short multiplier is controlled by a generic at the execute1, SOC, core and top levels. For now, it defaults to off for all platforms, and can be enabled using the --has_short_mult flag to fusesoc. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Michael Neuling	2bd00f5119	Merge pull request #315 from paulusmack/pmu Add basic PMU implementation	3 years ago
Paul Mackerras	a7873b45f7	core: Add a basic performance monitor unit (PMU) implementation This is the start of an implementation of a PMU according to PowerISA v3.0B. Things not implemented yet include most architected events, the BHRB, event-based branches, thresholding, MMCR0[TBCC] field, etc. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Anton Blanchard	7cfbcd5514	litesdcard: Add Nexys Video support This board has a reset line that needs to be held low to power up the SD card hardware. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	3 years ago
Anton Blanchard	9caaa3fc46	litesdcard: Use vendor not board type litesdcard provides a macro per vendor (eg xilinx, lattice) and not per board, so modify the fusesoc generator to take a vendor. This will make it easier to add litesdcard to more boards. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	3 years ago
Anton Blanchard	458dfe01a6	Add liteeth support to Nexys Video Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	3 years ago
Michael Neuling	84473eda1b	Merge pull request #277 from paulus/gpio A few cleanups. GPIO IRQ number is now 4 as 3 is now taken by the SD card.	4 years ago
Paul Mackerras	21ed730514	arty_a7: Add litesdcard interface This adds litesdcard.v generated from the litex/litesdcard project, along with logic in top-arty.vhdl to connect it into the system. There is now a DMA wishbone coming in to soc.vhdl which is narrower than the other wishbone masters (it has 32-bit data rather than 64-bit) so there is a widening/narrowing adapter between it and the main wishbone master arbiter. Also, litesdcard generates a non-pipelined wishbone for its DMA connection, which needs to be converted to a pipelined wishbone. We have a latch on both the incoming and outgoing sides of the wishbone in order to help make timing (at the cost of two extra cycles of latency). litesdcard generates an interrupt signal which is wired up to input 3 of the ICS (IRQ 19). Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	f06ffcf9b7	Add a GPIO controller and use it to drive the shield I/O pins on the Arty This adds a GPIO controller which provides 32 bits of I/O. The registers are modelled on the set used by the gpio-ftgpio010.c driver in the Linux kernel. Currently there is no interrupt capability implemented, though an interrupt line from the GPIO subsystem to the XICS has been connected. For the Arty A7 board, GPIO lines 0 to 13 are connected to the pins labelled IO0 to IO13 on the "shield" connector, GPIO lines 14 to 29 connect to IO26 to IO41, GPIO line 30 connects to the pin labelled A (aka IO42), and GPIO line 31 is connected to LED 7. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	ae2afeca5c	core: Track CR hazards and bypasses using tags Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	c0b45e153b	core: Track GPR hazards using tags that propagate through the pipelines This changes the way GPR hazards are detected and tracked. Instead of having a model of the pipeline in gpr_hazard.vhdl, which has to mirror the behaviour of the real pipeline exactly, we now assign a 2-bit tag to each instruction and record which GSPR the instruction writes. Subsequent instructions that need to use the GSPR get the tag number and stall until the value with that tag is being written back to the register file. For now, the forwarding paths are disabled. That gives about a 8% reduction in coremark performance. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	0fb207be60	fetch1: Implement a simple branch target cache This implements a cache in fetch1, where each entry stores the address of a simple branch instruction (b or bc) and the target of the branch. When fetching sequentially, if the address being fetched matches the cache entry, then fetching will be redirected to the branch target. The cache has 1024 entries and is direct-mapped, i.e. indexed by bits 11..2 of the NIA. The bus from execute1 now carries information about taken and not-taken simple branches, which fetch1 uses to update the cache. The cache entry is updated for both taken and not-taken branches, with the valid bit being set if the branch was taken and cleared if the branch was not taken. If fetching is redirected to the branch target then that goes down the pipe as a predicted-taken branch, and decode1 does not do any static branch prediction. If fetching is not redirected, then the next instruction goes down the pipe as normal and decode1 does its static branch prediction. In order to make timing, the lookup of the cache is pipelined, so on each cycle the cache entry for the current NIA + 8 is read. This means that after a redirect (from decode1 or execute1), only the third and subsequent sequentially-fetched instructions will be able to be predicted. This improves the coremark value on the Arty A7-100 from about 180 to about 190 (more than 5%). The BTC is optional. Builds for the Artix 7 35-T part have it off by default because the extra ~1420 LUTs it takes mean that the design doesn't fit on the Arty A7-35 board. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	856e9e955f	core: Add framework for an FPU This adds the skeleton of a floating-point unit and implements the mffs and mtfsf instructions. Execute1 sends FP instructions to the FPU and receives busy, exception, FP interrupt and illegal interrupt signals from it. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	45cd8f4fc3	core: Add support for floating-point loads and stores This extends the register file so it can hold FPR values, and implements the FP loads and stores that do not require conversion between single and double precision. We now have the FP, FE0 and FE1 bits in MSR. FP loads and stores cause a FP unavailable interrupt if MSR[FP] = 0. The FPU facilities are optional and their presence is controlled by the HAS_FPU generic passed down from the top-level board file. It defaults to true for all except the A7-35 boards. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Michael Neuling	6d6cf59bb7	Merge pull request #235 from paulusmack/master More instructions and a random number generator	4 years ago
Boris Shingarov	679c547e5f	fpga: Add support for Genesys2 Signed-off-by: Boris Shingarov <shingarov@labware.com>	4 years ago
Benjamin Herrenschmidt	dbb137437c	acorn: Add support for the Acorn CLE 215+ This is a NiteFury based PCIe M2 form-factor board originally used for mining. It contains a speed grade 2 Artix 7 200T, 1GB of DDR3 and 32MB of flash. The serial port is routed to pin 2 (RX) and 3 (TX) of the P2 connector (pin 1 is GND). Note: Only 16MB of flash is currently usable until code is added to configure the flash controller to use 4-bytes address commands on that part. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	4 years ago
Paul Mackerras	1a7aebeef8	Add random number generator and implement the darn instruction This adds a true random number generator for the Xilinx FPGAs which uses a set of chaotic ring oscillators to generate random bits and then passes them through a Linear Hybrid Cellular Automaton (LHCA) to remove bias, as described in "High Speed True Random Number Generators in Xilinx FPGAs" by Catalin Baetoniu of Xilinx Inc., in: https://pdfs.semanticscholar.org/83ac/9e9c1bb3dad5180654984604c8d5d8137412.pdf This requires adding a .xdc file to tell vivado that the combinatorial loops that form the ring oscillators are intentional. The same code should work on other FPGAs as well if their tools can be told to accept the combinatorial loops. For simulation, the random.vhdl module gets compiled in, which uses the pseudorand() function to generate random numbers. Synthesis using yosys uses nonrandom.vhdl, which always signals an error, causing darn to return 0xffff_ffff_ffff_ffff. This adds an implementation of the darn instruction. Darn can return either raw or conditioned random numbers. On Xilinx FPGAs, reading a raw random number gives the output of the ring oscillators, and reading a conditioned random number gives the output of the LHCA. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Benjamin Herrenschmidt	b0241d9f2d	corefile/nexys_video: Parameter fixes This fixes up a few issues with parameters: Only arty has "has_uart1" since we haven't added plumbing for a second UART anywhere else. Also "uart_is_16550" was mixing on one of the nexys_video targets, and nexys_video toplevel was missing LOG_LENGTH. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	4 years ago
Benjamin Herrenschmidt	fb5c16d05e	uart: Make 16550 the default Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	4 years ago
Benjamin Herrenschmidt	7575b1e0c2	uart: Import and hook up opencore 16550 compatible UART This imports via fusesoc a 16550 compatible (ie "standard") UART, and wires it up optionally in the SoC instead of the potato one. This also adds support for a second UART (which is always a 16550) to Arty, wired to JC "bottom" port. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	4 years ago
Benjamin Herrenschmidt	8366710217	liteeth: Hook up LiteX LiteEth ethernet controller Currently only generated for Arty. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	4 years ago
Paul Mackerras	64efd494e5	fpga: Add a xilinx_specific fileset to microwatt.core At present this just has the Xilinx-specific multiplier code, but might in future have other things. This also adds the xilinx_specific fileset to the synth target. Without that it was failing because there was no multiplier. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	78de4fef72	Make LOG_LENGTH configurable per FPGA variant This plumbs the LOG_LENGTH parameter (which controls how many entries the core log RAM has) up to the top level so that it can be set on the fusesoc command line and have different default values on different FPGAs. It now defaults to 512 entries generally and on the Artix-7 35 parts, and 2048 on the larger Artix-7 FPGAs. It can be set to 0 if desired. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	0809bc898b	multiply: Use DSP48 slices for multiplication on Xilinx FPGAs This adds a custom implementation of the multiplier which uses 16 DSP48E1 slices to do a 64x64 bit multiplication in 2 cycles. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	b5a7dbb78d	core: Remove fetch2 pipeline stage The fetch2 stage existed primarily to provide a stash buffer for the output of icache when a stall occurred. However, we can get the same effect -- of having the input to decode1 stay unchanged on a stall cycle -- by using the read enable of the BRAMs in icache, and by adding logic to keep the outputs unchanged on a clock cycle when stall_in = 1. This reduces branch and interrupt latency by one cycle. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Benjamin Herrenschmidt	cc4dcb3597	spi: Add SPI Flash controller This adds an SPI flash controller which supports direct memory-mapped access to the flash along with a manual mode to send commands. The direct mode can be set via generic to default to single wire or quad mode. The controller supports normal, dual and quad accesses with configurable commands, clock divider, dummy clocks etc... The SPI clock can be an even divider of sys_clk starting at 2 (so max 50Mhz with our typical Arty designs). A flash offset is carried via generics to syscon to tell SW about which portion of the flash is reserved for the FPGA bitfile. There is currently no plumbing to make the CPU reset past that address (TBD). Note: Operating at 50Mhz has proven unreliable without adding some delay to the sampling of the input data. I'm working in improving this, in the meantime, I'm leaving the default set at 25 Mhz. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	a3857aac94	litedram: Add an L2 cache with store queue This adds a cache between the wishbone and litedram with the following features (at this point, it's still evolving) - 128 bytes line width in order to have a reasonable amount of litedram pipelining on the 128-bit wide data port. - Configurable geometry otherwise - Stores are acked immediately on wishbone whether hit or miss (minus a 2 cycles delay if there's a previous load response in the way) and sent to LiteDRAM via 8 entries (configurable) store queue Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	bf1b98b958	litedram: Add support for booting without BRAM This adds an option to disable the main BRAM and instead copy a payload stashed along with the init code in the secondary BRAM into DRAM and boot from there Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Paul Mackerras	c164a2f4ea	Merge branch 'mmu' Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Benjamin Herrenschmidt	025cf5efe8	syscon: Add syscon registers These provides some info about the SoC (though it's still somewhat incomplete and needs more work, see comments). There's also a control register for selecting DRAM vs. BRAM at 0 (and for soft-resetting the SoC but that isn't wired up yet). Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	2cef3005cd	fpga: Hookup nexys-video to litedram Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	3ac815823c	fpga: Hookup Arty to litedram The old toplevel.vhdl becomes top-generic.vhdl, which is to be used by platforms that do not have a litedram option. Arty has its own top-arty.vhdl which supports litedram and is now hooked up Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Paul Mackerras	8160f4f821	Add framework for implementing an MMU This adds a new module to implement an MMU. At the moment it doesn't do very much. Tlbie instructions now get sent by loadstore1 to mmu, which sends them to dcache, rather than loadstore1 sending them directly to dcache. TLB misses from dcache now get sent by loadstore1 to mmu, which currently just returns an error. Loadstore1 then generates a DSI in response to the error return from mmu. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Benjamin Herrenschmidt	982cf166dd	litedram: Add basic support for LiteX LiteDRAM This comes in two parts: - A generator script which uses LiteX to generate litedram cores along with their init files for various boards (currently Arty and Nexys-video). This comes with configs for arty and nexys_video. - A fusesoc "generator" which uses pre-generated litedram cores The generation process is manual on purpose. This include pre-generated cores for the two above boards. This is done so that one doesn't have to install LiteX to build microwatt. In addition, the generator script or wrapper vhdl tend to break when LiteX changes significantly which happens. This is still rather standalone and hasn't been plumbed into the SoC or the FPGA toplevel files yet. At this point LiteDRAM self-initializes using a built-in VexRiscv "Minimum" core obtained from LiteX and included in this commit. There is some plumbing to generate and cores that are initialized by Microwatt directly but this isn't working yet and so isn't enabled yet. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	0f97b320f6	Change default frequency to 100Mhz LiteDRAM at the moment pretty much enforces 100Mhz, and our software isn't quite yet adaptable, so switch out default to 100Mhz accross the board. Recent timing improvements should make it a non-issue. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	f124dc4a40	xics: Add missing fusesoc core file Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Joel Stanley	6a3d2d95df	Set default RAM to be 16K in microwatt.core This allows it to run hello world out of the box. Signed-off-by: Joel Stanley <joel@jms.id.au>	5 years ago
Benjamin Herrenschmidt	8e0389b973	ram: Rework main RAM interface This replaces the simple_ram_behavioural and mw_soc_memory modules with a common wishbone_bram_wrapper.vhdl that interfaces the pipelined WB with a lower-level RAM module, along with an FPGA and a sim variants of the latter. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago

1 2

86 Commits (964b97e85cd60cdac29553cd308854a505f0404b)