microwatt

Commit Graph

Author	SHA1	Message	Date
Anton Blanchard	53ccf89d26	Use a record for cache parameters The number of generics we pass down from the top level is getting a bit unwieldy. Paul suggests using records to group them. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	4 years ago
Paul Mackerras	ae2afeca5c	core: Track CR hazards and bypasses using tags Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	c0b45e153b	core: Track GPR hazards using tags that propagate through the pipelines This changes the way GPR hazards are detected and tracked. Instead of having a model of the pipeline in gpr_hazard.vhdl, which has to mirror the behaviour of the real pipeline exactly, we now assign a 2-bit tag to each instruction and record which GSPR the instruction writes. Subsequent instructions that need to use the GSPR get the tag number and stall until the value with that tag is being written back to the register file. For now, the forwarding paths are disabled. That gives about a 8% reduction in coremark performance. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	0fb207be60	fetch1: Implement a simple branch target cache This implements a cache in fetch1, where each entry stores the address of a simple branch instruction (b or bc) and the target of the branch. When fetching sequentially, if the address being fetched matches the cache entry, then fetching will be redirected to the branch target. The cache has 1024 entries and is direct-mapped, i.e. indexed by bits 11..2 of the NIA. The bus from execute1 now carries information about taken and not-taken simple branches, which fetch1 uses to update the cache. The cache entry is updated for both taken and not-taken branches, with the valid bit being set if the branch was taken and cleared if the branch was not taken. If fetching is redirected to the branch target then that goes down the pipe as a predicted-taken branch, and decode1 does not do any static branch prediction. If fetching is not redirected, then the next instruction goes down the pipe as normal and decode1 does its static branch prediction. In order to make timing, the lookup of the cache is pipelined, so on each cycle the cache entry for the current NIA + 8 is read. This means that after a redirect (from decode1 or execute1), only the third and subsequent sequentially-fetched instructions will be able to be predicted. This improves the coremark value on the Arty A7-100 from about 180 to about 190 (more than 5%). The BTC is optional. Builds for the Artix 7 35-T part have it off by default because the extra ~1420 LUTs it takes mean that the design doesn't fit on the Arty A7-35 board. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	856e9e955f	core: Add framework for an FPU This adds the skeleton of a floating-point unit and implements the mffs and mtfsf instructions. Execute1 sends FP instructions to the FPU and receives busy, exception, FP interrupt and illegal interrupt signals from it. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	45cd8f4fc3	core: Add support for floating-point loads and stores This extends the register file so it can hold FPR values, and implements the FP loads and stores that do not require conversion between single and double precision. We now have the FP, FE0 and FE1 bits in MSR. FP loads and stores cause a FP unavailable interrupt if MSR[FP] = 0. The FPU facilities are optional and their presence is controlled by the HAS_FPU generic passed down from the top-level board file. It defaults to true for all except the A7-35 boards. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Michael Neuling	6d6cf59bb7	Merge pull request #235 from paulusmack/master More instructions and a random number generator	4 years ago
Boris Shingarov	679c547e5f	fpga: Add support for Genesys2 Signed-off-by: Boris Shingarov <shingarov@labware.com>	4 years ago
Benjamin Herrenschmidt	dbb137437c	acorn: Add support for the Acorn CLE 215+ This is a NiteFury based PCIe M2 form-factor board originally used for mining. It contains a speed grade 2 Artix 7 200T, 1GB of DDR3 and 32MB of flash. The serial port is routed to pin 2 (RX) and 3 (TX) of the P2 connector (pin 1 is GND). Note: Only 16MB of flash is currently usable until code is added to configure the flash controller to use 4-bytes address commands on that part. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	4 years ago
Paul Mackerras	1a7aebeef8	Add random number generator and implement the darn instruction This adds a true random number generator for the Xilinx FPGAs which uses a set of chaotic ring oscillators to generate random bits and then passes them through a Linear Hybrid Cellular Automaton (LHCA) to remove bias, as described in "High Speed True Random Number Generators in Xilinx FPGAs" by Catalin Baetoniu of Xilinx Inc., in: https://pdfs.semanticscholar.org/83ac/9e9c1bb3dad5180654984604c8d5d8137412.pdf This requires adding a .xdc file to tell vivado that the combinatorial loops that form the ring oscillators are intentional. The same code should work on other FPGAs as well if their tools can be told to accept the combinatorial loops. For simulation, the random.vhdl module gets compiled in, which uses the pseudorand() function to generate random numbers. Synthesis using yosys uses nonrandom.vhdl, which always signals an error, causing darn to return 0xffff_ffff_ffff_ffff. This adds an implementation of the darn instruction. Darn can return either raw or conditioned random numbers. On Xilinx FPGAs, reading a raw random number gives the output of the ring oscillators, and reading a conditioned random number gives the output of the LHCA. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Benjamin Herrenschmidt	b0241d9f2d	corefile/nexys_video: Parameter fixes This fixes up a few issues with parameters: Only arty has "has_uart1" since we haven't added plumbing for a second UART anywhere else. Also "uart_is_16550" was mixing on one of the nexys_video targets, and nexys_video toplevel was missing LOG_LENGTH. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	4 years ago
Benjamin Herrenschmidt	fb5c16d05e	uart: Make 16550 the default Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	4 years ago
Benjamin Herrenschmidt	7575b1e0c2	uart: Import and hook up opencore 16550 compatible UART This imports via fusesoc a 16550 compatible (ie "standard") UART, and wires it up optionally in the SoC instead of the potato one. This also adds support for a second UART (which is always a 16550) to Arty, wired to JC "bottom" port. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	4 years ago
Benjamin Herrenschmidt	8366710217	liteeth: Hook up LiteX LiteEth ethernet controller Currently only generated for Arty. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	4 years ago
Paul Mackerras	64efd494e5	fpga: Add a xilinx_specific fileset to microwatt.core At present this just has the Xilinx-specific multiplier code, but might in future have other things. This also adds the xilinx_specific fileset to the synth target. Without that it was failing because there was no multiplier. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	78de4fef72	Make LOG_LENGTH configurable per FPGA variant This plumbs the LOG_LENGTH parameter (which controls how many entries the core log RAM has) up to the top level so that it can be set on the fusesoc command line and have different default values on different FPGAs. It now defaults to 512 entries generally and on the Artix-7 35 parts, and 2048 on the larger Artix-7 FPGAs. It can be set to 0 if desired. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	0809bc898b	multiply: Use DSP48 slices for multiplication on Xilinx FPGAs This adds a custom implementation of the multiplier which uses 16 DSP48E1 slices to do a 64x64 bit multiplication in 2 cycles. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	b5a7dbb78d	core: Remove fetch2 pipeline stage The fetch2 stage existed primarily to provide a stash buffer for the output of icache when a stall occurred. However, we can get the same effect -- of having the input to decode1 stay unchanged on a stall cycle -- by using the read enable of the BRAMs in icache, and by adding logic to keep the outputs unchanged on a clock cycle when stall_in = 1. This reduces branch and interrupt latency by one cycle. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Benjamin Herrenschmidt	cc4dcb3597	spi: Add SPI Flash controller This adds an SPI flash controller which supports direct memory-mapped access to the flash along with a manual mode to send commands. The direct mode can be set via generic to default to single wire or quad mode. The controller supports normal, dual and quad accesses with configurable commands, clock divider, dummy clocks etc... The SPI clock can be an even divider of sys_clk starting at 2 (so max 50Mhz with our typical Arty designs). A flash offset is carried via generics to syscon to tell SW about which portion of the flash is reserved for the FPGA bitfile. There is currently no plumbing to make the CPU reset past that address (TBD). Note: Operating at 50Mhz has proven unreliable without adding some delay to the sampling of the input data. I'm working in improving this, in the meantime, I'm leaving the default set at 25 Mhz. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	a3857aac94	litedram: Add an L2 cache with store queue This adds a cache between the wishbone and litedram with the following features (at this point, it's still evolving) - 128 bytes line width in order to have a reasonable amount of litedram pipelining on the 128-bit wide data port. - Configurable geometry otherwise - Stores are acked immediately on wishbone whether hit or miss (minus a 2 cycles delay if there's a previous load response in the way) and sent to LiteDRAM via 8 entries (configurable) store queue Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	bf1b98b958	litedram: Add support for booting without BRAM This adds an option to disable the main BRAM and instead copy a payload stashed along with the init code in the secondary BRAM into DRAM and boot from there Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Paul Mackerras	c164a2f4ea	Merge branch 'mmu' Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Benjamin Herrenschmidt	025cf5efe8	syscon: Add syscon registers These provides some info about the SoC (though it's still somewhat incomplete and needs more work, see comments). There's also a control register for selecting DRAM vs. BRAM at 0 (and for soft-resetting the SoC but that isn't wired up yet). Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	2cef3005cd	fpga: Hookup nexys-video to litedram Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	3ac815823c	fpga: Hookup Arty to litedram The old toplevel.vhdl becomes top-generic.vhdl, which is to be used by platforms that do not have a litedram option. Arty has its own top-arty.vhdl which supports litedram and is now hooked up Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Paul Mackerras	8160f4f821	Add framework for implementing an MMU This adds a new module to implement an MMU. At the moment it doesn't do very much. Tlbie instructions now get sent by loadstore1 to mmu, which sends them to dcache, rather than loadstore1 sending them directly to dcache. TLB misses from dcache now get sent by loadstore1 to mmu, which currently just returns an error. Loadstore1 then generates a DSI in response to the error return from mmu. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Benjamin Herrenschmidt	982cf166dd	litedram: Add basic support for LiteX LiteDRAM This comes in two parts: - A generator script which uses LiteX to generate litedram cores along with their init files for various boards (currently Arty and Nexys-video). This comes with configs for arty and nexys_video. - A fusesoc "generator" which uses pre-generated litedram cores The generation process is manual on purpose. This include pre-generated cores for the two above boards. This is done so that one doesn't have to install LiteX to build microwatt. In addition, the generator script or wrapper vhdl tend to break when LiteX changes significantly which happens. This is still rather standalone and hasn't been plumbed into the SoC or the FPGA toplevel files yet. At this point LiteDRAM self-initializes using a built-in VexRiscv "Minimum" core obtained from LiteX and included in this commit. There is some plumbing to generate and cores that are initialized by Microwatt directly but this isn't working yet and so isn't enabled yet. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	0f97b320f6	Change default frequency to 100Mhz LiteDRAM at the moment pretty much enforces 100Mhz, and our software isn't quite yet adaptable, so switch out default to 100Mhz accross the board. Recent timing improvements should make it a non-issue. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	f124dc4a40	xics: Add missing fusesoc core file Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Joel Stanley	6a3d2d95df	Set default RAM to be 16K in microwatt.core This allows it to run hello world out of the box. Signed-off-by: Joel Stanley <joel@jms.id.au>	5 years ago
Benjamin Herrenschmidt	8e0389b973	ram: Rework main RAM interface This replaces the simple_ram_behavioural and mw_soc_memory modules with a common wishbone_bram_wrapper.vhdl that interfaces the pipelined WB with a lower-level RAM module, along with an FPGA and a sim variants of the latter. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	d2762e70e5	Add option to not flatten hierarchy Vivado by default tries to flatten the module hierarchy to improve placement and timing. However this makes debugging timing issues really hard as the net names in the timing report can be pretty bogus. This adds a generic that can be used to control attributes to stop vivado from flattening the main core components. The resulting design will have worst timing overall but it will be easier to understand what the worst timing path are and address them. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	b513f0fb48	dcache: Add a dcache This replaces loadstore2 with a dcache The dcache unit is losely based on the icache one (same basic cache layout), but has some significant logic additions to deal with stores, loads with update, non-cachable accesses and other differences due to operating in the execution part of the pipeline rather than the fetch part. The cache is store-through, though a hit with an existing line will update the line rather than invalidate it. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Paul Mackerras	f49a5a99a5	Remove execute2 stage Since the condition setting got moved to writeback, execute2 does nothing aside from wasting a cycle. This removes it. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Anton Blanchard	813f834012	Add CR hazard detection To keep things simple we treat the CR as a single entity. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Anton Blanchard	bdc26b7527	Add GPR hazard detection Check GPRs against any writers in the pipeline. All instructions are still marked single in pipeline at this stage. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Anton Blanchard	d5346d0abf	Separate issue control into its own unit Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Anton Blanchard	9b8c094cf6	Fix cmod-a7 frequency The cmod-a7 is ignoring the clk_frequency parameter and running at 100 MHz. Fix it. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Anton Blanchard	3c6e66dc96	Merge pull request #83 from paulusmack/logical execute: Consolidate count-leading/trailing-zeroes implementations	5 years ago
Anton Blanchard	4b7b702e01	Merge pull request #81 from antonblanchard/logical Consolidate logical instructions	5 years ago
Paul Mackerras	24a4a796ce	execute: Consolidate count-leading/trailing-zeroes implementations This adds combinatorial logic that does 32-bit and 64-bit count leading and trailing zeroes in one unit, and consolidates the four instructions under a single OP_CNTZ opcode. This saves 84 slice LUTs on the Arty A7-100. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Anton Blanchard	b8fb721b81	Consolidate logical instructions Consolidate and/andc/nand, or/orc/nor and xor/eqv, using a common invert on the input and output. This saves us about 200 LUTs. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Benjamin Herrenschmidt	b56b46b7d1	icache: Set associative icache This adds support for set associativity to the icache. It can still be direct mapped by setting NUM_WAYS to 1. The replacement policy uses a simple tree-PLRU for each set. This is only lightly tested, tests pass but I have to double check that we are using the ways effectively and not creating duplicates. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	004eb074c9	plru: Add a simple PLRU module Tested in sim only for now Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Paul Mackerras	f7c393ba7e	Add a rotate/mask/shift unit and use it in execute1 This adds a new entity 'rotator' which contains combinatorial logic for rotating and masking 64-bit values. It implements the operations of the rlwinm, rlwnm, rlwimi, rldicl, rldicr, rldic, rldimi, rldcl, rldcr, sld, slw, srd, srw, srad, sradi, sraw and srawi instructions. It consists of a 3-stage 64-bit rotator using 4:1 multiplexors at each stage, two mask generators, output logic and control logic. The insn_type_t values used for these instructions have been reduced to just 5: OP_RLC, OP_RLCL and OP_RLCR for the rotate and mask instructions (clear both left and right, clear left, clear right variants), OP_SHL for left shifts, and OP_SHR for right shifts. The control signals for the rotator are derived from the opcode and from the is_32bit and is_signed fields of the decode_rom_t. The rotator is instantiated as an entity in execute1 so that we can be sure we only have one of it. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Benjamin Herrenschmidt	9961d70dfb	Improve PLL/MMCM clocks configuration We can now pass both the input clock and target clock frequency via generics. Add support for both 50Mhz and 100Mhz target freqs for both cases. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	492bf06740	corefile: Remove duplicate wishbone_debug_master It's both in core and soc, it should only be in the latter Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	ab5c6ab9ac	fpga: Arty A7's don't need multiple filesets the XDC is identical between variants, so is the fileset Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Paul Mackerras	1158c81500	fpga: Add definitions for Arty A7-100 board These are a copy of the A7-35 definitions with 35 changed to 100. The A7-100 uses the same .xdc file (arty_a7-35.xdc) as the A7-35 since the only difference between the two is the FPGA part; the hardware and connections on the two boards are identical. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Anton Blanchard	b57325ce29	Merge branch 'divider' of https://github.com/paulusmack/microwatt	5 years ago

1 2

67 Commits (cache-tlb-parameters-2)