microwatt

Commit Graph

Author	SHA1	Message	Date
Paul Mackerras	6687aae4d6	core: Implement a simple branch predictor This implements a simple branch predictor in the decode1 stage. If it sees that the instruction is b or bc and the branch is predicted to be taken, it sends a flush and redirect upstream (to icache and fetch1) to redirect fetching to the branch target. The prediction is sent downstream with the branch instruction, and execute1 now only sends a flush/redirect upstream if the prediction was wrong. Unconditional branches are always predicted to be taken, and conditional branches are predicted to be taken if and only if the offset is negative. Branches that take the branch address from a register (bclr, bcctr) are predicted not taken, as we don't have any way to predict the branch address. Since we can now have a mflr being executed immediately after a bl or bcl, we now track the update to LR in the hazard tracker, using the second write register field that is used to track RA updates for update-form loads and stores. For those branches that update LR but don't write any other result (i.e. that don't decrementer CTR), we now write back LR in the same cycle as the instruction rather than taking a second cycle for the LR writeback. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	b3799c432b	decode1: Add a stash buffer to the output This means that the busy signal from execute1 (which can be driven combinatorially from mmu or dcache) now stops at decode1 and doesn't go on to icache or fetch1. This helps with timing. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	6701e7346b	core: Use a busy signal rather than a stall This changes the instruction dependency tracking so that we can generate a "busy" signal from execute1 and loadstore1 which comes along one cycle later than the current "stall" signal. This will enable us to signal busy cycles only when we need to from loadstore1. The "busy" signal from execute1/loadstore1 indicates "I didn't take the thing you gave me on this cycle", as distinct from the previous stall signal which meant "I took that but don't give me anything next cycle". That means that decode2 proactively gives execute1 a new instruction as soon as it has taken the previous one (assuming there is a valid instruction available from decode1), and that then sits in decode2's output until execute1 can take it. So instructions are issued by decode2 somewhat earlier than they used to be. Decode2 now only signals a stall upstream when its output buffer is full, meaning that we can fill up bubbles in the upstream pipe while a long instruction is executing. This gives a small boost in performance. This also adds dependency tracking for rA updates by update-form load/store instructions. The GPR and CR hazard detection machinery now has one extra stage, which may not be strictly necessary. Some of the code now really only applies to PIPELINE_DEPTH=1. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	f80da65799	core: Double the dcache and icache sizes This makes the dcache and icache both be 8kB. This still only uses one BRAM per way per cache on the Artix-7, since the BRAMs were only half-used previously. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	b5a7dbb78d	core: Remove fetch2 pipeline stage The fetch2 stage existed primarily to provide a stash buffer for the output of icache when a stall occurred. However, we can get the same effect -- of having the input to decode1 stay unchanged on a stall cycle -- by using the read enable of the BRAMs in icache, and by adding logic to keep the outputs unchanged on a clock cycle when stall_in = 1. This reduces branch and interrupt latency by one cycle. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	49a4d9f67a	Add core logging This logs 256 bits of data per cycle to a ring buffer in BRAM. The data collected can be read out through 2 new SPRs or through the debug interface. The new SPRs are LOG_ADDR (724) and LOG_DATA (725). LOG_ADDR contains the buffer write pointer in the upper 32 bits (in units of entries, i.e. 32 bytes) and the read pointer in the lower 32 bits (in units of doublewords, i.e. 8 bytes). Reading LOG_DATA gives the doubleword from the buffer at the read pointer and increments the read pointer. Setting bit 31 of LOG_ADDR inhibits the trace log system from writing to the log buffer, so the contents are stable and can be read. There are two new debug addresses which function similarly to the LOG_ADDR and LOG_DATA SPRs. The log is frozen while either or both of the LOG_ADDR SPR bit 31 or the debug LOG_ADDR register bit 31 are set. The buffer defaults to 2048 entries, i.e. 64kB. The size is set by the LOG_LENGTH generic on the core_debug module. Software can determine the length of the buffer because the length is ORed into the buffer write pointer in the upper 32 bits of LOG_ADDR. Hence the length of the buffer can be calculated as 1 << (31 - clz(LOG_ADDR)). There is a program to format the log entries in a somewhat readable fashion in scripts/fmt_log/fmt_log.c. The log_entry struct in that file describes the layout of the bits in the log entries. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Benjamin Herrenschmidt	b863791e38	icache: Fix icbi potentially clobbering the icache (#192 ) icbi currently just resets the icache. This has some nasty side effects such as also clearing the TLB, but also the wishbone interface. That means that any ongoing cycle will be dropped. However, most of our slaves don't handle that well and will continue sending acks for already issued requests. Under some circumstances we can thus restart an icache load and get spurious ack/data from the wishbone left over from the "cancelled" sequence. This has broken booting Linux for me. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	4 years ago
Benjamin Herrenschmidt	f86fb74bfe	irq: Simplify xics->core irq input Use a simple wire. common.vhdl types are better kept for things local to the core. We can add more wires later if we need to for HV irqs etc... Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Anton Blanchard	4e78b8078e	Merge branch 'master' into litedram	5 years ago
Benjamin Herrenschmidt	acbdd396a5	soc/core: Add reset latches This adds one-cycle latches to the various resets out of the soc and into the various core modules. It seems to help vivado P&R a bit and has shown to avoid timing violations under some circumstances. Interestingly those resets never seem to appear in the bad timing path. It looks like those long resets simply impose placement constraints that Vivado satisfies at the expense of timing elsewhere. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Paul Mackerras	c164a2f4ea	Merge branch 'mmu' Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	3d4712ad43	Add TLB to icache This adds a direct-mapped TLB to the icache, with 64 entries by default. Execute1 now sends a "virt_mode" signal from MSR[IR] to fetch1 along with redirects to indicate whether instruction addresses should be translated through the TLB, and fetch1 sends that on to icache. Similarly a "priv_mode" signal is sent to indicate the privilege mode for instruction fetches. This means that changes to MSR[IR] or MSR[PR] don't take effect until the next redirect, meaning an isync, rfid, branch, etc. The icache uses a hash of the effective address (i.e. next instruction address) to index the TLB. The hash is an XOR of three fields of the address; with a 64-entry TLB, the fields are bits 12--17, 18--23 and 24--29 of the address. TLB invalidations simply invalidate the indexed TLB entry without checking the contents. If the icache detects a TLB miss with virt_mode=1, it will send a fetch_failed indication through fetch2 to decode1, which will turn it into a special OP_FETCH_FAILED opcode with unit=LDST. That will get sent down to loadstore1 which will currently just raise a Instruction Storage Interrupt (0x400) exception. One bit in the PTE obtained from the TLB is used to check whether an instruction access is allowed -- the privilege bit (bit 3). If bit 3 is 1 and priv_mode=0, then a fetch_failed indication is sent down to fetch2 and to decode1, which generates an OP_FETCH_FAILED. Any PTEs with PTE bit 0 (EAA[3]) clear or bit 8 (R) clear should not be put into the iTLB since such PTEs would not allow execution by any context. Tlbie operations get sent from mmu to icache over a new connection. Unfortunately the privileged instruction tests are broken for now. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	8160f4f821	Add framework for implementing an MMU This adds a new module to implement an MMU. At the moment it doesn't do very much. Tlbie instructions now get sent by loadstore1 to mmu, which sends them to dcache, rather than loadstore1 sending them directly to dcache. TLB misses from dcache now get sent by loadstore1 to mmu, which currently just returns an error. Loadstore1 then generates a DSI in response to the error return from mmu. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	42d0fcc511	Implement data storage interrupts This adds a path from loadstore1 back to execute1 for reporting errors, and machinery in execute1 for generating data storage interrupts at vector 0x300. If dcache is given two requests in successive cycles and the first encounters an error (e.g. a TLB miss), it will now cancel the second request. Loadstore1 now responds to errors reported by dcache by sending an exception signal to execute1 and returning to the idle state. Execute1 then writes SRR0 and SRR1 and jumps to the 0x300 Data Storage Interrupt vector. DAR and DSISR are held in loadstore1. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	dd2e71930c	debug: Provide a way to examine GPRs, fast SPRs and MSR This provides commands on the debug interface to read the value of the MSR or any of the 64 GSPR register file entries. The GSPR values are read using the B port of the register file in a cycle when decode2 is not using it. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Benjamin Herrenschmidt	6853d22203	core: Add alternate reset address An external signal can control whether the core will start executing at the standard or the alternate reset address. This will be used when litedram is initialized by microwatt itself, to route the reset to the built-in init code secondary block RAM. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Michael Neuling	b4f20c20b9	XICS interrupt controller New unified ICP and ICS XICS compliant interrupt controller. Configurable number of hardware sources. Fixed hardware source number based on hardware line taken. All hardware interrupts are a fixed priority. Level interrupts supported only. Hardwired to 0xc0004000 in SOC (UART is kept at 0xc0002000). Signed-off-by: Michael Neuling <mikey@neuling.org>	5 years ago
Paul Mackerras	b349cc891a	loadstore1: Move logic from dcache to loadstore1 So that the dcache could in future be used by an MMU, this moves logic to do with data formatting, rA updates for update-form instructions, and handling of unaligned loads and stores out of dcache and into loadstore1. For now, dcache connects only to loadstore1, and loadstore1 now has the connection to writeback. Dcache generates a stall signal to loadstore1 which indicates that the request presented in the current cycle was not accepted and should be presented again. However, loadstore1 doesn't currently use it because we know that we can never hit the circumstances where it might be set. For unaligned transfers, loadstore1 generates two requests to dcache back-to-back, and then waits to see two acks back from dcache (cycles where d_in.valid is true). Loadstore1 now has a FSM for tracking how many acks we are expecting from dcache and for doing the rA update cycles when necessary. Handling for reservations and conditional stores is still in dcache. Loadstore1 now generates its own stall signal back to decode2, so we no longer need the logic in execute1 that generated the stall for the first two cycles. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	1a244d3470	Remove single-issue constraint for most loads and stores This removes the constraint that loads and stores are single-issue, at the expense of a stall of at least 2 cycles for every load and store. To do this, we plumb the existing stall signal that was generated in dcache to core, where it gets ORed with the stall signal from execute1. Execute1 generates a stall signal for the first two cycles of each load and store, and dcache generates the stall signal in the 3rd and subsequent cycles if it needs to. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	5422007f83	Plumb loadstore1 input from execute1 not decode2 This allows us to use the bypass at the input of execute1 for the address and data operands for loadstore1. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	b14d982011	execute: Implement bypass from output of execute1 to input This enables back-to-back execution of integer instructions where the first instruction writes a GPR and the second reads the same GPR. This is done with a set of multiplexers at the start of execute1 which enable any of the three input operands to be taken from the output of execute1 (i.e. r.e.write_data) rather than the input from decode2 (i.e. e_in.read_data[123]). This also requires changes to the hazard detection and handling. Decode2 generates a signal indicating that the GPR being written is available for bypass, which is true for instructions that are executed in execute1 (rather than loadstore1/dcache). The gpr_hazard module stores this "bypassable" bit, and if the same GPR needs to be read by a subsequent instruction, it outputs a "use_bypass" signal rather than generating a stall. The use_bypass signal is then latched at the output of decode2 and passed down to execute1 to control the input multiplexer. At the moment there is no bypass on the inputs to loadstore1, but that is OK because all load and store instructions are marked as single-issue. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	39d18d2738	Make divider hang off the side of execute1 With this, the divider is a unit that execute1 sends operands to and which sends its results back to execute1, which then send them to writeback. Execute1 now sends a stall signal when it gets a divide or modulus instruction until it gets a valid signal back from the divider. Divide and modulus instructions are no longer marked as single-issue. The data formatting step that used to be done in decode2 for div and mod instructions is now done in execute1. We also do the absolute value operation in that same cycle instead of taking an extra cycle inside the divider for signed operations with a negative operand. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	2167186b5f	Make multiplier hang off the side of execute1 With this, the multiplier isn't a separate pipe that decode2 issues instructions to, but rather is a unit that execute1 sends operands to and which sends the result back to execute1, which then sends it to writeback. Execute1 now sends a stall signal when it gets a multiply instruction until it gets a valid signal back from the multiplier. This all means that we no longer need to mark the multiply instructions as single-issue. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Anton Blanchard	dcee60a729	Fix a ghdlsynth issue in icache ghdlsynth doesn't like the debug statement, so wrap it in a generate. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Anton Blanchard	467630573c	Dump CTR, LR and CR on sim termination, and update our tests Right now our test cases fold the SPRs into the GPRs. That makes debugging fails more difficult than it needs to be, so print out the CTR, LR and CR. We still need to print the XER, but that is in two spots in microwatt and will take some more work. This also adds many instructions to the tests that we have added lately including overflow instructions, CR logicals and mt/mfxer. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Benjamin Herrenschmidt	e4f475e17f	sprs: Store common SPRs in register file This stores the most common SPRs in the register file. This includes CTR and LR and a not yet final list of others. The register file is set to 64 entries for now. Specific types are defined that can represent a GPR index (gpr_index_t) or a GPR/SPR index (gspr_index_t) along with conversion functions between the two. On order to deal with some forms of branch updating both LR and CTR, we introduced a delayed update of LR after a branch link. Note: We currently stall the pipeline on such a delayed branch, but we could avoid stalling fetch in that specific case as we know we have a branch delay. We could also limit that to the specific case where we need to update both CTR and LR. This allows us to make bcreg, mtspr and mfspr pipelined. decode1 will automatically force the single issue flag on mfspr/mtspr to a "slow" SPR. [paulus@ozlabs.org - fix direction of decode2.stall_in] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Benjamin Herrenschmidt	d2762e70e5	Add option to not flatten hierarchy Vivado by default tries to flatten the module hierarchy to improve placement and timing. However this makes debugging timing issues really hard as the net names in the timing report can be pretty bogus. This adds a generic that can be used to control attributes to stop vivado from flattening the main core components. The resulting design will have worst timing overall but it will be easier to understand what the worst timing path are and address them. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Anton Blanchard	247d7d4aa0	Merge pull request #113 from mikey/exec-sim-remove Remove SIM generic from execute1	5 years ago
Michael Neuling	bd4ac06243	Remove SIM generic from execute1 This does nothing, so remove. Signed-off-by: Michael Neuling <mikey@neuling.org>	5 years ago
Benjamin Herrenschmidt	742b21480e	insn: Simplistic implementation of icbi We don't yet have a proper snooper for the icache, so for now make icbi just flush the whole thing Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	265fbf894b	icache/dcache: Make both caches 32 lines, 2 ways Adding lines seems to add only little extra as the BRAMs aren't full, 2 ways is our current comprimise to limit pressure on small FPGAs. We could go to 64 lines for a little more, but timing is becoming a bit too right to my linking on the tags/LRU path of the icache, so let's leave it at 32 for now. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	b513f0fb48	dcache: Add a dcache This replaces loadstore2 with a dcache The dcache unit is losely based on the icache one (same basic cache layout), but has some significant logic additions to deal with stores, loads with update, non-cachable accesses and other differences due to operating in the execution part of the pipeline rather than the fetch part. The cache is store-through, though a hit with an existing line will update the line rather than invalidate it. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Paul Mackerras	f49a5a99a5	Remove execute2 stage Since the condition setting got moved to writeback, execute2 does nothing aside from wasting a cycle. This removes it. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	d4f51e08c8	divider: Return 0 for invalid and overflow cases, like P9 does This adds logic to detect the cases where the quotient of the division overflows the range of the output representation, and return all zeroes in those cases, which is what POWER9 does. To do this, we extend the dividend register by 1 bit and we do an extra step in the division process to get a 2^64 bit of the quotient, which ends up in the 'overflow' signal. This catches all the cases where dividend >= 2^64 * divisor, including the case where divisor = 0, and the divde/divdeu cases where \|RA\| >= \|RB\|. Then, in the output stage, we also check that the result fits in the representable range, which depends on whether the division is a signed division or not, and whether it is a 32-bit or 64-bit division. If dividend >= 2^64 or the result doesn't fit in the representable range, write_data is set to 0 and write_cr_data to 0x20000000 (i.e. cr0.eq = 1). POWER9 sets the top 32 bits of the result to zero for 32-bit signed divisions, and sets CR0 when RC=1 according to the 64-bit value (i.e. CR0.LT is always 0 for 32-bit signed divisions, even if the 32-bit result is negative). However, modsw with a negative result sets the top 32 bits to all 1s. We follow suit. This updates divider_tb to check the invalid cases as well as the valid case. This also fixes a small bug where the reset signal for the divider was driven from rst when it should have been driven from core_rst. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Benjamin Herrenschmidt	b56b46b7d1	icache: Set associative icache This adds support for set associativity to the icache. It can still be direct mapped by setting NUM_WAYS to 1. The replacement policy uses a simple tree-PLRU for each set. This is only lightly tested, tests pass but I have to double check that we are using the ways effectively and not creating duplicates. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	d40c1c1a25	icache: Use narrower block RAMs We only ever access the cache memory for at most the wishbone bus width at a time. So having the BRAMs organized as a cache-line-wide port is a waste of resources. Instead, use a wishbone-wide memory and store a line as consecutive rows in the BRAM. This significantly improves BRAM usage in the FPGA as we can now use more rows in the BRAM blocks. It also saves a few LUTs and muxes. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	d415e5544a	fetch/icache: Fit icache in BRAM The goal is to have the icache fit in BRAM by latching the output into a register. In order to avoid timing issues , we need to give the BRAM a full cycle on reads, and thus we souce the BRAM address directly from fetch1 latched NIA. (Note: This will be problematic if/when we want to hash the address, we'll probably be better off having fetch1 latch a fully hashed address along with the normal one, so the icache can use the former to address the BRAM and pass the latter along) One difficulty is that we cannot really stall the icache without adding more combo logic that would break the "one full cycle" BRAM model. This means that on stalls from decode, by the time we stall fetch1, it has already gone to the next address, which the icache is already latching. We work around this by having a "stash" buffer in fetch2 that will stash away the icache output on a stall, and override the output of the icache with the content of the stash buffer when unstalling. This requires a rewrite of the stop/step debug logic as well. We now do most of the hard work in fetch1 which makes more sense. Note: Vivado is still not inferring an built-in output register for the BRAMs. I don't want to add another cycle... I don't fully understand why it wouldn't be able to treat current_row as such but clearly it won't. At least the timing seems good enough now for 100Mhz, possibly more. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	ec1868f7d2	register_file: Move GPRs into distributed RAM The register file is currently implemented as a whole pile of individual 1-bit registers instead of LUT memory which is a huge waste of FPGA space. This is caused by the output signal exposing the register file to the outside world for simulation debug. This removes that output, and moves the dumping of the register file to the register file module itself. This saves about 8% of fpga on the little Arty A7-35T. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Anton Blanchard	b57325ce29	Merge branch 'divider' of https://github.com/paulusmack/microwatt	5 years ago
Anton Blanchard	6cae10eebd	Terminate test on illegal instruction This gets the CI going again, but we will want to fix the test harness since it's useful to be able to debug the core after it executes an illegal instruction. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Anton Blanchard	8c5dcc8c4c	Fix ghdl error I'm seeing an issue on my version of ghdl: core.vhdl:137:24:error: actual expression must be globally static Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Paul Mackerras	d5bc6c8824	Add a divider unit and a testbench for it This adds a divider unit, connected to the core in much the same way that the multiplier unit is connected. The division algorithm is very simple-minded, taking 64 clock cycles for any division (even 32-bit division instructions). The decoding is simplified by making use of regularities in the instruction encoding for div* and mod* instructions. Instead of having PPC_* encodings from the first-stage decoder for each of the different div* and mod* instructions, we now just have PPC_DIV and PPC_MOD, and the inputs to the divider that indicate what sort of division operation to do are derived from instruction word bits. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Benjamin Herrenschmidt	98f0994698	Add core debug module This module adds some simple core controls: reset, stop, start, step along with icache clear and reading the NIA and core status bits Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org	5 years ago
Anton Blanchard	89849a6856	Add a simple direct mapped icache Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Anton Blanchard	4528ef2b43	Reformat core.vhdl	5 years ago
Anton Blanchard	92a7152370	Rework pipeline, add stall and flush signals This adds stall and flush signals to the pipeline. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Benjamin Herrenschmidt	c97b080d8c	Move wishbone arbiter out of the core Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Anton Blanchard	147b259691	Use a better input signal in writeback w_in comes from the execution unit, it makes more sense to call it e_in. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Anton Blanchard	5a29cb4699	Initial import of microwatt Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago

49 Commits (6687aae4d659e79c429c60ebbc07bfac7686365a)