microwatt

Commit Graph

Author	SHA1	Message	Date
Paul Mackerras	06c13d4988	decode1: Work out register addresses in decode1 This adds some relatively simple logic to decode1 to compute the GPR/FPR addresses that an instruction will access. It always computes three addresses regardless of whether the instruction will actually use all of them. The main things it computes are whether the instruction uses the RS field or the RC field for the 3rd operand, and whether the operands are FPRs or GPRs (it is possible for RS to be an FPR but RA and RB to be GPRs, as for example with stfdx). At the moment all we do with these computed register addresses is to assert that they are identical to the ones coming from decode2 one cycle later. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	af814a0d5e	Provide debug access to SPRs in loadstore1 and mmu They are accessible as GSPR 0x3c - PID, 0x3d - PTCR, 0x3e - DSISR and 0x3f - DAR. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	d0f319290f	Restore debug access to SPRs This provides access to the SPRs via the JTAG DMI interface. For now they are still accessed as if they were GPR/FPRs using the same numbering as before (GPRs at 0 - 0x1f, SPRs at 0x20 - 0x2d, FPRs at 0x40 - 0x5f). For XER, debug reads now report the full value, not just the bits that were previously stored in the register file. The "slow" SPR mux is not used for debug reads. Decode2 determines on each cycle whether a debug SPR access will happen next cycle, based on whether there is a request and whether the current instruction accesses the SPR RAM. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	337b104250	Move LR, CTR and TAR out of the register file By putting CTR on the odd side and LR and TAR on the even side, we can read and write CTR for bdnz-style instructions in parallel with reading LR or TAR for indirect branches and writing LR for branches with LK=1. Thus we don't need to double up any of these instructions, giving a simplification in decode2. We now have logic for printing LR and CTR at the end of a simulation in execute1, in addition to the similar logic in register_file and cr_file. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	bc4d02cb0d	Start removing SPRs from register file This starts the process of removing SPRs from the register file by moving SRR0/1, SPRG0-3, HSRR0/1 and HSPRG0/1 out of the register file and putting them into execute1. They are stored in a pair of small RAM arrays, referred to as "even" and "odd". The reason for having two arrays is so that two values can be read and written in each cycle. For example, SRR0 and SRR1 can be written in parallel by an interrupt and read in parallel by the rfid instruction. The addresses in the RAM which will be accessed are determined in the decode2 stage. We have one write address for both sides, but two read addresses, since in future we will want to be able to read CTR at the same time as either LR or TAR. We now have a connection from writeback to execute1 which carries the partial SRR1 value for an interrupt. SRR0 comes from the execute pipeline; we no longer need to carry instruction addresses along the LSU and FPU pipelines. Since SRR0 and SRR1 can be written in the same cycle now, we don't need the little state machine in writeback any more. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	2f45e545ed	decode2: Rework to make the stall_out signal come from a register At present the busy/stall signal going to decode1 depends on whether control thinks it can issue the current instruction, and that depends on completion and bypass signals coming from execute1 and writeback. To improve the timing of stall_out, this rearranges decode2 so that stall_out is asserted when we have a valid instruction that couldn't be issued in the previous cycle. This means that decode1 could give us a new instruction when we haven't issued the previous instruction. This in turn means that we can only use d_in in the first cycle of processing an instruction. After the first cycle, we get register addresses etc. from dc2 rather than d_in. Then, to avoid the need to read register operands from register_file in each cycle until the instruction issues, we bring the bypass path for data being written to the register file into decode2 explicitly rather than having it in register_file. A new process called decode2_addrs does the process of calling decode_input_reg_* and decode_output_reg and sets up the register file addresses. This was split out (and decode_input_reg_* reworked) to try to reduce the number of passes through the decode2_1 process that need to be done in simulation. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	9a8a8e50f8	FPU: Add stage-2 stall ability to FPU This makes the FPU able to stall other units at execute stage 2 and be stalled by other units (specifically the LSU). This means that the completion and writeback for an instruction can now end up being deferred until the second cycle of a following instruction, i.e. the cycle when the state machine has gone through IDLE state into one of the DO_* states, which means we need to latch the destination FPR number, CR mask, etc. from the previous instruction so that we present the correct information to writeback. The advantage of this is that we can get rid of the in_progress signal from the LSU. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	4b6148ada6	Add a bypass path from the execute2 stage This enables some instructions to issue earlier and thus improves performance, at the cost of some extra multiplexers in decode2. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	204fedc63f	Move XER low bits out of register file Besides the overflow and status carry bits, XER has 18 bits which need to retain the value written by mtxer (in case software wants to emulate the move-assist instructions (lswi, lswx, stswi, stswx). Until now these bits (and others) have been stored in the GPR file as a "fast" SPR, but this causes complications because XER is not really a fast SPR. Instead, we now store these 18 bits in the 'ctrl' signal, which exists in execute1. This will enable us to simplify the data path in future, and has the added bonus that with a little bit of plumbing, we can get the full XER value printed when dumping registers at the end of a simulation. Therefore this changes scripts/run_test.sh to remove the greps which exclude XER from the comparison of actual and expected register results. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Anton Blanchard	a527d9b959	core: Remove unused icache_inv signal Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	2 years ago
Anton Blanchard	a750365ffa	Remove some FPGA style signal inits These don't work on the ASIC flow, so remove them and initialise them explicitly where required. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	2 years ago
Paul Mackerras	734e4c4a52	core: Add a short multiplier This adds an optional 16 bit x 16 bit signed multiplier and uses it for multiply instructions that return the low 64 bits of the product (mull[dw][o] and mulli, but not maddld) when the operands are both in the range -2^15 .. 2^15 - 1. The "short" 16-bit multiplier produces its result combinatorially, so a multiply that uses it executes in one cycle. This improves the coremark result by about 4%, since coremark does quite a lot of multiplies and they almost all have operands that fit into 16 bits. The presence of the short multiplier is controlled by a generic at the execute1, SOC, core and top levels. For now, it defaults to off for all platforms, and can be enabled using the --has_short_mult flag to fusesoc. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	65c43b488b	PMU: Add several more events This implements most of the architected PMU events. The ones missing are mostly the ones that depend on which level of the cache hierarchy data is fetched from. The events implemented here, and their raw event codes, are: Floating-point operation completed (100f4) Load completed (100fc) Store completed (200f0) Icache miss (200fc) ITLB miss (100f6) ITLB miss resolved (400fc) Dcache load miss (400f0) Dcache load miss resolved (300f8) Dcache store miss (300f0) DTLB miss (300fc) DTLB miss resolved (200f6) No instruction available and none being executed (100f8) Instruction dispatched (200f2, 300f2, 400f2) Taken branch instruction completed (200fa) Branch mispredicted (400f6) External interrupt taken (200f8) Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	a7873b45f7	core: Add a basic performance monitor unit (PMU) implementation This is the start of an implementation of a PMU according to PowerISA v3.0B. Things not implemented yet include most architected events, the BHRB, event-based branches, thresholding, MMCR0[TBCC] field, etc. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	231003f7c7	icache: Snoop writes to memory by other agents This makes the icache snoop writes to memory in the same way that the dcache does, thus making DMA cache-coherent for the icache as well as the dcache. This also simplifies the logic for the WAIT_ACK state by removing the stbs_done variable, since is_last_row(r.store_row, r.end_row_ix) can only be true when stbs_done is true. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	eb7eba2d92	dcache: Snoop writes to memory by other agents This adds a path where the wishbone that goes out to memory and I/O also gets fed back to the dcache, which looks for writes that it didn't initiate, and invalidates any cache line that gets written to. This involves a second read port on the cache tag RAM for looking up the snooped writes, and effectively a second write port on the cache valid bit array to clear bits corresponding to snoop hits. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Anton Blanchard	2d21b95f87	Pass icache/dcache/tlb parameters down from soc We want much smaller caches and tlbs when building for sky130, so allow the toplevel file to override the defaults. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	4 years ago
Paul Mackerras	3cd3449b4b	core: Move redirect and interrupt delivery logic to writeback This moves the logic for redirecting fetching and writing SRR0 and SRR1 to writeback. The aim is that ultimately units other than execute1 can send their interrupts to writeback along with their instruction completions, so that there can be multiple instructions in flight without needing execute1 to keep track of the address of each outstanding instruction. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	ae2afeca5c	core: Track CR hazards and bypasses using tags Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	d290d2a9bb	core: Restore bypass path from execute1 This changes the bypass path. Previously it went from after execute1's output to after decode2's output. Now it goes from before execute1's output register to before decode2's output register. The reason is that the new path will be simpler to manage when there are possibly multiple instructions in flight. This means that the bypassing can be managed inside decode2 and control. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	c0b45e153b	core: Track GPR hazards using tags that propagate through the pipelines This changes the way GPR hazards are detected and tracked. Instead of having a model of the pipeline in gpr_hazard.vhdl, which has to mirror the behaviour of the real pipeline exactly, we now assign a 2-bit tag to each instruction and record which GSPR the instruction writes. Subsequent instructions that need to use the GSPR get the tag number and stall until the value with that tag is being written back to the register file. For now, the forwarding paths are disabled. That gives about a 8% reduction in coremark performance. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	0fb207be60	fetch1: Implement a simple branch target cache This implements a cache in fetch1, where each entry stores the address of a simple branch instruction (b or bc) and the target of the branch. When fetching sequentially, if the address being fetched matches the cache entry, then fetching will be redirected to the branch target. The cache has 1024 entries and is direct-mapped, i.e. indexed by bits 11..2 of the NIA. The bus from execute1 now carries information about taken and not-taken simple branches, which fetch1 uses to update the cache. The cache entry is updated for both taken and not-taken branches, with the valid bit being set if the branch was taken and cleared if the branch was not taken. If fetching is redirected to the branch target then that goes down the pipe as a predicted-taken branch, and decode1 does not do any static branch prediction. If fetching is not redirected, then the next instruction goes down the pipe as normal and decode1 does its static branch prediction. In order to make timing, the lookup of the cache is pipelined, so on each cycle the cache entry for the current NIA + 8 is read. This means that after a redirect (from decode1 or execute1), only the third and subsequent sequentially-fetched instructions will be able to be predicted. This improves the coremark value on the Arty A7-100 from about 180 to about 190 (more than 5%). The BTC is optional. Builds for the Artix 7 35-T part have it off by default because the extra ~1420 LUTs it takes mean that the design doesn't fit on the Arty A7-35 board. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Anton Blanchard	659be2780f	Fully initialize FPU buses when FPU is disabled Some of the bits in the FPU buses end up as z state. Yosys flags them, so we may as well clean it up. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	4 years ago
Paul Mackerras	856e9e955f	core: Add framework for an FPU This adds the skeleton of a floating-point unit and implements the mffs and mtfsf instructions. Execute1 sends FP instructions to the FPU and receives busy, exception, FP interrupt and illegal interrupt signals from it. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	45cd8f4fc3	core: Add support for floating-point loads and stores This extends the register file so it can hold FPR values, and implements the FP loads and stores that do not require conversion between single and double precision. We now have the FP, FE0 and FE1 bits in MSR. FP loads and stores cause a FP unavailable interrupt if MSR[FP] = 0. The FPU facilities are optional and their presence is controlled by the HAS_FPU generic passed down from the top-level board file. It defaults to true for all except the A7-35 boards. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	893d2bc6a2	core: Don't generate logic for log data when LOG_LENGTH = 0 This adds "if LOG_LENGTH > 0 generate" to the places in the core where log output data is latched, so that when LOG_LENGTH = 0 we don't create the logic to collect the data which won't be stored. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	78de4fef72	Make LOG_LENGTH configurable per FPGA variant This plumbs the LOG_LENGTH parameter (which controls how many entries the core log RAM has) up to the top level so that it can be set on the fusesoc command line and have different default values on different FPGAs. It now defaults to 512 entries generally and on the Artix-7 35 parts, and 2048 on the larger Artix-7 FPGAs. It can be set to 0 if desired. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	6687aae4d6	core: Implement a simple branch predictor This implements a simple branch predictor in the decode1 stage. If it sees that the instruction is b or bc and the branch is predicted to be taken, it sends a flush and redirect upstream (to icache and fetch1) to redirect fetching to the branch target. The prediction is sent downstream with the branch instruction, and execute1 now only sends a flush/redirect upstream if the prediction was wrong. Unconditional branches are always predicted to be taken, and conditional branches are predicted to be taken if and only if the offset is negative. Branches that take the branch address from a register (bclr, bcctr) are predicted not taken, as we don't have any way to predict the branch address. Since we can now have a mflr being executed immediately after a bl or bcl, we now track the update to LR in the hazard tracker, using the second write register field that is used to track RA updates for update-form loads and stores. For those branches that update LR but don't write any other result (i.e. that don't decrementer CTR), we now write back LR in the same cycle as the instruction rather than taking a second cycle for the LR writeback. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	b3799c432b	decode1: Add a stash buffer to the output This means that the busy signal from execute1 (which can be driven combinatorially from mmu or dcache) now stops at decode1 and doesn't go on to icache or fetch1. This helps with timing. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	6701e7346b	core: Use a busy signal rather than a stall This changes the instruction dependency tracking so that we can generate a "busy" signal from execute1 and loadstore1 which comes along one cycle later than the current "stall" signal. This will enable us to signal busy cycles only when we need to from loadstore1. The "busy" signal from execute1/loadstore1 indicates "I didn't take the thing you gave me on this cycle", as distinct from the previous stall signal which meant "I took that but don't give me anything next cycle". That means that decode2 proactively gives execute1 a new instruction as soon as it has taken the previous one (assuming there is a valid instruction available from decode1), and that then sits in decode2's output until execute1 can take it. So instructions are issued by decode2 somewhat earlier than they used to be. Decode2 now only signals a stall upstream when its output buffer is full, meaning that we can fill up bubbles in the upstream pipe while a long instruction is executing. This gives a small boost in performance. This also adds dependency tracking for rA updates by update-form load/store instructions. The GPR and CR hazard detection machinery now has one extra stage, which may not be strictly necessary. Some of the code now really only applies to PIPELINE_DEPTH=1. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	f80da65799	core: Double the dcache and icache sizes This makes the dcache and icache both be 8kB. This still only uses one BRAM per way per cache on the Artix-7, since the BRAMs were only half-used previously. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	b5a7dbb78d	core: Remove fetch2 pipeline stage The fetch2 stage existed primarily to provide a stash buffer for the output of icache when a stall occurred. However, we can get the same effect -- of having the input to decode1 stay unchanged on a stall cycle -- by using the read enable of the BRAMs in icache, and by adding logic to keep the outputs unchanged on a clock cycle when stall_in = 1. This reduces branch and interrupt latency by one cycle. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	49a4d9f67a	Add core logging This logs 256 bits of data per cycle to a ring buffer in BRAM. The data collected can be read out through 2 new SPRs or through the debug interface. The new SPRs are LOG_ADDR (724) and LOG_DATA (725). LOG_ADDR contains the buffer write pointer in the upper 32 bits (in units of entries, i.e. 32 bytes) and the read pointer in the lower 32 bits (in units of doublewords, i.e. 8 bytes). Reading LOG_DATA gives the doubleword from the buffer at the read pointer and increments the read pointer. Setting bit 31 of LOG_ADDR inhibits the trace log system from writing to the log buffer, so the contents are stable and can be read. There are two new debug addresses which function similarly to the LOG_ADDR and LOG_DATA SPRs. The log is frozen while either or both of the LOG_ADDR SPR bit 31 or the debug LOG_ADDR register bit 31 are set. The buffer defaults to 2048 entries, i.e. 64kB. The size is set by the LOG_LENGTH generic on the core_debug module. Software can determine the length of the buffer because the length is ORed into the buffer write pointer in the upper 32 bits of LOG_ADDR. Hence the length of the buffer can be calculated as 1 << (31 - clz(LOG_ADDR)). There is a program to format the log entries in a somewhat readable fashion in scripts/fmt_log/fmt_log.c. The log_entry struct in that file describes the layout of the bits in the log entries. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Benjamin Herrenschmidt	b863791e38	icache: Fix icbi potentially clobbering the icache (#192 ) icbi currently just resets the icache. This has some nasty side effects such as also clearing the TLB, but also the wishbone interface. That means that any ongoing cycle will be dropped. However, most of our slaves don't handle that well and will continue sending acks for already issued requests. Under some circumstances we can thus restart an icache load and get spurious ack/data from the wishbone left over from the "cancelled" sequence. This has broken booting Linux for me. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	f86fb74bfe	irq: Simplify xics->core irq input Use a simple wire. common.vhdl types are better kept for things local to the core. We can add more wires later if we need to for HV irqs etc... Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Anton Blanchard	4e78b8078e	Merge branch 'master' into litedram	5 years ago
Benjamin Herrenschmidt	acbdd396a5	soc/core: Add reset latches This adds one-cycle latches to the various resets out of the soc and into the various core modules. It seems to help vivado P&R a bit and has shown to avoid timing violations under some circumstances. Interestingly those resets never seem to appear in the bad timing path. It looks like those long resets simply impose placement constraints that Vivado satisfies at the expense of timing elsewhere. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Paul Mackerras	c164a2f4ea	Merge branch 'mmu' Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	3d4712ad43	Add TLB to icache This adds a direct-mapped TLB to the icache, with 64 entries by default. Execute1 now sends a "virt_mode" signal from MSR[IR] to fetch1 along with redirects to indicate whether instruction addresses should be translated through the TLB, and fetch1 sends that on to icache. Similarly a "priv_mode" signal is sent to indicate the privilege mode for instruction fetches. This means that changes to MSR[IR] or MSR[PR] don't take effect until the next redirect, meaning an isync, rfid, branch, etc. The icache uses a hash of the effective address (i.e. next instruction address) to index the TLB. The hash is an XOR of three fields of the address; with a 64-entry TLB, the fields are bits 12--17, 18--23 and 24--29 of the address. TLB invalidations simply invalidate the indexed TLB entry without checking the contents. If the icache detects a TLB miss with virt_mode=1, it will send a fetch_failed indication through fetch2 to decode1, which will turn it into a special OP_FETCH_FAILED opcode with unit=LDST. That will get sent down to loadstore1 which will currently just raise a Instruction Storage Interrupt (0x400) exception. One bit in the PTE obtained from the TLB is used to check whether an instruction access is allowed -- the privilege bit (bit 3). If bit 3 is 1 and priv_mode=0, then a fetch_failed indication is sent down to fetch2 and to decode1, which generates an OP_FETCH_FAILED. Any PTEs with PTE bit 0 (EAA[3]) clear or bit 8 (R) clear should not be put into the iTLB since such PTEs would not allow execution by any context. Tlbie operations get sent from mmu to icache over a new connection. Unfortunately the privileged instruction tests are broken for now. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	8160f4f821	Add framework for implementing an MMU This adds a new module to implement an MMU. At the moment it doesn't do very much. Tlbie instructions now get sent by loadstore1 to mmu, which sends them to dcache, rather than loadstore1 sending them directly to dcache. TLB misses from dcache now get sent by loadstore1 to mmu, which currently just returns an error. Loadstore1 then generates a DSI in response to the error return from mmu. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	42d0fcc511	Implement data storage interrupts This adds a path from loadstore1 back to execute1 for reporting errors, and machinery in execute1 for generating data storage interrupts at vector 0x300. If dcache is given two requests in successive cycles and the first encounters an error (e.g. a TLB miss), it will now cancel the second request. Loadstore1 now responds to errors reported by dcache by sending an exception signal to execute1 and returning to the idle state. Execute1 then writes SRR0 and SRR1 and jumps to the 0x300 Data Storage Interrupt vector. DAR and DSISR are held in loadstore1. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	dd2e71930c	debug: Provide a way to examine GPRs, fast SPRs and MSR This provides commands on the debug interface to read the value of the MSR or any of the 64 GSPR register file entries. The GSPR values are read using the B port of the register file in a cycle when decode2 is not using it. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Benjamin Herrenschmidt	6853d22203	core: Add alternate reset address An external signal can control whether the core will start executing at the standard or the alternate reset address. This will be used when litedram is initialized by microwatt itself, to route the reset to the built-in init code secondary block RAM. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Michael Neuling	b4f20c20b9	XICS interrupt controller New unified ICP and ICS XICS compliant interrupt controller. Configurable number of hardware sources. Fixed hardware source number based on hardware line taken. All hardware interrupts are a fixed priority. Level interrupts supported only. Hardwired to 0xc0004000 in SOC (UART is kept at 0xc0002000). Signed-off-by: Michael Neuling <mikey@neuling.org>	5 years ago
Paul Mackerras	b349cc891a	loadstore1: Move logic from dcache to loadstore1 So that the dcache could in future be used by an MMU, this moves logic to do with data formatting, rA updates for update-form instructions, and handling of unaligned loads and stores out of dcache and into loadstore1. For now, dcache connects only to loadstore1, and loadstore1 now has the connection to writeback. Dcache generates a stall signal to loadstore1 which indicates that the request presented in the current cycle was not accepted and should be presented again. However, loadstore1 doesn't currently use it because we know that we can never hit the circumstances where it might be set. For unaligned transfers, loadstore1 generates two requests to dcache back-to-back, and then waits to see two acks back from dcache (cycles where d_in.valid is true). Loadstore1 now has a FSM for tracking how many acks we are expecting from dcache and for doing the rA update cycles when necessary. Handling for reservations and conditional stores is still in dcache. Loadstore1 now generates its own stall signal back to decode2, so we no longer need the logic in execute1 that generated the stall for the first two cycles. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	1a244d3470	Remove single-issue constraint for most loads and stores This removes the constraint that loads and stores are single-issue, at the expense of a stall of at least 2 cycles for every load and store. To do this, we plumb the existing stall signal that was generated in dcache to core, where it gets ORed with the stall signal from execute1. Execute1 generates a stall signal for the first two cycles of each load and store, and dcache generates the stall signal in the 3rd and subsequent cycles if it needs to. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	5422007f83	Plumb loadstore1 input from execute1 not decode2 This allows us to use the bypass at the input of execute1 for the address and data operands for loadstore1. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	b14d982011	execute: Implement bypass from output of execute1 to input This enables back-to-back execution of integer instructions where the first instruction writes a GPR and the second reads the same GPR. This is done with a set of multiplexers at the start of execute1 which enable any of the three input operands to be taken from the output of execute1 (i.e. r.e.write_data) rather than the input from decode2 (i.e. e_in.read_data[123]). This also requires changes to the hazard detection and handling. Decode2 generates a signal indicating that the GPR being written is available for bypass, which is true for instructions that are executed in execute1 (rather than loadstore1/dcache). The gpr_hazard module stores this "bypassable" bit, and if the same GPR needs to be read by a subsequent instruction, it outputs a "use_bypass" signal rather than generating a stall. The use_bypass signal is then latched at the output of decode2 and passed down to execute1 to control the input multiplexer. At the moment there is no bypass on the inputs to loadstore1, but that is OK because all load and store instructions are marked as single-issue. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	39d18d2738	Make divider hang off the side of execute1 With this, the divider is a unit that execute1 sends operands to and which sends its results back to execute1, which then send them to writeback. Execute1 now sends a stall signal when it gets a divide or modulus instruction until it gets a valid signal back from the divider. Divide and modulus instructions are no longer marked as single-issue. The data formatting step that used to be done in decode2 for div and mod instructions is now done in execute1. We also do the absolute value operation in that same cycle instead of taking an extra cycle inside the divider for signed operations with a negative operand. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	2167186b5f	Make multiplier hang off the side of execute1 With this, the multiplier isn't a separate pipe that decode2 issues instructions to, but rather is a unit that execute1 sends operands to and which sends the result back to execute1, which then sends it to writeback. Execute1 now sends a stall signal when it gets a multiply instruction until it gets a valid signal back from the multiplier. This all means that we no longer need to mark the multiply instructions as single-issue. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago

1 2

76 Commits (01f8ad55efffd21ba0371c8c834c035814cc0d19)