microwatt

Commit Graph

Author	SHA1	Message	Date
Paul Mackerras	73b6004ac6	icache: Use next real address to index icache Now that we are translating the fetch effective address to real one cycle earlier, we can use the real address to index the icache array. This has the benefit that the set size can be larger than a page, enabling us to configure the icache to be larger without having to increase its associativity. Previously the set size was limited to the page size to avoid aliasing problems. Thus for example a 32kB icache would need to be 8-way associative, resulting in large numbers of LUTs being used for tag comparisons in FPGA implementations, and poor timing. With this change, a 32kB icache can be 1 or 2-way associative, which means deeper and narrower tag and data RAMs and fewer tag comparators. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	7 months ago
Paul Mackerras	f9e5622327	Move iTLB from icache to fetch1 This moves the address translation step for instruction fetches one cycle earlier, so that it now happens in the fetch1 stage. There is now a 2-entry mini translation cache ("ERAT", or effective to real address translation cache) which operates on the output of the multiplexer that selects the instruction address for the next cycle. The ERAT consists of two effective address registers and two corresponding real address registers. They store the page number part of the addresses for a 4kB page size, which is the smallest page size supported by the architecture. If the effective address doesn't match either of the EA registers, and address translation is enabled, then i_out.req goes low for two cycles while the iTLB is looked up. Experimentally, this delay results in a 0.1% drop in coremark performance; allowing two cycles for the lookup results in better timing. The result from the iTLB is placed into the least recently used ERAT entry and then used to translate the address as normal. If address translation is not enabled then the EA is used directly as the real address. The iTLB structure is the same as it was before; direct mapped, indexed using a hashed EA. The "fetch failed" signal, which indicates a TLB miss or protection violation, is now generated in fetch1 and passed through icache. When it is asserted, fetch1 goes into a stalled state until a PTE arrives from the MMU (which gets put into both the iTLB and the ERAT), or an interrupt or redirect occurs. Any TLB invalidations from the MMU invalidate the whole ERAT. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	7 months ago
Paul Mackerras	f34a54d295	fetch1: Streamline next NIA generation further This reduces the number of possible sources for the next NIA from 4 down to 3, by routing interrupt vector addresses through the r_int.next_nia register, as is already done for reset. This adds one extra cycle of latency when taking interrupts. During this extra cycle, i_out.req is 0. Writeback now no longer combines redirects (branches, rfid, isync) with interrupts; they are presented separately to fetch1. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	7 months ago
Paul Mackerras	e92d49375f	fetch1: Reorganize fetch1 to provide an asynchronous early next NIA to icache This adds a next_nia field to the Fetch1ToIcacheType record, which provides an indication of what will be in the nia field on the next non-stalled cycle. This is intended to be as fast as possible, being a selection from two redirect addresses (from writeback and decode1) or an internal register (r_int.next_nia). Reset addresses and predicted branch targets come through this internal register. The rearrangement here has the side effect that we can now use the BTC on the first instruction after a taken branch, whereas previously the BTC was only active starting with the second instruction after a taken branch. This provides a slight improvement in performance. This also fixes a buglet in icache where it would assert its stall output when i_in.req was false. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	7 months ago
Paul Mackerras	1c4b5def36	Improve timing of redirect_nia going from writeback to fetch1 This gets rid of the adder in writeback that computes redirect_nia. Instead, the main adder in the ALU is used to compute the branch target for relative branches. We now decode b and bc differently depending on the AA field, generating INSN_brel, INSN_babs, INSN_bcrel or INSN_bcabs as appropriate. Each one has a separate entry in the decode table in decode1; the *rel versions use CIA as the A input. The bclr/bcctr/bctar and rfid instructions now select ramspr_result for the main result mux to get the redirect address into ex1.e.write_data. For branches which are predicted taken but not actually taken, we need to redirect to the following instruction. We also need to do that for isync. We do this in the execute2 stage since whether or not to do it depends on the branch result. The next_nia computation is moved to the execute2 stage and comes in via a new leg on the secondary result multiplexer, making next_nia available ultimately in ex2.e.write_data. This also means that the next_nia leg of the primary result multiplexer is gone. Incrementing last_nia by 4 for sc (so that SRR0 points to the following instruction) is also moved to execute2. Writing CIA+4 to LR was previously done through the main result multiplexer. Now it comes in explicitly in the ramspr write logic. Overall this removes the br_offset and abs_br fields and the logic to add br_offset and next_nia, and one leg of the primary result multiplexer, at the cost of a few extra control signals between execute1 and execute2 and some multiplexing for the ramspr write side and an extra input on the secondary result multiplexer. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	7 months ago
Paul Mackerras	c4492c843a	Implement interrupts for prefixed instructions This arranges to generate an illegal instruction type program interrupt for illegal prefixed instructions, that is, those where the suffix is not a legal value given the prefix, or the prefix has a reserved value in the subtype field. This implementation doesn't generate an interrupt for the invalid 8LS:D and MLS:D instruction forms where R = 1 and RA != 0. (In those cases it uses (RA) as the addend, i.e. it ignores the R bit.) This detects the case where the address of an instruction prefix is equal mod 64 to 60, and generates an alignment interrupt in that case. This also arranges to set bit 34 of SRR1 when an interrupt occurs due to a prefixed instruction, for those interrupts where that is required (i.e. trace, alignment, floating-point unavailable, data storage, data segment, and most cases of program interrupt). Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	10 months ago
Paul Mackerras	39ca675ce3	Decode prefixed instructions This adds logic to do basic decoding of the prefixed instructions defined in PowerISA v3.1B which are in the SFFS (Scalar Fixed plus Floating-Point Subset) compliancy subset. In PowerISA v3.1B SFFS, there are 14 prefixed load/store instructions plus the prefixed no-op instruction (pnop). The prefixed load/store instructions all use an extended version of D-form, which has an extra 18 bits of displacement in the prefix, plus an 'R' bit which enables PC-relative addressing. When decode1 sees an instruction word where the insn_code is INSN_prefix (i.e. the primary opcode was 1), it stores the prefix word and sends nothing down to decode2 in that cycle. When the next valid instruction word arrives, it is interpreted as a suffix, meaning that its insn_code gets modified before being used to look up the decode table. The insn_code values are rearranged so that the values for instructions which are the suffix of a valid prefixed instruction are all at even indexes, and the corresponding prefixed instructions follow immediately, so that an insn_code value can be converted to the corresponding prefixed value by setting the LSB of the insn_code value. There are two prefixed instructions, pld and pstd, for which the suffix is not a valid SFFS instruction by itself, so these have been given dummy insn_code values which decode as illegal (INSN_op57 and INSN_op61). For a prefixed instruction, decode1 examines the type and subtype fields of the prefix and checks that the suffix is valid for the type and subtype. This check doesn't affect which entry of the decode table is used; the result is passed down to decode2, and will in future be acted upon in execute1. The instruction address passed down to decode2 is the address of the prefix. To enable this, part of the instruction address is saved when the prefix is seen, and then the instruction address received from icache is partly overlaid by the saved prefix address. Because prefixed instructions are not permitted to cross 64-byte boundaries, we only need to save bits 5:2 of the instruction to do this. If the alignment restriction ever gets relaxed, we will then need to save more bits of the address. Decode2 has been extended to handle the R bit of the prefix (in 8LS and MLS forms) and to be able to generate the 34-bit immediate value from the prefix and suffix. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	10 months ago
Paul Mackerras	7af0e001ad	Move insn_codes for mcrfs, mtfsb0/1 and mtfsfi This moves the insn_code values for mcrfs, mtfsb0/1 and mtfsfi into the region used for floating-point instructions. This means that in no-FPU implementations, they will get turned into illegal instructions in predecode. We then don't need the code in execute1 that makes FP instructions illegal in no-FPU implementations. We also remove the NONE value for unit_t, since it was only ever used with insn_type = OP_ILLEGAL, and the check for unit = NONE was redundant with the check for insn_type = OP_ILLEGAL. Thus the check for unit = NONE is no longer needed and is removed here. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	10 months ago
Paul Mackerras	e02d8060ed	Change the multiplier interface to support signed multipliers This adds an 'is_signed' signal to MultiplyInputType to indicate whether the data1 and data2 fields are to be interpreted as signed or unsigned numbers. The 'not_result' field is replaced by a 'subtract' field which provides a more intuitive interface for requesting that the product be subtracted from the addend rather than added, i.e. subtract = 1 gives C - A * B, vs. subtract = 0 giving C + A * B. (Previously the users of the multipliers got the same effect by complementing the addend and setting not_result = 1.) The is_32bit field is removed because it is no longer used now that we have a separate 32-bit multiplier. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	21ab36a0c0	Pre-decode instructions when writing them to icache This splits out the decoding done in the decode0 step into a separate predecoder, used when writing instructions into the icache. The icache now holds 36 bits per instruction rather than 32. For valid instructions, those 36 bits comprise the bottom 26 bits of the instruction word, a 9-bit insn_code value (which uniquely identifies the instruction), and a zero in the MSB. For illegal instructions, the MSB is one and the full instruction word is in the bottom 32 bits. Having the full instruction word available for illegal instructions means that it can be printed in the log when simulating, or in future could be placed in the HEIR register. If we don't have an FPU, then the floating-point instructions are regarded as illegal. In that case, the insn_code values would fit into 8 bits, which could be used in future to reduce the size of decode_rom from 512 to 256 entries. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	c3ee10f013	decode1: Split instruction decoding into two steps This reduces the block RAM requirements for instruction decoding by splitting it into two steps. The first, in a new pipeline stage called decode0 (implemented by code in decode1.vhdl) maps the instruction to a 9-bit instruction code using major and row decode ROMs. The second maps the 9-bit code to the final decode_rom_t (about 44 bits wide). Branch prediction done in decode is now done in decode0 rather than decode1. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	932da4c114	FPU: Simplify IDLE state code Do more decoding of the instruction ahead of the IDLE state processing so that the IDLE state code becomes much simpler. To make the decoding easier, we now use four insn_type_t codes for floating-point operations rather than two. This also rearranges the insn_type_t values a little to get the 4 FP opcode values to differ only in the bottom 2 bits, and put OP_DIV, OP_DIVE and OP_MOD next to them. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	7a60c118ed	loadstore1: Simplify address generation in OP_FETCH_FAILED case Instead of having a multiplexer in loadstore1 in order to be able to put the instruction address into v.addr, we now set decode.input_reg_a to CIA in the decode table entry for OP_FETCH_FAILED. That means that the operand selection machinery in decode2 will supply the instruction address to loadstore1 on the lv.addr1 input and no special case is needed in loadstore1. This saves a few LUTs (~40 on the Artix-7). Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	795b6e2a6b	Remove leftover logic for 16-byte loads and stores This removes some logic that was previously added for the 16-byte loads and stores (lq, lqarx, stq, stqcx.) and not completely removed in commit `c9e838b656` ("Remove support for lq, stq, lqarx and stqcx.", 2022-06-04). Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Michael Neuling	caf458be37	Metavalue cleanup for common.vhdl This affects other files which have been included here. Signed-off-by: Michael Neuling <mikey@neuling.org>	2 years ago
Paul Mackerras	d6121cd636	Use register addresses from decode1 for dependency tracking This improves timing a little because the register addresses now come directly from a latch instead of being calculated by decode_input_reg_*. The asserts that check that the two are the same are now in decode2 rather than register_file. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	1d7de2f1da	register_file: Make read access to register file synchronous With this, the register RAM is read synchronously using the addresses supplied by decode1. That means the register RAM can now be block RAM rather than LUT RAM. Debug accesses are done via the B port on cycles when decode1 indicates that there is no valid instruction or the instruction doesn't use a [F]RB operand. We latch the addresses being read in each cycle and use the same address next cycle if stalled. Data that is being written is latched and a multiplexer on each read port then supplies the latched write data if the read address for that port equals the write address. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	06c13d4988	decode1: Work out register addresses in decode1 This adds some relatively simple logic to decode1 to compute the GPR/FPR addresses that an instruction will access. It always computes three addresses regardless of whether the instruction will actually use all of them. The main things it computes are whether the instruction uses the RS field or the RC field for the 3rd operand, and whether the operands are FPRs or GPRs (it is possible for RS to be an FPR but RA and RB to be GPRs, as for example with stfdx). At the moment all we do with these computed register addresses is to assert that they are identical to the ones coming from decode2 one cycle later. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	af814a0d5e	Provide debug access to SPRs in loadstore1 and mmu They are accessible as GSPR 0x3c - PID, 0x3d - PTCR, 0x3e - DSISR and 0x3f - DAR. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	d0f319290f	Restore debug access to SPRs This provides access to the SPRs via the JTAG DMI interface. For now they are still accessed as if they were GPR/FPRs using the same numbering as before (GPRs at 0 - 0x1f, SPRs at 0x20 - 0x2d, FPRs at 0x40 - 0x5f). For XER, debug reads now report the full value, not just the bits that were previously stored in the register file. The "slow" SPR mux is not used for debug reads. Decode2 determines on each cycle whether a debug SPR access will happen next cycle, based on whether there is a request and whether the current instruction accesses the SPR RAM. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	fdb3ef6874	Finish off taking SPRs out of register file With this, the register file now contains 64 entries, for 32 GPRs and 32 FPRs, rather than the 128 it had previously. Several things get simplified - decode1 no longer has to work out the ispr{1,2,o} values, decode_input_reg_{a,b,c} no longer have the t = SPR case, etc. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	337b104250	Move LR, CTR and TAR out of the register file By putting CTR on the odd side and LR and TAR on the even side, we can read and write CTR for bdnz-style instructions in parallel with reading LR or TAR for indirect branches and writing LR for branches with LK=1. Thus we don't need to double up any of these instructions, giving a simplification in decode2. We now have logic for printing LR and CTR at the end of a simulation in execute1, in addition to the similar logic in register_file and cr_file. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	bc4d02cb0d	Start removing SPRs from register file This starts the process of removing SPRs from the register file by moving SRR0/1, SPRG0-3, HSRR0/1 and HSPRG0/1 out of the register file and putting them into execute1. They are stored in a pair of small RAM arrays, referred to as "even" and "odd". The reason for having two arrays is so that two values can be read and written in each cycle. For example, SRR0 and SRR1 can be written in parallel by an interrupt and read in parallel by the rfid instruction. The addresses in the RAM which will be accessed are determined in the decode2 stage. We have one write address for both sides, but two read addresses, since in future we will want to be able to read CTR at the same time as either LR or TAR. We now have a connection from writeback to execute1 which carries the partial SRR1 value for an interrupt. SRR0 comes from the execute pipeline; we no longer need to carry instruction addresses along the LSU and FPU pipelines. Since SRR0 and SRR1 can be written in the same cycle now, we don't need the little state machine in writeback any more. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	73cc5167ec	Use FPU for division instructions if we have an FPU - Arrange for XER to be written for OE=1 forms - Arrange for condition codes to be set for RC=1 forms (including correct handling for 32-bit mode) - Don't instantiate the divider if we have an FPU. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	a95f8aab38	FPU: Add integer division logic to FPU This adds logic to the FPU to accomplish 64-bit integer divisions. No instruction actually uses this yet. The algorithm used is to obtain an estimate of the reciprocal of the divisor using the lookup table and refine it by one to three iterations of the Newton-Raphson algorithm (the number of iterations depends on the number of significant bits in the dividend). Then the reciprocal is multiplied by the dividend to get the quotient estimate. The remainder is calculated as dividend - quotient * divisor. If the remainder is greater than or equal to the divisor, the quotient is incremented, or if a modulo operation is being done, the divisor is subtracted from the remainder. The inverse estimate after refinement is good enough that the quotient estimate is always equal to or one less than the true quotient. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	2f45e545ed	decode2: Rework to make the stall_out signal come from a register At present the busy/stall signal going to decode1 depends on whether control thinks it can issue the current instruction, and that depends on completion and bypass signals coming from execute1 and writeback. To improve the timing of stall_out, this rearranges decode2 so that stall_out is asserted when we have a valid instruction that couldn't be issued in the previous cycle. This means that decode1 could give us a new instruction when we haven't issued the previous instruction. This in turn means that we can only use d_in in the first cycle of processing an instruction. After the first cycle, we get register addresses etc. from dc2 rather than d_in. Then, to avoid the need to read register operands from register_file in each cycle until the instruction issues, we bring the bypass path for data being written to the register file into decode2 explicitly rather than having it in register_file. A new process called decode2_addrs does the process of calling decode_input_reg_* and decode_output_reg and sets up the register file addresses. This was split out (and decode_input_reg_* reworked) to try to reduce the number of passes through the decode2_1 process that need to be done in simulation. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	9a8a8e50f8	FPU: Add stage-2 stall ability to FPU This makes the FPU able to stall other units at execute stage 2 and be stalled by other units (specifically the LSU). This means that the completion and writeback for an instruction can now end up being deferred until the second cycle of a following instruction, i.e. the cycle when the state machine has gone through IDLE state into one of the DO_* states, which means we need to latch the destination FPR number, CR mask, etc. from the previous instruction so that we present the correct information to writeback. The advantage of this is that we can get rid of the in_progress signal from the LSU. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	e030a500e8	Allow integer instructions and load/store instructions to execute together Execute1 and loadstore1 now send each other stall signals that indicate that a valid instruction in stage 2 can't complete in this cycle, and hence any valid instruction in stage 1 in the other unit can't move to stage 2. With this in place, an ALU instruction can move into stage 1 while a LSU instruction is in stage 2. Since the FPU doesn't yet have a way to stall completion, we can't yet start FPU instructions while any LSU or ALU instruction is in progress. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	3510071d9a	Add a second execute stage to the pipeline This adds a second execute stage to the pipeline, in order to match up the length of the pipeline through loadstore and dcache with the length through execute1. This will ultimately enable us to get rid of the 1-cycle bubble that we currently have when issuing ALU instructions after one or more LSU instructions. Most ALU instructions execute in the first stage, except for count-zeroes and popcount instructions (which take two cycles and do some of their work in the second stage) and mfspr/mtspr to "slow" SPRs (TB, DEC, PVR, LOGA/LOGD, CFAR). Multiply and divide/mod instructions take several cycles but the instruction stays in the first stage (ex1) and ex1.busy is asserted until the operation is complete. There is currently a bypass from the first stage but not the second stage. Performance is down somewhat because of that and because this doesn't yet eliminate the bubble between LSU and ALU instructions. The forwarding of XER common bits has been changed somewhat because now there is another pipeline stage between ex1 and the committed state in cr_file. The simplest thing for now is to record the last value written and use that, unless there has been a flush, in which case the committed state (obtained via e_in.xerc) is used. Note that this fixes what was previously a benign bug in control.vhdl, where it was possible for control to forget an instructions dependency on a value from a previous instruction (a GPR or the CR) if this instruction writes the value and the instruction gets to the point where it could issue but is blocked by the busy signal from execute1. In that situation, control may incorrectly not indicate that a bypass should be used. That didn't matter previously because, for ALU and FPU instructions, there was only one previous instruction in flight and once the current instruction could issue, the previous instruction was completing and the correct value would be obtained from register_file or cr_file. For loadstore instructions there could be two being executed, but because there are no bypass paths, failing to indicate use of a bypass path is fine. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	813e2317bf	execute1: Restructure to separate out execution of side effects We now have a record that represents the actions taken in executing an instruction, and a process that computes that for the incoming instruction. We no longer have 'current' or 'r.cur_instr', instead things like the destination register are put into r.e in the first cycle of an instruction and not reinitialized in subsequent busy cycles. For mfspr and mtspr, we now decode "slow" SPR numbers (those SPRs that are not stored in the register file) to a new "spr_selector" record in decode1 (excluding those in the loadstore unit). With this, the result for mfspr is determined in the data path. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	204fedc63f	Move XER low bits out of register file Besides the overflow and status carry bits, XER has 18 bits which need to retain the value written by mtxer (in case software wants to emulate the move-assist instructions (lswi, lswx, stswi, stswx). Until now these bits (and others) have been stored in the GPR file as a "fast" SPR, but this causes complications because XER is not really a fast SPR. Instead, we now store these 18 bits in the 'ctrl' signal, which exists in execute1. This will enable us to simplify the data path in future, and has the added bonus that with a little bit of plumbing, we can get the full XER value printed when dumping registers at the end of a simulation. Therefore this changes scripts/run_test.sh to remove the greps which exclude XER from the comparison of actual and expected register results. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Anton Blanchard	0b39947f8d	Remove unused sequential signal from Fetch1ToIcacheType GHDL synthesis is flagging a warning about this. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	2 years ago
Benjamin Herrenschmidt	d745995207	Introduce real_addr_t and addr_to_real() This moves REAL_ADDR_BITS out of the caches and defines a real_addr_t type for a real address, along with a addr_to_real() conversion helper. It makes the vhdl a bit more readable Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	3 years ago
Paul Mackerras	54b0e8b8c8	core: Predict not-taken conditional branches using BTC This adds a bit to the BTC to store whether the corresponding branch instruction was taken last time it was encountered. That lets us pass a not-taken prediction down to decode1, which for backwards direct branches inhibits it from redirecting fetch to the target of the branch. This increases coremark by about 2%. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	65c43b488b	PMU: Add several more events This implements most of the architected PMU events. The ones missing are mostly the ones that depend on which level of the cache hierarchy data is fetched from. The events implemented here, and their raw event codes, are: Floating-point operation completed (100f4) Load completed (100fc) Store completed (200f0) Icache miss (200fc) ITLB miss (100f6) ITLB miss resolved (400fc) Dcache load miss (400f0) Dcache load miss resolved (300f8) Dcache store miss (300f0) DTLB miss (300fc) DTLB miss resolved (200f6) No instruction available and none being executed (100f8) Instruction dispatched (200f2, 300f2, 400f2) Taken branch instruction completed (200fa) Branch mispredicted (400f6) External interrupt taken (200f8) Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	a7873b45f7	core: Add a basic performance monitor unit (PMU) implementation This is the start of an implementation of a PMU according to PowerISA v3.0B. Things not implemented yet include most architected events, the BHRB, event-based branches, thresholding, MMCR0[TBCC] field, etc. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	64e3ce7134	execute1: Handle interrupts during sequences of load/store operations At present the logic prevents any interrupts from being handled while there is a load/store instruction (one that has unit=LDST) being executed. However, load/store instructions can still get sent to loadstore1. Thus an instruction which should generate an interrupt such as a floating-point unavailable interrupt will instead get executed. To fix this, when we detect that an interrupt should be generated but loadstore1 is still executing a previous instruction, we don't execute any new instructions, and set a new r.intr_pending flag. That results in busy_out being asserted (meaning that no further instructions will come in from decode2). When loadstore1 has finished the instructions it has, the interrupt gets sent to writeback. If one of the instructions in loadstore1 generates an interrupt in the meantime, the l_in.interrupt signal gets asserted and that clears r.intr_pending, so the interrupt we detected gets discarded. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	18120f153d	MMU: Implement a vestigial partition table This implements a 1-entry partition table, so that instead of getting the process table base address from the PRTBL SPR, the MMU now reads the doubleword pointed to by the PTCR register plus 8 to get the process table base address. The partition table entry is cached. Having the PTCR and the vestigial partition table reduces the amount of software change required in Linux for Microwatt support. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	17fd069640	core: Allow multiple loadstore instructions to be in flight The idea here is that we can have multiple instructions in progress at the same time as long as they all go to the same unit, because that unit will keep them in order. If we get an instruction for a different unit, we wait for all the previous instructions to finish before executing it. Since the loadstore unit is the only one that is currently pipelined, this boils down to saying that loadstore instructions can go ahead while l_in.in_progress = 1 but other instructions have to wait until it is 0. This gives a 2% increase on coremark performance on the Arty A7-100 (from ~190 to ~194). Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	f636bb7c39	dcache: Fix bugs in pipelined operation This fixes two bugs which show up when multiple operations are in flight in the dcache, and adds a 'hold' input which will be needed when loadstore1 is pipelined. The first bug is that dcache needs to sample the data for a store on the cycle after the store request comes in even if the store request is held up because of a previous request (e.g. if the previous request is a load miss or a dcbz). The second bug is that a load request coming in for a cache line being refilled needs to be handled immediately in the case where it is for the row whose data arrives on the same cycle. If it is not, then it will be handled as a separate cache miss and the cache line will be refilled again into a different way, leading to two ways both being valid for the same tag. This can lead to data corruption, in the scenario where subsequent writes go to one of the ways and then that way gets displaced but the other way doesn't. This bug could in principle show up even without having multiple operations in flight in the dcache. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	acb3d2d745	core: Send FPU interrupts to writeback rather than execute1 Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	29221315e9	core: Send loadstore1 interrupts to writeback rather than execute1 Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	3cd3449b4b	core: Move redirect and interrupt delivery logic to writeback This moves the logic for redirecting fetching and writing SRR0 and SRR1 to writeback. The aim is that ultimately units other than execute1 can send their interrupts to writeback along with their instruction completions, so that there can be multiple instructions in flight without needing execute1 to keep track of the address of each outstanding instruction. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	4fd8d9509c	execute1: Move CR result to data path process Also work out in decode2 whether the instruction sets the XER common bits. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	ae2afeca5c	core: Track CR hazards and bypasses using tags Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	d290d2a9bb	core: Restore bypass path from execute1 This changes the bypass path. Previously it went from after execute1's output to after decode2's output. Now it goes from before execute1's output register to before decode2's output register. The reason is that the new path will be simpler to manage when there are possibly multiple instructions in flight. This means that the bypassing can be managed inside decode2 and control. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	c0b45e153b	core: Track GPR hazards using tags that propagate through the pipelines This changes the way GPR hazards are detected and tracked. Instead of having a model of the pipeline in gpr_hazard.vhdl, which has to mirror the behaviour of the real pipeline exactly, we now assign a 2-bit tag to each instruction and record which GSPR the instruction writes. Subsequent instructions that need to use the GSPR get the tag number and stall until the value with that tag is being written back to the register file. For now, the forwarding paths are disabled. That gives about a 8% reduction in coremark performance. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	a1d7b54f76	core: Crack branches that update both CTR and LR This uses the instruction doubling machinery to convert conditional branch instructions that update both CTR and LR (e.g., bdnzl, bdnzlrl) into two instructions, of which the first updates CTR and determines whether the branch is taken, and the second updates LR and does the redirect if necessary. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	4c61a71a62	core: Crack update-form loads into two internal ops This uses the instruction-doubling machinery to send load with update instructions down to loadstore1 as two separate ops, rather than one op with two destinations. This will help to simplify the value tracking mechanisms. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	0fb207be60	fetch1: Implement a simple branch target cache This implements a cache in fetch1, where each entry stores the address of a simple branch instruction (b or bc) and the target of the branch. When fetching sequentially, if the address being fetched matches the cache entry, then fetching will be redirected to the branch target. The cache has 1024 entries and is direct-mapped, i.e. indexed by bits 11..2 of the NIA. The bus from execute1 now carries information about taken and not-taken simple branches, which fetch1 uses to update the cache. The cache entry is updated for both taken and not-taken branches, with the valid bit being set if the branch was taken and cleared if the branch was not taken. If fetching is redirected to the branch target then that goes down the pipe as a predicted-taken branch, and decode1 does not do any static branch prediction. If fetching is not redirected, then the next instruction goes down the pipe as normal and decode1 does its static branch prediction. In order to make timing, the lookup of the cache is pipelined, so on each cycle the cache entry for the current NIA + 8 is read. This means that after a redirect (from decode1 or execute1), only the third and subsequent sequentially-fetched instructions will be able to be predicted. This improves the coremark value on the Arty A7-100 from about 180 to about 190 (more than 5%). The BTC is optional. Builds for the Artix 7 35-T part have it off by default because the extra ~1420 LUTs it takes mean that the design doesn't fit on the Arty A7-35 board. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago

1 2 3

147 Commits (master)