microwatt

Commit Graph

Author	SHA1	Message	Date
Paul Mackerras	1a244d3470	Remove single-issue constraint for most loads and stores This removes the constraint that loads and stores are single-issue, at the expense of a stall of at least 2 cycles for every load and store. To do this, we plumb the existing stall signal that was generated in dcache to core, where it gets ORed with the stall signal from execute1. Execute1 generates a stall signal for the first two cycles of each load and store, and dcache generates the stall signal in the 3rd and subsequent cycles if it needs to. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	441160d865	execute1: Use truth table embedded in instruction for CR logical ops It turns out that CR logical instructions have the truth table of the operation embedded in the instruction word. This means that we can collect the two input operand bits into a 2-bit value and use that as the index to select the appropriate bit from the instruction word. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	e08ca4ab8e	countzero: Add a register to help make timing This adds a register in the middle of the countzero computation, so that we now have two cycles to count leading or trailing zeroes instead of just one. Execute1 now outputs a one-cycle stall signal when it encounters a cntlz* or cnttz* instruction. With this, the countzero path no longer fails timing on the Artix-7 at 100MHz. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	5422007f83	Plumb loadstore1 input from execute1 not decode2 This allows us to use the bypass at the input of execute1 for the address and data operands for loadstore1. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	b14d982011	execute: Implement bypass from output of execute1 to input This enables back-to-back execution of integer instructions where the first instruction writes a GPR and the second reads the same GPR. This is done with a set of multiplexers at the start of execute1 which enable any of the three input operands to be taken from the output of execute1 (i.e. r.e.write_data) rather than the input from decode2 (i.e. e_in.read_data[123]). This also requires changes to the hazard detection and handling. Decode2 generates a signal indicating that the GPR being written is available for bypass, which is true for instructions that are executed in execute1 (rather than loadstore1/dcache). The gpr_hazard module stores this "bypassable" bit, and if the same GPR needs to be read by a subsequent instruction, it outputs a "use_bypass" signal rather than generating a stall. The use_bypass signal is then latched at the output of decode2 and passed down to execute1 to control the input multiplexer. At the moment there is no bypass on the inputs to loadstore1, but that is OK because all load and store instructions are marked as single-issue. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	0c714f1be6	execute: Move popcnt and prty instructions into the logical unit This implements logic in the logical entity to calculate the results of the popcnt* and prty* instructions. We now have one insn_type_t value for the 3 popcnt variants and one for the two prty variants, using the length field of the decode_rom_t to distinguish between them. The implementations in logical.vhdl using recursive algorithms rather than the simple functions in ppc_fx_insns.vhdl. This gives a saving of about 140 slice LUTs on the A7-100 and improves timing slightly. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	d2ca625b3b	execute: Do comparisons using the main adder This handles OP_CMP like a subtraction; the main adder computes ~RA + RB + 1, and the condition codes are computed from the results. A direct comparison of the two input operands is used to calculate the EQ bit of the condition result. The LT and GT bits are computed from the MSB of the subtraction result, the carry out from the subtraction, and the MSBs of the operands. For a 32-bit comparison, the 32-bit carry and bit 31 of the result and input operands are used; for a 64-bit comparison, the 64-bit carry and bit 63 of the operands and result are used. It turns out to be more convenient to use the 'signed' field of the decode table to distinguish signed from unsigned comparisons, rather than the insn_type. Therefore this uses OP_CMP for both cmp and cmpl, which also has the benefit of reducing the number of values in insn_type_t. Doing this saves over 200 slice LUTs on the Arty A7-100 and improves timing slightly as well. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	d956846667	execute1: Move EXTS* instruction back into execute1 This moves the sign extension done by the extsb, extsh and extsw instructions back into execute1. This means that we no longer need any data formatting in writeback for results coming from execute1, so this modifies writeback so the data formatter inputs come directly from the loadstore unit output. The condition code updates for RC=1 form instructions are now done on the value from execute1 rather than the output of the data formatter, which should help timing. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	c9a2076dd3	execute1: Remember dest GPR, RC, OE, XER for slow operations For multiply and divide operations, execute1 now records the destination GPR number, RC and OE from the instruction, and the XER value. This means that the multiply and divide units don't need to record those values and then send them back to execute1. This makes the interface to those units a bit simpler. They simply report an overflow signal along with the result value, and execute1 takes care of updating XER if necessary. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	39d18d2738	Make divider hang off the side of execute1 With this, the divider is a unit that execute1 sends operands to and which sends its results back to execute1, which then send them to writeback. Execute1 now sends a stall signal when it gets a divide or modulus instruction until it gets a valid signal back from the divider. Divide and modulus instructions are no longer marked as single-issue. The data formatting step that used to be done in decode2 for div and mod instructions is now done in execute1. We also do the absolute value operation in that same cycle instead of taking an extra cycle inside the divider for signed operations with a negative operand. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	2167186b5f	Make multiplier hang off the side of execute1 With this, the multiplier isn't a separate pipe that decode2 issues instructions to, but rather is a unit that execute1 sends operands to and which sends the result back to execute1, which then sends it to writeback. Execute1 now sends a stall signal when it gets a multiply instruction until it gets a valid signal back from the multiplier. This all means that we no longer need to mark the multiply instructions as single-issue. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Anton Blanchard	ad3db18dce	Fix a ghdysynth inferred latch error in execute It should never happen in practise, but ghdlsynth is complaining about an inferred latch here. Fix it Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Anton Blanchard	cc8a9e7893	Upper 32 bits of XER should read as 0s From the architecture: bits 0:31 and 35:43 are treated as reserved and return 0s when read using mfxer Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Tom Vijlbrief	c05441bf47	Implement CRNOR and friends Signed-off-by: Tom Vijlbrief <tvijlbrief@gmail.com>	5 years ago
Benjamin Herrenschmidt	e4f475e17f	sprs: Store common SPRs in register file This stores the most common SPRs in the register file. This includes CTR and LR and a not yet final list of others. The register file is set to 64 entries for now. Specific types are defined that can represent a GPR index (gpr_index_t) or a GPR/SPR index (gspr_index_t) along with conversion functions between the two. On order to deal with some forms of branch updating both LR and CTR, we introduced a delayed update of LR after a branch link. Note: We currently stall the pipeline on such a delayed branch, but we could avoid stalling fetch in that specific case as we know we have a branch delay. We could also limit that to the specific case where we need to update both CTR and LR. This allows us to make bcreg, mtspr and mfspr pipelined. decode1 will automatically force the single issue flag on mfspr/mtspr to a "slow" SPR. [paulus@ozlabs.org - fix direction of decode2.stall_in] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	ec9b27660f	execute: Copy XER[SO] to CR for cmp[i] and cmpl[i] instructions We were copying in XER[SO] for the dot-form instructions but not the explicit compare instructions. Fix this. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Benjamin Herrenschmidt	501b6daf9b	Add basic XER support The carry is currently internal to execute1. We don't handle any of the other XER fields. This creates type called "xer_common_t" that contains the commonly used XER bits (CA, CA32, SO, OV, OV32). The value is stored in the CR file (though it could be a separate module). The rest of the bits will be implemented as a separate SPR and the two parts reconciled in mfspr/mtspr in latter commits. We always read XER in decode2 (there is little point not to) and send it down all pipeline branches as it will be needed in writeback for all type of instructions when CR0:SO needs to be updated (such forms exist for all pipeline branches even if we don't yet implement them). To avoid having to track XER hazards, we forward it back in EX1. This assumes that other pipeline branches that can modify it (mult and div) are running single issue for now. One additional hazard to beware of is an XER:SO modifying instruction in EX1 followed immediately by a store conditional. Due to our writeback latency, the store will go down the LSU with the previous XER value, thus the stcx. will set CR0:SO using an obsolete SO value. I doubt there exist any code relying on this behaviour being correct but we should account for it regardless, possibly by ensuring that stcx. remain single issue initially, or later by adding some minimal tracking or moving the LSU into the same pipeline as execute. Missing some obscure XER affecting instructions like addex or mcrxrx. [paulus@ozlabs.org - fix CA32 and OV32 for OP_ADD, fix order of arguments to set_ov] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Benjamin Herrenschmidt	83a8bb0238	spr: Cleanup decoding of SPR numbers Use a function to obtain the integer number and use constants with the architected numbers. Replace std_match with a case statement. This also has the side effect of returning 0 instead of some random previous result on mfspr of an unknown SPR. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Anton Blanchard	247d7d4aa0	Merge pull request #113 from mikey/exec-sim-remove Remove SIM generic from execute1	5 years ago
Michael Neuling	bd4ac06243	Remove SIM generic from execute1 This does nothing, so remove. Signed-off-by: Michael Neuling <mikey@neuling.org>	5 years ago
Benjamin Herrenschmidt	742b21480e	insn: Simplistic implementation of icbi We don't yet have a proper snooper for the icache, so for now make icbi just flush the whole thing Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	a0d95e791e	insn: Implement isync instruction The instruction works by redirecting fetch to nia+4 (hopefully using the same adder used to generate LR) and doing a backflush. Along with being single issue, this should guarantee that the next instruction only gets fetched after the pipe's been emptied. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Anton Blanchard	e67924f55e	isel takes a CR bit, not a CR field Fix a GHDL assert in isel. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Benjamin Herrenschmidt	bddc9327cc	execute1: Remove mux on "write_data" and "rc" outputs Only "write_enable" needs to change, this shrinks the core a bit more Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	da0bd89c43	crhelpers: Constraint "crnum" integer This seems to save quite a few LUTs Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	4437487ad0	execute1: Reformat No functional change Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Anton Blanchard	4433118c91	Merge pull request #105 from paulusmack/writeback Writeback	5 years ago
Paul Mackerras	f49a5a99a5	Remove execute2 stage Since the condition setting got moved to writeback, execute2 does nothing aside from wasting a cycle. This removes it. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	9646fe28b0	Do sign-extension instructions in writeback instead of execute1 This makes the exts[bhw] instructions do the sign extension in the writeback stage using the sign-extension logic there instead of having unique sign extension logic in execute1. This requires passing the data length and sign extend flag from decode2 down through execute1 and execute2 and into writeback. As a side bonus we reduce the number of values in insn_type_t by two. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	86c53aa3f7	Implement neg using OP_ADD We have all the machinery in place to implement the neg instruction as OP_ADD. Doing that means we can ditch OP_NEG, and saves about 66 slice LUTs on the A7-100. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Anton Blanchard	57b7f1ed71	Don't infer latch for newcrf Always initialize newcrf to avoid inferring a latch. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Paul Mackerras	24a4a796ce	execute: Consolidate count-leading/trailing-zeroes implementations This adds combinatorial logic that does 32-bit and 64-bit count leading and trailing zeroes in one unit, and consolidates the four instructions under a single OP_CNTZ opcode. This saves 84 slice LUTs on the Arty A7-100. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Anton Blanchard	b8fb721b81	Consolidate logical instructions Consolidate and/andc/nand, or/orc/nor and xor/eqv, using a common invert on the input and output. This saves us about 200 LUTs. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Paul Mackerras	f7c393ba7e	Add a rotate/mask/shift unit and use it in execute1 This adds a new entity 'rotator' which contains combinatorial logic for rotating and masking 64-bit values. It implements the operations of the rlwinm, rlwnm, rlwimi, rldicl, rldicr, rldic, rldimi, rldcl, rldcr, sld, slw, srd, srw, srad, sradi, sraw and srawi instructions. It consists of a 3-stage 64-bit rotator using 4:1 multiplexors at each stage, two mask generators, output logic and control logic. The insn_type_t values used for these instructions have been reduced to just 5: OP_RLC, OP_RLCL and OP_RLCR for the rotate and mask instructions (clear both left and right, clear left, clear right variants), OP_SHL for left shifts, and OP_SHR for right shifts. The control signals for the rotator are derived from the opcode and from the is_32bit and is_signed fields of the decode_rom_t. The rotator is instantiated as an entity in execute1 so that we can be sure we only have one of it. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	7fe84220a5	decode: Avoid multiplexing from instruction reg fields to regfile address ports This aims to simplify the logic between the instruction image and the register file read address ports and reduce the size of the decode tables. With this patch, the input_reg_a column of the decode tables can only select RA or zeroes, the input_reg_b column can only select RB or a constant (0, -1, or an immediate value from the instruction), and the input_reg_c columns can only select RS or zeroes. That means that the rotate/shift/logical ops now have their first input coming in via the input_reg_c column. That means we need to add a read_data3 field to the Decode2ToExecuteType record, but that will go away again when we split out the rotate/mask/logical ops to their own unit. As a related but not tightly connected change, this patch also sets the read1_enable signal to the register file be 0 when RA=0 and the input_reg_a for the instruction is RA_OR_ZERO (previously it was 1). Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	96b402a4bf	Consolidate add/subtract instructions into a single op All of the PPC add and subtract instructions, including carrying and extended versions, do much the same arithmetic operation: result = (I xor A) + B + C where A is the value from RA, I provides a logical inversion of A (i.e. I is 0 or -1), B is either from RB or is a constant 0 or -1, and C is 0, 1 or the carry bit from XER (CA). To consolidate all the add/subtract instructions into a single OP_ADD, we add a column to decode_rom_t to indicate when A should be inverted, and change the input_carry field to a 3-state selector to select C in the equation above. This also adds a new "CONST_M1" value for input_reg_b_t to indicate that B is a constant -1. This allows us to implement addme and subfme. The addex instruction appears not to exist, so the comments referring to it are removed. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	58b06eb5f3	decode: Remove const fields from decode_rom_t The const* fields of decode_rom_t drove multiplexers in decode2 that picked out various instruction fields and put them into the const* fields of the Decode2ToExecute1Type record, from where they were used in execute1. However, the code in execute1 can just as easily use the appropriate fields of the original instruction word, since that is now available in execute1. This therefore changes the code to do that, resulting in smaller decode tables. Suggested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	bbae2d1eda	decode: Index minor op table with insn bits for opcode 31 This changes decode_op_31_array from being indexed by a ppc_insn_t (which is derived from the instruction word by a whole series of if/elsif statements) to being indexed directly by bits 10...1 of the instruction word. With this we no longer need ppc_insn. This then means that the decode1 stage doesn't distinguish between mfcr and mfocrf, or between mtcrf and mtocrf, since those are distinguished by the value in bit 20 of the instruction. To accommodate that, execute1 changes so that the one op value (OP_MFCR) does either the mfcr or the mfocrf behaviour depending on bit 20 of the instruction word; and similarly for mtcrf/mtocrf. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	21d3f8a5ed	decode: Index minor op table with insn bits for opcode 30 This comprises the 64-bit rotate and mask instructions. In order to reduce the table index to 3 bits, we combine rldcl and rdlcr into a single op (OP_RLDCX), and choose the right mask at execute time based on bit 1 of the instruction word. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	00e9f801f6	decode: Index minor op table with insn bits for opcode 19 This changes the decoding of major opcode 19 from using the ppc_insn_t index to using bits of the instruction word directly. Opcode 19 has a 10-bit minor opcode field (bits 10..1) but the space is sparsely filled. Therefore we index a table of single-bit entries with the 10-bit minor opcode to filter out the illegal minor opcodes, and index a table using just 3 bits -- 5, 3 and 2 -- of the instruction to get the decode entry. This groups together all the instructions in 4 columns of the opcode map as a single entry. That means that mcrf and all the CR logical ops get grouped together, and bcctr, bclr and bctar get grouped together. At present the CR logical ops are not implemented, so their grouping has no impact. The code for bclr and bcctr in execute1 is now common, using a single op, and it now determines the branch address by looking at bit 10 of the instruction word at execute time. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	c9e92483b8	decode: Push mtspr/mfspr register decoding down into execute1 Instead of doing mfctr, mflr, mftb, mtctr, mtlr as separate ops, just pass down mfspr and mtspr ops with the spr number and let execute1 decode which SPR we're addressing. This will help reduce the number of instruction bits decode1 needs to look at. In fact we now pass down the whole instruction from decode2 to execute1. We will need more bits of the instruction in future, and the tools should just optimize away any that we don't end up using. Since the 'aa' bit was just a copy of an instruction bit, we can now remove it from the record. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Benjamin Herrenschmidt	3e6f656a90	Add MCRF instruction Hopefully it's not too timing catastrophic. The variable newcrf will be handy for the other CR ops when we implement them I suspect. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	554ae88540	Implement absolute branches Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	80a0e7fcf3	execute1: simplify flush_out It's always set when f_out.redirect is set, so may as well set it once at the end. It's all combo from the register. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Anton Blanchard	b57325ce29	Merge branch 'divider' of https://github.com/paulusmack/microwatt	5 years ago
Anton Blanchard	5a6f8d26d1	Rename OP_SUBFC -> OP_SUBFE, OP_ADDC -> OP_ADDE These were somewhat badly named. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Paul Mackerras	d5bc6c8824	Add a divider unit and a testbench for it This adds a divider unit, connected to the core in much the same way that the multiplier unit is connected. The division algorithm is very simple-minded, taking 64 clock cycles for any division (even 32-bit division instructions). The decoding is simplified by making use of regularities in the instruction encoding for div* and mod* instructions. Instead of having PPC_* encodings from the first-stage decoder for each of the different div* and mod* instructions, we now just have PPC_DIV and PPC_MOD, and the inputs to the divider that indicate what sort of division operation to do are derived from instruction word bits. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Anton Blanchard	6d85920068	execute1 no longer needs sim_console Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Michael Neuling	1e1b799382	Remove FIXME comment This was mistakenly left behind in `4d5abfb430` ("Remove dynamic ranges from code") Signed-off-by: Michael Neuling <mikey@neuling.org>	5 years ago
Anton Blanchard	a2df2a10a2	Remove sim console We can force all existing code to use the UART console by passing 0 in bit zero of the sim config register. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Anton Blanchard	a8f8c54b77	Move debug execute output into decode2 This covers all units, and we avoid double printing. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Anton Blanchard	92a7152370	Rework pipeline, add stall and flush signals This adds stall and flush signals to the pipeline. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Michael Neuling	4d5abfb430	Remove dynamic ranges from code Some VHDL compilers like verific [1] don't like these, so let's remove them. Lots of random code changes, but passes make check. Also add basic script to run verific and generate verilog. 1. https://www.verific.com/ Signed-off-by: Michael Neuling <mikey@neuling.org>	5 years ago
Anton Blanchard	0fd18c2455	Add srd and srw Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Anton Blanchard	73daacbcd4	Add sim only divw Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Anton Blanchard	5a29cb4699	Initial import of microwatt Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago

1 2 3 4

156 Commits (35e0dbed345011bafb651fa7d168de550e6fd6e7)