microwatt

Commit Graph

Author	SHA1	Message	Date
Paul Mackerras	2661b9b985	decode1: Mark subfic as pipelined This seems just to have been missed in commit `f291efa266` ("decode1: Mark ALU ops using carry as pipelined"). Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	0c714f1be6	execute: Move popcnt and prty instructions into the logical unit This implements logic in the logical entity to calculate the results of the popcnt* and prty* instructions. We now have one insn_type_t value for the 3 popcnt variants and one for the two prty variants, using the length field of the decode_rom_t to distinguish between them. The implementations in logical.vhdl using recursive algorithms rather than the simple functions in ppc_fx_insns.vhdl. This gives a saving of about 140 slice LUTs on the A7-100 and improves timing slightly. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	d2ca625b3b	execute: Do comparisons using the main adder This handles OP_CMP like a subtraction; the main adder computes ~RA + RB + 1, and the condition codes are computed from the results. A direct comparison of the two input operands is used to calculate the EQ bit of the condition result. The LT and GT bits are computed from the MSB of the subtraction result, the carry out from the subtraction, and the MSBs of the operands. For a 32-bit comparison, the 32-bit carry and bit 31 of the result and input operands are used; for a 64-bit comparison, the 64-bit carry and bit 63 of the operands and result are used. It turns out to be more convenient to use the 'signed' field of the decode table to distinguish signed from unsigned comparisons, rather than the insn_type. Therefore this uses OP_CMP for both cmp and cmpl, which also has the benefit of reducing the number of values in insn_type_t. Doing this saves over 200 slice LUTs on the Arty A7-100 and improves timing slightly as well. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	39d18d2738	Make divider hang off the side of execute1 With this, the divider is a unit that execute1 sends operands to and which sends its results back to execute1, which then send them to writeback. Execute1 now sends a stall signal when it gets a divide or modulus instruction until it gets a valid signal back from the divider. Divide and modulus instructions are no longer marked as single-issue. The data formatting step that used to be done in decode2 for div and mod instructions is now done in execute1. We also do the absolute value operation in that same cycle instead of taking an extra cycle inside the divider for signed operations with a negative operand. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	2167186b5f	Make multiplier hang off the side of execute1 With this, the multiplier isn't a separate pipe that decode2 issues instructions to, but rather is a unit that execute1 sends operands to and which sends the result back to execute1, which then sends it to writeback. Execute1 now sends a stall signal when it gets a multiply instruction until it gets a valid signal back from the multiplier. This all means that we no longer need to mark the multiply instructions as single-issue. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Tom Vijlbrief	c05441bf47	Implement CRNOR and friends Signed-off-by: Tom Vijlbrief <tvijlbrief@gmail.com>	5 years ago
Benjamin Herrenschmidt	e4f475e17f	sprs: Store common SPRs in register file This stores the most common SPRs in the register file. This includes CTR and LR and a not yet final list of others. The register file is set to 64 entries for now. Specific types are defined that can represent a GPR index (gpr_index_t) or a GPR/SPR index (gspr_index_t) along with conversion functions between the two. On order to deal with some forms of branch updating both LR and CTR, we introduced a delayed update of LR after a branch link. Note: We currently stall the pipeline on such a delayed branch, but we could avoid stalling fetch in that specific case as we know we have a branch delay. We could also limit that to the specific case where we need to update both CTR and LR. This allows us to make bcreg, mtspr and mfspr pipelined. decode1 will automatically force the single issue flag on mfspr/mtspr to a "slow" SPR. [paulus@ozlabs.org - fix direction of decode2.stall_in] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	d04887fdcd	decode1: Add OE=1 forms of add/sub, mul and div instructions Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Benjamin Herrenschmidt	f291efa266	decode1: Mark ALU ops using carry as pipelined There is no reason not to that I can think of Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Benjamin Herrenschmidt	742b21480e	insn: Simplistic implementation of icbi We don't yet have a proper snooper for the icache, so for now make icbi just flush the whole thing Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	a0d95e791e	insn: Implement isync instruction The instruction works by redirecting fetch to nia+4 (hopefully using the same adder used to generate LR) and doing a backflush. Along with being single issue, this should guarantee that the next instruction only gets fetched after the pipe's been emptied. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Anton Blanchard	4433118c91	Merge pull request #105 from paulusmack/writeback Writeback	5 years ago
Paul Mackerras	9646fe28b0	Do sign-extension instructions in writeback instead of execute1 This makes the exts[bhw] instructions do the sign extension in the writeback stage using the sign-extension logic there instead of having unique sign extension logic in execute1. This requires passing the data length and sign extend flag from decode2 down through execute1 and execute2 and into writeback. As a side bonus we reduce the number of values in insn_type_t by two. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	86c53aa3f7	Implement neg using OP_ADD We have all the machinery in place to implement the neg instruction as OP_ADD. Doing that means we can ditch OP_NEG, and saves about 66 slice LUTs on the A7-100. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Anton Blanchard	813f834012	Add CR hazard detection To keep things simple we treat the CR as a single entity. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Anton Blanchard	bb65d0b899	Remove issue restrictions on a number of instructions Anything that isn't a load or store and anything that doesn't read the CR can go as soon as its inputs are ready. While we could also allow SPR read/write and carry read/write, we plan to change them to be read in decode2 and written in writeback soon and they will need separate hazard detection to be added. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Anton Blanchard	10a990bba8	mod* doesn't have an RC form The RC bit should be ignored for mod* instructions. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Anton Blanchard	c41da84226	decode: Handle icbi We will need a proper handler for icbi, but in the meantime treat it as a nop. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Paul Mackerras	24a4a796ce	execute: Consolidate count-leading/trailing-zeroes implementations This adds combinatorial logic that does 32-bit and 64-bit count leading and trailing zeroes in one unit, and consolidates the four instructions under a single OP_CNTZ opcode. This saves 84 slice LUTs on the Arty A7-100. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Anton Blanchard	b8fb721b81	Consolidate logical instructions Consolidate and/andc/nand, or/orc/nor and xor/eqv, using a common invert on the input and output. This saves us about 200 LUTs. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Paul Mackerras	f7c393ba7e	Add a rotate/mask/shift unit and use it in execute1 This adds a new entity 'rotator' which contains combinatorial logic for rotating and masking 64-bit values. It implements the operations of the rlwinm, rlwnm, rlwimi, rldicl, rldicr, rldic, rldimi, rldcl, rldcr, sld, slw, srd, srw, srad, sradi, sraw and srawi instructions. It consists of a 3-stage 64-bit rotator using 4:1 multiplexors at each stage, two mask generators, output logic and control logic. The insn_type_t values used for these instructions have been reduced to just 5: OP_RLC, OP_RLCL and OP_RLCR for the rotate and mask instructions (clear both left and right, clear left, clear right variants), OP_SHL for left shifts, and OP_SHR for right shifts. The control signals for the rotator are derived from the opcode and from the is_32bit and is_signed fields of the decode_rom_t. The rotator is instantiated as an entity in execute1 so that we can be sure we only have one of it. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	90b6e27380	Generalize the mul_32bit and mul_signed fields of decode_rom_t This changes the names of the mul_32bit and mul_signed fields of decode_rom_t to is_32bit and is_signed, so they can be used with other types of operations besides multiplies. This plumbs the is_32bit and is_signed flags down into execute1, though they are not used at this point. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	7fe84220a5	decode: Avoid multiplexing from instruction reg fields to regfile address ports This aims to simplify the logic between the instruction image and the register file read address ports and reduce the size of the decode tables. With this patch, the input_reg_a column of the decode tables can only select RA or zeroes, the input_reg_b column can only select RB or a constant (0, -1, or an immediate value from the instruction), and the input_reg_c columns can only select RS or zeroes. That means that the rotate/shift/logical ops now have their first input coming in via the input_reg_c column. That means we need to add a read_data3 field to the Decode2ToExecuteType record, but that will go away again when we split out the rotate/mask/logical ops to their own unit. As a related but not tightly connected change, this patch also sets the read1_enable signal to the register file be 0 when RA=0 and the input_reg_a for the instruction is RA_OR_ZERO (previously it was 1). Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	96b402a4bf	Consolidate add/subtract instructions into a single op All of the PPC add and subtract instructions, including carrying and extended versions, do much the same arithmetic operation: result = (I xor A) + B + C where A is the value from RA, I provides a logical inversion of A (i.e. I is 0 or -1), B is either from RB or is a constant 0 or -1, and C is 0, 1 or the carry bit from XER (CA). To consolidate all the add/subtract instructions into a single OP_ADD, we add a column to decode_rom_t to indicate when A should be inverted, and change the input_carry field to a 3-state selector to select C in the equation above. This also adds a new "CONST_M1" value for input_reg_b_t to indicate that B is a constant -1. This allows us to implement addme and subfme. The addex instruction appears not to exist, so the comments referring to it are removed. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	b0f302ecf4	decode: Make all update-form indexed loads and stores use RA_OR_ZERO Experimentation on POWER9 indicates that the invalid form of lbzux with RA=0 uses just RB as the address, not R0 + RB. Extrapolating this to all update-form loads and stores with RA=0, change all the update-form loads and stores to use RA_OR_ZERO rather than RA. This then means that all decode ROM entries with insn_type = LDST have input_reg_a = RA_OR_ZERO. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	58b06eb5f3	decode: Remove const fields from decode_rom_t The const* fields of decode_rom_t drove multiplexers in decode2 that picked out various instruction fields and put them into the const* fields of the Decode2ToExecute1Type record, from where they were used in execute1. However, the code in execute1 can just as easily use the appropriate fields of the original instruction word, since that is now available in execute1. This therefore changes the code to do that, resulting in smaller decode tables. Suggested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	143d0ae9e4	decode: Fix larx/stcx instructions to use RA_OR_ZERO not RA The l?arx and st?cx. instructions are defined to use the normal indexed mode address calculations, i.e. (RA\|0) + RB. Fix their entries in the decode table to say RA_OR_ZERO rather than RA. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	bbae2d1eda	decode: Index minor op table with insn bits for opcode 31 This changes decode_op_31_array from being indexed by a ppc_insn_t (which is derived from the instruction word by a whole series of if/elsif statements) to being indexed directly by bits 10...1 of the instruction word. With this we no longer need ppc_insn. This then means that the decode1 stage doesn't distinguish between mfcr and mfocrf, or between mtcrf and mtocrf, since those are distinguished by the value in bit 20 of the instruction. To accommodate that, execute1 changes so that the one op value (OP_MFCR) does either the mfcr or the mfocrf behaviour depending on bit 20 of the instruction word; and similarly for mtcrf/mtocrf. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	21d3f8a5ed	decode: Index minor op table with insn bits for opcode 30 This comprises the 64-bit rotate and mask instructions. In order to reduce the table index to 3 bits, we combine rldcl and rdlcr into a single op (OP_RLDCX), and choose the right mask at execute time based on bit 1 of the instruction word. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	00e9f801f6	decode: Index minor op table with insn bits for opcode 19 This changes the decoding of major opcode 19 from using the ppc_insn_t index to using bits of the instruction word directly. Opcode 19 has a 10-bit minor opcode field (bits 10..1) but the space is sparsely filled. Therefore we index a table of single-bit entries with the 10-bit minor opcode to filter out the illegal minor opcodes, and index a table using just 3 bits -- 5, 3 and 2 -- of the instruction to get the decode entry. This groups together all the instructions in 4 columns of the opcode map as a single entry. That means that mcrf and all the CR logical ops get grouped together, and bcctr, bclr and bctar get grouped together. At present the CR logical ops are not implemented, so their grouping has no impact. The code for bclr and bcctr in execute1 is now common, using a single op, and it now determines the branch address by looking at bit 10 of the instruction word at execute time. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	e30a87593a	decode: Start moving towards decoding by major opcode first With this, we have a table for most major opcodes and separate tables for each major opcode that has further decoding required. These tables are still mostly indexed by the ppc_insn_t values, however. A few things are still decoded completely at the top level: nop, attn and sim_config. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	c9e92483b8	decode: Push mtspr/mfspr register decoding down into execute1 Instead of doing mfctr, mflr, mftb, mtctr, mtlr as separate ops, just pass down mfspr and mtspr ops with the spr number and let execute1 decode which SPR we're addressing. This will help reduce the number of instruction bits decode1 needs to look at. In fact we now pass down the whole instruction from decode2 to execute1. We will need more bits of the instruction in future, and the tools should just optimize away any that we don't end up using. Since the 'aa' bit was just a copy of an instruction bit, we can now remove it from the record. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Benjamin Herrenschmidt	3e6f656a90	Add MCRF instruction Hopefully it's not too timing catastrophic. The variable newcrf will be handy for the other CR ops when we implement them I suspect. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Benjamin Herrenschmidt	554ae88540	Implement absolute branches Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Anton Blanchard	b57325ce29	Merge branch 'divider' of https://github.com/paulusmack/microwatt	5 years ago
Anton Blanchard	5a6f8d26d1	Rename OP_SUBFC -> OP_SUBFE, OP_ADDC -> OP_ADDE These were somewhat badly named. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Paul Mackerras	d5bc6c8824	Add a divider unit and a testbench for it This adds a divider unit, connected to the core in much the same way that the multiplier unit is connected. The division algorithm is very simple-minded, taking 64 clock cycles for any division (even 32-bit division instructions). The decoding is simplified by making use of regularities in the instruction encoding for div* and mod* instructions. Instead of having PPC_* encodings from the first-stage decoder for each of the different div* and mod* instructions, we now just have PPC_DIV and PPC_MOD, and the inputs to the divider that indicate what sort of division operation to do are derived from instruction word bits. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Benjamin Herrenschmidt	98f0994698	Add core debug module This module adds some simple core controls: reset, stop, start, step along with icache clear and reading the NIA and core status bits Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org	5 years ago
Anton Blanchard	7bb88d5321	Merge pull request #59 from antonblanchard/trap-decode Fix make check	5 years ago
Anton Blanchard	427effdaa9	Fix make check We need to finish support for all the trap instructions, but for now we at least need a decode entry for tw, so we know to stall until the previous instruction completes. Some of our test cases were failing because the trap executed before the previous instruction completed. All these trap instructions need to be resolved at completion, not in execute. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Anton Blanchard	9867fb6149	Add a decode for the nop instruction We want these to go out without any GPR dependencies, so add a specific entry in decode for them. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Anton Blanchard	acdb2ea157	No need to gate nia or insn in decode1 Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Anton Blanchard	b9e28598b4	Explicitly check against '1' in if statements nvc doesn't like what I think is a VHDL 2008 construct. Lets just check against '1' explicitly. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Anton Blanchard	a2df2a10a2	Remove sim console We can force all existing code to use the UART console by passing 0 in bit zero of the sim config register. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Anton Blanchard	92a7152370	Rework pipeline, add stall and flush signals This adds stall and flush signals to the pipeline. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Anton Blanchard	9687034d78	Add a decode bit to mark an instruction as single through the pipeline This is used by the pipelining patches. Mark everyone as single through the pipeline to start. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Benjamin Herrenschmidt	b0ade2857f	decode1 array fix header Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Anton Blanchard	9fbaea6f08	Rework CR file and add forwarding Handle the CR as a single field with per nibble enables. Forward any writes in the same cycle. If this proves to be an issue for timing, we may want to revisit this in the future. For now, it keeps things simple. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Anton Blanchard	5a29cb4699	Initial import of microwatt Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago

49 Commits (a2bf039a70ac2cb4c05035b9a6a75bb7b8ccc7cc)