microwatt

Commit Graph

Author	SHA1	Message	Date
Paul Mackerras	1c4b5def36	Improve timing of redirect_nia going from writeback to fetch1 This gets rid of the adder in writeback that computes redirect_nia. Instead, the main adder in the ALU is used to compute the branch target for relative branches. We now decode b and bc differently depending on the AA field, generating INSN_brel, INSN_babs, INSN_bcrel or INSN_bcabs as appropriate. Each one has a separate entry in the decode table in decode1; the *rel versions use CIA as the A input. The bclr/bcctr/bctar and rfid instructions now select ramspr_result for the main result mux to get the redirect address into ex1.e.write_data. For branches which are predicted taken but not actually taken, we need to redirect to the following instruction. We also need to do that for isync. We do this in the execute2 stage since whether or not to do it depends on the branch result. The next_nia computation is moved to the execute2 stage and comes in via a new leg on the secondary result multiplexer, making next_nia available ultimately in ex2.e.write_data. This also means that the next_nia leg of the primary result multiplexer is gone. Incrementing last_nia by 4 for sc (so that SRR0 points to the following instruction) is also moved to execute2. Writing CIA+4 to LR was previously done through the main result multiplexer. Now it comes in explicitly in the ramspr write logic. Overall this removes the br_offset and abs_br fields and the logic to add br_offset and next_nia, and one leg of the primary result multiplexer, at the cost of a few extra control signals between execute1 and execute2 and some multiplexing for the ramspr write side and an extra input on the secondary result multiplexer. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	7 months ago
Paul Mackerras	06ff486567	icache: Restore primary opcode to instruction word The icache stores a predecoded insn_code value for each instruction, and so as to fit in 36 bits, omits the primary opcode (the most significant 6 bits) of each instruction. Previously, for valid instructions, the primary opcode field of the instruction delivered to decode1 was a part-representation of the insn_code value rather than the actual primary opcode. This adds a lookup table to compute the primary opcode from the insn_code and deliver it in the instruction words supplied to decode1. In order that each insn_code can be associated with a single primary opcode value, the various no-operation instructions with primary opcode 31 (the reserved no-ops and dss, dst and dstst) have been given a new insn_code, INSN_rnop, leaving INSN_nop for the preferred no-op (ori r0,r0,0). Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	7 months ago
Paul Mackerras	b50170cd1d	Implement byte reversal instructions This implements the byte-reverse halfword, word and doubleword instructions: brh, brw, and brd. These instructions were added to the ISA in version 3.1. They use a new OP_BREV insn_type value. The logic for these instructions is implemented in logical.vhdl. In order to avoid going over 64 insn_type values, OP_AND and OP_OR were combined into OP_LOGIC, which is like OP_AND except that the RS input can be inverted as well as the RB input. The various forms of OR instruction are then implemented using the identity a OR b = NOT (NOT a AND NOT b) The 'is_signed' field of the instruction decode table is used to indicate that RS should be inverted. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	7 months ago
Paul Mackerras	39ca675ce3	Decode prefixed instructions This adds logic to do basic decoding of the prefixed instructions defined in PowerISA v3.1B which are in the SFFS (Scalar Fixed plus Floating-Point Subset) compliancy subset. In PowerISA v3.1B SFFS, there are 14 prefixed load/store instructions plus the prefixed no-op instruction (pnop). The prefixed load/store instructions all use an extended version of D-form, which has an extra 18 bits of displacement in the prefix, plus an 'R' bit which enables PC-relative addressing. When decode1 sees an instruction word where the insn_code is INSN_prefix (i.e. the primary opcode was 1), it stores the prefix word and sends nothing down to decode2 in that cycle. When the next valid instruction word arrives, it is interpreted as a suffix, meaning that its insn_code gets modified before being used to look up the decode table. The insn_code values are rearranged so that the values for instructions which are the suffix of a valid prefixed instruction are all at even indexes, and the corresponding prefixed instructions follow immediately, so that an insn_code value can be converted to the corresponding prefixed value by setting the LSB of the insn_code value. There are two prefixed instructions, pld and pstd, for which the suffix is not a valid SFFS instruction by itself, so these have been given dummy insn_code values which decode as illegal (INSN_op57 and INSN_op61). For a prefixed instruction, decode1 examines the type and subtype fields of the prefix and checks that the suffix is valid for the type and subtype. This check doesn't affect which entry of the decode table is used; the result is passed down to decode2, and will in future be acted upon in execute1. The instruction address passed down to decode2 is the address of the prefix. To enable this, part of the instruction address is saved when the prefix is seen, and then the instruction address received from icache is partly overlaid by the saved prefix address. Because prefixed instructions are not permitted to cross 64-byte boundaries, we only need to save bits 5:2 of the instruction to do this. If the alignment restriction ever gets relaxed, we will then need to save more bits of the address. Decode2 has been extended to handle the R bit of the prefix (in 8LS and MLS forms) and to be able to generate the 34-bit immediate value from the prefix and suffix. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	10 months ago
Paul Mackerras	7af0e001ad	Move insn_codes for mcrfs, mtfsb0/1 and mtfsfi This moves the insn_code values for mcrfs, mtfsb0/1 and mtfsfi into the region used for floating-point instructions. This means that in no-FPU implementations, they will get turned into illegal instructions in predecode. We then don't need the code in execute1 that makes FP instructions illegal in no-FPU implementations. We also remove the NONE value for unit_t, since it was only ever used with insn_type = OP_ILLEGAL, and the check for unit = NONE was redundant with the check for insn_type = OP_ILLEGAL. Thus the check for unit = NONE is no longer needed and is removed here. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	10 months ago
Paul Mackerras	58e799b350	predecode: Add more comments to row_predecode_rom and insn_code values This adds comments to row_predecode_rom to aid understanding how the columns in the second half of the table are allocated to different primary opcodes, and to the insn_code values to assist in locating the code with a given numeric value. No code change. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	26dc1e879c	Eliminate use of primary opcode outside of decode1 This changes code that previously looked at the primary opcode (bits 26 to 31) of the instruction to use other methods, in places other than in stage0 of decode1. * Extend rc_t to have a new value, RCOE, indicating that the instruction has both Rc and OE bits. * Decode2 now tells execute1 whether the instruction has a third operand, used for distinguishing between multiply and multiply-add instructions. * The invert_a field of the decode ROM is overloaded for load/store instructions to indicate cache-inhibited loads and stores. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	c9aea45ffe	decode1: Divide insn_code values into ranges to indicate register usage This lets us compute r_out.reg_*_addr and r_out.read_2_enable values without needing access to the primary opcode value. We also have that non-FP instructions are < 256. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	c3ee10f013	decode1: Split instruction decoding into two steps This reduces the block RAM requirements for instruction decoding by splitting it into two steps. The first, in a new pipeline stage called decode0 (implemented by code in decode1.vhdl) maps the instruction to a 9-bit instruction code using major and row decode ROMs. The second maps the 9-bit code to the final decode_rom_t (about 44 bits wide). Branch prediction done in decode is now done in decode0 rather than decode1. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	932da4c114	FPU: Simplify IDLE state code Do more decoding of the instruction ahead of the IDLE state processing so that the IDLE state code becomes much simpler. To make the decoding easier, we now use four insn_type_t codes for floating-point operations rather than two. This also rearranges the insn_type_t values a little to get the 4 FP opcode values to differ only in the bottom 2 bits, and put OP_DIV, OP_DIVE and OP_MOD next to them. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	fdb3ef6874	Finish off taking SPRs out of register file With this, the register file now contains 64 entries, for 32 GPRs and 32 FPRs, rather than the 128 it had previously. Several things get simplified - decode1 no longer has to work out the ispr{1,2,o} values, decode_input_reg_{a,b,c} no longer have the t = SPR case, etc. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	c9e838b656	Remove support for lq, stq, lqarx and stqcx. They are optional in SFFS (scalar fixed-point and floating-point subset), are not needed for running Linux, and add complexity, so remove them. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	4c61a71a62	core: Crack update-form loads into two internal ops This uses the instruction-doubling machinery to send load with update instructions down to loadstore1 as two separate ops, rather than one op with two destinations. This will help to simplify the value tracking mechanisms. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	89a67a18d0	decode: Add a facility field to the instruction decode tables This makes it simpler to work out when to deliver a FPU unavailable interrupt. This also means we can get rid of the OP_FPLOAD and OP_FPSTORE insn_type values. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	4b2c23703c	core: Implement quadword loads and stores This implements the lq, stq, lqarx and stqcx. instructions. These instructions all access two consecutive GPRs; for example the "lq %r6,0(%r3)" instruction will load the doubleword at the address in R3 into R7 and the doubleword at address R3 + 8 into R6. To cope with having two GPR sources or destinations, the instruction gets repeated at the decode2 stage, that is, for each lq/stq/lqarx/stqcx. coming in from decode1, two instructions get sent out to execute1. For these instructions, the RS or RT register gets modified on one of the iterations by setting the LSB of the register number. In LE mode, the first iteration uses RS\|1 or RT\|1 and the second iteration uses RS or RT. In BE mode, this is done the other way around. In order for decode2 to know what endianness is currently in use, we pass the big_endian flag down from icache through decode1 to decode2. This is always in sync with what execute1 is using because only rfid or an interrupt can change MSR[LE], and those operations all cause a flush and redirect. There is now an extra column in the decode tables in decode1 to indicate whether the instruction needs to be repeated. Decode1 also enforces the rule that lq with RT = RT and lqarx with RA = RT or RB = RT are illegal. Decode2 now passes a 'repeat' flag and a 'second' flag to execute1, and execute1 passes them on to loadstore1. The 'repeat' flag is set for both iterations of a repeated instruction, and 'second' is set on the second iteration. Execute1 does not take asynchronous or trace interrupts on the second iteration of a repeated instruction. Loadstore1 uses 'next_addr' for the second iteration of a repeated load/store so that we access the second doubleword of the memory operand. Thus loadstore1 accesses the doublewords in increasing memory order. For 16-byte loads this means that the first iteration writes GPR RT\|1. It is possible that RA = RT\|1 (this is a legal but non-preferred form), meaning that if the memory operand was misaligned, the first iteration would overwrite RA but then the second iteration might take a page fault, leading to corrupted state. To avoid that possibility, 16-byte loads in LE mode take an alignment interrupt if the operand is not 16-byte aligned. (This is the case anyway for lqarx, and we enforce it for lq as well.) Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	e6a5f237bc	FPU: Implement fmul[s] This implements the fmul and fmuls instructions. For fmul[s] with denormalized operands we normalize the inputs before doing the multiplication, to eliminate the need for doing count-leading-zeroes on P. This adds 3 or 5 cycles to the execution time when one or both operands are denormalized. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	b628af6176	FPU: Implement fmr and related instructions This implements fmr, fneg, fabs, fnabs and fcpsgn and adds tests for them. This adds logic to unpack and repack floating-point data from the 64-bit packed form (as stored in memory and the register file) into the unpacked form in the fpr_reg_type record. This is not strictly necessary for fmr et al., but will be useful for when we do actual arithmetic. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	856e9e955f	core: Add framework for an FPU This adds the skeleton of a floating-point unit and implements the mffs and mtfsf instructions. Execute1 sends FP instructions to the FPU and receives busy, exception, FP interrupt and illegal interrupt signals from it. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	45cd8f4fc3	core: Add support for floating-point loads and stores This extends the register file so it can hold FPR values, and implements the FP loads and stores that do not require conversion between single and double precision. We now have the FP, FE0 and FE1 bits in MSR. FP loads and stores cause a FP unavailable interrupt if MSR[FP] = 0. The FPU facilities are optional and their presence is controlled by the HAS_FPU generic passed down from the top-level board file. It defaults to true for all except the A7-35 boards. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	83816cb9e3	core: Implement BCD Assist instructions addg6s, cdtbcd, cbcdtod To avoid adding too much logic, this moves the adder used by OP_ADD out of the case statement in execute1.vhdl so that the result can be used by OP_ADDG6S as well. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	5fafdc56ef	core: Implement the addex instruction The addex instruction is like adde but uses the XER[OV] bit for the carry in and out rather than XER[CA]. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	290b05f97d	core: Implement the maddhd, maddhdu and maddld instructions These instructions use major opcode 4 and have a third GPR input operand, so we need a decode table for major opcode 4 and some plumbing to get the RC register operand read. The multiply-add instructions use the same insn_type_t values as the regular multiply instructions, and we distinguish in execute1 by looking at the major opcode. This turns out to be convenient because we don't have to add any cases in the code that handles the output of the multiplier, and it frees up some insn_type_t values. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	fa77a6f683	core: Implement the mcrxrx instruction This also removes OP_MCRXR, as the mcrxr instruction was removed in version 3.0B of the Power ISA, having been phased-out for the server architecture since v2.02. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	4a4a98d4b9	core: Do addpcis using the main adder (#189 ) By adding logic to decode2 to be able to send the instruction address down the A input, and making CONST_DX_HI (renamed to CONST_DXHI4) add 4 to the immediate value (easy since the bottom 16 bits were zero), we can do addpcis using the main adder. This reduces the width of the result mux and frees up one value in insn_type_t, since we can now use OP_ADD for addpcis. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Shawn Anastasio	e606772aeb	Implement the addpcis instruction This commit adds support for the addpcis instruction from ISA 3.0. A new input_reg_b_t type, CONST_DX_HI, was added to support the shifted immediate value used in DX-Form instructions. Signed-off-by: Shawn Anastasio <shawn@anastas.io>	4 years ago
Paul Mackerras	3d4712ad43	Add TLB to icache This adds a direct-mapped TLB to the icache, with 64 entries by default. Execute1 now sends a "virt_mode" signal from MSR[IR] to fetch1 along with redirects to indicate whether instruction addresses should be translated through the TLB, and fetch1 sends that on to icache. Similarly a "priv_mode" signal is sent to indicate the privilege mode for instruction fetches. This means that changes to MSR[IR] or MSR[PR] don't take effect until the next redirect, meaning an isync, rfid, branch, etc. The icache uses a hash of the effective address (i.e. next instruction address) to index the TLB. The hash is an XOR of three fields of the address; with a 64-entry TLB, the fields are bits 12--17, 18--23 and 24--29 of the address. TLB invalidations simply invalidate the indexed TLB entry without checking the contents. If the icache detects a TLB miss with virt_mode=1, it will send a fetch_failed indication through fetch2 to decode1, which will turn it into a special OP_FETCH_FAILED opcode with unit=LDST. That will get sent down to loadstore1 which will currently just raise a Instruction Storage Interrupt (0x400) exception. One bit in the PTE obtained from the TLB is used to check whether an instruction access is allowed -- the privilege bit (bit 3). If bit 3 is 1 and priv_mode=0, then a fetch_failed indication is sent down to fetch2 and to decode1, which generates an OP_FETCH_FAILED. Any PTEs with PTE bit 0 (EAA[3]) clear or bit 8 (R) clear should not be put into the iTLB since such PTEs would not allow execution by any context. Tlbie operations get sent from mmu to icache over a new connection. Unfortunately the privileged instruction tests are broken for now. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	750b3a8e28	dcache: Implement data TLB This adds a TLB to dcache, providing the ability to translate addresses for loads and stores. No protection mechanism has been implemented yet. The MSR_DR bit controls whether addresses are translated through the TLB. The TLB is a fixed-pagesize, set-associative cache. Currently the page size is 4kB and the TLB is 2-way set associative with 64 entries per set. This implements the tlbie instruction. RB bits 10 and 11 control whether the whole TLB is invalidated (if either bit is 1) or just a single entry corresponding to the effective page number in bits 12-63 of RB. As an extension until we get a hardware page table walk, a tlbie instruction with RB bits 9-11 set to 001 will load an entry into the TLB. The TLB entry value is in RS in the format of a radix PTE. Currently there is no proper handling of TLB misses. The load or store will not be performed but no interrupt is generated. In order to make timing at 100MHz on the Arty A7-100, we compare the real address from each way of the TLB with the tag from each way of the cache in parallel (requiring # TLB ways * # cache ways comparators). Then the result is selected based on which way hit in the TLB. That avoids a timing path going through the TLB EA comparators, the multiplexer that selects the RA, and the cache tag comparators. The hack where addresses of the form 0xc------- are marked as cache-inhibited is kept for now but restricted to real-mode accesses. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	278ac5e0eb	Remove sim_config instruction It's not used any more, and it's not in the ISA. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	381149b2cc	Consolidate trap variants under a single OP_TRAP This replaces OP_TD, OP_TDI, OP_TW and OP_TWI with a single OP_TRAP, distinguishing the cases by the input_reg_b and is_32bit fields of the decode ROM. This adds the twi and td cases to the decode tables. For now we make all of the trap instructions unconditionally generate a trap-type program interrupt if the TO field of the instruction is all ones, and do nothing otherwise. This reduces the number of values in insn_type_t from 65 to 62, meaning that an insn_type_t can now be encoded in 6 bits rather than 7. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	fe077a116a	Rename OP_MCRF to OP_CROP and trim insn_type_t OP_MCRF covers the CR logical ops as well as mcrf since commit `c05441bf47` ("Implement CRNOR and friends"), so this renames OP_MCRF to OP_CROP. The OP_* values for the individual CR logical ops (OP_CRAND, etc.) are not used, so remove them from insn_type_t. No functional change. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Michael Neuling	5ef5604b65	Add sc, illegal and decrementer exceptions and some supervisor state This adds the following exceptions: - 0x700 program check (for illegal instructions) - 0x900 decrementer - 0xc00 system call This also adds some supervisor state: - decremeter - msr (SPRG0/1 and SRR0/1 already exist as fast SPRs) It also adds some supporting instructions: - rfid - mtmsrd - mfmsr - sc MSR state is added but only EE is used in this patch set. Other bits are read/written but are not used at all. This adds a 2 stage state machine to execute1.vhdl. This state machine allows fast SPRS SRR0/1 to be written in different cycles. This state machine can be extended later to add DAR and DSISR SPR writing for more complex exceptions like page faults. Signed-off-by: Michael Neuling <mikey@neuling.org>	4 years ago
Paul Mackerras	0c714f1be6	execute: Move popcnt and prty instructions into the logical unit This implements logic in the logical entity to calculate the results of the popcnt* and prty* instructions. We now have one insn_type_t value for the 3 popcnt variants and one for the two prty variants, using the length field of the decode_rom_t to distinguish between them. The implementations in logical.vhdl using recursive algorithms rather than the simple functions in ppc_fx_insns.vhdl. This gives a saving of about 140 slice LUTs on the A7-100 and improves timing slightly. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	d2ca625b3b	execute: Do comparisons using the main adder This handles OP_CMP like a subtraction; the main adder computes ~RA + RB + 1, and the condition codes are computed from the results. A direct comparison of the two input operands is used to calculate the EQ bit of the condition result. The LT and GT bits are computed from the MSB of the subtraction result, the carry out from the subtraction, and the MSBs of the operands. For a 32-bit comparison, the 32-bit carry and bit 31 of the result and input operands are used; for a 64-bit comparison, the 64-bit carry and bit 63 of the operands and result are used. It turns out to be more convenient to use the 'signed' field of the decode table to distinguish signed from unsigned comparisons, rather than the insn_type. Therefore this uses OP_CMP for both cmp and cmpl, which also has the benefit of reducing the number of values in insn_type_t. Doing this saves over 200 slice LUTs on the Arty A7-100 and improves timing slightly as well. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	39d18d2738	Make divider hang off the side of execute1 With this, the divider is a unit that execute1 sends operands to and which sends its results back to execute1, which then send them to writeback. Execute1 now sends a stall signal when it gets a divide or modulus instruction until it gets a valid signal back from the divider. Divide and modulus instructions are no longer marked as single-issue. The data formatting step that used to be done in decode2 for div and mod instructions is now done in execute1. We also do the absolute value operation in that same cycle instead of taking an extra cycle inside the divider for signed operations with a negative operand. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	2167186b5f	Make multiplier hang off the side of execute1 With this, the multiplier isn't a separate pipe that decode2 issues instructions to, but rather is a unit that execute1 sends operands to and which sends the result back to execute1, which then sends it to writeback. Execute1 now sends a stall signal when it gets a multiply instruction until it gets a valid signal back from the multiplier. This all means that we no longer need to mark the multiply instructions as single-issue. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Benjamin Herrenschmidt	e4f475e17f	sprs: Store common SPRs in register file This stores the most common SPRs in the register file. This includes CTR and LR and a not yet final list of others. The register file is set to 64 entries for now. Specific types are defined that can represent a GPR index (gpr_index_t) or a GPR/SPR index (gspr_index_t) along with conversion functions between the two. On order to deal with some forms of branch updating both LR and CTR, we introduced a delayed update of LR after a branch link. Note: We currently stall the pipeline on such a delayed branch, but we could avoid stalling fetch in that specific case as we know we have a branch delay. We could also limit that to the specific case where we need to update both CTR and LR. This allows us to make bcreg, mtspr and mfspr pipelined. decode1 will automatically force the single issue flag on mfspr/mtspr to a "slow" SPR. [paulus@ozlabs.org - fix direction of decode2.stall_in] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Benjamin Herrenschmidt	797b1bb045	decode: Reformat decode_types.vhdl Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Anton Blanchard	4433118c91	Merge pull request #105 from paulusmack/writeback Writeback	5 years ago
Paul Mackerras	9646fe28b0	Do sign-extension instructions in writeback instead of execute1 This makes the exts[bhw] instructions do the sign extension in the writeback stage using the sign-extension logic there instead of having unique sign extension logic in execute1. This requires passing the data length and sign extend flag from decode2 down through execute1 and execute2 and into writeback. As a side bonus we reduce the number of values in insn_type_t by two. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	86c53aa3f7	Implement neg using OP_ADD We have all the machinery in place to implement the neg instruction as OP_ADD. Doing that means we can ditch OP_NEG, and saves about 66 slice LUTs on the A7-100. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	24a4a796ce	execute: Consolidate count-leading/trailing-zeroes implementations This adds combinatorial logic that does 32-bit and 64-bit count leading and trailing zeroes in one unit, and consolidates the four instructions under a single OP_CNTZ opcode. This saves 84 slice LUTs on the Arty A7-100. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Anton Blanchard	b8fb721b81	Consolidate logical instructions Consolidate and/andc/nand, or/orc/nor and xor/eqv, using a common invert on the input and output. This saves us about 200 LUTs. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Paul Mackerras	f7c393ba7e	Add a rotate/mask/shift unit and use it in execute1 This adds a new entity 'rotator' which contains combinatorial logic for rotating and masking 64-bit values. It implements the operations of the rlwinm, rlwnm, rlwimi, rldicl, rldicr, rldic, rldimi, rldcl, rldcr, sld, slw, srd, srw, srad, sradi, sraw and srawi instructions. It consists of a 3-stage 64-bit rotator using 4:1 multiplexors at each stage, two mask generators, output logic and control logic. The insn_type_t values used for these instructions have been reduced to just 5: OP_RLC, OP_RLCL and OP_RLCR for the rotate and mask instructions (clear both left and right, clear left, clear right variants), OP_SHL for left shifts, and OP_SHR for right shifts. The control signals for the rotator are derived from the opcode and from the is_32bit and is_signed fields of the decode_rom_t. The rotator is instantiated as an entity in execute1 so that we can be sure we only have one of it. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	90b6e27380	Generalize the mul_32bit and mul_signed fields of decode_rom_t This changes the names of the mul_32bit and mul_signed fields of decode_rom_t to is_32bit and is_signed, so they can be used with other types of operations besides multiplies. This plumbs the is_32bit and is_signed flags down into execute1, though they are not used at this point. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	7fe84220a5	decode: Avoid multiplexing from instruction reg fields to regfile address ports This aims to simplify the logic between the instruction image and the register file read address ports and reduce the size of the decode tables. With this patch, the input_reg_a column of the decode tables can only select RA or zeroes, the input_reg_b column can only select RB or a constant (0, -1, or an immediate value from the instruction), and the input_reg_c columns can only select RS or zeroes. That means that the rotate/shift/logical ops now have their first input coming in via the input_reg_c column. That means we need to add a read_data3 field to the Decode2ToExecuteType record, but that will go away again when we split out the rotate/mask/logical ops to their own unit. As a related but not tightly connected change, this patch also sets the read1_enable signal to the register file be 0 when RA=0 and the input_reg_a for the instruction is RA_OR_ZERO (previously it was 1). Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	96b402a4bf	Consolidate add/subtract instructions into a single op All of the PPC add and subtract instructions, including carrying and extended versions, do much the same arithmetic operation: result = (I xor A) + B + C where A is the value from RA, I provides a logical inversion of A (i.e. I is 0 or -1), B is either from RB or is a constant 0 or -1, and C is 0, 1 or the carry bit from XER (CA). To consolidate all the add/subtract instructions into a single OP_ADD, we add a column to decode_rom_t to indicate when A should be inverted, and change the input_carry field to a 3-state selector to select C in the equation above. This also adds a new "CONST_M1" value for input_reg_b_t to indicate that B is a constant -1. This allows us to implement addme and subfme. The addex instruction appears not to exist, so the comments referring to it are removed. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	58b06eb5f3	decode: Remove const fields from decode_rom_t The const* fields of decode_rom_t drove multiplexers in decode2 that picked out various instruction fields and put them into the const* fields of the Decode2ToExecute1Type record, from where they were used in execute1. However, the code in execute1 can just as easily use the appropriate fields of the original instruction word, since that is now available in execute1. This therefore changes the code to do that, resulting in smaller decode tables. Suggested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	bbae2d1eda	decode: Index minor op table with insn bits for opcode 31 This changes decode_op_31_array from being indexed by a ppc_insn_t (which is derived from the instruction word by a whole series of if/elsif statements) to being indexed directly by bits 10...1 of the instruction word. With this we no longer need ppc_insn. This then means that the decode1 stage doesn't distinguish between mfcr and mfocrf, or between mtcrf and mtocrf, since those are distinguished by the value in bit 20 of the instruction. To accommodate that, execute1 changes so that the one op value (OP_MFCR) does either the mfcr or the mfocrf behaviour depending on bit 20 of the instruction word; and similarly for mtcrf/mtocrf. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	21d3f8a5ed	decode: Index minor op table with insn bits for opcode 30 This comprises the 64-bit rotate and mask instructions. In order to reduce the table index to 3 bits, we combine rldcl and rdlcr into a single op (OP_RLDCX), and choose the right mask at execute time based on bit 1 of the instruction word. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	00e9f801f6	decode: Index minor op table with insn bits for opcode 19 This changes the decoding of major opcode 19 from using the ppc_insn_t index to using bits of the instruction word directly. Opcode 19 has a 10-bit minor opcode field (bits 10..1) but the space is sparsely filled. Therefore we index a table of single-bit entries with the 10-bit minor opcode to filter out the illegal minor opcodes, and index a table using just 3 bits -- 5, 3 and 2 -- of the instruction to get the decode entry. This groups together all the instructions in 4 columns of the opcode map as a single entry. That means that mcrf and all the CR logical ops get grouped together, and bcctr, bclr and bctar get grouped together. At present the CR logical ops are not implemented, so their grouping has no impact. The code for bclr and bcctr in execute1 is now common, using a single op, and it now determines the branch address by looking at bit 10 of the instruction word at execute time. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago

1 2

60 Commits (master)