Commit Graph

60 Commits (master)

Author SHA1 Message Date
Paul Mackerras 1c4b5def36 Improve timing of redirect_nia going from writeback to fetch1
This gets rid of the adder in writeback that computes redirect_nia.
Instead, the main adder in the ALU is used to compute the branch
target for relative branches.  We now decode b and bc differently
depending on the AA field, generating INSN_brel, INSN_babs, INSN_bcrel
or INSN_bcabs as appropriate.  Each one has a separate entry in the
decode table in decode1; the *rel versions use CIA as the A input.
The bclr/bcctr/bctar and rfid instructions now select ramspr_result
for the main result mux to get the redirect address into
ex1.e.write_data.

For branches which are predicted taken but not actually taken, we need
to redirect to the following instruction.  We also need to do that for
isync.  We do this in the execute2 stage since whether or not to do it
depends on the branch result.  The next_nia computation is moved to
the execute2 stage and comes in via a new leg on the secondary result
multiplexer, making next_nia available ultimately in ex2.e.write_data.
This also means that the next_nia leg of the primary result
multiplexer is gone.  Incrementing last_nia by 4 for sc (so that SRR0
points to the following instruction) is also moved to execute2.

Writing CIA+4 to LR was previously done through the main result
multiplexer.  Now it comes in explicitly in the ramspr write logic.

Overall this removes the br_offset and abs_br fields and the logic to
add br_offset and next_nia, and one leg of the primary result
multiplexer, at the cost of a few extra control signals between
execute1 and execute2 and some multiplexing for the ramspr write side
and an extra input on the secondary result multiplexer.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
7 months ago
Paul Mackerras 06ff486567 icache: Restore primary opcode to instruction word
The icache stores a predecoded insn_code value for each instruction,
and so as to fit in 36 bits, omits the primary opcode (the most
significant 6 bits) of each instruction.  Previously, for valid
instructions, the primary opcode field of the instruction delivered to
decode1 was a part-representation of the insn_code value rather than
the actual primary opcode.  This adds a lookup table to compute the
primary opcode from the insn_code and deliver it in the instruction
words supplied to decode1.

In order that each insn_code can be associated with a single primary
opcode value, the various no-operation instructions with primary
opcode 31 (the reserved no-ops and dss, dst and dstst) have been given
a new insn_code, INSN_rnop, leaving INSN_nop for the preferred no-op
(ori r0,r0,0).

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
7 months ago
Paul Mackerras b50170cd1d Implement byte reversal instructions
This implements the byte-reverse halfword, word and doubleword
instructions: brh, brw, and brd.  These instructions were added to the
ISA in version 3.1.  They use a new OP_BREV insn_type value.  The
logic for these instructions is implemented in logical.vhdl.

In order to avoid going over 64 insn_type values, OP_AND and OP_OR
were combined into OP_LOGIC, which is like OP_AND except that the RS
input can be inverted as well as the RB input.  The various forms of
OR instruction are then implemented using the identity

    a OR b = NOT (NOT a AND NOT b)

The 'is_signed' field of the instruction decode table is used to
indicate that RS should be inverted.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
7 months ago
Paul Mackerras 39ca675ce3 Decode prefixed instructions
This adds logic to do basic decoding of the prefixed instructions
defined in PowerISA v3.1B which are in the SFFS (Scalar Fixed plus
Floating-Point Subset) compliancy subset.  In PowerISA v3.1B SFFS,
there are 14 prefixed load/store instructions plus the prefixed no-op
instruction (pnop).  The prefixed load/store instructions all use an
extended version of D-form, which has an extra 18 bits of displacement
in the prefix, plus an 'R' bit which enables PC-relative addressing.

When decode1 sees an instruction word where the insn_code is
INSN_prefix (i.e. the primary opcode was 1), it stores the prefix word
and sends nothing down to decode2 in that cycle.  When the next valid
instruction word arrives, it is interpreted as a suffix, meaning that
its insn_code gets modified before being used to look up the decode
table.

The insn_code values are rearranged so that the values for
instructions which are the suffix of a valid prefixed instruction are
all at even indexes, and the corresponding prefixed instructions
follow immediately, so that an insn_code value can be converted to the
corresponding prefixed value by setting the LSB of the insn_code
value.  There are two prefixed instructions, pld and pstd, for which
the suffix is not a valid SFFS instruction by itself, so these have
been given dummy insn_code values which decode as illegal (INSN_op57
and INSN_op61).

For a prefixed instruction, decode1 examines the type and subtype
fields of the prefix and checks that the suffix is valid for the type
and subtype.  This check doesn't affect which entry of the decode
table is used; the result is passed down to decode2, and will in
future be acted upon in execute1.

The instruction address passed down to decode2 is the address of the
prefix.  To enable this, part of the instruction address is saved when
the prefix is seen, and then the instruction address received from
icache is partly overlaid by the saved prefix address.  Because
prefixed instructions are not permitted to cross 64-byte boundaries,
we only need to save bits 5:2 of the instruction to do this.  If the
alignment restriction ever gets relaxed, we will then need to save
more bits of the address.

Decode2 has been extended to handle the R bit of the prefix (in 8LS
and MLS forms) and to be able to generate the 34-bit immediate value
from the prefix and suffix.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
10 months ago
Paul Mackerras 7af0e001ad Move insn_codes for mcrfs, mtfsb0/1 and mtfsfi
This moves the insn_code values for mcrfs, mtfsb0/1 and mtfsfi into
the region used for floating-point instructions.  This means that in
no-FPU implementations, they will get turned into illegal instructions
in predecode.  We then don't need the code in execute1 that makes FP
instructions illegal in no-FPU implementations.

We also remove the NONE value for unit_t, since it was only ever used
with insn_type = OP_ILLEGAL, and the check for unit = NONE was
redundant with the check for insn_type = OP_ILLEGAL.  Thus the check
for unit = NONE is no longer needed and is removed here.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
10 months ago
Paul Mackerras 58e799b350 predecode: Add more comments to row_predecode_rom and insn_code values
This adds comments to row_predecode_rom to aid understanding how the
columns in the second half of the table are allocated to different
primary opcodes, and to the insn_code values to assist in locating the
code with a given numeric value.  No code change.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2 years ago
Paul Mackerras 26dc1e879c Eliminate use of primary opcode outside of decode1
This changes code that previously looked at the primary opcode (bits
26 to 31) of the instruction to use other methods, in places other
than in stage0 of decode1.

* Extend rc_t to have a new value, RCOE, indicating that the
  instruction has both Rc and OE bits.

* Decode2 now tells execute1 whether the instruction has a third
  operand, used for distinguishing between multiply and multiply-add
  instructions.

* The invert_a field of the decode ROM is overloaded for load/store
  instructions to indicate cache-inhibited loads and stores.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2 years ago
Paul Mackerras c9aea45ffe decode1: Divide insn_code values into ranges to indicate register usage
This lets us compute r_out.reg_*_addr and r_out.read_2_enable values
without needing access to the primary opcode value.  We also have that
non-FP instructions are < 256.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2 years ago
Paul Mackerras c3ee10f013 decode1: Split instruction decoding into two steps
This reduces the block RAM requirements for instruction decoding by
splitting it into two steps.  The first, in a new pipeline stage
called decode0 (implemented by code in decode1.vhdl) maps the
instruction to a 9-bit instruction code using major and row decode
ROMs.  The second maps the 9-bit code to the final decode_rom_t (about
44 bits wide).  Branch prediction done in decode is now done in
decode0 rather than decode1.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2 years ago
Paul Mackerras 932da4c114 FPU: Simplify IDLE state code
Do more decoding of the instruction ahead of the IDLE state
processing so that the IDLE state code becomes much simpler.
To make the decoding easier, we now use four insn_type_t codes for
floating-point operations rather than two.  This also rearranges the
insn_type_t values a little to get the 4 FP opcode values to differ
only in the bottom 2 bits, and put OP_DIV, OP_DIVE and OP_MOD next to
them.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2 years ago
Paul Mackerras fdb3ef6874 Finish off taking SPRs out of register file
With this, the register file now contains 64 entries, for 32 GPRs and
32 FPRs, rather than the 128 it had previously.  Several things get
simplified - decode1 no longer has to work out the ispr{1,2,o} values,
decode_input_reg_{a,b,c} no longer have the t = SPR case, etc.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2 years ago
Paul Mackerras c9e838b656 Remove support for lq, stq, lqarx and stqcx.
They are optional in SFFS (scalar fixed-point and floating-point
subset), are not needed for running Linux, and add complexity, so
remove them.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2 years ago
Paul Mackerras 4c61a71a62 core: Crack update-form loads into two internal ops
This uses the instruction-doubling machinery to send load with update
instructions down to loadstore1 as two separate ops, rather than
one op with two destinations.  This will help to simplify the value
tracking mechanisms.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 years ago
Paul Mackerras 89a67a18d0 decode: Add a facility field to the instruction decode tables
This makes it simpler to work out when to deliver a FPU unavailable
interrupt.  This also means we can get rid of the OP_FPLOAD and
OP_FPSTORE insn_type values.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 years ago
Paul Mackerras 4b2c23703c core: Implement quadword loads and stores
This implements the lq, stq, lqarx and stqcx. instructions.

These instructions all access two consecutive GPRs; for example the
"lq %r6,0(%r3)" instruction will load the doubleword at the address
in R3 into R7 and the doubleword at address R3 + 8 into R6.  To cope
with having two GPR sources or destinations, the instruction gets
repeated at the decode2 stage, that is, for each lq/stq/lqarx/stqcx.
coming in from decode1, two instructions get sent out to execute1.

For these instructions, the RS or RT register gets modified on one
of the iterations by setting the LSB of the register number.  In LE
mode, the first iteration uses RS|1 or RT|1 and the second iteration
uses RS or RT.  In BE mode, this is done the other way around.  In
order for decode2 to know what endianness is currently in use, we
pass the big_endian flag down from icache through decode1 to decode2.
This is always in sync with what execute1 is using because only rfid
or an interrupt can change MSR[LE], and those operations all cause
a flush and redirect.

There is now an extra column in the decode tables in decode1 to
indicate whether the instruction needs to be repeated.  Decode1 also
enforces the rule that lq with RT = RT and lqarx with RA = RT or
RB = RT are illegal.

Decode2 now passes a 'repeat' flag and a 'second' flag to execute1,
and execute1 passes them on to loadstore1.  The 'repeat' flag is set
for both iterations of a repeated instruction, and 'second' is set
on the second iteration.  Execute1 does not take asynchronous or
trace interrupts on the second iteration of a repeated instruction.

Loadstore1 uses 'next_addr' for the second iteration of a repeated
load/store so that we access the second doubleword of the memory
operand.  Thus loadstore1 accesses the doublewords in increasing
memory order.  For 16-byte loads this means that the first iteration
writes GPR RT|1.  It is possible that RA = RT|1 (this is a legal
but non-preferred form), meaning that if the memory operand was
misaligned, the first iteration would overwrite RA but then the
second iteration might take a page fault, leading to corrupted state.
To avoid that possibility, 16-byte loads in LE mode take an
alignment interrupt if the operand is not 16-byte aligned.  (This
is the case anyway for lqarx, and we enforce it for lq as well.)

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 years ago
Paul Mackerras e6a5f237bc FPU: Implement fmul[s]
This implements the fmul and fmuls instructions.

For fmul[s] with denormalized operands we normalize the inputs
before doing the multiplication, to eliminate the need for doing
count-leading-zeroes on P.  This adds 3 or 5 cycles to the
execution time when one or both operands are denormalized.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras b628af6176 FPU: Implement fmr and related instructions
This implements fmr, fneg, fabs, fnabs and fcpsgn and adds tests
for them.

This adds logic to unpack and repack floating-point data from the
64-bit packed form (as stored in memory and the register file) into
the unpacked form in the fpr_reg_type record.  This is not strictly
necessary for fmr et al., but will be useful for when we do actual
arithmetic.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras 856e9e955f core: Add framework for an FPU
This adds the skeleton of a floating-point unit and implements the
mffs and mtfsf instructions.

Execute1 sends FP instructions to the FPU and receives busy,
exception, FP interrupt and illegal interrupt signals from it.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras 45cd8f4fc3 core: Add support for floating-point loads and stores
This extends the register file so it can hold FPR values, and
implements the FP loads and stores that do not require conversion
between single and double precision.

We now have the FP, FE0 and FE1 bits in MSR.  FP loads and stores
cause a FP unavailable interrupt if MSR[FP] = 0.

The FPU facilities are optional and their presence is controlled by
the HAS_FPU generic passed down from the top-level board file.  It
defaults to true for all except the A7-35 boards.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras 83816cb9e3 core: Implement BCD Assist instructions addg6s, cdtbcd, cbcdtod
To avoid adding too much logic, this moves the adder used by OP_ADD
out of the case statement in execute1.vhdl so that the result can
be used by OP_ADDG6S as well.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras 5fafdc56ef core: Implement the addex instruction
The addex instruction is like adde but uses the XER[OV] bit for the
carry in and out rather than XER[CA].

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras 290b05f97d core: Implement the maddhd, maddhdu and maddld instructions
These instructions use major opcode 4 and have a third GPR input
operand, so we need a decode table for major opcode 4 and some
plumbing to get the RC register operand read.

The multiply-add instructions use the same insn_type_t values as the
regular multiply instructions, and we distinguish in execute1 by
looking at the major opcode.  This turns out to be convenient because
we don't have to add any cases in the code that handles the output of
the multiplier, and it frees up some insn_type_t values.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras fa77a6f683 core: Implement the mcrxrx instruction
This also removes OP_MCRXR, as the mcrxr instruction was removed in
version 3.0B of the Power ISA, having been phased-out for the server
architecture since v2.02.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras 4a4a98d4b9
core: Do addpcis using the main adder (#189)
By adding logic to decode2 to be able to send the instruction address
down the A input, and making CONST_DX_HI (renamed to CONST_DXHI4) add
4 to the immediate value (easy since the bottom 16 bits were zero),
we can do addpcis using the main adder.  This reduces the width of the
result mux and frees up one value in insn_type_t, since we can now use
OP_ADD for addpcis.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Shawn Anastasio e606772aeb Implement the addpcis instruction
This commit adds support for the addpcis instruction from ISA 3.0.

A new input_reg_b_t type, CONST_DX_HI, was added to support the
shifted immediate value used in DX-Form instructions.

Signed-off-by: Shawn Anastasio <shawn@anastas.io>
4 years ago
Paul Mackerras 3d4712ad43 Add TLB to icache
This adds a direct-mapped TLB to the icache, with 64 entries by default.
Execute1 now sends a "virt_mode" signal from MSR[IR] to fetch1 along
with redirects to indicate whether instruction addresses should be
translated through the TLB, and fetch1 sends that on to icache.
Similarly a "priv_mode" signal is sent to indicate the privilege
mode for instruction fetches.  This means that changes to MSR[IR]
or MSR[PR] don't take effect until the next redirect, meaning an
isync, rfid, branch, etc.

The icache uses a hash of the effective address (i.e. next instruction
address) to index the TLB.  The hash is an XOR of three fields of the
address; with a 64-entry TLB, the fields are bits 12--17, 18--23 and
24--29 of the address.  TLB invalidations simply invalidate the
indexed TLB entry without checking the contents.

If the icache detects a TLB miss with virt_mode=1, it will send a
fetch_failed indication through fetch2 to decode1, which will turn it
into a special OP_FETCH_FAILED opcode with unit=LDST.  That will get
sent down to loadstore1 which will currently just raise a Instruction
Storage Interrupt (0x400) exception.

One bit in the PTE obtained from the TLB is used to check whether an
instruction access is allowed -- the privilege bit (bit 3).  If bit 3
is 1 and priv_mode=0, then a fetch_failed indication is sent down to
fetch2 and to decode1, which generates an OP_FETCH_FAILED.  Any PTEs
with PTE bit 0 (EAA[3]) clear or bit 8 (R) clear should not be put
into the iTLB since such PTEs would not allow execution by any
context.

Tlbie operations get sent from mmu to icache over a new connection.

Unfortunately the privileged instruction tests are broken for now.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras 750b3a8e28 dcache: Implement data TLB
This adds a TLB to dcache, providing the ability to translate
addresses for loads and stores.  No protection mechanism has been
implemented yet.  The MSR_DR bit controls whether addresses are
translated through the TLB.

The TLB is a fixed-pagesize, set-associative cache.  Currently
the page size is 4kB and the TLB is 2-way set associative with 64
entries per set.

This implements the tlbie instruction.  RB bits 10 and 11 control
whether the whole TLB is invalidated (if either bit is 1) or just
a single entry corresponding to the effective page number in bits
12-63 of RB.

As an extension until we get a hardware page table walk, a tlbie
instruction with RB bits 9-11 set to 001 will load an entry into
the TLB.  The TLB entry value is in RS in the format of a radix PTE.

Currently there is no proper handling of TLB misses.  The load or
store will not be performed but no interrupt is generated.

In order to make timing at 100MHz on the Arty A7-100, we compare
the real address from each way of the TLB with the tag from each way
of the cache in parallel (requiring # TLB ways * # cache ways
comparators).  Then the result is selected based on which way hit in
the TLB.  That avoids a timing path going through the TLB EA
comparators, the multiplexer that selects the RA, and the cache tag
comparators.

The hack where addresses of the form 0xc------- are marked as
cache-inhibited is kept for now but restricted to real-mode accesses.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras 278ac5e0eb Remove sim_config instruction
It's not used any more, and it's not in the ISA.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras 381149b2cc Consolidate trap variants under a single OP_TRAP
This replaces OP_TD, OP_TDI, OP_TW and OP_TWI with a single OP_TRAP,
distinguishing the cases by the input_reg_b and is_32bit fields of
the decode ROM.  This adds the twi and td cases to the decode tables.

For now we make all of the trap instructions unconditionally generate
a trap-type program interrupt if the TO field of the instruction is
all ones, and do nothing otherwise.

This reduces the number of values in insn_type_t from 65 to 62,
meaning that an insn_type_t can now be encoded in 6 bits rather
than 7.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras fe077a116a Rename OP_MCRF to OP_CROP and trim insn_type_t
OP_MCRF covers the CR logical ops as well as mcrf since commit
c05441bf47 ("Implement CRNOR and friends"), so this renames
OP_MCRF to OP_CROP.  The OP_* values for the individual CR logical
ops (OP_CRAND, etc.) are not used, so remove them from insn_type_t.

No functional change.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Michael Neuling 5ef5604b65 Add sc, illegal and decrementer exceptions and some supervisor state
This adds the following exceptions:
 - 0x700 program check (for illegal instructions)
 - 0x900 decrementer
 - 0xc00 system call

This also adds some supervisor state:
 - decremeter
 - msr
(SPRG0/1 and SRR0/1 already exist as fast SPRs)

It also adds some supporting instructions:
 - rfid
 - mtmsrd
 - mfmsr
 - sc

MSR state is added but only EE is used in this patch set. Other bits
are read/written but are not used at all.

This adds a 2 stage state machine to execute1.vhdl. This state machine
allows fast SPRS SRR0/1 to be written in different cycles. This state
machine can be extended later to add DAR and DSISR SPR writing for
more complex exceptions like page faults.

Signed-off-by: Michael Neuling <mikey@neuling.org>
4 years ago
Paul Mackerras 0c714f1be6 execute: Move popcnt and prty instructions into the logical unit
This implements logic in the logical entity to calculate the results
of the popcnt* and prty* instructions.  We now have one insn_type_t
value for the 3 popcnt variants and one for the two prty variants,
using the length field of the decode_rom_t to distinguish between
them.  The implementations in logical.vhdl using recursive
algorithms rather than the simple functions in ppc_fx_insns.vhdl.

This gives a saving of about 140 slice LUTs on the A7-100 and
improves timing slightly.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras d2ca625b3b execute: Do comparisons using the main adder
This handles OP_CMP like a subtraction; the main adder computes
~RA + RB + 1, and the condition codes are computed from the results.
A direct comparison of the two input operands is used to calculate the
EQ bit of the condition result.  The LT and GT bits are computed from
the MSB of the subtraction result, the carry out from the subtraction,
and the MSBs of the operands.  For a 32-bit comparison, the 32-bit
carry and bit 31 of the result and input operands are used; for a
64-bit comparison, the 64-bit carry and bit 63 of the operands and
result are used.

It turns out to be more convenient to use the 'signed' field of
the decode table to distinguish signed from unsigned comparisons,
rather than the insn_type.  Therefore this uses OP_CMP for both
cmp and cmpl, which also has the benefit of reducing the number
of values in insn_type_t.

Doing this saves over 200 slice LUTs on the Arty A7-100 and improves
timing slightly as well.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras 39d18d2738 Make divider hang off the side of execute1
With this, the divider is a unit that execute1 sends operands to and
which sends its results back to execute1, which then send them to
writeback.  Execute1 now sends a stall signal when it gets a divide
or modulus instruction until it gets a valid signal back from the
divider.  Divide and modulus instructions are no longer marked as
single-issue.

The data formatting step that used to be done in decode2 for div
and mod instructions is now done in execute1.  We also do the
absolute value operation in that same cycle instead of taking an
extra cycle inside the divider for signed operations with a
negative operand.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras 2167186b5f Make multiplier hang off the side of execute1
With this, the multiplier isn't a separate pipe that decode2 issues
instructions to, but rather is a unit that execute1 sends operands
to and which sends the result back to execute1, which then sends it
to writeback.  Execute1 now sends a stall signal when it gets a
multiply instruction until it gets a valid signal back from the
multiplier.

This all means that we no longer need to mark the multiply
instructions as single-issue.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Benjamin Herrenschmidt e4f475e17f sprs: Store common SPRs in register file
This stores the most common SPRs in the register file.

This includes CTR and LR and a not yet final list of others.

The register file is set to 64 entries for now. Specific types
are defined that can represent a GPR index (gpr_index_t) or
a GPR/SPR index (gspr_index_t) along with conversion functions
between the two.

On order to deal with some forms of branch updating both LR and
CTR, we introduced a delayed update of LR after a branch link.

Note: We currently stall the pipeline on such a delayed branch,
but we could avoid stalling fetch in that specific case as we
know we have a branch delay. We could also limit that to the
specific case where we need to update both CTR and LR.

This allows us to make bcreg, mtspr and mfspr pipelined. decode1
will automatically force the single issue flag on mfspr/mtspr to
a "slow" SPR.

[paulus@ozlabs.org - fix direction of decode2.stall_in]

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Benjamin Herrenschmidt 797b1bb045 decode: Reformat decode_types.vhdl
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
5 years ago
Anton Blanchard 4433118c91
Merge pull request #105 from paulusmack/writeback
Writeback
5 years ago
Paul Mackerras 9646fe28b0 Do sign-extension instructions in writeback instead of execute1
This makes the exts[bhw] instructions do the sign extension in the
writeback stage using the sign-extension logic there instead of
having unique sign extension logic in execute1.  This requires
passing the data length and sign extend flag from decode2 down
through execute1 and execute2 and into writeback.  As a side bonus
we reduce the number of values in insn_type_t by two.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 86c53aa3f7 Implement neg using OP_ADD
We have all the machinery in place to implement the neg instruction
as OP_ADD.  Doing that means we can ditch OP_NEG, and saves about
66 slice LUTs on the A7-100.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 24a4a796ce execute: Consolidate count-leading/trailing-zeroes implementations
This adds combinatorial logic that does 32-bit and 64-bit count
leading and trailing zeroes in one unit, and consolidates the
four instructions under a single OP_CNTZ opcode.

This saves 84 slice LUTs on the Arty A7-100.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Anton Blanchard b8fb721b81 Consolidate logical instructions
Consolidate and/andc/nand, or/orc/nor and xor/eqv, using a common
invert on the input and output. This saves us about 200 LUTs.

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
5 years ago
Paul Mackerras f7c393ba7e Add a rotate/mask/shift unit and use it in execute1
This adds a new entity 'rotator' which contains combinatorial logic
for rotating and masking 64-bit values.  It implements the operations
of the rlwinm, rlwnm, rlwimi, rldicl, rldicr, rldic, rldimi, rldcl,
rldcr, sld, slw, srd, srw, srad, sradi, sraw and srawi instructions.
It consists of a 3-stage 64-bit rotator using 4:1 multiplexors at
each stage, two mask generators, output logic and control logic.

The insn_type_t values used for these instructions have been reduced
to just 5: OP_RLC, OP_RLCL and OP_RLCR for the rotate and mask
instructions (clear both left and right, clear left, clear right
variants), OP_SHL for left shifts, and OP_SHR for right shifts.
The control signals for the rotator are derived from the opcode
and from the is_32bit and is_signed fields of the decode_rom_t.

The rotator is instantiated as an entity in execute1 so that we can
be sure we only have one of it.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 90b6e27380 Generalize the mul_32bit and mul_signed fields of decode_rom_t
This changes the names of the mul_32bit and mul_signed fields of
decode_rom_t to is_32bit and is_signed, so they can be used with
other types of operations besides multiplies.

This plumbs the is_32bit and is_signed flags down into execute1,
though they are not used at this point.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 7fe84220a5 decode: Avoid multiplexing from instruction reg fields to regfile address ports
This aims to simplify the logic between the instruction image and
the register file read address ports and reduce the size of the decode
tables.  With this patch, the input_reg_a column of the decode tables
can only select RA or zeroes, the input_reg_b column can only select
RB or a constant (0, -1, or an immediate value from the instruction),
and the input_reg_c columns can only select RS or zeroes.

That means that the rotate/shift/logical ops now have their first
input coming in via the input_reg_c column.  That means we need to
add a read_data3 field to the Decode2ToExecuteType record, but that
will go away again when we split out the rotate/mask/logical ops to
their own unit.

As a related but not tightly connected change, this patch also sets
the read1_enable signal to the register file be 0 when RA=0 and the
input_reg_a for the instruction is RA_OR_ZERO (previously it was 1).

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 96b402a4bf Consolidate add/subtract instructions into a single op
All of the PPC add and subtract instructions, including carrying
and extended versions, do much the same arithmetic operation:

	result = (I xor A) + B + C

where A is the value from RA, I provides a logical inversion of A
(i.e. I is 0 or -1), B is either from RB or is a constant 0 or -1,
and C is 0, 1 or the carry bit from XER (CA).

To consolidate all the add/subtract instructions into a single
OP_ADD, we add a column to decode_rom_t to indicate when A should
be inverted, and change the input_carry field to a 3-state selector
to select C in the equation above.

This also adds a new "CONST_M1" value for input_reg_b_t to indicate
that B is a constant -1.  This allows us to implement addme and
subfme.

The addex instruction appears not to exist, so the comments referring
to it are removed.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 58b06eb5f3 decode: Remove const fields from decode_rom_t
The const* fields of decode_rom_t drove multiplexers in decode2 that
picked out various instruction fields and put them into the const*
fields of the Decode2ToExecute1Type record, from where they were
used in execute1.  However, the code in execute1 can just as easily
use the appropriate fields of the original instruction word, since
that is now available in execute1.  This therefore changes the
code to do that, resulting in smaller decode tables.

Suggested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras bbae2d1eda decode: Index minor op table with insn bits for opcode 31
This changes decode_op_31_array from being indexed by a ppc_insn_t
(which is derived from the instruction word by a whole series of
if/elsif statements) to being indexed directly by bits 10...1 of
the instruction word.  With this we no longer need ppc_insn.

This then means that the decode1 stage doesn't distinguish between
mfcr and mfocrf, or between mtcrf and mtocrf, since those are
distinguished by the value in bit 20 of the instruction.  To
accommodate that, execute1 changes so that the one op value (OP_MFCR)
does either the mfcr or the mfocrf behaviour depending on bit 20
of the instruction word; and similarly for mtcrf/mtocrf.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 21d3f8a5ed decode: Index minor op table with insn bits for opcode 30
This comprises the 64-bit rotate and mask instructions.  In order to
reduce the table index to 3 bits, we combine rldcl and rdlcr into a
single op (OP_RLDCX), and choose the right mask at execute time based
on bit 1 of the instruction word.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 00e9f801f6 decode: Index minor op table with insn bits for opcode 19
This changes the decoding of major opcode 19 from using the ppc_insn_t
index to using bits of the instruction word directly.  Opcode 19 has
a 10-bit minor opcode field (bits 10..1) but the space is sparsely
filled.  Therefore we index a table of single-bit entries with the
10-bit minor opcode to filter out the illegal minor opcodes, and
index a table using just 3 bits -- 5, 3 and 2 -- of the instruction
to get the decode entry.  This groups together all the instructions
in 4 columns of the opcode map as a single entry.  That means that
mcrf and all the CR logical ops get grouped together, and bcctr, bclr
and bctar get grouped together.  At present the CR logical ops are not
implemented, so their grouping has no impact.

The code for bclr and bcctr in execute1 is now common, using a single
op, and it now determines the branch address by looking at bit 10 of
the instruction word at execute time.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago