This uses the instruction-doubling machinery to send load with update
instructions down to loadstore1 as two separate ops, rather than
one op with two destinations. This will help to simplify the value
tracking mechanisms.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This makes it simpler to work out when to deliver a FPU unavailable
interrupt. This also means we can get rid of the OP_FPLOAD and
OP_FPSTORE insn_type values.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This implements the lq, stq, lqarx and stqcx. instructions.
These instructions all access two consecutive GPRs; for example the
"lq %r6,0(%r3)" instruction will load the doubleword at the address
in R3 into R7 and the doubleword at address R3 + 8 into R6. To cope
with having two GPR sources or destinations, the instruction gets
repeated at the decode2 stage, that is, for each lq/stq/lqarx/stqcx.
coming in from decode1, two instructions get sent out to execute1.
For these instructions, the RS or RT register gets modified on one
of the iterations by setting the LSB of the register number. In LE
mode, the first iteration uses RS|1 or RT|1 and the second iteration
uses RS or RT. In BE mode, this is done the other way around. In
order for decode2 to know what endianness is currently in use, we
pass the big_endian flag down from icache through decode1 to decode2.
This is always in sync with what execute1 is using because only rfid
or an interrupt can change MSR[LE], and those operations all cause
a flush and redirect.
There is now an extra column in the decode tables in decode1 to
indicate whether the instruction needs to be repeated. Decode1 also
enforces the rule that lq with RT = RT and lqarx with RA = RT or
RB = RT are illegal.
Decode2 now passes a 'repeat' flag and a 'second' flag to execute1,
and execute1 passes them on to loadstore1. The 'repeat' flag is set
for both iterations of a repeated instruction, and 'second' is set
on the second iteration. Execute1 does not take asynchronous or
trace interrupts on the second iteration of a repeated instruction.
Loadstore1 uses 'next_addr' for the second iteration of a repeated
load/store so that we access the second doubleword of the memory
operand. Thus loadstore1 accesses the doublewords in increasing
memory order. For 16-byte loads this means that the first iteration
writes GPR RT|1. It is possible that RA = RT|1 (this is a legal
but non-preferred form), meaning that if the memory operand was
misaligned, the first iteration would overwrite RA but then the
second iteration might take a page fault, leading to corrupted state.
To avoid that possibility, 16-byte loads in LE mode take an
alignment interrupt if the operand is not 16-byte aligned. (This
is the case anyway for lqarx, and we enforce it for lq as well.)
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This implements the fmul and fmuls instructions.
For fmul[s] with denormalized operands we normalize the inputs
before doing the multiplication, to eliminate the need for doing
count-leading-zeroes on P. This adds 3 or 5 cycles to the
execution time when one or both operands are denormalized.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This implements fmr, fneg, fabs, fnabs and fcpsgn and adds tests
for them.
This adds logic to unpack and repack floating-point data from the
64-bit packed form (as stored in memory and the register file) into
the unpacked form in the fpr_reg_type record. This is not strictly
necessary for fmr et al., but will be useful for when we do actual
arithmetic.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds the skeleton of a floating-point unit and implements the
mffs and mtfsf instructions.
Execute1 sends FP instructions to the FPU and receives busy,
exception, FP interrupt and illegal interrupt signals from it.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This extends the register file so it can hold FPR values, and
implements the FP loads and stores that do not require conversion
between single and double precision.
We now have the FP, FE0 and FE1 bits in MSR. FP loads and stores
cause a FP unavailable interrupt if MSR[FP] = 0.
The FPU facilities are optional and their presence is controlled by
the HAS_FPU generic passed down from the top-level board file. It
defaults to true for all except the A7-35 boards.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
To avoid adding too much logic, this moves the adder used by OP_ADD
out of the case statement in execute1.vhdl so that the result can
be used by OP_ADDG6S as well.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The addex instruction is like adde but uses the XER[OV] bit for the
carry in and out rather than XER[CA].
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
These instructions use major opcode 4 and have a third GPR input
operand, so we need a decode table for major opcode 4 and some
plumbing to get the RC register operand read.
The multiply-add instructions use the same insn_type_t values as the
regular multiply instructions, and we distinguish in execute1 by
looking at the major opcode. This turns out to be convenient because
we don't have to add any cases in the code that handles the output of
the multiplier, and it frees up some insn_type_t values.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This also removes OP_MCRXR, as the mcrxr instruction was removed in
version 3.0B of the Power ISA, having been phased-out for the server
architecture since v2.02.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
By adding logic to decode2 to be able to send the instruction address
down the A input, and making CONST_DX_HI (renamed to CONST_DXHI4) add
4 to the immediate value (easy since the bottom 16 bits were zero),
we can do addpcis using the main adder. This reduces the width of the
result mux and frees up one value in insn_type_t, since we can now use
OP_ADD for addpcis.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This commit adds support for the addpcis instruction from ISA 3.0.
A new input_reg_b_t type, CONST_DX_HI, was added to support the
shifted immediate value used in DX-Form instructions.
Signed-off-by: Shawn Anastasio <shawn@anastas.io>
This adds a direct-mapped TLB to the icache, with 64 entries by default.
Execute1 now sends a "virt_mode" signal from MSR[IR] to fetch1 along
with redirects to indicate whether instruction addresses should be
translated through the TLB, and fetch1 sends that on to icache.
Similarly a "priv_mode" signal is sent to indicate the privilege
mode for instruction fetches. This means that changes to MSR[IR]
or MSR[PR] don't take effect until the next redirect, meaning an
isync, rfid, branch, etc.
The icache uses a hash of the effective address (i.e. next instruction
address) to index the TLB. The hash is an XOR of three fields of the
address; with a 64-entry TLB, the fields are bits 12--17, 18--23 and
24--29 of the address. TLB invalidations simply invalidate the
indexed TLB entry without checking the contents.
If the icache detects a TLB miss with virt_mode=1, it will send a
fetch_failed indication through fetch2 to decode1, which will turn it
into a special OP_FETCH_FAILED opcode with unit=LDST. That will get
sent down to loadstore1 which will currently just raise a Instruction
Storage Interrupt (0x400) exception.
One bit in the PTE obtained from the TLB is used to check whether an
instruction access is allowed -- the privilege bit (bit 3). If bit 3
is 1 and priv_mode=0, then a fetch_failed indication is sent down to
fetch2 and to decode1, which generates an OP_FETCH_FAILED. Any PTEs
with PTE bit 0 (EAA[3]) clear or bit 8 (R) clear should not be put
into the iTLB since such PTEs would not allow execution by any
context.
Tlbie operations get sent from mmu to icache over a new connection.
Unfortunately the privileged instruction tests are broken for now.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds a TLB to dcache, providing the ability to translate
addresses for loads and stores. No protection mechanism has been
implemented yet. The MSR_DR bit controls whether addresses are
translated through the TLB.
The TLB is a fixed-pagesize, set-associative cache. Currently
the page size is 4kB and the TLB is 2-way set associative with 64
entries per set.
This implements the tlbie instruction. RB bits 10 and 11 control
whether the whole TLB is invalidated (if either bit is 1) or just
a single entry corresponding to the effective page number in bits
12-63 of RB.
As an extension until we get a hardware page table walk, a tlbie
instruction with RB bits 9-11 set to 001 will load an entry into
the TLB. The TLB entry value is in RS in the format of a radix PTE.
Currently there is no proper handling of TLB misses. The load or
store will not be performed but no interrupt is generated.
In order to make timing at 100MHz on the Arty A7-100, we compare
the real address from each way of the TLB with the tag from each way
of the cache in parallel (requiring # TLB ways * # cache ways
comparators). Then the result is selected based on which way hit in
the TLB. That avoids a timing path going through the TLB EA
comparators, the multiplexer that selects the RA, and the cache tag
comparators.
The hack where addresses of the form 0xc------- are marked as
cache-inhibited is kept for now but restricted to real-mode accesses.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This replaces OP_TD, OP_TDI, OP_TW and OP_TWI with a single OP_TRAP,
distinguishing the cases by the input_reg_b and is_32bit fields of
the decode ROM. This adds the twi and td cases to the decode tables.
For now we make all of the trap instructions unconditionally generate
a trap-type program interrupt if the TO field of the instruction is
all ones, and do nothing otherwise.
This reduces the number of values in insn_type_t from 65 to 62,
meaning that an insn_type_t can now be encoded in 6 bits rather
than 7.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
OP_MCRF covers the CR logical ops as well as mcrf since commit
c05441bf47 ("Implement CRNOR and friends"), so this renames
OP_MCRF to OP_CROP. The OP_* values for the individual CR logical
ops (OP_CRAND, etc.) are not used, so remove them from insn_type_t.
No functional change.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds the following exceptions:
- 0x700 program check (for illegal instructions)
- 0x900 decrementer
- 0xc00 system call
This also adds some supervisor state:
- decremeter
- msr
(SPRG0/1 and SRR0/1 already exist as fast SPRs)
It also adds some supporting instructions:
- rfid
- mtmsrd
- mfmsr
- sc
MSR state is added but only EE is used in this patch set. Other bits
are read/written but are not used at all.
This adds a 2 stage state machine to execute1.vhdl. This state machine
allows fast SPRS SRR0/1 to be written in different cycles. This state
machine can be extended later to add DAR and DSISR SPR writing for
more complex exceptions like page faults.
Signed-off-by: Michael Neuling <mikey@neuling.org>
This implements logic in the logical entity to calculate the results
of the popcnt* and prty* instructions. We now have one insn_type_t
value for the 3 popcnt variants and one for the two prty variants,
using the length field of the decode_rom_t to distinguish between
them. The implementations in logical.vhdl using recursive
algorithms rather than the simple functions in ppc_fx_insns.vhdl.
This gives a saving of about 140 slice LUTs on the A7-100 and
improves timing slightly.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This handles OP_CMP like a subtraction; the main adder computes
~RA + RB + 1, and the condition codes are computed from the results.
A direct comparison of the two input operands is used to calculate the
EQ bit of the condition result. The LT and GT bits are computed from
the MSB of the subtraction result, the carry out from the subtraction,
and the MSBs of the operands. For a 32-bit comparison, the 32-bit
carry and bit 31 of the result and input operands are used; for a
64-bit comparison, the 64-bit carry and bit 63 of the operands and
result are used.
It turns out to be more convenient to use the 'signed' field of
the decode table to distinguish signed from unsigned comparisons,
rather than the insn_type. Therefore this uses OP_CMP for both
cmp and cmpl, which also has the benefit of reducing the number
of values in insn_type_t.
Doing this saves over 200 slice LUTs on the Arty A7-100 and improves
timing slightly as well.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
With this, the divider is a unit that execute1 sends operands to and
which sends its results back to execute1, which then send them to
writeback. Execute1 now sends a stall signal when it gets a divide
or modulus instruction until it gets a valid signal back from the
divider. Divide and modulus instructions are no longer marked as
single-issue.
The data formatting step that used to be done in decode2 for div
and mod instructions is now done in execute1. We also do the
absolute value operation in that same cycle instead of taking an
extra cycle inside the divider for signed operations with a
negative operand.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
With this, the multiplier isn't a separate pipe that decode2 issues
instructions to, but rather is a unit that execute1 sends operands
to and which sends the result back to execute1, which then sends it
to writeback. Execute1 now sends a stall signal when it gets a
multiply instruction until it gets a valid signal back from the
multiplier.
This all means that we no longer need to mark the multiply
instructions as single-issue.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This stores the most common SPRs in the register file.
This includes CTR and LR and a not yet final list of others.
The register file is set to 64 entries for now. Specific types
are defined that can represent a GPR index (gpr_index_t) or
a GPR/SPR index (gspr_index_t) along with conversion functions
between the two.
On order to deal with some forms of branch updating both LR and
CTR, we introduced a delayed update of LR after a branch link.
Note: We currently stall the pipeline on such a delayed branch,
but we could avoid stalling fetch in that specific case as we
know we have a branch delay. We could also limit that to the
specific case where we need to update both CTR and LR.
This allows us to make bcreg, mtspr and mfspr pipelined. decode1
will automatically force the single issue flag on mfspr/mtspr to
a "slow" SPR.
[paulus@ozlabs.org - fix direction of decode2.stall_in]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This makes the exts[bhw] instructions do the sign extension in the
writeback stage using the sign-extension logic there instead of
having unique sign extension logic in execute1. This requires
passing the data length and sign extend flag from decode2 down
through execute1 and execute2 and into writeback. As a side bonus
we reduce the number of values in insn_type_t by two.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
We have all the machinery in place to implement the neg instruction
as OP_ADD. Doing that means we can ditch OP_NEG, and saves about
66 slice LUTs on the A7-100.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds combinatorial logic that does 32-bit and 64-bit count
leading and trailing zeroes in one unit, and consolidates the
four instructions under a single OP_CNTZ opcode.
This saves 84 slice LUTs on the Arty A7-100.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Consolidate and/andc/nand, or/orc/nor and xor/eqv, using a common
invert on the input and output. This saves us about 200 LUTs.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
This adds a new entity 'rotator' which contains combinatorial logic
for rotating and masking 64-bit values. It implements the operations
of the rlwinm, rlwnm, rlwimi, rldicl, rldicr, rldic, rldimi, rldcl,
rldcr, sld, slw, srd, srw, srad, sradi, sraw and srawi instructions.
It consists of a 3-stage 64-bit rotator using 4:1 multiplexors at
each stage, two mask generators, output logic and control logic.
The insn_type_t values used for these instructions have been reduced
to just 5: OP_RLC, OP_RLCL and OP_RLCR for the rotate and mask
instructions (clear both left and right, clear left, clear right
variants), OP_SHL for left shifts, and OP_SHR for right shifts.
The control signals for the rotator are derived from the opcode
and from the is_32bit and is_signed fields of the decode_rom_t.
The rotator is instantiated as an entity in execute1 so that we can
be sure we only have one of it.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This changes the names of the mul_32bit and mul_signed fields of
decode_rom_t to is_32bit and is_signed, so they can be used with
other types of operations besides multiplies.
This plumbs the is_32bit and is_signed flags down into execute1,
though they are not used at this point.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This aims to simplify the logic between the instruction image and
the register file read address ports and reduce the size of the decode
tables. With this patch, the input_reg_a column of the decode tables
can only select RA or zeroes, the input_reg_b column can only select
RB or a constant (0, -1, or an immediate value from the instruction),
and the input_reg_c columns can only select RS or zeroes.
That means that the rotate/shift/logical ops now have their first
input coming in via the input_reg_c column. That means we need to
add a read_data3 field to the Decode2ToExecuteType record, but that
will go away again when we split out the rotate/mask/logical ops to
their own unit.
As a related but not tightly connected change, this patch also sets
the read1_enable signal to the register file be 0 when RA=0 and the
input_reg_a for the instruction is RA_OR_ZERO (previously it was 1).
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
All of the PPC add and subtract instructions, including carrying
and extended versions, do much the same arithmetic operation:
result = (I xor A) + B + C
where A is the value from RA, I provides a logical inversion of A
(i.e. I is 0 or -1), B is either from RB or is a constant 0 or -1,
and C is 0, 1 or the carry bit from XER (CA).
To consolidate all the add/subtract instructions into a single
OP_ADD, we add a column to decode_rom_t to indicate when A should
be inverted, and change the input_carry field to a 3-state selector
to select C in the equation above.
This also adds a new "CONST_M1" value for input_reg_b_t to indicate
that B is a constant -1. This allows us to implement addme and
subfme.
The addex instruction appears not to exist, so the comments referring
to it are removed.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The const* fields of decode_rom_t drove multiplexers in decode2 that
picked out various instruction fields and put them into the const*
fields of the Decode2ToExecute1Type record, from where they were
used in execute1. However, the code in execute1 can just as easily
use the appropriate fields of the original instruction word, since
that is now available in execute1. This therefore changes the
code to do that, resulting in smaller decode tables.
Suggested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This changes decode_op_31_array from being indexed by a ppc_insn_t
(which is derived from the instruction word by a whole series of
if/elsif statements) to being indexed directly by bits 10...1 of
the instruction word. With this we no longer need ppc_insn.
This then means that the decode1 stage doesn't distinguish between
mfcr and mfocrf, or between mtcrf and mtocrf, since those are
distinguished by the value in bit 20 of the instruction. To
accommodate that, execute1 changes so that the one op value (OP_MFCR)
does either the mfcr or the mfocrf behaviour depending on bit 20
of the instruction word; and similarly for mtcrf/mtocrf.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This comprises the 64-bit rotate and mask instructions. In order to
reduce the table index to 3 bits, we combine rldcl and rdlcr into a
single op (OP_RLDCX), and choose the right mask at execute time based
on bit 1 of the instruction word.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This changes the decoding of major opcode 19 from using the ppc_insn_t
index to using bits of the instruction word directly. Opcode 19 has
a 10-bit minor opcode field (bits 10..1) but the space is sparsely
filled. Therefore we index a table of single-bit entries with the
10-bit minor opcode to filter out the illegal minor opcodes, and
index a table using just 3 bits -- 5, 3 and 2 -- of the instruction
to get the decode entry. This groups together all the instructions
in 4 columns of the opcode map as a single entry. That means that
mcrf and all the CR logical ops get grouped together, and bcctr, bclr
and bctar get grouped together. At present the CR logical ops are not
implemented, so their grouping has no impact.
The code for bclr and bcctr in execute1 is now common, using a single
op, and it now determines the branch address by looking at bit 10 of
the instruction word at execute time.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Instead of doing mfctr, mflr, mftb, mtctr, mtlr as separate ops,
just pass down mfspr and mtspr ops with the spr number and let
execute1 decode which SPR we're addressing. This will help reduce
the number of instruction bits decode1 needs to look at.
In fact we now pass down the whole instruction from decode2 to
execute1. We will need more bits of the instruction in future,
and the tools should just optimize away any that we don't end
up using. Since the 'aa' bit was just a copy of an instruction
bit, we can now remove it from the record.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Hopefully it's not too timing catastrophic. The variable newcrf will
be handy for the other CR ops when we implement them I suspect.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This adds a divider unit, connected to the core in much the same way
that the multiplier unit is connected. The division algorithm is
very simple-minded, taking 64 clock cycles for any division (even
32-bit division instructions).
The decoding is simplified by making use of regularities in the
instruction encoding for div* and mod* instructions. Instead of
having PPC_* encodings from the first-stage decoder for each of the
different div* and mod* instructions, we now just have PPC_DIV and
PPC_MOD, and the inputs to the divider that indicate what sort of
division operation to do are derived from instruction word bits.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
We can force all existing code to use the UART console
by passing 0 in bit zero of the sim config register.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>