This adds a path to allow the CR result of one instruction to be
forwarded to the next instruction, so that sequences such as
cmp; bc can avoid having a 1-cycle bubble.
Forwarding is not available for dot-form (Rc=1) instructions,
since the CR result for them is calculated in writeback. The
decode.output_cr field is used to identify those instructions
that compute the CR result in execute1.
For some reason, the multiply instructions incorrectly had
output_cr = 1 in the decode tables. This fixes that.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This latches the redirect signal inside execute1, so that it is sent
a cycle later to fetch1 (and to decode/icache as flush). This breaks
a long combinatorial chain from the branch and interrupt detection
in execute1 through the redirect/flush signals all the way back to
fetch1, icache and decode.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This implements the CFAR SPR as a slow SPR stored in 'ctrl'. Taken
branches and rfid update it to the address of the branch or rfid
instruction.
To simplify the logic, this makes rfid use the branch logic to
generate its redirect (requiring SRR0 to come in to execute1 on
the B input and SRR1 on the A input), and the masking of the bottom
2 bits of NIA is moved to fetch1.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This reduces the number of different things that are assigned to
the result variable.
- The computations for the popcnt, prty, cmpb and exts instruction
families are moved into the logical unit.
- The result of mfspr from the slow SPRs is computed in 'spr_val'
before being assigned to 'result'.
- Writes to LR as a result of a blr or bclr instruction are done
through the exc_write path to writeback.
This eases timing considerably.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This implements a simple branch predictor in the decode1 stage. If it
sees that the instruction is b or bc and the branch is predicted to be
taken, it sends a flush and redirect upstream (to icache and fetch1)
to redirect fetching to the branch target. The prediction is sent
downstream with the branch instruction, and execute1 now only sends
a flush/redirect upstream if the prediction was wrong. Unconditional
branches are always predicted to be taken, and conditional branches
are predicted to be taken if and only if the offset is negative.
Branches that take the branch address from a register (bclr, bcctr)
are predicted not taken, as we don't have any way to predict the
branch address.
Since we can now have a mflr being executed immediately after a bl
or bcl, we now track the update to LR in the hazard tracker, using
the second write register field that is used to track RA updates for
update-form loads and stores.
For those branches that update LR but don't write any other result
(i.e. that don't decrementer CTR), we now write back LR in the same
cycle as the instruction rather than taking a second cycle for the
LR writeback.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This reduces the number of cycles where loadstore1 asserts its busy
output, leading to increased throughput of loads and stores. Loads
that hit in the cache can now be executed at the rate of one every two
cycles. Stores take 4 cycles assuming the wishbone slave responds
with an ack the cycle after we assert strobe.
To achieve this, the state machine code is split into two parts, one
for when we have an existing instruction in progress, and one for
starting a new instruction. We can now combinatorially clear busy and
start a new instruction in the same cycle that we get a done signal
from the dcache; in other words we are completing one instruction and
potentially writing back results in the same cycle that we start a new
instruction and send its address and data to the dcache.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This changes the instruction dependency tracking so that we can
generate a "busy" signal from execute1 and loadstore1 which comes
along one cycle later than the current "stall" signal. This will
enable us to signal busy cycles only when we need to from loadstore1.
The "busy" signal from execute1/loadstore1 indicates "I didn't take
the thing you gave me on this cycle", as distinct from the previous
stall signal which meant "I took that but don't give me anything
next cycle". That means that decode2 proactively gives execute1
a new instruction as soon as it has taken the previous one (assuming
there is a valid instruction available from decode1), and that then
sits in decode2's output until execute1 can take it. So instructions
are issued by decode2 somewhat earlier than they used to be.
Decode2 now only signals a stall upstream when its output buffer is
full, meaning that we can fill up bubbles in the upstream pipe while a
long instruction is executing. This gives a small boost in
performance.
This also adds dependency tracking for rA updates by update-form
load/store instructions.
The GPR and CR hazard detection machinery now has one extra stage,
which may not be strictly necessary. Some of the code now really
only applies to PIPELINE_DEPTH=1.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This puts the logic that selects which bits of the multiplier result
get written into the destination GPR into execute1, moved out from
multiply.
The multiplier is now expected to do an unsigned multiplication of
64-bit operands, optionally negate the result, detect 32-bit
or 64-bit signed overflow of the result, and return a full 128-bit
result.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This logs 256 bits of data per cycle to a ring buffer in BRAM. The
data collected can be read out through 2 new SPRs or through the
debug interface.
The new SPRs are LOG_ADDR (724) and LOG_DATA (725). LOG_ADDR contains
the buffer write pointer in the upper 32 bits (in units of entries,
i.e. 32 bytes) and the read pointer in the lower 32 bits (in units of
doublewords, i.e. 8 bytes). Reading LOG_DATA gives the doubleword
from the buffer at the read pointer and increments the read pointer.
Setting bit 31 of LOG_ADDR inhibits the trace log system from writing
to the log buffer, so the contents are stable and can be read.
There are two new debug addresses which function similarly to the
LOG_ADDR and LOG_DATA SPRs. The log is frozen while either or both of
the LOG_ADDR SPR bit 31 or the debug LOG_ADDR register bit 31 are set.
The buffer defaults to 2048 entries, i.e. 64kB. The size is set by
the LOG_LENGTH generic on the core_debug module. Software can
determine the length of the buffer because the length is ORed into the
buffer write pointer in the upper 32 bits of LOG_ADDR. Hence the
length of the buffer can be calculated as 1 << (31 - clz(LOG_ADDR)).
There is a program to format the log entries in a somewhat readable
fashion in scripts/fmt_log/fmt_log.c. The log_entry struct in that
file describes the layout of the bits in the log entries.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
By adding logic to decode2 to be able to send the instruction address
down the A input, and making CONST_DX_HI (renamed to CONST_DXHI4) add
4 to the immediate value (easy since the bottom 16 bits were zero),
we can do addpcis using the main adder. This reduces the width of the
result mux and frees up one value in insn_type_t, since we can now use
OP_ADD for addpcis.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This commit adds support for the addpcis instruction from ISA 3.0.
A new input_reg_b_t type, CONST_DX_HI, was added to support the
shifted immediate value used in DX-Form instructions.
Signed-off-by: Shawn Anastasio <shawn@anastas.io>
Use a simple wire. common.vhdl types are better kept for things
local to the core. We can add more wires later if we need to for
HV irqs etc...
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
- Changing use of others in core files to satisfy VCS
- Adding workaround for VCS subtype constraint inconsistencies in common.vhdl
Signed-off-by: Jonathan Balkind <jbalkind@princeton.edu>
Slbia (with IH=7) is used in the Linux kernel to flush the ERATs
(our iTLB/dTLB), so make it do that.
This moves the logic to work out whether to flush a single entry
or the whole TLB from dcache and icache into mmu. We now invalidate
all dTLB and iTLB entries when the AP (actual pagesize) field of
RB is non-zero on a tlbie[l], as well as when IS is non-zero.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This hooks up the connections so that an OP_FETCH_FAILED coming down
to loadstore1 will get sent to the MMU for it to do a radix tree walk
for the instruction address. The MMU then sends the resulting PTE to
the icache module to be installed in the iTLB. If no valid PTE can
be found, the MMU sends an error signal back to loadstore1 which sends
it on to execute1 to generate an ISI.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds a direct-mapped TLB to the icache, with 64 entries by default.
Execute1 now sends a "virt_mode" signal from MSR[IR] to fetch1 along
with redirects to indicate whether instruction addresses should be
translated through the TLB, and fetch1 sends that on to icache.
Similarly a "priv_mode" signal is sent to indicate the privilege
mode for instruction fetches. This means that changes to MSR[IR]
or MSR[PR] don't take effect until the next redirect, meaning an
isync, rfid, branch, etc.
The icache uses a hash of the effective address (i.e. next instruction
address) to index the TLB. The hash is an XOR of three fields of the
address; with a 64-entry TLB, the fields are bits 12--17, 18--23 and
24--29 of the address. TLB invalidations simply invalidate the
indexed TLB entry without checking the contents.
If the icache detects a TLB miss with virt_mode=1, it will send a
fetch_failed indication through fetch2 to decode1, which will turn it
into a special OP_FETCH_FAILED opcode with unit=LDST. That will get
sent down to loadstore1 which will currently just raise a Instruction
Storage Interrupt (0x400) exception.
One bit in the PTE obtained from the TLB is used to check whether an
instruction access is allowed -- the privilege bit (bit 3). If bit 3
is 1 and priv_mode=0, then a fetch_failed indication is sent down to
fetch2 and to decode1, which generates an OP_FETCH_FAILED. Any PTEs
with PTE bit 0 (EAA[3]) clear or bit 8 (R) clear should not be put
into the iTLB since such PTEs would not allow execution by any
context.
Tlbie operations get sent from mmu to icache over a new connection.
Unfortunately the privileged instruction tests are broken for now.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
A data segment interrupt (DSegI) occurs when an address to be
translated by the MMU is outside the range of the radix tree
or the top two bits of the address (the quadrant) are 01 or 10.
This is detected in a new state of the MMU state machine, and
is sent back to loadstore1 as an error, which sends it on to
execute1 to generate an interrupt to the 0x380 vector.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds logic to the dcache to check the permissions encoded in
the PTE that it gets from the dTLB. The bits that are checked are:
R must be 1
C must be 1 for a store
EAA(0) - if this is 1, MSR[PR] must be 0
EAA(2) must be 1 for a store
EAA(1) | EAA(2) must be 1 for a load
In addition, ATT(0) is used to indicate a cache-inhibited access.
This now implements DSISR bits 36, 38 and 45.
(Bit numbers above correspond to the ISA, i.e. using big-endian
numbering.)
MSR[PR] is now conveyed to loadstore1 for use in permission checking.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds a path from loadstore1 back to execute1 for reporting
errors, and machinery in execute1 for generating data storage
interrupts at vector 0x300.
If dcache is given two requests in successive cycles and the
first encounters an error (e.g. a TLB miss), it will now cancel
the second request.
Loadstore1 now responds to errors reported by dcache by sending
an exception signal to execute1 and returning to the idle state.
Execute1 then writes SRR0 and SRR1 and jumps to the 0x300 Data
Storage Interrupt vector. DAR and DSISR are held in loadstore1.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds a TLB to dcache, providing the ability to translate
addresses for loads and stores. No protection mechanism has been
implemented yet. The MSR_DR bit controls whether addresses are
translated through the TLB.
The TLB is a fixed-pagesize, set-associative cache. Currently
the page size is 4kB and the TLB is 2-way set associative with 64
entries per set.
This implements the tlbie instruction. RB bits 10 and 11 control
whether the whole TLB is invalidated (if either bit is 1) or just
a single entry corresponding to the effective page number in bits
12-63 of RB.
As an extension until we get a hardware page table walk, a tlbie
instruction with RB bits 9-11 set to 001 will load an entry into
the TLB. The TLB entry value is in RS in the format of a radix PTE.
Currently there is no proper handling of TLB misses. The load or
store will not be performed but no interrupt is generated.
In order to make timing at 100MHz on the Arty A7-100, we compare
the real address from each way of the TLB with the tag from each way
of the cache in parallel (requiring # TLB ways * # cache ways
comparators). Then the result is selected based on which way hit in
the TLB. That avoids a timing path going through the TLB EA
comparators, the multiplexer that selects the RA, and the cache tag
comparators.
The hack where addresses of the form 0xc------- are marked as
cache-inhibited is kept for now but restricted to real-mode accesses.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This arranges for some mfspr and mtspr to get sent to loadstore1
instead of being handled in execute1. In particular, DAR and DSISR
are handled this way. They are therefore "slow" SPRs.
While we're at it, fix the spelling of HEIR and remove mention of
DAR and DSISR from the comments in execute1.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This provides commands on the debug interface to read the value of
the MSR or any of the 64 GSPR register file entries. The GSPR values
are read using the B port of the register file in a cycle when
decode2 is not using it.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Mfspr from an unimplemented SPR should be a no-op in privileged state,
so in this case we need to write back whatever was previously in the
destination register. For problem state, both mtspr and mfspr to
unimplemented SPRs should cause a program interrupt.
There are special cases in the architecture for SPRs 0, 4 5 and 6
which we still don't implement.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This mainly required the addition of an entry to the opcode 31 decode
table and a 32-bit sign-extender in the rotator.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
During slow instructions such as multiply or divide, if a decrementer
(or other asynchronous) interrupt becomes pending, it disrupts the
logic that keeps stall asserted until the end of the slow
instruction, and the interrupt logic starts trying to deliver the
interrupt before the slow instruction has finished.
To fix that, make the interrupt logic wait until it sees e_in.valid
set before setting exception to 1.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
In preparation for adding a TLB to the dcache, this plumbs the
insn_type from execute1 through to loadstore1, so that we can have
other operations besides loads and stores (e.g. tlbie) going to
loadstore1 and thence to the dcache. This also plumbs the unit field
of the decode ROM from decode2 through to execute1 to simplify the
logic around which ops need to go to loadstore1.
The load and store data formatting are now not conditional on the
op being OP_LOAD or OP_STORE. This eliminates the inferred latches
clocked by each of the bits of r.op that we were getting previously.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds logic to execute1 to check, when MSR[PR] = 1, whether each
instruction arriving to be executed is a privileged instruction.
If it is, a privileged-instruction type program interrupt is generated.
For the mtspr and mfspr instructions, we need to look at bit 20 of the
instruction (bit 4 of the SPR number) to determine if the SPR is
privileged.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This makes our treatment of the MSR conform better with the ISA.
- On reset, initialize the MSR to have the SF and LE bits set and
all the others reset. For good measure initialize r properly too.
- Fix the bit numbering in msr_copy (the code was using big-endian
bit numbers, not little-endian).
- Use constants like MSR_EE to index MSR bits instead of expressions
like '63 - 48', for readability.
- Set MSR[SF, LE] and clear MSR[PR, IR, DR, RI] on interrupts.
- Copy the relevant fields for rfid instead of using msr_copy, because
the partial function fields of the MSR should be left unchanged,
not zeroed. Our implementation of rfid is like the architecture
description of hrfid, because we don't implement hypervisor mode.
- Return the whole MSR for mfmsr.
- Implement the L field for mtmsrd (L=1 copies just EE and RI).
- For mtmsrd with L=0, leave out the HV, ME and LE bits as per the arch.
- For mtmsrd and rfid, if PR ends up set, then also set EE, IR and DR
as per the arch.
- A few other minor tidyups (no semantic change).
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
New unified ICP and ICS XICS compliant interrupt controller.
Configurable number of hardware sources.
Fixed hardware source number based on hardware line taken. All
hardware interrupts are a fixed priority. Level interrupts supported
only.
Hardwired to 0xc0004000 in SOC (UART is kept at 0xc0002000).
Signed-off-by: Michael Neuling <mikey@neuling.org>
This fixes a bug in the logic where we would still send a load
or store instruction to loadstore1 even though we have decided
to take an asynchronous interrupt.
Reported-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This decodes attn using entry 0 of the major_decode_rom_array table
instead of a special case in the decode1_1 process. This means that
only the major opcode (the top 6 bits) is checked at decode time.
To make sure the instruction is attn not some random illegal pattern,
we now check bits 1-10 of the instruction at execute time and
generate an illegal instruction interrupt if those bits are not
0100000000.
This reduces LUT consumption by 42 LUTs on the Arty A7-100.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This decodes sc using entry 17 of the major_decode_rom_array table
instead of a special case in the decode1_1 process. This means that
only the major opcode (the top 6 bits) is checked at decode time.
To make sure that the instruction is sc not scv, we now check bit
1 of the instruction at execute time and generate an illegal
instruction interrupt if it is 0 (indicating scv). The level field
of the sc instruction is now ignored.
This reduces LUT consumption by 31 LUTs on the Arty A7-100.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This implements the trap instructions (tw, twi, td, tdi) using
much of the same code as is used for the cmp/cmpl instructions.
A 5-bit comparison value is generated, and for cmp/cmpl, the
appropriate 3 bits are used to update the destination CR, and for
trap instructions, the comparison value is ANDed with the TO
field, and an exception is generated if any bit of the result
is 1.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This replaces OP_TD, OP_TDI, OP_TW and OP_TWI with a single OP_TRAP,
distinguishing the cases by the input_reg_b and is_32bit fields of
the decode ROM. This adds the twi and td cases to the decode tables.
For now we make all of the trap instructions unconditionally generate
a trap-type program interrupt if the TO field of the instruction is
all ones, and do nothing otherwise.
This reduces the number of values in insn_type_t from 65 to 62,
meaning that an insn_type_t can now be encoded in 6 bits rather
than 7.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This makes some simplifications to the interrupt logic which will
help with later commits.
- When irq_valid is set, don't set exception to 1 until we have a
valid instruction. That means we can remove the if e_in.valid = '1'
test from the exception = '1' block.
- Don't assert stall_out on the first cycle of delivering an
interrupt. If we do get another instruction in the next cycle,
nothing will happen because we have ctrl.irq_state set and we
will just continue writing the interrupt registers.
- Make sure we deliver as many completions as we got instructions,
otherwise the outstanding instruction count in control.vhdl gets
out of sync.
- In writeback, make sure all of the other write enables are ignored
when e_in.exc_write_enable is set.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
OP_MCRF covers the CR logical ops as well as mcrf since commit
c05441bf47 ("Implement CRNOR and friends"), so this renames
OP_MCRF to OP_CROP. The OP_* values for the individual CR logical
ops (OP_CRAND, etc.) are not used, so remove them from insn_type_t.
No functional change.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds separate fields in Execute1ToWritebackType for use in
writing SRR0/1 (and in future other SPRs) on an interrupt. With
this, we make timing once again on the Arty A7-100 -- previously
we were missing by 0.2ns, presumably due to the result mux being
wider than before.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds the following exceptions:
- 0x700 program check (for illegal instructions)
- 0x900 decrementer
- 0xc00 system call
This also adds some supervisor state:
- decremeter
- msr
(SPRG0/1 and SRR0/1 already exist as fast SPRs)
It also adds some supporting instructions:
- rfid
- mtmsrd
- mfmsr
- sc
MSR state is added but only EE is used in this patch set. Other bits
are read/written but are not used at all.
This adds a 2 stage state machine to execute1.vhdl. This state machine
allows fast SPRS SRR0/1 to be written in different cycles. This state
machine can be extended later to add DAR and DSISR SPR writing for
more complex exceptions like page faults.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Currently we decode attn but we just mark it as an illegal.
This adds a separate case statement in execute 1 for attn to terminate
the core. Illegals also do this currently but we are soon implementing
a 0x700 execption for them.
Signed-off-by: Michael Neuling <mikey@neuling.org>
This adds support for lbzcix, lhzcix, lwzcix, ldcix, stbcix, sthcix,
stwcix and stdcix. The temporary hack where accesses to addresses of
the form 0xc??????? are made non-cacheable is left in for now to avoid
making existing programs non-functional.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
So that the dcache could in future be used by an MMU, this moves
logic to do with data formatting, rA updates for update-form
instructions, and handling of unaligned loads and stores out of
dcache and into loadstore1. For now, dcache connects only to
loadstore1, and loadstore1 now has the connection to writeback.
Dcache generates a stall signal to loadstore1 which indicates that
the request presented in the current cycle was not accepted and
should be presented again. However, loadstore1 doesn't currently
use it because we know that we can never hit the circumstances
where it might be set.
For unaligned transfers, loadstore1 generates two requests to
dcache back-to-back, and then waits to see two acks back from
dcache (cycles where d_in.valid is true).
Loadstore1 now has a FSM for tracking how many acks we are
expecting from dcache and for doing the rA update cycles when
necessary. Handling for reservations and conditional stores is
still in dcache.
Loadstore1 now generates its own stall signal back to decode2,
so we no longer need the logic in execute1 that generated the stall
for the first two cycles.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This involves plumbing the (existing) 'reserve' and 'rc' bits in
the decode tables down to dcache, and 'rc' and 'store_done' bits
from dcache to writeback.
It turns out that we had 'RC' set in the 'rc' column for several
ordinary stores and for the attn instruction. This corrects them
to 'NONE', and sets the 'rc' column to 'ONE' for the conditional
stores.
In writeback we now have logic to set CR0 when the input from dcache
has rc = 1.
In dcache we have the reservation itself, which has a valid bit
and the address down to cache line granularity. We don't currently
store the reservation length. For a store conditional which fails,
we set a 'cancel_store' signal which inhibits the write to the
cache and prevents the state machine from starting a bus cycle or
going to the STORE_WAIT_ACK state. Instead we set r1.stcx_fail
which causes the instruction to complete in the next cycle with
rc=1 and store_done=0.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This removes the constraint that loads and stores are single-issue,
at the expense of a stall of at least 2 cycles for every load and
store.
To do this, we plumb the existing stall signal that was generated
in dcache to core, where it gets ORed with the stall signal from
execute1. Execute1 generates a stall signal for the first two
cycles of each load and store, and dcache generates the stall
signal in the 3rd and subsequent cycles if it needs to.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
It turns out that CR logical instructions have the truth table of
the operation embedded in the instruction word. This means that we
can collect the two input operand bits into a 2-bit value and use
that as the index to select the appropriate bit from the instruction
word.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds a register in the middle of the countzero computation,
so that we now have two cycles to count leading or trailing zeroes
instead of just one. Execute1 now outputs a one-cycle stall signal
when it encounters a cntlz* or cnttz* instruction. With this,
the countzero path no longer fails timing on the Artix-7 at 100MHz.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>