This implements most of the architected PMU events. The ones missing
are mostly the ones that depend on which level of the cache hierarchy
data is fetched from. The events implemented here, and their raw
event codes, are:
Floating-point operation completed (100f4)
Load completed (100fc)
Store completed (200f0)
Icache miss (200fc)
ITLB miss (100f6)
ITLB miss resolved (400fc)
Dcache load miss (400f0)
Dcache load miss resolved (300f8)
Dcache store miss (300f0)
DTLB miss (300fc)
DTLB miss resolved (200f6)
No instruction available and none being executed (100f8)
Instruction dispatched (200f2, 300f2, 400f2)
Taken branch instruction completed (200fa)
Branch mispredicted (400f6)
External interrupt taken (200f8)
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This is the start of an implementation of a PMU according to PowerISA
v3.0B. Things not implemented yet include most architected events,
the BHRB, event-based branches, thresholding, MMCR0[TBCC] field, etc.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
At present the logic prevents any interrupts from being handled while
there is a load/store instruction (one that has unit=LDST) being
executed. However, load/store instructions can still get sent to
loadstore1. Thus an instruction which should generate an interrupt
such as a floating-point unavailable interrupt will instead get
executed.
To fix this, when we detect that an interrupt should be generated but
loadstore1 is still executing a previous instruction, we don't execute
any new instructions, and set a new r.intr_pending flag. That results
in busy_out being asserted (meaning that no further instructions will
come in from decode2). When loadstore1 has finished the instructions
it has, the interrupt gets sent to writeback. If one of the
instructions in loadstore1 generates an interrupt in the meantime, the
l_in.interrupt signal gets asserted and that clears r.intr_pending, so
the interrupt we detected gets discarded.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This implements a 1-entry partition table, so that instead of getting
the process table base address from the PRTBL SPR, the MMU now reads
the doubleword pointed to by the PTCR register plus 8 to get the
process table base address. The partition table entry is cached.
Having the PTCR and the vestigial partition table reduces the amount
of software change required in Linux for Microwatt support.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The idea here is that we can have multiple instructions in progress at
the same time as long as they all go to the same unit, because that
unit will keep them in order. If we get an instruction for a
different unit, we wait for all the previous instructions to finish
before executing it. Since the loadstore unit is the only one that is
currently pipelined, this boils down to saying that loadstore
instructions can go ahead while l_in.in_progress = 1 but other
instructions have to wait until it is 0.
This gives a 2% increase on coremark performance on the Arty A7-100
(from ~190 to ~194).
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This fixes two bugs which show up when multiple operations are in
flight in the dcache, and adds a 'hold' input which will be needed
when loadstore1 is pipelined.
The first bug is that dcache needs to sample the data for a store on
the cycle after the store request comes in even if the store request
is held up because of a previous request (e.g. if the previous request
is a load miss or a dcbz).
The second bug is that a load request coming in for a cache line being
refilled needs to be handled immediately in the case where it is for
the row whose data arrives on the same cycle. If it is not, then it
will be handled as a separate cache miss and the cache line will be
refilled again into a different way, leading to two ways both being
valid for the same tag. This can lead to data corruption, in the
scenario where subsequent writes go to one of the ways and then that
way gets displaced but the other way doesn't. This bug could in
principle show up even without having multiple operations in flight in
the dcache.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This moves the logic for redirecting fetching and writing SRR0 and
SRR1 to writeback. The aim is that ultimately units other than
execute1 can send their interrupts to writeback along with their
instruction completions, so that there can be multiple instructions
in flight without needing execute1 to keep track of the address
of each outstanding instruction.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This changes the bypass path. Previously it went from after
execute1's output to after decode2's output. Now it goes from before
execute1's output register to before decode2's output register. The
reason is that the new path will be simpler to manage when there are
possibly multiple instructions in flight. This means that the
bypassing can be managed inside decode2 and control.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This changes the way GPR hazards are detected and tracked. Instead of
having a model of the pipeline in gpr_hazard.vhdl, which has to mirror
the behaviour of the real pipeline exactly, we now assign a 2-bit tag
to each instruction and record which GSPR the instruction writes.
Subsequent instructions that need to use the GSPR get the tag number
and stall until the value with that tag is being written back to the
register file.
For now, the forwarding paths are disabled. That gives about a 8%
reduction in coremark performance.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This uses the instruction doubling machinery to convert conditional
branch instructions that update both CTR and LR (e.g., bdnzl, bdnzlrl)
into two instructions, of which the first updates CTR and determines
whether the branch is taken, and the second updates LR and does the
redirect if necessary.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This uses the instruction-doubling machinery to send load with update
instructions down to loadstore1 as two separate ops, rather than
one op with two destinations. This will help to simplify the value
tracking mechanisms.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This implements a cache in fetch1, where each entry stores the address
of a simple branch instruction (b or bc) and the target of the branch.
When fetching sequentially, if the address being fetched matches the
cache entry, then fetching will be redirected to the branch target.
The cache has 1024 entries and is direct-mapped, i.e. indexed by bits
11..2 of the NIA.
The bus from execute1 now carries information about taken and
not-taken simple branches, which fetch1 uses to update the cache.
The cache entry is updated for both taken and not-taken branches, with
the valid bit being set if the branch was taken and cleared if the
branch was not taken.
If fetching is redirected to the branch target then that goes down the
pipe as a predicted-taken branch, and decode1 does not do any static
branch prediction. If fetching is not redirected, then the next
instruction goes down the pipe as normal and decode1 does its static
branch prediction.
In order to make timing, the lookup of the cache is pipelined, so on
each cycle the cache entry for the current NIA + 8 is read. This
means that after a redirect (from decode1 or execute1), only the third
and subsequent sequentially-fetched instructions will be able to be
predicted.
This improves the coremark value on the Arty A7-100 from about 180 to
about 190 (more than 5%).
The BTC is optional. Builds for the Artix 7 35-T part have it off by
default because the extra ~1420 LUTs it takes mean that the design
doesn't fit on the Arty A7-35 board.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This breaks up the enormous if .. elsif .. case .. elsif statement in
execute1 in order to try to make it simpler and more understandable.
We now have decode2 deciding whether the instruction has a value to be
written back to a register (GPR, GSPR, FPR, etc.) rather than
individual cases in execute1 setting result_en. The computation of
the data to be written back is now independent of detection of various
exception conditions. We now have an if block determining if any
exception condition exists which prevents the next instruction from
being executed, then the case statement which performs actions such as
setting carry/overflow bits, determining if a trap exception exists,
doing branches, etc., then an if statement for all the r.busy = 1
cases (continuing execution of an instruction which was started in a
previous cycle, or writing SRR1 for an interrupt).
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds an explicit multiplexer feeding v.e.write_data in execute1,
with the select lines determined in the previous cycle based on the
insn_type. Similarly, for multiply and divide instructions, there is
now an explicit multiplexer.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This makes timing easier and also means that store floating-point
single precision instructions no longer need to take an extra cycle.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This makes it simpler to work out when to deliver a FPU unavailable
interrupt. This also means we can get rid of the OP_FPLOAD and
OP_FPSTORE insn_type values.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This implements the lq, stq, lqarx and stqcx. instructions.
These instructions all access two consecutive GPRs; for example the
"lq %r6,0(%r3)" instruction will load the doubleword at the address
in R3 into R7 and the doubleword at address R3 + 8 into R6. To cope
with having two GPR sources or destinations, the instruction gets
repeated at the decode2 stage, that is, for each lq/stq/lqarx/stqcx.
coming in from decode1, two instructions get sent out to execute1.
For these instructions, the RS or RT register gets modified on one
of the iterations by setting the LSB of the register number. In LE
mode, the first iteration uses RS|1 or RT|1 and the second iteration
uses RS or RT. In BE mode, this is done the other way around. In
order for decode2 to know what endianness is currently in use, we
pass the big_endian flag down from icache through decode1 to decode2.
This is always in sync with what execute1 is using because only rfid
or an interrupt can change MSR[LE], and those operations all cause
a flush and redirect.
There is now an extra column in the decode tables in decode1 to
indicate whether the instruction needs to be repeated. Decode1 also
enforces the rule that lq with RT = RT and lqarx with RA = RT or
RB = RT are illegal.
Decode2 now passes a 'repeat' flag and a 'second' flag to execute1,
and execute1 passes them on to loadstore1. The 'repeat' flag is set
for both iterations of a repeated instruction, and 'second' is set
on the second iteration. Execute1 does not take asynchronous or
trace interrupts on the second iteration of a repeated instruction.
Loadstore1 uses 'next_addr' for the second iteration of a repeated
load/store so that we access the second doubleword of the memory
operand. Thus loadstore1 accesses the doublewords in increasing
memory order. For 16-byte loads this means that the first iteration
writes GPR RT|1. It is possible that RA = RT|1 (this is a legal
but non-preferred form), meaning that if the memory operand was
misaligned, the first iteration would overwrite RA but then the
second iteration might take a page fault, leading to corrupted state.
To avoid that possibility, 16-byte loads in LE mode take an
alignment interrupt if the operand is not 16-byte aligned. (This
is the case anyway for lqarx, and we enforce it for lq as well.)
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Some of the bits in the FPU buses end up as z state. Yosys
flags them, so we may as well clean it up.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
This adds the skeleton of a floating-point unit and implements the
mffs and mtfsf instructions.
Execute1 sends FP instructions to the FPU and receives busy,
exception, FP interrupt and illegal interrupt signals from it.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds code to loadstore1 to convert between single-precision and
double-precision formats, and implements the lfs* and stfs*
instructions. The conversion processes are described in Power ISA
v3.1 Book 1 sections 4.6.2 and 4.6.3.
These conversions take one cycle, so lfs* and stfs* are one cycle
slower than lfd* and stfd*.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This extends the register file so it can hold FPR values, and
implements the FP loads and stores that do not require conversion
between single and double precision.
We now have the FP, FE0 and FE1 bits in MSR. FP loads and stores
cause a FP unavailable interrupt if MSR[FP] = 0.
The FPU facilities are optional and their presence is controlled by
the HAS_FPU generic passed down from the top-level board file. It
defaults to true for all except the A7-35 boards.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Trace interrupts occur when the MSR[TE] field is non-zero and an
instruction other than rfid has been successfully completed. A trace
interrupt occurs before the next instruction is executed or any
asynchronous interrupt is taken.
Since the trace interrupt is defined to set SRR1 bits depending on
whether the traced instruction is a load or an instruction treated as
a load, or a store or an instruction treated as a store, we need to
make sure the treated-as-a-load instructions (icbi, icbt, dcbt, dcbst,
dcbf) and the treated-as-a-store instructions (dcbtst, dcbz) have the
correct opcodes in decode1. Several of them were previously marked as
OP_NOP.
We don't yet implement the SIAR or SDAR registers, which should be set
by trace interrupts.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Load-and-reserve and store-conditional instructions are required to
generate an alignment interrupt (0x600 vector) if their EA is not
aligned. Implement this.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
In 32-bit mode, effective addresses are truncated to 32 bits, both for
instruction fetches and data accesses, and CR0 is set for Rc=1 (record
form) instructions based on the lower 32 bits of the result rather
than all 64 bits.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Big-endian mode affects both instruction fetches and data accesses.
For instruction fetches, we byte-swap each word read from memory when
writing it into the icache data RAM, and use a tag bit to indicate
whether each cache line contains instructions in BE or LE form.
For data accesses, we simply need to invert the existing byte_reverse
signal in BE mode. The only thing to be careful of is to get the sign
bit from the correct place when doing a sign-extending load that
crosses two doublewords of memory.
For now, interrupts unconditionally set MSR[LE]. We will need some
sort of interrupt-little-endian bit somewhere, perhaps in LPCR.
This also fixes a debug report statement in fetch1.vhdl.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This makes the interface to the multiplier more general so an instance
of it can be used in the FPU. It now has a 128-bit addend that is
added on to the product. Instead of an input to negate the output,
it now has a "not_result" input to complement the output. Execute1
uses not_result=1 and addend=-1 to get the effect of negating the
output. The interface is defined this way because this is what can
be done easily with the Xilinx DSP slices in xilinx-mult.vhdl.
This also adds clock enable signals to the DSP slices, mostly for the
sake of reducing power consumption.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This makes the calculation of busy as simple as possible and dependent
only on register outputs. The timing of busy is critical, as it gates
the valid signal for the next instruction, and therefore any delays
in dropping busy at the end of a load or store directly impact the
timing of a host of other paths.
This also separates the 'done without error' and 'done with error'
cases from the MMU into separate signals that are both driven directly
from registers.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds a path to allow the CR result of one instruction to be
forwarded to the next instruction, so that sequences such as
cmp; bc can avoid having a 1-cycle bubble.
Forwarding is not available for dot-form (Rc=1) instructions,
since the CR result for them is calculated in writeback. The
decode.output_cr field is used to identify those instructions
that compute the CR result in execute1.
For some reason, the multiply instructions incorrectly had
output_cr = 1 in the decode tables. This fixes that.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This latches the redirect signal inside execute1, so that it is sent
a cycle later to fetch1 (and to decode/icache as flush). This breaks
a long combinatorial chain from the branch and interrupt detection
in execute1 through the redirect/flush signals all the way back to
fetch1, icache and decode.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This implements the CFAR SPR as a slow SPR stored in 'ctrl'. Taken
branches and rfid update it to the address of the branch or rfid
instruction.
To simplify the logic, this makes rfid use the branch logic to
generate its redirect (requiring SRR0 to come in to execute1 on
the B input and SRR1 on the A input), and the masking of the bottom
2 bits of NIA is moved to fetch1.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Move the external interrupt generation to a separate module
"ICS" (source controller) which a register per source containing
currently only the priority control.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This implements a simple branch predictor in the decode1 stage. If it
sees that the instruction is b or bc and the branch is predicted to be
taken, it sends a flush and redirect upstream (to icache and fetch1)
to redirect fetching to the branch target. The prediction is sent
downstream with the branch instruction, and execute1 now only sends
a flush/redirect upstream if the prediction was wrong. Unconditional
branches are always predicted to be taken, and conditional branches
are predicted to be taken if and only if the offset is negative.
Branches that take the branch address from a register (bclr, bcctr)
are predicted not taken, as we don't have any way to predict the
branch address.
Since we can now have a mflr being executed immediately after a bl
or bcl, we now track the update to LR in the hazard tracker, using
the second write register field that is used to track RA updates for
update-form loads and stores.
For those branches that update LR but don't write any other result
(i.e. that don't decrementer CTR), we now write back LR in the same
cycle as the instruction rather than taking a second cycle for the
LR writeback.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This changes the instruction dependency tracking so that we can
generate a "busy" signal from execute1 and loadstore1 which comes
along one cycle later than the current "stall" signal. This will
enable us to signal busy cycles only when we need to from loadstore1.
The "busy" signal from execute1/loadstore1 indicates "I didn't take
the thing you gave me on this cycle", as distinct from the previous
stall signal which meant "I took that but don't give me anything
next cycle". That means that decode2 proactively gives execute1
a new instruction as soon as it has taken the previous one (assuming
there is a valid instruction available from decode1), and that then
sits in decode2's output until execute1 can take it. So instructions
are issued by decode2 somewhat earlier than they used to be.
Decode2 now only signals a stall upstream when its output buffer is
full, meaning that we can fill up bubbles in the upstream pipe while a
long instruction is executing. This gives a small boost in
performance.
This also adds dependency tracking for rA updates by update-form
load/store instructions.
The GPR and CR hazard detection machinery now has one extra stage,
which may not be strictly necessary. Some of the code now really
only applies to PIPELINE_DEPTH=1.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The icache can now detect a hit on a line being refilled from memory,
as we have an array of individual valid bits per row for the line
that is currently being loaded. This enables the request that
initiated the refill to be satisfied earlier, and also enables
following requests to the same cache line to be satisfied before the
line is completely refilled. Furthermore, the refill now starts
at the row that is needed. This should reduce the latency for an
icache miss.
We now get a 'sequential' indication from fetch1, and use that to know
when we can deliver an instruction word using the other half of the
64-bit doubleword that was read last cycle. This doesn't make much
difference at the moment, but it frees up cycles where we could test
whether the next line is present in the cache so that we could
prefetch it if not.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This puts the logic that selects which bits of the multiplier result
get written into the destination GPR into execute1, moved out from
multiply.
The multiplier is now expected to do an unsigned multiplication of
64-bit operands, optionally negate the result, detect 32-bit
or 64-bit signed overflow of the result, and return a full 128-bit
result.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The fetch2 stage existed primarily to provide a stash buffer for the
output of icache when a stall occurred. However, we can get the same
effect -- of having the input to decode1 stay unchanged on a stall
cycle -- by using the read enable of the BRAMs in icache, and by
adding logic to keep the outputs unchanged on a clock cycle when
stall_in = 1. This reduces branch and interrupt latency by one
cycle.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Use a simple wire. common.vhdl types are better kept for things
local to the core. We can add more wires later if we need to for
HV irqs etc...
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
- Changing use of others in core files to satisfy VCS
- Adding workaround for VCS subtype constraint inconsistencies in common.vhdl
Signed-off-by: Jonathan Balkind <jbalkind@princeton.edu>
This adds the PID register and repurposes SPR 720 as the PRTBL
register, which points to the base of the process table. There
doesn't seem to be any point to implementing the partition table given
that we don't have hypervisor mode.
The MMU caches entry 0 of the process table internally (in pgtbl3)
plus the entry indexed by the value in the PID register (pgtbl0).
Both caches are invalidated by a tlbie[l] with RIC=2 or by a move to
PRTBL. The pgtbl0 cache is invalidated by a move to PID. The dTLB
and iTLB are cleared by a move to either PRTBL or PID.
Which of the two page table root pointers is used (pgtbl0 or pgtbl3)
depends on the MSB of the address being translated. Since the segment
checking ensures that address(63) = address(62), this is sufficient to
map quadrants 0 and 3.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Slbia (with IH=7) is used in the Linux kernel to flush the ERATs
(our iTLB/dTLB), so make it do that.
This moves the logic to work out whether to flush a single entry
or the whole TLB from dcache and icache into mmu. We now invalidate
all dTLB and iTLB entries when the AP (actual pagesize) field of
RB is non-zero on a tlbie[l], as well as when IS is non-zero.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This hooks up the connections so that an OP_FETCH_FAILED coming down
to loadstore1 will get sent to the MMU for it to do a radix tree walk
for the instruction address. The MMU then sends the resulting PTE to
the icache module to be installed in the iTLB. If no valid PTE can
be found, the MMU sends an error signal back to loadstore1 which sends
it on to execute1 to generate an ISI.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds a direct-mapped TLB to the icache, with 64 entries by default.
Execute1 now sends a "virt_mode" signal from MSR[IR] to fetch1 along
with redirects to indicate whether instruction addresses should be
translated through the TLB, and fetch1 sends that on to icache.
Similarly a "priv_mode" signal is sent to indicate the privilege
mode for instruction fetches. This means that changes to MSR[IR]
or MSR[PR] don't take effect until the next redirect, meaning an
isync, rfid, branch, etc.
The icache uses a hash of the effective address (i.e. next instruction
address) to index the TLB. The hash is an XOR of three fields of the
address; with a 64-entry TLB, the fields are bits 12--17, 18--23 and
24--29 of the address. TLB invalidations simply invalidate the
indexed TLB entry without checking the contents.
If the icache detects a TLB miss with virt_mode=1, it will send a
fetch_failed indication through fetch2 to decode1, which will turn it
into a special OP_FETCH_FAILED opcode with unit=LDST. That will get
sent down to loadstore1 which will currently just raise a Instruction
Storage Interrupt (0x400) exception.
One bit in the PTE obtained from the TLB is used to check whether an
instruction access is allowed -- the privilege bit (bit 3). If bit 3
is 1 and priv_mode=0, then a fetch_failed indication is sent down to
fetch2 and to decode1, which generates an OP_FETCH_FAILED. Any PTEs
with PTE bit 0 (EAA[3]) clear or bit 8 (R) clear should not be put
into the iTLB since such PTEs would not allow execution by any
context.
Tlbie operations get sent from mmu to icache over a new connection.
Unfortunately the privileged instruction tests are broken for now.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This is required by the architecture. It means that the error bits
reported in DSISR or SRR1 now come from the permission/RC check done
on the refetched PTE rather than the TLB entry. Unfortunately that
somewhat breaks the software-loaded TLB mode of operation in that
DSISR/SRR1 always report no PTE rather than permission error or
RC failure.
This also restructures the loadstore1 state machine a bit, combining
the FIRST_ACK_WAIT and LAST_ACK_WAIT states into a single state and
the MMU_LOOKUP_1ST and MMU_LOOKUP_LAST states likewise. We now have a
'dwords_done' bit to say whether the first transfer of two (for an
unaligned access) has been done.
The cache paradox error (where a non-cacheable access finds a hit in
the cache) is now the only cause of DSI from the dcache. This should
probably be a machine check rather than DSI in fact.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>