Instead of having TLB invalidation and TLB load requests come through
the dcache main path, these operations are now done in one cycle
entirely based on signals from the MMU, and don't involve the TLB read
path or the dcache state machine at all. So that we know which way of
the TLB to affect for invalidations, loadstore1 now sends down a "TLB
probe" operation for tlbie instructions which goes through the dcache
pipeline and sets the r1.tlb_hit_* fields which are used in the
subsequent invalidation operation from the MMU (if it is a single-page
invalidation). TLB load operations write to the way identified by
r1.victim_way, which was set on the TLB miss that triggered the TLB
reload.
Since we are writing just one way of the TLB tags now, rather than
writing all ways with one way's value changed, we now pad each way to
a multiple of 8 bits so that byte write-enables can be used to select
which way gets written.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This rearranges the multiplexing of cache read data with forwarded
store data with the aim of shortening the path from the req_hit_ways
signal to the r1.data_out register. The forwarding decisions are now
made for each way independently and the the results then combined
according to which way detected a cache hit.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
With some slight arrangement of the state machine in the dcache_slow
process, we can remove one of the two comparators that detect writes
by other entities to the reservation granule. The state machine now
sets the wishbone cyc signal on the transition from IDLE to DO_STCX
state. Once we see the wishbone stall signal at 0, we consider we
have the wishbone and we can assert stb to do the write provided that
the stcx is to the reservation address and we haven't seen another
write to the reservation granule. We keep the comparator that
compares the snoop address delayed by one cycle, in order to make
timing easier, and the one (or more) cycle delay between cyc and stb
covers that one cycle delay in the kill_rsrv signal.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The reset was added originally to reduce metavalue warnings in
simulation, is not necessary for correct operation, and showed up as a
critical path in synthesis for the Xilinx Artix-7. Remove it when
doing synthesis; for simulation we set the value read to X rather than
0 in order to catch any use of the previously reset value.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This gets rid of some largish comparators in the dcache_request
process by matching index and way that hit in the cache tags instead
of comparing tag values. That is, some tag comparisons can be
replaced by seeing if both tags hit in the same cache way.
When reloading a cache line, we now set it valid at the beginning of
the reload, so that we get hits to compare. While the reload is still
occurring, accesses to doublewords that haven't yet been read are
indicated with req_is_hit = 0 and req_hit_reload = 1 (i.e. are
considered to be a miss, at least for now).
For the comparison of whether a subsequent access is to the same page
as stores already being performed, in virtual mode (TLB being used) we
now compare the way and index of the hit in the TLB, and in real mode
we compare the effective address. If any new entry has been loaded
into the TLB since the access we're comparing against, then it is
considered to be a different page.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
A dcbz operation to memory that is mapped as non-cacheable in the page
tables doesn't cause an alignment interrupt, but neither was it
implemented properly in the dcache. It does do 8 writes to memory but
it also creates a zero-filled line in the cache.
This fixes it so that dcbz to memory mapped non-cacheable doesn't
write the cache tag or set any line valid. We now have r1.reloading
which is 1 only in RELOAD_WAIT_ACK state, but only if the memory is
cacheable and therefore the cache should be updated (i.e. it is zero
in RELOAD_WAIT_ACK state if we are doing a non-cacheable dcbz).
We can now also remove the code in loadstore1 that checks for
non-cacheable dcbz, which only triggered when doing dcbz in real mode
to an address in the Cxxxxxxx range.
Also remove some unused variables and signals.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Rather than combining the results of the per-way comparators into
an encoded 'hit_way' variable, use the individual results directly
using AND-OR type networks where possible, in order to reduce
utilization and improve timing.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Instead of a single global timebase register in the SoC, we now have
a timebase counter in each core; however, now they are only reset by
the soc reset, not the core reset. Thus they stay in sync even when
some cores are disabled (via the syscon cpu_ctrl register).
This implements mtspr to the TBLW and TBUW SPRs, which write the lower
and upper 32 bits of this core's timebase, respectively.
In order to fulfil the ISA's requirements that (a) some method for
getting the timebases into sync and (b) some method for preventing
userspace from reading the timebase be provided by the platform, this
adds a syscon register TB_CTRL with two read/write bits implemented;
bit 0 freezes all the timebases in the system when set, and bit 1
makes reading the timebase privileged (in all cores).
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
SPR numbers 808 - 811 do nothing when read or written, that is, mfspr
doesn't modify the destination register. This is accomplished in the
same way that privileged mfspr to an unimplemented SPR is made a
no-op, by supplying the old contents of the destination register as an
input and writing that same value back.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Of the defined aspect bits (which are all read-write), only the NPHIE
and PHIE bits have any function at all, since Microwatt is an in-order
single-issue machine and never does any branch speculation. Also,
since there is no privileged non-hypervisor mode, the high 32 bits of
DEXCR do nothing.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This moves HASHKEYR and HASHPKEYR to the SPR RAM that also stores
things such as SRR0/1, LR and CTR. For hashst[p] and hashchk[p]
instructions, execute1 reads the relevant key register from the RAM
and sends it to loadstore1. This saves several LUTs.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Instead of a single input_reg_b_t field in the decode table which
select both whether input B is a register or constant, and also which
constant (immediate value) to use, we now have one field which selects
whether input B is immediate (constant), a GPR, or an FPR, and a
separate field to select which sort of immediate value to use. This
results in simpler logic and better timing.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
These provide facilities similar to hashstp, hashchk and HASHKEYR, but
restricted to privileged mode.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Previously the computation of whether an instruction is privileged or
not was done based on the insn_type. However, that meant that l*cix
(OP_LOAD) and st*cix (OP_STORE) couldn't be made privileged, and
neither could tlbsync (OP_NOP).
Instead, this adds a field to the main instruction decode table to
indicate privileged instructions, and makes the cache-inhibited loads
and stores privileged, along with tlbsync.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
These are done in loadstore1. The HashDigest function is computed in
9 cycles; for 8 cycles, a state machine does 4 steps of key expansion
per cycle, and for each of 4 lanes of data, does 4 steps of ciphering;
then there is 1 cycle to combine the results into the final hash
value.
At present, hashcmp does not overlap the computation of the hash with
fetching of data from memory (in the case of a cache miss).
The 'is_signed' field in the instruction decode table is used to
distinguish hashst and hashcmp from ordinary loads and stores. We
have a new 'RBC' value for input_reg_c_t which says that we are
reading RB but we want the value to come in via the C port; this is
because we want the 5-bit immediate offset on the B port.
Note that in the list of insn_code values, hashst/chk have been put in
the section for instructions with an RB operand, which is not strictly
correct given that the B port is used for the immediate D operand;
however, adding them to the section for instructions without an RB
operand would have made that section exceed 128 entries, causing
changes to the padding needed. The only downside to having hashst/cmp
where they are is that the debug logic can't use the RB port to read
GPR/FPRs when a hashst/cmp instruction is being decoded.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Instead of doing the address subtractions and subsequent logic for
DAWR hit detection in the second cycle of a load or store, this does
the subtractions in the first cycle and the remaining logic in the
second cycle. This improves timing.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
For the sake of overall timing in larger SoCs, remove the early_sel
optimization when there are more than 4 masters.
Also make the ack and stall signals to a particular master depend on
that master's cyc, not on the busy signal, which can depend on any
master's cyc.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This implements the server field in the XISRs (external interrupt
source registers), allowing each interrupt source to be directed to a
particular CPU. If the CPU number that is written is out of range,
CPU 0 is used.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds an 'NCPUS' generic parameter to the soc module, which then
includes that many CPU cores.
The cores have separate addresses on the DMI interconnect, meaning
that external JTAG debug tools can view and control the state of each
core individually.
The syscon module has a new 'cpu_ctrl' register, where byte 0 contains
individual enable bits for each core, and byte 1 indicates the number
of cores. If a core's enable bit is clear, the core is held in reset.
On system reset, the enable byte is set to 0x01, so only core 0 is
active.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This does bperm in the bitsort unit instead of the logical unit, and
no longer tries to do it in a single cycle with eight 64-to-1
multiplexers. Instead it is now a state machine in the bitsort unit,
takes 8 cycles, and only has one 64-to-1 multiplexer. This helps
improve timing and reduces LUT usage.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Instead of creating a 2-bit encoded bypass selector, we now have a
4-bit encoding where bits 1 to 3 enable separate bypass sources, and
bit 0 indicates if any bypass should be used. This results in
slightly simpler logic and better timing.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The tags for the bypass data paths back to decode2 don't really need
to depend on the stall/busy inputs or on whether an exception might be
generated, since the bypass values won't be used until the instruction
gets executed. Therefore, this simplifies the expressions for
bypass_data.tag.valid and bypass_cr_data.tag.valid.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This implements the DAWR0, DAWRX0, DAWR1, and DAWRX1 registers, which
provide the ability to set watchpoints on two ranges of data addresses
and take an interrupt when an access is made to either range.
The address comparisons are done in loadstore1 in the second cycle
(doing it in the first cycle turned out to have poor timing). If a
match is detected, a signal is sent to the dcache which causes the
access to fail and generate an error signal back to loadstore1, in
much the same way that a protection violation would, whereupon a data
storage interrupt is generated.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This arranges for SIAR and SDAR to be set when a trace interrupt
is triggered by a non-zero setting of the MSR[TE] field. According to
the ISA, SIAR should be set to the address of the instruction and SDAR
should be set to the effective address of its storage operand if any.
This also fixes setting of SDAR by the PMU when an alert occurs;
previously it was always just set to zero.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The tests that intentionally generate alignment interrupts now also
check that SRR0 is pointing to a l*arx or st*cx instruction.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
When an alignment interrupt was being generated, loadstore1 was
setting the l_out.valid signal in one cycle and l_out.interrupt in the
next, for the same instruction. This meant that the offending
instruction completed and the interrupt was applied to the next
instruction, meaning that SRR0 ended up pointing to the following
instruction. To fix this, when an access causing an alignment
interrupt is going into r2, we set r2.busy for one cycle and set
r2.one_cycle to 0 so that the complete signal doesn't get asserted.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
CIABR (Completed Instruction Address Breakpoint Register) is an SPR
that contains an instruction address. When the instruction at that
address completes, the CPU takes a Trace interrupt before executing
the next instruction (provided the instruction doesn't cause some
other interrupt and isn't an rfid, hrfid or rfscv instruction).
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This reports the CPU core number, currently always 0, but this will be
useful in future for distinguishing which CPU is which in a
multiprocessor system.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Previously we only put slow requests in r1.req, but that caused timing
problems because it meant the clock enable for all the registers in
r1.req depended on whether we have a TLB and cache hit or not. Now we
put any valid request (i.e. with req_go = 1) into r1.req, which has
better timing because req_go is a relatively simple function of
registered values (r0_full, r0_valid, r0.tlbie, r0.tlbld, r1.full,
r1.ls_error, d_in.hold). We still have to work out if we have a slow
request, but that is only needed for the D input of one register
(r1.full).
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This implements all the sync variants (sync, lwsync, ptesync, etc.) as
a LSU op that gets sent down to the dcache and completes once the
dcache state machine is idle.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>