Commit Graph

1416 Commits (850b87c83fe5aa3345f5fde18a17cd8a813af86c)
 

Author SHA1 Message Date
Paul Mackerras 850b87c83f FPU: Get rid of r.madd_cmp and r.exp_cmp
This saves a few LUTs and simplifies the code a little.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 month ago
Paul Mackerras ba2add029a FPU: Remove need to set opsel_a one cycle ahead
Most states set opsel_a directly to select the operand for the A input
of the main adder.  The exception is the EXC_RESULT state, which uses
r.opsel_a set by the previous cycle to indicate which input operand to
use as the result.

In order to make timing, ensure that the controls that select the
inputs to the main adder (opsel_*, etc.) don't depend on any
complicated functions of the data (such as px_nz, pcmpb_eq, pcmpb_lt,
etc.), but are as far as possible constant for each state.  There is
now a control called set_r for whether the result is written to r.r,
which enables us to avoid setting opsel_b or opsel_r conditionally in
some cases.

Also, to avoid a data-dependent setting of msel_2 in IDIV_DODIV state,
the IDIV_NR1 and IDIV_NR2 states have been reworked so that completion
of the required number of iterations is checked in IDIV_NR1 state, and
at that point, if the inverse estimate is < 0.5, we go to IDIV_USE0_5
state in order to use 0.5 as the estimate.  This means that in the
normal case, the inverse estimate is already in Y when we get to
IDIV_DODIV state.  IDIV_USE0_5 has been reworked to put R (which will
contain 0.5) into Y as the inverse estimate.  That means that
IDIV_DODIV state doesn't have any data-dependent logic to put either P
or R into Y.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 month ago
Paul Mackerras 2731384a4b FPU: Reduce misc_sel to 3 bits
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 month ago
Paul Mackerras cf866ce910 FPU: Simplify logic for setting r.x
Since r.x is mostly set from the value in r.r and only once from
anything else (r.b.mantissa), move the check to before the input
multiplexer for the main adder, so it works on r.r rather than
whatever is selected by r.opsel_a.

For the case in DO_FRSP where we have B selected by r.opsel_a, we add
a new state so that we now get B into R and then check the low bits of
R.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 month ago
Paul Mackerras 4e5f856c55 FPU: Factor out some of the common elements of the DO_* states
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 month ago
Paul Mackerras 2422585e14 FPU: Reduce use of r.insn inside the state machine
Instead use things derived from the instruction in the first cycle,
such as r.is_multiply, r.is_addition, etc.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 month ago
Paul Mackerras 7812a55b6c FPU: Reorganize NaN and infinity handling and improve arch compliance
The architecture specifies that an invalid operation exception for
signalling NaN (VXSNAN) can occur in the same instructions as an
invalid operation exception for infinity times zero (VXIMZ) in the
case of a multiply-add instruction where B is a signalling NaN, and
one of A and C is infinity and the other is zero.  This moves the
invalid operation tests around so as to handle this case correctly.
It also restructures the infinity and NaN cases to simplify the logic
a little.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 month ago
Paul Mackerras 9ac71cfbf2 tests/fpu: Add more floating multiply-add tests
Add more tests to check that the result sign computations are correct.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 month ago
Paul Mackerras a3613d863b FPU: Simplify sign calculation in FP multiply-add instructions
By starting out with result_sign = +/- sign of B, we avoid the need to
flip the result sign in a few places.

This also simplifies DO_FMADD state a bit by having DO_ZERO_DEN go to
DO_FMUL state for floating multiply-add where B is zero.  (The
RENORM_A2 and RENORM_C2 states already do this.)

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 month ago
Paul Mackerras 707dd619a0 FPU: Move NaN/infinity and zero/denorm handling out to separate states
This should simplify the DO_* states and hopefully be simpler overall.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 month ago
Paul Mackerras 27b3e42353 FPU: Move result_sign computations from state machine to a data path
Instead of operating on result_sign directly, the state machine now
sets a control variable "rsgn_op" that then directs a tiny ALU to do
what's required.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 month ago
Paul Mackerras 71b7df679b FPU: Calculate quieten_nan in first cycle
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 month ago
Paul Mackerras 955fa561fb FPU: Move most result_sign computation out of state machine
This moves the computation of r.result_sign out of the various
states for most instructions.  Now the sign is mostly computed in the
first cycle (when e_in.valid is true).

The set of operations done on r.result_sign in the state machine are
now restricted to 5 (other than no change): invert, xor with
r.is_subtract, or set to the sign of A, B or C.

Similarly r.is_subtract and r.negate are computed in the first cycle
now.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 month ago
Paul Mackerras c5abe3c0a9
Merge pull request #440 from paulusmack/compliance
More compliance improvements - [H]DEXCR, no-op SPRs, writable TB
2 months ago
Paul Mackerras 413907e4bc soc: Move timebase back into the core and enable writing to it
Instead of a single global timebase register in the SoC, we now have
a timebase counter in each core; however, now they are only reset by
the soc reset, not the core reset.  Thus they stay in sync even when
some cores are disabled (via the syscon cpu_ctrl register).

This implements mtspr to the TBLW and TBUW SPRs, which write the lower
and upper 32 bits of this core's timebase, respectively.

In order to fulfil the ISA's requirements that (a) some method for
getting the timebases into sync and (b) some method for preventing
userspace from reading the timebase be provided by the platform, this
adds a syscon register TB_CTRL with two read/write bits implemented;
bit 0 freezes all the timebases in the system when set, and bit 1
makes reading the timebase privileged (in all cores).

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2 months ago
Paul Mackerras f705fc5e19 core: Implement reserved/no-op SPR numbers
SPR numbers 808 - 811 do nothing when read or written, that is, mfspr
doesn't modify the destination register.  This is accomplished in the
same way that privileged mfspr to an unimplemented SPR is made a
no-op, by supplying the old contents of the destination register as an
input and writing that same value back.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2 months ago
Paul Mackerras c49c32b5fe core: Implement DEXCR and HDEXCR registers
Of the defined aspect bits (which are all read-write), only the NPHIE
and PHIE bits have any function at all, since Microwatt is an in-order
single-issue machine and never does any branch speculation.  Also,
since there is no privileged non-hypervisor mode, the high 32 bits of
DEXCR do nothing.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2 months ago
Paul Mackerras bae24b12e7
Merge pull request #439 from paulusmack/master
Update LiteX code for ethernet, SD card and DRAM
2 months ago
Paul Mackerras 3e0888ae35 litesdcard: Update generated code
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2 months ago
Paul Mackerras 3fb0a9ed26 litedram: Update generated code
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2 months ago
Paul Mackerras ab7105f438 liteeth: Update generated code
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2 months ago
Paul Mackerras 370dbef593
Merge pull request #438 from paulusmack/master
Improve timing and utilization, remove warnings
2 months ago
Paul Mackerras f0c331b8b8 Arty A7: Reduce warnings from Vivado
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 months ago
Paul Mackerras 1395bde3cc core: Store hash key SPRs in the SPR RAM
This moves HASHKEYR and HASHPKEYR to the SPR RAM that also stores
things such as SRR0/1, LR and CTR.  For hashst[p] and hashchk[p]
instructions, execute1 reads the relevant key register from the RAM
and sends it to loadstore1.  This saves several LUTs.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 months ago
Paul Mackerras 2c7d1e5d9c decode: Split input B selection into two fields
Instead of a single input_reg_b_t field in the decode table which
select both whether input B is a register or constant, and also which
constant (immediate value) to use, we now have one field which selects
whether input B is immediate (constant), a GPR, or an FPR, and a
separate field to select which sort of immediate value to use.  This
results in simpler logic and better timing.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 months ago
Paul Mackerras e4e1a033bd
Merge pull request #437 from paulusmack/compliance
Implement fixed-point hash instructions
3 months ago
Paul Mackerras 8f537c13bc tests: Add a test for the hash instructions hash{st,cmp}[p]
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 months ago
Paul Mackerras 3bcc31fdda core: Implement hashstp and hashchkp instructions and HASHPKEYR register
These provide facilities similar to hashstp, hashchk and HASHKEYR, but
restricted to privileged mode.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 months ago
Paul Mackerras 00a3db8457 decode1: Indicate instruction privilege in main decode table
Previously the computation of whether an instruction is privileged or
not was done based on the insn_type.  However, that meant that l*cix
(OP_LOAD) and st*cix (OP_STORE) couldn't be made privileged, and
neither could tlbsync (OP_NOP).

Instead, this adds a field to the main instruction decode table to
indicate privileged instructions, and makes the cache-inhibited loads
and stores privileged, along with tlbsync.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 months ago
Paul Mackerras 0a11e8455f core: Implement hashst and hashchk instructions
These are done in loadstore1.  The HashDigest function is computed in
9 cycles; for 8 cycles, a state machine does 4 steps of key expansion
per cycle, and for each of 4 lanes of data, does 4 steps of ciphering;
then there is 1 cycle to combine the results into the final hash
value.

At present, hashcmp does not overlap the computation of the hash with
fetching of data from memory (in the case of a cache miss).

The 'is_signed' field in the instruction decode table is used to
distinguish hashst and hashcmp from ordinary loads and stores.  We
have a new 'RBC' value for input_reg_c_t which says that we are
reading RB but we want the value to come in via the C port; this is
because we want the 5-bit immediate offset on the B port.

Note that in the list of insn_code values, hashst/chk have been put in
the section for instructions with an RB operand, which is not strictly
correct given that the B port is used for the immediate D operand;
however, adding them to the section for instructions without an RB
operand would have made that section exceed 128 entries, causing
changes to the padding needed.  The only downside to having hashst/cmp
where they are is that the debug logic can't use the RB port to read
GPR/FPRs when a hashst/cmp instruction is being decoded.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 months ago
Paul Mackerras e9b57ca5bf
Merge pull request #436 from paulusmack/smp
Implement SMP
3 months ago
Paul Mackerras 0a2d3b6f58 loadstore1: Split DAWR check across a clock edge
Instead of doing the address subtractions and subsequent logic for
DAWR hit detection in the second cycle of a load or store, this does
the subtractions in the first cycle and the remaining logic in the
second cycle.  This improves timing.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 months ago
Paul Mackerras d8423568b6 core: Evaluate rotator control signals in decode2
Hopefully this improves timing a bit.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 months ago
Paul Mackerras d1c7b654bb wishbone_arbiter: Remove early_sel optimization when > 4 masters
For the sake of overall timing in larger SoCs, remove the early_sel
optimization when there are more than 4 masters.

Also make the ack and stall signals to a particular master depend on
that master's cyc, not on the busy signal, which can depend on any
master's cyc.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 months ago
Paul Mackerras bf55efec6d Arty A7: Add an option to select the number of CPU cores
Timing is currently not very good with 2 cores on the Arty A7-100.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 months ago
Paul Mackerras 9bd6b3d175 xics: Implement destination server field in interrupt source registers
This implements the server field in the XISRs (external interrupt
source registers), allowing each interrupt source to be directed to a
particular CPU.  If the CPU number that is written is out of range,
CPU 0 is used.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 months ago
Paul Mackerras 3924ed0f49 xics: Implement a presentation controller per CPU core
This is mainly in order to get IPIs.  All external interrupts still go
to CPU 0 for now.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 months ago
Paul Mackerras 49fcbaa5b2 soc: Implement a global timebase across all cores
Now all cores see the same timebase value at any given instant.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 months ago
Paul Mackerras e0c5af9bb1 mw_debug: Add -c flag to select which CPU core to address
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 months ago
Paul Mackerras 9a06b0c182 soc: Implement multiple CPU cores
This adds an 'NCPUS' generic parameter to the soc module, which then
includes that many CPU cores.

The cores have separate addresses on the DMI interconnect, meaning
that external JTAG debug tools can view and control the state of each
core individually.

The syscon module has a new 'cpu_ctrl' register, where byte 0 contains
individual enable bits for each core, and byte 1 indicates the number
of cores.  If a core's enable bit is clear, the core is held in reset.
On system reset, the enable byte is set to 0x01, so only core 0 is
active.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 months ago
Paul Mackerras 0020c13226
Merge pull request #435 from paulusmack/compliance
Improve architecture compliance of debug facilities
3 months ago
Paul Mackerras 23ff954059 core: Change bperm to a simpler and slower implementation
This does bperm in the bitsort unit instead of the logical unit, and
no longer tries to do it in a single cycle with eight 64-to-1
multiplexers.  Instead it is now a state machine in the bitsort unit,
takes 8 cycles, and only has one 64-to-1 multiplexer.  This helps
improve timing and reduces LUT usage.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 months ago
Paul Mackerras f6a839a86b control: Use a 1-hot encoding for bypass enables
Instead of creating a 2-bit encoded bypass selector, we now have a
4-bit encoding where bits 1 to 3 enable separate bypass sources, and
bit 0 indicates if any bypass should be used.  This results in
slightly simpler logic and better timing.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 months ago
Paul Mackerras 52d8f28d03 execute1: Improve timing for execute bypass tag
The tags for the bypass data paths back to decode2 don't really need
to depend on the stall/busy inputs or on whether an exception might be
generated, since the bypass values won't be used until the instruction
gets executed.  Therefore, this simplifies the expressions for
bypass_data.tag.valid and bypass_cr_data.tag.valid.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 months ago
Paul Mackerras 80bc9d5098 tests/trace: Add a few tests of DAWR (data watchpoint) functionality
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 months ago
Paul Mackerras 5ddd8884fa core: Implement two data watchpoints
This implements the DAWR0, DAWRX0, DAWR1, and DAWRX1 registers, which
provide the ability to set watchpoints on two ranges of data addresses
and take an interrupt when an access is made to either range.

The address comparisons are done in loadstore1 in the second cycle
(doing it in the first cycle turned out to have poor timing).  If a
match is detected, a signal is sent to the dcache which causes the
access to fail and generate an error signal back to loadstore1, in
much the same way that a protection violation would, whereupon a data
storage interrupt is generated.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 months ago
Paul Mackerras 09de0738de tests/trace: Add checks for SIAR and SDAR being set correctly
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 months ago
Paul Mackerras ff00dc1505 PMU: Fix setting of SIAR and SDAR on trace interrupt
This arranges for SIAR and SDAR to be set when a trace interrupt
is triggered by a non-zero setting of the MSR[TE] field.  According to
the ISA, SIAR should be set to the address of the instruction and SDAR
should be set to the effective address of its storage operand if any.
This also fixes setting of SDAR by the PMU when an alert occurs;
previously it was always just set to zero.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 months ago
Paul Mackerras 23b183fb16 tests/reservation: Check that SRR0 is set correctly on alignment interrupt
The tests that intentionally generate alignment interrupts now also
check that SRR0 is pointing to a l*arx or st*cx instruction.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 months ago
Paul Mackerras 622f8c81cc loadstore1: Fix setting of SRR0 on alignment interrupt
When an alignment interrupt was being generated, loadstore1 was
setting the l_out.valid signal in one cycle and l_out.interrupt in the
next, for the same instruction.  This meant that the offending
instruction completed and the interrupt was applied to the next
instruction, meaning that SRR0 ended up pointing to the following
instruction.  To fix this, when an access causing an alignment
interrupt is going into r2, we set r2.busy for one cycle and set
r2.one_cycle to 0 so that the complete signal doesn't get asserted.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 months ago