Commit Graph

58 Commits (caravel-20210105)

Author SHA1 Message Date
Paul Mackerras e6a5f237bc FPU: Implement fmul[s]
This implements the fmul and fmuls instructions.

For fmul[s] with denormalized operands we normalize the inputs
before doing the multiplication, to eliminate the need for doing
count-leading-zeroes on P.  This adds 3 or 5 cycles to the
execution time when one or both operands are denormalized.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras b628af6176 FPU: Implement fmr and related instructions
This implements fmr, fneg, fabs, fnabs and fcpsgn and adds tests
for them.

This adds logic to unpack and repack floating-point data from the
64-bit packed form (as stored in memory and the register file) into
the unpacked form in the fpr_reg_type record.  This is not strictly
necessary for fmr et al., but will be useful for when we do actual
arithmetic.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras 856e9e955f core: Add framework for an FPU
This adds the skeleton of a floating-point unit and implements the
mffs and mtfsf instructions.

Execute1 sends FP instructions to the FPU and receives busy,
exception, FP interrupt and illegal interrupt signals from it.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras 45cd8f4fc3 core: Add support for floating-point loads and stores
This extends the register file so it can hold FPR values, and
implements the FP loads and stores that do not require conversion
between single and double precision.

We now have the FP, FE0 and FE1 bits in MSR.  FP loads and stores
cause a FP unavailable interrupt if MSR[FP] = 0.

The FPU facilities are optional and their presence is controlled by
the HAS_FPU generic passed down from the top-level board file.  It
defaults to true for all except the A7-35 boards.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras 6a80825e70 decode1: Avoid overriding fields of v.decode in decode1
In the cases where we need to override the values from the decode ROMs,
we now do that overriding after the clock edge (eating into decode2's
cycle) rather than before.  This helps timing a little.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras 290b05f97d core: Implement the maddhd, maddhdu and maddld instructions
These instructions use major opcode 4 and have a third GPR input
operand, so we need a decode table for major opcode 4 and some
plumbing to get the RC register operand read.

The multiply-add instructions use the same insn_type_t values as the
regular multiply instructions, and we distinguish in execute1 by
looking at the major opcode.  This turns out to be convenient because
we don't have to add any cases in the code that handles the output of
the multiplier, and it frees up some insn_type_t values.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras 893d2bc6a2 core: Don't generate logic for log data when LOG_LENGTH = 0
This adds "if LOG_LENGTH > 0 generate" to the places in the core
where log output data is latched, so that when LOG_LENGTH = 0 we
don't create the logic to collect the data which won't be stored.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras 74062195ca execute1: Do forwarding of the CR result to the next instruction
This adds a path to allow the CR result of one instruction to be
forwarded to the next instruction, so that sequences such as
cmp; bc can avoid having a 1-cycle bubble.

Forwarding is not available for dot-form (Rc=1) instructions,
since the CR result for them is calculated in writeback.  The
decode.output_cr field is used to identify those instructions
that compute the CR result in execute1.

For some reason, the multiply instructions incorrectly had
output_cr = 1 in the decode tables.  This fixes that.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras 6687aae4d6 core: Implement a simple branch predictor
This implements a simple branch predictor in the decode1 stage.  If it
sees that the instruction is b or bc and the branch is predicted to be
taken, it sends a flush and redirect upstream (to icache and fetch1)
to redirect fetching to the branch target.  The prediction is sent
downstream with the branch instruction, and execute1 now only sends
a flush/redirect upstream if the prediction was wrong.  Unconditional
branches are always predicted to be taken, and conditional branches
are predicted to be taken if and only if the offset is negative.
Branches that take the branch address from a register (bclr, bcctr)
are predicted not taken, as we don't have any way to predict the
branch address.

Since we can now have a mflr being executed immediately after a bl
or bcl, we now track the update to LR in the hazard tracker, using
the second write register field that is used to track RA updates for
update-form loads and stores.

For those branches that update LR but don't write any other result
(i.e. that don't decrementer CTR), we now write back LR in the same
cycle as the instruction rather than taking a second cycle for the
LR writeback.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 65a36cc0fc decode: Work out ispr1/ispr2 in parallel with decode ROM lookup
This makes the logic that calculates which SPRs are being accessed
work in parallel with the instruction decode ROM lookup instead of
being dependent on the opcode found in the decode ROM.  The reason
for doing that is that the path from icache through the decode ROM
to the ispr1/ispr2 fields has become a critical path.

Thus we are now using only a very partial decode of the instruction
word in the logic for isp1/isp2, and we therefore can no longer rely
on them being zero in all cases where no SPR is being accessed.
Instead, decode2 now ignores ispr1/ispr2 in all cases except when the
relevant decode.input_reg_a/b or decode.output_reg_a is set to SPR.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 6701e7346b core: Use a busy signal rather than a stall
This changes the instruction dependency tracking so that we can
generate a "busy" signal from execute1 and loadstore1 which comes
along one cycle later than the current "stall" signal.  This will
enable us to signal busy cycles only when we need to from loadstore1.

The "busy" signal from execute1/loadstore1 indicates "I didn't take
the thing you gave me on this cycle", as distinct from the previous
stall signal which meant "I took that but don't give me anything
next cycle".  That means that decode2 proactively gives execute1
a new instruction as soon as it has taken the previous one (assuming
there is a valid instruction available from decode1), and that then
sits in decode2's output until execute1 can take it.  So instructions
are issued by decode2 somewhat earlier than they used to be.

Decode2 now only signals a stall upstream when its output buffer is
full, meaning that we can fill up bubbles in the upstream pipe while a
long instruction is executing.  This gives a small boost in
performance.

This also adds dependency tracking for rA updates by update-form
load/store instructions.

The GPR and CR hazard detection machinery now has one extra stage,
which may not be strictly necessary.  Some of the code now really
only applies to PIPELINE_DEPTH=1.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 49a4d9f67a Add core logging
This logs 256 bits of data per cycle to a ring buffer in BRAM.  The
data collected can be read out through 2 new SPRs or through the
debug interface.

The new SPRs are LOG_ADDR (724) and LOG_DATA (725).  LOG_ADDR contains
the buffer write pointer in the upper 32 bits (in units of entries,
i.e. 32 bytes) and the read pointer in the lower 32 bits (in units of
doublewords, i.e. 8 bytes).  Reading LOG_DATA gives the doubleword
from the buffer at the read pointer and increments the read pointer.
Setting bit 31 of LOG_ADDR inhibits the trace log system from writing
to the log buffer, so the contents are stable and can be read.

There are two new debug addresses which function similarly to the
LOG_ADDR and LOG_DATA SPRs.  The log is frozen while either or both of
the LOG_ADDR SPR bit 31 or the debug LOG_ADDR register bit 31 are set.

The buffer defaults to 2048 entries, i.e. 64kB.  The size is set by
the LOG_LENGTH generic on the core_debug module.  Software can
determine the length of the buffer because the length is ORed into the
buffer write pointer in the upper 32 bits of LOG_ADDR.  Hence the
length of the buffer can be calculated as 1 << (31 - clz(LOG_ADDR)).

There is a program to format the log entries in a somewhat readable
fashion in scripts/fmt_log/fmt_log.c.  The log_entry struct in that
file describes the layout of the bits in the log entries.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras afa82bea9c decode2: Reformat to 4-space indentation
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 4a4a98d4b9
core: Do addpcis using the main adder (#189)
By adding logic to decode2 to be able to send the instruction address
down the A input, and making CONST_DX_HI (renamed to CONST_DXHI4) add
4 to the immediate value (easy since the bottom 16 bits were zero),
we can do addpcis using the main adder.  This reduces the width of the
result mux and frees up one value in insn_type_t, since we can now use
OP_ADD for addpcis.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Shawn Anastasio e606772aeb Implement the addpcis instruction
This commit adds support for the addpcis instruction from ISA 3.0.

A new input_reg_b_t type, CONST_DX_HI, was added to support the
shifted immediate value used in DX-Form instructions.

Signed-off-by: Shawn Anastasio <shawn@anastas.io>
5 years ago
Paul Mackerras dd2e71930c debug: Provide a way to examine GPRs, fast SPRs and MSR
This provides commands on the debug interface to read the value of
the MSR or any of the 64 GSPR register file entries.  The GSPR values
are read using the B port of the register file in a cycle when
decode2 is not using it.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 167e37d667 Plumb insn_type through to loadstore1
In preparation for adding a TLB to the dcache, this plumbs the
insn_type from execute1 through to loadstore1, so that we can have
other operations besides loads and stores (e.g. tlbie) going to
loadstore1 and thence to the dcache.  This also plumbs the unit field
of the decode ROM from decode2 through to execute1 to simplify the
logic around which ops need to go to loadstore1.

The load and store data formatting are now not conditional on the
op being OP_LOAD or OP_STORE.  This eliminates the inferred latches
clocked by each of the bits of r.op that we were getting previously.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 5d85ede97d dcache: Implement load-reserve and store-conditional instructions
This involves plumbing the (existing) 'reserve' and 'rc' bits in
the decode tables down to dcache, and 'rc' and 'store_done' bits
from dcache to writeback.

It turns out that we had 'RC' set in the 'rc' column for several
ordinary stores and for the attn instruction.  This corrects them
to 'NONE', and sets the 'rc' column to 'ONE' for the conditional
stores.

In writeback we now have logic to set CR0 when the input from dcache
has rc = 1.

In dcache we have the reservation itself, which has a valid bit
and the address down to cache line granularity.  We don't currently
store the reservation length.  For a store conditional which fails,
we set a 'cancel_store' signal which inhibits the write to the
cache and prevents the state machine from starting a bus cycle or
going to the STORE_WAIT_ACK state.  Instead we set r1.stcx_fail
which causes the instruction to complete in the next cycle with
rc=1 and store_done=0.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Anton Blanchard 098f10136d Fix a Diamond issue in decode2
By using a temporary we avoid a build issue in Diamond.

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
5 years ago
Paul Mackerras 5422007f83 Plumb loadstore1 input from execute1 not decode2
This allows us to use the bypass at the input of execute1 for the
address and data operands for loadstore1.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras b14d982011 execute: Implement bypass from output of execute1 to input
This enables back-to-back execution of integer instructions where
the first instruction writes a GPR and the second reads the same
GPR.  This is done with a set of multiplexers at the start of
execute1 which enable any of the three input operands to be taken
from the output of execute1 (i.e. r.e.write_data) rather than the
input from decode2 (i.e. e_in.read_data[123]).

This also requires changes to the hazard detection and handling.
Decode2 generates a signal indicating that the GPR being written
is available for bypass, which is true for instructions that are
executed in execute1 (rather than loadstore1/dcache).  The
gpr_hazard module stores this "bypassable" bit, and if the same
GPR needs to be read by a subsequent instruction, it outputs a
"use_bypass" signal rather than generating a stall.  The
use_bypass signal is then latched at the output of decode2 and
passed down to execute1 to control the input multiplexer.

At the moment there is no bypass on the inputs to loadstore1, but that
is OK because all load and store instructions are marked as
single-issue.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras c9a2076dd3 execute1: Remember dest GPR, RC, OE, XER for slow operations
For multiply and divide operations, execute1 now records the
destination GPR number, RC and OE from the instruction, and the
XER value.  This means that the multiply and divide units don't
need to record those values and then send them back to execute1.
This makes the interface to those units a bit simpler.  They
simply report an overflow signal along with the result value, and
execute1 takes care of updating XER if necessary.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 39d18d2738 Make divider hang off the side of execute1
With this, the divider is a unit that execute1 sends operands to and
which sends its results back to execute1, which then send them to
writeback.  Execute1 now sends a stall signal when it gets a divide
or modulus instruction until it gets a valid signal back from the
divider.  Divide and modulus instructions are no longer marked as
single-issue.

The data formatting step that used to be done in decode2 for div
and mod instructions is now done in execute1.  We also do the
absolute value operation in that same cycle instead of taking an
extra cycle inside the divider for signed operations with a
negative operand.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 2167186b5f Make multiplier hang off the side of execute1
With this, the multiplier isn't a separate pipe that decode2 issues
instructions to, but rather is a unit that execute1 sends operands
to and which sends the result back to execute1, which then sends it
to writeback.  Execute1 now sends a stall signal when it gets a
multiply instruction until it gets a valid signal back from the
multiplier.

This all means that we no longer need to mark the multiply
instructions as single-issue.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 23ade0b1c3 decode2: Minor cleanup
Remove unused variable is_reg in decode_input_reg_a.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Benjamin Herrenschmidt e4f475e17f sprs: Store common SPRs in register file
This stores the most common SPRs in the register file.

This includes CTR and LR and a not yet final list of others.

The register file is set to 64 entries for now. Specific types
are defined that can represent a GPR index (gpr_index_t) or
a GPR/SPR index (gspr_index_t) along with conversion functions
between the two.

On order to deal with some forms of branch updating both LR and
CTR, we introduced a delayed update of LR after a branch link.

Note: We currently stall the pipeline on such a delayed branch,
but we could avoid stalling fetch in that specific case as we
know we have a branch delay. We could also limit that to the
specific case where we need to update both CTR and LR.

This allows us to make bcreg, mtspr and mfspr pipelined. decode1
will automatically force the single issue flag on mfspr/mtspr to
a "slow" SPR.

[paulus@ozlabs.org - fix direction of decode2.stall_in]

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Benjamin Herrenschmidt 501b6daf9b Add basic XER support
The carry is currently internal to execute1. We don't handle any of
the other XER fields.

This creates type called "xer_common_t" that contains the commonly
used XER bits (CA, CA32, SO, OV, OV32).

The value is stored in the CR file (though it could be a separate
module). The rest of the bits will be implemented as a separate
SPR and the two parts reconciled in mfspr/mtspr in latter commits.

We always read XER in decode2 (there is little point not to)
and send it down all pipeline branches as it will be needed in
writeback for all type of instructions when CR0:SO needs to be
updated (such forms exist for all pipeline branches even if we don't
yet implement them).

To avoid having to track XER hazards, we forward it back in EX1. This
assumes that other pipeline branches that can modify it (mult and div)
are running single issue for now.

One additional hazard to beware of is an XER:SO modifying instruction
in EX1 followed immediately by a store conditional. Due to our writeback
latency, the store will go down the LSU with the previous XER value,
thus the stcx. will set CR0:SO using an obsolete SO value.

I doubt there exist any code relying on this behaviour being correct
but we should account for it regardless, possibly by ensuring that
stcx. remain single issue initially, or later by adding some minimal
tracking or moving the LSU into the same pipeline as execute.

Missing some obscure XER affecting instructions like addex or mcrxrx.

[paulus@ozlabs.org - fix CA32 and OV32 for OP_ADD, fix order of
 arguments to set_ov]

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Benjamin Herrenschmidt 98bd8b73c0 control: Reduce pipeline depth to 1
To match our one stage execute.

This might change back if we end up adding 2 stages to match the
LSU, but in that case we'll want forwards as well.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
5 years ago
Paul Mackerras 9646fe28b0 Do sign-extension instructions in writeback instead of execute1
This makes the exts[bhw] instructions do the sign extension in the
writeback stage using the sign-extension logic there instead of
having unique sign extension logic in execute1.  This requires
passing the data length and sign extend flag from decode2 down
through execute1 and execute2 and into writeback.  As a side bonus
we reduce the number of values in insn_type_t by two.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Anton Blanchard 813f834012 Add CR hazard detection
To keep things simple we treat the CR as a single entity.

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
5 years ago
Anton Blanchard bdc26b7527 Add GPR hazard detection
Check GPRs against any writers in the pipeline.

All instructions are still marked single in pipeline at
this stage.

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
5 years ago
Anton Blanchard d5346d0abf Separate issue control into its own unit
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
5 years ago
Paul Mackerras 5c0ba90722 decode2: Fix 32-bit flag passed to divider
Previously the 32-bit flag passed to the divider was always wrong;
this fixes it.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Anton Blanchard 4016f69e70 Limit outstanding range
outstanding can only ever be -1 to 2 at the moment (0 or 1 on a
rising clock edge). Vivado is synthesizing a much wider adder
which is silly. Constrain it with a range statement. This should
be good for timing and saves us about 85 LUTs.

This will get relaxed when we add more pipelining, but only by a
few bits.

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
5 years ago
Anton Blanchard b8fb721b81 Consolidate logical instructions
Consolidate and/andc/nand, or/orc/nor and xor/eqv, using a common
invert on the input and output. This saves us about 200 LUTs.

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
5 years ago
Paul Mackerras f7c393ba7e Add a rotate/mask/shift unit and use it in execute1
This adds a new entity 'rotator' which contains combinatorial logic
for rotating and masking 64-bit values.  It implements the operations
of the rlwinm, rlwnm, rlwimi, rldicl, rldicr, rldic, rldimi, rldcl,
rldcr, sld, slw, srd, srw, srad, sradi, sraw and srawi instructions.
It consists of a 3-stage 64-bit rotator using 4:1 multiplexors at
each stage, two mask generators, output logic and control logic.

The insn_type_t values used for these instructions have been reduced
to just 5: OP_RLC, OP_RLCL and OP_RLCR for the rotate and mask
instructions (clear both left and right, clear left, clear right
variants), OP_SHL for left shifts, and OP_SHR for right shifts.
The control signals for the rotator are derived from the opcode
and from the is_32bit and is_signed fields of the decode_rom_t.

The rotator is instantiated as an entity in execute1 so that we can
be sure we only have one of it.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 90b6e27380 Generalize the mul_32bit and mul_signed fields of decode_rom_t
This changes the names of the mul_32bit and mul_signed fields of
decode_rom_t to is_32bit and is_signed, so they can be used with
other types of operations besides multiplies.

This plumbs the is_32bit and is_signed flags down into execute1,
though they are not used at this point.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 7fe84220a5 decode: Avoid multiplexing from instruction reg fields to regfile address ports
This aims to simplify the logic between the instruction image and
the register file read address ports and reduce the size of the decode
tables.  With this patch, the input_reg_a column of the decode tables
can only select RA or zeroes, the input_reg_b column can only select
RB or a constant (0, -1, or an immediate value from the instruction),
and the input_reg_c columns can only select RS or zeroes.

That means that the rotate/shift/logical ops now have their first
input coming in via the input_reg_c column.  That means we need to
add a read_data3 field to the Decode2ToExecuteType record, but that
will go away again when we split out the rotate/mask/logical ops to
their own unit.

As a related but not tightly connected change, this patch also sets
the read1_enable signal to the register file be 0 when RA=0 and the
input_reg_a for the instruction is RA_OR_ZERO (previously it was 1).

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 96b402a4bf Consolidate add/subtract instructions into a single op
All of the PPC add and subtract instructions, including carrying
and extended versions, do much the same arithmetic operation:

	result = (I xor A) + B + C

where A is the value from RA, I provides a logical inversion of A
(i.e. I is 0 or -1), B is either from RB or is a constant 0 or -1,
and C is 0, 1 or the carry bit from XER (CA).

To consolidate all the add/subtract instructions into a single
OP_ADD, we add a column to decode_rom_t to indicate when A should
be inverted, and change the input_carry field to a 3-state selector
to select C in the equation above.

This also adds a new "CONST_M1" value for input_reg_b_t to indicate
that B is a constant -1.  This allows us to implement addme and
subfme.

The addex instruction appears not to exist, so the comments referring
to it are removed.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 58b06eb5f3 decode: Remove const fields from decode_rom_t
The const* fields of decode_rom_t drove multiplexers in decode2 that
picked out various instruction fields and put them into the const*
fields of the Decode2ToExecute1Type record, from where they were
used in execute1.  However, the code in execute1 can just as easily
use the appropriate fields of the original instruction word, since
that is now available in execute1.  This therefore changes the
code to do that, resulting in smaller decode tables.

Suggested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras c9e92483b8 decode: Push mtspr/mfspr register decoding down into execute1
Instead of doing mfctr, mflr, mftb, mtctr, mtlr as separate ops,
just pass down mfspr and mtspr ops with the spr number and let
execute1 decode which SPR we're addressing.  This will help reduce
the number of instruction bits decode1 needs to look at.

In fact we now pass down the whole instruction from decode2 to
execute1.  We will need more bits of the instruction in future,
and the tools should just optimize away any that we don't end
up using.  Since the 'aa' bit was just a copy of an instruction
bit, we can now remove it from the record.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Benjamin Herrenschmidt 3e6f656a90 Add MCRF instruction
Hopefully it's not too timing catastrophic. The variable newcrf will
be handy for the other CR ops when we implement them I suspect.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
5 years ago
Benjamin Herrenschmidt 554ae88540 Implement absolute branches
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
5 years ago
Paul Mackerras 25b9450475 divider: Do absolute-value ops in divider instead of decode
This moves the negation of negative operands for signed divide and
modulus operations out of the decode2 stage and into the divider.
If either of the operands for a signed divide or modulus operation
is negative, the divider now takes an extra cycle to negate the
operands that are negative.

The interface to the divider now has an 'is_signed' signal rather
than a 'neg_result' signal, and the dividend and divisor can be
negative, so divider_tb had to be updated for the new interface.

The reason for doing this is that one of the worst timing violations
on the Arty A7-100 at 100MHz involved the carry chain in the adders
that did the negation of the dividend and divisor in the decode stage.
Moving the negations to a separate cycle fixes that and also seems to
reduce the total number of slice LUTs used.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Anton Blanchard b57325ce29 Merge branch 'divider' of https://github.com/paulusmack/microwatt 5 years ago
Paul Mackerras d5bc6c8824 Add a divider unit and a testbench for it
This adds a divider unit, connected to the core in much the same way
that the multiplier unit is connected.  The division algorithm is
very simple-minded, taking 64 clock cycles for any division (even
32-bit division instructions).

The decoding is simplified by making use of regularities in the
instruction encoding for div* and mod* instructions.  Instead of
having PPC_* encodings from the first-stage decoder for each of the
different div* and mod* instructions, we now just have PPC_DIV and
PPC_MOD, and the inputs to the divider that indicate what sort of
division operation to do are derived from instruction word bits.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Benjamin Herrenschmidt 98f0994698 Add core debug module
This module adds some simple core controls:

  reset, stop, start, step

along with icache clear and reading the NIA and core
status bits

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org
5 years ago
Anton Blanchard d813ffb748 Fix spurious outstanding assert
Check it in the sequential process, not the combinatorial one.

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
5 years ago
Anton Blanchard 80aa781454
Merge pull request #47 from antonblanchard/if-fix
Explicitly check against '1' in if statements
5 years ago
Anton Blanchard b9e28598b4 Explicitly check against '1' in if statements
nvc doesn't like what I think is a VHDL 2008 construct. Lets just
check against '1' explicitly.

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
5 years ago