This implements an alternative count-leading-zeroes algorithm which
uses less LUTs to generate the higher-order bits (2..5) of the
result.
By doing (v | -v) rather than (v & -v), we get a value which has ones
from the MSB down to the rightmost 1 bit in v and then zeroes down to
the LSB. This means that we can generate the MSB of the result (the
index of the rightmost 1 bit in v) just by looking at bits 63 and 31
of (v | -v), assuming that v is 64 bits. Bit 4 of the result requires
looking at bits 63, 47, 31 and 15. In contrast, each bit of the
result using (v & -v), which has a single 1, requires ORing together
32 bits.
It turns out that the minimum LUT usage comes from using (v & -v) to
generate bits 0 and 1 of the result, and using (v | -v) to generate
bits 2 to 5. This saves almost 60 6-input LUTs on the Artix-7.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This generates a series of io_cycle_* signals which are clean latches
and which become the 'cyc' signals of the wishbone buses going to
various peripherals (syscon, uarts, XICS, GPIO, etc.). Effectively
this is done by moving the address decoding into the slave_io_latch
process. The slave_io_type, which drives the multiplexer which
selects which wishbone to look for a response on, is reduced to just 8
values in the expectation that an 8-way multiplexer will use less
logic than one with more than 8 inputs.
With this timing is considerably better on the A7-100T.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This removes logic that I added some time ago with the thought that it
would enable us to do prefetching in the icache. This logic detects
when the fetch address is an odd multiple of 4 and the next address in
sequence from the previous cycle. In that case the instruction we
want is in the output register of the icache RAM already so there is
no need to do another read or any icache tag or TLB lookup.
However, this logic adds complexity, and removing it improves timing,
so this removes it.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This moves the calculation of the result for popcnt* into the
countbits unit, renamed from countzero, so that we can take two cycles
to get the result. The motivation for this is that the popcnt*
calculation was showing up as a critical path.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Orangecrab missed out on:
Make wishbone addresses be in units of doublewords or words
Author: Paul Mackerras <paulus@ozlabs.org>
Date: Wed Sep 15 18:18:09 2021 +1000
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Modifies litescard generate script to take a clock speed.
Regenerated verilog with latest litesdcard
e52c731 ("Bump year.")
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Reduce litedram NUM_LINES 64->8
This allows us to meet timing. Can probably
be improved in future with better BRAM usage.
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
top-orangecrab0.2 is a copy of top-arty with various changes.
USRMCLK is added for the SPI clock
ethernet is removed
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Parameters are based on
https://github.com/gregdavill/OrangeCrab-test-sw/blob/main/hw/OrangeCrab-bitstream.py
and litex-boards orangecrab.py
rtt_nom and cmd_delay are overridden for OrangeCrab, we do the same here.
Generated with litedram and litex
62abf9c ("litedram_gen: Add block_until_ready port parameter to control blocking behaviour.")
add2746a ("tools/litex_cli: Rename wb to bus.")
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Recent litedram gets stuck at memtest unless block_until_ready=False.
(discussion in https://github.com/enjoy-digital/litedram/pull/292)
This change regenerates with latest litedram and litex
62abf9c ("litedram_gen: Add block_until_ready port parameter to control blocking behaviour.")
add2746a ("tools/litex_cli: Rename wb to bus.")
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Yosys changed command line behaviour following the v0.12 release. Work
around this by using read_verilog, which maintains the old behaviour.
This should work fine for current yosys and be compatible with
future releases.
See https://github.com/YosysHQ/yosys/issues/3109
Signed-off-by: Joel Stanley <joel@jms.id.au>
At present, code (such as simple_random) which produces serial port
output during the first few milliseconds of operation produces garbled
output. The reason is that the clock has not yet stabilized and is
running slow, resulting in the bit time of the serial characters being
too long.
The ECP5 data sheet says that the phase detector should be operated
between 10 and 400 MHz. The current code operates it at 2MHz.
Consequently, the PLL lock indication doesn't work, i.e. it is always
zero. The current code works around that by inverting it, i.e. taking
the "not locked" indication to mean "locked".
Instead, we now run it at 12MHz, chosen because the common external
clock inputs on ECP5 boards are 12MHz and 48MHz. Normally this would
mean that the available system clock frequencies would be multiples of
12MHz, but this is a little inconvenient as we use 40MHz on the Orange
Crab v0.21 boards. Instead, by using the secondary clock output for
feedback, we can have any divisor of the PLL frequency as the system
clock frequency.
The ECP5 data sheet says the PLL oscillator can run at 400 to 800
MHz. Here we choose 480MHz since that allows us to generate 40MHz and
48MHz easily and is a multiple of 12MHz.
With this, the lock signal works correctly, and the inversion can be
removed.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The existing orange crab target is for an older board with a
LFE5UM5G-85F device. Newer orange crab boards (v0.21) have a
LFE5U-85F device in the -8 speed grade, so make a new target for them
called ORANGE-CRAB-0.21.
Also add flags to ecppack to indicate that the bitstream should be
compressed and can be loaded at 38.8MHz.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
These convert addresses to/from wishbone addresses, and use them
in parts of the caches, in order to make the code a bit more readable.
Along the way, rename some functions in the caches to make it a bit
clearer what they operate on and fix a bug in the icache STOP_RELOAD state where
the wb address wasn't properly converted.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This moves REAL_ADDR_BITS out of the caches and defines a real_addr_t
type for a real address, along with a addr_to_real() conversion helper.
It makes the vhdl a bit more readable
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We have a bug where an store near a dcbz can cause the dcbz to only zero
8 bytes. Add a test case for this.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
Note: There are a few patches to upstream to fix an upstream breakage
of litedram standalone generator, and fix some issues with liteeth
in the way it's used on Wukong. All these have pending pull requests.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
For now only the V2 of the board (slightly different pinout)
and only the A100T variant. I also haven't added GPIOs or anything
else on the PMODs really.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This fixes a bug where a dcbz can get incorrectly handled as an
ordinary 8-byte store if it arrives while the dcache state machine is
handling other stores with the same tag value (i.e. within the same
set-sized area of memory). The logic that says whether to include a
new store in the current wishbone cycle didn't take into account
whether the new store was a dcbz. This adds a "req.dcbz = '0'" factor
so that it does. This is necessary because dcbz is handled more like
a cache line refill (but writing to memory rather than reading) than
an ordinary store.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This fixes two bugs in the flash invalidation of the icache.
The first is that an instruction could get executed twice. The
i-cache RAM is 2 instructions (64 bits) wide, so one read can supply
results for 2 cycles. The fetch1 stage tells icache when the address
is equal to the address of the previous cycle plus 4, and in cases
where that is true, bit 2 of the address is 1, and the previous cycle
was a cache hit, we just use the second word of the doubleword read
from the cache RAM. However, the cache hit/miss logic also continues
to operate, so in the case where the first word hits but the second
word misses (because of an icache invalidation or a snoop occurring in
the first cycle), we supply the instruction from the data previously
read from the icache RAM but also stall fetch1 and start a cache
reload sequence, and subsequently supply the second instruction
again. This fixes the issue by inhibiting req_is_miss and stall_out
when use_previous is true.
The second bug is that if an icache invalidation occurs while
reloading a line, we continue to reload the line, and make it valid
when the reload finishes, even though some of the data may have been
read before the invalidation occurred. This adds a new state
STOP_RELOAD which we go to if an invalidation happens while we are in
CLR_TAG or WAIT_ACK state. In STOP_RELOAD state we don't request any
more reads from memory and wait for the reads we have previously
requested to be acked, and then go to IDLE state. Data returned is
still written to the icache RAM, but that doesn't matter because the
line is invalid and is never made valid.
Note that we don't have to worry about invalidations due to snooped
writes while reloading a line, because the wishbone arbiter won't
switch to another master once it has started sending our reload
requests to memory. Thus a store to memory will either happen before
any of our reads have got to memory, or after we have finished the
reload (in which case we will no longer be in WAIT_ACK state).
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>