Execute1 and loadstore1 now send each other stall signals that
indicate that a valid instruction in stage 2 can't complete in this
cycle, and hence any valid instruction in stage 1 in the other unit
can't move to stage 2. With this in place, an ALU instruction can
move into stage 1 while a LSU instruction is in stage 2.
Since the FPU doesn't yet have a way to stall completion, we can't yet
start FPU instructions while any LSU or ALU instruction is in
progress.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>

This implements an alternative count-leading-zeroes algorithm which
uses less LUTs to generate the higher-order bits (2..5) of the
result.
By doing (v | -v) rather than (v & -v), we get a value which has ones
from the MSB down to the rightmost 1 bit in v and then zeroes down to
the LSB. This means that we can generate the MSB of the result (the
index of the rightmost 1 bit in v) just by looking at bits 63 and 31
of (v | -v), assuming that v is 64 bits. Bit 4 of the result requires
looking at bits 63, 47, 31 and 15. In contrast, each bit of the
result using (v & -v), which has a single 1, requires ORing together
32 bits.
It turns out that the minimum LUT usage comes from using (v & -v) to
generate bits 0 and 1 of the result, and using (v | -v) to generate
bits 2 to 5. This saves almost 60 6-input LUTs on the Artix-7.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>

This moves the calculation of the result for popcnt* into the
countbits unit, renamed from countzero, so that we can take two cycles
to get the result. The motivation for this is that the popcnt*
calculation was showing up as a critical path.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>