microwatt

Commit Graph

Author	SHA1	Message	Date
Paul Mackerras	d290d2a9bb	core: Restore bypass path from execute1 This changes the bypass path. Previously it went from after execute1's output to after decode2's output. Now it goes from before execute1's output register to before decode2's output register. The reason is that the new path will be simpler to manage when there are possibly multiple instructions in flight. This means that the bypassing can be managed inside decode2 and control. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	c0b45e153b	core: Track GPR hazards using tags that propagate through the pipelines This changes the way GPR hazards are detected and tracked. Instead of having a model of the pipeline in gpr_hazard.vhdl, which has to mirror the behaviour of the real pipeline exactly, we now assign a 2-bit tag to each instruction and record which GSPR the instruction writes. Subsequent instructions that need to use the GSPR get the tag number and stall until the value with that tag is being written back to the register file. For now, the forwarding paths are disabled. That gives about a 8% reduction in coremark performance. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	a1d7b54f76	core: Crack branches that update both CTR and LR This uses the instruction doubling machinery to convert conditional branch instructions that update both CTR and LR (e.g., bdnzl, bdnzlrl) into two instructions, of which the first updates CTR and determines whether the branch is taken, and the second updates LR and does the redirect if necessary. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	4c61a71a62	core: Crack update-form loads into two internal ops This uses the instruction-doubling machinery to send load with update instructions down to loadstore1 as two separate ops, rather than one op with two destinations. This will help to simplify the value tracking mechanisms. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	0fb207be60	fetch1: Implement a simple branch target cache This implements a cache in fetch1, where each entry stores the address of a simple branch instruction (b or bc) and the target of the branch. When fetching sequentially, if the address being fetched matches the cache entry, then fetching will be redirected to the branch target. The cache has 1024 entries and is direct-mapped, i.e. indexed by bits 11..2 of the NIA. The bus from execute1 now carries information about taken and not-taken simple branches, which fetch1 uses to update the cache. The cache entry is updated for both taken and not-taken branches, with the valid bit being set if the branch was taken and cleared if the branch was not taken. If fetching is redirected to the branch target then that goes down the pipe as a predicted-taken branch, and decode1 does not do any static branch prediction. If fetching is not redirected, then the next instruction goes down the pipe as normal and decode1 does its static branch prediction. In order to make timing, the lookup of the cache is pipelined, so on each cycle the cache entry for the current NIA + 8 is read. This means that after a redirect (from decode1 or execute1), only the third and subsequent sequentially-fetched instructions will be able to be predicted. This improves the coremark value on the Arty A7-100 from about 180 to about 190 (more than 5%). The BTC is optional. Builds for the Artix 7 35-T part have it off by default because the extra ~1420 LUTs it takes mean that the design doesn't fit on the Arty A7-35 board. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	f7b855dfc3	execute1: Improve timing on comparisons Using the main adder for comparisons has the disadvantage of creating a long path from the CA/OV bit forwarding to v.busy via the carry input of the adder, the comparison result, and determining whether a trap instruction would trap. Instead we now have dedicated comparators for the high and low words of a_in vs. b_in, and combine their results to get the signed and unsigned comparison results. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	b0510fd1bb	core: Reorganize execute1 This breaks up the enormous if .. elsif .. case .. elsif statement in execute1 in order to try to make it simpler and more understandable. We now have decode2 deciding whether the instruction has a value to be written back to a register (GPR, GSPR, FPR, etc.) rather than individual cases in execute1 setting result_en. The computation of the data to be written back is now independent of detection of various exception conditions. We now have an if block determining if any exception condition exists which prevents the next instruction from being executed, then the case statement which performs actions such as setting carry/overflow bits, determining if a trap exception exists, doing branches, etc., then an if statement for all the r.busy = 1 cases (continuing execution of an instruction which was started in a previous cycle, or writing SRR1 for an interrupt). Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	658feabfd4	core: Make result multiplexing explicit This adds an explicit multiplexer feeding v.e.write_data in execute1, with the select lines determined in the previous cycle based on the insn_type. Similarly, for multiply and divide instructions, there is now an explicit multiplexer. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	9ea1ab0215	execute1: Move branch adder after register This does the addition of the instruction NIA and the branch offset after the register at the output of execute1 rather than before. The propagation through the adder was showing up as a critical path on the A7-100. Performance is unaffected and now it makes timing. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	89a67a18d0	decode: Add a facility field to the instruction decode tables This makes it simpler to work out when to deliver a FPU unavailable interrupt. This also means we can get rid of the OP_FPLOAD and OP_FPSTORE insn_type values. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	4b2c23703c	core: Implement quadword loads and stores This implements the lq, stq, lqarx and stqcx. instructions. These instructions all access two consecutive GPRs; for example the "lq %r6,0(%r3)" instruction will load the doubleword at the address in R3 into R7 and the doubleword at address R3 + 8 into R6. To cope with having two GPR sources or destinations, the instruction gets repeated at the decode2 stage, that is, for each lq/stq/lqarx/stqcx. coming in from decode1, two instructions get sent out to execute1. For these instructions, the RS or RT register gets modified on one of the iterations by setting the LSB of the register number. In LE mode, the first iteration uses RS\|1 or RT\|1 and the second iteration uses RS or RT. In BE mode, this is done the other way around. In order for decode2 to know what endianness is currently in use, we pass the big_endian flag down from icache through decode1 to decode2. This is always in sync with what execute1 is using because only rfid or an interrupt can change MSR[LE], and those operations all cause a flush and redirect. There is now an extra column in the decode tables in decode1 to indicate whether the instruction needs to be repeated. Decode1 also enforces the rule that lq with RT = RT and lqarx with RA = RT or RB = RT are illegal. Decode2 now passes a 'repeat' flag and a 'second' flag to execute1, and execute1 passes them on to loadstore1. The 'repeat' flag is set for both iterations of a repeated instruction, and 'second' is set on the second iteration. Execute1 does not take asynchronous or trace interrupts on the second iteration of a repeated instruction. Loadstore1 uses 'next_addr' for the second iteration of a repeated load/store so that we access the second doubleword of the memory operand. Thus loadstore1 accesses the doublewords in increasing memory order. For 16-byte loads this means that the first iteration writes GPR RT\|1. It is possible that RA = RT\|1 (this is a legal but non-preferred form), meaning that if the memory operand was misaligned, the first iteration would overwrite RA but then the second iteration might take a page fault, leading to corrupted state. To avoid that possibility, 16-byte loads in LE mode take an alignment interrupt if the operand is not 16-byte aligned. (This is the case anyway for lqarx, and we enforce it for lq as well.) Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	d5cf4acfdb	execute1: Update comments about XER forwarding This deletes some commentary that is now out of date and replaces it with a simple statement about the XER common bits being forwarded from the output of execute1 to the input. The comment being deleted talked about a hazard if an instruction that modifies XER[SO] is immediately followed by a store conditional. That is no longer a problem because the operands for loadstore1 are sent from execute1 (and therefore have the forwarded value) rather than decode2. This was in fact fixed in `5422007f83` ("Plumb loadstore1 input from execute1 not decode2", 2020-01-14). Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Anton Blanchard	e1bac4d6e7	Reset TB and DECR We don't care what the values of TB and DECR are after reset, but we don't want the X state to propagate to other parts of the chip. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	4 years ago
Paul Mackerras	e49192cb5b	execute1: Fix forwarding of result when doing delayed LR update Random execution testcases showed that a bdnzl which doesn't branch, followed immediately by a bdnz, uses the wrong value for CTR for the bdnz. Decode2 detects the read-after-write hazard on CTR and tells execute1 to use the bypass path. However, the bdnzl takes two cycles because it has to write back both CTR and LR, meaning that by the time the bdnz starts to execute, r.e.write_data no longer contains the CTR value, but instead contains zero. To fix this, we make execute1 maintain the written-back value of CTR in r.e.write_data across the cycle where LR is written back (this is possible because the LR writeback uses the exc_write_data path). Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	27ac74a341	execute1: Fix writing LR for bdnzl/bdzl instructions Branch instructions which do a redirect and write both CTR and LR were not doing the write to LR due to a logic error. This fixes it. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	1037c6aa2e	core: Implement mtmsr instruction This is like mtmsrd except it only alters the lower 32 bits of the MSR. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	b0f7237b7f	execute1: Fix bug in trace interrupt vs. ITLB miss If an instruction fetch results in an instruction TLB miss, an OP_FETCH_FAILED instruction is sent down the pipe. If the MSR[TE] field is set for instruction tracing, the core currently considers that executing the OP_FETCH_FAILED counts as having executed one instruction and so generates a trace interrupt on the next valid instruction, meaning that the trace interrupt happens before the desired instruction rather than after it. Fix this by not tracing OP_FETCH_FAILED instructions. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	856e9e955f	core: Add framework for an FPU This adds the skeleton of a floating-point unit and implements the mffs and mtfsf instructions. Execute1 sends FP instructions to the FPU and receives busy, exception, FP interrupt and illegal interrupt signals from it. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	9d285a265c	core: Add support for single-precision FP loads and stores This adds code to loadstore1 to convert between single-precision and double-precision formats, and implements the lfs* and stfs* instructions. The conversion processes are described in Power ISA v3.1 Book 1 sections 4.6.2 and 4.6.3. These conversions take one cycle, so lfs* and stfs* are one cycle slower than lfd* and stfd*. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	45cd8f4fc3	core: Add support for floating-point loads and stores This extends the register file so it can hold FPR values, and implements the FP loads and stores that do not require conversion between single and double precision. We now have the FP, FE0 and FE1 bits in MSR. FP loads and stores cause a FP unavailable interrupt if MSR[FP] = 0. The FPU facilities are optional and their presence is controlled by the HAS_FPU generic passed down from the top-level board file. It defaults to true for all except the A7-35 boards. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	b589d2d472	execute1: Implement trace interrupts Trace interrupts occur when the MSR[TE] field is non-zero and an instruction other than rfid has been successfully completed. A trace interrupt occurs before the next instruction is executed or any asynchronous interrupt is taken. Since the trace interrupt is defined to set SRR1 bits depending on whether the traced instruction is a load or an instruction treated as a load, or a store or an instruction treated as a store, we need to make sure the treated-as-a-load instructions (icbi, icbt, dcbt, dcbst, dcbf) and the treated-as-a-store instructions (dcbtst, dcbz) have the correct opcodes in decode1. Several of them were previously marked as OP_NOP. We don't yet implement the SIAR or SDAR registers, which should be set by trace interrupts. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	6a80825e70	decode1: Avoid overriding fields of v.decode in decode1 In the cases where we need to override the values from the decode ROMs, we now do that overriding after the clock edge (eating into decode2's cycle) rather than before. This helps timing a little. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	eee90a0815	loadstore1: Generate alignment interrupts for unaligned larx/stcx Load-and-reserve and store-conditional instructions are required to generate an alignment interrupt (0x600 vector) if their EA is not aligned. Implement this. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	033ee909fd	core: Implement 32-bit mode In 32-bit mode, effective addresses are truncated to 32 bits, both for instruction fetches and data accesses, and CR0 is set for Rc=1 (record form) instructions based on the lower 32 bits of the result rather than all 64 bits. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	2e7b371305	core: Implement big-endian mode Big-endian mode affects both instruction fetches and data accesses. For instruction fetches, we byte-swap each word read from memory when writing it into the icache data RAM, and use a tag bit to indicate whether each cache line contains instructions in BE or LE form. For data accesses, we simply need to invert the existing byte_reverse signal in BE mode. The only thing to be careful of is to get the sign bit from the correct place when doing a sign-extending load that crosses two doublewords of memory. For now, interrupts unconditionally set MSR[LE]. We will need some sort of interrupt-little-endian bit somewhere, perhaps in LPCR. This also fixes a debug report statement in fetch1.vhdl. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	83816cb9e3	core: Implement BCD Assist instructions addg6s, cdtbcd, cbcdtod To avoid adding too much logic, this moves the adder used by OP_ADD out of the case statement in execute1.vhdl so that the result can be used by OP_ADDG6S as well. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	5fafdc56ef	core: Implement the addex instruction The addex instruction is like adde but uses the XER[OV] bit for the carry in and out rather than XER[CA]. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	1a7aebeef8	Add random number generator and implement the darn instruction This adds a true random number generator for the Xilinx FPGAs which uses a set of chaotic ring oscillators to generate random bits and then passes them through a Linear Hybrid Cellular Automaton (LHCA) to remove bias, as described in "High Speed True Random Number Generators in Xilinx FPGAs" by Catalin Baetoniu of Xilinx Inc., in: https://pdfs.semanticscholar.org/83ac/9e9c1bb3dad5180654984604c8d5d8137412.pdf This requires adding a .xdc file to tell vivado that the combinatorial loops that form the ring oscillators are intentional. The same code should work on other FPGAs as well if their tools can be told to accept the combinatorial loops. For simulation, the random.vhdl module gets compiled in, which uses the pseudorand() function to generate random numbers. Synthesis using yosys uses nonrandom.vhdl, which always signals an error, causing darn to return 0xffff_ffff_ffff_ffff. This adds an implementation of the darn instruction. Darn can return either raw or conditioned random numbers. On Xilinx FPGAs, reading a raw random number gives the output of the ring oscillators, and reading a conditioned random number gives the output of the LHCA. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	290b05f97d	core: Implement the maddhd, maddhdu and maddld instructions These instructions use major opcode 4 and have a third GPR input operand, so we need a decode table for major opcode 4 and some plumbing to get the RC register operand read. The multiply-add instructions use the same insn_type_t values as the regular multiply instructions, and we distinguish in execute1 by looking at the major opcode. This turns out to be convenient because we don't have to add any cases in the code that handles the output of the multiplier, and it frees up some insn_type_t values. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	8edfbf638b	core: Implement the cmpeqb and cmprb instructions Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	b739372f7e	core: Implement the bpermd instruction Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	cce34039c3	core: Implement the setb instruction Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	fa77a6f683	core: Implement the mcrxrx instruction This also removes OP_MCRXR, as the mcrxr instruction was removed in version 3.0B of the Power ISA, having been phased-out for the server architecture since v2.02. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	081684273e	execute1: Use r.<field> not v.<field> in countzero code This simplifies logic and improves timing. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	f1238299bd	execute1: Take an extra cycle for OE=1 multiply instructions We now expect the overflow signal from the multiplier to come along one cycle later than the product. This breaks up a long combinatorial path and improves timing. This also changes some uses of v.<field> to r.<field> in the slow op logic, which should help timing as well. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	535341961d	multiplier: Generalize interface to the multiplier This makes the interface to the multiplier more general so an instance of it can be used in the FPU. It now has a 128-bit addend that is added on to the product. Instead of an input to negate the output, it now has a "not_result" input to complement the output. Execute1 uses not_result=1 and addend=-1 to get the effect of negating the output. The interface is defined this way because this is what can be done easily with the Xilinx DSP slices in xilinx-mult.vhdl. This also adds clock enable signals to the DSP slices, mostly for the sake of reducing power consumption. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	893d2bc6a2	core: Don't generate logic for log data when LOG_LENGTH = 0 This adds "if LOG_LENGTH > 0 generate" to the places in the core where log output data is latched, so that when LOG_LENGTH = 0 we don't create the logic to collect the data which won't be stored. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	9160e29c56	execute1: Ease timing on redirect_nia This eliminates a dependency of r.f.redirect_nia on the carry out from the main adder in the case of a conditional trap instruction. We can set r.f.redirect_nia unconditionally, even if no interrupt is generated. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Jordan Niethe	17fc77cef2	core: Implement PVR register Microwatt has been allocated a PVR version of 0x0063. Implement a PVR with this value. Signed-off-by: Jordan Niethe <jniethe5@gmail.com>	4 years ago
Paul Mackerras	74062195ca	execute1: Do forwarding of the CR result to the next instruction This adds a path to allow the CR result of one instruction to be forwarded to the next instruction, so that sequences such as cmp; bc can avoid having a 1-cycle bubble. Forwarding is not available for dot-form (Rc=1) instructions, since the CR result for them is calculated in writeback. The decode.output_cr field is used to identify those instructions that compute the CR result in execute1. For some reason, the multiply instructions incorrectly had output_cr = 1 in the decode tables. This fixes that. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	0f0573903b	execute1: Add latch to redirect path This latches the redirect signal inside execute1, so that it is sent a cycle later to fetch1 (and to decode/icache as flush). This breaks a long combinatorial chain from the branch and interrupt detection in execute1 through the redirect/flush signals all the way back to fetch1, icache and decode. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	c2da82764f	core: Implement CFAR register This implements the CFAR SPR as a slow SPR stored in 'ctrl'. Taken branches and rfid update it to the address of the branch or rfid instruction. To simplify the logic, this makes rfid use the branch logic to generate its redirect (requiring SRR0 to come in to execute1 on the B input and SRR1 on the A input), and the masking of the bottom 2 bits of NIA is moved to fetch1. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Benjamin Herrenschmidt	76e2c7d81c	ex1: Add SPR_TBU support It's used by the boot wrapper in Linux and possibly some userspace programs. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	4 years ago
Paul Mackerras	ec2fa61792	execute1: Reduce width of the result mux to help timing This reduces the number of different things that are assigned to the result variable. - The computations for the popcnt, prty, cmpb and exts instruction families are moved into the logical unit. - The result of mfspr from the slow SPRs is computed in 'spr_val' before being assigned to 'result'. - Writes to LR as a result of a blr or bclr instruction are done through the exc_write path to writeback. This eases timing considerably. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	6687aae4d6	core: Implement a simple branch predictor This implements a simple branch predictor in the decode1 stage. If it sees that the instruction is b or bc and the branch is predicted to be taken, it sends a flush and redirect upstream (to icache and fetch1) to redirect fetching to the branch target. The prediction is sent downstream with the branch instruction, and execute1 now only sends a flush/redirect upstream if the prediction was wrong. Unconditional branches are always predicted to be taken, and conditional branches are predicted to be taken if and only if the offset is negative. Branches that take the branch address from a register (bclr, bcctr) are predicted not taken, as we don't have any way to predict the branch address. Since we can now have a mflr being executed immediately after a bl or bcl, we now track the update to LR in the hazard tracker, using the second write register field that is used to track RA updates for update-form loads and stores. For those branches that update LR but don't write any other result (i.e. that don't decrementer CTR), we now write back LR in the same cycle as the instruction rather than taking a second cycle for the LR writeback. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	209aa9ce3f	loadstore1: Reduce busy cycles This reduces the number of cycles where loadstore1 asserts its busy output, leading to increased throughput of loads and stores. Loads that hit in the cache can now be executed at the rate of one every two cycles. Stores take 4 cycles assuming the wishbone slave responds with an ack the cycle after we assert strobe. To achieve this, the state machine code is split into two parts, one for when we have an existing instruction in progress, and one for starting a new instruction. We can now combinatorially clear busy and start a new instruction in the same cycle that we get a done signal from the dcache; in other words we are completing one instruction and potentially writing back results in the same cycle that we start a new instruction and send its address and data to the dcache. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	6701e7346b	core: Use a busy signal rather than a stall This changes the instruction dependency tracking so that we can generate a "busy" signal from execute1 and loadstore1 which comes along one cycle later than the current "stall" signal. This will enable us to signal busy cycles only when we need to from loadstore1. The "busy" signal from execute1/loadstore1 indicates "I didn't take the thing you gave me on this cycle", as distinct from the previous stall signal which meant "I took that but don't give me anything next cycle". That means that decode2 proactively gives execute1 a new instruction as soon as it has taken the previous one (assuming there is a valid instruction available from decode1), and that then sits in decode2's output until execute1 can take it. So instructions are issued by decode2 somewhat earlier than they used to be. Decode2 now only signals a stall upstream when its output buffer is full, meaning that we can fill up bubbles in the upstream pipe while a long instruction is executing. This gives a small boost in performance. This also adds dependency tracking for rA updates by update-form load/store instructions. The GPR and CR hazard detection machinery now has one extra stage, which may not be strictly necessary. Some of the code now really only applies to PIPELINE_DEPTH=1. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	9880fc7435	multiply: Move selection of result bits into execute1 This puts the logic that selects which bits of the multiplier result get written into the destination GPR into execute1, moved out from multiply. The multiplier is now expected to do an unsigned multiplication of 64-bit operands, optionally negate the result, detect 32-bit or 64-bit signed overflow of the result, and return a full 128-bit result. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	49a4d9f67a	Add core logging This logs 256 bits of data per cycle to a ring buffer in BRAM. The data collected can be read out through 2 new SPRs or through the debug interface. The new SPRs are LOG_ADDR (724) and LOG_DATA (725). LOG_ADDR contains the buffer write pointer in the upper 32 bits (in units of entries, i.e. 32 bytes) and the read pointer in the lower 32 bits (in units of doublewords, i.e. 8 bytes). Reading LOG_DATA gives the doubleword from the buffer at the read pointer and increments the read pointer. Setting bit 31 of LOG_ADDR inhibits the trace log system from writing to the log buffer, so the contents are stable and can be read. There are two new debug addresses which function similarly to the LOG_ADDR and LOG_DATA SPRs. The log is frozen while either or both of the LOG_ADDR SPR bit 31 or the debug LOG_ADDR register bit 31 are set. The buffer defaults to 2048 entries, i.e. 64kB. The size is set by the LOG_LENGTH generic on the core_debug module. Software can determine the length of the buffer because the length is ORed into the buffer write pointer in the upper 32 bits of LOG_ADDR. Hence the length of the buffer can be calculated as 1 << (31 - clz(LOG_ADDR)). There is a program to format the log entries in a somewhat readable fashion in scripts/fmt_log/fmt_log.c. The log_entry struct in that file describes the layout of the bits in the log entries. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	4a4a98d4b9	core: Do addpcis using the main adder (#189 ) By adding logic to decode2 to be able to send the instruction address down the A input, and making CONST_DX_HI (renamed to CONST_DXHI4) add 4 to the immediate value (easy since the bottom 16 bits were zero), we can do addpcis using the main adder. This reduces the width of the result mux and frees up one value in insn_type_t, since we can now use OP_ADD for addpcis. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago

1 2 3

141 Commits (d290d2a9bbddcfe52faa9427088bf6c4f225a711)