microwatt

Commit Graph

Author	SHA1	Message	Date
Paul Mackerras	af814a0d5e	Provide debug access to SPRs in loadstore1 and mmu They are accessible as GSPR 0x3c - PID, 0x3d - PTCR, 0x3e - DSISR and 0x3f - DAR. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	d0f319290f	Restore debug access to SPRs This provides access to the SPRs via the JTAG DMI interface. For now they are still accessed as if they were GPR/FPRs using the same numbering as before (GPRs at 0 - 0x1f, SPRs at 0x20 - 0x2d, FPRs at 0x40 - 0x5f). For XER, debug reads now report the full value, not just the bits that were previously stored in the register file. The "slow" SPR mux is not used for debug reads. Decode2 determines on each cycle whether a debug SPR access will happen next cycle, based on whether there is a request and whether the current instruction accesses the SPR RAM. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	fdb3ef6874	Finish off taking SPRs out of register file With this, the register file now contains 64 entries, for 32 GPRs and 32 FPRs, rather than the 128 it had previously. Several things get simplified - decode1 no longer has to work out the ispr{1,2,o} values, decode_input_reg_{a,b,c} no longer have the t = SPR case, etc. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	337b104250	Move LR, CTR and TAR out of the register file By putting CTR on the odd side and LR and TAR on the even side, we can read and write CTR for bdnz-style instructions in parallel with reading LR or TAR for indirect branches and writing LR for branches with LK=1. Thus we don't need to double up any of these instructions, giving a simplification in decode2. We now have logic for printing LR and CTR at the end of a simulation in execute1, in addition to the similar logic in register_file and cr_file. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	bc4d02cb0d	Start removing SPRs from register file This starts the process of removing SPRs from the register file by moving SRR0/1, SPRG0-3, HSRR0/1 and HSPRG0/1 out of the register file and putting them into execute1. They are stored in a pair of small RAM arrays, referred to as "even" and "odd". The reason for having two arrays is so that two values can be read and written in each cycle. For example, SRR0 and SRR1 can be written in parallel by an interrupt and read in parallel by the rfid instruction. The addresses in the RAM which will be accessed are determined in the decode2 stage. We have one write address for both sides, but two read addresses, since in future we will want to be able to read CTR at the same time as either LR or TAR. We now have a connection from writeback to execute1 which carries the partial SRR1 value for an interrupt. SRR0 comes from the execute pipeline; we no longer need to carry instruction addresses along the LSU and FPU pipelines. Since SRR0 and SRR1 can be written in the same cycle now, we don't need the little state machine in writeback any more. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	73cc5167ec	Use FPU for division instructions if we have an FPU - Arrange for XER to be written for OE=1 forms - Arrange for condition codes to be set for RC=1 forms (including correct handling for 32-bit mode) - Don't instantiate the divider if we have an FPU. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	a95f8aab38	FPU: Add integer division logic to FPU This adds logic to the FPU to accomplish 64-bit integer divisions. No instruction actually uses this yet. The algorithm used is to obtain an estimate of the reciprocal of the divisor using the lookup table and refine it by one to three iterations of the Newton-Raphson algorithm (the number of iterations depends on the number of significant bits in the dividend). Then the reciprocal is multiplied by the dividend to get the quotient estimate. The remainder is calculated as dividend - quotient * divisor. If the remainder is greater than or equal to the divisor, the quotient is incremented, or if a modulo operation is being done, the divisor is subtracted from the remainder. The inverse estimate after refinement is good enough that the quotient estimate is always equal to or one less than the true quotient. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	2f45e545ed	decode2: Rework to make the stall_out signal come from a register At present the busy/stall signal going to decode1 depends on whether control thinks it can issue the current instruction, and that depends on completion and bypass signals coming from execute1 and writeback. To improve the timing of stall_out, this rearranges decode2 so that stall_out is asserted when we have a valid instruction that couldn't be issued in the previous cycle. This means that decode1 could give us a new instruction when we haven't issued the previous instruction. This in turn means that we can only use d_in in the first cycle of processing an instruction. After the first cycle, we get register addresses etc. from dc2 rather than d_in. Then, to avoid the need to read register operands from register_file in each cycle until the instruction issues, we bring the bypass path for data being written to the register file into decode2 explicitly rather than having it in register_file. A new process called decode2_addrs does the process of calling decode_input_reg_* and decode_output_reg and sets up the register file addresses. This was split out (and decode_input_reg_* reworked) to try to reduce the number of passes through the decode2_1 process that need to be done in simulation. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	9a8a8e50f8	FPU: Add stage-2 stall ability to FPU This makes the FPU able to stall other units at execute stage 2 and be stalled by other units (specifically the LSU). This means that the completion and writeback for an instruction can now end up being deferred until the second cycle of a following instruction, i.e. the cycle when the state machine has gone through IDLE state into one of the DO_* states, which means we need to latch the destination FPR number, CR mask, etc. from the previous instruction so that we present the correct information to writeback. The advantage of this is that we can get rid of the in_progress signal from the LSU. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	e030a500e8	Allow integer instructions and load/store instructions to execute together Execute1 and loadstore1 now send each other stall signals that indicate that a valid instruction in stage 2 can't complete in this cycle, and hence any valid instruction in stage 1 in the other unit can't move to stage 2. With this in place, an ALU instruction can move into stage 1 while a LSU instruction is in stage 2. Since the FPU doesn't yet have a way to stall completion, we can't yet start FPU instructions while any LSU or ALU instruction is in progress. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	3510071d9a	Add a second execute stage to the pipeline This adds a second execute stage to the pipeline, in order to match up the length of the pipeline through loadstore and dcache with the length through execute1. This will ultimately enable us to get rid of the 1-cycle bubble that we currently have when issuing ALU instructions after one or more LSU instructions. Most ALU instructions execute in the first stage, except for count-zeroes and popcount instructions (which take two cycles and do some of their work in the second stage) and mfspr/mtspr to "slow" SPRs (TB, DEC, PVR, LOGA/LOGD, CFAR). Multiply and divide/mod instructions take several cycles but the instruction stays in the first stage (ex1) and ex1.busy is asserted until the operation is complete. There is currently a bypass from the first stage but not the second stage. Performance is down somewhat because of that and because this doesn't yet eliminate the bubble between LSU and ALU instructions. The forwarding of XER common bits has been changed somewhat because now there is another pipeline stage between ex1 and the committed state in cr_file. The simplest thing for now is to record the last value written and use that, unless there has been a flush, in which case the committed state (obtained via e_in.xerc) is used. Note that this fixes what was previously a benign bug in control.vhdl, where it was possible for control to forget an instructions dependency on a value from a previous instruction (a GPR or the CR) if this instruction writes the value and the instruction gets to the point where it could issue but is blocked by the busy signal from execute1. In that situation, control may incorrectly not indicate that a bypass should be used. That didn't matter previously because, for ALU and FPU instructions, there was only one previous instruction in flight and once the current instruction could issue, the previous instruction was completing and the correct value would be obtained from register_file or cr_file. For loadstore instructions there could be two being executed, but because there are no bypass paths, failing to indicate use of a bypass path is fine. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	813e2317bf	execute1: Restructure to separate out execution of side effects We now have a record that represents the actions taken in executing an instruction, and a process that computes that for the incoming instruction. We no longer have 'current' or 'r.cur_instr', instead things like the destination register are put into r.e in the first cycle of an instruction and not reinitialized in subsequent busy cycles. For mfspr and mtspr, we now decode "slow" SPR numbers (those SPRs that are not stored in the register file) to a new "spr_selector" record in decode1 (excluding those in the loadstore unit). With this, the result for mfspr is determined in the data path. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	204fedc63f	Move XER low bits out of register file Besides the overflow and status carry bits, XER has 18 bits which need to retain the value written by mtxer (in case software wants to emulate the move-assist instructions (lswi, lswx, stswi, stswx). Until now these bits (and others) have been stored in the GPR file as a "fast" SPR, but this causes complications because XER is not really a fast SPR. Instead, we now store these 18 bits in the 'ctrl' signal, which exists in execute1. This will enable us to simplify the data path in future, and has the added bonus that with a little bit of plumbing, we can get the full XER value printed when dumping registers at the end of a simulation. Therefore this changes scripts/run_test.sh to remove the greps which exclude XER from the comparison of actual and expected register results. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Anton Blanchard	0b39947f8d	Remove unused sequential signal from Fetch1ToIcacheType GHDL synthesis is flagging a warning about this. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	2 years ago
Benjamin Herrenschmidt	d745995207	Introduce real_addr_t and addr_to_real() This moves REAL_ADDR_BITS out of the caches and defines a real_addr_t type for a real address, along with a addr_to_real() conversion helper. It makes the vhdl a bit more readable Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	3 years ago
Paul Mackerras	54b0e8b8c8	core: Predict not-taken conditional branches using BTC This adds a bit to the BTC to store whether the corresponding branch instruction was taken last time it was encountered. That lets us pass a not-taken prediction down to decode1, which for backwards direct branches inhibits it from redirecting fetch to the target of the branch. This increases coremark by about 2%. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	65c43b488b	PMU: Add several more events This implements most of the architected PMU events. The ones missing are mostly the ones that depend on which level of the cache hierarchy data is fetched from. The events implemented here, and their raw event codes, are: Floating-point operation completed (100f4) Load completed (100fc) Store completed (200f0) Icache miss (200fc) ITLB miss (100f6) ITLB miss resolved (400fc) Dcache load miss (400f0) Dcache load miss resolved (300f8) Dcache store miss (300f0) DTLB miss (300fc) DTLB miss resolved (200f6) No instruction available and none being executed (100f8) Instruction dispatched (200f2, 300f2, 400f2) Taken branch instruction completed (200fa) Branch mispredicted (400f6) External interrupt taken (200f8) Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	a7873b45f7	core: Add a basic performance monitor unit (PMU) implementation This is the start of an implementation of a PMU according to PowerISA v3.0B. Things not implemented yet include most architected events, the BHRB, event-based branches, thresholding, MMCR0[TBCC] field, etc. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	64e3ce7134	execute1: Handle interrupts during sequences of load/store operations At present the logic prevents any interrupts from being handled while there is a load/store instruction (one that has unit=LDST) being executed. However, load/store instructions can still get sent to loadstore1. Thus an instruction which should generate an interrupt such as a floating-point unavailable interrupt will instead get executed. To fix this, when we detect that an interrupt should be generated but loadstore1 is still executing a previous instruction, we don't execute any new instructions, and set a new r.intr_pending flag. That results in busy_out being asserted (meaning that no further instructions will come in from decode2). When loadstore1 has finished the instructions it has, the interrupt gets sent to writeback. If one of the instructions in loadstore1 generates an interrupt in the meantime, the l_in.interrupt signal gets asserted and that clears r.intr_pending, so the interrupt we detected gets discarded. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	18120f153d	MMU: Implement a vestigial partition table This implements a 1-entry partition table, so that instead of getting the process table base address from the PRTBL SPR, the MMU now reads the doubleword pointed to by the PTCR register plus 8 to get the process table base address. The partition table entry is cached. Having the PTCR and the vestigial partition table reduces the amount of software change required in Linux for Microwatt support. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	17fd069640	core: Allow multiple loadstore instructions to be in flight The idea here is that we can have multiple instructions in progress at the same time as long as they all go to the same unit, because that unit will keep them in order. If we get an instruction for a different unit, we wait for all the previous instructions to finish before executing it. Since the loadstore unit is the only one that is currently pipelined, this boils down to saying that loadstore instructions can go ahead while l_in.in_progress = 1 but other instructions have to wait until it is 0. This gives a 2% increase on coremark performance on the Arty A7-100 (from ~190 to ~194). Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	f636bb7c39	dcache: Fix bugs in pipelined operation This fixes two bugs which show up when multiple operations are in flight in the dcache, and adds a 'hold' input which will be needed when loadstore1 is pipelined. The first bug is that dcache needs to sample the data for a store on the cycle after the store request comes in even if the store request is held up because of a previous request (e.g. if the previous request is a load miss or a dcbz). The second bug is that a load request coming in for a cache line being refilled needs to be handled immediately in the case where it is for the row whose data arrives on the same cycle. If it is not, then it will be handled as a separate cache miss and the cache line will be refilled again into a different way, leading to two ways both being valid for the same tag. This can lead to data corruption, in the scenario where subsequent writes go to one of the ways and then that way gets displaced but the other way doesn't. This bug could in principle show up even without having multiple operations in flight in the dcache. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	acb3d2d745	core: Send FPU interrupts to writeback rather than execute1 Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	29221315e9	core: Send loadstore1 interrupts to writeback rather than execute1 Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	3cd3449b4b	core: Move redirect and interrupt delivery logic to writeback This moves the logic for redirecting fetching and writing SRR0 and SRR1 to writeback. The aim is that ultimately units other than execute1 can send their interrupts to writeback along with their instruction completions, so that there can be multiple instructions in flight without needing execute1 to keep track of the address of each outstanding instruction. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	4fd8d9509c	execute1: Move CR result to data path process Also work out in decode2 whether the instruction sets the XER common bits. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	ae2afeca5c	core: Track CR hazards and bypasses using tags Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	d290d2a9bb	core: Restore bypass path from execute1 This changes the bypass path. Previously it went from after execute1's output to after decode2's output. Now it goes from before execute1's output register to before decode2's output register. The reason is that the new path will be simpler to manage when there are possibly multiple instructions in flight. This means that the bypassing can be managed inside decode2 and control. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	c0b45e153b	core: Track GPR hazards using tags that propagate through the pipelines This changes the way GPR hazards are detected and tracked. Instead of having a model of the pipeline in gpr_hazard.vhdl, which has to mirror the behaviour of the real pipeline exactly, we now assign a 2-bit tag to each instruction and record which GSPR the instruction writes. Subsequent instructions that need to use the GSPR get the tag number and stall until the value with that tag is being written back to the register file. For now, the forwarding paths are disabled. That gives about a 8% reduction in coremark performance. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	a1d7b54f76	core: Crack branches that update both CTR and LR This uses the instruction doubling machinery to convert conditional branch instructions that update both CTR and LR (e.g., bdnzl, bdnzlrl) into two instructions, of which the first updates CTR and determines whether the branch is taken, and the second updates LR and does the redirect if necessary. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	4c61a71a62	core: Crack update-form loads into two internal ops This uses the instruction-doubling machinery to send load with update instructions down to loadstore1 as two separate ops, rather than one op with two destinations. This will help to simplify the value tracking mechanisms. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	0fb207be60	fetch1: Implement a simple branch target cache This implements a cache in fetch1, where each entry stores the address of a simple branch instruction (b or bc) and the target of the branch. When fetching sequentially, if the address being fetched matches the cache entry, then fetching will be redirected to the branch target. The cache has 1024 entries and is direct-mapped, i.e. indexed by bits 11..2 of the NIA. The bus from execute1 now carries information about taken and not-taken simple branches, which fetch1 uses to update the cache. The cache entry is updated for both taken and not-taken branches, with the valid bit being set if the branch was taken and cleared if the branch was not taken. If fetching is redirected to the branch target then that goes down the pipe as a predicted-taken branch, and decode1 does not do any static branch prediction. If fetching is not redirected, then the next instruction goes down the pipe as normal and decode1 does its static branch prediction. In order to make timing, the lookup of the cache is pipelined, so on each cycle the cache entry for the current NIA + 8 is read. This means that after a redirect (from decode1 or execute1), only the third and subsequent sequentially-fetched instructions will be able to be predicted. This improves the coremark value on the Arty A7-100 from about 180 to about 190 (more than 5%). The BTC is optional. Builds for the Artix 7 35-T part have it off by default because the extra ~1420 LUTs it takes mean that the design doesn't fit on the Arty A7-35 board. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	b0510fd1bb	core: Reorganize execute1 This breaks up the enormous if .. elsif .. case .. elsif statement in execute1 in order to try to make it simpler and more understandable. We now have decode2 deciding whether the instruction has a value to be written back to a register (GPR, GSPR, FPR, etc.) rather than individual cases in execute1 setting result_en. The computation of the data to be written back is now independent of detection of various exception conditions. We now have an if block determining if any exception condition exists which prevents the next instruction from being executed, then the case statement which performs actions such as setting carry/overflow bits, determining if a trap exception exists, doing branches, etc., then an if statement for all the r.busy = 1 cases (continuing execution of an instruction which was started in a previous cycle, or writing SRR1 for an interrupt). Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	658feabfd4	core: Make result multiplexing explicit This adds an explicit multiplexer feeding v.e.write_data in execute1, with the select lines determined in the previous cycle based on the insn_type. Similarly, for multiply and divide instructions, there is now an explicit multiplexer. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	6427cab46f	loadstore1/dcache: Send store data one cycle later This makes timing easier and also means that store floating-point single precision instructions no longer need to take an extra cycle. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	89a67a18d0	decode: Add a facility field to the instruction decode tables This makes it simpler to work out when to deliver a FPU unavailable interrupt. This also means we can get rid of the OP_FPLOAD and OP_FPSTORE insn_type values. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	4b2c23703c	core: Implement quadword loads and stores This implements the lq, stq, lqarx and stqcx. instructions. These instructions all access two consecutive GPRs; for example the "lq %r6,0(%r3)" instruction will load the doubleword at the address in R3 into R7 and the doubleword at address R3 + 8 into R6. To cope with having two GPR sources or destinations, the instruction gets repeated at the decode2 stage, that is, for each lq/stq/lqarx/stqcx. coming in from decode1, two instructions get sent out to execute1. For these instructions, the RS or RT register gets modified on one of the iterations by setting the LSB of the register number. In LE mode, the first iteration uses RS\|1 or RT\|1 and the second iteration uses RS or RT. In BE mode, this is done the other way around. In order for decode2 to know what endianness is currently in use, we pass the big_endian flag down from icache through decode1 to decode2. This is always in sync with what execute1 is using because only rfid or an interrupt can change MSR[LE], and those operations all cause a flush and redirect. There is now an extra column in the decode tables in decode1 to indicate whether the instruction needs to be repeated. Decode1 also enforces the rule that lq with RT = RT and lqarx with RA = RT or RB = RT are illegal. Decode2 now passes a 'repeat' flag and a 'second' flag to execute1, and execute1 passes them on to loadstore1. The 'repeat' flag is set for both iterations of a repeated instruction, and 'second' is set on the second iteration. Execute1 does not take asynchronous or trace interrupts on the second iteration of a repeated instruction. Loadstore1 uses 'next_addr' for the second iteration of a repeated load/store so that we access the second doubleword of the memory operand. Thus loadstore1 accesses the doublewords in increasing memory order. For 16-byte loads this means that the first iteration writes GPR RT\|1. It is possible that RA = RT\|1 (this is a legal but non-preferred form), meaning that if the memory operand was misaligned, the first iteration would overwrite RA but then the second iteration might take a page fault, leading to corrupted state. To avoid that possibility, 16-byte loads in LE mode take an alignment interrupt if the operand is not 16-byte aligned. (This is the case anyway for lqarx, and we enforce it for lq as well.) Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Anton Blanchard	659be2780f	Fully initialize FPU buses when FPU is disabled Some of the bits in the FPU buses end up as z state. Yosys flags them, so we may as well clean it up. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	3 years ago
Paul Mackerras	856e9e955f	core: Add framework for an FPU This adds the skeleton of a floating-point unit and implements the mffs and mtfsf instructions. Execute1 sends FP instructions to the FPU and receives busy, exception, FP interrupt and illegal interrupt signals from it. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	9d285a265c	core: Add support for single-precision FP loads and stores This adds code to loadstore1 to convert between single-precision and double-precision formats, and implements the lfs* and stfs* instructions. The conversion processes are described in Power ISA v3.1 Book 1 sections 4.6.2 and 4.6.3. These conversions take one cycle, so lfs* and stfs* are one cycle slower than lfd* and stfd*. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	45cd8f4fc3	core: Add support for floating-point loads and stores This extends the register file so it can hold FPR values, and implements the FP loads and stores that do not require conversion between single and double precision. We now have the FP, FE0 and FE1 bits in MSR. FP loads and stores cause a FP unavailable interrupt if MSR[FP] = 0. The FPU facilities are optional and their presence is controlled by the HAS_FPU generic passed down from the top-level board file. It defaults to true for all except the A7-35 boards. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	b589d2d472	execute1: Implement trace interrupts Trace interrupts occur when the MSR[TE] field is non-zero and an instruction other than rfid has been successfully completed. A trace interrupt occurs before the next instruction is executed or any asynchronous interrupt is taken. Since the trace interrupt is defined to set SRR1 bits depending on whether the traced instruction is a load or an instruction treated as a load, or a store or an instruction treated as a store, we need to make sure the treated-as-a-load instructions (icbi, icbt, dcbt, dcbst, dcbf) and the treated-as-a-store instructions (dcbtst, dcbz) have the correct opcodes in decode1. Several of them were previously marked as OP_NOP. We don't yet implement the SIAR or SDAR registers, which should be set by trace interrupts. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	eee90a0815	loadstore1: Generate alignment interrupts for unaligned larx/stcx Load-and-reserve and store-conditional instructions are required to generate an alignment interrupt (0x600 vector) if their EA is not aligned. Implement this. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	033ee909fd	core: Implement 32-bit mode In 32-bit mode, effective addresses are truncated to 32 bits, both for instruction fetches and data accesses, and CR0 is set for Rc=1 (record form) instructions based on the lower 32 bits of the result rather than all 64 bits. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	2e7b371305	core: Implement big-endian mode Big-endian mode affects both instruction fetches and data accesses. For instruction fetches, we byte-swap each word read from memory when writing it into the icache data RAM, and use a tag bit to indicate whether each cache line contains instructions in BE or LE form. For data accesses, we simply need to invert the existing byte_reverse signal in BE mode. The only thing to be careful of is to get the sign bit from the correct place when doing a sign-extending load that crosses two doublewords of memory. For now, interrupts unconditionally set MSR[LE]. We will need some sort of interrupt-little-endian bit somewhere, perhaps in LPCR. This also fixes a debug report statement in fetch1.vhdl. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	0fb8967290	core: Implement the TAR register and the bctar instruction Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	535341961d	multiplier: Generalize interface to the multiplier This makes the interface to the multiplier more general so an instance of it can be used in the FPU. It now has a 128-bit addend that is added on to the product. Instead of an input to negate the output, it now has a "not_result" input to complement the output. Execute1 uses not_result=1 and addend=-1 to get the effect of negating the output. The interface is defined this way because this is what can be done easily with the Xilinx DSP slices in xilinx-mult.vhdl. This also adds clock enable signals to the DSP slices, mostly for the sake of reducing power consumption. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	91cbeee77c	loadstore1: Generate busy signal earlier This makes the calculation of busy as simple as possible and dependent only on register outputs. The timing of busy is critical, as it gates the valid signal for the next instruction, and therefore any delays in dropping busy at the end of a load or store directly impact the timing of a host of other paths. This also separates the 'done without error' and 'done with error' cases from the MMU into separate signals that are both driven directly from registers. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Jordan Niethe	17fc77cef2	core: Implement PVR register Microwatt has been allocated a PVR version of 0x0063. Implement a PVR with this value. Signed-off-by: Jordan Niethe <jniethe5@gmail.com>	4 years ago
Paul Mackerras	74062195ca	execute1: Do forwarding of the CR result to the next instruction This adds a path to allow the CR result of one instruction to be forwarded to the next instruction, so that sequences such as cmp; bc can avoid having a 1-cycle bubble. Forwarding is not available for dot-form (Rc=1) instructions, since the CR result for them is calculated in writeback. The decode.output_cr field is used to identify those instructions that compute the CR result in execute1. For some reason, the multiply instructions incorrectly had output_cr = 1 in the decode tables. This fixes that. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago

1 2 3

129 Commits (047be5c0c3b2f12c9321412518e17b7267fe14ea)