microwatt

Commit Graph

Author	SHA1	Message	Date
Paul Mackerras	1da8476cf9	dcache: Simplify forwarding of load data while reloading a cache line This removes a dependency of req_is_hit and similar signals on the wishbone ack input, by removing use_forward_rl, and making idx_reload not dependent on wr_row_match and wishbone_in.ack. Previously if a load in r0 hit the doubleword being supplied from memory, that was treated as a hit and the data was forwarded via a multiplexer associated with the cache RAM. Now it is called a miss and completed by the logic in the RELOAD_WAIT_ACK state of the state machine. The only downside is that now the selection of data source in the dcache_fast_hit process depends on req_is_hit rather than r1.full. Overall this change seems to reduce the number of LUTs, and make timing easier on the ECP-5. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 months ago
Paul Mackerras	c938246cc8	dcache: Simplify addressing of the dcache TLB Instead of having TLB invalidation and TLB load requests come through the dcache main path, these operations are now done in one cycle entirely based on signals from the MMU, and don't involve the TLB read path or the dcache state machine at all. So that we know which way of the TLB to affect for invalidations, loadstore1 now sends down a "TLB probe" operation for tlbie instructions which goes through the dcache pipeline and sets the r1.tlb_hit_* fields which are used in the subsequent invalidation operation from the MMU (if it is a single-page invalidation). TLB load operations write to the way identified by r1.victim_way, which was set on the TLB miss that triggered the TLB reload. Since we are writing just one way of the TLB tags now, rather than writing all ways with one way's value changed, we now pad each way to a multiple of 8 bits so that byte write-enables can be used to select which way gets written. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 months ago
Paul Mackerras	5168242cd5	dcache: Rework forwarding data paths This rearranges the multiplexing of cache read data with forwarded store data with the aim of shortening the path from the req_hit_ways signal to the r1.data_out register. The forwarding decisions are now made for each way independently and the the results then combined according to which way detected a cache hit. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 months ago
Paul Mackerras	4278387b21	dcache: Simplify reservation logic With some slight arrangement of the state machine in the dcache_slow process, we can remove one of the two comparators that detect writes by other entities to the reservation granule. The state machine now sets the wishbone cyc signal on the transition from IDLE to DO_STCX state. Once we see the wishbone stall signal at 0, we consider we have the wishbone and we can assert stb to do the write provided that the stcx is to the reservation address and we haven't seen another write to the reservation granule. We keep the comparator that compares the snoop address delayed by one cycle, in order to make timing easier, and the one (or more) cycle delay between cyc and stb covers that one cycle delay in the kill_rsrv signal. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 months ago
Paul Mackerras	26507450b7	dcache: Remove reset on read port of cache tag RAM The reset was added originally to reduce metavalue warnings in simulation, is not necessary for correct operation, and showed up as a critical path in synthesis for the Xilinx Artix-7. Remove it when doing synthesis; for simulation we set the value read to X rather than 0 in order to catch any use of the previously reset value. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 months ago
Paul Mackerras	9645ab6e1f	dcache: Rework forwarding and same-page logic This gets rid of some largish comparators in the dcache_request process by matching index and way that hit in the cache tags instead of comparing tag values. That is, some tag comparisons can be replaced by seeing if both tags hit in the same cache way. When reloading a cache line, we now set it valid at the beginning of the reload, so that we get hits to compare. While the reload is still occurring, accesses to doublewords that haven't yet been read are indicated with req_is_hit = 0 and req_hit_reload = 1 (i.e. are considered to be a miss, at least for now). For the comparison of whether a subsequent access is to the same page as stores already being performed, in virtual mode (TLB being used) we now compare the way and index of the hit in the TLB, and in real mode we compare the effective address. If any new entry has been loaded into the TLB since the access we're comparing against, then it is considered to be a different page. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 months ago
Paul Mackerras	2529bb66ad	dcache: Implement dcbz to non-cacheable memory properly A dcbz operation to memory that is mapped as non-cacheable in the page tables doesn't cause an alignment interrupt, but neither was it implemented properly in the dcache. It does do 8 writes to memory but it also creates a zero-filled line in the cache. This fixes it so that dcbz to memory mapped non-cacheable doesn't write the cache tag or set any line valid. We now have r1.reloading which is 1 only in RELOAD_WAIT_ACK state, but only if the memory is cacheable and therefore the cache should be updated (i.e. it is zero in RELOAD_WAIT_ACK state if we are doing a non-cacheable dcbz). We can now also remove the code in loadstore1 that checks for non-cacheable dcbz, which only triggered when doing dcbz in real mode to an address in the Cxxxxxxx range. Also remove some unused variables and signals. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 months ago
Paul Mackerras	ec323897e3	dcache: Use expanded per-way TLB and cache tag hit information Rather than combining the results of the per-way comparators into an encoded 'hit_way' variable, use the individual results directly using AND-OR type networks where possible, in order to reduce utilization and improve timing. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 months ago
Paul Mackerras	5ddd8884fa	core: Implement two data watchpoints This implements the DAWR0, DAWRX0, DAWR1, and DAWRX1 registers, which provide the ability to set watchpoints on two ranges of data addresses and take an interrupt when an access is made to either range. The address comparisons are done in loadstore1 in the second cycle (doing it in the first cycle turned out to have poor timing). If a match is detected, a signal is sent to the dcache which causes the access to fail and generate an error signal back to loadstore1, in much the same way that a protection violation would, whereupon a data storage interrupt is generated. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	6 months ago
Paul Mackerras	d531e8aa10	dcache: Improve timing Previously we only put slow requests in r1.req, but that caused timing problems because it meant the clock enable for all the registers in r1.req depended on whether we have a TLB and cache hit or not. Now we put any valid request (i.e. with req_go = 1) into r1.req, which has better timing because req_go is a relatively simple function of registered values (r0_full, r0_valid, r0.tlbie, r0.tlbld, r1.full, r1.ls_error, d_in.hold). We still have to work out if we have a slow request, but that is only needed for the D input of one register (r1.full). Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	6 months ago
Paul Mackerras	5121e0f392	core: Implement sync instructions This implements all the sync variants (sync, lwsync, ptesync, etc.) as a LSU op that gets sent down to the dcache and completes once the dcache state machine is idle. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	6 months ago
Paul Mackerras	00efcc2c3b	dcache: Make aligned quadword loads and stores actually be atomic This implements logic in the dcache to make aligned quadword loads and stores atomic with respect to other mechanisms that access memory. Such loads and stores are already marked with the atomic_qw bit in Loadstore1ToDcacheType. For quadword loads where the first dword access hits in the cache, we record the fact of the hit and the cache way used (r1.prev_hit and r1.prev_way). The second dword access then assumes a hit on the same way even if the cache line has been invalidated in the mean time by a snooped store. This gives the same effect as would loading both dwords at the time of the first dword load. For a lqarx, the reservation is set at the time of the first dword load, so if there is such a snooped store, the reservation will be invalid by the time the lqarx completes. If the first dword load hits on the cache line being refilled, so should the second, unless the refill finishes. In that case we set r1.prev_hit and r1.prev_way so the second load can use the line just refilled (but only if the first dword hit the line being refilled). For stores, the req.atomic_more flag is set on the first dword store, and that causes the STORE_WAIT_ACK state to wait for the next request without dropping cyc, so it is not possible for another wishbone master to insert an access between the writes of the two dwords to memory. For store-conditionals, DO_STCX state now transitions to STORE_WAIT_ACK state once the store has been accepted (stall is false). This means that the second store for a stqcx can be handled in the same way as the second store for a stq. Once the first store for a stqcx has succeeded, the second store is done unconditionally. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	6 months ago
Paul Mackerras	c2dcf4b334	dcache: Generate a DSI on larx/stcx to non-cacheable memory Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	6 months ago
Paul Mackerras	0fbeaa2a01	dcache: Use discrete req_op_* signals instead of an encoded req_op Hopefully this will improve timing by reducing unnecessary dependencies and giving more opportunities for routing. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	6 months ago
Paul Mackerras	ba4614c5f4	dcache: Implement data cache touch and flush instructions This implements dcbf, dcbt and dcbtst in the dcache. The dcbst (data cache block store) instruction remains a no-op because our dcache is write-through and therefore never has modified data that could need to be written back. Dcbt (data cache block touch) and dcbtst (data cache block touch for store) behave similarly except that dcbtst is a no-op on a readonly page. Neither instruction ever causes an interrupt. If they miss in the cache and the page is cacheable, they are handled like a load miss except that they complete immediately the state machine starts handling the load miss rather than waiting for any data. Dcbf (data cache block flush) can cause a data storage interrupt. If it hits in the cache, the state machine goes to a new FLUSH_CYCLE state in which the cache line valid bit is cleared. In order to avoid having more than 8 values in op_t, this combines OP_STORE_MISS and OP_STORE_HIT into a single state. A new OP_NOP state is used for operations which can complete immediately without changing any dcache state (now used for dcbt/dcbtst causing access exception or on a non-cachable page, or dcbf that misses the cache). Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	6 months ago
Paul Mackerras	b181d28df2	dcache: Cancel reservation on snooped store This restructures the reservation machinery so that the reservation is cleared when a snooped store by another agent is done to reservation address. The reservation address is now a real address rather than an effective address. For store-conditional, it is possible that a snooped store to the reservation address could come in even after we have asserted cyc and stb on the wishbone to do the store, and that should cause the store not to be performed. To achieve this, store-conditional now uses a separate state in the r1 state machine, which is set up so that losing the reservation due to a snooped store cause cyc and stb to be dropped immediately, and the store-conditional fails. For load-reserve, the reservation address is set at the end of cycle 1 and the reservation is made valid when the data is available. For lqarx, the reservation is made valid when the first doubleword of data is available. For the case where a snooped write comes in on cycle 0 of a larx and hits the same cache line, we detect that the index and way of the snooped write are the same as the index and way of the larx; it is done this way because reservation.addr is not set until the real address is available at the end of cycle 1. A hit on the same index and way causes reservation.valid to be set to 0 at the end of cycle 1. For a write in cycle 1, we compare the latched address in cycle 2 with the reservation address and clear reservation.valid at the end of cycle 2 if they match. In other words we compare the reservation address with both the address being written this cycle and the address being written in the previous cycle. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	6 months ago
Paul Mackerras	722f239c02	Reimplement quadword loads and stores This adds implementations of lq, plq, stq, pstq, lqarx and stqcx. Because register file addresses are now computed in decode1 before we have the decode table entry for the instruction, we have to check the icode directly to know when to read register RS\|1 before RS (i.e. for stq and stqcx in LE mode, but not pstq). For the second instance of the instruction, loadstore1 uses the EA from the first instance + 8. It generates an alignment interrupt for unaligned lqarx and stqcx and for lq in LE mode with an unaligned address. (The reason for the latter case is that it writes RT\|1 before RT, and if we have RA = RT\|1 and the second instance traps, we will have overwritten RA.) Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	6 months ago
Paul Mackerras	9c3d14dd5a	dcache: Make reading of DTLB independent of d_in.valid This improves timing. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	8c5dabd67f	dcache: Make r1.acks_pending independent of r1.state With this, the logic that maintains r1.acks_pending operates in every state based on r1.wb and wishbone_in, rather than only operating in STORE_WAIT_ACK state. This makes things a bit clearer and improves timing slightly. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Benjamin Herrenschmidt	76f61ef823	dcache: Update PLRU on misses as well as hits The current dcache will not update the PLRU on a cache miss which is later satisfied during the reload process. Thus subsequent misses will potentially evict the same cache line. The same issue happens with dcbz which are treated more/less as load misses. This fixes it by triggering a PLRU update when r1.choose_victim, which is set on a miss for one cycle to snapshot the PLRU output. This means we will update the PLRU on the same cycle as we capture its output, which is fine (the new value will be visible on the next cycle). That way, a "miss" will result in a PLRU update to reflect that the entry being refilled is actually used (and will be used to serve subsequent load operations from the same cache line while being refilled). Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	3 years ago
Benjamin Herrenschmidt	3edbbf5f18	Fix dcache_tb (and add dump of victim way to dcache) It bitrotted... more signals need to be initialized. This also adds a lot more accesses with different timing conditions allowing to test cases of hit during reloads, hit with reload formward, hit on idle cache etc... It also exposes a bug where the cache miss caused by the read of 0x140 uses the same victim way as previous cache miss of 0x40 (same index). This bug will need to be fixed separately, but at least this exposes it. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	3 years ago
Paul Mackerras	a1f5867919	dcache: Split PLRU into storage and logic Rather than having update and decode logic for each individual PLRU as well as a register to store the current PLRU state, we now put the PLRU state in a little RAM, which will typically use LUT RAM on FPGAs, and have just a single copy of the logic to calculate the pseudo-LRU way and to update the PLRU state. The PLRU RAM that apples to the data storage (as opposed to the TLB) is read asynchronously in the cycle after the cache tag matching is done. At the end of that cycle the PLRU RAM entry is updated if the access was a cache hit, or a victim way is calculated and stored if the access was a cache miss. It is possible that a cache miss doesn't start being handled until later, in which case the stored victim way is used later when the miss gets handled. Similarly for the TLB PLRU, the RAM is read asynchronously in the cycle after a TLB lookup is done, and either updated at the end of that cycle (for a hit), or a victim is chosen and stored for when the TLB miss is satisfied. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	cd2e174113	dcache: Fix compilation with NUM_WAYS and/or TLB_NUM_WAYS = 1 Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	6fe9dc9640	dcache: Reduce metavalue warnings Among other changes, this makes the things that were previously declared as signals of integer base type to be unsigned, since unsigned can carry metavalues, and hence we can get the checking for metavalues closer to the uses and therefore restrict the checking to the situations where the signal really ought to be well defined. We now have a couple more signals that indicate request validity to help with that. Non-fatal asserts have been sprinkled throughout to assist with determining the cause of warnings from library functions (primarily NUMERIC_STD.TO_INTEGER and NUMERIC_STD."="). Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	795b6e2a6b	Remove leftover logic for 16-byte loads and stores This removes some logic that was previously added for the 16-byte loads and stores (lq, lqarx, stq, stqcx.) and not completely removed in commit `c9e838b656` ("Remove support for lq, stq, lqarx and stqcx.", 2022-06-04). Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Paul Mackerras	bdd4d04162	Simplify flow control in the dcache and loadstore units Simplify the flow control by stalling the whole upstream pipeline when a stage can't proceed, instead of trying to let each stage progress independently when it can. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Anton Blanchard	39220be311	dcache: remove unused do_write signal Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	3 years ago
Benjamin Herrenschmidt	5cfa65e836	Introduce addr_to_wb() and wb_to_addr() helpers These convert addresses to/from wishbone addresses, and use them in parts of the caches, in order to make the code a bit more readable. Along the way, rename some functions in the caches to make it a bit clearer what they operate on and fix a bug in the icache STOP_RELOAD state where the wb address wasn't properly converted. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	4 years ago
Benjamin Herrenschmidt	d745995207	Introduce real_addr_t and addr_to_real() This moves REAL_ADDR_BITS out of the caches and defines a real_addr_t type for a real address, along with a addr_to_real() conversion helper. It makes the vhdl a bit more readable Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	4 years ago
Paul Mackerras	70270c066a	dcache: Fix bug with dcbz closely following stores with the same tag This fixes a bug where a dcbz can get incorrectly handled as an ordinary 8-byte store if it arrives while the dcache state machine is handling other stores with the same tag value (i.e. within the same set-sized area of memory). The logic that says whether to include a new store in the current wishbone cycle didn't take into account whether the new store was a dcbz. This adds a "req.dcbz = '0'" factor so that it does. This is necessary because dcbz is handled more like a cache line refill (but writing to memory rather than reading) than an ordinary store. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	ca4eb46aea	Make wishbone addresses be in units of doublewords or words This makes the 64-bit wishbone buses have the address expressed in units of doublewords (64 bits), and similarly for the 32-bit buses the address is in units of words (32 bits). This is to comply with the wishbone spec. Previously the addresses on the wishbone buses were in units of bytes regardless of the bus data width, which is not correct and caused problems with interfacing with externally-generated logic. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Anton Blanchard	b29c58f3d1	dcache: Loads from non-cacheable PTEs load entire 64 bits A non-cacheable load should only load the data requested and no more. We do the right thing for real mode cache inhibited storage instructions, but when loading through a non-cacheable PTE we load the entire 64 bits regardless of the size. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	4 years ago
Paul Mackerras	0b23a5e760	dcache: Simplify data input to improve timing Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	1a9834c506	dcache: Fix bug with forwarding of stores We have two stages of forwarding to cover the two cycles of latency between when something is written to BRAM and when that new data can be read from BRAM. When the writes to BRAM result from store instructions, the write may write only some bytes of a row (8 bytes) and not others, so we have a mask to enable only the written bytes to be forwarded. However, we only forward written data from either the first stage of forwarding or the second, not both. So if we have two stores in succession that write different bytes of the same row, and then a load from the row, we will only forward the data from the second store, and miss the data from the first store; thus the load will get the wrong value. To fix this, we make the decision on which forward stage to use for each byte individually. This results in a 4-input multiplexer feeding r1.data_out, with its inputs being the BRAM, the wishbone, the current write data, and the 2nd-stage forwarding register. Each byte of the multiplexer is separately controlled. The code for this multiplexer is moved to the dcache_fast_hit process since it is used for cache hits as well as cache misses. This also simplifies the BRAM code by ensuring that we can use the same source for the BRAM address and way selection for writes, whether we are writing store data or cache line refill data from memory. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	f812832ad7	dcache: Move way selection and forwarding earlier This moves the way multiplexer for the data from the BRAM, and the multiplexers for forwarding data from earlier stores or refills, before a clock edge rather than after, so that now the data output from the dcache comes from a clean latch. To do this we remove the extra latch on the output of the data BRAM (i.e. ADD_BUF=false) and rearrange the logic. The choice whether to forward or not now depends not on way comparisons but rather on a tag comparisons, for the sake of timing. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	65c43b488b	PMU: Add several more events This implements most of the architected PMU events. The ones missing are mostly the ones that depend on which level of the cache hierarchy data is fetched from. The events implemented here, and their raw event codes, are: Floating-point operation completed (100f4) Load completed (100fc) Store completed (200f0) Icache miss (200fc) ITLB miss (100f6) ITLB miss resolved (400fc) Dcache load miss (400f0) Dcache load miss resolved (300f8) Dcache store miss (300f0) DTLB miss (300fc) DTLB miss resolved (200f6) No instruction available and none being executed (100f8) Instruction dispatched (200f2, 300f2, 400f2) Taken branch instruction completed (200fa) Branch mispredicted (400f6) External interrupt taken (200f8) Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	4c11c9c661	dcache: Simplify logic in RELOAD_WAIT_ACK state Since the expression is_last_row(r1.store_row, r1.end_row_ix) can only be true when stbs_done is true, there is no need to include stbs_done in the expression for the reload being completed, and hence no need to compute stbs_done in the RELOAD_WAIT_ACK state. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	eb7eba2d92	dcache: Snoop writes to memory by other agents This adds a path where the wishbone that goes out to memory and I/O also gets fed back to the dcache, which looks for writes that it didn't initiate, and invalidates any cache line that gets written to. This involves a second read port on the cache tag RAM for looking up the snooped writes, and effectively a second write port on the cache valid bit array to clear bits corresponding to snoop hits. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Paul Mackerras	f636bb7c39	dcache: Fix bugs in pipelined operation This fixes two bugs which show up when multiple operations are in flight in the dcache, and adds a 'hold' input which will be needed when loadstore1 is pipelined. The first bug is that dcache needs to sample the data for a store on the cycle after the store request comes in even if the store request is held up because of a previous request (e.g. if the previous request is a load miss or a dcbz). The second bug is that a load request coming in for a cache line being refilled needs to be handled immediately in the case where it is for the row whose data arrives on the same cycle. If it is not, then it will be handled as a separate cache miss and the cache line will be refilled again into a different way, leading to two ways both being valid for the same tag. This can lead to data corruption, in the scenario where subsequent writes go to one of the ways and then that way gets displaced but the other way doesn't. This bug could in principle show up even without having multiple operations in flight in the dcache. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	6427cab46f	loadstore1/dcache: Send store data one cycle later This makes timing easier and also means that store floating-point single precision instructions no longer need to take an extra cycle. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	4b2c23703c	core: Implement quadword loads and stores This implements the lq, stq, lqarx and stqcx. instructions. These instructions all access two consecutive GPRs; for example the "lq %r6,0(%r3)" instruction will load the doubleword at the address in R3 into R7 and the doubleword at address R3 + 8 into R6. To cope with having two GPR sources or destinations, the instruction gets repeated at the decode2 stage, that is, for each lq/stq/lqarx/stqcx. coming in from decode1, two instructions get sent out to execute1. For these instructions, the RS or RT register gets modified on one of the iterations by setting the LSB of the register number. In LE mode, the first iteration uses RS\|1 or RT\|1 and the second iteration uses RS or RT. In BE mode, this is done the other way around. In order for decode2 to know what endianness is currently in use, we pass the big_endian flag down from icache through decode1 to decode2. This is always in sync with what execute1 is using because only rfid or an interrupt can change MSR[LE], and those operations all cause a flush and redirect. There is now an extra column in the decode tables in decode1 to indicate whether the instruction needs to be repeated. Decode1 also enforces the rule that lq with RT = RT and lqarx with RA = RT or RB = RT are illegal. Decode2 now passes a 'repeat' flag and a 'second' flag to execute1, and execute1 passes them on to loadstore1. The 'repeat' flag is set for both iterations of a repeated instruction, and 'second' is set on the second iteration. Execute1 does not take asynchronous or trace interrupts on the second iteration of a repeated instruction. Loadstore1 uses 'next_addr' for the second iteration of a repeated load/store so that we access the second doubleword of the memory operand. Thus loadstore1 accesses the doublewords in increasing memory order. For 16-byte loads this means that the first iteration writes GPR RT\|1. It is possible that RA = RT\|1 (this is a legal but non-preferred form), meaning that if the memory operand was misaligned, the first iteration would overwrite RA but then the second iteration might take a page fault, leading to corrupted state. To avoid that possibility, 16-byte loads in LE mode take an alignment interrupt if the operand is not 16-byte aligned. (This is the case anyway for lqarx, and we enforce it for lq as well.) Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	784d409999	dcache: Add more commentary, no code change Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	128fe8ac26	dcache: Ease timing on wishbone data and byte selects This eliminates a path where the inputs to r1.wb.dat and r1.wb.sel depend on req_op, which depends on the TLB and cache hit detection. In fact they only need to depend on the nature of the request in r0.req (i.e. DCBZ, store, cacheable load, or non-cacheable load). This sets them at the beginning of the code for IDLE state rather than inside the req_op case statement. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	c180ed0af0	dcache: Output separate done-without-error and error-done signals This reduces the complexity of the logic in the places where these signals are used. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	56420e74f3	dcache: Ease timing on calculation of acks remaining This moves the incrementing or decrementing of r1.acks_pending to the cycle after a strobe is output or an ack is seen on the wishbone, and simplifies the logic that determines whether the cycle is now complete. This means that the path from seeing req_op equal to OP_STORE_HIT or OP_STORE_MISS to setting r1.state and r1.cyc now just involves the stbs_done bit rather than a more complex calculation involving the possibly incremented r1.acks_pending. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	dc8980d5a5	dcache: Improve timing of valid/done outputs This makes d_out.valid and m_out.done come directly from registers in order to improve timing. The inputs to the registers are set by the same conditions that cause r1.hit_load_valid, r1.slow_valid, r1.error_done and r1.stcx_fail to be set. Note that the STORE_WAIT_ACK state doesn't test r1.mmu_req but assumes that the request came from loadstore1. This is because we normally have r1.full = 0 in this state, which means that r1.mmu_req can change at any time. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	893d2bc6a2	core: Don't generate logic for log data when LOG_LENGTH = 0 This adds "if LOG_LENGTH > 0 generate" to the places in the core where log output data is latched, so that when LOG_LENGTH = 0 we don't create the logic to collect the data which won't be stored. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	1be6fbac33	dcache: Remove dependency of r1.wb.adr/dat/sel on req_op This improves timing by setting r1.wb.{adr,dat,sel} to the next request when doing a write cycle on the wishbone before we know whether the next request has a TLB and cache hit or not, i.e. without depending on req_op. r1.wb.stb still depends on req_op. This contains a workaround for what is probably a bug elsewhere, in that changing r1.wb.sel unconditionally once we see stall=0 from the wishbone causes incorrect behaviour. Making it conditional on there being a valid following request appears to fix the problem. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	c01e1c7b91	dcache: Update TLB PLRU one cycle later This puts the inputs to the TLB PLRU through a register stage, so the TLB PLRU update is done in the cycle after the TLB tag matching rather than the same cycle. This improves timing. The PLRU output is only used when writing the TLB in response to a tlbwe request from the MMU, and that doesn't happen within one cycle of a virtual-mode load or store, so the fact that the tlb victim way information is delayed by one cycle doesn't create any problems. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago
Paul Mackerras	31587affb3	dcache: Do PLRU update one cycle later This does the PLRU update based on r1.cache_hit and r1.hit_way rather than req_op and req_hit_way, which means there is now a register between the TLB and cache tag lookup and the PLRU update, which should help with timing. The PLRU victim selection now becomes valid one cycle later, in the cycle where r1.write_tag = 1. We now have replace_way coming from the PLRU when r1.write_tag = 1 and from r1.store_way at other times, and we use that instead of r1.store_way in situations where we need it to be valid in the first cycle of the RELOAD_WAIT_ACK state. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	5 years ago

1 2

79 Commits (master)