## 14.2 A 4R2W Register File for a 2.3GHz Wire-Speed POWER<sup>™</sup> Processor with Double-Pumped Write Operation

Gary S. Ditlow<sup>1</sup>, Robert K. Montoye<sup>1</sup>, Salvatore N. Storino<sup>2</sup>, Sherman M. Dance<sup>2</sup>, Sebastian Ehrenreich<sup>3</sup>, Bruce M. Fleischer<sup>1</sup>, Thomas W. Fox<sup>1</sup>, Kyle M. Holmes<sup>4</sup>, Junichi Mihara<sup>5</sup>, Yutaka Nakamura<sup>5</sup>, Shohji Onishi<sup>5</sup>, Robert Shearer<sup>2</sup>, Dieter Wendel<sup>6</sup>, Leland Chang<sup>1</sup>

<sup>1</sup>IBM T. J. Watson Reseach Center, Yorktown Heights, NY, <sup>2</sup>IBM Systems and Technology Group, Rochester, MN, <sup>3</sup>Hoerner & Sulger, Schwetzingen, Germany, <sup>4</sup>IBM Systems and Technology Group, Essex Junction, VT, <sup>5</sup>IBM Systems and Technology Group, Kyoto, Japan, <sup>6</sup>IBM Systems and Technology Group, Boeblingen, Germany

In multi-ported register files, memory cell size grows quadratically with the total number of ports due to wordline and bitline wiring. Reducing the number of physical access ports in a memory cell can thus lead to significant area and power savings as well as latency improvement. Double-pumped register files operate access ports twice in a single clock period to reduce area by halving the number of physical ports in the memory cell-a technique often confined to lowfrequency applications. Replication of a memory cell in separate arrays halves the number of physical read ports in each copy. In this work, double-pumped write ports and replicated read ports are applied to a 4R2W register file in a highperformance microprocessor product [1]. This paper describes detailed implementation and measured hardware characteristics of this array and demonstrates a fast error correction scheme. The techniques used balance high efficiency and low latency and thus differ from previous work, in which doublepumped ports perform a write followed by a read in a very large register file [2] or where write ports are double-pumped without cell-level read port reduction [3].

Counter to intuition, double-pumping and cell duplication both work to reduce overall power and area by reducing the number of cell write and read ports, respectively. Due to more efficient wiring and contact sharing, a 2R1W register file cell is ~3 to 4× smaller than a 4R2W cell (Fig. 14.2.1), which reduces cell dimensions and thus both wordline and bitline lengths by nearly a factor of two. The single physical write port is double-pumped (early and late write operations on the rising and falling edge of the clock) to yield two effective write ports. To achieve fast 190ps read latency, the read ports are not double-pumped since a late read operation would not meet latency targets; instead, the 2R1W cell subarray is replicated (with common write operations occurring on two duplicate copies of the data) so that four read ports are functionally achieved while still maintaining low word and bitline capacitances. The two techniques are complementary and mutually inclusive, as both are needed to enable the small 2R1W cell size, which, in turn, enables double-pumped write operation and improved read performance. Even with subarray duplication, the 3 to 4× smaller cell size achieves a near 2× macro-level area reduction over a traditional 4R2W design. This area reduction also results in a corresponding decrease in leakage power. Due to reduced read bitline capacitance and smaller drivers, read power and read bitline latency can both be improved by ~2×. Write power is not dramatically affected as reduced write bitline capacitance balances subarray duplication.

As system reliability challenges mount in future technologies, error checking algorithms can add significantly to system-level register file access latency. The availability of two identical copies of the stored data enables error correction in contiguous bits using a simple interleaved parity code (Fig. 14.2.2). Parity bits are generated for small data blocks, which minimizes system-level write latency as compared with traditional ECC generation. If an error is detected in any of the four read ports while reading one of the two copies of the data, a state machine recovers the correct data from the other copy. In such a situation, the Instruction Unit stalls and flushes the pipeline and then inserts a new "select" instruction. This instruction reads the erroneous register from both arrays and rewrites each array with the correct data as chosen by the two parity values (if both arrays have errors, an unrecoverable error is signaled).

In the double-pumped write path, write ports are operated twice per clock cycle using pulses (LCLK and DCLK) triggered off both edges of the global clock (Fig. 14.2.3). Through static CMOS multiplexors, these two pulses select and combine the early and late versions of the predecoded address. For both the MSB (most significant bits) and LSB (least significant bits) paths, the early and late addresses share predecode drivers and wires, which reduces area and capacitance and results in improved latency. Due to statistical process variation, merging pulses (a logical AND) tends to reduce worst-case pulse widths. To enable high frequency operation, predecoded versions of the MSB and LSB, which are both pulsed signals, are merged as late in the path as possible near the final WWL driver. All circuits in the write address path utilize skewed beta ratios in conjunction with wide WWL wires to minimize latency. To select and merge early and late data for the WBL, transmission-gate multiplexors, which minimize WBL switching caused by input activity, are utilized. A small feedback inverter is also used to latch the WBL state when neither the early or late signals are asserted.

To achieve a fast read path, the I/O block is placed between the top and bottom halves of each subarray. This enables short global RBLs with a maximum of five dots hierarchically coupled to local RBLs with eight memory cells (Fig. 14.2.4). For robustness, the global RBL uses a standard keeper while the local RBL uses a delayed keeper to speed evaluation. Further read latency improvement is achieved by the use of a fast dynamic latch with a built-in NAND function to combine the upper and lower global RBL signals as well as logical reoptimization of the bypass multiplexor to minimize further inversions from the inverted read port of the 2R1W cell. Predecoder latency is improved by skewing beta ratios in all circuits handling the MSB, which are clocked half-cycle pulses that set RWL timing. Predecoding of the LSB is performed with static gates, with the last stage of gain for each 16b LSB group placed close to the RWL driver to sharpen slew rates.

To achieve double-pumped write operation at multi-GHz-range frequencies, early and late pulse widths and separations were tuned to minimize latency while ensuring robustness. Critical timing margins (Fig. 14.2.5) are ensured between successive writes (early followed by late in the same cycle and late followed by next-cycle early) and in a read-after-write (late write followed by next-cycle read) through rigorous Monte Carlo-based statistical simulation. While the early and late WWL pulse widths must both be sufficient to complete a cell write operation, it must also be guaranteed that a WWL pulse falls sufficiently to avoid collision with an immediately succeeding WWL pulse. During this separation time, the WBL must transition while maintaining write data setup and hold times. A nominal WWL pulse width/separation target of 90/127ps was found to satisfy all critical timing requirements. Read-after-write critical timing must ensure that the late cell write has reached completion before assertion of the RWL. Processinduced variability and RC delays across the array were considered in this analysis as well as jitter on the back-edge of the clock (off which the late clock pulse is derived), which is especially important since such mid-cycle clock uncertainty is traditionally not well controlled in front-edge-triggered clock networks.

Measured results in 45nm SOI-CMOS demonstrate operation of the doublepumped write register file of up to 2.76GHz at a supply voltage of 0.9V (Fig. 14.2.6). This results in a read latency of 190ps, active power dissipation with all ports active of ~28mW, and leakage power of ~31mW. 1.6GHz operation is maintained down to a supply voltage of 0.7V.

## Acknowledgement:

The authors thank R. Redmond and R. R. Robertazzi for characterization assistance.

## References:

 C. Johnson, et al., "A Wire-Speed Power<sup>™</sup> Processor: 2.3GHz 45nm SOI with 16 Cores and 64 Threads," *ISSCC Dig. Tech. Papers*, pp. 104-105, Feb. 2010.
E. S. Fetzer, et al., "A Fully Bypassed Six-Issue Integer Datapath and Register File on the Itanium-2 Microprocessor," *IEEE J. Solid-State Circuits*, vol. 37, no. 11, Nov. 2002, pp. 1433-1440.

[3] D. Wendel, et al., "The Implementation of POWER7<sup>™</sup>: A Highly Parallel and Scalable Multi-Core High-End Server Processor," *ISSCC Dig. Tech. Papers*, pp. 102-103, Feb. 2010.



Figure 14.2.1: Implementation comparison: Standard 4R2W cell vs. 2 copies of a 2R1W cell.





0---

