# A 5.3GHz 8T-SRAM with Operation Down to 0.41V in 65nm CMOS

Leland Chang<sup>\*</sup>, Yutaka Nakamura<sup>†</sup>, Robert K. Montoye, Jun Sawada<sup>‡</sup>, Andrew K. Martin<sup>‡</sup>, Kiyofumi Kinoshita<sup>†</sup>,

Fadi H. Gebara<sup>‡</sup>, Kanak B. Agarwal<sup>‡</sup>, Dhruva J. Acharyya, Wilfried Haensch, Kohji Hosokawa<sup>†</sup>, and Damir Jamsek<sup>‡</sup>

IBM T. J. Watson Research Center, Yorktown Heights, NY<sup>†</sup>IBM TCS, Yasu, Japan<sup>‡</sup>IBM Austin Research Lab, Austin, TX

e-mail: <u>lelandc@us.ibm.com</u> phone: (914) 945-2329

## Abstract

A 32kb subarray demonstrates practical implementation of a 65nm node 8T-SRAM cell for variability tolerance in high-speed caches. Ideal cell stability allows single-supply operation down to 0.41V at 295MHz without dynamic voltage techniques. Despite a larger cell, array area is competitive with 6T-SRAM due to higher array efficiency. With an LSDL decoder, a gated diode sense amplifier, and design tradeoffs enabled by the 8T cell, 5.3GHz operation at 1.2V is achieved.

# Introduction

As variability concerns mount in future CMOS technologies, SRAM cell stability, which depends on delicately balanced transistor characteristics, inhibits low voltage cache operation without secondary [1] or dynamic [2] supplies. 6T-SRAM thus requires tradeoffs in power and performance to maintain stability and yield. These tradeoffs are eliminated by adding two transistors to the cell (Fig. 1) [3], which provides a disturb-free read mechanism and ideal cell stability while preserving a compact layout. In addition to 1R1W operation, read and write port separation allows each to be independently optimized. This permits improvement of the cell write margin, which, combined with excellent read stability, enables a robust, variation-tolerant SRAM cell. In this work, a high performance 32kb subarray is demonstrated in a 65nm SOI technology with a 0.9µm<sup>2</sup> 8T cell (Fig. 2). Practical array implementation techniques to achieve a design that is area competitive with 6T-SRAM are discussed. Array design to simultaneously achieve 5.3GHz performance and low voltage operation is described.

# Array Design

While the 8T cell removes read stability concerns, cell disturbs in unselected columns during a write event [4] must be addressed by the array architecture. Such column select functionality, which can simplify the array floorplan at the cost of active power and performance, must be disallowed. Instead, 8T arrays should be floorplanned so that all bits in a word are spatially adjacent. In this work, a 128b word is split into 32b quadrants – each with added bits for ECC and redundancy (Fig. 3).

Without column select, column decode circuitry can be removed to improve array efficiency. However, an area penalty arises because bits from different words cannot be physically interleaved. To provide protection from multi-bit soft errors, the ECC/parity code must instead be interleaved (e.g. separate codes for odd/even bits). For a 128b word, the extra bit penalty is small for ECC (~5%) and nearly negligible for parity.

With separate read/write ports, a read/write multiplexer is not needed, which greatly simplifies local evaluation circuitry. In addition, read and write paths can be independently optimized by sharing the RBL and WBL across different numbers of bits. In Fig. 4, a short RBL (8b) forms a high-speed read path while a long WBL (512b) preserves array efficiency.

Improved 8T array efficiency thus offsets area penalties associated with a larger cell, an extra WL driver, and added ECC bits. As a result, 8T array area is competitive (+7% at equal performance) with that for 6T (Fig. 5). Higher speed arrays with larger 6T cells can show even smaller area penalties.

# Decoder and Sense Amplifier Design

Array peripherals are built in limited switch dynamic logic (LSDL) [5] for optimal performance, power, and area. A BIST structure is built by an automated LSDL synthesis tool. High-speed RWL and WWL decoders use a modified LSDL design (Fig. 6). The predecoder employs NOR-type gates to decode the most significant address bits and to generate a self-timed clock, Read/Write  $\varphi_2$ , for the WL driver, which itself is a NOR-type gate with a shared foot nFET. A 9-bit address is thus decoded with minimally stacked nFET evaluation networks and a single critical clock signal.

A reduced-swing gated diode sense amplifier [6] accelerates global BL evaluation and decreases driver size. In Fig. 7, the sense amp output follows GBL until the clock,  $\varphi$ , initiates sensing by isolating the output node and amplifying its potential. The negative edge of  $\overline{\varphi}$  pulls the output towards ground by capacitive coupling for a partially evaluated logical '0' on GBL. No coupling occurs for a logical '1' since GBL does not fall below the gated diode V<sub>T</sub> and there is no inversion capacitance.

# **Experimental Results**

At room temperature, measured hardware performs at up to 5.3GHz at 1.2V (Fig. 8). Ideal 8T cell stability enables 295MHz operation at a single 0.41V supply without dynamic voltage techniques. Voltage scaling reduces active power dissipation (Fig. 9) from 146mW (1.2V) to 1.1mW (0.41V).

Optimization of write pass-gate and read stack strength improves low voltage operation (Fig. 10). Cell write margin is maintained by low-V<sub>T</sub>, short-L<sub>gate</sub> pass-gates and high-V<sub>T</sub>, long-L<sub>gate</sub>, and minimum-W storage inverters. Cell read performance is maintained by a low-V<sub>T</sub>, short-L<sub>gate</sub> read stack.

In the gated diode sense amp, the GBL sensing threshold is set by the pFET gated diode  $V_T$ . A forward body bias lowers  $V_T$ , raises this threshold, and speeds up GBL sensing (Fig. 11). This  $V_T$  shift is especially beneficial in low voltage operation.

While a straight-line active cell layout (Fig. 2a) is lithography-friendly, SOI technology allows a butted junction [7] (Fig. 2b), which enables metal-1 (instead of metal-2) BL wiring. Performance of the two cells is similar (Fig. 12), which suggests comparable BL capacitance and variability tolerance.

### Summary

A practical 8T-SRAM design for high-speed, low-voltage caches is demonstrated. Elimination of cell stability concerns enables operation down to 0.41V and, combined with aggressive design techniques, performance up to 5.3GHz. Improved array efficiency allows area to be competitive with 6T-SRAM.

## Acknowledgement

This work was partially supported by the Maryland Procurement Office (MPO) – Contract 98230-04-C-0920.

#### References

- [1] J. Davis, et al., *ISSCC*, 2006, p. 622.
- [2] K. Zhang, et al., ISSCC, 2005, p. 474.
- [3] L. Chang, et al., Symp. VLSI Tech., 2005, p. 128.
- [4] T. Suzuki, et al., Symp. VLSI Circ., 2006, p. 14.
- [5] R. Montoye, et al., ISSCC, 2003, p. 336.
- [6] W. K. Luk, et al., IEEE Trans. Circ. Sys., p. 266, May 2005.
- [7] E. Leobandung, et al., Symp. VLSI Tech., 2005, p. 126.



Figure 1: An 8T-SRAM cell adds a read stack to the standard 6T cell. This eliminates read stability issues and enables low voltage operation.



Figure 4: The local read BL is 8b long while the write BLs are 512b long. Domino sensing with a static NAND gate is used to drive the global BL.



Figure 2: 0.9µm<sup>2</sup> cell layouts in 65nm SOI technology: a) straight-line active with M2 BL, b) Butted-junction active with M1 BL [7].







Figure 3: Word-oriented floorplan of the



Figure 6: High-speed RWL/WWL decoder for a 9b address (addr<8:0>) using NOR-type LSDL logic. at 295MHz at a single 0.41V supply.









146mW to 1.1mW when V<sub>DD</sub> is reduced from 1.2V to 0.41V.

improved with low Vt/short L for the sense amp gated diode improves straight-line and butted junction the write pass-gates and read stack.

Figure 9: Active power drops from Figure 10: Low V<sub>DD</sub> performance is Figure 11: A forward body bias on Figure 12: Results for cells with speed by raising the sense threshold. active layouts are nearly identical.