4
A 9GHz 320x80bit Low Leakage Microcode Read Only Memory in 65nm CMOS Steven K. Hsu, Amit Agarwal, Sanu K. Mathew, Ram K. Krishnamurthy Circuit Research Labs Intel Corporation Hillsboro, OR, USA [email protected] Martin Hansson, Atila Alvandpour Electronic Devices, Dept. of EE. Linköping University Linköping, Sweden [email protected] Abstract— This paper describes a 320x80bit microcode ROM for 9GHz operation in 1.2V, 65nm CMOS technology. An extended pre-fabrication technique is proposed, which allows optimal programming of local/global merge circuitry, pre-charge devices, column-select multiplexers, and wordline driver strengths. The proposed technique enables 32% leakage and 15% total array power reduction without any delay or area penalty. I. INTRODUCTION Read only memories (ROMs) are essential components in high-performance microprocessors, DSPs, and special purpose accelerators, storing fixed information such as microcode instructions for complex micro- operations or data coefficients for digital filters. ROM active/leakage power reduction has become important because of power constrained high-performance designs and low-power portable systems requiring longer battery life [1]. Technology scaling results in exponentially increasing gate and sub-threshold leakage power, requiring more aggressive low leakage power schemes [2]. The leakage power component constitutes 36% of the overall conventional ROM power and becomes an even larger component as leakage current increases for sub- 65nm designs (Fig. 1). The majority of the leakage power in a conventional ROM design is dissipated in the drivers for wordline, column-select, and pre-charge signals as shown in Fig. 2. Moreover, conventional ROM design utilizes a worst-case design approach where drivers, pre- charge devices, and column-select multiplexers are designed for a fully populated ROM array (all cells logic ‘1’). However, in a typical fabricated ROM implementation the memory array is sparsely populated, thereby making the drivers end up oversized. This paper describes a 9GHz 25.6K microcode ROM implemented in a 1.2V, 65nm CMOS technology [3]. A programmable logic technique is proposed, which utilizes the data heuristics of the microcode to allow optimal programming of local/global merge circuitry, pre-charge devices, column-select multiplexers, and wordline driver strengths. The proposed technique enables low total power consumption of 101.1mW with 30.6mW leakage component and a dense layout occupying 0.088mm 2 (Fig. 11). 0% 20% 40% 60% 80% 100% 120% ROM Array Power Normalized power Active Leakage Figure 1. ROM leakage power breakdown. WL/CS drivers 47% PC drivers 12% GBL 6% SDL NAND 4% LBL NAND 13% LBL 18% Figure 2. ROM leakage components and total power. The paper is organized as follows: Section II describes the organization of the implemented ROM followed by a discussion in section III of the stored ROM data heuristics. Section IV describes the proposed programmable logic technique, while a comparison results is presented in section V. Finally the paper is concluded in section VI. 1-4244-0303-4/06/$20.00 ©2006 IEEE. 299

[IEEE 2006 Proceedings of the 32nd European Solid-State Circuits Conference - Montreaux, Switzerland (2006.09.19-2006.09.21)] 2006 Proceedings of the 32nd European Solid-State Circuits

  • Upload
    atila

  • View
    215

  • Download
    2

Embed Size (px)

Citation preview

Page 1: [IEEE 2006 Proceedings of the 32nd European Solid-State Circuits Conference - Montreaux, Switzerland (2006.09.19-2006.09.21)] 2006 Proceedings of the 32nd European Solid-State Circuits

A 9GHz 320x80bit Low Leakage Microcode Read Only Memory in 65nm CMOS

Steven K. Hsu, Amit Agarwal, Sanu K. Mathew, Ram K. Krishnamurthy

Circuit Research Labs Intel Corporation

Hillsboro, OR, USA [email protected]

Martin Hansson, Atila Alvandpour

Electronic Devices, Dept. of EE. Linköping University Linköping, Sweden [email protected]

Abstract— This paper describes a 320x80bit microcode ROM for 9GHz operation in 1.2V, 65nm CMOS technology. An extended pre-fabrication technique is proposed, which allows optimal programming of local/global merge circuitry, pre-charge devices, column-select multiplexers, and wordline driver strengths. The proposed technique enables 32% leakage and 15% total array power reduction without any delay or area penalty.

I. INTRODUCTION

Read only memories (ROMs) are essential components in high-performance microprocessors, DSPs, and special purpose accelerators, storing fixed information such as microcode instructions for complex micro-operations or data coefficients for digital filters. ROM active/leakage power reduction has become important because of power constrained high-performance designs and low-power portable systems requiring longer battery life [1]. Technology scaling results in exponentially increasing gate and sub-threshold leakage power, requiring more aggressive low leakage power schemes [2]. The leakage power component constitutes 36% of the overall conventional ROM power and becomes an even larger component as leakage current increases for sub-65nm designs (Fig. 1). The majority of the leakage power in a conventional ROM design is dissipated in the drivers for wordline, column-select, and pre-charge signals as shown in Fig. 2. Moreover, conventional ROM design utilizes a worst-case design approach where drivers, pre-charge devices, and column-select multiplexers are designed for a fully populated ROM array (all cells logic ‘1’). However, in a typical fabricated ROM implementation the memory array is sparsely populated, thereby making the drivers end up oversized.

This paper describes a 9GHz 25.6K microcode ROM implemented in a 1.2V, 65nm CMOS technology [3]. A programmable logic technique is proposed, which utilizes the data heuristics of the microcode to allow optimal programming of local/global merge circuitry, pre-charge devices, column-select multiplexers, and wordline driver

strengths. The proposed technique enables low total power consumption of 101.1mW with 30.6mW leakage component and a dense layout occupying 0.088mm2 (Fig. 11).

0%

20%

40%

60%

80%

100%

120%

ROM Array Power

Nor

mal

ized

pow

er

Active

Leakage

Figure 1. ROM leakage power breakdown.

WL/CS drivers

47%

PC drivers

12%

GBL6%

SDL NAND

4%

LBL NAND13%

LBL18%

Figure 2. ROM leakage components and total power.

The paper is organized as follows: Section II describes the organization of the implemented ROM followed by a discussion in section III of the stored ROM data heuristics. Section IV describes the proposed programmable logic technique, while a comparison results is presented in section V. Finally the paper is concluded in section VI.

1-4244-0303-4/06/$20.00 ©2006 IEEE. 299

Page 2: [IEEE 2006 Proceedings of the 32nd European Solid-State Circuits Conference - Montreaux, Switzerland (2006.09.19-2006.09.21)] 2006 Proceedings of the 32nd European Solid-State Circuits

II. ROM ORGANIZATION

Fig. 3 shows the organization of the 25.6K bit high-performance microcode ROM, consisting of 320-entries x 80-bits. A gate connection to the wordline (WL) and a drain connection to the bitline (BL) represent a bit-cell (BC) layout storing a logic ‘1’. Removing direct connections of the gate and drain from the WL and BL, respectively, represents a BC layout storing a logic ‘0’. Prior to fabrication, a ROM contains a fully populated array (all ‘1’) and is ready for programming. During microcode programming the ‘0’ BC layouts replace the appropriate ‘1’ BC layouts, thus finalizing the layout before fabrication.

A complete read operation is performed in two cycles. A 2-phase 50% duty-cycle clocking plan allows seamless time borrowing at the phase boundaries. In the first cycle, a partially decoded 9-bit address delivers 2x100 control signals, of which 2x80 drive the word-lines (WL) directly and 2x20 drive column-select (CS) multiplexers. In the next cycle, WL buffers drive across two 40-bit arrays and bitline evaluation starts.

WL

8x

CS

4x

WL

8x

CS

4x

Bank 0 Bank 1 Bank 2 Bank 3 Bank 4

SDL NAND

80x

CLK CLK CLK CLK

CLKCLK

BCL

BCL

BCL

BCLLBL LBL

GBLGBL

Figure 3. ROM Organization.

bit-cell line(BCL)

‘0’ stored in ROM

‘1’ stored in ROM

4x

Precharge (PC) for bit-cell lines

Column select(CS) MUX

Precharge LBL

Figure 4. LBL including CS-MUX and PC devices.

Fig. 4 shows a bit-cell line (BCL), which contains 8 BCs and individual pre-charge transistors. By using a 4-to-1 CS multiplexer 4 BCLs are merged together to form a 32-bit local bitline (LBL). Two LBLs are merged via a static NAND gate that drives a single global bitline (GBL) pull-down. Finally, a 2-way and a 3-way GBL are merged with a SDL-NAND gate, which forms a 5-way GBL producing a static output.

III. MICROCODE DATA HEURISTICS

The heuristics of the finalized ROM are shown in Table I. After programming, 80% of the finalized ROM store BC layouts that contain logic ‘0’, thereby removing up to 155 BC loads on a single WL as shown in the histogram in Fig. 5. Conventional WL driver strengths allow for full population of 160 BC loads; however, the WL drivers end up oversized based on the actual programmed WL loads. Since 48% of the BCLs always read ‘0’ and remain constant at Vcc, this results in 24% of the LBLs remaining constant at Vcc, and 14% of the LBL-NAND gate outputs remaining constant at Vss. Alternatively, 2.5% of the BCLs always read a ‘1’, remaining constant at Vss, forcing 0.5% of the LBLs to remain constant at Vss. These constant voltage BCL, LBL, and LBL-NAND output nodes will never transition due to the programmed data; therefore, many pre-charge (PC), CS transistors, static NAND gates, and GBL pull-downs are unnecessary and perform no useful operation. These unused devices together with the oversized WL drivers contribute greatly to the total leakage as shown in Fig. 2.

05

10152025303540

5 30 55 80 105 130 155

Number of bit-cells connected to WL

Num

ber

of o

ccur

ance

s

Figure 5. Histogram of WL loading after BC programming.

TABLE I. MICROCODE ROM HEURISTICS.

Total Storing ‘0’ Storing ‘1’

Bits 25.6K 20577 (80.4%) 5023 (19.6%)

BCL 3.2K 1544 (48%) 81 (2.5%)

LBL 800 193 (24%) 4 (0.5%)

NAND/GBL 400 56 (14%) 0 (0%)

300

Page 3: [IEEE 2006 Proceedings of the 32nd European Solid-State Circuits Conference - Montreaux, Switzerland (2006.09.19-2006.09.21)] 2006 Proceedings of the 32nd European Solid-State Circuits

bit-cell line only storing ‘0’

4x

Unused precharge transistor

Unused CS-MUX

*

Removable to reduce driver strength

CLK

DEC

*

*

*

*

*

WL / CS

Figure 1. Proposed programmable logic technique.

II. PROGRAMMABLE LOGIC TECHNIQUE

Based on the data heuristics of the ROM microcode presented in Table I, pre-fabrication layout programming is not limited to only the BC programming. A programmable logic technique is proposed, which allows optimal programming to occur on the unnecessary local/global merge circuitry, PC devices, CS multiplexers, and WL driver layouts with nodes that remain at a constant Vcc or Vss. The proposed programmable logic technique is depicted in Fig. 6.

A. Removal of unused devices A BCL node that always reads a ‘0’ allows a direct

layout connection to Vcc, which removes any unnecessary PC devices and one CS device. Similarly, a LBL node that always reads a ‘0’ allows the removal of the 4 CS and PC devices creating a direct connection to Vcc. When a ‘1’ read always occurs on BCLs, the BCL removes all 8 BCs and PC devices and a direct connection to Vss is made. CS and PC devices are also removed when a ‘1’ read always occurs on LBLs creating a direct connection to Vss. BC, CS, and PC device removal achieves a 22% reduction of the LBL leakage component. Finally, the removal of unnecessary LBL-NAND merge and GBL pull-downs results in a 15% reduction of the respective leakage component.

B. Optimization of WL/CS and PC driver strengths After removal of BC, CS, and PC devices, respective

CS/WL and PC driver loading reduces because of less gate load. Due to the variation in loading across all WLs, 4 differently sized WL buffer layout designs are proposed, which are designed to drive 40, 80, 120, and 160 BCs with less than or equivalent the delay of the 160 BCs worst-case. Fig. 7 shows the number of occurrences of the WL driver strengths, where the majority of WL drivers only drive up to 40 BCs. Removing additional legs of the devices in the WL driver layouts creates the 3 additional layout cells as shown in Fig. 6. This enables optimization of the WL driver strength to the actual pre-programming load. By applying the same WL driver approach, three additional drivers is designed for the CS and BCL-PC

drivers, respectively, with strength optimized for 25%, 50%, and 75% of the fully populated load. The proposed driver optimization technique enables further leakage power reduction in CS drivers and PC drivers. Fig. 8 and Fig. 9 show the distribution of drivers before and after applying the reprogramming, thus reducing WL/CS driver leakage by 52% and PC driver leakage by 25%.

020406080

100120140

40 80 120 160Number of BCs connected to WL

Num

ber

of o

ccur

ance

s

Figure 2. WL driver histogram for the proposed ROM

0

10

20

30

40

50

20 40 60 80Number of CS transistors

connected to driver

Num

ber o

f occ

uran

ces

Conventional Proposed

Figure 3. CS driver histogram for conventional and proposed ROM

0

5

10

15

20

25

40 80 120 160Number of BCL precharge

devices connected to driver

Num

ber o

f occ

uran

ces

Conventional Proposed

Figure 4. PC driver histogram for conventional and proposed ROM

301

Page 4: [IEEE 2006 Proceedings of the 32nd European Solid-State Circuits Conference - Montreaux, Switzerland (2006.09.19-2006.09.21)] 2006 Proceedings of the 32nd European Solid-State Circuits

III. COMPARISON RESULTS

The conventional and proposed ROM arrays have been implemented in a 1.2V, 65nm CMOS technology [3]. The conventional microcode ROM incorporates selective insertion of long channel transistors in the BC and GBL pull-down transistors. By using the proposed programmable logic and incorporating the microcode heuristics in Table I, the ROM array leakage reduces by 32%, from 45.1mW down to 30.6mW and overall array power reduces by 15%, from 118.5mW down to 101.1mW for a clock frequency of 9GHz as shown in Table II. The leakage reduction due to the proposed technique is further shown in the power break-up in Fig. 10, which clearly shows the large leakage reduction in the WL/CS drivers. Furthermore, in a fully nominal channel length ROM design, maximum frequency increases by 8% to 9.7GHz (Table II). The proposed programmable logic technique reduces the leakage power by 27%, from 61.7mW down to 44.8mW, in a fully nominal channel length ROM design. Moreover, total ROM array power for the fully nominal channel length ROM is reduced by 14%, from 141.4mW down to 121.4mW.

TABLE I. ROM POWER AND FREQUENCY COMPARISONS.

Microcode ROM Leakage Power (mW)

Total Power (mW)

Max. Freq. (GHz)

Conventional 45.1 118.5 Inserted long

channel length Proposed 30.6 101.1

9.0

Conventional 61.7 141.4 All nominal channel length Proposed 44.8 121.4

9.7

0%

20%

40%

60%

80%

100%

Conventional Proposed

Nor

mal

ized

Lea

kage

Pow

e r

PC driversSDL NANDGBLLBL NANDLBLWL/CS drivers

Figure 5. ROM leakage power break-up.

Arr

ay

Dec

oder

Arr

ay

Figure 6. Microcode ROM layout.

IV. CONCLUSION

A 9GHz 25.6K microcode ROM implemented in 1.2V, 65nm CMOS technology is presented. The proposed ROM design consumes 101.1mW total array power with a low leakage component of 30.6mW showing good sub-65nm scaling trend. An extended pre-fabrication programmable logic technique is proposed, which removes unused devices and enabling optimized driver strength. The proposed programmable logic technique enabled 32% leakage power and 15% total array power reduction without delay or area penalty.

REFERENCES [1] B.-D. Yang, L.-S. Kim, “A Low-Power ROM using

Charge Recycling and Charge Sharing”, in Digest of Technical Papers IEEE International Solid-State Circuit Conference, vol. 1, pp. 108-109, February 2002.

[2] T. Ghani, K. Mistry, P. Packan, S. Thomson, M. Stettler, S. Tyagi, M. Bohs, “Scaling Challenges and Device Design Requirements for High Performance Sub-50 nm Gate Length Planar CMOS Transistors”, in Symposium on VLSI Technology Digest of Technical Papers, pp. 174-175, June 2000.

[3] P. Bai, et al., “A 65nm Logic Technology Featuring 35nm Gate Length, Enhanced Channel Strain, 8 Cu Interconnect Layers, Low-k ILD and 0.57 µm2 SRAM Cell”, IEEE International Electron Device Meeting Technical Digest, pp. 657-660, December 2004.

32 %

302