4
IMPLEMENTATION OF A HIGH SPEED MULTIPORT REGISTER FILE IN A 1.8V, 0.25pm CMOS BULK AND SO1 TECHNOLOGY R. V. Joshi, W. Hwang, S. Wilson", G. Shahidi*, and C. T. Chuang IBM,T.J. Watson Research Center, Yorktown Heights, NY 10598 *IBM Microelectronic Division, East Fishkill, NY 12533 rvjoshi @us.ibm.com (914-945-1118) ABSTRACT The experimental hardware results of a high speed 8-ports, 32 Words x 64-bit register file in 1.8 V, 0.25pm CMOS bulk and SO1 silicon technology are presented. Such a register file is designed for bulk technology but is also remapped and fabricated in SO1 technology without any body contacts. It is shown that the register file in SO1 achieves more than 20% performance gain over the conterpart . 1. IN'TRODUCTION High speed microprocessors require high speed register files. The trend in superscalar and Very Long Instruction Word (VLIW) computer architecture creates a need for multi- port register files. This paper demonstrates the hardware results of a high performance (> 660 MHz) register file with 8-ports (6 Read and 2 Write ports, 32 words x 64 bit). The register file is designed and implemented in a 1.8 V, 0.25 ,urn CMOS bulk technology and then mapped into SIMOX (Separation by Implanted Oxygen) Silicon-on-Insulator (Sol) technology without any body contacts [I]. Major performance advantages over bulk result from smaller diffusion capacitances, increase current drive due to lower threshold voltages, and a reduction in the back-bias effect of stacked devices. However, SO1 introduces a number of concerns that are not present in bulk CMOS. Most of these concerns are due to the undesired SO1 floating body effects. Many unique SO1 design challenges such as history-dependence of switching delays, lower noise margins, parasitic bipolar discharge current, and pulse stretching in dynamic circuits are considered. In this paper we describe the write ports and circuits in detail and show that SO1 technology offers performance gain of over 20% as compared to bulk. II. CIRCUITS The architecture and read port circuitry of the register file was described in a recent publication 121. Here we will concentrate mainly on the modification of the self-resetting write operation and circuit techniques. Fig. 1 shows a schematic of two write (A- and 6-port) ports which represents the forward path using a 1 bit cross-section. The main stages are receiver, generation of dual rail signals, selection of addresses using NOR, priority generation for port 8, decoder and generation of true/comple- mentary write wordlines. There are five input address bits for each write port with the fifth one as the least significant bit (LSB). First the addresses go through a receiver, a dynamic inverter and then a static inverter. The write circuit is provided with a "write address timing signal" (WATS) which synchroniz- es the timing of all input addresses. This signal goes through a static NOR, dynamic and then static inverter. The output signal (WATSOUT) is ANDed with the nodes "a" (inverted input addresses) and "an" (delayed input addresses) respecti- vely. This results in the generation of "true" and "complement" dual rail signals per port which have same arrival times and pulsewidths. This way write port does not have to rely on the timing of external dual rail signals. These true and complement signals go through a 5-way dynamic NOR. Since these NORS dispense with slow, stacked PFETS the NOR stage can be made much faster compared to its static implementationwith the same area. Fig. 2 shows write enable and priority signal generations. In order to enable writing, a global signal called "write enable timing signal" (WETS) is provided in addition to the individual write enable for each port i.e WEAIN and WEBIN. WETS first goes through a NOR then ANDs with inverted WEAIN and WEBIN signals to produce output at nodes "al" and "bl". The delayed output at these nodes are called as WEA and WE6 signals which are used for decoding. The logic requires write into either port A when WEAIN is active, or into B when WEBIN is active. When both signals are active then the priority is given to port 6. The data is invalid when both the signals are in standby. The circuits are designed to map such logic. WEBIN goes through a NOR, dynamic and then static inverter to produce WE61 signal which assures wrlte priority for port B. The output of the NOR from 6-port ( shown in Fig. 1 which selects the write addresses) is gated with WEB1 signal to produce "A-inhibit" signal which is used along with WEA signal to decode the output of NOR in port A. These signals thus drive a 3-way pull- down tree which is tapered to achieve the performance. When the addresses for A and B port are equal it prevents the writing into A port and gives priority to 6 port. The output of the 6-port NOR is decoded by using only WEB signal. During evaluation one of the signals goes low and the other one remains high, while in standby, they are both high. Thus using 5 write addresses along with the write enable signal (WEB) 32 write wordlines per port are generated. The non-inverted and inverted outputs of these decoders are used to form true and complement wordlines to trigger complementary passgates for each write port. Fig. 3 shows the reset chain for both the write ports. The signals generated by this chain (CUT, RSWE, WRSI, and WRS3 and WRS4) reset dynamic nodes once the evaluation is carried out. The reset chain is triggered by the trailing edge of WETS signal. The "CUT" signal adjusts the pulsewidth of global signals such WETS and WATS by chopping it if it is wide. If the pulsewidth is narrower than specification then the 214

[IEEE International Symposium on VLSI Technology Systems and Applications - Taipei, Taiwan (8-10 June 1999)] 1999 International Symposium on VLSI Technology, Systems, and Applications

  • Upload
    ct

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [IEEE International Symposium on VLSI Technology Systems and Applications - Taipei, Taiwan (8-10 June 1999)] 1999 International Symposium on VLSI Technology, Systems, and Applications

IMPLEMENTATION OF A HIGH SPEED MULTIPORT REGISTER FILE IN A 1.8V, 0.25pm CMOS BULK AND SO1 TECHNOLOGY

R. V. Joshi, W. Hwang, S. Wilson", G. Shahidi*, and C. T. Chuang IBM,T.J. Watson Research Center, Yorktown Heights, NY 10598

*IBM Microelectronic Division, East Fishkill, NY 12533

rvjoshi @us.ibm.com

(914-945-1118)

ABSTRACT

The experimental hardware results of a high speed 8-ports, 32 Words x 64-bit register file in 1.8 V, 0.25pm CMOS bulk and SO1 silicon technology are presented. Such a register file is designed for bulk technology but is also remapped and fabricated in SO1 technology without any body contacts. It is shown that the register file in SO1 achieves more than 20% performance gain over the conterpart .

1. IN'TRODUCTION

High speed microprocessors require high speed register files. The trend in superscalar and Very Long Instruction Word (VLIW) computer architecture creates a need for multi- port register files. This paper demonstrates the hardware results of a high performance (> 660 MHz) register file with 8-ports (6 Read and 2 Write ports, 32 words x 64 bit). The register file is designed and implemented in a 1.8 V, 0.25 ,urn CMOS bulk technology and then mapped into SIMOX (Separation by Implanted Oxygen) Silicon-on-Insulator (Sol) technology without any body contacts [I] . Major performance advantages over bulk result from smaller diffusion capacitances, increase current drive due to lower threshold voltages, and a reduction in the back-bias effect of stacked devices. However, SO1 introduces a number of concerns that are not present in bulk CMOS. Most of these concerns are due to the undesired SO1 floating body effects. Many unique SO1 design challenges such as history-dependence of switching delays, lower noise margins, parasitic bipolar discharge current, and pulse stretching in dynamic circuits are considered. In this paper we describe the write ports and circuits in detail and show that

SO1 technology offers performance gain of over 20% as compared to bulk.

II. CIRCUITS

The architecture and read port circuitry of the register file was described in a recent publication 121. Here we will concentrate mainly on the modification of the self-resetting write operation and circuit techniques.

Fig. 1 shows a schematic of two write (A- and 6-port) ports which represents the forward path using a 1 bit cross-section. The main stages are receiver, generation of dual rail signals, selection of addresses using NOR, priority generation for port 8, decoder and generation of true/comple- mentary write wordlines. There are five input address bits for each write port with the fifth one as the least significant bit (LSB). First the addresses go through a receiver, a dynamic

inverter and then a static inverter. The write circuit is provided with a "write address timing signal" (WATS) which synchroniz- es the timing of all input addresses. This signal goes through a static NOR, dynamic and then static inverter. The output signal (WATSOUT) is ANDed with the nodes "a" (inverted input addresses) and "an" (delayed input addresses) respecti- vely. This results in the generation of "true" and "complement" dual rail signals per port which have same arrival times and pulsewidths. This way write port does not have to rely on the timing of external dual rail signals. These true and complement signals go through a 5-way dynamic NOR. Since these NORS dispense with slow, stacked PFETS the NOR stage can be made much faster compared to its static implementation with the same area.

Fig. 2 shows write enable and priority signal generations. In order to enable writing, a global signal called "write enable timing signal" (WETS) is provided in addition to the individual write enable for each port i.e WEAIN and WEBIN. WETS first goes through a NOR then ANDs with inverted WEAIN and WEBIN signals to produce output at nodes "al" and "bl". The delayed output at these nodes are called as WEA and WE6 signals which are used for decoding. The logic requires write into either port A when WEAIN is active, or into B when WEBIN is active. When both signals are active then the priority is given to port 6. The data is invalid when both the signals are in standby. The circuits are designed to map such logic. WEBIN goes through a NOR, dynamic and then static inverter to produce WE61 signal which assures wrlte priority for port B. The output of the NOR from 6-port ( shown in Fig. 1 which selects the write addresses) is gated with WEB1 signal to produce "A-inhibit" signal which is used along with WEA signal to decode the output of NOR in port A. These signals thus drive a 3-way pull- down tree which is tapered to achieve the performance. When the addresses for A and B port are equal it prevents the writing into A port and gives priority to 6 port. The output of the 6-port NOR is decoded by using only WEB signal. During evaluation one of the signals goes low and the other one remains high, while in standby, they are both high. Thus using 5 write addresses along with the write enable signal (WEB) 32 write wordlines per port are generated. The non-inverted and inverted outputs of these decoders are used to form true and complement wordlines to trigger complementary passgates for each write port.

Fig. 3 shows the reset chain for both the write ports. The signals generated by this chain (CUT, RSWE, WRSI, and WRS3 and WRS4) reset dynamic nodes once the evaluation is carried out. The reset chain is triggered by the trailing edge of WETS signal. The "CUT" signal adjusts the pulsewidth of global signals such WETS and WATS by chopping it if it is wide. If the pulsewidth is narrower than specification then the

214

Page 2: [IEEE International Symposium on VLSI Technology Systems and Applications - Taipei, Taiwan (8-10 June 1999)] 1999 International Symposium on VLSI Technology, Systems, and Applications

reset signals (such as RSWATS and RSWE) restore it to the desired width. WRSl, WRS2 and WRS4 resets the "true and complement generator" nodes shown in Fig. 1. The " reset", "evaluate", "static- evaluate" signals are used in the diagnost- ic mode.

The data which is written in the cell by either write port A or B is read by 6 read ports with dual rail address inputs. The details of the read ports are described in reference [2]. The output of the register file is fed to a dynamic latch which converts dynamic output to static one 121.

111. TECHNOLOGY FEATURES

The SO1 structure used for the register file and latch is shown in Fig. 4. The register file and latch are fabricated in a 0.25,urn technology. The devices are partially depleted without body contacts. The silicon thickness is around 221.3 nm and buried oxide thickness is 400 nm. The junction depths for pfet and nfet are around 150 nm and 170 nm respectively. The gate oxide thickness is around 40 nm. For bulk the junction depths and back end of the line processing are similar to SO1 technology. The technology data is summarized in TABLE 1.

IV. RESULTS AND DISCUSSIONS

Fig. 5 shows an optical photograph of a functional chip site fabricated in bulk and Sol. The register file is tested for two operations- read before write (RBW) and write before read (WBR) in the same cycle. These two operations are generated by moving the read addresses earlier or later compared to write addresses and WETS and WATS signals. Fig. 6 shows plot for cycle time as a function of supply voltage (Vdd) for WBR operation in the same cycle for SO1 and bulk respectively. Cycle time reduces by increasing Vdd and exceeds the tester limit which is 1.5 ns (666 MHz) for both SO1 and bulk. While the SO1 wafer functions all the way down to 1.2 V bulk wafer fails around 1.6 V. The nominal voltage is 1.9 V the register file in SO1 is functional down to 1.3 V with a cycle time of 2.4 ns ( 417 MHz). Capacitance and threshold voltage reduction are the main reasons for the performance gain over bulk. Very similar behaviour is obtained for RBW operation. These speeds are 20-25% faster compared to corresponding bulk wafers.

Comparison of access time as a function of Vdd for SO1 and bulk is shown in Fig. 7. For nominal condition (1.9 V, 25' C) the access time is close to 0.76 ns for SO1 while bulk shows access time of 1.1 1 ns. Thus the gain in access time is more than 20% which is expected from the capacitance reduction and floating body effect.

Hardware data for high temperatures and voltages is shown in Fig. 8 alongwith various input patterns. The waveforms at 2.7 V and 140' C indicate normal behavior (RB2<4> -LSB and OB2<0> - far end ouput of the register file). The register file passes various pattern tests at 3.5 V. These results demonstrate functionality of register file at higher voltages which is quite comparable to bulk wafers.

Fig. 9 shows the important function of pulse chopping for both bulk and Sol. Initially the input pulsewidth is 15 ns

which later widens internally due to RC network and floating body effects in Sol, but is reduced to 0.8 ns by pulse chopper circuit (not shown here) in the read path. Similarly a starting input pulsewidth of 0.7 ns or less is kept at 0.7 ns output pulsewidth. This is achieved by write input circuit described in Fig.1. Thus read and write circuits maintain the output pulsewidth independent of the input pulsewidths. Implementa- tion of such features improves the robustness of the register file remapped into Sol.

Fig. 10 shows the pulsewidth variation as a function of Vdd for a cycle time of 5 ns. As the supply voltage increases from 1.5 V to 2 V the delays across the devices decrease. As a result, the internal reset signal, which controls the pulsewidth chopping, arrives early and chops the read wqrdline, thus reducing the output pulsewidth. However at lower voltages due to longer device delays, reset signal is delayed rendering larger pulsewidth. This behavior is amplified for bulk wafers especially at lower than nominal voltages. Since the threshold voltages are constant and there is no reduction in junction capacitance the reset signal is delayed compared to SOL As a result the pulsewidth for bulk is larger compared to SOL Thus a larger control over pulsew- idth variation can be obtained for SO1 due to device drive strengths at low voltages.

Fig. 11 shows the full function of the register file and the latch at 25' C for both SO1 and bulk for WBR operation. The access time (between read address and read output) at 25" C is 0.760 ns for SO1 and 1 .I 1 ns for bulk. As the falling edge of the clock (CLKG) catches the active output of the register file, the latch converts the dynamic output to static one and holds it until the next falling edge of the clock catches the standby state of the output. The delay between the CLKG and the latch output at the far end, DD<O>, is close to 500 ps and 740 ps for SO1 and bulk respectively. Further evaluations show that the register file operates around 850 MHz.

V. CONCLUSIONS

Implementation of a high performance mulitport register file is demonstrated. Frequencies exceeding 660 MHz for both bulk and SO1 designs are shown. It is also shown that the use of pulse chopping/restoring circuits, provision of robust forwardheset path and internal timing margins enables robust operation in SO1 technology without body contacts. Full functionality is obtained at high temperatures and voltages which is a normal concern for breakdown for Sol. Comparison of SO1 and bulk hardware shows performance gain (cycle and access time) exceeds 20% for SO1 in 0.25 prn technology.

REFERENCES

[I] D. J. Schepis, et al, " A 0.25 ,urn CMOS SO1 Technology and Its Application to 4 Mb SRAM," IEDM Techical Digest, pp. 587-590, Dec. 1997.

[2] W. Henkels, W. Hwang and R. V. Joshi, "A 500 MHz 32-Word x 64-Bit 8-Port Self-Resetting CMOS Register File and Associated Dynamic-to-Static Latch", 1997 Symp. on VLSl Circuits Digest, pp.41-42, June 1997.

275

Page 3: [IEEE International Symposium on VLSI Technology Systems and Applications - Taipei, Taiwan (8-10 June 1999)] 1999 International Symposium on VLSI Technology, Systems, and Applications

Fig:l. Forward path for wrlte ports.

I I S;atevai

WEBIN I\ v- RSWEBI Stat-evai

I I (For B

Oxide I '+?

Burled Oxide (BOX) p Substrate

Fig. 4. SO1 Device structure.

REGISTER FILE (6 ieaa ana 2 write ports)

dhln

OYNAMIC-~O%ATiC LATCH

READlWRl DECODE

and RRAY and

RESET UXES OTHER LOGIC

Fig. 5. An o tical micro raph of a full' functionas chip in bl.

Fig. 2. Priority and write-enable generatlon.

WRSJ CUT RSWE WRBI WRSZ WRSk

Priority)

Fig. 6. Cycle time vs. Vdd.

Fig. 3. Reset for write port A and B Fig. 7. Access time vs. Vdd.

276

Page 4: [IEEE International Symposium on VLSI Technology Systems and Applications - Taipei, Taiwan (8-10 June 1999)] 1999 International Symposium on VLSI Technology, Systems, and Applications

SO1 (2.7V, 140C)

WbinO 1 n r m uuuu WAAe42

wATs - WEBIN WETS -

RB1<4> - FAREND

Fig. 8. SO1 functionality at high temperature.

Pulse Chopping (1.9V, 2%)

500 psldiv.

SO1 (1.9V, 25.C)

RB2<4?

OB2<0>

CLKG

DD<O>

1 nsldiv.

(81

Bulk (1.9V, 25'C) RB2<4>

OB2<0>

CLKG

DD<O7

2 nsldiv. (b)

Fig. 11. Functionality of register file and latch for (a) Sol, and (b) bulk.

Fig. 9. Pulse chopping for bulk and Sol.

TABLE 1

U o'ol.O 1.5 2.0 2.5 3.0 3.5 4.0 j

Vdd (V)

Fig. I O . Pulsewidth enlargement.

P Channel bu - 170 nm N Channel - 150 nm Gate Oxide - 40 nm No. of Metal Levels - 5 MCBAR - Local Interconnect M1, M3 - Horizontal (Wordline, Reset) M2, M4 - Vertical (Bitline)

Level Metal Sheet p Line Capacltance

(rnl (fFhm)

MCBAR W 0.45 0.263 M1 A I 0 0.084 0.210 M2 AI-CU 0.055 0.224 M3 AI& 0.055 0.224 M4 AI-CU 0.055 0.224

277