[IEEE IEEE 1999 Custom Integrated Circuits Conference - San Diego, CA, USA (16-19 May 1999)] Proceedings of the IEEE 1999 Custom Integrated Circuits Conference (Cat. No.99CH36327)

A 11OMhz 350mW 0 . 6 ~ CMOS 16-State Generalized-Target Viterbi Detector for Disk Drive Read Channels

Srinath Sridharan and L. Richard Carley

Carnegie Mellon University, Department of Electrical and Computer Engineering, 5000 Forbes Ave., Pittsburgh, PA 15213 USA

Abstract - An architecture for efficiently implementing linear and nonlinear Viterbi detectors for magnetic read channels is presented. By employing generalized noiseless target values for the Viterbi trellis, the detector is better able to adapt to the actual binary data storage channel and less equalization is needed, resulting in a significant reduction in the probability of error. An implementation example is presented for the case of a 16-state Viterbi detector having a capability of handling any noiseless target of up to 5 adjacent non-zero values. In 0.6pm 3.0 V CMOS process, the design has been implemented with a die area of 9mm2 consuming under 350mW of power when operated at 11OMhz.

1. INTRODUCTION Magnetic data storage read channels employ the Viterbi algorithm for detection of encoded binary data. It has been demonstrated that the maximum likelihood sequence detector (MLSD) is the optimum detector for signals corrupted with intersymbol interference (ISI) and additive white gaussian noise (AWGN). A MLSD can be implemented as a matched filter followed by a discrete time equalizer and a Viterbi detector [ 11.

The span of the intersymbol interference (1%) decides the complexity of the Viterbi detector, as the number of states in the detector increases exponentially with the IS1 span. For a binary data storage channel (inputs to the magnetic recording channel can only be +1 or - 1 for digital saturation recording), the output of a low pass filter at any sampling instant depends on a window of N transmitted signals, where N is the IS1 span. This implies that the output can take only one of 2N possible values if we ignore noise and residual IS1 that corrupts the signal. These possible signal levels are what we refer to as noiseless target values for the Viterbi trellis.

The MLSD computes the squared difference between the observed input and the noiseless target value that would be observed for all possible transmitted sequences. Thus at every symbol time there are 2(N-') sequences that need to be retained. f i e Viterbi algorithm ensures that these possible survivor sequences emerge from the same state if the detector decisions are taken

with sufficient latency. The detector must compute the squared difference between the input and each of the 2N possible noiseless target values at each symbol time. Out of the 2 paths entering a state in the trellis, the one with the smaller sum of squared differences computed with respect to the input sequence is retained. Notice that this computation can be significantly reduced by appropriate choice of the noiseless target values.

Typically these noiseless target values are selected from a family of partial response (PR) polyno- mials (I-D)( I+D)'; e.g., PR4 [ 2 ] , EPR4 131, or EEPR4 [4]. For EPR4 (X=2) the polynomial coefficients are 0,+1,- 1 ,+2,-2. The noiseless target values in this case turn out to be powers of 2, reducing computation of the squared differences to just bit shift operations. While this equalization dramatically reduces detector complexity, it can also result in a loss in performance (SNR required for a given error rate) at high recording densities because equalizing the channel response to match the PR target chosen both amplifies the noise power relative to the signal and also correlates the noise signal in time. More- over, equalization to a PR target assumes the channel response to be linear, which might not be true at high recording densities. Note that the Viterbi algorithm allows noiseless targets to be any nonlinear function of the N transmitted interfering symbols. Allowing the PR target to take non-integer coefficients, a better match to the channel is possible which might result in additional performance gains. These noiseless target values can be obtained by running least-mean squares (LMS) algorithm on the forward equalizer and the Viterbi detector simultaneously.

The purpose of this paper is to demonstrate that ueneralized Viterbi detector can be designed that satis- fies typical constraints on IS1 span, clock speed, die area, and power dissipation when implemented in a modern CMOS process. First we describe the basic architecture of a generalized Viterbi detector, including generation of the error signal needed for decision directed feedback loops. Then we describe a design example that was carried out in a 0.6 pm CMOS process. Finally we conclude with a discussion of the scalability of this architecture as CMOS VLSI continues to decrease in feature size.

14.4.1 0-7803-5443-5/99/$10.0001999 IEEE

325 IEEE 1998 CUSTOM INTEGRATED CIRCUITS CONFERENCE

11. PROPOSED DETECTOR ARCHITECTURE

Fig. 1 shows the proposed detector architecture with the main blocks being the branch metric unit (evalu- ates the squared differences between the input and the noiseless target values), the add-compare-select (ACS) unit and the path trace-back unit. We now describe each of these blocks separately.

zN noiseless target values

Branch metric unit

Computes 2N squared differences between input and noiseless target values

Digital Input

I I

ACS ACS ACS ACS ACS

2N-* Viterbi Add I Compare I Selects

Latches Clock I Viterbi path Trace-back R-Stages

Clock (depends on decision depth

Latches Clock

Binary Decision Output

Fig. 1. Proposed detector Architecture

A. Branch Metric Unit

We assume that the input to the Viterbi detector is an M-bit binary number which can be generated directly by an A/D convertor if all analog equalization is used, or it can be generated by a digital equalizer. Typi- cally M is on the order of 6 bits for data storage channels. The 2N noiseless target values are held in latches. For example, for an IS1 span of 5 there will be 32 possible noiseless target values. The noiseless target values are also M bits. Thus 2M bits will be needed to represent the squared difference without introducing truncation

errors. However by running extensive simulations we have determined that retaining P-bits (P=5, the 7th bit through the 3rd bit for the case of M=6) for the squared error is sufficient and that additional bits only marginally decrease the bit error rate.

Consider the squared error computation for the case of M=6 and P=5 as shown in Fig. 2. First the modulus of the difference between the input and the noiseless target value is obtained. This value is compared with 11. If the modulus exceeds 1 I , then the square will exceed 121 (121 can be represented as a 7 bit number 1 1 11001) and so the branch metric is saturated at 30, ( 1 I 1 IO, ignoring the two LSB's) else the square is evaluated and the 7th bit through the 3rd bit is passed as the branch metric. All 2N branch metrics are latched and then sent to the ACS unit.

I ' ' t - 7 7

I I & ' Y block repeated 2N times h

@ 5

cloc*,,s branch metric

Fig. 2. Block diagram of Branch Metric unit.

B. Add-Compare-Select Unit.

The ACS block typically consists of an array of 2(N-1) ACS units, one corresponding to each state of the Viterbi detector. Each of the ACS units consumes two squared errors associated with that node in the Viterbi trellis (as shown in Fig. 1). The path with the smaller accumulated sum of squared errors is retained. The decisions from the ACS units are then pipeline latched and passed to the path tracing circuit which determines the detected bits from the sequence of decisions made by each of the ACS units.

C. Path Trace Back Unit.

The decisions of the ACS unit at each symbol time are used to retain the 2(N-') survivor paths. The number of stages in the trace back unit (Fig.3) is chosen such that there is a very high probability of all the survivor paths

14.4.2 326

emerging from the same state, so that a decision can be taken without any ambiguity.

Decisions from ACS units

muxes

Fig. 3 Two stages of Trace back unit for a 4-State Viterbi

D. Auxiliary Circuits for Decision Directed Feedback.

The error signal between the input and the noiseless target observed is needed for adaption of equalizer coefficients and noiseless target values, timing recovery, DC offset cancellation and automatic gain con- trol. Since the IS1 span is N, we need the last N-1 decisions of the detector and the present decision to determine which noiseless target value needs to be sub- tracted from the delayed input as shown in Fig. 4. The T- stage delay line is chosen to match the delay between the time an input is received by the detector and the time a decision is generated for that input.

target values

detector

T Stage

M Error signal

Fig. 4. Block diagram for error signal generation

111. DESIGN EXAMPLE in 0.6pm CMOS

A 16-State Viterbi detector capable of handling a generalized IS1 span of 5 [N=5) design example was carried out using a 0.6 pm CMOS fabrication technology operating at 3.0 Volts. For this implementation we chose the digital input to be a 6-bit signal (M=6). The squared errors were 5 bits wide and this was arrived at after doing extensive simulations at user densities between 2.5-3.0 user bits I PW50. The path trace back unit had a depth of 16 stages to keep the probability that the survivor paths had not converged to an acceptably low probability.

The squared error computation unit operated with less than 5ns of delay, which was far less than the ACS block. The array of ACS units dominates the delay and consumes the maximum die area and power amongst all blocks. An ACS unit adds the two incoming accumu-

lated sum of squared errors associated with that state in the trellis to the two corresponding squared errors determined by the present sample, and then compares the two sums. Since we expect all the paths to have converged to the same state within 16 symbols, the maximum difference that can appear on two path metrics is the sum of 16 5-bit numbers, which gives us 9 bits. By adding one more bit, we can allow the accumulated path metrics to wrap around within I O bits to unambiguously determine the smaller of the two sums of squared errors. In our design example two adder blocks, add two 5-bit squared errors to corresponding IO-bit path metrics. The comparison can then be made by taking the difference of these two sums and passing the smaller of the two. This would imply that there will be two full adder delays in series (first the add and then the comparison) that sets the maximum clock speed. In order to reduce this delay, we start the comparison process along with the addition by using a 2 layers of 3-input 2-output carry-save-adders as shown in Fig. 5. The 2 tier CSA tree reduces 4 bits in each posi- tion to 2 bits without any carry propagation. This is a standard technique used for reducing partial product array in Wallace tree multipliers. A decision can be taken by looking at the 10th bit of the difference of the two sums. This is evaluated by using a block carry look ahead (BCLA) chain to generate the carry into the 10th bit posi- tion and using a full adder to generate the decision as shown in Fig. 4. In this manner we can parallelize the comparison with the computation of the two sums.

The final component is the path trace back unit. Since every stage in this unit is latched as shown in Fig. 3, the delay associated with this unit is extremely short. Fig. 6 shows the die photograph of the generalized Vit- erbi detector example with an IS1 span of 5 symbols.

The use of generalized targets in the Viterbi trellis resulted in lesser high frequency boost in the forward equalizer (as shown in Fig. 7) when compared with PR targets leading to a probability of bit error reduction of 4- 5 x.

In order to reduce the design time for our explora- tions we used the EPOCH Datapath Compiler from Cas-

Full adder and 9-

path metric Updated

Fig 5. P1 and P2 are accumulated metrics and b l , b2 are the squared errors for the current input sample.

14.4.3 327

cade Design Automation [5] to generate the layout from their CMOS standard cell library. We expect that significant improvements in die area, power dissipation, and maximum clock speed can be achieved by optimizing transistor sizes and improving the layout and floorplan.

n! CONCLUSIONS

In this paper, we have presented a new architecture for implementing Viterbi detectors for binary encoded data that allows the selection of arbitrary linear and non-linear noiseless target values for the recording channel response. In addition auxiliary error signal generation circuits were also described. The design is implemented in roughly 3mm by 3mm of die area and dissipates under 350 mW of power when operated at 110 Mhz. Using generalized targets results in lesser noise enhancement by the front end equalizer leading to a significant reduction in the bit error rate. Alternatively, this could be viewed as a reduction in the channel SNR requirement to obtain the same bit error rate while using PR targets.

Since this is a completely digital detector design, more advanced CMOS processes would reduce the die area and power dissipation and increase the maximum clock speed. Alternatively, by holding the die area constant we can hope to handle more IS1 terms (6 & 7 symbols in the 0.35pm and 0.25pm processes) and pro- ceed towards true MLSD detection by constructing generalized Viterbi detectors with large IS1 span.

Fig. 6 Die photograph of the 16-state generalized Viterbi Detector occupying 3mm on each side with pad frame.

5 7

1

0

-1 A

With PR target / ,’ ., \

b 2 3 . frequency (rads/sec) 1

Fig. 7 Front end equalizer frequency response

V. ACKNOWLEDGEMENTS

We would like to thank Professor Herman Schmit for his help with the CAD Tool Flow. We would like to acknowledge the support of the CMU Data Stor- age Systems Center. This work was supported in part by NSF under Grant # ECD-8907068. The government has certain rights to this material.

VI. REFERENCES

(1)G. D. Forney, “Maximum-likelihood sequence estimation of digital sequences in the presence of intersymbol interference,” IEEE Trans. Inform. The- ory, Vol. IT- 18, pp. 363-378, May 1972.

(2) D. Coker, R. L. Galbraith, G.J. Kerwin, J. W. Rae, and P. A. Ziperovich, “Implementation of PRML in a Rigid Disk Drive,” IEEE Transactions on Magnetics, MAG-27, No. 6, pp. 4538-4544, November I99 1.

(3) J . G . Chern, C. Conroy, R. Contreras, etal., “An EPRML Digital ReadrWrite Channel IC,” Digests of Technical Papers, Int. Solid-state Circuits Conference, pp. 320-321, 1997.

(4) R. Yamasaki, M. Palmer, C. Tammel, etal., “ A 1-7 code EEPR4 Read Channel IC with an analog Noise Whitened Detector,” Digests of Technical Papers, International Solid State Circuits Conference, pp. 3 16- 317, 1997.

( 5 ) EPOCH Tool Suite, Cascade Design Auto- mation, http://www.cdac.com.

14.4.4 328

http://www.cdac.com

Documents

[IEEE IEEE 1999 Custom Integrated Circuits Conference - San Diego, CA, USA (16-19 May 1999)] Proceedings of the IEEE 1999 Custom Integrated Circuits Conference (Cat. No.99CH36327)