[IEEE 2009 IEEE Circuits and Systems International Conference on Testing and Diagnosis - Chengdu , China (2009.04.28-2009.04.29)] 2009 IEEE Circuits and Systems International Conference

A Low Overhead Fault Detection and RecoveryMethod for the Faults in Clock Generators

Nima Karimpour DaravIslamic Azad University of Lahijan Branch

Lahijan, IranEmail: [email protected]

Mohammad Amin AmiriIran University of Science and Technology

Tehran, IranEmail: [email protected]

Alireza EjlaliSharif University of Technology

Tehran, IranEmail: [email protected]

Abstract—In many synchronous digital systems especially thoseused in mobile applications, the system is exposed to severshaking that may lead to a failure in the clock generator. In thispaper we present an effective method to tolerate the faults on theclock signal that are due to defects in external oscillators. Ourtechnique utilizes no Phase-Lock Loops (PLL), no Delay-LockedLoops (DLL) and no high frequency oscillators because of theirdrawbacks so that it needs neither more effort to meet Electro-Magnetic Compatibility (EMC) and requirements nor extrahardware to implement DLLs. We have formally evaluated themeta-stability of our technique. This evaluation shows that ourtechnique reliably meets the timing requirements. Furthermore,our simulation-based fault injection experiments show that ourtechnique can tolerate all single faults on clock sources that leadto permanent stuck-at fault and masks almost 49 percents ofintermittent faults.

I. INTRODUCTION

Embedded digital systems that are employed in defense,industrial, and consumer applications often have sever fault-tolerance requirements. On the other hand, these systemsare usually used in mobile applications where the system isexposed to sever shaking. This increases physical defect ratewhere some of them like pad detachment or breaking maylead to the failure of the clock resource, while the reliabilityof clock signals is highly important as a digital system willdefinitely fail if its clock signal fails. Therefore, it is a crucialissue to design a highly reliable clock signal for these systems

As a new trend to SOC-based design [1], some device ven-dors produce chips that utilize internal clock sources [2], [3].Notwithstanding, this way mitigates the problem but cannotsolve it completely because in most applications, there arestill more than one chip that must be synchronized by a clocksource (like those use at least a microprocessor and an externalsynchronous memory), and this necessitates the use of externalclock sources. There are some attempts to mitigate or solvethe problem when an external clock generator is used [4], [5]and [6]. In [4] a locked loop circuit is provided to maintainphase alignment between a backup clock signal and a masterclock signal. Failure modes within the master and backupclock are determined through a DLL or a PLL embodied inthe master and backup clock generators. Since PLLs and DLLsare inherently noise sensitive [7], vulnerability to noise is themain drawback of this solution.

Reference [5] has presented a clock voter that utilizes a highfrequency clock input to vote between asynchronous clock

inputs. In [5], when a fault is detected on a clock input, theclock input is prevented to contribute to the voted output. Thatit utilizes a high frequency clock input to vote is a drawbackof their work because the high frequency clock input needsmore effort to meet Electro-Magnetic Compatibility and aremore sensitive to noise.

Reference [6] has presented a digital fault tolerant clockthat votes between clock generators. Each clock generatorcontains an oscillator with frequency much higher than thefrequency of the output clock. Although it seem that its agood approach to design a fault tolerant clock generator,like the solution presented in [5], It needs more effort tomeet EMC/EMI requirements in its implementation because ofusing a high frequency oscillator. Moreover, for high frequencyapplications, its implementation is not feasible because oftechnology limits.

In this paper we intend to present a low overhead solutionfor the faults that are due to disconnecting of externaloscillators or lead to permanent stuck-at faults. In throughoutthis paper, we assume that only one clock may fail. Twocircuits based on a primary clock with a backup are presented.The first circuit has some drawbacks , but is the building blockof the second one which has no disadvantages of the first. Boththe circuits utilize no PLLs, no DLLs and no oscillators withfrequency much higher than the needed frequency. Moreoverour solution has simple structure so that it can be implementedin any digital synchronous systems.

The remainder of this paper is organized as follows:in Section II we describe our proposed solution in twosubsections. In Section III the mechanism we have evaluatedour second circuit and its result is presented. In Section IV,our circuit is formally evaluated from viewpoint of the meta-stability problem. Finally, we conclude our solution.

II. THE PROPOSED SOLUTION

We initially assume that our desired digital system has beensynchronized by a single clock. We intend to design a faulttolerant clock source by utilizing two regular clock sourceswith almost the same frequency that follow single stuck-atfault model and only one clock source may fail. Moreoverwe assume that after recovering the clock source, the state ofthe system may be recovered by applying other methods likeroll-back check pointing techniques [8], [9]. Since each clock

978-1-4244-2587-7/09/$25.00 ©2009 IEEE

Fig. 1. The general structure of our solution.

source may have two types of fault (stuck-at-1 and stuck-at-0)and only one clock source may fail, the logic we need to designis a circuit that detects a stuck-at fault on the first clock (theprimary), and then switches to the second clock (the backup)as the driver of the system. Fig. 1 shows the general structureof our solution.

A. The Building Block Circuit

The technique we here propose is to utilize clk2 as a hotspare clock source where the primary clock source is clk1.When clk1 fails, clk2 is selected as the clock source of thesystem. But the main concern is how to detect logically afault on clk1 while a digital fault detector itself needs a clocksource.

Fig. 2 is a circuit that detects faults on clk1 and then masksthem by switching the output from clk1 to clk2. The circuittries to sample clk1 and then examines two samples gainedfrom two consequent periods through an XOR gate; if thereis a fault on clk1, the output of XOR gate will be zero andthen the circuit will select clk2 as the main clock source ofthe system otherwise the circuit keeps clk1 as the main clocksource. Therefore we can detect a fault on clk1 and mask it byclk2. In following we present how the proposed circuit works.

The main concept behind the circuit in Fig. 2 is to sampleclk1 twice a period by means of clk2 (at its both edge),and then to examine two consequent samples to detect thetransition of clk1. If clk1 does not have a transition in a period,two consequent samples will be the same that means clk1has failed. On the other hand, if clk1 has a transition in aperiod, two consequent samples will be different that meansclk1 is still alive. For this purpose, and avoid sampling ofclk1 at each edge of clk2, we divide the frequency of clk1by means of a T flip-flop (TFF in Fig. 2) so that sampling atsingle edge is sufficient to detect a fault. After dividing thefrequency, the output of the T flip-flop is sampled through asimple synchronizer [10] at positive edges of clk2 (DFF1 andDFF2), and then two consequent samples determined fromtwo consequent periods are examined through an XOR gate.If there is a fault on clk1, two consequent samples will be thesame, and the output of the XOR gate will logically get low.

This output can be utilized to detect the failure of clk1 aslong as we assume that clk1 has exactly the same frequency asclk2. This assumption is far from reality because the frequencytolerance of two clock sources may causes the output of theXOR gate logically gets low sometimes but not at consequentperiods. Fig. 3 shows a situation that the output of the XOR

gate gets low sometimes because of the frequency tolerance.In order to cope with this problem, the output of the XORgate in Fig. 2 is sampled through a D flip-flop (DFF4) , andthen circuit detects the failure of clk1 by examining of theconsequent samples of the output of the XOR gate through anOR gate. When clk1 fails, the output of the OR gate is zeroand then the circuit will select clk2 as the clock source of thesystem. On the other hand, if clk2 fails, the output of the ORgate will stay on clk1 as the clock source of the system andthe failure of clk2 will be masked. To make the output of theOR gate glitch-free, it passes through a D-flip flop (DFF5).

Since clk2 is a hot spare component, clk2 has the samefailure rate as clk1. Furthermore, when a clock source is goingto fail, usually several glitches may occur on it. The proposedcircuit operates correctly as long as there are no glitches onclk2. If some glitches occur on clk2 while it is going to fail,the output of OR gate may get zero and then the circuit mayincorrectly switch to clk2 as the source clock of the system.To avoid such situations, we propose another circuit that ispresented in the next.

B. The Main Circuit

In circuit Fig. 2, clk1 is sampled and clock switching isbased on the transitions of clk1. As mentioned above, sincethe transitions of clk2 are neglected, the failure of clk2 maylead to the failure of the circuit and then the failure of thesystem. In order to overcome the problem, we propose Fig. 4as the circuit that operates correctly even if clk2 gets someglitches when it is going to fail. The circuit of Fig. 4 is anextension of circuit Fig. 2.

The circuit of Fig. 4 is composed of two main parts and asimple clock selector. Each part samples a clock by means ofthe other one, and then examines samples like the mechanismin the circuit of Fig. 2. Finally, the clock selection is performedbased on Table I. In the absence of faults, the output of eachpart is high, which means the sampled clocks are alive. Othercases are the indications of fault occurrence.

For the presented circuit, we can define three states. In thefirst state, the normal state, both clock sources generate clockcorrectly, and clk1 is selected as the output of the circuit. Inthe second state, the transient state, there are some glitches onthe faulty clock. In this case, the output of the circuit is notstabilized. In the final state, the steady state, the faulty clockgets a permanent stuck-at fault. In this case, when the outputof each part is low, the faulty clock cannot be located, andsince there are no transitions on the faulty one, we determine

Fig. 2. The structure of the basic circuit.

Fig. 3. The XOR output gets low because of the frequency tolerance.

the output by masking the faulty clock through a XOR gateas demonstrated in Table I.

Since there is no difference between clk1 and clk2 insampling and the transitions of both clocks can be monitored,clock selection can be accomplished more reliable than theprevious circuit.

The circuit of Fig. 4 can mask all single faults on either ofclock sources that lead to permanent faults. Notwithstanding,the circuit operates correctly as long as our fault model followspermanent stuck-at fault model because intermittent faults maynot be detected by the proposed circuits.

III. SIMULATION BASED FAULT INJECTION

We have utilized simulation based fault injection to evaluatethe latter circuit. We injected more than 100000 faults into ourcircuit by means of a method that is presented as follows.

To inject a fault, we defined three random variables (C,T, and L) with the uniform distribution. The first randomvariable, C, determine the clock (clk1 or clk2) that faults mustbe injected into. The second random variable, T, determinethe time duration of the transient state. Finally the third

TABLE ITHE TRUTH TABLE OF THE CLOCK SELECTOR.

SEL1 SEL2 OUTPUT

high high CLK1

high low CLK1

low high CLK2

low low CLK1 XOR CLK2

random variable, L, determine the length of glitches duringthe transient state. We have implemented our circuit by VHDLand the stimulus by SystemC language [11].

For the faults that lead to permanent faults, our result showsthat the circuit masks all single faults correctly. On the otherhand, in the case of intermittent faults our circuit masks 49074out of 100000 faults.

IV. META-STABILITY CONSIDERATIONS

To ensure reliable operation, digital circuits must meetsetup-time and hold-time of flip-flops. If these timing require-ments are violated for a certain flip-flop, the output of theflip-flop can oscillate or hover between high and low, whichis so-called meta-stable output [10].

In case of our last circuit, since the clock sources arenot synchronized while they sample each other through someflip-flops, the asynchronous inputs may violate timing re-quirements of a flip-flop, and cause unpredictable operation.To reduce the effects of meta-stability, we have exploited amultiple-stage synchronizer before sampling the divided clocksources as shown in Fig. 2.

The Mean Time Between Failure (MTBF) caused by a meta-stable output is estimated with the following formula [12]:

MTBF =etr/τ

To × fclk × fD(1)

Wherein, the constants To and τ depend on the process tech-nology used to manufacture the device. fclk refers to the clockfrequency while fD refers to the number of asynchronousinput changes per second applied to flip-flops. tr is the timedelay for the meta-stability to resolve itself.

Fig. 4. The structure of our solution based on Fig 2. circuit.

For the circuit of Fig. 4, we have two asynchronous inputswith the half frequency of the clock sources, so:

FD = 2 ×(

fclk

2

)(2)

Therefore:

MTBF =etr/τ

To × (fclk)2(3)

The above formula gives a lower bound for our circuitbecause there are more than two stages in the circuit. Fora typical FPGA like ProASICPLUS [13], a simple multiple-stage synchronizer with two stages gives the MTBF equal to1 year when fclk = 100MHz and fD = 12.5MHz [14]. Forour circuit that fclk = fD , we gain a lower bound for theMTBF equal to 45 days. Fortunately, we believe the MTBFof our circuit is much more than 45 days because our circuithas more than two stages.

Moreover, aside from the frequency of the clock sources,to achieve a desired lower bound MTBF (e.g. 8 years), wecan divide sampling clocks to the desired frequency (e.g.12.5MHz) before applying to the sampling flip-flops. In thiscase, the recovery latency increases.

V. CONCLUSION

In this paper, we have presented two circuits that utilizetwo clock generators to tolerate permanent faults. One clockgenerator is the primary clock and another one is utilized asa hot spare.

Upon our simulation result, our second circuit masks allsingle faults that lead to permanent stuck-at faults. Moreover,it masks almost 49 percents of intermittent faults. That ourcircuits are simple and need no PLLs, DLLs and clockgenerators with higher frequency than what the system needs,are prominent contributions of our solution.

Notwithstanding, our approach has some drawbacks like theclock synchronization problem in switching between two clock

generators and unpredictability behavior for intermittent faultsthat are the main efforts of our future work.

REFERENCES

[1] ITRS. (2006) International technology roadmap for semiconductors.00 ExecSum2006Update.pdf. [Online]. Available: http://www.itrs.net/Links/2006Update/FinalToPost/

[2] Actel Corp. (2007) Fusion Family of Mixed-Signal Flash FPGAswith Optional Soft ARM Support. Fusion GenDesc DS.pdf. [Online].Available: http://www.actel.com/documents/

[3] Freescale Semiconductor Corp. (2007) MC9S08QD4 Series Features.[Online]. Available: http://www.freescale.com

[4] K.-Y. K. Chang and M. A. Horowitz, “Fault-tolerant clock generator,”U.S. Patent application 7 467 320, Nov. 09, 2006.

[5] C. W. Roleston, “Fault tolerant clock voter with recovery,” U.S. Patentapplication 672 647, Aug. 07, 2001.

[6] K. R. Driscoll, “Accurate digital fault tolerant clock,” U.S. Patentapplication 5 557 623, Sep. 09, 1996.

[7] A. Wilde, “The generalized delay-locked loop,” Wirel. Pers. Commun.,vol. 8, no. 2, pp. 113–130, 1998.

[8] S. W. Kwak, B. J. Choi, and B. K. Kim, “An optimal checkpointing-strategy for real-time control systems under transient fault,” IEEE Trans.Rel., vol. 50, pp. 293–301, Sep. 2001.

[9] T. Ozaki, T. Dohi, H. Okamura, and N. Kaio, “Min-max checkpointplacement under incomplete failure information,” in International Con-ference on Dependable Systems and Network, Jul. 2004, pp. 721–730.

[10] L. Kleeman and A. Cantoni, “On the unavoidability of metastablebehavior in digital systems,” IEEE Trans. Comput., vol. 36, pp. 109–112,Jan. 1987.

[11] (2006) SyetemC Version 2.0. [Online]. Available: http://www.systemc.org

[12] T. J. Chaney, “Measured flip-flop responses to marginal triggering,”IEEE Trans. Comput., vol. 32, no. 12, pp. 1207–1209, 1983.

[13] Actel Corp. (2008, Sep.) ProASICPLUS Flash Family FPGAs. ProA-SICPlus DS.pdf. [Online]. Available: http://www.actel.com/documents/

[14] Actel Corp. (2008, Jan.) Metastability Characterization Report forActel Flash FPGAs. Flash Metastability HBs.pdf. [Online]. Available:http://www.actel.com/documents/

Documents

[IEEE 2009 IEEE Circuits and Systems International Conference on Testing and Diagnosis - Chengdu , China (2009.04.28-2009.04.29)] 2009 IEEE Circuits and Systems International Conference