English Speech Recognition System on Chip

TSINGHUA SCIENCE AND TECHNOLOGY IS SN l l 1 0 0 7 - 0 2 1 4 l l 1 5 / 1 7 l l p p 9 5 - 9 9 Volume 16, Number 1, February 2011

English Speech Recognition System on Chip*

LIU Hong (刘鸿), QIAN Yanmin (钱彦旻), LIU Jia (刘加)**

Tsinghua National Laboratory for Information Science and Technology， Department of Electronic Engineering, Tsinghua University, Beijing 100084, China

Abstract: An English speech recognition system was implemented on a chip, called speech system-on-chip

(SoC). The SoC included an application specific integrated circuit with a vector accelerator to improve per-

formance. The sub-word model based on a continuous density hidden Markov model recognition algorithm

ran on a very cheap speech chip. The algorithm was a two-stage fixed-width beam-search baseline system

with a variable beam-width pruning strategy and a frame-synchronous word-level pruning strategy to sig-

nificantly reduce the recognition time. Tests show that this method reduces the recognition time nearly 6 fold

and the memory size nearly 2 fold compared to the original system, with less than 1% accuracy degradation

for a 600 word recognition task and recognition accuracy rate of about 98%.

Key words: non-specific human voice-consciousness; system-on-chip; mel-frequency cepstral coefficients

(MFCC)

Introduction

Embedded speech recognition systems[1] are becoming more important with the rapid development of hand-held portable devices. However, only a few products are yet available due to the high chip costs. This paper describes an inexpensive English speech recognition system based on a chip. The chip includes a 16 bit mi-crocontroller with 16 bit coprocessor, 32 KB RAM, and 16 bit A/D and D/A.

The recognition models are based on the sub-word continuous density hidden Markov model (CHMM)[2,3] model, with the mel-frequency ceptrum coefficients (MFCC)[4] features. The recognition engine uses

two-pass beam searching algorithms[5] which greatly improve the implementation efficiency and lower the system cost. The system can be used in such command control devices as consumer electronics, handheld de-vices, and household appliances[6]; thus the system has many applications. The recognition accuracy of the speech recognition SoC is more than 97%.

The hardware architecture of the speech recognition system-on-chip is designed for practical applications. All of a system’s hardware is integrated into a single chip which presents the best solution for performance, size, power consumption, and cost as well as reliabil-ity[7]. The speech recognition system-on-chip is com-posed of a general purpose microcontroller, a vector accelerator, a 16 bit ADC/DAC, an analog filter circuit, audio input and output amplifiers, and a communica-tion interface. In addition, the chip also includes a power management module and a clock. The applica-tion specific integrated circuit (ASIC) computing power is much greater than that of a microprocessor control unit (MCU) because of the vector accelerator. Unlike a DSP, this ASIC has integrated ADC, DAC, audio amplifier, and power management modules

Received: 2009-12-15; revised: 2010-06-10

* Supported by the National Natural Science Foundation of China andMicrosoft Research Asia (No. 60776800), the National NaturalScience Foundation of China and Research Grants Council (No.60931160443), and the National High-Tech Research and Devel-opment (863) Program of China (Nos. 2006AA010101,2007AA04Z223, 2008AA02Z414, and 2008AA040201)

** To whom correspondence should be addressed. E-mail: [email protected]; Tel: 86-10-62781847

Tsinghua Science and Technology, February 2011, 16(1): 95-99

96

without some unnecessary circuits to reduce the cost. Figure 1 shows a photo graph of the unpackaged chip core. The block diagram is shown in Fig. 2 with the details of each block described below.

Fig. 1 Photo graph of the chip

Fig. 2 SoC block diagram

1 Chip Software Design

The speech recognition process is described in Fig. 3. The speech signal from the microphone is pre-ampli-fied and filtered by a low-pass filter and then sampled by the ADC with an 8 kHz sampling frequency. The signal is then segmented into frames, which form a

Fig. 3 Speech recognition system

continuous speech frame sequence. This sequence is then sent to the endpoint detection and feature extrac-tion unit. After two-stage matching[8], the system out-puts the recognition result. The result can be sent to the circuit or to the LCD display.

1.1 Software level division

Hierarchical design[9] is often used for complex sys-tems, regardless of the operating system or network structure. The lower levels provide bottom services and lower-level management. Each layer is encapsulated, so that the upper layer does not need to know the un-derlying method and the lower layer does not know the purpose. The structural layer increases the system flexibility with each module being logically related to the hierarchical design to improve the system’s appli-cability and flexibility and to enhance the system reli-ability, robustness, and maintainability. The system software is divided into the driven layer, the universal module layer, the function module layer, and the scheduling layer, as shown in Fig. 4. The driver layer which isolates the software and hardware, includes all of the interrupt service routines and the peripheral drivers. The universal module layer includes a variety of basic operation modules which provide basic com-puting and operating services. The function module layer contains various functional modules as the core algorithms. The scheduling system which is the top layers, controls the ultra-loop of the global data main-tenance system whose core aim is the task scheduling.

Fig. 4 Software division

1.1.1 Driver layer The driver layer program enables direct operation of the hardware part. The program modules at this level generally correspond to the actual hardware modules, such as the memory, peripheral interfaces, and com-munication interfaces. These functions provide an interface between the hardware modules and the

LIU Hong (刘鸿) et al.：English Speech Recognition System on Chip

97

application program interface to the upper level program. 1.1.2 Service layer The driver layer program provides basic support to the hardware, but not any extended or enhanced function-ality. The service layer provides a powerful interface program to the application level to further improve the system performance. Thus, the service layer improves use of the hardware features. 1.1.3 Scheduling layer The various inputs select different sub-tasks such as the speech recognition, speech coding, and speech de-coding, which all have different response times for different asynchronous events. The scheduling layer then schedules these various tasks. This whole system is designed to provide good real-time performance with the scheduling layer providing seamless connec-tivity between the system procedures to complete ap-plications without the applications needing to consider how to schedule the execution. The scheduling also facilitates control of the DSP power consumption. 1.1.4 Application layer The application level programs then use the driver level, service layer, and scheduling procedures pro-vided by the API interface functions, so that the user can focus on the tasks, for example, the English com-mand word recognition engine. Thus each application depends on the application layer procedures, with most, if not all, changes to be made only in the application layer program based on the application needs, while the driver, service, and the scheduling layer changes are relatively small. The driver, service, and the sched-uling layer program serve as the system kernel layers, while the application layer program serves as the user program.

1.2 Two-pass beam search algorithm

The sub-word model in this system is based on the continuous density hidden Markov model (CHMM) with the output probability distribution of each state described by the Gaussian mixture model (GMM). The relationships between the context phonemes are cate-gorized as Monophone, Biphone, and Triphone[9]. More complex models have higher recognition rate. However, such models take much longer. Thus the more complex models are not practical, even though the recognition rates can reach nealy 100%. Faster

systems using very simple models, do not give satis-factory results. Therefore, a two-pass search strategy was adopted to optimize the performance with speed, as shown in Fig. 5. In the first stage, the search uses approximate models such as the Monophone model with one Gaussian mixture[9]. This “fast match” gener-ates a list of nbest hypotheses for the second search. The second stage was a detailed match among the nbest hypotheses using the Triphone model with three Gaussian mixtures. To reduce the computations, the covariance matrix of the Gaussian mixture model should be diagonal in both the fast match and detailed match stages. The output probability scores of all states are calculated and then matched one by one using the Viterbi method[10].

Fig. 5 Two-stage search structure

1.3 Front end feature extraction

Robust features must be used for embeded speech rec-ognition systems. The mel-frequency cepstral coeffi-cients have been proved to be more robust in the pres-ence of background noise compared to other feature parameters. Since the dynamic range of the MFCC features is limited, the MFCC can be used for fixed-point algorithms and is more suitable for em-bedded systems. The MFCC parameters chosen here offer the best trade-offs between performance and computational requirements. Generally speaking, in-creased feature vector dimensions provide more in-formation, but this increases the computational burden in the feature extraction and recognition stage[11,12].

The final recognition results for different features

Tsinghua Science and Technology, February 2011, 16(1): 95-99

98

are shown in Fig. 6. At the four-Gaussian-mixture situation, the first step needed twelve candidates for a 99% recognition rate. For 34 feature cases, the highest recognition rate was 99.22% and with 22 feature cases the recognition rate was 99.01% which satisfies the system requirements. Therefore a 22-dimension feature vector was chosen from the 12-dimension MFCC, 12-dimension difference MFCC, 12-dimension second difference MFCC, and the normalized energy and its first and second order differences.

Fig. 6 Different model recognition rates

2 Evaluation Results

The tests used 40 volunteer speakers using a test vo-cabulary[13,14] composed of names, places, and stock names. The vocabulary had a total of 600 phrases, with each phrase comprised of 2 to 4 English words spoken once by each speaker. The system recognition accuracy rate was tested by sampling input through the USB interface at 8 kHz/s. In this way, the recognition test conditions were almost the same as real conditions so that the statistical recognition accuracy could be prop-erly assessed.

The first recognition stage used a relatively simple acoustic model to produce a multi-candidate recogni-tion result with a high recognition rate. Figure 7 shows the recognition rates for the multi-candidate results using a 600 phrase vocabulary. Although the model for the first recognition by one candidate obtains only a 93.7% recognition rate and six candidates recognition rate reached 98%. Thereafter, the upward trend of the recognition rate significantly slowed along with in-creasing numbers of candidates. In most cases, the correct results were included in the front four candi-dates. Therefore, the twelve candidates in the first stage were used for the second stage matching with a more complex acoustic model. The recognition rate

then reached 99% in the second stage.

Fig. 7 Recognition rates for different candidates

The final system recognition rates using different sizes are shown in Table 1. All the evaluations were performed in an office environment with a moderate noise level. Although the recognition rate decreased as the vocabulary size increased, the recognition accuracy rate was still 98% with 600 phrases. The recognition time using the 600 phrase in a real environment was approximately 0.7 RTF (real time factor). Thus, this system could effectively deal with a 600 phrase vocabulary.

Table 1 Speech recognition system performance

Test speech (20 males /20 females) Recognition accuracy (%)150 phrase vocabulary 99.2 300 phrase vocabulary 98.8 600 phrase vocabulary 98.1

The best features of these medium size vocabulary recognition SoC systems are the low frequency (48 MHz) and small required system resources (48 KB). For the same performance the IBM Strong ARM needs a frequency of 200 MHz and a memory of 2.2 MB, while the Siemens ARM920T needs a fre-quency of 100 MHz and a memory of 402 KB.

3 Conclusions and Future Work

This paper describes an English verification system implemented on an SoC platform. The system uses an ASIC with a vector accelerator and speech recognition software developed to use the ASIC architecture. Tests show that the system can attain high recognition accu-racy (more than 98%) with a short response time (0.7 RTF). The ASIC design of flexible fast speech recognition solutions for embeded applications has only 48 KB RAM. Future work will improve the algo-rithm to reduce the recognition time and increase the system robustness in noisy environments with changes

LIU Hong (刘鸿) et al.：English Speech Recognition System on Chip

99

in the memory and scheduling. The system can be used for Chinese or English on-chip identification.

References

[1] Guo Bing, Shen Yan. SoC Technology and Its Application. Beijing: Tsinghua University Press, 2006. (in Chinese)

[2] Levy C, Linares G, Nocera P, et al. Reducing computation and memory cost for cellular phone embedded speech rec-ognition system. In: Proceedings of the ICASSP. Montreal, Canada, 2004, 5: 309-312.

[3] Novak M, Hampl R, Krbec P, et al. Two-pass search strat-egy for large list recognition on embedded speech recogni-tion platforms. In: Proceedings of the ICASSP. Hong Kong, China, 2003, 1: 200-203.

[4] Xu Haiyang, Fu Yan. Embedded Technology and Applica-tions. Beijing: Machinery Industry Press, 2002. (in Chinese)

[5] Hoon C, Jeon P, Yun L, et al. Fast speech recognition to access a very large list of items on embedded devices. IEEE Trans. on Consumer Electronics, 2008, 54(2): 803-807.

[6] Yang Zhizuo, Liu Jia. An embedded system for speech recognition and compression. In: ISCIT2005. Beijing, China, 2005: 653-656.

[7] Westall F. Review of speech technologies for telecommu-nications. Electronics & Communication Engineering Journal, 1997, 9(5): 197-207.

[8] Shi Yuanyuan, Liu Jia, Liu Rensheng. Single-chip speech recognition system based on 8051 microcontroller core. IEEE Trans. on Consumer Electronics, 2001, 47(1): 149-154.

[9] Yang Haijie, Yao Jing, Liu Jia. A novel speech recognition system-on-chip. In: International Conference on Audio, Language and Image Processing 2008 (ICALIP 2008). Shanghai, China, 2008: 166-174.

[10] Lee T, Ching P C, Chan L W, et al. Tone recognition of isolated Cantonese syllables. IEEE Trans. on Speech and Audio Processing, 1995, 3(3): 204-209.

[11] Novuk M, Humpl R, Krbec P, et al. Two-pass search strategy for large list recognition on embedded speech recognition platforms. In: ICCASP’04. Montreal, Canada, 2004: 200-203.

[12] Zhu Xuan, Wang Rui, Chen Yining. Acoustic model com-parison for an embedded phonme-based mandarin name dialing system, In: Proceedings of International Sympo-sium on Chinese Spoken Language Processing. Taipei, 2002: 83-86.

[13] Zhu Xuan, Chen Yining, Liu Jia, et al. Multi-pass decoding algorithm based on a speech recognition chip. Chinese Acta Electronica Sinica, 2004, 32(1): 150-153.

[14] Demeechai T, Mäkeläinen K. Integration of tonal knowl-edge into phonetic Hmms for recognition of speech in tone languages. Signal Processing, 2000, 80: 2241-2247.

Documents

English Speech Recognition System on Chip