05739208

  • Upload
    gzb012

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

  • 7/27/2019 05739208

    1/6

    AN INTERACTIVE SPEECH CODING TOOL USING LABVIEWTM

    Karthikeyan N. Ramamurthy, Jayaraman J. Thiagarajan and Andreas Spanias

    SenSIP Center, School of ECEE, Arizona State University, Tempe, AZ USA 85287-5706

    ABSTRACT

    Code Excited Linear Prediction (CELP) is a closed-loop

    analysis-by-synthesis speech coding algorithm that has been

    standardized in Federal Standard-1016. Variants of the

    CELP algorithm form the core of many speech coding

    standards that exist today. In this paper, we discuss the

    development of an interactive speech coding tool in National

    Instruments LabVIEWTM software for the Federal Standard-

    1016 CELP algorithm. A brief description of the speech

    coding algorithm and the features of the LabVIEW speech

    coding tool are presented. Illustrations demonstrating the use

    of the interactive software tool in analyzing the speechcoding algorithm are provided. This tool can be used to

    teach the various modules of the CELP based speech coders

    to undergraduate and graduate students.

    Index Terms Speech coding, LabVIEW, code

    excited linear prediction, interactive tools.

    1. INTRODUCTION

    Speech coding is concerned with compact digital

    representations of voice signals for the purpose of efficient

    transmission or storage [1-6]. Linear predictive coding is the

    core of many speech coding standards that exist today [7,8].

    Linear predictive coding relies on the source-system modelof speech production, which is inspired by the human speech

    production mechanism. Voiced speech is produced by

    exciting the vocal tract filter with periodic impulses and

    unvoiced speech is generated using random pseudo-white

    noise excitation. The vocal tract is usually represented by a

    tenth-order digital all-pole filter. This source-system

    analysis-synthesis model is used in most standardized

    algorithms. In fact, the Levinson-Durbin linear prediction

    algorithm is embedded in every cell phone.

    The closed-loop source-system encoders use linear

    prediction (LP) along with an excitation scheme determined

    by closed-loop analysis-by-synthesis (A-by-S) optimization.

    The excitation sequence that minimizes the perceptually-weighted (PW) mean-square-error (MSE) between the input

    speech and reconstructed speech is chosen as the optimal

    [9]. In the CELP algorithm [10-11], the excitation sequences

    are stored in two code books and the indices to the

    codebooks are chosen during the PW MSE minimization

    process. The adaptive code book (ACB) predicts the pitch

    delay using the long term predictor (LTP) and the stochastic

    code book (SCB) predicts the random component of the

    excitation. Other components of a generic CELP encoder

    include autocorrelation analysis and linear prediction, and

    line spectral pair (LSP) computation. CELP decoder

    implements a part of the encoder itself. A generic CELP

    encoder is illustrated in Figure 1.

    LabVIEWTM [12] was chosen as the programming

    environment to implement the CELP algorithm as it has a

    rich set of signal processing and visualization functions, and

    real-time signal acquisition capabilities. Implementation of

    speech coding algorithms involves integration of softwareand hardware components, which can be easily performed

    with LabVIEW. The graphical programming approach

    enables users to easily visualize and understand the basic

    blocks of the speech analysis-synthesis procedure. This

    speech coding tool is scalable in the sense that additional

    options and capabilities could be added.

    In this paper, we extend the work published in [13] and

    discuss the implementation of the Federal Standard 1016

    (FS-1016) version of the CELP speech coder in

    LabVIEWTM. Our main goal here is to introduce and

    demonstrate the concepts of speech coding to students and

    enhance their learning experience using an interactive visual

    interface. We choose the CELP coder for analysis, becauseit can be connected to several concepts covered in DSP

    classes including digital filter theory, estimation of

    periodicity, autocorrelation computation and filter stability.

    Exercises that expose students to the non-stationarity of the

    speech signal, the all-pole spectral modeling performed by

    LP analysis-synthesis and the distortion caused by

    quantization of LSP parameters will be developed. The tool

    can be used along with the books [4,10] that have

    demonstrations of the FS-1016 algorithm and exercises

    based on MATLAB. The speech coding tool is of value

    not only to undergraduate and graduate students but also to

    DSP practitioners. The tool can also be used in high school

    science classes after some simplifications, for demonstratingthe basic aspects of coding and transmission of speech.

    Assessment instruments will be developed and pre-, post-

    quizzes and interviews will be conducted among the

    students.

    180978-1-61284-227-1/11/$26.00 2011 IEEE DSP/SPE 2011

  • 7/27/2019 05739208

    2/6

    2. CELP BASED SPEECH CODING STANDARDS

    The speech coding standards based on CELP are surveyed in

    this section. In our survey, we divide the algorithms based

    on CELP into three categories based on their chronology oftheir development, i.e., first-generation CELP (1986-1992),

    second-generation CELP (1993-1998), and third-generation

    CELP (1999-present). A detailed description of the FS-1016

    standard is also provided.

    2.1. Survey of Speech Coders

    The first-generation CELP algorithms are generally of high

    complexity and non-toll quality that operate at bit rates

    between 4.8 kb/s and 16 kb/s. Some of the first-generation

    CELP algorithms include: the FS-1016 CELP, the IS-54

    vector sum excited linear prediction (VSELP), the ITU-T

    G.728 low delay-CELP, and the IS-96 Qualcomm CELP.The newer second and third generation A-by-S coders

    replaced most of these standardized CELP algorithms.

    The second-generation CELP algorithms are targeted

    for Internet audio streaming, voice-over-Internet-protocol

    (VoIP), teleconferencing applications, and secure

    communications. Some of the second-generation CELP

    algorithms include: the ITU-T G.723.1 dual-rate speech

    codec [14], the GSM EFR [15], the IS-127 Relaxed CELP

    (RCELP) [16], and the ITU-T G.729 CS-ACELP [17]. The

    algebraic CELP (ACELP) uses algebraic codes in place of

    the SCB and hence provides a huge reduction in

    computational complexity for code book search.

    The third-generation (3G) CELP algorithmsaccommodate different bit rates and are multimodal. They

    are designed to operate in different modes: low-mobility,

    high-mobility, indoor, etc., and consistent with the vision on

    wideband wireless standards. There are at least two

    algorithms that have been developed and standardized for

    these applications. The Global Systems for Mobile Rate

    Communications (GSM) standardized the Adaptive Multi-

    (AMR) coder [18] in Europe and the Telecommunications

    Industry Association (TIA) has tested the Selectable Mode

    Vocoder (SMV) [19] in the U.S.

    2.2. The FS-1016 CELP Coder

    FS-1016 is a 4.8 kb/s CELP algorithm that was adopted in

    the late 1980s by the Department of Defense (DoD) for use

    in the third-generation secure telephone unit (STU-III). The

    CELP FS-1016 remains interesting for our study as it

    contains core elements of A-by-S algorithms that are still

    very useful. The synthesis configuration for the FS-1016

    CELP is shown in Figure 2. Speech is sampled at 8 kHz and

    segmented into frames of 30ms in the FS-1016 CELP. Each

    frame is segmented in sub-frames of 7.5 ms. The excitation

    in CELP is formed by combining vectors from an adaptive

    and a stochastic codebook (gain-shape VQ). The excitation

    vectors are selected in every sub-frame by minimizing theperceptually weighted error measure.

    The codebooks are searched sequentially starting with

    the ACB. The ACB contains the history of past excitation

    signals and the LTP lag search is carried over 128 integer

    (20 to 147) and 128 non-integer delays. Only a subset of

    lags is searched in even sub-frames to reduce the

    computational complexity. The SCB contains 512 sparse

    and overlapping code vectors [20]. Each code vector

    consists of sixty samples and each sample is ternary valued

    SpeechVQ index

    Lag index

    Postfilter

    Stochastic

    codebook

    Adaptive

    codebook

    +

    +

    ga

    gs

    A(z)

    SpeechVQ index

    Lag index

    Postfilter

    Stochastic

    codebook

    Adaptive

    codebook

    +

    +

    ga

    gs

    A(z)

    Figure 2. FS-1016 CELP synthesis.

    .

    .

    ..

    MSE

    minimization

    PWF

    W(z)

    Input

    speech

    LP synthesis

    filter1/A(z)

    +

    _Syntheticspeech

    Residual

    error

    Excitation vectors

    Codebook

    ( )ix

    ( )

    is

    ( )

    i

    ws

    ( )ie

    LTP synthesis

    filter1/AL(z)

    g(i)PWF

    W(z)

    s

    sw

    ..

    ..

    MSE

    minimization

    PWF

    W(z)

    Input

    speech

    LP synthesis

    filter1/A(z)

    +

    _Syntheticspeech

    Residual

    error

    Excitation vectors

    Codebook

    ( )ix

    ( )

    is

    ( )

    i

    ws

    ( )ie

    LTP synthesis

    filter1/AL(z)

    g(i)PWF

    W(z)

    s

    sw

    Figure 1. Block diagram of a generic CELP encoder.

    181

  • 7/27/2019 05739208

    3/6

    MATLAB

    Implementation

    of the

    Algorithm

    Create a Shared

    Library using

    MATLAB

    Compiler

    Create a C++

    WrapperFunction

    Create a NewShared Library

    for LabVIEW

    Build User

    Interface inLabVIEW

    Figure 3. Block diagram illustrating the steps involved in building the speech coding tool.

    (1,0,-1) [21] to allow for fast convolution.Ten short-term prediction parameters are encoded as

    LSPs on a frame-by-frame basis. LSPs are more amenable

    to quantization and hence they are transmitted instead of LP

    coefficients. Sub-frame LSPs are obtained by performing

    linear interpolation of frame LSPs. A short-term pole-zero

    postfilter is also part of the standard. The details on the bit

    allocations are given in the standard [11]. The computational

    complexity of FS-1016 CELP was estimated at 16 Million

    Instructions per Second (MIPS) for partially searched

    codebooks and the Diagnostic Rhyme Test (DRT) and Mean

    Opinion Scores (MOS) were reported to be 91.5 and 3.2respectively.

    3. LABVIEW SPEECH CODING TOOL

    In this section, we present the software tool developed for

    teaching speech coding theory and the CELP algorithm

    using the National Instruments LabVIEW package.

    Implementation of highly complex signal processing

    algorithms involves integration of several software and

    hardware components developed across different platforms.

    Hence, there is a need for a scalable framework that

    provides flexibility to extensions and ability to perform

    detailed analysis under different system conditions. Such a

    framework can be realized using two different approaches:

    a) hybrid programming and b) integration of existing

    software.

    Hybrid programming combines the inherent graphical

    programming functions of LabVIEW with the textual

    programming using Mathscript. The primary limitations of

    this approach are the speed of execution and the overhead

    involved in converting the external source code from the

    native programming language to Mathscript. The other

    approach is to integrate existing software and this requires a

    complete understanding of the underlying platform in which

    the native code was developed. The primary challenge in

    this case is to develop suitable software interfaces for

    LabVIEW to communicate with the different components.The important limitation of this approach is that extension of

    the algorithm and modification of the native source code

    may be required. However, we base our speech coding tool

    on this approach, since hybrid programming is not fast

    enough to realize real-time applications.

    Figure 4. An example LabVIEW model using the built dll.

    3.1. Basic Framework

    The framework has been built using shared libraries that

    exploited existing MATLAB implementation along with

    LabVIEWs native functionalities. We make use of shared

    libraries that are built from the native implementation and

    integrated with software/hardware components developed in

    LabVIEW. The basic steps involved in building the speech

    coding tool using LabVIEW is illustrated in Figure 3.

    i) MATLAB implementation: The speech coding/processingalgorithms are implemented using MATLAB [10]. This

    implementation includes functions from specific toolboxes.

    The inputs and outputs of the speech coding tool are

    identified.

    ii) Create shared library from MATLAB: The MATLAB

    compiler is used to build a shared C library of the algorithm.

    This requires the MATLAB Component Runtime (MCR).

    iii) Create C++ wrapper: A C++ wrapper is built to

    interface the MATLAB library and LabVIEW. This step

    includes the identification of the functions that are to be

    exposed to LabVIEW.

    iv)Make library for LabVIEW: A new shared library is built

    over the wrapper code. In effect, invoking a function of thisnew library will implicitly invoke the functions in the

    MATLAB shared library.

    v)Build user interface for the tool: Call Library node is used

    to call the external shared library in LabVIEW. A graphical

    182

  • 7/27/2019 05739208

    4/6

    interface is developed to handle the inputs and outputs of

    the library function

    3.2. Challenges

    The primary challenge involved in this process is in the

    creation of the wrapper function. In addition to

    communicating data types between MATLAB and

    LabVIEW, it needs to account for the memory issues in

    LabVIEW. When LabVIEW loads a VI, it loads all the

    subVIs into memory. Specifically, it loads all the shared

    libraries (*.dll) used. The dlls are erased from memory only

    when the top-level VI is closed. This is a problem when

    MATLAB libraries are used in our speech coding tool. Wecannot initialize the MATLAB dll again once we have

    terminated the tool. This is because the dlls are not erased

    from the memory unless the LabVIEW application is

    restarted. This implies that we are only able to run the dll

    once. Therefore the solution we resort to is splitting the

    initialize, execute and terminate functions of the MATLAB

    libraries. Then when running the tool in LabVIEW, we

    initialize the libraries only once, before running the dll

    functions and terminate them before we shut down. Figure 4

    illustrates an example LabVIEW model that uses the built

    dll.

    4. USER INTERFACE OF THE TOOL

    Figure 5 shows the user interface of the LabVIEW speechcoding tool. The interface consists of multiple tabs that

    illustrate several modules of the FS-1016 algorithm. The

    software can access either an audio (.wav) file or real -time

    speech input. The user also has options to change certain

    speech parameters to analyze the performance and behavior

    of the algorithm under different conditions. The

    preprocessed input speech is displayed and processed on a

    frame-by-frame basis. Frame-by-frame display is also used

    to view the spectra of the decoded output speech frames, the

    LP spectral envelopes before and after quantizing the LSPs,

    pole-zero plots of the synthesis filter, synthesized speech

    waveforms etc. The software has options to save the output

    speech. The user can also analyze the subjective quality ofthese algorithms by listening to the synthesized speech with

    the aid of the playback feature.

    In the following sections, the various outputs obtained

    with the LabVIEW tool for a single frame of speech will be

    shown. The analyzed frame is a voiced frame with a pitch

    period of 65 samples.

    Figure 5. User interface of the LabVIEW speech coding tool.

    (a)

    (b)

    Figure 6. (a) Input spectrum and (b) output spectrum for

    the given frame of speech data.

    183

  • 7/27/2019 05739208

    5/6

    4.1. Input and Output Spectra

    The Fourier magnitude spectra of the input speech frame and

    the output speech frame can be observed using the tool. In

    Figure 6, the spectra for the analyzed frame as obtained from

    the LabVIEW tool are shown. The user can analyze the

    spectra of any desired frame.

    4.2. Quantized and Unquantized LP Spectra

    The LP spectra obtained before and after quantizing the

    LSPs for the given frame are shown in Figure 7. This feature

    of the tool is very useful in order to analyze the spectraldistortion caused by the quantization of LSP parameters.

    The roots of the input LP polynomial obtained using the

    unquantized LSPs and the output polynomial obtained from

    the quantized LSPs are shown in Figure 9. It can be seen that

    theoutput LP filter is still stable after quantization. This canbe used to demonstrate the preservation of stability by

    quantizing LSPs instead of LP coefficients. Quantization of

    LP coefficients is more likely to result in an unstable filter.

    4.3. Subjective Quality

    The subjective quality of the speech coder can be analyzed

    using the options to play back the postfiltered, high-pass

    filtered and non-postfiltered speech as shown in Figure 8.

    5. UTILITY IN EDUCATION AND ASSESSMENT

    The main educational objective of the LabVIEW speech

    coding tool developed in this paper is to introduce and

    demonstrate the concepts of speech coding, in particular

    coding based on analysis-by-synthesis methods. The

    interactive visual interface of LabVIEW is intuitive and the

    tabbed interface of the tool allows the students to visualize

    various concepts of speech coding simultaneously, which is

    not possible when text-based programming languages are

    used. The tool can also be extended easily to include other

    outputs that are useful for student learning. This can be used

    to demonstrate speech coding in a DSP class or in a more

    advanced speech coding class. The authors have written

    books on audio coding [4] and FS-1016 [10], which contain

    exercises and demonstrations of the FS-1016 in MATLAB.

    The proposed tool can be used along with these books for

    demonstrating speech coding concepts.

    The following exercises will be developed and

    presented to the students as a part of the proposed

    assessment. Assessment results will be generated after

    introducing the students to the speech coding tool in the

    DSP class at Arizona State University. The assessment

    results will include pre- and post-quizzes on the

    fundamentals of speech coding.

    5.1. Analysis of Voice/Unvoiced/Mixed Frames

    The students are required to identify voiced, unvoiced and

    mixed frames from a speech file. They will plot the time

    domain input and output waveforms, Fourier spectra,unquantized and quantized LPC plots. The time and

    frequency domain characteristics of the voiced/unvoiced and

    mixed frames will be analyzed.

    5.2. Subjective Quality Analysis

    The students will be asked to evaluate the performance of

    the FS-1016 coder with the speech files provided. The three

    speech outputs, (a) postfiltered speech, (b) non-postfiltered

    speech and, (c) highpass filtered speech will be listened to

    and the differences in subjective quality will be analyzed.

    The students will also provide a MOS, which is a measure of

    perceived speech quality.

    5.3. Pitch Forcing

    The students will have the option to force the pitch to a

    predefined value, using the Force Pitch option in the tool.

    Different values of pitch periods, (e.g.) 40, 75 and 110,

    (a)

    (b)Figure 7. (a) Input LP spectrum and (b) output LP spectrum

    for the given frame of speech data.

    Figure 8. Options for subjective quality analysis.

    184

  • 7/27/2019 05739208

    6/6

    will be forced and the students will evaluate the perceptual

    quality of output speech.

    6. CONCLUSIONS

    In this paper, a LabVIEW speech coding tool that

    implements the FS-1016 algorithm was presented. The steps

    involved in creating the software tool from the existing

    MATLAB implementation of FS-1016 were described.

    The tool will be very useful to students and practitioners of

    DSP for teaching and understanding the principles behind

    CELP based speech coding algorithms.

    7. ACKNOWLDGEMENTS

    Portions of this work have been sponsored by the ASU

    SenSIP center National Instruments project and the NSF

    CCLI award 0443137.

    8. REFERENCES

    [1] A. Spanias, Speech Coding: A Tutorial Review,

    Proceedings of the IEEE, Vol.82, Issue 10, Oct 1994.

    [2] V. Atti, A Simulation Tool For Introducing Algebraic

    CELP (ACELP) Coding Concepts In A DSP Course, IEEE

    2002 DSPWorkshop, Callaway, Georgia, Oct. 2002.

    [3] A. Spanias, E.M. Painter, A Software Tool for

    Introducing Speech Coding Fundamentals in a DSP Course,

    IEEE Trans. On Education, Vol.39,2, pp.143-152, May

    1996.

    [4] A. Spanias, T Painter, V. Atti, Audio Signal Processing

    and Coding, ISBN: 0-471-79147-4, Wiley, February 2007.

    [5] A. Spanias, Digital Signal Processing; An Interactive

    Approach, ISBN: 978-1-4243-2524-5, January 2007.

    [6] V. Atti, Interactive On-line Undergraduate Laboratories

    Using J-DSP, IEEE Trans. on Education Special Issue on

    Web-based Instruction, vol. 48, no. 4, pp. 735-749, Nov.

    2005.

    [7] A. Spanias, Chapter 3: Speech Coding Standards, pp.

    25-44, Invited. Academic Press, Ed: G. Gibson, ISBN 2000

    0-12- 282160-2.

    [8] FS-1016 CELP C Code Implementation, Available at

    World Wide Web: ftp://svr-ftp.eng.cam.ac.uk/ comp.speech/

    coding/celp_3.2a.tar.Z.

    [9] M.R. Schroeder and B. Atal, Code-Excited LinearPrediction (CELP): High Quality Speech at Very Low Bit

    Rates,Proc. ICASSP-85, p. 937, Apr. 1985.

    [10] K. Ramamurthy and A. Spanias, MATLAB Software for

    the Code Excited Linear Prediction Algorithm: The Federal

    Standard-1016, Morgan and Claypool, 2010 .

    [11] J.P. Campbell Jr., T.E. Tremain and V.C. Welch, The

    Federal Standard 1016 4800 bps CELP Voice Coder,

    Digital Signal Processing, Academic Press, Vol. 1, No. 3, p.

    145-155, 1991.

    [12] LabVIEW Fundamentals, Available at World Wide

    Web: http://www.ni.com/pdf/manuals/374029a.pdf

    [13] A. Spanias, K. Natesan, J. Jayaraman and P. Spanias,

    Work in progress - teaching speech signal processing andcoding using LabVIEWTM,Proc. IEEE FIE, pp.T1C-22-

    T1C-23, Oct. 2007.

    [14] ITU Recommendation G.723.1, Dual Rate Speech

    Coder for Multimedia Communications transmitting at 5.3

    and 6.3 kb/s, Draft 1995.

    [15] TIA/EIA/IS-641, Cellular/PCS Radio Interface -

    Enhanced Full-Rate Speech Codec, TIA 1996.

    [16] TIA/EIA/IS-127, Enhanced Variable Rate Codec,

    Speech Service Option 3 for Wideband Spread Spectrum

    Digital Systems, TIA, 1997.

    [17] ITU Study Group 15 Draft Recommendation G.729,

    Coding of Speech at 8kb/s using Conjugate-Structure

    Algebraic-Code-Excited Linear-Prediction (CS-ACELP),1995.

    [18] R. Ekudden, R. Hagen, I. Johansson, and J. Svedburg,

    The Adaptive Multi-Rate speech coder, Proc. IEEE

    Workshop on Speech Coding, pp. 117-119, Jun. 1999.

    [19] Y. Gao et. al., The SMV algorithm selected by TIA

    and 3GPP2 for CDMA applications, Proc. IEEE ICASSP-

    01, vol. 2, pp. 709-712, May 2001.

    [20] W.B. Kleijn, Source-Dependent Channel Coding and

    its Application to CELP, Advances in Speech Coding, Eds.

    B. Atal, V. Cuperman, and A. Gersho, pp. 257-266, Kluwer

    Ac. Publ., 1990.

    [21] D. Lin, New Approaches to Stochastic Coding of

    Speech Sources at Very Low Bit Rates, Proc. EUPISCO-86, p. 445, 1986.

    Figure 9. Roots of input (left) and output (right) LP

    polynomials.

    185