Chapter 16 Speech Synthesis Algorithms 16.1 Synthesis based on LPC 16.2 Synthesis based on formants 16.3 Synthesis based on homomorphic processing 16.4

Chapter 16 Speech Synthesis Algorithms

• 16.1 Synthesis based on LPC• 16.2 Synthesis based on formants • 16.3 Synthesis based on homomorphic proc

essing• 16.4 PSOLA( Pitch Synchronous Overlap-

Add) Algorithm for Synthesis• 16.5 Synthesis based on addition of sin func

tions

16.1 Synthesis based on LPC (1)

• x(n) = Σ ai x(n-i), i=1~p

• For every frame of the original speech the p ai are extracted by LPC algorithm and stored in memory with the first p signals. When synthesis is required, the later signals could be generated by above formula.

16.2 Synthesis based on formants (1)

• The transfer characteristics of formant filter

• y(n) = ax(n)-by(n-1)-cy(n-2)

• where a=1+b+c,

• b=-2exp(-πBTs)cos(2πFTs)

• c= exp(-2πBTs)

• B is bandwidth, F is resonance frequency of filter, Ts is sample frequency

Synthesis based on formants (2)

• In the range of formants deploys a couple of filters with F1,F2,F3… as the resonance frequency the whole system will close to the transfer characteristics of the vocal tract

• Cascade (series) or Parallel Connection of formant filters)

Synthesis based on homomorphic processing (1)

• After homomorphic processing• x(n) = e(n) + v(n)• For voice the e(n) is a periodic sequence. Su

ppose the period is N,• e(n) =Σδ(n-rN), r=0~R • e(n) only nonzero on mN. • It is easy to separate the e(n) and restored e

(n)

16.4 PSOLA( Pitch Synchronous

Overlap-Add) Algorithm for Synthesis (1)

• This algorithm was proposed by F. Charpentier and E.Moulines in the end of 1980’s. The advantage is relative lower computing complexity, the clarity and naturiness are both better. In particular, the TD-PSOLA(time-domain PSOLA) can meet the real time requirement.

• The principle of PSOLA

PSOLA Algorithm for Synthesis (2)

• The algorithm is originated from the addition of the reconstructed short-time Fourier transform signals :

• The short-time Fourier transform of x[n] is : • Xn(ejω)=Σx(m)w(n-m)e-jωm, for-∞<m<∞ • For any n it corresponds a continuous frequency s

pectrum function. There exists redundancy. So we can just take a sample every R samples:

• Yr(ejω) = Xn(ejω)|n=rR, It’s reverse transform is


• yr(m)=∫-∞∞Yr(ejω)ejωm dω/(2π)

• Added the yr(m)’s we get

• y(m)=Σyr(m)=Σx(m)w(m-rR)

• = x(m)Σw(rR-m), for -∞<r<∞

• It is possible to prove that when R<=N/4

• Σw(m-rR)≈W(ej0)/R,

• so y(n)≈x(n) W(ej0)/R


• So the difference between y(n) and x(n) is only a constant factor!

• If Hanning window is used, an exact relation could be derived that

• Σw(rN/2-m)≡1, for -∞<r<∞, for any m • If x(n) is a voiced with period Np, then we can use

Hanning window to intercept a signal with double periods 2Np and added by Np delay. Under idea periodic condition, it is possible to restore the original signal x(n)= Σw(rNp-n)x(n)


• In practice, there is no idea periodic condition and the reconstruction condition is not completely satisfied, and we need to change the pitch, duration and intensity so don’t want to reconstruct the original signal.

• By using PSOLA, we can make the mean square of spectrum minimal

PSOLA Algorithm for Synthesis (6)• D[x(n),y(n)]=∫-π

π|Xtm(ejω)-Ytg(ejω)|2dω• Where tm and tg are pitch mark point of x(n) and y

(n) respectively

• The procedures for PSOLA• 1. Pitch synchronous analysis : to mark the pitch a

s accurate as possible;• 2. Change time scale : for given pitch adjust param

eterβand time adjust parameterγ, determine the relation between the original pitch mark sequence and the synthesized pitch mark sequence;


• 3. Change the analyzed short-time signal and create synthesized signal(TD-PSOLA only make delay and adjust the signal on frequency domain)

• Pitch synchronous overlay processing and create last version of the synthesized speech signal.


• Pitch Synchronous Analysis : for unvoiced speech we set the period according to fixed period; for voiced segments, the pitch marks being set correctly. So a series of pitch mark points {tm, m=1,2,…M}

• Times the x(n) with the series of window functions will get a series of short-time signal xm(n) :


• xm(n) = wm(tm-n)x(n)

• These xm(n) are intermediate representation of the waves.

• W is Hanning window. Window length is larger than a pitch period. Window center is located at the pitch mark. There are partly overlap between the adjucent frames.

PSOLA Algorithm for Synthesis (10)• Time scale changing

• In order to perform prosodic modification, must determine the new pitch mark position on the synthesis axis tq(q=1~Q) and the mapping tm -> tq.

• Duration adjustment functionγ(n) and pitch adjustment function β(n) are two important parameters for determining new mark and mapping relation. They will change at same time.

• The change of pitch leads the increase of pitch period, so duration should make some change to adjust to the original

• duration. It is also could be done in one step.


• xm(n) is changed into xq(n) by modification.

• Then xq(n) will be synthesized according to new marks. It contains three steps: changing the numbers of short-time signal waves, changing the delay of short-time signals, changing every short-time signal itself.

• For TD_PSOLA, the synthesized signal is only the copy of analyzed signal. First select the number of analyzed signals; delay δq=tq-tm, xq(n)=xm(n+δq)=xm(n-tm+tq)

• For FD-PSOLA, besides above processing, xm(n+δq) must be transformed on frequency domain.


• The overlap-add

• There are a couple of ways to add. • x(n)= Σαqxq(n)wq(tq-n)/ Σwq

2(tq-n) for q

• Where αq are normalized factors; w is the sequence of synthesized window.

• Another simple way is :

• x(n)= Σαqxq(n)/ Σwq(tq-n) for q

16.5 Synthesis based on Sin Models (1)

• This technique starts from the frequency spectrum decomposition of speech signal. By the decomposition, a series of frequencies, amplitudes and phases are obtained. By matching the frequency parameters and adjusting amplitude and phases, the re-addition of sin waves could synthesize new speech signal.

• Sin Model of Speech for Synthesis by Analysis

• The generation of speech could be seen as the result of a glottal excitation through a linear time-variant system.

• S(t) =∫0th(t-τ,t)e(τ)dτ

Synthesis based on Sin Models (2)

• e(t) =Σal(t)cos[Ωl(t)], l=1~L

• Ωl(t)= ∫0t ωl(σ)dσ+φl

• s(t) = ΣAl(t)cos[θl(t)] , l=1~L

• The transfer function H(ω,t) of vocal track is the Fourier transform of h(t-τ,t), H(ω,t) =M(ω,t)exp(jψ(ω,t))

• Al(t)= al(t) Ml(t), θl(t)= Ωl(t)+ψl(t)

• Speech Synthesis by Analysis Based on Sin Models

• 1. The estimate of frequency, amplitude and phase parameters


• The conclusion is the frequencies of synthesized speech signal correspond the the frequencies at the peaks of the short-time Fourier transform(DFT) of that frame. The amplitudes and phases are that at these frequencies.

• In practice, we estimate frequency, amplitude and phase parameters by peak extraction. By windowing a series of short-time speech signal. For the performance, the window length should be larger than two current pitch periods.

• The window used is Hamming window with length 256 and 0%-50% overlay.


• After 512 points of FFT, the spectrum is obtained. By peak

• Extraction, the frequencies ωl amplitudes Al and phasesθl are obtained. l=1~L, L generally is 40-50.

• Frequency Matching

• Adjecent frame needs to do the frequency matching to facilitate the explonation. After matching, the frequency matching locus is obtained.

• Explonation of amplitude and phase for two frames

• Experiment Results.




Documents

Chapter 16 Speech Synthesis Algorithms 16.1 Synthesis based on LPC 16.2 Synthesis based on formants 16.3 Synthesis based on homomorphic processing 16.4