Download pdf - The past, present and future of singing synthesis

Kanru Hua (華侃如)June 19, 2016

The Past, Present and Futureof Singing Voice Modeling

Motivation

“You are making too many assumptions, this thing won’t work on realspeech signal.”

— Jont B. Allen

● What’s wrong with current and past researches in this area?

● What’s our next step?

What’s in a Speech/Singing Synthesizer

Parameter Generator

Vocoder

Text / Music Score

Speech Audio

Generate pitch, duration and spectrum… from input

Generate waveform from parameters Vocoder

Part 1History of Speech Analysis/Synthesis

(http://clas.mq.edu.au/speech/synthesis/history_synthesis/)

History of Math & Acoustics

1600 1700 1800 1900 2000

Law of Forces/Motions, Foundation of Calculus

Wave Equation,Complex Number

Fourier/Laplace Transform,Analog Circuits & Electromagnetism

Newton Bernoulli, Euler, d‘Alembert

(http://www2.ling.su.se/staff/hartmut/kemplne.htm)

Gauss, Fourier, Laplace, Riemann, Cauchy, Kirchhoff, Heaviside

Filtering Theory, Digital Systems, Sampling Theory, ...

History of Math & Acoustics

1600 1700 1800 1900 2000

Law of Forces/Motions, Foundation of Calculus

Wave Equation,Complex Number

Fourier/Laplace Transform,Analog Circuits & Electromagnetism

Filtering Theory, Digital Systems, Sampling Theory, ...

Newton Bernoulli, Euler, d‘Alembert

Gauss, Fourier, Laplace, Riemann, Cauchy, Kirchhoff, Heaviside

(http://www2.ling.su.se/staff/hartmut/kemplne.htm)

= =

Frequency Response

Source-Filter Model

Vocal TractVocal Folds LipLung

tf f

Signal Generator (Source) Filter 1 Filter 2

Signal Generator Filter 1 Filter 2Filter 0

20th Century, the Dawn of Speech ProcessingCooley and Tukey (1965): Fast Fourier TransformOppenheim (1969): one of the earliest digital implementation of speech analysis/ synthesis

InputPitch

(source)

Cepstrum(vocal tract filter)

Analysis Synthesis

Spectrum

Output

Family Tree of Speech A/S AlgorithmsHomomorphic Filtering

(Oppenheim, 1969)STRAIGHT

(Kawahara, 1998)

WORLD1(Morise, 2009)

WORLD2(Morise, 2013)

TANDEM-STRAIGHT(Kawahara & Morise, 2007)

PSOLA(?, 1985)

Phase Vocoder(Flanagan et al, 1966)

Source-FilterModel

Sinusoidal Model(McAulay & Quatieri, 1986)

SMS(Serra, 1989)

Autotune

CELP(Atal & Schroeder,1983)

LSP/LSF(Itakura, 1975)

MGC/MLSA(Imai, et al., 1983)

SinsyMelodyne

& NiaoNiao& tn_fnds

Harmonic+Noise(Stylianou, 1993)

NBVPM(Bonada, 2004)

WBVPM(Bonada, 2008)

Vocaloid Vocaloid 2+RUCE(Rocaloid 4)

Rocaloid 3

Sine+Noise+Transient(Levin & Smith, 1998)

CeVIO

Quasi-Harmonic Model(Pantazis, et al., 2008)

Chiptune

Vocaine(Agiomyrgiannakis, 2015)

Linear Prediction(Atal & Schroeder,1967)

Part 2What’s Wrong

Quasi-static AssumptionAlgorithms affected:

● Homomorphic Filtering● PSOLA● Linear Prediction & CELP & MLSA● Sinusoidal Model● Harmonic+Noise Model● SMS & NBVPM● WORLD & STRAIGHT (slightly)

Mis-represented Aperiodic ComponentPopular belief:1. Speech = periodic signal + aperiodic signal (breathing noise)2. Aperiodic signal is filtered white noise

Aperiodic

Periodic (Friction)

Mis-represented Aperiodic Component

t

Algorithms affected:● (Quasi-)Harmonic+Noise Model● SMS & Sines+Noise+Transients Model● WORLD & (TANDEM-)STRAIGHT● Algorithms that do not model aperiodic component

○ Phase vocoder, CELP, MLSA, ...

Over-simplified Source-Filter Model

Tract FilterOscillator Lip Filter

Tract FilterOscillator

Source Filter

Assumption: source filter is independent from pitch

Equivalent assumption:“When my pitch is higher by 12 semitones, my vocal folds still oscillate at the same speed.”

Affected algorithms: all of those listed on page 11

Part 3Future: How to Fix &

the Low Level Speech Model

“Neoclassical” Approaches to Speech Modeling

Tract

Source

Lip

t

f

f

InputInverse

Linear Prediction(Atal & Schroeder,1967)

ARX(Wen, et al., 1995)

ARX-LF(Vincent, et al., 2005)

LF Model(Liljencrants, Fant and

Lin, 1985)

OVE Synthesizer(Fant, 1953)

“Neoclassical” Approaches to Speech ModelingDegottex (2013): similar idea, but in frequency domain

Hua (2016, in progress): more robust under poor recording conditions; less sensitive to processed input.

The Low Level Speech Model (new version)

Level 0(Signal Level)

Input Signal

Pitch Harmonic Model Noise Model

SpectrumChannel 1 EnergyChannel 2 EnergyChannel 3 Energy

...

Harmonic ModelHarmonic Model

Harmonic Model

Output Signal

Glottal/Source Information(LF Model)

Vocal Tract Filter Lip FilterLevel 1(Acoustic Level)

An acoustically meaningful speech model

Inverse Analysis of Speech

Original

Glottal Flow(Source Signal)

Pitch Shifting powered by LLSM

Original

50% Pitch

200% Pitch

Pitch Shifting powered by LLSM

Original

50% Pitch

200% Pitch

Instants of vocal fold closure were revealed

Reference● A.V. Oppenheim, “Speech Analysis-Synthesis System Based on Homomorphic Filtering”. JASA

(1969): Vol. 45, No. 2.

● Degottex, Gilles, et al. "Mixed source model and its adapted vocal tract filter estimate for voice transformation and synthesis." Speech Communication 55.2 (2013): 278-294.

● H. K. Dunn, "The calculation of vowel resonances, and an electrical vocal tract", Journal of the Acoustical Society of America, 1950, vol. 22, p. 740-753.

● Pantazis, Yannis, and Yannis Stylianou. "Improving the modeling of the noise part in the harmonic plus noise model of speech." Acoustics, Speech and Signal Processing (2008). IEEE International Conference on.