Kanru Hua (華侃如)June 19, 2016
The Past, Present and Futureof Singing Voice Modeling
Motivation
“You are making too many assumptions, this thing won’t work on realspeech signal.”
— Jont B. Allen
● What’s wrong with current and past researches in this area?
● What’s our next step?
What’s in a Speech/Singing Synthesizer
Parameter Generator
Vocoder
Text / Music Score
Speech Audio
Generate pitch, duration and spectrum… from input
Generate waveform from parameters Vocoder
Part 1History of Speech Analysis/Synthesis
(http://clas.mq.edu.au/speech/synthesis/history_synthesis/)
History of Math & Acoustics
1600 1700 1800 1900 2000
Law of Forces/Motions, Foundation of Calculus
Wave Equation,Complex Number
Fourier/Laplace Transform,Analog Circuits & Electromagnetism
Newton Bernoulli, Euler, d‘Alembert
(http://www2.ling.su.se/staff/hartmut/kemplne.htm)
Gauss, Fourier, Laplace, Riemann, Cauchy, Kirchhoff, Heaviside
Filtering Theory, Digital Systems, Sampling Theory, ...
History of Math & Acoustics
1600 1700 1800 1900 2000
Law of Forces/Motions, Foundation of Calculus
Wave Equation,Complex Number
Fourier/Laplace Transform,Analog Circuits & Electromagnetism
Filtering Theory, Digital Systems, Sampling Theory, ...
Newton Bernoulli, Euler, d‘Alembert
Gauss, Fourier, Laplace, Riemann, Cauchy, Kirchhoff, Heaviside
(http://www2.ling.su.se/staff/hartmut/kemplne.htm)
= =
Frequency Response
Source-Filter Model
Vocal TractVocal Folds LipLung
tf f
Signal Generator (Source) Filter 1 Filter 2
Signal Generator Filter 1 Filter 2Filter 0
20th Century, the Dawn of Speech ProcessingCooley and Tukey (1965): Fast Fourier TransformOppenheim (1969): one of the earliest digital implementation of speech analysis/ synthesis
InputPitch
(source)
Cepstrum(vocal tract filter)
Analysis Synthesis
Spectrum
Output
Family Tree of Speech A/S AlgorithmsHomomorphic Filtering
(Oppenheim, 1969)STRAIGHT
(Kawahara, 1998)
WORLD1(Morise, 2009)
WORLD2(Morise, 2013)
TANDEM-STRAIGHT(Kawahara & Morise, 2007)
PSOLA(?, 1985)
Phase Vocoder(Flanagan et al, 1966)
Source-FilterModel
Sinusoidal Model(McAulay & Quatieri, 1986)
SMS(Serra, 1989)
Autotune
CELP(Atal & Schroeder,1983)
LSP/LSF(Itakura, 1975)
MGC/MLSA(Imai, et al., 1983)
SinsyMelodyne
& NiaoNiao& tn_fnds
Harmonic+Noise(Stylianou, 1993)
NBVPM(Bonada, 2004)
WBVPM(Bonada, 2008)
Vocaloid Vocaloid 2+RUCE(Rocaloid 4)
Rocaloid 3
Sine+Noise+Transient(Levin & Smith, 1998)
CeVIO
Quasi-Harmonic Model(Pantazis, et al., 2008)
Chiptune
Vocaine(Agiomyrgiannakis, 2015)
Linear Prediction(Atal & Schroeder,1967)
Part 2What’s Wrong
Quasi-static AssumptionAlgorithms affected:
● Homomorphic Filtering● PSOLA● Linear Prediction & CELP & MLSA● Sinusoidal Model● Harmonic+Noise Model● SMS & NBVPM● WORLD & STRAIGHT (slightly)
Mis-represented Aperiodic ComponentPopular belief:1. Speech = periodic signal + aperiodic signal (breathing noise)2. Aperiodic signal is filtered white noise
Aperiodic
Periodic (Friction)
Mis-represented Aperiodic Component
t
Algorithms affected:● (Quasi-)Harmonic+Noise Model● SMS & Sines+Noise+Transients Model● WORLD & (TANDEM-)STRAIGHT● Algorithms that do not model aperiodic component
○ Phase vocoder, CELP, MLSA, ...
Over-simplified Source-Filter Model
Tract FilterOscillator Lip Filter
Tract FilterOscillator
Source Filter
Assumption: source filter is independent from pitch
Equivalent assumption:“When my pitch is higher by 12 semitones, my vocal folds still oscillate at the same speed.”
Affected algorithms: all of those listed on page 11
Part 3Future: How to Fix &
the Low Level Speech Model
“Neoclassical” Approaches to Speech Modeling
Tract
Source
Lip
t
f
f
InputInverse
Linear Prediction(Atal & Schroeder,1967)
ARX(Wen, et al., 1995)
ARX-LF(Vincent, et al., 2005)
LF Model(Liljencrants, Fant and
Lin, 1985)
OVE Synthesizer(Fant, 1953)
“Neoclassical” Approaches to Speech ModelingDegottex (2013): similar idea, but in frequency domain
Hua (2016, in progress): more robust under poor recording conditions; less sensitive to processed input.
The Low Level Speech Model (new version)
Level 0(Signal Level)
Input Signal
Pitch Harmonic Model Noise Model
SpectrumChannel 1 EnergyChannel 2 EnergyChannel 3 Energy
...
Harmonic ModelHarmonic Model
Harmonic Model
Output Signal
Glottal/Source Information(LF Model)
Vocal Tract Filter Lip FilterLevel 1(Acoustic Level)
An acoustically meaningful speech model
Inverse Analysis of Speech
Original
Glottal Flow(Source Signal)
Pitch Shifting powered by LLSM
Original
50% Pitch
200% Pitch
Pitch Shifting powered by LLSM
Original
50% Pitch
200% Pitch
Instants of vocal fold closure were revealed
Reference● A.V. Oppenheim, “Speech Analysis-Synthesis System Based on Homomorphic Filtering”. JASA
(1969): Vol. 45, No. 2.
● Degottex, Gilles, et al. "Mixed source model and its adapted vocal tract filter estimate for voice transformation and synthesis." Speech Communication 55.2 (2013): 278-294.
● H. K. Dunn, "The calculation of vowel resonances, and an electrical vocal tract", Journal of the Acoustical Society of America, 1950, vol. 22, p. 740-753.
● Pantazis, Yannis, and Yannis Stylianou. "Improving the modeling of the noise part in the harmonic plus noise model of speech." Acoustics, Speech and Signal Processing (2008). IEEE International Conference on.