Upload
eji-warp
View
478
Download
2
Embed Size (px)
Citation preview
Kanru Hua (華侃如)June 19, 2016
The Past, Present and Futureof Singing Voice Modeling
Motivation
“You are making too many assumptions, this thing won’t work on realspeech signal.”
— Jont B. Allen
● What’s wrong with current and past researches in this area?
● What’s our next step?
What’s in a Speech/Singing Synthesizer
Parameter Generator
Vocoder
Text / Music Score
Speech Audio
Generate pitch, duration and spectrum… from input
Generate waveform from parameters Vocoder
Part 1History of Speech Analysis/Synthesis
(http://clas.mq.edu.au/speech/synthesis/history_synthesis/)
History of Math & Acoustics
1600 1700 1800 1900 2000
Law of Forces/Motions, Foundation of Calculus
Wave Equation,Complex Number
Fourier/Laplace Transform,Analog Circuits & Electromagnetism
Newton Bernoulli, Euler, d‘Alembert
(http://www2.ling.su.se/staff/hartmut/kemplne.htm)
Gauss, Fourier, Laplace, Riemann, Cauchy, Kirchhoff, Heaviside
Filtering Theory, Digital Systems, Sampling Theory, ...
History of Math & Acoustics
1600 1700 1800 1900 2000
Law of Forces/Motions, Foundation of Calculus
Wave Equation,Complex Number
Fourier/Laplace Transform,Analog Circuits & Electromagnetism
Filtering Theory, Digital Systems, Sampling Theory, ...
Newton Bernoulli, Euler, d‘Alembert
Gauss, Fourier, Laplace, Riemann, Cauchy, Kirchhoff, Heaviside
(http://www2.ling.su.se/staff/hartmut/kemplne.htm)
= =
Frequency Response
Source-Filter Model
Vocal TractVocal Folds LipLung
tf f
Signal Generator (Source) Filter 1 Filter 2
Signal Generator Filter 1 Filter 2Filter 0
20th Century, the Dawn of Speech ProcessingCooley and Tukey (1965): Fast Fourier TransformOppenheim (1969): one of the earliest digital implementation of speech analysis/ synthesis
InputPitch
(source)
Cepstrum(vocal tract filter)
Analysis Synthesis
Spectrum
Output
Family Tree of Speech A/S AlgorithmsHomomorphic Filtering
(Oppenheim, 1969)STRAIGHT
(Kawahara, 1998)
WORLD1(Morise, 2009)
WORLD2(Morise, 2013)
TANDEM-STRAIGHT(Kawahara & Morise, 2007)
PSOLA(?, 1985)
Phase Vocoder(Flanagan et al, 1966)
Source-FilterModel
Sinusoidal Model(McAulay & Quatieri, 1986)
SMS(Serra, 1989)
Autotune
CELP(Atal & Schroeder,1983)
LSP/LSF(Itakura, 1975)
MGC/MLSA(Imai, et al., 1983)
SinsyMelodyne
& NiaoNiao& tn_fnds
Harmonic+Noise(Stylianou, 1993)
NBVPM(Bonada, 2004)
WBVPM(Bonada, 2008)
Vocaloid Vocaloid 2+RUCE(Rocaloid 4)
Rocaloid 3
Sine+Noise+Transient(Levin & Smith, 1998)
CeVIO
Quasi-Harmonic Model(Pantazis, et al., 2008)
Chiptune
Vocaine(Agiomyrgiannakis, 2015)
Linear Prediction(Atal & Schroeder,1967)
Part 2What’s Wrong
Quasi-static AssumptionAlgorithms affected:
● Homomorphic Filtering● PSOLA● Linear Prediction & CELP & MLSA● Sinusoidal Model● Harmonic+Noise Model● SMS & NBVPM● WORLD & STRAIGHT (slightly)
Mis-represented Aperiodic ComponentPopular belief:1. Speech = periodic signal + aperiodic signal (breathing noise)2. Aperiodic signal is filtered white noise
Aperiodic
Periodic (Friction)
Mis-represented Aperiodic Component
t
Algorithms affected:● (Quasi-)Harmonic+Noise Model● SMS & Sines+Noise+Transients Model● WORLD & (TANDEM-)STRAIGHT● Algorithms that do not model aperiodic component
○ Phase vocoder, CELP, MLSA, ...
Over-simplified Source-Filter Model
Tract FilterOscillator Lip Filter
Tract FilterOscillator
Source Filter
Assumption: source filter is independent from pitch
Equivalent assumption:“When my pitch is higher by 12 semitones, my vocal folds still oscillate at the same speed.”
Affected algorithms: all of those listed on page 11
Part 3Future: How to Fix &
the Low Level Speech Model
“Neoclassical” Approaches to Speech Modeling
Tract
Source
Lip
t
f
f
InputInverse
Linear Prediction(Atal & Schroeder,1967)
ARX(Wen, et al., 1995)
ARX-LF(Vincent, et al., 2005)
LF Model(Liljencrants, Fant and
Lin, 1985)
OVE Synthesizer(Fant, 1953)
“Neoclassical” Approaches to Speech ModelingDegottex (2013): similar idea, but in frequency domain
Hua (2016, in progress): more robust under poor recording conditions; less sensitive to processed input.
The Low Level Speech Model (new version)
Level 0(Signal Level)
Input Signal
Pitch Harmonic Model Noise Model
SpectrumChannel 1 EnergyChannel 2 EnergyChannel 3 Energy
...
Harmonic ModelHarmonic Model
Harmonic Model
Output Signal
Glottal/Source Information(LF Model)
Vocal Tract Filter Lip FilterLevel 1(Acoustic Level)
An acoustically meaningful speech model
Inverse Analysis of Speech
Original
Glottal Flow(Source Signal)
Pitch Shifting powered by LLSM
Original
50% Pitch
200% Pitch
Pitch Shifting powered by LLSM
Original
50% Pitch
200% Pitch
Instants of vocal fold closure were revealed
Reference● A.V. Oppenheim, “Speech Analysis-Synthesis System Based on Homomorphic Filtering”. JASA
(1969): Vol. 45, No. 2.
● Degottex, Gilles, et al. "Mixed source model and its adapted vocal tract filter estimate for voice transformation and synthesis." Speech Communication 55.2 (2013): 278-294.
● H. K. Dunn, "The calculation of vowel resonances, and an electrical vocal tract", Journal of the Acoustical Society of America, 1950, vol. 22, p. 740-753.
● Pantazis, Yannis, and Yannis Stylianou. "Improving the modeling of the noise part in the harmonic plus noise model of speech." Acoustics, Speech and Signal Processing (2008). IEEE International Conference on.