Upload
deeplearningjp2016
View
458
Download
4
Embed Size (px)
Citation preview
WAVENET A GENERATIVE MODEL FOR RAW AUDIO
WAVENETA GENERATIVE MODEL FOR RAW AUDIO
Aaron et al (Deep Mind)
arxiv
2016/9/12
concatenative Text to Speech(TTS)parametric TTSWaveNet
WaveNet
16,000 samples/sec
int1665,536 256-law companding transformation
WaveNet
WaveNet1
dilated causal convolutional layers
DilationinputDilationDilation1e.g.) 1,2,4,...,512,1,2,4,...,512,1,2,4,...,512.
RNN
dilated causal convolutional layers
RNN1
WaveNet
Resnetskip-connection
Conditional WaveNet
()
MULTI-SPEAKER SPEECH GENERATION VCTK 109 44
ID
receptive field size ()0.3sec(1516)
MULTI-SPEAKER SPEECH GENERATION
US parametricUS concatenateUS wavenet(parametricconcatenate)
CH parametricCH concatenateCH wavenet
TEXT-TO-SPEECH
TEXT-TO-SPEECH
MOS 1: Bad, 2: Poor, 3: Fair, 4: Good, 5: Excellent
TEXT-TO-SPEECH
No preference
TEXT-TO-SPEECH
Sample1Sample2
MUSICMagnaTagATune datasets: 200 (etc)
MagnaTagATune datasets()
MUSICYouTube piano dataset: 60
RNNCNN()