23
WAVENET A GENERATIVE MODEL FOR RAW AUDIO 中中中中中中中 中 中中中中

[DL輪読会]Wavenet a generative model for raw audio

Embed Size (px)

Citation preview

WAVENET A GENERATIVE MODEL FOR RAW AUDIO

WAVENETA GENERATIVE MODEL FOR RAW AUDIO

Aaron et al (Deep Mind)

arxiv

2016/9/12

concatenative Text to Speech(TTS)parametric TTSWaveNet

WaveNet

16,000 samples/sec

int1665,536 256-law companding transformation

WaveNet

WaveNet1

dilated causal convolutional layers

DilationinputDilationDilation1e.g.) 1,2,4,...,512,1,2,4,...,512,1,2,4,...,512.

RNN

dilated causal convolutional layers

RNN1

WaveNet

Resnetskip-connection

Conditional WaveNet

()

MULTI-SPEAKER SPEECH GENERATION VCTK 109 44

ID

receptive field size ()0.3sec(1516)

MULTI-SPEAKER SPEECH GENERATION

US parametricUS concatenateUS wavenet(parametricconcatenate)

CH parametricCH concatenateCH wavenet

TEXT-TO-SPEECH

TEXT-TO-SPEECH

MOS 1: Bad, 2: Poor, 3: Fair, 4: Good, 5: Excellent

TEXT-TO-SPEECH

No preference

TEXT-TO-SPEECH

Sample1Sample2

MUSICMagnaTagATune datasets: 200 (etc)

MagnaTagATune datasets()

MUSICYouTube piano dataset: 60

RNNCNN()