Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
機器能不能無師自通學習人類語言
Hung-yi Lee
Unsupervised Learning
Abstractive Summarization
Chat-bot
Voice ConversionAudio word2vec &
Speech Recognition
Generative Adversarial Network (GAN)
https://www.youtube.com/playlist?list=PLJV_el3uVTsMd2G9ZjcpJn1YfnM9wVOBf
Abstractive Summarization
• Now machine can do abstractive summary (write summaries in its own words)
Title 1
Title 2
Title 3
Training Data
title generated by machine
without hand-crafted rules
(in its own words)
Abstractive Summarization
• Input: transcriptions of audio, output: summary
ℎ1 ℎ2 ℎ3 ℎ4
RNN Encoder: read through the input
w1 w4w2 w3
transcriptions of audio from automatic speech recognition (ASR)
𝑧1 𝑧2 ……
……wA wB
RNN generator
We need lots of labelled training data (supervised).
Unsupervised Abstractive Summarization
• Now machine can do abstractive summary by seq2seq (write summaries in its own words)
summary 1
summary 2
summary 3
seq2seq
Unsupervised Abstractive Summarization
G RSummary?
Seq2seq Seq2seq
document documentword
sequence
Only need a lot of documents to train the model
This is a seq2seq2seq auto-encoder.
Using a sequence of words as latent representation.
not readable …
Unsupervised Abstractive Summarization
G R
Seq2seq Seq2seq
word sequence
D
Human written summaries Real or not
DiscriminatorLet Discriminator considers
my output as real
document document
Summary?
Readable
REINFORCE algorithm is used.
Unsupervised Abstractive Summarization
• Document:澳大利亞今天與13個國家簽署了反興奮劑雙邊協議,旨在加強體育競賽之外的藥品檢查並共享研究成果 ……
• Summary:
• Human:澳大利亞與13國簽署反興奮劑協議
• Unsupervised:澳大利亞加強體育競賽之外的藥品檢查
• Document:中華民國奧林匹克委員會今天接到一九九二年冬季奧運會邀請函,由於主席張豐緒目前正在中南美洲進行友好訪問,因此尚未決定是否派隊赴賽 ……
• Summary:
• Human:一九九二年冬季奧運會函邀我參加
• Unsupervised:奧委會接獲冬季奧運會邀請函
感謝王耀賢同學提供實驗結果
Unsupervised Abstractive Summarization
• Document:據此間媒體27日報道,印度尼西亞蘇門答臘島的兩個省近日來連降暴雨,洪水泛濫導致塌方,到26日為止至少已有60人喪生,100多人失蹤 ……
• Summary:
• Human:印尼水災造成60人死亡
• Unsupervised:印尼門洪水泛濫導致塌雨
• Document:安徽省合肥市最近為領導幹部下基層做了新規定:一律輕車簡從,不準搞迎來送往、不準搞層層陪同 ……
• Summary:
• Human:合肥規定領導幹部下基層活動從簡
• Unsupervised:合肥領導幹部下基層做搞迎來送往規定:一律簡
感謝王耀賢同學提供實驗結果
Semi-supervised Learning
25
26
27
28
29
30
31
32
33
34
0 10k 500k
RO
UG
E-1
Number of document-summary pairs used
WGAN Reinforce Supervised
Using matched data
(3.8M pairs are used)
(unpublished result)
Unsupervised Learning
Abstractive Summarization
Chat-bot
Voice ConversionAudio word2vec &
Speech Recognition
Generative Adversarial Network (GAN)
Sequence-to-sequence
Encoder Decoder
Input sentence c
output sentence x
Training data:
A: How are you ?
B: I’m good.
……
……
How are you ?
I’m good.
Seq2seq
Output: Not bad I’m John.
Maximize likelihood
Training Criterion
Human better
better
Reinforcement Learning
• Machine obtains feedback from user
• Chat-bot learns to maximize the expected reward
How are you?
Bye bye☺
Hello
Hi ☺
-10 3
Alpha GO style training !
• Let two agents talk to each other
How old are you?
See you.
See you.
See you.
How old are you?
I am 16.
I though you were 12.
What make you think so?
Still need humans to provide reward
https://arxiv.org/pdf/1606.01541.pdf
human dialogues
Generative Adversarial Network (GAN)
• Let two agents talk to each other
How old are you?
See you.
See you.
See you.
D
Real or not
Discriminator
Conditional GAN
https://arxiv.org/pdf/1701.06547.pdf
Personalized Chat-bot
• General chat-bots generate plain responses
• Human talks in different styles and sentiments to different people in different conditions.
• We want the response of chat-bot to be controllable.
• Therefore, chat-bot can be personalized in the future
• We only focus on generate positive response.
Input: How was your day today?It is wonderful today.
It is terrible today.
Optimistic Chat-bot
[Chih-Wei Lee, et al., ICASSP, 2018]
Approaches
Input sentence
response sentence
Chatbot
En De Transformation
Positive response
Do not have to change
Input sentence
response sentence
Chatbot
En De
Parameters are modified
Type 2. Output Transformation
Type 1. System Modification
Positive response
1. Persona-based Approach2. Reinforcement
3. Plug & Play4. Cycle GAN
Approaches
• 1. Persona-Based Model
How is today
Today is awesome
SentimentClassifier
Today is awesome
0.9
Training
How positive is the reference
off-the-shelf
[Jiwei Li, et al., ACL, 2016]
(reference response)
Approaches
• 1. Persona-Based Model
How is today
Today is bad
SentimentClassifier
Today is bad
0.1
Training
How positive is the reference
[Jiwei Li, et al., ACL, 2016]
(reference response)
Approaches
• 1. Persona-Based Model
I love you
? ? ?
?
Testing
?
Response: I love you, too.
= 0.0
? = 1.0
Response: I am not ready to start a relationship.
1.0
Approaches
2. Reinforcement Learning
How is today
Today is bad
Positive reward for positive response Sentiment
Classifier
0.1
good
Network parameters are updated
Approaches
3. Plug & Play
response sentence
Transformation
Positive response
code
sentence VRAEEncoder
VRAEDecoder
sentence
coderesponse of chat-bot
VRAEEncoder
VRAEDecoder
Positiveresponse
SentimentClassifier
new code
As large as possible
As close as possible
VRAE = Variational Recurrent Auto-encoder
Approaches
4. Cycle GAN
response sentence
Transformation
Positive response
Domain X Domain Y
It is good.
It’s a good day.
I love you.
It is bad.
It’s a bad day.
I don’t love you.
positive sentences negative sentences
Speaker A Speaker B
Cycle GAN
𝐺𝑋→𝑌 𝐺Y→X
as close as possible
𝐺Y→X 𝐺𝑋→𝑌
as close as possible
𝐷𝑌𝐷𝑋scalar: belongs to domain Y or not
scalar: belongs to domain X or not
Cycle GAN
𝐺𝑋→𝑌 𝐺Y→X
as close as possible
𝐺Y→X 𝐺𝑋→𝑌
as close as possible
𝐷𝑌𝐷𝑋negative sentence? positive sentence?
It is bad. It is good. It is bad.
I love you. I hate you. I love you.positive
positive
positivenegative
negative negative
Cycle GAN
𝐺𝑋→𝑌 𝐺Y→X
as close as possible
𝐺Y→X 𝐺𝑋→𝑌
as close as possible
𝐷𝑌𝐷𝑋negative sentence? positive sentence?
It is bad. It is good. It is bad.
I love you. I hate you. I love you.positive
positive
positivenegative
negative negative
Word embedding[Lee, et al., ICASSP, 2018]
Discrete?
Cycle GAN
• Negative sentence to positive sentence:it's a crappy day → it's a great dayi wish you could be here → you could be hereit's not a good idea → it's good ideai miss you → i love youi don't love you → i love youi can't do that → i can do thati feel so sad → i happyit's a bad day → it's a good dayit's a dummy day → it's a great daysorry for doing such a horrible thing → thanks for doing a great thingmy doggy is sick → my doggy is my doggymy little doggy is sick → my little doggy is my little doggy
Human Evaluation Objective evaluation has the same trend.
Sentiment Grammar Coherence
Seq2seq (Baseline) 0.16 1.00 0.55
Persona-based 0.71 0.75 0.24
Reinforcement 0.70 0.93 0.35
Plug & Play 0.48 0.43 0.15
Cycle GAN 0.63 0.91 0.44
Sentiment: Is the chat-bot response positive?
Human annotators gave score 0, 1, 2, 3, 4, 5, and normalized to from 0 to 1.
Grammar: Is the chat-bot response grammatically correct?
Coherence: Does chat-bot output a suitable response?
Human Evaluation Objective evaluation has the same trend.
Sentiment Grammar Coherence
Seq2seq (Baseline) 0.16 1.00 0.55
Persona-based 0.71 0.75 0.24
Reinforcement 0.70 0.93 0.35
Plug & Play 0.48 0.43 0.15
Cycle GAN 0.63 0.91 0.44
⚫Modifying the chat-bot parameters leads to more obvious positive responses
⚫ Reinforcement learning and cycle GAN make less grammatical errors.
⚫ Cycle GAN has better semantic coherence.
Example 1
• Input: yeah, i’m just less interested in the how-to and more in the what and why of the things we build
• Seq2seq: it’s not a good thing .
• Persona: I’m interested in this .
• RL: it’s a good one
• Plug and play: it’s not a wrong thing .
• Cycle GAN: it’s a good thing .
Example 2
• Input: always learning and self improving is a huge turn on for me so believe me iunderstand
• Seq2seq: i'm not sure if i can see you in a relationship
• Persona: and you're gonna have a word with your mother .
• RL: i love you so much
• Plug and play: i'm not sure if i can see you a a special
• Cycle GAN: i'm sure if i can see you in a relationship.
Unsupervised Learning
Abstractive Summarization
Chat-bot
Voice ConversionAudio word2vec &
Speech Recognition
Generative Adversarial Network (GAN)
Voice Conversion
In the past
Today
Speaker A Speaker B
How are you? How are you?
Good morning Good morning
Speaker A Speaker B
天氣真好 How are you?
再見囉 Good morning
Speakers A and B are talking about completely different things.
• Multi-target VC [Chou et al., arxiv 2018]
𝑒𝑛𝑐(𝒙)𝒙
Voice Conversion
C
𝑬nc Dec
𝒚
𝒚 𝒚′····
𝑒𝑛𝑐(𝒙)
𝑬nc Dec
𝒚"
𝑮
𝒚"
D+C
Real data
𝒙 𝑑𝑒𝑐(𝑒𝑛𝑐 𝒙 , 𝒚) 𝑑𝑒𝑐(𝑒𝑛𝑐 𝒙 , 𝒚′)
➢ Stage-1
➢ Stage-2
F/R
ID
···
• Subjective evaluations
Voice Conversion (Multi-target VC)
Fig. 20: Preference test results
1. The proposed method uses non-parallel data.2. The multi-target VC approach outperforms one-stage only.
3. The multi-target VC approach is comparable to Cycle-GAN-VC in terms of the naturalness and the similarity.
Speaker A Speaker B
我
感謝周儒杰同學提供實驗結果
1
2 3
Unsupervised Learning
Abstractive Summarization
Chat-bot
Voice ConversionAudio word2vec &
Speech Recognition
Generative Adversarial Network (GAN)
Audio Word2Vector word-level audio segment
Model Model Model Model
Learn from lots of audio without annotation
Audio Word2Vector (v1)
• The audio segments corresponding to words with similar pronunciations are close to each other.
ever ever
never
never
never
dog
dog
dogs
In the following discussion, assume that we already get the segmentation.
Query-by-example Spoken Term Detection
user
“Trump”spoken
query
Compute similarity between spoken queries and audio
files on acoustic level, and find the query term
Spoken Content
“Trump” “Trump”
Query-by-example Spoken Term Detection
Audio archive divided into variable-length audio segments
Audio Word to Vector
Audio Word to Vector
Similarity
Search Result
Spoken Query
Off-line
On-line
Much faster than DTW
Sequence-to-sequence Auto-encoder
audio segment
acoustic featuresx1 x2 x3 x4
RNN Encoder
vector
The vector we want
We use sequence-to-sequence auto-encoder here
The training is unsupervised.
audio segment
Sequence-to-sequence Auto-encoder
RNN Decoderx1 x2 x3 x4
y1 y2 y3 y4
x1 x2 x3x4
RNN Encoder
audio segment
acoustic features
The RNN encoder and decoder are jointly trained.
Input acoustic features
RNN
Encoder
RNN
Decoderinput segment reconstructed
Original Seq2seq Auto-encoder
Include phonetic information,
speaker information, etc.
Feature Disentangle
Phonetic
Encoder
Speaker
Encoder
RNN
Decoder
𝑧
𝑒
reconstructed input segment
Feature Disentangle
Phonetic
Encoder
Speaker
Encoder
RNN
Decoder
𝑧
𝑒
reconstructed input segment
Speaker
Encoder
Speaker
Encoder
same speaker as close as
possible
𝒙𝑖
𝒙𝑗
𝑒𝑖
𝑒𝑗
different speakersdistance
larger than a
threshold
Assume that we know speaker ID of segments.
Feature Disentangle
Phonetic
Encoder
Speaker
Encoder
RNN
Decoder
𝑧
𝑒
reconstructed input segment
Phonetic
Encoder
Phonetic
Encoder
Speaker
Classifier score
Learn to confuse Speaker Classifier
𝒙𝑖
𝒙𝑗
𝑧𝑖
𝑧𝑗 same speaker
different speakers
Inspired from domain adversarial training
Audio segments of two different speakers
• At each time step, RNN encoder determines whether it is right before a boundary.
• If it is determined as right before a boundary, a vector (an embedding for an audio segment) is outputted.
RNN Encoder
Joint Learning of Segmentation and Seq2seq Auto-encoder
audio segment x
embedding for x
Where to segment is determined automatically.
RNN Encoder
RNN Decoder
Joint Learning of Segmentation and Seq2seq Auto-encoder
audio segment x
embedding for x
RNN decoder reconstructs the input utterance
Joint Learning of Segmentation and Seq2seq Auto-encoder
• The learning criterion of RNN encoder and decoder is the weighted sum of the following two terms.
• 1. Minimizing Reconstruction error
• 2. Minimizing the number of segments, that is the number of output embedding
The second term is necessary.
If we only minimize reconstruction error,
RNN encoder would output embedding at all of the time steps.
Joint Learning of Segmentation and Seq2seq Auto-encoder
• The learning criterion of RNN encoder and decoder is the weighted sum of the following two terms.
• 1. Minimizing Reconstruction error
• 2. Minimizing the number of segments, that is the number of output embedding
Determine whether to output an embedding is discrete.
Therefore, the whole network is not differentiable.
Use REINFORCE algorithm for training
Audio Word2Vector (v2)
• Audio word to vector with semantics
flower tree
dog
cat
cats
walk
walked
run
Audio Skip-gram
Skip-gram
One-hot
𝑤𝑡
𝑤𝑡−2
𝑤𝑡−1
𝑤𝑡+1
𝑤𝑡+2
Semantic embedding
linear
linear
Audio Skip-gram
Phonetic
Encoder
Skip-gram
Semantic embedding
𝑤𝑡
𝑤𝑡−2
𝑤𝑡−1
𝑤𝑡+1
𝑤𝑡+2Audio
Replace one-hot with phonetic encoder output
2 hidden layer
2 hidden layer
Experiments
• Spearman’s rank correlation scores of audio word2vec and text word2vec
Dataset V1 V2
MEN 0.38 0.43
Mturk 0.37 0.47
RG65 0.12 0.16
RW 0.70 0.71
SimLex999 0.22 0.27
WS353 0.50 0.52
WS353R 0.47 0.53
WS353S 0.53 0.50
Wo
rd P
air
Set
[Chen, et al., arXiv, 2018]
V1: Phonetic Information
V2: Semantic Information
Training Corpus: Librispeech
dog
catpig
Iyou
orange
apple
Text Word2Vec
……I eat an apple.You eat an orange.
……
Audio Word2Vec (v2)
I eat an apple
you eat an orange
A large collection of Text
A large collection of Audio
Unsupervised Conditional Generation
Unsupervised Speech Recognition?
Labeled Pairs Top 1 Top 10 Top 100
0 0.00 0.00 0.02
1K 0.04 0.14 0.50
2K 0.11 0.45 0.76
5K 0.18 0.61 0.86
Audio Word2Vec (v2)
Conditional
Generatordog
catpig
Iyou
orange
apple
Word2Vec
dog
Unsupervised Speech Recognition
AY L AH V Y UW
G UH D B AY
HH AW AA R Y UW
T AY W AA N
AY M F AY NGAN
“AY”=
Phone-level Acoustic Pattern Discovery
p1
p1 p3 p2
p1 p4 p3 p5 p5
p1 p5 p4 p3
p1 p2 p3 p4
[Liu, et al., arXiv, 2018]
Phoneme sequences from Text
Unsupervised Speech Recognition
• Phoneme recognitionAudio: TIMITText: WMT
supervised
WGAN-GP
Gumbel-softmax
[Liu, et al., arXiv, 2018]
(using oracle phoneme boundaries)
Conclusion
Abstractive Summarization
Chat-bot
Voice ConversionAudio word2vec &
Speech Recognition
Towards Unsupervised Learning by GAN