音情報処理論 2018...音情報処理論 2018 中村哲高道愼之介 Sakriani Sakti 吉野幸一郎 2016/10/4 Satoshi Nakamura @ NAIST 2016 1 音声って何人のコミュニケーションで意図を伝える最も重

音情報処理論 2018

中村哲高道愼之介

Sakriani Sakti吉野幸一郎

Satoshi Nakamura @ NAIST 2016 12016/10/4

Speech Processing 2018

Satoshi NakamuraShinnosuke Takamichi

Sakriani SaktiKoichiro Yoshino


音声って何

人のコミュニケーションで意図を伝える最も重要な手段

– 音声を圧縮する

– 音声を作る

– 音声を聞き取る

– さらに、音響信号の処理


What is speech

The most natural human communication means.

– Speech Compression

– Speech Generation

– Speech Recognition

– Speech Signal Processing

+

– Acoustic signal processing


シリコンオーディオ

Satoshi Nakamura @ NAIST 2016 5

音声、音楽を圧縮！

2016/10/4

http://ja.wikipedia.org/wiki/%E3%83%95%E3%82%A1%E3%82%A4%E3%83%AB:Ipod_5th_Generation_white.jpg


Silicon Audio


Compression of

Speech and Music.

2016/10/4



Apple Siri


音声で問い合わせ！

2016/10/4

../../../CravingExplorer/download/mpeg/Apple - Siri 日本.mpg


Apple Siri


Spoken QA

2016/10/4



VoiceTra+


音声をその場で翻

訳！

2016/10/4

../../../++プレゼンテーション資料/Video-photo-etc/iphone/iphone.wmv


VoiceTra+


Translate Speech

Utterance

2016/10/4



何が難しいのか

音声合成

– 録音して再生する？

– テキストから音声を合成？

– どんな声でも？

– 怒ったり、泣いたりできる？

音声認識

– 一人の人の声を認識するとは？

– 違う人、男女、子供だと？

– アクセント、方言、外来語は？

– 感情的な音声は？

– 音響的な外乱は？

音声対話– 人間のように対話するとは？？？


What are the difficulties?

Speech Synthesis

– Record and playback?

– Synthesize from texts

– Any voice?テキストから音声を合成？

– Cry, Angry voices?

Speech Recognition

– Recognize any voice?

– Different speaker, gender, child

– Local accents, dialects, loan words

– Emotional speech

– Acoustic interference

Spoken Dialog

– Spoken dialog by machine like human being?


本講義の狙い

音声とは何か

音声情報を使ったヒューマンインタフェース

– 音声認識

– 音声合成

– 音声対話

– 音声翻訳

– さらに音響信号処理

これらの技術の中身はどうなっているのか？


Lecture goal

What is speech?

Human interface by speech information

– Speech recognition

– Speech synthesis

– Spoken dialog

– Acoustic signal processing

What are the state-of-the-art technologies ?


講義の構成

1. 11/2 音情報基礎中村音情報基礎（音声、音響情報処理の導入、基礎）：Introduction to speech and acoustic signal processing

2. 11/6 音声の分析高道音声の特徴抽出（ＤＦＴ，ＬＰＣ，ケプストラム分析）：Speech signal processing (DFT, LPC analysis)

3. 11/8 音響信号処理基礎高道音響信号処理基礎：Fundamentals of acoustic signal processing

4. 11/12 音声符号化基礎中村音声符号化基礎：Fundamentals of Speech Coding

5. 11/14 音声認識理論と音声認識システム Sakti 音声認識理論と音声認識システム:Speech recognition theory and system

6. 11/16 音声合成理論と音声合成システム高道音声合成理論と音声合成システム：Speech synthesis theory and system

7. 11/20 音声対話システム理論と音声対話システム吉野音声対話システム理論と音声対話システム： Spoken dialog system theory and system

8. 11/22 質疑、試験中村


講義の構成

1. 11/2 音情報基礎中村音情報基礎（音声、音響情報処理の導入、基礎）：Introduction to speech and acoustic signal processing

2. 11/6 音声の分析高道音声の特徴抽出（ＤＦＴ，ＬＰＣ，ケプストラム分析）：Speech signal processing (DFT, LPC analysis)

3. 11/8 音響信号処理基礎高道音響信号処理基礎：Fundamentals of acoustic signal processing

4. 11/12 音声符号化基礎中村音声符号化基礎：Fundamentals of Speech Coding

5. 11/14 音声認識理論と音声認識システム Sakti 音声認識理論と音声認識システム:Speech recognition theory and system

6. 11/16 音声合成理論と音声合成システム高道音声合成理論と音声合成システム：Speech synthesis theory and system

7. 11/20 音声対話システム理論と音声対話システム吉野音声対話システム理論と音声対話システム： Spoken dialog system theory and system

8. 11/22 質疑、試験中村


採点基準

各回に課題を出します。• 1週間以内の提出

最終回に最終試験を行います。

この２つの合計点を元に採点します。

おおよそ、

(出席＋演習）の合計％ｘ０．3＋最終試験％ｘ０．7

が目安です。


Score standards

Attendance and home works at each class

• Report in one week

Final examination at the last class.

Sum of those two scores

Roughly speaking,

0.3x (Attendance+Homeworks）score%

＋0.7x Final exam score ％

We apply GPA normalization


今回の内容も試験に出ます。


Todays’ contents will be in the exam.


参考書

森北出版音声工学板橋秀一編著

今日は、この本を参考にしています。

他に、

コロナ社中村ら、「話し言葉の自動翻訳」

昭晃堂鹿野、中村、伊勢

「音声・音情報のディジタル信号処理」

コロナ社鹿野、武田ら「音声認識システム」

近代科学社古井、「新音響・音声工学」


References

Books in Japanese

– Speech Engineering, Shuichi Itahashi, Shoko-do publisher

– Automatic Translation of Spoken Language, Satoshi Nakamura, Corona Publisher, 2018

– Digital Signal Processing of Speech and Acoustic Signals, Kiyohiro Shikano, Satoshi Nakamura, Shiro Ise, Shoko-do publisher

– Speech Recognition, Kiyohiro Shikano, et al., Corona publisher


音声に関する研究

人間の発声器官、発声メカニズム

音の伝搬、音響工学、音響信号処理

人間の聴覚器官、聴取メカニズム

⇒ 生理学、心理学

言葉の理解、生成

⇒ 言語学

コンピュータによる実現

⇒ コンピュータサイエンス、情報工学


Research on Speech

Human speech production

Acoustic signal propagation, acoustic signal processing

Human auditory system, auditory organs, hearing mechanism

⇒ Physiology, Psychology

Language understanding, generation

⇒ Linguistics

Realization by computers

⇒ Computer sciences, infomatics


音声の生成


Speech production and hearing


Speech chain

Speaker Nerve fibers ears

Acoustic signals

Speech organs ears

Listner

BrainBrain

Linguistic Physiological Acoustic Physiological Linguistic

Speech chain in speech generation and hearing in speaker and listener

音声器官


Speech Organs


母音の分類


図２．１母音の分類2016/10/4

Vowel system


図２．１母音の分類2016/10/4

母音の性質

ホルマント、ホルマント周波数、ホルマント帯域幅


Formants

Formant

Formant frequency

Formant bandwidth


母音のホルマント

Satoshi Nakamura @ NAIST 2016 33図２．１２連続音声のF1-F2分布2016/10/4

Formant Frequencies of Japanese Vowles

Satoshi Nakamura @ NAIST 2016 34図２．１２連続音声のF1-F2分布2016/10/4

Consonants

Smivowles /j/ /w/

Plosives /p,t,k/ /b,d,g/

Satoshi Nakamura @ NAIST 2016 35図２．１３無声破裂音と有声破裂音2016/10/4

子音

半母音 /j/ /w/

破裂音 /p,t,k/ /b,d,g/

Satoshi Nakamura @ NAIST 2016 36図２．１３無声破裂音と有声破裂音2016/10/4

waveform

vowelaspirationPlosivesilence

waveform

buzzy plosive vowel

IPA

異音（allophone)

有声音、無声音

閉鎖音、破裂音、鼻音、破擦音


IPA

allophone

Voiced, unvoiced

Plosies, Stops, Nasal, Fricatives


調音結合

/aoi/青いの/a/は /aida/間の/a/と調音点が異なる。

– /aoi/ の/a/は、後続の/o/に近い奥舌音

– /aida/の/a/は、後続の/i/に近い前よりの調音

これらを「調音結合」(Co-articulation)

あるいは、「同化」(assimilation)と呼ぶ

– 鼻音化：/namae/名前 /a/が鼻音化

– 無声化：/akita/, /yakusho/ 無声子音に挟まれた/i//u/ が無声化する


調音結合

/a/ is different between /aoi/ and /aida/

– /a/ in /aoi/ is close to /o/

– /a/ in /aida/ is close to /i/

Co-articulation

or Assimilation

– Nasalization ：/namae/ /a/ is nasalized

– Un-vocalization：/i/ /u/ in /akita/, /yakusho/ are un-vocalized.


アクセント・イントネーション

高さアクセント

強さアクセント

文節要素（Segmental Phoneme)

パラ言語情報（Paralinguistic information)


Accent, Intonation

Tone accent

Stress accent

Segmental Phoneme

Paralinguistic information


文章発話の基本周波数

Satoshi Nakamura @ NAIST 2016 43図２．１９文章発話中の基本周波数2016/10/4

Fundamental Frequencty

Satoshi Nakamura @ NAIST 2016 44図２．１９文章発話中の基本周波数2016/10/4

Niwa niwaniwatorigairu

Niwaniwaniwa torigairu

Niwa niwaniwa torigairu

話し言葉

文が短い

主語などの省略が多い

短縮形を多用

ね、さ、よ、等の終助詞がつく

同じ言葉の繰り返しが多い

複雑な構文を避ける

時間的要素（忘却）が関与する

えー、あー、うーなどのいいよどみがある

言い誤り、言い直しが多い


Spoken Language

Short phrases

Subjects omitted

Short expression

/ne, sa, yo/ post positional particles

Rephrases

Simple synthax

Temporal structures

/e-, a-, u-/ fillers, laughters

Restart, mistakes


聴覚器官


Auditory Organs


蝸牛


Cochlear


蝸牛断面


蝸牛断面


基底膜振動


Basilar Membrane


等感曲線


Equal Loudness Curve


音の高さ知覚


Hz

Perception of Frequency


Hz

Mel

Sca

le

Linear Scale Hz

スペクトルマスキング


Spectral Masking


継時マスキング


継時

2016/10/4

Temporal Masking


カテゴリー知覚


Categorical Perception


何が難しいのか

音声合成

– 録音して再生

• 変形したい

– スペクトル構造と音源を分けて制御したい

– 文字から音声を合成

• イントネーションがない

• 調音結合の影響で不自然

– 変形が必要

音声認識

– 同じ人でも話すたびに違う

• 時間構造、スペクトル構造

– 違う人、男女、子供で音声が違う

– アクセント、方言の問題、外来語

– 雑音、残響

音声対話

– 人間のように対話するとは？


What are the difficulties?

Speech Synthesis

– Record and playback?

– Synthesize from texts

– Any voice?テキストから音声を合成？

– Cry, Angry voices?

Speech Recognition

– Recognize any voice?

– Different speaker, gender, child

– Local accents, dialects, loan words

– Emotional speech

– Acoustic interference

Spoken Dialog

– Spoken dialog by machine like human being?


http://www.naist.jp/無限の可能性、ここが最先端－Outgrow your limits－

最近の音声認識の進歩

これまでの経緯

– テンプレートマッチング、動的計画法 [Sakoe 71]

– 隠れマルコフモデル、N-Gramモデル [Mercer 83, etc]

– ニューラルネットワーク、TDNN[Waibel 89], LSTM [Hochreiter 97]

– Weighted Finite State Transducer [Mohri 2006]

– 大量のデータの収集、試行サービスによるデータ収集

深層学習による最近の進化

– DNN-HMM [Hinton 2012]

• DNN により状態の事後確率を直接推定する

– Connectionist Temporal Classification [Graves 2013]

• フレーム毎に音素ラベルを出力する

– Listen, Attend, and Spell [Chan 2016]

• CTCにAttentionメカニズムを加え高精度化

Satoshi Nakamura@NAIST, Invited Talk, 2017 ASJ Fall Meeting

68


Recent Progress in Speech Recognition

Background

– Template Matching, Dynamic Time Warping, [Sakoe 71]

– Hidden Markov Model, N-Gram Language Model, [Mercer 83, etc]

– Neural Network, TDNN[Waibel 89], LSTM [Hochreiter 97]

– Weighted Finite State Transducer [Mohri 2006]

– Huge amount of data and data collection through trial services

Recent Progress by Deep Learning

– DNN-HMM [Hinton 2012]

• Estimate state posterior probability by DNN

– Connectionist Temporal Classification [Graves 2013]

• Produce phone labels every frame

– Listen, Attend, and Spell [Chan 2016]

• Attention-based encoder-decoder


69


Phone-Level Matcher

Word-Level

Matcher

Sentence-

Level MatcherAcoustic Model

P(Xs|)LexiconP( | W)

LanguageModel P(W)

Feature Extraction

SearchAlgorithm

Xs ŴRecognized

Words

Hypothesis

Text Corpus

SpeechCorpus

TrainTrain

Statistical Learning

Speech Waveform

音声認識システム

単語列の最尤復号：The most probable string of words:

)|()|()(maxarg

)|()(maxarg)|(maxarg

sw

sw

sw

XPWPWP

WXPWPXWPW

５つの要素特徴抽出：Feature extraction

音響モデル：Acoustic model

発音辞書：Pronunciation lexicon

言語モデル：Language model

探索：Search algorithm

2018/11/1 ©Prof. Satoshi Nakamura, NARA INSTITUTE OF SCIENCE AND TECHNOLOGY

70


Phone-Level Matcher

Word-Level

Matcher

Sentence-

Level MatcherAcoustic Model

P(Xs|)LexiconP( | W)

LanguageModel P(W)

Feature Extraction

SearchAlgorithm

Xs ŴRecognized

Words

Hypothesis

Text Corpus

SpeechCorpus

TrainTrain

Statistical Learning

Speech Waveform

Speech Recognition System

単語列の最尤復号：The most probable string of words:

)|()|()(maxarg

)|()(maxarg)|(maxarg

sw

sw

sw

XPWPWP

WXPWPXWPW

５つの要素特徴抽出：Feature extraction

音響モデル：Acoustic model

発音辞書：Pronunciation lexicon

言語モデル：Language model

探索：Search algorithm

2018/11/1 ©Prof. Satoshi Nakamura, NARA INSTITUTE OF SCIENCE AND TECHNOLOGY

71


深層学習による音声認識

ハイブリッド HMM-DNN:

注意機構付きEncoder-DecoderによるEnd-to-end ASR

CNNs ：特徴抽出 LSTM：系列モデリング DNN：高精度な識別

[Sainath et. al 2015]

TIMIT音素認識結果

Results on Google data


72


Speech Recognition by DNN

ハイブリッド HMM-DNN:

Attention Encoder-Decoder: End-to-end ASR

CNNs ：Feature extraction LSTM：Temporal modeling DNN： Discrimination

[Sainath et. al 2015]

TIMIT phone recognition rates

Results on Google data


73


CTC: Connectionist Temporal Classification

問題: RNNの音素認識器を学習するためにはフレーム毎のラベルが必要。これまでは、HMMが使われていた。Connectionist Temporal Classification (CTC) [A. Graves et al 2006]

動的計画法を用いてラベルを割り当て学習する。

学習時、正解ラベル系列𝒍を入力系列𝒙を割り当てる。:

𝑃 𝑙 𝑥 =

𝜋

𝑃 𝑙 𝜋 𝑃 𝜋 𝑥

Classical framewise RNN vs RNN-CTC

Model WER

Classical RNN 14.0%

RNN+CTC 12.9%

2000時間の英語のVoiceSearchの結果。[H. Sak et al 2015]


CTC: Connectionist Temporal Classification

Connectionist Temporal Classification (CTC) [A. Graves et al 2006]

Produce phoneme labes by Dynamic Time Warping

In training phase, use correct label sequence l to input sequence x.

𝑃 𝑙 𝑥 =

𝜋

𝑃 𝑙 𝜋 𝑃 𝜋 𝑥

Classical framewise RNN vs RNN-CTC

Model WER

Classical RNN 14.0%

RNN+CTC 12.9%

English Voice SearchUsing 2000 hours data.[H. Sak et al 2015]

Input sequence


音声認識性能の向上

Saon, et al. “English Conversational Telephone Speech Recognition by

Humans and Machines”, INTERSPEECH 2017

[1] R. P. Lippmann, “Speech recognition by machines and humans,”Speech communication, vol. 22, no. 1, pp. 1–15, 1997.


76


Improvements of SR performnce

Saon, et al. “English Conversational Telephone Speech Recognition by

Humans and Machines”, INTERSPEECH 2017

[1] R. P. Lippmann, “Speech recognition by machines and humans,”Speech communication, vol. 22, no. 1, pp. 1–15, 1997.


77


最近の音声合成の進歩

フォルマント合成、素片合成

確率モデルベース音声合成：HTS– HMMフレームワークによる音声合成

– Tokuda, et al., “Speech parameter generation algorithms for HMM-

based speech synthesis”, ICASSP 2000

Wavenet– 時系列信号に対し、畳み込みを行うNNにより波形生成– van den Oord et al., “WAVENET: A GENERATIVE MODEL FOR RAW

AUDIO”, arXiv:1609.03499v2 [cs.SD] 19 Sep 2016

Tacotron– 文字入力でスペクトログラムを生成、その後、Griffin-Lim法で波形生成– Wang, et al., “TACOTRON: TOWARDS END-TO-END SPEECH

SYNTHESIS”, arXiv:1703.10135v2 [cs.CL] 6 Apr 2017


78


Recent Progress on Speech Synthesis

Formant synthesis, Unit-based synthesis

Probabilistic Speech Synthesis: HTS– HMM Framework

– Tokuda, et al., “Speech parameter generation algorithms for HMM-

based speech synthesis”, ICASSP 2000

Wavenet– Convolution to time signal by NN

– van den Oord et al., “WAVENET: A GENERATIVE MODEL FOR RAW

AUDIO”, arXiv:1609.03499v2 [cs.SD] 19 Sep 2016

Tacotron– Generate spectrogram per input character and then signal by Griffin-Lim

algorithm

– Wang, et al., “TACOTRON: TOWARDS END-TO-END SPEECH

SYNTHESIS”, arXiv:1703.10135v2 [cs.CL] 6 Apr 2017


79


Architecture on Wavenet


80


Architecture on Wavenet


81


Architecture on Tacotron


82


Architecture on Tacotron


83


機械翻訳の進歩

ルールベース：言語学者、言語学の知識のある作業者がルールを注意して作成

コーパスベース：

– 用例ベース（Example-Based）ルールを自動的にコーパスから抽出[M.Nagao84, Sato et.al.,89, Sumita et. al., 91 ]

– 統計ベース翻訳（Statistical Machine Translation) さらに、ルールが頻出するかの確率を学習。Noisy Channel Model [P.F.Brown, et.al. 93]

– Phrase-base SMT

単語レベルでなくフレーズという単位を導入。

Tree-to-string

– 構文構造の関係を学習する統計的機械翻訳

Neural Machine Translation

– LSTMによるEncoder と Decoderを組み合わせ、翻訳文を生成する

Attention NMT

– 原言語の単語列のEncoder出力に重みを加えてDecoderにいれることでアライメントをImplicitに学習する


84


Recent Progress on Machine Translation

Rule-based MT：Translation rules are made by linguists and experts

Corpus-based MT：

– Example-Based: Extract rules automatically from corpus

[M.Nagao84, Sato et.al.,89, Sumita et. al., 91 ]

– Statistical MT: Estimate rules statistically from corpus. Noisy Channel Model

[P.F.Brown, et.al. 93]

– Phrase-base SMT

Based on phrase not word

Tree-to-string

– Statistical MT considering syntactical structure


– Encoder-decoder MT by LSTM

Attention NMT

– Attention mechanism to scope on related words in source language in decoding

process


85

フレーズベースSMT

ate a meal with a friend

友達とご飯を食べた

友達a friend

とwith

ご飯をa meal

食べたate

友達a friend

とwith

ご飯をa meal

食べたate

Phrase Aligment

Re-ordering

Translation


食べた

SUF5

VP4-5

x1 with x0

x1 x0

a friend

a meal

ate

x1 x0

Apply rules

友達

ご飯を

VP0-5

PP0-1VP2-5

PP2-3

N2 P3 V4

N0 P1

x1 x0


Parser

訳出

と

構文木の利用:Tree-to-string

構文構造の利用

Satoshi Nakamura@NAIST, Invited Talk, 2017 ASJ Fall Meeting 86

Phrase-based SMT



友達a friend

とwith

ご飯をa meal

食べたate

友達a friend

とwith

ご飯をa meal

食べたate

Phrase Aligment

Re-ordering

Translation


食べた

SUF5

VP4-5

x1 with x0

x1 x0

a friend

a meal

ate

x1 x0

Apply rules

友達

ご飯を

VP0-5

PP0-1VP2-5

PP2-3

N2 P3 V4

N0 P1

x1 x0


Parser

訳出

と

Tree-to-string MT

Syntactic Structure


Statistical Translation Frameworks

Symbolic Models

Phrase-based MT [Koehn+ 03]

he has a cold

彼は風邪を引いている

he

彼はhas

引いているa cold

風邪を

he

彼はhas


風邪を

Tree-to-String MT [Liu+ 06]

彼は風邪

he has a cold

PRP VBZ DET NN

VP

NP

S

引いているを

Continuous-space (Neural) Models

Encoder-Decoder [Sutskever+ 14]

he has a cold <s>

彼

彼

は

は

風邪

風邪

を

引いているを

<s>引いている

Attentional [Bahdanau+ 15]he has a cold

g1,...,g

4

a1

a2

a3

a4

hi-1

hi

ri-1

P(ei|F,e

1,...,e

i-1)

Intelligent and Invisible Computing 88


Statistical Translation Frameworks

Symbolic Models

Phrase-based MT [Koehn+ 03]

he has a cold

彼は風邪を引いている

he

彼はhas


風邪を

he

彼はhas


風邪を

Tree-to-String MT [Liu+ 06]

彼は風邪

he has a cold

PRP VBZ DET NN

VP

NP

S

引いているを

Continuous-space (Neural) Models

Encoder-Decoder [Sutskever+ 14]

he has a cold <s>

彼

彼

は

は

風邪

風邪

を

引いているを

<s>引いている

Attentional [Bahdanau+ 15]he has a cold

g1,...,g

4

a1

a2

a3

a4

hi-1

hi

ri-1

P(ei|F,e

1,...,e

i-1)

Intelligent and Invisible Computing 89



Neural MT NMT Re-ranking

Tomodachi to Gohan wo tabeta

Vector Representation

History ofMT results

I ate

Predict next word

0.5 a0.3 rice0.1 the…

Tree-to-string MT


I ate a meal with my friend

I ate rice with my friend

I ate rice and my friend

NMT probabilityNMT

T2S

0.3 I ate a meal with my friend

0.5 I ate rice with my friend

0.1 I ate rice and my friend

Take best hypothesis I ate rice with my friend

Good example by NMT：

Original: demo Kensa ha kanari itai desuka？before: But quite sore test?after: But the test hurts a lot?



Neural MT NMT Re-ranking


Vector Representation

History ofMT results

I ate

Predict next word

0.5 a0.3 rice0.1 the…

Tree-to-string MT


I ate a meal with my friend

I ate rice with my friend

I ate rice and my friend

NMT probabilityNMT

T2S

0.3 I ate a meal with my friend

0.5 I ate rice with my friend

0.1 I ate rice and my friend

Take best hypothesis I ate rice with my friend

Good example by NMT：

Original: demo Kensa ha kanari itai desuka？before: But quite sore test?after: But the test hurts a lot?


音声翻訳システム

10/08/2016 92

多言語音声認識

話し言葉機械翻訳

多言語音声合成

日本語英語I go to school

「私は学校に行く: Watashi wa Gakko he iku」

Watashi wa Gakko he iku

I go to school

NAIST 公開講座 Satoshi Nakamura@AHC Lab

Speech-to-speech Translation System

10/08/2016 93

多言語音声認識

話し言葉機械翻訳

多言語音声合成

日本語英語I go to school

「私は学校に行く: Watashi wa Gakko he iku」


I go to school

NAIST 公開講座 Satoshi Nakamura@AHC Lab

Bridges Different Language Speaking PeopleBy Speech Translation Technology

11/1/2018 94

MultilingualSpeech

Recognition

Spoken Language

Translation

MultilingualSpeech

SynthesisJapanese English

I go to school「私は学校に行く: Watashi wa Gakko he iku」


I go to school

Invited Talk © U. Trento Satoshi Nakamura, NAIST

我が国の音声翻訳プロジェクトの流れ


読み上げ文を音声翻訳

•文法的な表現

•明瞭な発声

国際会議申込み」

日常の話し言葉を音声翻訳

•標準的な表現

•明瞭な発声

•限定された話題

「ホテル予約」

広い話題に適応

•広い話題での表現

（日常旅行会話）

•雑音を含む音声

•日英＋日中

1986 1992 2000 2006

要素技術

ルールベース人手作業

大規模コーパス+ 機械学習

2008

A-STAR

内閣府社会還元加速PJ

• 8アジア言語

•ネットワーク型音声翻訳

2010

C-STAR

• 音声翻訳国際共同研究コンソーシアム

IWSLT

• 音声翻訳性能評価ワークショップ

2011VoiceTraNAIST

2014

U-STAR

NICTGC PJ

NICTATR

・2007年11月開始

2016/10/4

History of Speech Translation Research in Japan

11/1/2018 96

Fundamentals

Read Speech

• Syntactically correct

• Clear utterance

• Limited domain

Ex. “Conference

Registration”

Daily Conversation

• Standard expression

• Unclear utterance

• Limited domain

Ex. “Hotel Reservation”

Wider and Real Domain

• Wider and real domain

“International Travel”

• Realistic expressions

• Noisy speech

• J-E, J-C speech translation

1986 1992 1999 2006

Rule-based TechnologyCorpus-based Technology

Hand-madeLarge scale corpus

+ Machine learning

2008

ATR NICT

A-STAR

+ More languagesfor translation

• Multilateral translation for 8 Asian languages

• Network-based S2ST

2010

•21 multilateral text translation

C-STAR

• Multilateral translation for 7 world languages

IWSLT

• Evaluation Campaign of S2S technologies

2011

VoiceTra

NAIST

ATR ATR

iPhone アプリ：VoiceTra, TexTra

• 新しいiphone用ネットワーク音声翻訳アプリケーション “VoiceTra” をAppStore から 2010年7月29日リリース

• 21 言語対応、音声入出力は 6言語（日、英、中、インドネシア、ベトナム、マレー語）

• これまでに、約８０万ダウロード、1千万アクセス


97

* Text-translation application, TexTra is released at the same time.

Japanese, English, Mandarin, Taiwanese Mandarin, German, French, Dutch, Danish,Italian, Spanish, Portuguese, Brazilian Portuguese, Russian, Arabic, Hindi, Indonesian, Malay, Thai, Tagalog, Vietnamese, Korean※ Language in red can be input/output in voices.※There is no text input support for Hindi or Vietnamese.

VoiceTra

2016/10/4

iPhone Apps：VoiceTra, TexTra

• Released in July 29th in 2010

• 21 langues, Speech I/O in 6 languages （J, E, C, In, Ve, Ma)• Around 0.8 M downloads, 10 M access in 2012


98

* Text-translation application, TexTra is released at the same time.

Japanese, English, Mandarin, Taiwanese Mandarin, German, French, Dutch, Danish,Italian, Spanish, Portuguese, Brazilian Portuguese, Russian, Arabic, Hindi, Indonesian, Malay, Thai, Tagalog, Vietnamese, Korean※ Language in red can be input/output in voices.※There is no text input support for Hindi or Vietnamese.

VoiceTra

2016/10/4

同時通訳への挑戦 (InterSpeech 2013)

課題：従来法は文末を待つため遅い

提案法：文末を待たず、フレーズ毎に翻訳

2016/10/4 Satoshi Nakamura @ NAIST 2016 99

発話

音声認識

翻訳

音声合成時間

発話

音声認識

翻訳

音声合成

翻訳

音声合成

翻訳

音声合成

時間

Simultaneous Incremental Speech Translation(2013)

11/1/2018Invited Talk © U. Trento Satoshi Nakamura,

NAIST 100

ASR

こんにちは、

MT

駅は

MT

どこですか？

MT

Hello, the station where is it?

TTS TTS TTS

Delay: Reduced

But, this is not easy!

結果


38

40

42

44

46

48

50

0 1 2 3 4 5 6

RIB

ES

Dealy (Sec)

LM+Tu

A rank

B rank

Ａランク：4 年経験Ｂランク：1 年経験

Fast

Acc

ura

te

フレーズ終了時翻訳

発話終了時翻訳

B ランク（経験 1 年）

A ランク（経験 4 年）

≒ 経験年数1年のB ランク通訳者と同等

../../../../++プレゼンテーション資料/Video-photo-etc/2013 0926 同時通訳/Demonstration-movie/Demonstration-movie.wmv

../../../../++プレゼンテーション資料/Video-photo-etc/2013 0926 同時通訳/Demonstration-movie/Demonstration-movie.wmv

Results

Invited Talk © U. Trento Satoshi Nakamura, NAIST 102

38

40

42

44

46

48

50

0 1 2 3 4 5 6

RIB

ES

Dealy (Sec)

LM+Tu

A rank

B rank

Fast

Ac

cu

rate

By Phrase

By Sentence

B Rank（1 Year）

A Rank（4 Year）

≒ B rank human Interpreter with 1 year experience

11/1/2018

../../++プレゼンテーション資料/Video-photo-etc/2013 0926 同時通訳/Demonstration-movie/Demonstration-movie.wmv

../../++プレゼンテーション資料/Video-photo-etc/2013 0926 同時通訳/Demonstration-movie/Demonstration-movie.wmv

本日はここまで


Thank you for your attention


Documents

音情報処理論 2018...音情報処理論 2018 中村 哲 高道愼之介 Sakriani Sakti 吉野幸一郎 2016/10/4 Satoshi Nakamura @ NAIST 2016 1 音声って何 人のコミュニケーションで意図を伝える最も重

音情報処理論 2018...音情報処理論 2018 中村哲高道愼之介 Sakriani Sakti 吉野幸一郎 2016/10/4 Satoshi Nakamura @ NAIST 2016 1 音声って何人のコミュニケーションで意図を伝える最も重