1
Multilingual End-to-End Speech Translation Hirofumi Inaguma ! Kevin Duh " Tatsuya Kawahara ! Shinji Watanabe " ! Graduate School of Informatics, Kyoto University, Japan " Center for Language and Speech Processing, Johns Hopkins University, Baltimore, USA Summary We evaluated if training data from other language pairs are helpful for the end-to-end speech translation (ST) task Directly translate source speech to target languages with a single sequence-to-sequence (S2S) model Ø Many-to-many (M2M) Ø One-to-many (O2M) Outperformed the bilingual end-to-end/pipeline speech translation models Performed transfer learning to a very low-resource ST task: Mboshi->French (4.4h) Ø Shared representations obtained from multilingual E2E-ST were more effective than those from the bilingual one Proposed method Background u Universal sequence-to-sequence (S2S) model A single S2S architecture (a shared encoder/decoder) Does not generate source sentence (no ASR) u Target language biasing Feed a target language ID token (e.g., < 2%& >, < 2() >) to the translation decoder at the first step instead of < *+* > Why end-to-end architecture for multilingual speech translation? Ø Total number of parameters are drastically reduced Ø No need to pre-define a mini-batch scheduler for the language cycle Ø No need to change the existing architecture Ø Potential zero-shot translation (not investigated in this work) u End-to-end speech-to-text translation (E2E-ST) Simplified architecture (no ASR decoder, LM, and MT encoder) -> reduced number of parameters Avoid error propagations from ASR uMultilingual speech-to-text translation Applications Ø One-to-many (O2M) e.g., lecture, news reading etc. Ø Many-to-many (M2M) e.g., dialogue, meeting etc. Conventional pipeline ST system Ø Language identification->ASR->text normalization->MT Data & Task Translation Corpus #hours #utterances #words #vocab domain Bilingual (A) Fisher-CallHome Spanish (EsEn) 170 138 k 1.7 M 66 conversation (B) Librispeech (EnFr) 99 45 k0.8 M 112 reading (C) ST-TED (EnDe) 203 133 k 2.2 M 109 lecture One-to-many (O2M) (B) + (C) (En{Fr, De}) 302 178 k 3.3 M 153 mixed Many-to-many (M2Ma) (A) + (B) ({En, Es}{Fr, En}) 269 183 k 2.8 M 121 mixed Many-to-many (M2Mb) (A) + (C) ({En, Es}{De, En}) 373 272 k 4.0 M 119 mixed Many-to-many (M2Mc) (A) + (B) + (C) ({En, Es}{Fr, De, En}) 472 317 k 5.1 M 157 mixed 1: Statistics in each corpus. Each value is calculated after normalizing references and removing short and long utterances. u One-to-many (O2M) Librispeech + ST-TED Experimental evaluations -( -)- Model BLEU Fisher-test CH-evltest MT Bi-NMT 59.6 28.9 (M2Ma) Multi-NMT 49.5 22.8 (M2Mb) Multi-NMT 56.7 27.7 (M2Mc) Multi-NMT 56.2 27.7 E2E-ST Bi-ST 41.5 14.2 + ASR pre-training 45.2 15.4 (M2Ma) Multi-ST 41.3 15.2 (M2Mb) Multi-ST 44.2 15.8 (M2Mc) Multi-ST 45.2 16.2 + ASR pre-training 46.3 17.2 Pipeline-ST Mono ASR -> Bi NMT 38.6 16.5 (M2Ma) Multi-ASR -> Bi-NMT 39.2 17.2 (M2Mb) Multi-ASR -> Bi-NMT 38.9 17.0 (M2Mc) Multi-ASR -> Bi-NMT 38.5 16.9 Model BLEU test MT Bi-NMT 18.3 (O2M) Multi-NMT 16.2 (M2Ma) Multi-NMT 12.2 (M2Mc) Multi-NMT 14.8 E2E-ST Bi-ST 15.7 + ASR pre-training 16.3 (O2M) Multi-ST 17.2 (M2Ma) Multi-ST 16.4 (M2Mc) Multi-ST 17.3 + ASR pre-training 17.6 Pipeline-ST Mono ASR -> Bi NMT 15.8 (O2M) Multi-ASR -> Bi-NMT 16.7 (M2Ma) Multi-ASR -> Bi-NMT 16.4 (M2Mc) Multi-ASR -> Bi-NMT 16.7 Model BLEU test MT Bi-NMT 23.0 (O2M) Multi-NMT 18.9 (M2Mb) Multi-NMT 17.5 (M2Mc) Multi-NMT 17.2 E2E-ST Bi-ST 16.0 + ASR pre-training 17.1 (O2M) Multi-ST 17.6 (M2Mb) Multi-ST 16.7 (M2Mc) Multi-ST 17.7 + ASR pre-training 18.6 Pipeline-ST Mono ASR -> Bi NMT 18.1 (O2M) Multi-ASR -> Bi-NMT 18.5 (M2Mb) Multi-ASR -> Bi-NMT 17.7 (M2Mc) Multi-ASR -> Bi-NMT 18.1 Take-home-message (many-to-many scenario) Additional training data from other language pairs was effective Multilingual E2E-ST models were not affected by the domain mismatch E2E-ST models got more gains from multilingual training than the pipeline systems Take-home-message (one-to-many scenario) O2M training was more effective than M2M training from the perspective of data efficiency However, using all training data (M2Mc) got a further small gain O2M multilingual training benefits from not only additional English speech data but also the direct optimization Take-home-message Multilingual E2E-ST seed was more effective than the bilingual one O2M seed showed the best performance among all models Model BLEU dev Bi-ST (Librispeech) 4.55 O2M-ST 6.92 M2Ma-ST 5.50 M2Mc-ST 6.52 Problems of multilingual pipeline ST systems Ø Data sparseness issue for low-resource directions Ø Mis-identification of source languages in ASR Ø Increased number of parameters Ø Need text normalization per source language u Many-to-many (M2M) M2Ma: Fisher + Librispeech M2Mb: Fisher + ST-TED M2Mc: Fisher + Librispeech + ST-TED u Main results u Transfer learning to a very low-resource Mboshi->Fr: 4.4h

201912 ASRU2019 poster - sap.ist.i.kyoto-u.ac.jp

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 201912 ASRU2019 poster - sap.ist.i.kyoto-u.ac.jp

.VMUJMJOHVBM�&OE�UP�&OE�4QFFDI�5SBOTMBUJPO)JSPGVNJ *OBHVNB! ,FWJO %VI" 5BUTVZB ,BXBIBSB! 4IJOKJ 8BUBOBCF"

!(SBEVBUF 4DIPPM PG *OGPSNBUJDT ,ZPUP 6OJWFSTJUZ +BQBO"$FOUFS GPS -BOHVBHF BOE 4QFFDI 1SPDFTTJOH +PIOT )PQLJOT 6OJWFSTJUZ #BMUJNPSF 64"

4VNNBSZ• 8F�FWBMVBUFE�JG�USBJOJOH�EBUB�GSPN�PUIFS�MBOHVBHF�QBJST�BSF�IFMQGVM�GPS�UIF�FOE�UP�FOE�TQFFDI�USBOTMBUJPO�45�UBTL• %JSFDUMZ�USBOTMBUF�TPVSDF�TQFFDI�UP�UBSHFU�MBOHVBHFT�XJUI�B�TJOHMF�TFRVFODF�UP�TFRVFODF�4�4�NPEFM�Ø .BOZ�UP�NBOZ�.�.Ø 0OF�UP�NBOZ�0�.

• 0VUQFSGPSNFE�UIF�CJMJOHVBM�FOE�UP�FOE�QJQFMJOF�TQFFDI�USBOTMBUJPO�NPEFMT• 1FSGPSNFE�USBOTGFS�MFBSOJOH�UP�B�WFSZ�MPX�SFTPVSDF�45�UBTL��.CPTIJ��'SFODI����IØ 4IBSFE�SFQSFTFOUBUJPOT�PCUBJOFE�GSPN�NVMUJMJOHVBM�&�&�45�XFSF�NPSF�FGGFDUJWF�UIBO�UIPTF�GSPN�UIF�CJMJOHVBM�POF

1SPQPTFE�NFUIPE

#BDLHSPVOE

��� ���������

��� ���������

!"#$%&$'"# ()*'+)"'

u6OJWFSTBM�TFRVFODF�UP�TFRVFODF�4�4�NPEFM• "�TJOHMF�4�4�BSDIJUFDUVSF�B�TIBSFE�FODPEFS�EFDPEFS• %PFT�OPU�HFOFSBUF�TPVSDF�TFOUFODF�OP�"43

u5BSHFU�MBOHVBHF�CJBTJOH• 'FFE�B�UBSHFU�MBOHVBHF�*%�UPLFO�F�H� �< 2%& > �< 2() >�UP�UIF�USBOTMBUJPO�EFDPEFS�BU�UIF�GJSTU�TUFQ�JOTUFBE�PG�< *+* >

8IZ�FOE�UP�FOE�BSDIJUFDUVSF�GPS�NVMUJMJOHVBM�TQFFDI�USBOTMBUJPO Ø5PUBM�OVNCFS�PG�QBSBNFUFST�BSF�ESBTUJDBMMZ�SFEVDFEØ/P�OFFE�UP�QSF�EFGJOF�B�NJOJ�CBUDI�TDIFEVMFS�GPS�UIF�MBOHVBHF�DZDMFØ/P�OFFE�UP�DIBOHF�UIF�FYJTUJOH�BSDIJUFDUVSFØ1PUFOUJBM�[FSP�TIPU�USBOTMBUJPO�OPU�JOWFTUJHBUFE�JO�UIJT�XPSL

������������������� ������������������������������������ ������������ ������������

!���� ������������������ ����������������������� ���������������������������

������������������������ ������������ �������������� � �������������

!��� ������������������������ ������������ ������������������

u&OE�UP�FOE�TQFFDI�UP�UFYU�USBOTMBUJPO�&�&�45• 4JNQMJGJFE�BSDIJUFDUVSF�OP "43�EFDPEFS �-. �BOE�.5�FODPEFS����SFEVDFE�OVNCFS�PG�QBSBNFUFST• "WPJE�FSSPS�QSPQBHBUJPOT�GSPN�"43

u.VMUJMJOHVBM�TQFFDI�UP�UFYU�USBOTMBUJPO• "QQMJDBUJPOTØ0OF�UP�NBOZ�0�.F�H� �MFDUVSF �OFXT�SFBEJOH�FUD�Ø.BOZ�UP�NBOZ�.�.F�H� �EJBMPHVF �NFFUJOH�FUD�• $POWFOUJPOBM�QJQFMJOF�45�TZTUFNØ -BOHVBHF�JEFOUJGJDBUJPO��"43��UFYU�OPSNBMJ[BUJPO��.5

%BUB���5BTLTranslation Corpus #hours #utterances #words #vocab domain

Bilingual(A) Fisher-CallHome Spanish (Es→En) 170 138 k 1.7 M 66 conversation(B) Librispeech (En→Fr) 99 45 k† 0.8 M 112 reading(C) ST-TED (En→De) 203 133 k 2.2 M 109 lecture

One-to-many (O2M) (B) + (C) (En→{Fr, De}) 302 178 k 3.3 M 153 mixedMany-to-many (M2Ma) (A) + (B) ({En, Es}→{Fr, En}) 269 183 k 2.8 M 121 mixedMany-to-many (M2Mb) (A) + (C) ({En, Es}→{De, En}) 373 272 k 4.0 M 119 mixedMany-to-many (M2Mc) (A) + (B) + (C) ({En, Es}→{Fr, De, En}) 472 317 k 5.1 M 157 mixed

Table 1: Statistics in each corpus. Each value is calculated after normalizing references and removing short and long utterances. Speedperturbation based data augmentation [32] is not performed here. †Two translation references are prepared per source speech utterance.

[16, 17, 18], an artificial token to represent the target language (tar-get language ID) is prepended in the source sentence. However, thisis not suitable for the ST task since the ST encoder directly consumesspeech features. Instead, we replace a start-of-sentence (⟨sos⟩) to-ken in the decoder with a target language ID ⟨2lang⟩ (see Figure 1).For example, when English speech is translated to French text, ⟨sos⟩is replaced with French ID token ⟨2fr⟩.

3.3. Mixed data training

We train multilingual models with mixed training data from multiplelanguages. Thus, each mini-batch may contain utterances from dif-ferent language pairs. We bucket all samples so that each mini-batchcontains utterances of speech frames of the similar lengths regardlessof language pairs. As a result, we can use the same training schemeas the conventional ASR and bilingual ST tasks.

4. DATA

We build our systems on three speech translation corpora: Fisher-CallHome Spanish, Librispeech, and Speech-Translation TED (ST-TED) corpus. To the best of our knowledge, these are the only pub-lic available corpora recorded with a reasonable size of real speechdata6. The data statistics are summarized in Table 1.

4.1. Bilingual translation

(A) Fisher-CallHome Spanish: Es→En

This corpus contains about 170-hours of Spanish conversationaltelephone speech, the corresponding transcription, and the Englishtranslations7 [19]. Following [4, 19, 35], we report results on the fiveevaluation sets: dev, dev2, and test in Fisher corpus (with four ref-erences), and devtest and evltest in CallHome corpus (with a singlereference). We use the Fisher/train as the training set and Fisher/devas the validation set. All punctuation marks except for apostropheare removed during evaluation in ST and MT tasks to compare withprevious works [4, 19].

(B) Librispeech: En→Fr

This corpus is a subset of the original Librispeech corpus [36] andcontains 236-hours of English read speech, the corresponding tran-scription, and the French translations [20]. We use the clean 99-hours of speech data for the training set [5]. Translation referencesin the training set are augmented with Google Translate following[5], so we have two French references per utterance. We use the devset as the validation set and report results on the test set.

6We noticed publicly available one-to-many multilingual ST corpus [34]right before submission. However, this dataset has English speech only.

7https://github.com/joshua-decoder/Fisher-CallHome-corpus

(C) Speech-Translation TED (ST-TED): En→De

This data contains 271-hours of English lecture speech, the corre-sponding transcription, as well as the German translation8. Sincethe original training set includes a lot of noisy utterances due to lowalignment quality, we take a data cleaning strategy. We first force-aligned all training utterances with a Gentle forced aligner9 basedon Kaldi [37], then excluded all utterances where all words in thetranscription were not perfectly aligned with the corresponding au-dio signal [38]. This process reduced from 171,121 to 137,660 ut-terances. We sampled two sets of 2k utterances from the cleanedtraining data as the validation and test sets, respectively (totally 4kutterances). Note that all sets have no text overlap and are disjointregarding speakers, and data splits are available in our codes. Wereport results on this test set and tst2013. tst2013 is one of the testsets provided in IWSLT2018 evaluation campaign. Since there areno human-annotated time alignment provided in these test sets, wedecided to sample the disjoint test set from the training data withalignment information.

4.2. Multilingual translation

We perform experiments in two scenarios: one-to-many (O2M) andmany-to-many (M2M)10.

One-to-many (O2M)

For one-to-many (O2M) translation, speech utterances in a sourcelanguage are translated to multiple target languages. We concatenateLibrispeech (En→Fr) and ST-TED (En→De), and build models forEn→{Fr, De} translations (see Table 1).

Many-to-many (M2M)

For many-to-many (M2M) translation, speech utterances in mul-tiple source languages are translated to all target languages givenin training. We can regard this task as a more challenging opti-mization problem than O2M and M2O translations. We concatenateLibrispeech (En→Fr) and Fisher-CallHome Spanish (Es→En), thenbuild models for {En, Es}→{Fr, En} translations (M2Ma)11. Othercombinations such as Fisher-CallHome Spanish and ST-TED ({En,Es}→{De, En}, M2Mb), and all three directions ({En, Es}→{Fr,De, En}, M2Mc) are also investigated.

8https://sites.google.com/site/iwsltevaluation2018/Lectures-task9https://github.com/lowerquality/gentle

10For many-to-one (M2O) scenario, none of the corpora combinations ex-ists in publicly available corpora, therefore we leave the exploration of thistask for future work. However, O2M and M2M are the realistic scenarios formultilingual speech translation as mentioned in Section 1.

11Readers might think that this scenario is not suitable for the M2M eval-uation since French does not appear in source side as in the multilingual MTtask [16]. However, such public corpora are not currently available.

u 0OF�UP�NBOZ�0�.• -JCSJTQFFDI ��45�5&%

&YQFSJNFOUBM�FWBMVBUJPOT- ( -) -

.PEFM #-&6'JTIFS�UFTU $)�FWMUFTU

.5#J�/.5 ���� ����.�.B�.VMUJ�/.5 ���� ����.�.C�.VMUJ�/.5 ���� ����.�.D�.VMUJ�/.5 ���� ����

&�&�45

#J�45 ���� ������"43�QSF�USBJOJOH ���� ����.�.B�.VMUJ�45 ���� ����.�.C�.VMUJ�45 ���� ����.�.D�.VMUJ�45 ���� ������"43�QSF�USBJOJOH ���� ����

1JQFMJOF�45.POP�"43����#J�/.5 ���� ����.�.B�.VMUJ�"43����#J�/.5 ���� ����.�.C�.VMUJ�"43����#J�/.5 ���� ����.�.D�.VMUJ�"43����#J�/.5 ���� ����

.PEFM #-&6UFTU

.5#J�/.5 ����0�.�.VMUJ�/.5 ����.�.B�.VMUJ�/.5 ����.�.D�.VMUJ�/.5 ����

&�&�45

#J�45 ������"43�QSF�USBJOJOH ����0�.�.VMUJ�45 ����.�.B�.VMUJ�45 ����.�.D�.VMUJ�45 ������"43�QSF�USBJOJOH ����

1JQFMJOF�45.POP�"43����#J�/.5 ����0�.�.VMUJ�"43����#J�/.5 ����.�.B�.VMUJ�"43����#J�/.5 ����.�.D�.VMUJ�"43����#J�/.5 ����

.PEFM #-&6UFTU

.5#J�/.5 ����0�.�.VMUJ�/.5 ����.�.C�.VMUJ�/.5 ����.�.D�.VMUJ�/.5 ����

&�&�45

#J�45 ������"43�QSF�USBJOJOH ����0�.�.VMUJ�45 ����.�.C�.VMUJ�45 ����.�.D�.VMUJ�45 ������"43�QSF�USBJOJOH ����

1JQFMJOF�45.POP�"43����#J�/.5 ����0�.�.VMUJ�"43����#J�/.5 ����.�.C�.VMUJ�"43����#J�/.5 ����.�.D�.VMUJ�"43����#J�/.5 ����

5BLF�IPNF�NFTTBHF�NBOZ�UP�NBOZ�TDFOBSJP• "EEJUJPOBM�USBJOJOH�EBUB�GSPN�PUIFS�MBOHVBHF�QBJST�XBT FGGFDUJWF• .VMUJMJOHVBM�&�&�45�NPEFMT�XFSF�OPU�BGGFDUFE�CZ�UIF�EPNBJO�NJTNBUDI• &�&�45�NPEFMT�HPU�NPSF�HBJOT GSPN�NVMUJMJOHVBM�USBJOJOH�UIBO�UIF�QJQFMJOF�TZTUFNT

5BLF�IPNF�NFTTBHF�POF�UP�NBOZ�TDFOBSJP• 0�.�USBJOJOH�XBT�NPSF�FGGFDUJWF�UIBO�.�.�USBJOJOH GSPN�UIF�QFSTQFDUJWF�PG�EBUB�FGGJDJFODZ• )PXFWFS �VTJOH�BMM�USBJOJOH�EBUB�.�.D�HPU�B�GVSUIFS�TNBMM�HBJO• 0�.�NVMUJMJOHVBM�USBJOJOH�CFOFGJUT�GSPN�OPU�POMZ�BEEJUJPOBM�&OHMJTI�TQFFDI�EBUB�CVU�BMTP�UIF�EJSFDU�PQUJNJ[BUJPO

5BLF�IPNF�NFTTBHF• .VMUJMJOHVBM�&�&�45�TFFE�XBT�NPSF�FGGFDUJWF UIBO�UIF�CJMJOHVBM�POF• 0�. TFFE�TIPXFE�UIF�CFTU�QFSGPSNBODF�BNPOH�BMM�NPEFMT

.PEFM #-&6EFW

#J�45�-JCSJTQFFDI ����0�.�45 ����.�.B�45 ����.�.D�45 ����

1SPCMFNT�PG�NVMUJMJOHVBM�QJQFMJOF�45�TZTUFNTØ%BUB�TQBSTFOFTT�JTTVF�GPS�MPX�SFTPVSDF�EJSFDUJPOTØ.JT�JEFOUJGJDBUJPO�PG�TPVSDF�MBOHVBHFT�JO�"43Ø*ODSFBTFE�OVNCFS�PG�QBSBNFUFSTØ/FFE�UFYU�OPSNBMJ[BUJPO�QFS�TPVSDF�MBOHVBHF

u .BOZ�UP�NBOZ�.�.• .�.B��'JTIFS���-JCSJTQFFDI• .�.C��'JTIFS���45�5&%• .�.D��'JTIFS���-JCSJTQFFDI ��45�5&%

u.BJO�SFTVMUT

u5SBOTGFS�MFBSOJOH�UP�B�WFSZ�MPX�SFTPVSDF� .CPTIJ��'S�����I� ��� �������� ��� ��������

!"#$%&

'()#*+,

-.//*,0#12"342(+5 6*728/,8%/,9%#4/,*2(2:/8/($