201912 ASRU2019 poster - sap.ist.i.kyoto-u.ac.jp

.VMUJMJOHVBM�&OE�UP�&OE�4QFFDI�5SBOTMBUJPO)JSPGVNJ *OBHVNB! ,FWJO %VI" 5BUTVZB ,BXBIBSB! 4IJOKJ 8BUBOBCF"

!(SBEVBUF 4DIPPM PG *OGPSNBUJDT ,ZPUP 6OJWFSTJUZ +BQBO"$FOUFS GPS -BOHVBHF BOE 4QFFDI 1SPDFTTJOH +PIOT )PQLJOT 6OJWFSTJUZ #BMUJNPSF 64"

4VNNBSZ• 8F�FWBMVBUFE�JG�USBJOJOH�EBUB�GSPN�PUIFS�MBOHVBHF�QBJST�BSF�IFMQGVM�GPS�UIF�FOE�UP�FOE�TQFFDI�USBOTMBUJPO�45�UBTL• %JSFDUMZ�USBOTMBUF�TPVSDF�TQFFDI�UP�UBSHFU�MBOHVBHFT�XJUI�B�TJOHMF�TFRVFODF�UP�TFRVFODF�4�4�NPEFM�Ø .BOZ�UP�NBOZ�.�.Ø 0OF�UP�NBOZ�0�.

• 0VUQFSGPSNFE�UIF�CJMJOHVBM�FOE�UP�FOE�QJQFMJOF�TQFFDI�USBOTMBUJPO�NPEFMT• 1FSGPSNFE�USBOTGFS�MFBSOJOH�UP�B�WFSZ�MPX�SFTPVSDF�45�UBTL��.CPTIJ��'SFODI��IØ 4IBSFE�SFQSFTFOUBUJPOT�PCUBJOFE�GSPN�NVMUJMJOHVBM�&�&�45�XFSF�NPSF�FGGFDUJWF�UIBO�UIPTF�GSPN�UIF�CJMJOHVBM�POF

1SPQPTFE�NFUIPE

#BDLHSPVOE

��

��

!"#$%&$'"# ()*'+)"'

u6OJWFSTBM�TFRVFODF�UP�TFRVFODF�4�4�NPEFM• "�TJOHMF�4�4�BSDIJUFDUVSF�B�TIBSFE�FODPEFS�EFDPEFS• %PFT�OPU�HFOFSBUF�TPVSDF�TFOUFODF�OP�"43

u5BSHFU�MBOHVBHF�CJBTJOH• 'FFE�B�UBSHFU�MBOHVBHF�*%�UPLFO�F�H� �< 2%& > �< 2() >�UP�UIF�USBOTMBUJPO�EFDPEFS�BU�UIF�GJSTU�TUFQ�JOTUFBE�PG�< *+* >

8IZ�FOE�UP�FOE�BSDIJUFDUVSF�GPS�NVMUJMJOHVBM�TQFFDI�USBOTMBUJPO Ø5PUBM�OVNCFS�PG�QBSBNFUFST�BSF�ESBTUJDBMMZ�SFEVDFEØ/P�OFFE�UP�QSF�EFGJOF�B�NJOJ�CBUDI�TDIFEVMFS�GPS�UIF�MBOHVBHF�DZDMFØ/P�OFFE�UP�DIBOHF�UIF�FYJTUJOH�BSDIJUFDUVSFØ1PUFOUJBM�[FSP�TIPU�USBOTMBUJPO�OPU�JOWFTUJHBUFE�JO�UIJT�XPSL

��

!��

��

!��

u&OE�UP�FOE�TQFFDI�UP�UFYU�USBOTMBUJPO�&�&�45• 4JNQMJGJFE�BSDIJUFDUVSF�OP "43�EFDPEFS �-. �BOE�.5�FODPEFS��SFEVDFE�OVNCFS�PG�QBSBNFUFST• "WPJE�FSSPS�QSPQBHBUJPOT�GSPN�"43

u.VMUJMJOHVBM�TQFFDI�UP�UFYU�USBOTMBUJPO• "QQMJDBUJPOTØ0OF�UP�NBOZ�0�.F�H� �MFDUVSF �OFXT�SFBEJOH�FUD�Ø.BOZ�UP�NBOZ�.�.F�H� �EJBMPHVF �NFFUJOH�FUD�• $POWFOUJPOBM�QJQFMJOF�45�TZTUFNØ -BOHVBHF�JEFOUJGJDBUJPO��"43��UFYU�OPSNBMJ[BUJPO��.5

%BUB��5BTLTranslation Corpus #hours #utterances #words #vocab domain

Bilingual(A) Fisher-CallHome Spanish (Es→En) 170 138 k 1.7 M 66 conversation(B) Librispeech (En→Fr) 99 45 k† 0.8 M 112 reading(C) ST-TED (En→De) 203 133 k 2.2 M 109 lecture

One-to-many (O2M) (B) + (C) (En→{Fr, De}) 302 178 k 3.3 M 153 mixedMany-to-many (M2Ma) (A) + (B) ({En, Es}→{Fr, En}) 269 183 k 2.8 M 121 mixedMany-to-many (M2Mb) (A) + (C) ({En, Es}→{De, En}) 373 272 k 4.0 M 119 mixedMany-to-many (M2Mc) (A) + (B) + (C) ({En, Es}→{Fr, De, En}) 472 317 k 5.1 M 157 mixed

Table 1: Statistics in each corpus. Each value is calculated after normalizing references and removing short and long utterances. Speedperturbation based data augmentation [32] is not performed here. †Two translation references are prepared per source speech utterance.

[16, 17, 18], an artificial token to represent the target language (tar-get language ID) is prepended in the source sentence. However, thisis not suitable for the ST task since the ST encoder directly consumesspeech features. Instead, we replace a start-of-sentence (⟨sos⟩) to-ken in the decoder with a target language ID ⟨2lang⟩ (see Figure 1).For example, when English speech is translated to French text, ⟨sos⟩is replaced with French ID token ⟨2fr⟩.

3.3. Mixed data training

We train multilingual models with mixed training data from multiplelanguages. Thus, each mini-batch may contain utterances from dif-ferent language pairs. We bucket all samples so that each mini-batchcontains utterances of speech frames of the similar lengths regardlessof language pairs. As a result, we can use the same training schemeas the conventional ASR and bilingual ST tasks.

4. DATA

We build our systems on three speech translation corpora: Fisher-CallHome Spanish, Librispeech, and Speech-Translation TED (ST-TED) corpus. To the best of our knowledge, these are the only pub-lic available corpora recorded with a reasonable size of real speechdata6. The data statistics are summarized in Table 1.

4.1. Bilingual translation

(A) Fisher-CallHome Spanish: Es→En

This corpus contains about 170-hours of Spanish conversationaltelephone speech, the corresponding transcription, and the Englishtranslations7 [19]. Following [4, 19, 35], we report results on the fiveevaluation sets: dev, dev2, and test in Fisher corpus (with four ref-erences), and devtest and evltest in CallHome corpus (with a singlereference). We use the Fisher/train as the training set and Fisher/devas the validation set. All punctuation marks except for apostropheare removed during evaluation in ST and MT tasks to compare withprevious works [4, 19].

(B) Librispeech: En→Fr

This corpus is a subset of the original Librispeech corpus [36] andcontains 236-hours of English read speech, the corresponding tran-scription, and the French translations [20]. We use the clean 99-hours of speech data for the training set [5]. Translation referencesin the training set are augmented with Google Translate following[5], so we have two French references per utterance. We use the devset as the validation set and report results on the test set.

6We noticed publicly available one-to-many multilingual ST corpus [34]right before submission. However, this dataset has English speech only.

7https://github.com/joshua-decoder/Fisher-CallHome-corpus

(C) Speech-Translation TED (ST-TED): En→De

This data contains 271-hours of English lecture speech, the corre-sponding transcription, as well as the German translation8. Sincethe original training set includes a lot of noisy utterances due to lowalignment quality, we take a data cleaning strategy. We first force-aligned all training utterances with a Gentle forced aligner9 basedon Kaldi [37], then excluded all utterances where all words in thetranscription were not perfectly aligned with the corresponding au-dio signal [38]. This process reduced from 171,121 to 137,660 ut-terances. We sampled two sets of 2k utterances from the cleanedtraining data as the validation and test sets, respectively (totally 4kutterances). Note that all sets have no text overlap and are disjointregarding speakers, and data splits are available in our codes. Wereport results on this test set and tst2013. tst2013 is one of the testsets provided in IWSLT2018 evaluation campaign. Since there areno human-annotated time alignment provided in these test sets, wedecided to sample the disjoint test set from the training data withalignment information.

4.2. Multilingual translation

We perform experiments in two scenarios: one-to-many (O2M) andmany-to-many (M2M)10.

One-to-many (O2M)

For one-to-many (O2M) translation, speech utterances in a sourcelanguage are translated to multiple target languages. We concatenateLibrispeech (En→Fr) and ST-TED (En→De), and build models forEn→{Fr, De} translations (see Table 1).

Many-to-many (M2M)

For many-to-many (M2M) translation, speech utterances in mul-tiple source languages are translated to all target languages givenin training. We can regard this task as a more challenging opti-mization problem than O2M and M2O translations. We concatenateLibrispeech (En→Fr) and Fisher-CallHome Spanish (Es→En), thenbuild models for {En, Es}→{Fr, En} translations (M2Ma)11. Othercombinations such as Fisher-CallHome Spanish and ST-TED ({En,Es}→{De, En}, M2Mb), and all three directions ({En, Es}→{Fr,De, En}, M2Mc) are also investigated.

8https://sites.google.com/site/iwsltevaluation2018/Lectures-task9https://github.com/lowerquality/gentle

10For many-to-one (M2O) scenario, none of the corpora combinations ex-ists in publicly available corpora, therefore we leave the exploration of thistask for future work. However, O2M and M2M are the realistic scenarios formultilingual speech translation as mentioned in Section 1.

11Readers might think that this scenario is not suitable for the M2M eval-uation since French does not appear in source side as in the multilingual MTtask [16]. However, such public corpora are not currently available.

u 0OF�UP�NBOZ�0�.• -JCSJTQFFDI ��45�5&%

&YQFSJNFOUBM�FWBMVBUJPOT- ( -) -

.PEFM #-&6'JTIFS�UFTU $)�FWMUFTU

.5#J�/.5 �� .�.B�.VMUJ�/.5 �� .�.C�.VMUJ�/.5 �� .�.D�.VMUJ�/.5 ��

&�&�45

#J�45 �� "43�QSF�USBJOJOH �� .�.B�.VMUJ�45 �� .�.C�.VMUJ�45 �� .�.D�.VMUJ�45 �� "43�QSF�USBJOJOH ��

1JQFMJOF�45.POP�"43��#J�/.5 �� .�.B�.VMUJ�"43��#J�/.5 �� .�.C�.VMUJ�"43��#J�/.5 �� .�.D�.VMUJ�"43��#J�/.5 ��

.PEFM #-&6UFTU

.5#J�/.5 ��0�.�.VMUJ�/.5 ��.�.B�.VMUJ�/.5 ��.�.D�.VMUJ�/.5 ��

&�&�45

#J�45 ��"43�QSF�USBJOJOH ��0�.�.VMUJ�45 ��.�.B�.VMUJ�45 ��.�.D�.VMUJ�45 ��"43�QSF�USBJOJOH ��

1JQFMJOF�45.POP�"43��#J�/.5 ��0�.�.VMUJ�"43��#J�/.5 ��.�.B�.VMUJ�"43��#J�/.5 ��.�.D�.VMUJ�"43��#J�/.5 ��

.PEFM #-&6UFTU

.5#J�/.5 ��0�.�.VMUJ�/.5 ��.�.C�.VMUJ�/.5 ��.�.D�.VMUJ�/.5 ��

&�&�45

#J�45 ��"43�QSF�USBJOJOH ��0�.�.VMUJ�45 ��.�.C�.VMUJ�45 ��.�.D�.VMUJ�45 ��"43�QSF�USBJOJOH ��

1JQFMJOF�45.POP�"43��#J�/.5 ��0�.�.VMUJ�"43��#J�/.5 ��.�.C�.VMUJ�"43��#J�/.5 ��.�.D�.VMUJ�"43��#J�/.5 ��

5BLF�IPNF�NFTTBHF�NBOZ�UP�NBOZ�TDFOBSJP• "EEJUJPOBM�USBJOJOH�EBUB�GSPN�PUIFS�MBOHVBHF�QBJST�XBT FGGFDUJWF• .VMUJMJOHVBM�&�&�45�NPEFMT�XFSF�OPU�BGGFDUFE�CZ�UIF�EPNBJO�NJTNBUDI• &�&�45�NPEFMT�HPU�NPSF�HBJOT GSPN�NVMUJMJOHVBM�USBJOJOH�UIBO�UIF�QJQFMJOF�TZTUFNT

5BLF�IPNF�NFTTBHF�POF�UP�NBOZ�TDFOBSJP• 0�.�USBJOJOH�XBT�NPSF�FGGFDUJWF�UIBO�.�.�USBJOJOH GSPN�UIF�QFSTQFDUJWF�PG�EBUB�FGGJDJFODZ• )PXFWFS �VTJOH�BMM�USBJOJOH�EBUB�.�.D�HPU�B�GVSUIFS�TNBMM�HBJO• 0�.�NVMUJMJOHVBM�USBJOJOH�CFOFGJUT�GSPN�OPU�POMZ�BEEJUJPOBM�&OHMJTI�TQFFDI�EBUB�CVU�BMTP�UIF�EJSFDU�PQUJNJ[BUJPO

5BLF�IPNF�NFTTBHF• .VMUJMJOHVBM�&�&�45�TFFE�XBT�NPSF�FGGFDUJWF UIBO�UIF�CJMJOHVBM�POF• 0�. TFFE�TIPXFE�UIF�CFTU�QFSGPSNBODF�BNPOH�BMM�NPEFMT

.PEFM #-&6EFW

#J�45�-JCSJTQFFDI ��0�.�45 ��.�.B�45 ��.�.D�45 ��

1SPCMFNT�PG�NVMUJMJOHVBM�QJQFMJOF�45�TZTUFNTØ%BUB�TQBSTFOFTT�JTTVF�GPS�MPX�SFTPVSDF�EJSFDUJPOTØ.JT�JEFOUJGJDBUJPO�PG�TPVSDF�MBOHVBHFT�JO�"43Ø*ODSFBTFE�OVNCFS�PG�QBSBNFUFSTØ/FFE�UFYU�OPSNBMJ[BUJPO�QFS�TPVSDF�MBOHVBHF

u .BOZ�UP�NBOZ�.�.• .�.B��'JTIFS��-JCSJTQFFDI• .�.C��'JTIFS��45�5&%• .�.D��'JTIFS��-JCSJTQFFDI ��45�5&%

u.BJO�SFTVMUT

u5SBOTGFS�MFBSOJOH�UP�B�WFSZ�MPX�SFTPVSDF� .CPTIJ��'S��I� ��

!"#$%&

'()#*+,

-.//*,0#12"342(+5 6*728/,8%/,9%#4/,*2(2:/8/($

Documents

201912 ASRU2019 poster - sap.ist.i.kyoto-u.ac.jp