#5 Predicting Machine Translation Quality

Predictingmachine translation quality

I am @bittlingmayer.My company is @SignalNLabs

interests: translation quality, translation crowdsourcing, transliteration, browser translation integrations, topic classification, automatic source-side correction

previously @Google, @Adobe, @Cerner

Ciao!

Today’s topics

◉ Why translation quality?◉ What is the problem?◉ Our data model◉ Our learning infra

Quality estimation?

sentence-level quality

good machine translation vs bad

1

Quality evaluation?

corpus-level quality given reference translations

machine translation vs human translation

2

Why quality?Why is predicting quality useful?

Machine translation should not be a gamble.

$4.501M chars by machine

Optimisation Function

$100001M chars at 5¢/word by human

Perfect Prediction == Perfect Translation

translator

predictor

reward [scores, rankings]

state

action [translations]

Reinforcement Learning

What’s the problem?Is it really harder than self-driving cars?

Language is hard.

Context.

Data are dirty.

Bridging.

Payoff

What is solvable?

Effort

bad input

50% of errors

context/customisation

like a human

like Search, FB, Maps...source-side ambiguity

ideally interactivebad output

What is quality?Can we quantify the quality of a translation?

Accuracy

What is sentence-level quality?

Fluency

Low Quality

Good Enough

Misleading

Human Quality

Recall vs Precision vs Accuracy

actual bad

predicted bad

Trivial 90% Accuracy Example

actual bad

predicted bad: 100%

How does quality vary?

to English to top languages to other

from English

from top languages

from other

How does quality vary?

Wikipedia

news

dialogues, film subtitles, Coursera, Medium

“everyday” reviews, customer service

your children’s WhatsApp messages

my WhatsApp messages

Other concepts of quality?

How do we solve it?With data and features

What is our data model?

source target score

en-zh Hello 您好 1.0

en-zh The car is driving. The car is driving. 0.0

en-ru The car is driving. Автомобиль вождения. 0.3

... ... ... ...

What is our data model?

source target src_length_bytes ... trg_spam_prob score

en-zh Hello 您好 5 ... 0.5 1.0

en-zh The car is driving. The car is driving. 19 ... 0.2 0.0

en-ru The car is driving. Автомобиль вождения. 19 ... 0.1 0.3

... ... ... ... ... ... ...

10-1000 featuressignals engineered by us

1000-10M rowssentences* hand-scored by linguists

language-agnosticLanguage is just another feature.

Human scoresEvaluate many translations by hand

Human Evaluation Score Types

Labels

good/bad

multilabels

word-level labels

Ranking

rank multiple systems

Post-Edit

to comprehensible

to human quality

Human Evaluation Score Types

Labels

good/bad

0.0-1.0

multilabels

word-level labels

Ranking

rank multiple systems

Post-Edit

to comprehensible

to human quality

requires smaller dataset and budget

$0.001 / row @ 5x redundancy$

QuEst baseline featuresquest.dcs.shef.ac.uk/quest_files/features_blackbox_baseline_17

number of tokens in the source sentencenumber of tokens in the target sentenceaverage source token lengthLM probability of source sentenceLM probability of target sentencenumber of occurrences of the target word within the target hypothesis (averaged for all words in the hypothesis - type/token ratio)average number of translations per source word in the sentence (as given by IBM 1 table thresholded such that prob(t|s) > 0.2)average number of translations per source word in the sentence (as given by IBM 1 table thresholded such that prob(t|s) > 0.01) weighted by the inverse frequency of each word in the source corpuspercentage of unigrams in quartile 1 of frequency (lower frequency words) in a corpus of the source language (SMT training corpus)percentage of unigrams in quartile 4 of frequency (higher frequency words) in a corpus of the source languagepercentage of bigrams in quartile 1 of frequency of source words in a corpus of the source languagepercentage of bigrams in quartile 4 of frequency of source words in a corpus of the source languagepercentage of trigrams in quartile 1 of frequency of source words in a corpus of the source languagepercentage of trigrams in quartile 4 of frequency of source words in a corpus of the source languagepercentage of unigrams in the source sentence seen in a corpus (SMT training corpus)number of punctuation marks in the source sentencenumber of punctuation marks in the target sentence

number of tokens

length

LM probability

number of occurrences of the target word within the target hypothesis

average number of translations per source word in the sentence

…

percentage of unigrams in quartile 1 of frequency (lower frequency words)

… percentage of unigrams in quartile n of frequency (higher frequency words)

…

percentage of trigrams in quartile 1 of frequency of source words

… percentage of trigrams in quartile n of frequency of source words

number of punctuation marks

bad input signals

vot tak narod ho4et napisat'

Возможно, вы имели в виду: вот так народ хочет написать

https://www.google.ru/search?newwindow=1&biw=1381&bih=760&q=%D0%B2%D0%BE%D1%82+%D1%82%D0%B0%D0%BA+%D0%BD%D0%B0%D1%80%D0%BE%D0%B4+%D1%85%D0%BE%D1%87%D0%B5%D1%82+%D0%BD%D0%B0%D0%BF%D0%B8%D1%81%D0%B0%D1%82%D1%8C%27&spell=1&sa=X&ved=0ahUKEwjV8Myr__TKAhWFKJoKHdG2A7IQBQgYKAA







human vot tak narod ho4et napisat' vot tak narod ho4et napisat'

search вот так народ хочет написать That's how people want to write

translation Вот так народ хочет написать. So people want to write.

bad output signals

ambiguity signals

translation signals

Google Microsoft Wiktionary ...

Merry Christmas Krismasi! Krismasi Njema! heri ya KrismasiKrismasi njema

...

eat apples kula mapera kula apples ∅ ...

lexical signals

sygnały leksykalne

char signals

sygnały znaków

syntactic signals

parse tree to sequence conversion

sequence to sequence learning

cross-lingual signals

outside signals

context/customisation signals

Other signals?

50-99+% accuracyDepends on the benchmark! ;-)

1000-10M rows

10-1000 features

Data augmentation?

Can we use parallel corpora?target

Onartutako gertaerak Aholkuak eta iradokizunak Etorkizuneko egitasmoei buruz galdetzea onespena eskatzea Laguntza eskatzea Jende galdetzea itxaron Norbait iritzia eskatzea Etorkizunari Garrantzia emanez informazio saihestea Bad pertsona … … ... Aditu batek ingelesez izatea Being Lucky zaharra izatea pobrea izatea ari irekietan aberatsa izatea Ziur izatea / zenbait ari kezkaturik Aspergarria! Your Mind aldatzeak Pertsonak txaloak Up Hipokresia kexu

source

받아 들여지는 사실 조언 및 제안 향후 계획에 대해 물어 승인 요청 도움을 요청 사람을 요구하는 대기 누군가의 의견을 물어 미래에 대한 태도 제공 정보 방지 나쁜 사람들 … … ... 영어 전문가 인 존재 럭키 오래 되 가난 안심되는 부자가되는 확인 인 / 특정 걱정되는 지루한! 당신의 마음을 변경 사람을 응원합니다 위선에 대해 불평

score

1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 … … ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

What is our learning infra?

H2O.ai deeplearning

Do we need deep learning?

Why doesn’t deep learning

work for translation?

Want to learn more?

The real experts

◉ Dr. Lucia Specia◉ quest.dcs.shef.ac.uk◉ statmt.org/wmt15/quality-estimation-task.html

ACL 2016 will be held in Berlin in August.

Reading

Any questions ?

You can find me at

◉ @bittlingmayer◉ [email protected]

Thanks!

Technology

#5 Predicting Machine Translation Quality