48
Machine Translation Introduction to MT

Machine Translation

  • Upload
    tamas

  • View
    80

  • Download
    2

Embed Size (px)

DESCRIPTION

Machine Translation. Introduction to MT. Machine Translation. Helping human translators. Fully automatic. Enter Source Text:.  这 不过 是 一 个 时间 的 问题 . Translation from Stanford’s Phrasal :. This is only a matter of time. Google Translate. Fried ripe plantains: - PowerPoint PPT Presentation

Citation preview

Page 1: Machine Translation

Machine Translation

Introduction to MT

Page 2: Machine Translation

Dan Jurafsky

Machine Translation

• Fully automatic • Helping human translators

Enter Source Text:

Translation from Stanford’s Phrasal:

  这 不过 是 一 个 时间 的 问题  .

This is only a matter of time.

Page 4: Machine Translation

Dan Jurafsky

Machine Translation

• The Story of the Stone (“The Dream of the Red Chamber”)• Cao Xueqin 1792

• Chinese gloss: Dai-yu alone at bed on think-of-with-gratitude Bao-chai… again listen to window outside bamboo tip plantain leaf of on, rain sound sigh drop, clear cold penetrate curtain, not feeling again fall down tears come.

• Hawkes translation: As she lay there alone, Dai-yu’s thoughts turned to Bao-chai… Then she listened to the insistent rustle of the rain on the bamboos and plantains outside her window. The coldness penetrated the curtains of her bed. Almost without noticing it she had begun to cry.

Page 5: Machine Translation

Dan Jurafsky

Difficulties in Chinese to English translation

• Long Chinese sentences: 4 English sentences to 1 Chinese• Chinese no pronouns or articles (English the, a)• Chinese has locative post-positions, English prepositions

• Chinese bed on, window outside, English on the bed, outside the window• Chinese rarely marks tense:

• English as, turned to, had begun,• Chinese tou, ‘penetrate’ -> English penetrated

• Chinese relative clauses are before the noun, English after• Chinese: [window outside bamboo on] rain• English: rain [on the bamboo outside the window]

• Stylistic and cultural differences• Chinese bamboo tip plaintain leaf -> bamboos and plantains• Chinese rain sound sigh drop -> insistent rustle of the rain• Chinese ma ‘curtain’ -> curtains of her bed

Page 6: Machine Translation

Dan Jurafsky

Alignment in Machine Translation

Page 7: Machine Translation

Dan Jurafsky

Early MT History

1946 Booth and Weaver discuss MT in New York1947-48 idea of dictionary-based direct translation1947 Warren Weaver suggests translation by computer1949 Weaver memorandum1952 all 18 MT researchers in world meet at MIT1954 IBM/Georgetown Demo Russian-English MT1955-65 lots of labs take up MT

http://www.hutchinsweb.me.uk/PPF-TOC.htm

Page 8: Machine Translation

Dan Jurafsky

8

1949 Weaver memorandum

• http://www.mt-archive.info/Weaver-1949.pdf

• “There are certain invariant properties which are… common to all languages”

• ‘When I look at an article in Russian, I say "This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.”’

• “[If] one can see… N words on either side, then, if N is large enough, one can unambiguously decide the meaning of the central word.”

Page 9: Machine Translation

Dan Jurafsky

The History of MT: Pessimism

• 1959/1960• Yehoshua Bar-Hillel “Report on the state of MT

in US and GB”• FAHQ MT too hard because we would have to

encode all of human knowledge• Instead we should work on computer tools for

human translators

Page 10: Machine Translation

Dan Jurafsky

The claim that fully automatic high quality MT is impossible

Yehoshua Bar-Hillel. 1960. A Demonstration of the Nonfeasibility of Fully Automatic High Quality Translation.

• Little John was looking for his toy box. Finally he found it. The box was in the pen. John was very happy.

• Pen1: Enclosure for small children• Pen2: Writing utensil

Pen1: Enclosure for small children

Page 11: Machine Translation

Dan Jurafsky

• The box was in the pen.

Page 12: Machine Translation

Dan Jurafsky

The claim that fully automatic high quality MT is impossible

Yehoshua Bar-Hillel, 1960

“I now claim that no existing or imaginable program will enable an electronic computer to determine…”

Page 13: Machine Translation

Dan Jurafsky

The state of the art in MT

Page 14: Machine Translation

Dan Jurafsky

The state of the art in MT

Page 15: Machine Translation

Dan Jurafsky

History of MT: Further PessimismThe ALPAC report

• Headed by John R. Pierce of Bell Labs• Conclusions:• MT doesn’t work• MT a failure: all current MT work had to be post-edited• Intelligibility and informativeness worse than human

• We don’t need MT anyhow• Already too many human translators from Russian

• Results: MT research suffered• Funding loss• Number of research labs declined• Association for Machine Translation and Computational

Linguistics dropped MT from its name

Page 16: Machine Translation

Dan Jurafsky

MT in the modern age

• 1975-1985 Resurgence of MT in Europe and Japan• Domain-specific rule-based systems

• 1990-present• Rise of Statistical Machine Translation

Page 17: Machine Translation

Machine Translation

Introduction to MT

Page 18: Machine Translation

Machine Translation

Language Divergences

Page 19: Machine Translation

Dan Jurafsky

Language Similarities and Divergences

• Typology: • the study of systematic cross-linguistic similarities

and differences• What are the dimensions along which human

languages vary?

Page 20: Machine Translation

Dan Jurafsky

Syntactic Variation: Basic Word Orders

• SVO (Subject-Verb-Object) languagesEnglish, German, French, MandarinI baked a pizza

• SOV LanguagesJapanese, HindiEnglish: He adores listening to musicJapanese: kare ha ongaku wo kiku no ga daisuki desu he music to listening adores

• VSO languages• Irish, Classical Arabic, Tagalog

In many languages one word order is more basic

Page 21: Machine Translation

Dan Jurafsky

Morphology

• Morpheme: “Minimal meaningful unit of language”Word = Morpheme + Morpheme + Morpheme +…

• Stems: (base form, root) hope+ing hoping hop hopping

• Affixes• Prefixes: Antidisestablishmentarianism• Suffixes: Antidisestablishmentarianism• Infixes: hingi (borrow) – humingi (borrower) in Tagalog• Circumfixes: sagen (say) – gesagt (said) in German

Page 22: Machine Translation

Dan Jurafsky

Morphemes per Word

isolating synthetic

Vietnamese

Joseph Greenberg. 1954. A Quantitative Approach to the Morphological Typology of Language. IJAL 26:3.

1 3

1.06

Yakut (Turkic)

2.17

English

1.68

WestGreenlandic(Eskimo-Inuit)

3.72

2

Swahili

2.55

4

Page 23: Machine Translation

Dan Jurafsky

Few morphemes per word: Cantonese

“He said this was the biggest building in the whole country”

Each word in this sentence has one morpheme (and one syllable):

keui wa chyuhn gwok jeui daaih gaan nguk haih li gaanhe say entire country most big bldg house is this bldg

Page 24: Machine Translation

Dan Jurafsky

Many Morphemes per word: Turkish

uygarlaştıramadıklarımızdanmışsınızcasınauygar+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casınaBehaving as if you are among those whom we could not cause to become civilized

Page 25: Machine Translation

Dan Jurafsky

Word SegmentationAre word boundaries marked in writing?

• Some writing systems: boundaries between words not marked• Chinese, Japanese, Thai• Word segmentation becomes an important part of text

normalization for MT• Some languages tend to have sentences that are quite

long, closer to English paragraphs than sentences:• Modern Standard Arabic, Chinese• Sentence segmentation may be necessary for MT between

these languages and languages like English

Page 26: Machine Translation

Dan Jurafsky Inferential Load: cold vs. hot languages

• Hot languages:• Who did what to whom is marked explicitly• English

• Cold languages:• The hearer has more “figuring out” of who the various actors

in the various events are• Japanese, Chinese

Balthasar Bickel. 2003. Referential density in discourse and syntactic typology. Language 79:2, 708-36

Page 27: Machine Translation

Dan Jurafsky

Inferential Load: The blue noun phrases are not in the Chinese original

飓风丽塔已经减弱为第三级飓风,Rita weakened and was downgraded to a Category 3 storm;ø 迫近美国德课萨斯州和路易斯安那州,[Rita/it/the storm] is moving close to Texas and Louisiana;当局表示,the authorities announced; 虽然 ø 在登陆前可能再稍微减弱,although [Rita/it/the storm] might weaken again before landing,但 ø 仍然会非常危险,[Rita/it/the storm] is still very dangerous;ø 预料 ø 会在当地时间星期六凌晨在德州和路易斯安那州之间登陆,[the authorities] predict [Rita/it/the storm] will arrive at the Texas-

Louisiana border on Saturday morning local time;ø 直接吹袭休斯敦市东面的主要炼油设施。[Rita/it/the storm] will directly hit the oil-refining industry east of

Houston.

Page 28: Machine Translation

Dan Jurafsky

Lexical Divergences

• Word to phrases:• English computer science • French informatique

• Part of Speech divergences

• English She likes to sing • German Sie singt gerne [She sings likefully]

• English I’m hungry• Spanish Tengo hambre [I have hunger]

Page 29: Machine Translation

Dan Jurafsky

Lexical Specificity Divergences

• Grammatical specificity• Spanish: plural pronouns have gender (ellos/ellas)• English: plural pronouns no gender (they)

• So translating “they” from English to Spanish, need to figure out gender of the referent!

Page 30: Machine Translation

Dan Jurafsky Lexical Divergences: Semantic Specificity

English brotherMandarin gege (older brother), didi (younger brother)

English wallGerman Wand (inside) Mauer (outside)

English fishSpanish pez (the creature) pescado (fish as food)

Cantonese ngauEnglish cow beef

Page 31: Machine Translation

Dan Jurafsky

Predicate Argument divergences

• English SpanishThe bottle floated out. La botella salió flotando.

The bottle exited floating

• Satellite-framed languages: • direction of motion is marked on the satellite• Crawl out, float off, jump down, walk over to, run after

• Most of Indo-European, Hungarian, Finnish, Chinese

• Verb-framed languages: • direction of motion is marked on the verb• Spanish, French, Arabic, Hebrew, Japanese, Tamil, Polynesian, Mayan,

Bantu families

L. Talmy. 1985. Lexicalization patterns: Semantic Structure in Lexical Form.

Page 32: Machine Translation

Dan Jurafsky

Predicate Argument divergences:Heads and Argument swapping

Heads:English: X swim across YSpanish: X crucar Y nadando

English: I like to eatGerman: Ich esse gern

English: I’d prefer vanillaGerman: Mir wäre Vanille lieber

Arguments:Spanish: Y me gustaEnglish: I like Y

German: Der Termin fällt mir einEnglish: I forget the date

Dorr, Bonnie J., "Machine Translation Divergences: A Formal Description and Proposed Solution," Computational Linguistics, 20:4, 597--633

Page 33: Machine Translation

Dan Jurafsky

Predicate-Argument Divergence Counts

Found divergences in 32% of sentences in UN Spanish/English Corpus

Part of Speech X tener hambre Y have hunger

98%

Phrase/Light verb X dar puñaladas a ZX stab Z

83%

Structural X entrar en Y X enter Y

35%

Heads swap X cruzar Y nadandoX swim across Y

8%

Arguments swap X gustar a YY likes X

6%

B.Dorr et al. 2002. DUSTer: A Method for Unraveling Cross-Language Divergences for Statistical Word-Level Alignment

Page 34: Machine Translation

Machine Translation

Language Divergences

Page 35: Machine Translation

Machine Translation

Three classical methods for MT

Page 36: Machine Translation

Dan Jurafsky

3 Classical methods for MT

• Direct• Transfer• Interlingua

Page 37: Machine Translation

Dan Jurafsky

Three MT Approaches: Direct, Transfer, Interlingual

Page 38: Machine Translation

Dan Jurafsky

Direct Translation

• Proceed word-by-word through text• Translating each word• No intermediate structures except morphology• Knowledge is in the form of

• Huge bilingual dictionary• word-to-word translation information

• After word translation, can do simple reordering• Adjective ordering English -> French/Spanish

Page 39: Machine Translation

Dan Jurafsky

Direct MT Dictionary entry

Page 40: Machine Translation

Dan Jurafsky

Direct MT

Page 41: Machine Translation

Dan Jurafsky

Problems with direct MT

• German

• Chinese

Page 42: Machine Translation

Dan Jurafsky

The Transfer Model

• Idea: apply contrastive knowledge, i.e., knowledge about the difference between two languages

• Steps:• Analysis: Syntactically parse source language• Transfer: Rules to turn this parse into parse for target

language• Generation: Generate target sentence from parse

tree

Page 43: Machine Translation

Dan Jurafsky

English to French

English: Adjective NounFrench: Noun Adjective• This is not always true

Route mauvaise ‘bad road, badly-paved road’Mauvaise route ‘wrong road’

• But is a reasonable first approximation• Rule:

Page 44: Machine Translation

Dan Jurafsky

Transfer rules

Page 45: Machine Translation

Dan Jurafsky

45

Transferring the green witch….

Page 46: Machine Translation

Dan Jurafsky

Interlingua

• Instead of N2 sets of transfer rules• Use meaning as a representation language

1. Parse source sentence into meaning representation2. Generate target sentence from meaning.

• Intuition: Use other NLP applications to do MT work• English book to Spanish: libro or reservar• Disambiguate book into concepts BOOKVOLUME and RESERVE

• Need 2N systems (a parser and generator for each language)

Page 47: Machine Translation

Dan Jurafsky

Interlingua for Mary did not slap the green witch

Page 48: Machine Translation

Machine Translation

Three classical methods for MT