22
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE Dept., IIT Bombay 15 th Feb, 2011

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE

Embed Size (px)

Citation preview

Page 1: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE

CS460/626 : Natural Language Processing/Speech, NLP and the Web

(Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses)

Pushpak BhattacharyyaCSE Dept., IIT Bombay

15th Feb, 2011

Page 2: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE

Going forward from word alignment

Word alignment

Phrase Alignment Decoding(going to bigger units (best possibleOf correspondence) translation)

Page 3: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE

Abstract Problem

Given: eoe1e2e3….enen+1 (Entities)

Goal: lol1l2l3….lnln+1 (Labels)

The Goal is to find the best possible label sequence

Generative Model

))|((maxarg* ELPLL

)|().(maxarg)|(maxarg LEPLPELPL

Page 4: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE

Simplification

Using Markov Assumption, the Language Model can be represented using bigrams

Similarly translation model can also be represented in the following way:

)|()( 10

ii

n

iLLPLP

n

iii lePLEP

0

)|()|(

Page 5: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE

Statistical Machine Translation

Finding the best possible English sentence given the foreign sentence

P(E)= Language Model P(F|E) = Translation Model E: English, F: Foreign Language

)|().(maxarg)|(maxarg* EFPEPFEPeE

Page 6: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE

Problems in the framework Labels are words of the target

language Very large in number Who do you want to_go with ? With whom do you want to go ? आप कि�स �� _स�थ जा�ना� चा�हते�_ह� (Aap kis ke_sath jaana chahate_ho)

who who

do do and so on you youwant want

to_go to_gowith with

Each word have multiple translation options.

Preposition Stranding

Page 7: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE

Column of words of target language on

the source language words ^ Aap kis ke_sath jaana

chahate_ho . who who do do and so on you you^ want want …

. to_go to_go with with

Find the best possible path from ‘^’ to ‘.’ using transition andObservation probabilities.

Viterbi can be used

Page 8: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE

TUTORIAL ON Giza++ and Moses tools(delivered by Kushal Ladha)

Page 9: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE

Word-based alignment

For each word in source language, align words from target language that this word possibly produces

Based on IBM models 1-5 Model 1 – simplest As we go from models 1 to 5, models

get more complex but more realistic This is all that Giza++ does

Page 10: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE

Alignment

A function from target position to source position:

10

The alignment sequence is: 2,3,4,5,6,6,6Alignment function A: A(1) = 2, A(2) = 3 ..A different alignment function will give the sequence:1,2,1,2,3,4,3,4 for A(1), A(2)..

To allow spurious insertion, allow alignment with word 0 (NULL)No. of possible alignments: (I+1)J

Page 11: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE

IBM Model 1: Generative Process

11

Page 12: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE

Training Alignment Models

12

Given a parallel corpora, for each (F,E) learn the best alignment A and the component probabilities: t(f|e) for Model 1 lexicon probability P(f|e) and alignment

probability P(ai|ai-1,I) How to compute these probabilities if

all you have is a parallel corpora

Page 13: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE

Intuition : Interdependence of Probabilities

13

If you knew which words are probable translation of each other then you can guess which alignment is probable and which one is improbable

If you were given alignments with probabilities then you can compute translation probabilities

Looks like a chicken and egg problem

EM algorithm comes to the rescue

Page 14: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE

Limitation: Only 1->Many Alignments allowed

14

Page 15: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE

Phrase-based alignment

More natural

Many-to-one mappings allowed

Page 16: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE

Giza++ and Moses Package

http://cl.naist.jp/~eric-n/ubuntu-nlp/ Select your Ubuntu version Browse the nlp folder Download debian package of giza+

+, moses, mkcls, srilm Resolve all the dependencies and

they get installed For alternate installation, refer to

http://www.statmt.org/moses_steps.html

Page 17: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE

Steps

Input - sentence aligned parallel corpus

Output- target side tagged data Training Tuning Generate output on test corpus

(decoding)

Page 18: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE

Training Create a folder named corpus

containing test, train and tuning file Giza++ is used to generate

alignment Phrase table is generated after

training Before training language model

needs to be build on target side mkdir lm ; /usr/bin/ngram-count -order 3 -interpolate -kndiscount -

text $PWD/corpus/train_surface.hi -lm lm/train.lm; /usr/share/moses/scripts/training/train-factored-phrase-model.perl

-scripts-root-dir /usr/share/moses/scripts -root-dir . -corpus train.clean -e hi -f en -lm 0:3:$PWD/lm/train.lm:0;

Page 19: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE

Example

train.enh e l l oh e l l ow o r l dc o m p o u n d w o r dh y p h e n a t e do n eb o o mk w e e z l e b o t t e r

train.prhh eh l ow

hh ah l ow

w er l d

k aa m p aw n d w er d

hh ay f ah n ey t ih d

ow eh n iy

b uw m

k w iy z l ah b aa t ah r

Page 20: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE

Sample from Phrase-table

b o ||| b aa ||| (0) (1) ||| (0) (1) ||| 1 0.666667 1 0.181818 2.718

b ||| b ||| (0) ||| (0) ||| 1 1 1 1 2.718c o m p o ||| aa m p ||| (2) (0,1) (1) (0) (1) ||| (1,3)

(1,2,4) (0) ||| 1 0.0486111 1 0.154959 2.718c ||| p ||| (0) ||| (0) ||| 1 1 1 1 2.718d w ||| d w ||| (0) (1) ||| (0) (1) ||| 1 0.75 1 1 2.718d ||| d ||| (0) ||| (0) ||| 1 1 1 1 2.718e b ||| ah b ||| (0) (1) ||| (0) (1) ||| 1 1 1 0.6 2.718e l l ||| ah l ||| (0) (1) (1) ||| (0) (1,2) ||| 1 1 0.5 0.5

2.718e l l ||| eh l ||| (0) (0) (1) ||| (0,1) (2) ||| 1 0.111111

0.5 0.111111 2.718e l ||| eh ||| (0) (0) ||| (0,1) ||| 1 0.111111 1

0.133333 2.718e ||| ah ||| (0) ||| (0) ||| 1 1 0.666667 0.6 2.718h e ||| hh ah ||| (0) (1) ||| (0) (1) ||| 1 1 1 0.6 2.718h ||| hh ||| (0) ||| (0) ||| 1 1 1 1 2.718l e b ||| l ah b ||| (0) (1) (2) ||| (0) (1) (2) ||| 1 1 1

0.5 2.718l e ||| l ah ||| (0) (1) ||| (0) (1) ||| 1 1 1 0.5 2.718

l l o ||| l ow ||| (0) (0) (1) ||| (0,1) (2) ||| 0.5 1 1 0.227273 2.718l l ||| l ||| (0) (0) ||| (0,1) ||| 0.25 1 1 0.833333 2.718l o ||| l ow ||| (0) (1) ||| (0) (1) ||| 0.5 1 1 0.227273 2.718l ||| l ||| (0) ||| (0) ||| 0.75 1 1 0.833333 2.718m ||| m ||| (0) ||| (0) ||| 1 0.5 1 1 2.718n d ||| n d ||| (0) (1) ||| (0) (1) ||| 1 1 1 1 2.718n e ||| eh n iy ||| (1) (2) ||| () (0) (1) ||| 1 1 0.5 0.3 2.718n e ||| n iy ||| (0) (1) ||| (0) (1) ||| 1 1 0.5 0.3 2.718n ||| eh n ||| (1) ||| () (0) ||| 1 1 0.25 1 2.718o o m ||| uw m ||| (0) (0) (1) ||| (0,1) (2) ||| 1 0.5 1 0.181818 2.718o o ||| uw ||| (0) (0) ||| (0,1) ||| 1 1 1 0.181818 2.718o ||| aa ||| (0) ||| (0) ||| 1 0.666667 0.2 0.181818 2.718o ||| ow eh ||| (0) ||| (0) () ||| 1 1 0.2 0.272727 2.718o ||| ow ||| (0) ||| (0) ||| 1 1 0.6 0.272727 2.718w o r ||| w er ||| (0) (1) (1) ||| (0) (1,2) ||| 1 0.1875 1 0.424242 2.718w ||| w ||| (0) ||| (0) ||| 1 0.75 1 1 2.718

Page 21: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE

Tuning

Not a compulsory step but will improve the decoding by a small percentage

mkdir tuning; cp $WDIR/corpus/tun.en tuning/input; cp $WDIR/corpus/tun.hi tuning/reference; /usr/share/moses/scripts/training/mert-moses.pl $PWD/tuning/input $PWD/tuning/reference /usr/bin/moses $PWD/model/moses.ini --working-dir $PWD/tuning --rootdir /usr/share/moses/scripts

It will take around 1 hour on a server with 32GB RAM

Page 22: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE

Testing

mkdir evaluation; /usr/bin/moses -config $WDIR/tuning/moses.ini -input-file $WDIR/corpus/test.en >evaluation/test.output;

The output will be in evaluation/test.output file

Sample Output h o t hh aa t p h o n e p|UNK hh ow eh n iy b o o k b uw k