Upload
clark-bigwood
View
213
Download
1
Embed Size (px)
Citation preview
CS460/626 : Natural Language Processing/Speech, NLP and the Web
(Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses)
Pushpak BhattacharyyaCSE Dept., IIT Bombay
15th Feb, 2011
Going forward from word alignment
Word alignment
Phrase Alignment Decoding(going to bigger units (best possibleOf correspondence) translation)
Abstract Problem
Given: eoe1e2e3….enen+1 (Entities)
Goal: lol1l2l3….lnln+1 (Labels)
The Goal is to find the best possible label sequence
Generative Model
))|((maxarg* ELPLL
)|().(maxarg)|(maxarg LEPLPELPL
Simplification
Using Markov Assumption, the Language Model can be represented using bigrams
Similarly translation model can also be represented in the following way:
)|()( 10
ii
n
iLLPLP
n
iii lePLEP
0
)|()|(
Statistical Machine Translation
Finding the best possible English sentence given the foreign sentence
P(E)= Language Model P(F|E) = Translation Model E: English, F: Foreign Language
)|().(maxarg)|(maxarg* EFPEPFEPeE
Problems in the framework Labels are words of the target
language Very large in number Who do you want to_go with ? With whom do you want to go ? आप कि�स �� _स�थ जा�ना� चा�हते�_ह� (Aap kis ke_sath jaana chahate_ho)
who who
do do and so on you youwant want
to_go to_gowith with
Each word have multiple translation options.
Preposition Stranding
Column of words of target language on
the source language words ^ Aap kis ke_sath jaana
chahate_ho . who who do do and so on you you^ want want …
. to_go to_go with with
Find the best possible path from ‘^’ to ‘.’ using transition andObservation probabilities.
Viterbi can be used
TUTORIAL ON Giza++ and Moses tools(delivered by Kushal Ladha)
Word-based alignment
For each word in source language, align words from target language that this word possibly produces
Based on IBM models 1-5 Model 1 – simplest As we go from models 1 to 5, models
get more complex but more realistic This is all that Giza++ does
Alignment
A function from target position to source position:
10
The alignment sequence is: 2,3,4,5,6,6,6Alignment function A: A(1) = 2, A(2) = 3 ..A different alignment function will give the sequence:1,2,1,2,3,4,3,4 for A(1), A(2)..
To allow spurious insertion, allow alignment with word 0 (NULL)No. of possible alignments: (I+1)J
IBM Model 1: Generative Process
11
Training Alignment Models
12
Given a parallel corpora, for each (F,E) learn the best alignment A and the component probabilities: t(f|e) for Model 1 lexicon probability P(f|e) and alignment
probability P(ai|ai-1,I) How to compute these probabilities if
all you have is a parallel corpora
Intuition : Interdependence of Probabilities
13
If you knew which words are probable translation of each other then you can guess which alignment is probable and which one is improbable
If you were given alignments with probabilities then you can compute translation probabilities
Looks like a chicken and egg problem
EM algorithm comes to the rescue
Limitation: Only 1->Many Alignments allowed
14
Phrase-based alignment
More natural
Many-to-one mappings allowed
Giza++ and Moses Package
http://cl.naist.jp/~eric-n/ubuntu-nlp/ Select your Ubuntu version Browse the nlp folder Download debian package of giza+
+, moses, mkcls, srilm Resolve all the dependencies and
they get installed For alternate installation, refer to
http://www.statmt.org/moses_steps.html
Steps
Input - sentence aligned parallel corpus
Output- target side tagged data Training Tuning Generate output on test corpus
(decoding)
Training Create a folder named corpus
containing test, train and tuning file Giza++ is used to generate
alignment Phrase table is generated after
training Before training language model
needs to be build on target side mkdir lm ; /usr/bin/ngram-count -order 3 -interpolate -kndiscount -
text $PWD/corpus/train_surface.hi -lm lm/train.lm; /usr/share/moses/scripts/training/train-factored-phrase-model.perl
-scripts-root-dir /usr/share/moses/scripts -root-dir . -corpus train.clean -e hi -f en -lm 0:3:$PWD/lm/train.lm:0;
Example
train.enh e l l oh e l l ow o r l dc o m p o u n d w o r dh y p h e n a t e do n eb o o mk w e e z l e b o t t e r
train.prhh eh l ow
hh ah l ow
w er l d
k aa m p aw n d w er d
hh ay f ah n ey t ih d
ow eh n iy
b uw m
k w iy z l ah b aa t ah r
Sample from Phrase-table
b o ||| b aa ||| (0) (1) ||| (0) (1) ||| 1 0.666667 1 0.181818 2.718
b ||| b ||| (0) ||| (0) ||| 1 1 1 1 2.718c o m p o ||| aa m p ||| (2) (0,1) (1) (0) (1) ||| (1,3)
(1,2,4) (0) ||| 1 0.0486111 1 0.154959 2.718c ||| p ||| (0) ||| (0) ||| 1 1 1 1 2.718d w ||| d w ||| (0) (1) ||| (0) (1) ||| 1 0.75 1 1 2.718d ||| d ||| (0) ||| (0) ||| 1 1 1 1 2.718e b ||| ah b ||| (0) (1) ||| (0) (1) ||| 1 1 1 0.6 2.718e l l ||| ah l ||| (0) (1) (1) ||| (0) (1,2) ||| 1 1 0.5 0.5
2.718e l l ||| eh l ||| (0) (0) (1) ||| (0,1) (2) ||| 1 0.111111
0.5 0.111111 2.718e l ||| eh ||| (0) (0) ||| (0,1) ||| 1 0.111111 1
0.133333 2.718e ||| ah ||| (0) ||| (0) ||| 1 1 0.666667 0.6 2.718h e ||| hh ah ||| (0) (1) ||| (0) (1) ||| 1 1 1 0.6 2.718h ||| hh ||| (0) ||| (0) ||| 1 1 1 1 2.718l e b ||| l ah b ||| (0) (1) (2) ||| (0) (1) (2) ||| 1 1 1
0.5 2.718l e ||| l ah ||| (0) (1) ||| (0) (1) ||| 1 1 1 0.5 2.718
l l o ||| l ow ||| (0) (0) (1) ||| (0,1) (2) ||| 0.5 1 1 0.227273 2.718l l ||| l ||| (0) (0) ||| (0,1) ||| 0.25 1 1 0.833333 2.718l o ||| l ow ||| (0) (1) ||| (0) (1) ||| 0.5 1 1 0.227273 2.718l ||| l ||| (0) ||| (0) ||| 0.75 1 1 0.833333 2.718m ||| m ||| (0) ||| (0) ||| 1 0.5 1 1 2.718n d ||| n d ||| (0) (1) ||| (0) (1) ||| 1 1 1 1 2.718n e ||| eh n iy ||| (1) (2) ||| () (0) (1) ||| 1 1 0.5 0.3 2.718n e ||| n iy ||| (0) (1) ||| (0) (1) ||| 1 1 0.5 0.3 2.718n ||| eh n ||| (1) ||| () (0) ||| 1 1 0.25 1 2.718o o m ||| uw m ||| (0) (0) (1) ||| (0,1) (2) ||| 1 0.5 1 0.181818 2.718o o ||| uw ||| (0) (0) ||| (0,1) ||| 1 1 1 0.181818 2.718o ||| aa ||| (0) ||| (0) ||| 1 0.666667 0.2 0.181818 2.718o ||| ow eh ||| (0) ||| (0) () ||| 1 1 0.2 0.272727 2.718o ||| ow ||| (0) ||| (0) ||| 1 1 0.6 0.272727 2.718w o r ||| w er ||| (0) (1) (1) ||| (0) (1,2) ||| 1 0.1875 1 0.424242 2.718w ||| w ||| (0) ||| (0) ||| 1 0.75 1 1 2.718
Tuning
Not a compulsory step but will improve the decoding by a small percentage
mkdir tuning; cp $WDIR/corpus/tun.en tuning/input; cp $WDIR/corpus/tun.hi tuning/reference; /usr/share/moses/scripts/training/mert-moses.pl $PWD/tuning/input $PWD/tuning/reference /usr/bin/moses $PWD/model/moses.ini --working-dir $PWD/tuning --rootdir /usr/share/moses/scripts
It will take around 1 hour on a server with 32GB RAM
Testing
mkdir evaluation; /usr/bin/moses -config $WDIR/tuning/moses.ini -input-file $WDIR/corpus/test.en >evaluation/test.output;
The output will be in evaluation/test.output file
Sample Output h o t hh aa t p h o n e p|UNK hh ow eh n iy b o o k b uw k