Edinburgh mt lecture6_decoding

Decoding

Decoding

We want to solve this problem:

Q: how many English sentences are there?

e⇤ = arg max

ep(e|f)

北风呼啸。

the strong north wind .

Suppose we have 5 translations per word.Suppose every word has fertility 1.

(15,000 for this example).Then we have translations!5nn!

北风呼啸。


Given a sentence pair and an alignment, we can easily calculate

p(English, alignment|Chinese)

Can we decode without enumerating all translations?

北风呼啸。


There are target sentences.

Key Idea

But there are only ways to start them.O(5n)

5nn!

coverage vector

Key Idea

north

northern

strong

p(north|START ) · p( |north)北

p(northern|START ) · p( |northern)北

p(strong|START ) · p( |strong)呼啸

北风呼啸。

coverage vector

Key Idea

north

northern

strong

wind

p(wind|north) · p( |wind)风

p(strong|north) · p( |strong)呼啸

strong

Key Idea

Dynamic Programming

amount of work:O(5n2n)

bad, but much better than

5nn!

Key Idea

Dynamic Programming



5nn!

Key Idea

Dynamic Programming

each edge labelled with a weight and a

word (or words)



5nn!

Key Idea

Dynamic Programming


word (or words)

north, 0.014



5nn!

Key Idea

Dynamic Programming


word (or words)

north, 0.014

weighted finite-state automata



5nn!

Weighted languages

•The lattice describing the set of all possible translations is a weighted finite state automaton.

•So is the language model.

•Since regular languages are closed under intersection, we can intersect the devices and run shortest path graph algorithms.

•Taking their intersection is equivalent to computing the probability under Bayes’ rule.

Wait a second!

We want to solve this problem:e⇤ = arg max

ep(e|f)

But now we’re solving this problem:e⇤ = arg max

emax

ap(e,a|f)

= arg max

e

X

a

p(e,a|f)

Often called the Viterbi approximation

We can sum over alignments by weighted determinization

Wait a second!

How expensive is that?


Wait a second!



O(5n2n)nondeterministic

Wait a second!



O(5n2n)nondeterministic

O(25n2n

)deterministic

Wait a second!

I made the simplest machine translation model I could think of

and it blew up in my face

is still far too much work.O(5n2n)Ok, let’s stick with the Viterbi approximation. But…

I made the simplest machine translation model I could think of

and it blew up in my face

is still far too much work.

Can we do better?

O(5n2n)Ok, let’s stick with the Viterbi approximation. But…

北风呼啸。

Can we do better?

北风呼啸。

north wind the strong .

Can we do better?

北风呼啸。


Can we do better?

北风呼啸。


Can we do better?

Each arc weighted by translation probability +

bigram probability

北风呼啸。


Can we do better?


bigram probability

Objective: find shortest path that visits each word once.

北风呼啸。

London Paris NY Tokyo .

Can we do better?


bigram probability


北风呼啸。


Can we do better?


bigram probability


Probably not: this is the traveling salesman problem.

北风呼啸。


Can we do better?


bigram probability


Probably not: this is the traveling salesman problem.Even the Viterbi approximation is too hard to solve!

Approximation: Pruning


Idea: prune states by accumulated path length


Idea: prune states by accumulated path length



Reality: longer paths have lower probability!




Solution: Group states by number of covered words.


Solution: Group states by number of covered words.


“Stack” decoding: a linear-time approximation

















虽然北风呼啸 , 但天空依然十分清澈。

the sky

Approximation: Distortion Limits


the sky

O(2n)number of vertices:



the sky


d = 4window



the sky


d = 4window

outside windowto left: covered

outside windowto right: uncovered



the sky

number of vertices:

d = 4window

outside windowto left: covered

outside windowto right: uncovered

O(n2d)


Summary

Summary

•We need every possible trick to make decoding fast.

Summary


•Viterbi approximation: from worse to bad.

Summary



•Dynamic programming: exact but too slow.

Summary




•NP-Completeness means exact solutions unlikely.

Summary





•Heuristic approximations: stack decoding, distortion limits.

Summary





•Heuristic approximations: stack decoding, distortion limits.

•Tradeoff: might not find true argmax.

Documents

Edinburgh mt lecture6_decoding