58
Decoding

Edinburgh mt lecture6_decoding

Embed Size (px)

Citation preview

Decoding

Decoding

We want to solve this problem:

Q: how many English sentences are there?

e⇤ = arg max

ep(e|f)

北 风 呼啸 。

the strong north wind .

Suppose we have 5 translations per word.Suppose every word has fertility 1.

(15,000 for this example).Then we have translations!5nn!

北 风 呼啸 。

the strong north wind .

Given a sentence pair and an alignment, we can easily calculate

p(English, alignment|Chinese)

Can we decode without enumerating all translations?

北 风 呼啸 。

the strong north wind .

There are target sentences.

Key Idea

But there are only ways to start them.O(5n)

5nn!

coverage vector

Key Idea

north

northern

strong

p(north|START ) · p( |north)北

p(northern|START ) · p( |northern)北

p(strong|START ) · p( |strong)呼啸

北 风 呼啸 。

coverage vector

Key Idea

north

northern

strong

wind

p(wind|north) · p( |wind)风

p(strong|north) · p( |strong)呼啸

strong

Key Idea

Dynamic Programming

amount of work:O(5n2n)

bad, but much better than

5nn!

Key Idea

Dynamic Programming

amount of work:O(5n2n)

bad, but much better than

5nn!

Key Idea

Dynamic Programming

each edge labelled with a weight and a

word (or words)

amount of work:O(5n2n)

bad, but much better than

5nn!

Key Idea

Dynamic Programming

each edge labelled with a weight and a

word (or words)

north, 0.014

amount of work:O(5n2n)

bad, but much better than

5nn!

Key Idea

Dynamic Programming

each edge labelled with a weight and a

word (or words)

north, 0.014

weighted finite-state automata

amount of work:O(5n2n)

bad, but much better than

5nn!

Weighted languages

•The lattice describing the set of all possible translations is a weighted finite state automaton.

•So is the language model.

•Since regular languages are closed under intersection, we can intersect the devices and run shortest path graph algorithms.

•Taking their intersection is equivalent to computing the probability under Bayes’ rule.

Wait a second!

We want to solve this problem:e⇤ = arg max

ep(e|f)

But now we’re solving this problem:e⇤ = arg max

emax

ap(e,a|f)

= arg max

e

X

a

p(e,a|f)

Often called the Viterbi approximation

We can sum over alignments by weighted determinization

Wait a second!

How expensive is that?

We can sum over alignments by weighted determinization

Wait a second!

How expensive is that?

We can sum over alignments by weighted determinization

O(5n2n)nondeterministic

Wait a second!

How expensive is that?

We can sum over alignments by weighted determinization

O(5n2n)nondeterministic

O(25n2n

)deterministic

Wait a second!

I made the simplest machine translation model I could think of

and it blew up in my face

is still far too much work.O(5n2n)Ok, let’s stick with the Viterbi approximation. But…

I made the simplest machine translation model I could think of

and it blew up in my face

is still far too much work.

Can we do better?

O(5n2n)Ok, let’s stick with the Viterbi approximation. But…

北 风 呼啸 。

Can we do better?

北 风 呼啸 。

north wind the strong .

Can we do better?

北 风 呼啸 。

north wind the strong .

Can we do better?

北 风 呼啸 。

north wind the strong .

Can we do better?

Each arc weighted by translation probability +

bigram probability

北 风 呼啸 。

north wind the strong .

Can we do better?

Each arc weighted by translation probability +

bigram probability

Objective: find shortest path that visits each word once.

北 风 呼啸 。

London Paris NY Tokyo .

Can we do better?

Each arc weighted by translation probability +

bigram probability

Objective: find shortest path that visits each word once.

北 风 呼啸 。

London Paris NY Tokyo .

Can we do better?

Each arc weighted by translation probability +

bigram probability

Objective: find shortest path that visits each word once.

Probably not: this is the traveling salesman problem.

北 风 呼啸 。

London Paris NY Tokyo .

Can we do better?

Each arc weighted by translation probability +

bigram probability

Objective: find shortest path that visits each word once.

Probably not: this is the traveling salesman problem.Even the Viterbi approximation is too hard to solve!

Approximation: Pruning

Approximation: Pruning

Idea: prune states by accumulated path length

Approximation: Pruning

Idea: prune states by accumulated path length

Approximation: Pruning

Approximation: Pruning

Reality: longer paths have lower probability!

Approximation: Pruning

Approximation: Pruning

Approximation: Pruning

Solution: Group states by number of covered words.

Approximation: Pruning

Solution: Group states by number of covered words.

Approximation: Pruning

“Stack” decoding: a linear-time approximation

Approximation: Pruning

“Stack” decoding: a linear-time approximation

Approximation: Pruning

“Stack” decoding: a linear-time approximation

Approximation: Pruning

“Stack” decoding: a linear-time approximation

Approximation: Pruning

“Stack” decoding: a linear-time approximation

Approximation: Pruning

“Stack” decoding: a linear-time approximation

Approximation: Pruning

“Stack” decoding: a linear-time approximation

Approximation: Pruning

“Stack” decoding: a linear-time approximation

Approximation: Pruning

“Stack” decoding: a linear-time approximation

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。

the sky

Approximation: Distortion Limits

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。

the sky

O(2n)number of vertices:

Approximation: Distortion Limits

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。

the sky

O(2n)number of vertices:

d = 4window

Approximation: Distortion Limits

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。

the sky

O(2n)number of vertices:

d = 4window

outside windowto left: covered

outside windowto right: uncovered

Approximation: Distortion Limits

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。

the sky

number of vertices:

d = 4window

outside windowto left: covered

outside windowto right: uncovered

O(n2d)

Approximation: Distortion Limits

Summary

Summary

•We need every possible trick to make decoding fast.

Summary

•We need every possible trick to make decoding fast.

•Viterbi approximation: from worse to bad.

Summary

•We need every possible trick to make decoding fast.

•Viterbi approximation: from worse to bad.

•Dynamic programming: exact but too slow.

Summary

•We need every possible trick to make decoding fast.

•Viterbi approximation: from worse to bad.

•Dynamic programming: exact but too slow.

•NP-Completeness means exact solutions unlikely.

Summary

•We need every possible trick to make decoding fast.

•Viterbi approximation: from worse to bad.

•Dynamic programming: exact but too slow.

•NP-Completeness means exact solutions unlikely.

•Heuristic approximations: stack decoding, distortion limits.

Summary

•We need every possible trick to make decoding fast.

•Viterbi approximation: from worse to bad.

•Dynamic programming: exact but too slow.

•NP-Completeness means exact solutions unlikely.

•Heuristic approximations: stack decoding, distortion limits.

•Tradeoff: might not find true argmax.