Latent Structure Models For NLP - GitHub Pages · 2019-09-01 · ExamplesofstructureinNLP POStagging VERB PREP NOUN dog on wheels NOUN PREP NOUN dog on wheels NOUN DET NOUN dog on

Latent Structure Modelsfor NLP

André Mar ns Ins tuto de Telecomunicações & IST & UnbabelTsvetomila Mihaylova Ins tuto de Telecomunicações

Nikita Nangia NYUVlad Niculae Ins tuto de Telecomunicações

deep-spin.github.io/tutorial

https://deep-spin.github.io/tutorial

I. Introduc on

Structured predic on and NLP

• Structured predic on: a machine learning framework for predic ng structured,constrained, and interdependent outputs• NLP deals with structured and ambiguous textual data:• machine transla on• speech recogni on• syntac c parsing• seman c parsing• informa on extrac on• ...

deep-spin.github.io/tutorial 2


Examples of structure in NLP

POS taggingVERB PREP NOUNdog on wheels

NOUN PREP NOUNdog on wheels

NOUN DET NOUNdog on wheels

Dependency parsing· · ·

⋆ dog on wheels

⋆ dog on wheels

⋆ dog on wheels

· · ·

Word alignments

dogon

wheels

hondopwielen

dogon

wheels

hondopwielen

dogon

wheels

hondopwielen

Exponen ally many parse trees!Cannot enumerate.



Examples of structure in NLP

POS taggingVERB PREP NOUNdog on wheels




⋆ dog on wheels

⋆ dog on wheels

⋆ dog on wheels

· · ·

Word alignments

dogon

wheels

hondopwielen

dogon

wheels

hondopwielen

dogon

wheels

hondopwielen




Examples of structure in NLPPOS tagging

VERB PREP NOUNdog on wheels




⋆ dog on wheels

⋆ dog on wheels

⋆ dog on wheels

· · ·

Word alignments

dogon

wheels

hondopwielen

dogon

wheels

hondopwielen

dogon

wheels

hondopwielen




NLP 5 years ago:Structured predic on and pipelines


• Big pipeline systems, connec ng di erent structured predictors, trainedseparately• Advantages: fast and simple to train, can rearrange pieces

• Disadvantage: linguis c annota ons required for each component• Bigger disadvantage: error propagates through the pipeline




• Big pipeline systems, connec ng di erent structured predictors, trainedseparately• Advantages: fast and simple to train, can rearrange pieces• Disadvantage: linguis c annota ons required for each component

• Bigger disadvantage: error propagates through the pipeline




• Big pipeline systems, connec ng di erent structured predictors, trainedseparately• Advantages: fast and simple to train, can rearrange pieces• Disadvantage: linguis c annota ons required for each component• Bigger disadvantage: error propagates through the pipeline



NLP today:End-to-end training


• Forget pipelines—train everything from scratch!• No more error propaga on or linguis c annota ons!

• Treat everything as latent!




• Forget pipelines—train everything from scratch!• No more error propaga on or linguis c annota ons!• Treat everything as latent!



Representa on learning

• Uncover hidden representa onsuseful for the downstream task.• Neural networks are well-suitedfor this: deep computa on graphs.

• Neural representa ons areunstructured, inscrutable.Language data has underlyingstructure!

posi veneutralnega ve

input



Representa on learning

• Uncover hidden representa onsuseful for the downstream task.• Neural networks are well-suitedfor this: deep computa on graphs.• Neural representa ons areunstructured, inscrutable.Language data has underlyingstructure!

posi veneutralnega ve

input



Latent structure models

• Seek structured hiddenrepresenta ons instead! posi ve

neutralnega ve

input



Latent structure models

• Seek structured hiddenrepresenta ons instead! posi ve

neutralnega ve

input· · ·

· · ·



Latent structure models aren’t so new!

• They have a very long history in NLP:• IBM Models for SMT (latent word alignments) [Brown et al., 1993]• HMMs [Rabiner, 1989]• CRFs with hidden variables [Qua oni et al., 2007]• Latent PCFGs [Petrov and Klein, 2008, Cohen et al., 2012]• Trained with EM, spectral learning, method of moments, ...• O en, very strict assump ons (e.g. strong factoriza ons)• Today, neural networks opened up some new possibili es!



Why do we love latent structure models?

• The inferred latent variables can bring us some interpretability• They o er a way of injec ng prior knowledge as a structured bias• Hopefully: Higher predic ve power with fewer model parameters

• smaller carbon footprint!



Why do we love latent structure models?

• The inferred latent variables can bring us some interpretability• They o er a way of injec ng prior knowledge as a structured bias• Hopefully: Higher predic ve power with fewer model parameters• smaller carbon footprint!



What this tutorial is about:

• Discrete, combinatorial latent structures• O en the structure is inspired by some linguis c intui on• We’ll cover both:• RL methods (structure built incrementally, reward coming from downstream task)• ... vs end-to-end di eren able approaches (global op miza on, marginaliza on)• stochas c computa on graphs• ... vs determinis c graphs.• All plugged in discrimina ve neural models.



This tutorial is not about:

• It’s not about con nuous latent variables• It’s not about deep genera ve learning• We won’t cover GANs, VAEs, etc.• There are (very good) recent tutorials on deep varia onal models for NLP:• “Varia onal Inference and Deep Genera ve Models” (Schulz and Aziz, ACL 2018)• “Deep Latent-Variable Models for Natural Language” (Kim, Wiseman, Rush, EMNLP 2018)



Background

Unstructured vs structured

• To be er explain the math, we’ll o en backtrack to unstructuredmodels (wherethe latent variable is a categorical) before jumping to the structured ones



The unstructured case: Probability simplex

[0,0,1]

[.3, .2, .5]

• Each vertex is an indicator vector,represen ng one class:

zc = [0, . . . ,0, 1︸︷︷︸cth posi on

,0, . . . ,0].

• Points inside are probability vectors, aconvex combina on of classes:

p ≥ 0,∑cpc = 1.




[0,0,1]

[.3, .2, .5]


zc = [0, . . . ,0, 1︸︷︷︸cth posi on

,0, . . . ,0].


p ≥ 0,∑cpc = 1.




[0,0,1]

[.3, .2, .5]


zc = [0, . . . ,0, 1︸︷︷︸cth posi on

,0, . . . ,0].


p ≥ 0,∑cpc = 1.



What’s the analogous of for a structure?• A structured object z can be represented as a bit vector.

• Example:

• a dependency tree can be represented a O(L2) vector indexed by arcs• each entry is 1 i the arc belongs to the tree• structural constraints: not all bit vectors represent valid trees!

z1 = [1,0,0,0,1,0,0,0,1]

⋆ dog on wheels

z2 = [0,0,1,0,0,1,1,0,0]

⋆ dog on wheels

z3 = [1,0,0,0,1,0,0,1,0]

⋆ dog on wheels



What’s the analogous of for a structure?• A structured object z can be represented as a bit vector.• Example:• a dependency tree can be represented a O(L2) vector indexed by arcs• each entry is 1 i the arc belongs to the tree• structural constraints: not all bit vectors represent valid trees!

z1 = [1,0,0,0,1,0,0,0,1]

⋆ dog on wheels

z2 = [0,0,1,0,0,1,1,0,0]

⋆ dog on wheels

z3 = [1,0,0,0,1,0,0,1,0]

⋆ dog on wheels



What’s the analogous of for a structure?• A structured object z can be represented as a bit vector.• Example:• a dependency tree can be represented a O(L2) vector indexed by arcs• each entry is 1 i the arc belongs to the tree• structural constraints: not all bit vectors represent valid trees!

z1 = [1,0,0,0,1,0,0,0,1]

⋆ dog on wheels

z2 = [0,0,1,0,0,1,1,0,0]

⋆ dog on wheels

z3 = [1,0,0,0,1,0,0,1,0]

⋆ dog on wheels



The structured case: Marginal polytope[Wainwright and Jordan, 2008]

• Each vertex corresponds to one such bit vector z• Points inside correspond to marginal distribu ons:convex combina ons of structured objects

μ = p1z1 + . . . + pNzN︸︷︷︸exponen ally many terms

, p ∈ ∆.

p1 = 0.2, z1 = [1,0,0,0,1,0,0,0,1]p2 = 0.7, z2 = [0,0,1,0,0,1,1,0,0]p3 = 0.1, z3 = [1,0,0,0,1,0,0,1,0]

⇒ μ=[.3,0, .7,0, .3, .7, .7, .1, .2].

M




• Each vertex corresponds to one such bit vector z

• Points inside correspond to marginal distribu ons:convex combina ons of structured objects


, p ∈ ∆.

p1 = 0.2, z1 = [1,0,0,0,1,0,0,0,1]p2 = 0.7, z2 = [0,0,1,0,0,1,1,0,0]p3 = 0.1, z3 = [1,0,0,0,1,0,0,1,0]

⇒ μ=[.3,0, .7,0, .3, .7, .7, .1, .2].

M






, p ∈ ∆.

p1 = 0.2, z1 = [1,0,0,0,1,0,0,0,1]p2 = 0.7, z2 = [0,0,1,0,0,1,1,0,0]p3 = 0.1, z3 = [1,0,0,0,1,0,0,1,0]

⇒ μ=[.3,0, .7,0, .3, .7, .7, .1, .2].

M



Unstructured vs Structured

• Unstructured case: simplex∆

• Structured case: marginal polytopeM

M






M






M



Compu ng the most likely structureis a very high-dimensional argmax

· · ·

s zinputx

outputby

There are exponen allymany structures(s cannot t in memory;

we cannot “loop” over s nor z)



Compu ng the most likely structureis a very high-dimensional argmax

· · ·

s zinputx

outputbyThere are exponen ally

many structures(s cannot t in memory;

we cannot “loop” over s nor z)



Dealing with the combinatorial explosion

→ → → · · ·1. Incremental structures• Build structure greedily, as sequenceof discrete choices (e.g., shi -reduce).• Scores (par al structure, ac on) tuples.• Advantages: exible, rich histories.• Disadvantages: greedy, local decisionsare subop mal, error propaga on.

max · · · , , , , · · ·

2. Factoriza on into parts• Op mizes globally (e.g. Viterbi,Chu-Liu-Edmonds, Kuhn-Munkres).• Scores smaller parts.• Advantages: op mal, elegant, canhandle hard & global constraints.• Disadvantages: strong assump ons.



The challenge of discrete choices.

z = 1z = 2

· · ·z =N

s zinputx

outputbys = fθ(x) by = gφ(z, x)

∂L(by,y)∂w =? or, essen ally, ∂z

∂s=?




z = 1z = 2

· · ·z =N

s

zinputx



∂s=?




z = 1z = 2

· · ·z =N

s z

inputx



∂s=?




z = 1z = 2

· · ·z =N

s zinputx



∂s=?




z = 1z = 2

· · ·z =N

s zinputx


∂L(by,y)∂w =?

or, essen ally, ∂z∂s=?




z = 1z = 2

· · ·z =N

s zinputx



∂s=?



Discrete mappings are “ at”s

z = 1z = 2

· · ·z =N

s zz

∂z∂s=?




z = 1z = 2

· · ·z =N

s zz

∂z∂s=?




z = 1z = 2

· · ·z =N

s zz

∂z∂s=?




z = 1z = 2

· · ·z =N

s zz

∂z∂s=?




z = 1z = 2

· · ·z =N

s zz

∂z∂s=?




z = 1z = 2

· · ·z =N

s zz

∂z∂s=?




z = 1z = 2

· · ·z =N

s zz

∂z∂s=?




z = 1z = 2

· · ·z =N

s zz

∂z∂s=?



Argmax

z = 1z = 2

· · ·z =N

s z

∂z∂s = 0

z1

s10

1

s2 − 1 s2 s2 + 1



Example: Regression with latent categoriza on

(c)

s

zp

input x

u

embeddings E Ws

u = 1|x|∑

j Exj s =Wsu

output y

v

Wv Wy

v = tanh (Wv[u, z]) y =Wyv L = (y − y)2− log exp(sc)

Z

Workarounds: circumven ng the issue,bypassing discrete variablesTackling discreteness end-to-end




(c)

s z

p

input x

u

embeddings E Ws

u = 1|x|∑

j Exj s =Wsu

output y

v

Wv Wy


Z

predict topic c (z = ec)





(c)

s z

p

input x

u

embeddings E Ws

u = 1|x|∑

j Exj s =Wsu

output y

v

Wv Wy

v = tanh (Wv[u, z]) y =Wyv L = (y − y)2

− log exp(sc)Z

predict topic c (z = ec)





(c)

s z

p

input x

u

embeddings E Ws

u = 1|x|∑

j Exj s =Wsu

output y

v

Wv Wy


− log exp(sc)Z





(c)

s z

p

input x

u

embeddings E Ws

u = 1|x|∑

j Exj s =Wsu

output y

v

Wv Wy


− log exp(sc)Z

∂L∂Ws

= ∂L∂y

∂y∂v

∂v∂z

≡0︷︸︸︷∂z∂s

∂s∂Ws





(c)

s z

p

input x

u

embeddings E Ws

u = 1|x|∑

j Exj s =Wsu

output y

v

Wv Wy


− log exp(sc)Z

Workarounds: circumven ng the issue,bypassing discrete variables

Tackling discreteness end-to-end




(c)

s z

p

input x

u

embeddings E Ws

u = 1|x|∑

j Exj s =Wsu

output y

v

Wv Wy


− log exp(sc)Z

Op on 1. Pretrain latent classi erWs





(c)

s z

p

input x

u

embeddings E Ws

u = 1|x|∑

j Exj s =Wsu

output y

v

Wv Wy


Z

Op on 2. Mul -task learning





(c)

s z

p

input x

u

embeddings E Ws

u = 1|x|∑

j Exj s =Wsu

output y

v

Wv Wy


− log exp(sc)Z

Workarounds: circumven ng the issue,bypassing discrete variables

Tackling discreteness end-to-end




(c)

s z

p

input x

u

embeddings E Ws

u = 1|x|∑

j Exj s =Wsu

output y

v

Wv Wy

v = tanh (Wv[u, z]) y =Wyv L = Ez(y − y)2

− log exp(sc)Z

Op on 3. Stochas city! ∂Ez(y(z)−y)2

∂Ws=/ 0





(c)

s z

p

input x

u

embeddings E Ws

u = 1|x|∑

j Exj s =Wsu

output y

v

Wv Wy


− log exp(sc)Z

Op on 4. Gradient surrogates (e.g. straight-through, ∂z∂s ← I)





(c)

s

z

p

input x

u

embeddings E Ws

u = 1|x|∑

j Exj s =Wsu

output y

v

Wv Wy

v = tanh (Wv[u,p]) y =Wyv L = (y − y)2

− log exp(sc)Z

Op on 5. Con nuous relaxa on (e.g. so max)




Dealing with discrete latent variables

1. Pre-train external classi er2. Mul -task learning3. Stochas c latent variables

(Part 2)

4. Gradient surrogates

(Part 3)

5. Con nuous relaxa on

(Part 4)



Dealing with discrete latent variables

1. Pre-train external classi er2. Mul -task learning3. Stochas c latent variables (Part 2)4. Gradient surrogates (Part 3)5. Con nuous relaxa on (Part 4)



Roadmap of the tutorial

• Part 1: Introduc on • Part 2: Reinforcement learning• Part 3: Gradient surrogates

Co ee Break

• Part 4: End-to-end di eren able models• Part 5: Conclusions



II. ReinforcementLearning Methods

Latent structure via marginaliza on• Given a sentence-label pair (x, y) and its known parse tree z,

we can make a predic on y(z; x)and incur a loss,

L(y(z; x), y)

or simply L(z)

• But we don’t know z!• In this sec on:

we jointly learn a structured predic on model πθ(z | x)

by op mizing the expected loss,

Eπθ(z|x)L(z)




we can make a predic on y(z; x)

and incur a loss,L(y(z; x), y)

or simply L(z)




Eπθ(z|x)L(z)





L(y(z; x), y)

or simply L(z)




Eπθ(z|x)L(z)





L(y(z; x), y) or simply L(z)




Eπθ(z|x)L(z)






• But we don’t know z!

• In this sec on:we jointly learn a structured predic on model πθ(z | x)


Eπθ(z|x)L(z)









Eπθ(z|x)L(z)







we jointly learn a structured predic on model πθ(z | x)by op mizing the expected loss,

Eπθ(z|x)L(z)



But rst, supervisedSPINN

Stack-augmented Parser-InterpreterNeural-Network

[Bowman et al., 2016]





• Joint learning: Combines a cons tuency parser and a sentence representa onmodel.

• The parser, fθ(x) is a transi on-based shi -reduce parser. It looks at top twoelements of stack and top element of the bu er.• TreeLSTM combines top two elements of the stack when the parser choses the

reduce ac on.





• Joint learning: Combines a cons tuency parser and a sentence representa onmodel.• The parser, fθ(x) is a transi on-based shi -reduce parser. It looks at top twoelements of stack and top element of the bu er.

• TreeLSTM combines top two elements of the stack when the parser choses thereduce ac on.





• Joint learning: Combines a cons tuency parser and a sentence representa onmodel.• The parser, fθ(x) is a transi on-based shi -reduce parser. It looks at top twoelements of stack and top element of the bu er.• TreeLSTM combines top two elements of the stack when the parser choses the

reduce ac on.



























Shi -Reduce parsing

We can write a shi -reduce style parse as a sequence of Bernoulli randomvariables,

z = z1, . . . , z2L−1

where, zj ∈ 0,1 ∀j ∈ [1,2L − 1]



Shi -Reduce parsing

A sequence of Bernoulli trials but with condi onal dependence,

p(z1, z2, . . . , z2L−1) =2L−1∏j=1

p(zj | z<j)



Latent structure learning with SPINN

• But now, remove syntac c supervision from SPINN.

• We model the parse, z, as a latent variable with our parser as the score func ones mator, fθ(x).• With shi -reduce parsing, we’re making discrete decisions⇒ REINFORCE as a“natural” solu on.



Latent structure learning with SPINN• But now, remove syntac c supervision from SPINN.





• We model the parse, z, as a latent variable with our parser as the score func ones mator, fθ(x).

• With shi -reduce parsing, we’re making discrete decisions⇒ REINFORCE as a“natural” solu on.







Unsupervised SPINN

Unsupervised SPINN

No syntac c supervision.Only reward is from the downstream task.We only get this reward a er parsing the full sentence.



SPINN with REINFORCE[Williams, 1992]

Some basic terminology,• The ac on space is zj ∈ shift,reduce, and z is a sequence of ac ons.

• Training parser network parameters, θ with REINFORCE• The state, h, is the top two elements of the stack and the top element of thebu er.• Learning a policy network π(z | h;θ)• Maximize the reward, where R is performance on the downstream task likesentence classi ca on.

NOTE: Only a single reward at the end of parsing.




Some basic terminology,• The ac on space is zj ∈ shift,reduce, and z is a sequence of ac ons.• Training parser network parameters, θ with REINFORCE

• The state, h, is the top two elements of the stack and the top element of thebu er.• Learning a policy network π(z | h;θ)• Maximize the reward, where R is performance on the downstream task likesentence classi ca on.





Some basic terminology,• The ac on space is zj ∈ shift,reduce, and z is a sequence of ac ons.• Training parser network parameters, θ with REINFORCE• The state, h, is the top two elements of the stack and the top element of thebu er.

• Learning a policy network π(z | h;θ)• Maximize the reward, where R is performance on the downstream task likesentence classi ca on.





Some basic terminology,• The ac on space is zj ∈ shift,reduce, and z is a sequence of ac ons.• Training parser network parameters, θ with REINFORCE• The state, h, is the top two elements of the stack and the top element of thebu er.• Learning a policy network π(z | h;θ)

• Maximize the reward, where R is performance on the downstream task likesentence classi ca on.





Some basic terminology,• The ac on space is zj ∈ shift,reduce, and z is a sequence of ac ons.• Training parser network parameters, θ with REINFORCE• The state, h, is the top two elements of the stack and the top element of thebu er.• Learning a policy network π(z | h;θ)• Maximize the reward, where R is performance on the downstream task likesentence classi ca on.





Some basic terminology,• The ac on space is zj ∈ shift,reduce, and z is a sequence of ac ons.• Training parser network parameters, θ with REINFORCE• The state, h, is the top two elements of the stack and the top element of thebu er.• Learning a policy network π(z | h;θ)• Maximize the reward, where R is performance on the downstream task likesentence classi ca on.




Through the looking glass of REINFORCE

∇θEz∼πθ(z|x)[L(z)]

= ∇θ∑

zL(z)πθ(z | x)

(By de ni on of expecta on. How to evaluate?)

=∑zL(z)∇θπθ(z | x)

=∑zL(z)πθ(z | x)∇θ logπθ(z | x)

(By Leibniz integral rule for log )

= Ez∼πθ(z|x)[L(z)∇θ logπθ(z | x)]




∇θEz∼πθ(z|x)[L(z)] = ∇θ∑

zL(z)πθ(z | x)










zL(z)πθ(z | x)










zL(z)πθ(z | x)










zL(z)πθ(z | x)





= Ez∼πθ(z|x)[L(z)∇θ logπθ(z | x)]deep-spin.github.io/tutorial 36


SPINN with REINFORCE, aka RL-SPINN

Yogatama et al. [2017] uses REINFORCE to train SPINN!

However, this vanilla implementa on isn’t very e ec ve at learning syntax.This model fails to solve a simple toy problem.




Yogatama et al. [2017] uses REINFORCE to train SPINN!However, this vanilla implementa on isn’t very e ec ve at learning syntax.

This model fails to solve a simple toy problem.




Yogatama et al. [2017] uses REINFORCE to train SPINN!However, this vanilla implementa on isn’t very e ec ve at learning syntax.This model fails to solve a simple toy problem.



Toy problem: ListOps[Nangia and Bowman, 2018]

[max 2 9 [min 4 7 ] 0 ]




Accuracy SelfModel μ(σ)μ(σ)μ(σ) max F1LSTM 71.5 (1.5) 74.4 -RL-SPINN 60.7 (2.6) 64.8 30.8

Random Trees - - 30.1F1 wrt. Avg.

Model LB RB GT Depth48D RL-SPINN 64.5 16.0 32.1 14.6128D RL-SPINN 43.5 13.0 71.1 10.4

GT Trees 41.6 8.8 100.0 9.6Random Trees 24.0 24.0 24.2 5.2

But why?




Accuracy SelfModel μ(σ)μ(σ)μ(σ) max F1LSTM 71.5 (1.5) 74.4 -RL-SPINN 60.7 (2.6) 64.8 30.8

Random Trees - - 30.1F1 wrt. Avg.

Model LB RB GT Depth48D RL-SPINN 64.5 16.0 32.1 14.6128D RL-SPINN 43.5 13.0 71.1 10.4

GT Trees 41.6 8.8 100.0 9.6Random Trees 24.0 24.0 24.2 5.2

But why?



RL-SPINN’s Troubles

This system faces at least two big problems,

1. High variance of gradients2. Coadapta on



RL-SPINN’s Troubles

This system faces at least two big problems,1. High variance of gradients2. Coadapta on



High variance

• We have a single reward at the end of parsing.

• We are sampling parses from very large search space!Catalan number of binary trees.• And the policy is stochas c.



High variance

• We have a single reward at the end of parsing.• We are sampling parses from very large search space!Catalan number of binary trees.

• And the policy is stochas c.



High variance

• We have a single reward at the end of parsing.• We are sampling parses from very large search space!Catalan number of binary trees.

3 tokens ⇒ 5 trees5 tokens ⇒ 42 trees10 tokens ⇒ 16796 trees

• And the policy is stochas c.



High variance

• We have a single reward at the end of parsing.• We are sampling parses from very large search space!Catalan number of binary trees.• And the policy is stochas c.



High variance

So, some mes the policy lands in a “rewarding state”:

[sm [sm [sm [max 5 6 ] 2 ] 0 ] 5 0 8 6 ]Figure: Truth: 7; Pred: 7

Some mes it doesn’t:



High variance

So, some mes the policy lands in a “rewarding state”:

Some mes it doesn’t:

[max [med [med 1 [sm 3 1 3 ] 9 ] 6 ] 5 ]Figure: Truth: 6; Pred: 5deep-spin.github.io/tutorial 42


High variance

Catalan number of parses means we need many many samples to lower variance!

Possible solu ons,1. Gradient normaliza on2. Control variates, aka baselines



High variance

Catalan number of parses means we need many many samples to lower variance!Possible solu ons,1. Gradient normaliza on2. Control variates, aka baselines



Control variates

• A simple control variate: moving average of recent rewards

• Parameters are updated using the advantage which is the di erence between thereward, R, and the baseline predic on.So,

∇Ez∼π(z) = Ez∼π(z)[(L(z) − b(x))∇π(z)]

Which we can do because,∑zb(x)∇π(z) = b(x)

∑z∇π(z) = b(x)∇1 = 0



Control variates

• A simple control variate: moving average of recent rewards• Parameters are updated using the advantage which is the di erence between thereward, R, and the baseline predic on.

So,

∇Ez∼π(z) = Ez∼π(z)[(L(z) − b(x))∇π(z)]


∑z∇π(z) = b(x)∇1 = 0



Control variates

• A simple control variate: moving average of recent rewards• Parameters are updated using the advantage which is the di erence between thereward, R, and the baseline predic on.So,

∇Ez∼π(z) = Ez∼π(z)[(L(z) − b(x))∇π(z)]


∑z∇π(z) = b(x)∇1 = 0



Control variates

• A simple control variate: moving average of recent rewards• Parameters are updated using the advantage which is the di erence between thereward, R, and the baseline predic on.So,

∇Ez∼π(z) = Ez∼π(z)[(L(z) − b(x))∇π(z)]


∑z∇π(z) = b(x)∇1 = 0



Issues with SPINN with REINFORCE

This system faces two big problems,1. High variance of gradients2. Coadapta on



Coadapta on problem

Learning composi on func on parameters φ with backpropaga on,and parser parameters θ with REINFORCE.

Generally, φ will be learned more quickly than θ,making it harder to explore the parsing search space and op mize for θ.

Di erence in variance of two gradient es mates.

Possible solu on:Proximal Policy Op miza on (Schulman et al., 2017)



Coadapta on problem







Coadapta on problem







Coadapta on problem







Making REINFORCE+SPINN work

Havrylov et al. [2019] use,1. Input dependent control variate2. Gradient normaliza on3. Proximal Policy Op miza on

They solve ListOps!However, does not learn English grammars.





They solve ListOps!

However, does not learn English grammars.





They solve ListOps!However, does not learn English grammars.



Should I? Shouldn’t I?

• Unbiased!

• In a simple se ng, with enoughtricks, it can work!

• High variance• Has not yet been very e ec ve atlearning English syntax.




• Unbiased!

• In a simple se ng, with enoughtricks, it can work!

• High variance

• Has not yet been very e ec ve atlearning English syntax.




• Unbiased!• In a simple se ng, with enoughtricks, it can work!

• High variance

• Has not yet been very e ec ve atlearning English syntax.




• Unbiased!• In a simple se ng, with enoughtricks, it can work!

• High variance• Has not yet been very e ec ve atlearning English syntax.



III. GradientSurrogates

So far:• Tackled expected loss in a stochas c computa on graph

Eπθ(z|x)L(z)

• Op mized with the REINFORCE es mator.• Struggled with variance & sampling.

In this sec on:• Consider the determinis c alterna ve:

pick “best” structure z(x) := argmaxz∈M πθ(z | x)incur loss L

z(x)

• 3A: try to op mize the determinis c loss directly• 3B: use this strategy to reduce variance in the stochas c model.




Eπθ(z|x)L(z)

• Op mized with the REINFORCE es mator.

• Struggled with variance & sampling.



z(x)





Eπθ(z|x)L(z)




z(x)





Eπθ(z|x)L(z)




z(x)





Eπθ(z|x)L(z)



pick “best” structure z(x) := argmaxz∈M πθ(z | x)

incur loss Lz(x)





Eπθ(z|x)L(z)




z(x)





Eπθ(z|x)L(z)




z(x)

• 3A: try to op mize the determinis c loss directly

• 3B: use this strategy to reduce variance in the stochas c model.




Eπθ(z|x)L(z)




z(x)

• 3A: try to op mize the determinis c loss directly• 3B: use this strategy to reduce variance in the stochas c model.deep-spin.github.io/tutorial 49


Recap: The argmax problem

z = 1z = 2

· · ·z =N

s z

∂z∂s = 0

z1

s10

1

s2 − 1 s2 s2 + 1

z = argmax(s)



So max

z = 1z = 2

· · ·z =N

s p

∂p∂s = diag(p) − pp

⊤

p1

s10

1

s2 − 1 s2 s2 + 1

pj = exp(sj)/Z



Straight-Through Es mator[Hinton, 2012, Bengio et al., 2013]

• Forward: z = argmax(s)• Backward: pretend z was some con nuous p; ∂p∂s =/ 0

• simplest: iden ty, p(s) = s, ∂p∂s = I• others, e.g. so max p(s) = softmax(s), ∂p∂s = diag(p) − pp⊤

• More explana on in a while

s

z

pWhat about the structured case?




• Forward: z = argmax(s)

• Backward: pretend z was some con nuous p; ∂p∂s =/ 0



sz





• Forward: z = argmax(s)

• Backward: pretend z was some con nuous p; ∂p∂s =/ 0



sz








sz

p

What about the structured case?







sz

p





• Forward: z = argmax(s)• Backward: pretend z was some con nuous p; ∂p∂s =/ 0• simplest: iden ty, p(s) = s, ∂p∂s = I

• others, e.g. so max p(s) = softmax(s), ∂p∂s = diag(p) − pp⊤


sz

p





• Forward: z = argmax(s)• Backward: pretend z was some con nuous p; ∂p∂s =/ 0• simplest: iden ty, p(s) = s, ∂p∂s = I• others, e.g. so max p(s) = softmax(s), ∂p∂s = diag(p) − pp

⊤


sz

p






⊤


sz

p






⊤


sz






max · · · , , , , · · ·




STE for incremental structures

• Build a structure as a sequence of discrete choices (e.g., shi -reduce)• Assigns a score to any (par al structure, ac on) tuple.• In this case, we just apply the straight-through es mator for each step.• Forward: the highest scoring ac on for each step• Backward: pretend that we had used a di eren able surrogate func onExample: Latent Tree Learning with Di eren able Parsers: Shi -Reduce Parsingand Chart Parsing [Maillard and Clark, 2018] (STE through beam search).




• Build a structure as a sequence of discrete choices (e.g., shi -reduce)

• Assigns a score to any (par al structure, ac on) tuple.• In this case, we just apply the straight-through es mator for each step.• Forward: the highest scoring ac on for each step• Backward: pretend that we had used a di eren able surrogate func onExample: Latent Tree Learning with Di eren able Parsers: Shi -Reduce Parsingand Chart Parsing [Maillard and Clark, 2018] (STE through beam search).




• Build a structure as a sequence of discrete choices (e.g., shi -reduce)• Assigns a score to any (par al structure, ac on) tuple.

• In this case, we just apply the straight-through es mator for each step.• Forward: the highest scoring ac on for each step• Backward: pretend that we had used a di eren able surrogate func onExample: Latent Tree Learning with Di eren able Parsers: Shi -Reduce Parsingand Chart Parsing [Maillard and Clark, 2018] (STE through beam search).




• Build a structure as a sequence of discrete choices (e.g., shi -reduce)• Assigns a score to any (par al structure, ac on) tuple.• In this case, we just apply the straight-through es mator for each step.

• Forward: the highest scoring ac on for each step• Backward: pretend that we had used a di eren able surrogate func onExample: Latent Tree Learning with Di eren able Parsers: Shi -Reduce Parsingand Chart Parsing [Maillard and Clark, 2018] (STE through beam search).




• Build a structure as a sequence of discrete choices (e.g., shi -reduce)• Assigns a score to any (par al structure, ac on) tuple.• In this case, we just apply the straight-through es mator for each step.• Forward: the highest scoring ac on for each step

• Backward: pretend that we had used a di eren able surrogate func onExample: Latent Tree Learning with Di eren able Parsers: Shi -Reduce Parsingand Chart Parsing [Maillard and Clark, 2018] (STE through beam search).




• Build a structure as a sequence of discrete choices (e.g., shi -reduce)• Assigns a score to any (par al structure, ac on) tuple.• In this case, we just apply the straight-through es mator for each step.• Forward: the highest scoring ac on for each step• Backward: pretend that we had used a di eren able surrogate func on

Example: Latent Tree Learning with Di eren able Parsers: Shi -Reduce Parsingand Chart Parsing [Maillard and Clark, 2018] (STE through beam search).




• Build a structure as a sequence of discrete choices (e.g., shi -reduce)• Assigns a score to any (par al structure, ac on) tuple.• In this case, we just apply the straight-through es mator for each step.• Forward: the highest scoring ac on for each step• Backward: pretend that we had used a di eren able surrogate func onExample: Latent Tree Learning with Di eren able Parsers: Shi -Reduce Parsingand Chart Parsing [Maillard and Clark, 2018] (STE through beam search).



STE for the factorized approach

Requires a bit more work:• Recap: marginal polytope• Predic ng structures globally: Maximum A Posteriori (MAP)• Deriving Straight-Through and SPIGOT






, p ∈ ∆.

p1 = 0.2, z1 = [1,0,0,0,1,0,0,0,1]p2 = 0.7, z2 = [0,0,1,0,0,1,1,0,0]p3 = 0.1, z3 = [1,0,0,0,1,0,0,1,0]

⇒ μ=[.3,0, .7,0, .3, .7, .7, .1, .2].

M




• Each vertex corresponds to one such bit vector z

• Points inside correspond to marginal distribu ons:convex combina ons of structured objects


, p ∈ ∆.

p1 = 0.2, z1 = [1,0,0,0,1,0,0,0,1]p2 = 0.7, z2 = [0,0,1,0,0,1,1,0,0]p3 = 0.1, z3 = [1,0,0,0,1,0,0,1,0]

⇒ μ=[.3,0, .7,0, .3, .7, .7, .1, .2].

M






, p ∈ ∆.

p1 = 0.2, z1 = [1,0,0,0,1,0,0,0,1]p2 = 0.7, z2 = [0,0,1,0,0,1,1,0,0]p3 = 0.1, z3 = [1,0,0,0,1,0,0,1,0]

⇒ μ=[.3,0, .7,0, .3, .7, .7, .1, .2].

M



Predic ng structures from scores of parts

• η(i→ j): score of arc i→ j• z(i→ j): is arc i→ j selected?

• Task-speci c algorithm for thehighest-scoring structure.

η z outputey



Predic ng structures from scores of parts

• η(i→ j): score of arc i→ j• z(i→ j): is arc i→ j selected?• Task-speci c algorithm for thehighest-scoring structure.

η z outputey



Algorithms for speci c structuresBest structure (MAP)

Marginals

Sequence tagging Viterbi[Rabiner, 1989]

Forward-Backward[Rabiner, 1989]

Cons tuent treesCKY

[Kasami, 1966, Younger, 1967][Cocke and Schwartz, 1970]

Inside-Outside[Baker, 1979]

Temporal alignments DTW[Sakoe and Chiba, 1978]

So -DTW[Cuturi and Blondel, 2017]

Dependency trees Max. Spanning Arborescence[Chu and Liu, 1965, Edmonds, 1967]

Matrix-Tree[Kirchho , 1847]

Assignments Kuhn-Munkres[Kuhn, 1955, Jonker and Volgenant, 1987]

#P-complete[Valiant, 1979, Taskar, 2004]



Structured Straight-Through

• Forward pass:Find highest-scoring structure:z = argmax

z∈Zη⊤z

• Backward pass:pretend we used μ =η.

ηz

μ



Straight-Through Es matorRevisited

• In the forward pass, z = argmax(s).• if we had labels (mul -task learning), LMTL = Ly(z), y+Lhid(s, ztrue)• One choice: perceptron loss Lhid(s, ztrue) = s⊤z − s⊤ztrue; ∂Lhid

∂s = z − ztrue.

• We don’t have labels! Induce labels by “pulling back” the downstream target:the “best” (unconstrained) latent value would be: argminz∈RD L

y(z), y

• One gradient descent step star ng from z: ztrue← z − ∂L∂z

∂LMTL∂s

=∂L

∂s︸︷︷︸=0

+∂Lhid∂s

= z −z −

∂L

∂z

=∂L

∂z

[Mar ns and Niculae, 2019]




• In the forward pass, z = argmax(s).

• if we had labels (mul -task learning), LMTL = Ly(z), y+Lhid(s, ztrue)• One choice: perceptron loss Lhid(s, ztrue) = s⊤z − s⊤ztrue; ∂Lhid

∂s = z − ztrue.


y(z), y


∂LMTL∂s

=∂L

∂s︸︷︷︸=0

+∂Lhid∂s

= z −z −

∂L

∂z

=∂L

∂z





• In the forward pass, z = argmax(s).• if we had labels (mul -task learning), LMTL = Ly(z), y+Lhid(s, ztrue)

• One choice: perceptron loss Lhid(s, ztrue) = s⊤z − s⊤ztrue; ∂Lhid∂s = z − z

true.• We don’t have labels! Induce labels by “pulling back” the downstream target:

the “best” (unconstrained) latent value would be: argminz∈RD Ly(z), y


∂LMTL∂s

=∂L

∂s︸︷︷︸=0

+∂Lhid∂s

= z −z −

∂L

∂z

=∂L

∂z






∂s = z − ztrue.


y(z), y


∂LMTL∂s

=∂L

∂s︸︷︷︸=0

+∂Lhid∂s

= z −z −

∂L

∂z

=∂L

∂z






∂s = z − ztrue.


y(z), y


∂LMTL∂s

=∂L

∂s︸︷︷︸=0

+∂Lhid∂s

= z −z −

∂L

∂z

=∂L

∂z






∂s = z − ztrue.


y(z), y


∂LMTL∂s

=∂L

∂s︸︷︷︸=0

+∂Lhid∂s

= z −z −

∂L

∂z

=∂L

∂z






∂s = z − ztrue.


y(z), y


∂LMTL∂s

=∂L

∂s︸︷︷︸=0

+∂Lhid∂s

= z −z −

∂L

∂z

=∂L

∂z






∂s = z − ztrue.


y(z), y


∂LMTL∂s

=∂L

∂s︸︷︷︸=0

+∂Lhid∂s

= z −z −

∂L

∂z

=∂L

∂z




Straight-Through in the structured case[Peng et al., 2018, Mar ns and Niculae, 2019]

• Structured STE: perceptron update with induced annota onargminμ∈RD

L(y(μ), y) ≈ z − ∇zL(z)→ ztrue

(one step of gradient descent)

• SPIGOT takes into account the constraints; uses the inducedannota on

argminμ∈M

L(y(μ), y) ≈ ProjMz − ∇zL(z)→ ztrue

(one step of projected gradient descent!)• We discuss a generic way to compute the projec on in part 4.

z

ztrue = z - ∇zL(z)

z

z - ∇zL(z)

ztrue






(one step of gradient descent)• SPIGOT takes into account the constraints; uses the inducedannota on

argminμ∈M


(one step of projected gradient descent!)

• We discuss a generic way to compute the projec on in part 4.

z


z

z - ∇zL(z)

ztrue






(one step of gradient descent)• SPIGOT takes into account the constraints; uses the inducedannota on

argminμ∈M


(one step of projected gradient descent!)• We discuss a generic way to compute the projec on in part 4.

z


z

z - ∇zL(z)

ztrue



Summary: Straight-Through Es mator

We saw how to use the Straight-Through Es mator to allow learning models withargmax in the middle of the computa on graph.We were op mizing L

z(x)

Now we will see how to apply STE for stochas c graphs, as an alterna veapproach of REINFORCE.




We saw how to use the Straight-Through Es mator to allow learning models withargmax in the middle of the computa on graph.

We were op mizing Lz(x)






z(x)






z(x)




Stochas c node in the computa on graph

Recall the stochas c objec ve:

Eπθ(z|x)L(z)

• REINFORCE (previous sec on).

High variance.

• An alterna ve is using the reparameteriza on trick [Kingma and Welling, 2014].





Eπθ(z|x)L(z)


High variance.






Eπθ(z|x)L(z)


High variance.• An alterna ve is using the reparameteriza on trick [Kingma and Welling, 2014].





Eπθ(z|x)L(z)

• REINFORCE (previous sec on). High variance.






Eπθ(z|x)L(z)

• REINFORCE (previous sec on). High variance.• An alterna ve is using the reparameteriza on trick [Kingma and Welling, 2014].



Categorical reparameteriza on

[Jang et al., 2017, Maddison et al., 2016]

• Sampling from a categorical value inthe middle of the computa on graph.z ∼ πθ(z | x) ∝ exp sθ(z | x)• What is the gradient of a sample ∂z

∂θ?!• Reparameteriza on: Move thestochas city out of the gradient path.• Makes z determinis c w.r.t. s!



Categorical reparameteriza on


• Sampling from a categorical value inthe middle of the computa on graph.z ∼ πθ(z | x) ∝ exp sθ(z | x)

• What is the gradient of a sample ∂z∂θ?!• Reparameteriza on: Move the

stochas city out of the gradient path.• Makes z determinis c w.r.t. s!

s z



Categorical reparameteriza on[Jang et al., 2017, Maddison et al., 2016]


∂θ?!

• Reparameteriza on: Move thestochas city out of the gradient path.• Makes z determinis c w.r.t. s!

s z





∂θ?!• Reparameteriza on: Move thestochas city out of the gradient path.

• Makes z determinis c w.r.t. s!

s

𝝐 (stochastic)

z






s+𝝐s

𝝐 (stochastic)

+

z






s+𝝐s

𝝐 (stochastic)

+

z






s+𝝐s

𝝐 (stochastic)

+

z

As a result:

Stochas city is moved as an input.

We can backpropagate through the determinis c input to z.




s+𝝐s

𝝐 (stochastic)

+

z

How do we sample from a categorical variable?




s+𝝐s

𝝐 (stochastic)

+

z

How do we sample from a categorical variable?



Sampling from a categorical variableWe want to sample from a categorical variable with scores s (class i has a score si)

1. Inverse transform sampling:• p = softmax(s)• ci =∑

j≤i pj• u ∼ Uniform(0,1)• return z = et s.t. ct ≤ u < ct+1

2. The Gumbel-Max trick• ui ∼ Uniform(0,1)• εi = − log(− log(ui))• z = argmax(s + ε)

The two methods are equivalent. (Not obvious, but we will not prove it now.)Requires sampling from the Standard Gumbel Distribu on G(0,1).

Deriva on & more info: [Adams, 2013, Vieira, 2014]

s+𝝐s

𝝐 (stochastic)

+

z

We have an argmax againand cannot backpropagate!




1. Inverse transform sampling:

• p = softmax(s)• ci =∑





s+𝝐s

𝝐 (stochastic)

+

z





1. Inverse transform sampling:• p = softmax(s)

• ci =∑





s+𝝐s

𝝐 (stochastic)

+

z






j≤i pj

• u ∼ Uniform(0,1)• return z = et s.t. ct ≤ u < ct+1




s+𝝐s

𝝐 (stochastic)

+

z






j≤i pj• u ∼ Uniform(0,1)

• return z = et s.t. ct ≤ u < ct+1




s+𝝐s

𝝐 (stochastic)

+

z










s+𝝐s

𝝐 (stochastic)

+

z







2. The Gumbel-Max trick

• ui ∼ Uniform(0,1)• εi = − log(− log(ui))• z = argmax(s + ε)



s+𝝐s

𝝐 (stochastic)

+

z







2. The Gumbel-Max trick• ui ∼ Uniform(0,1)

• εi = − log(− log(ui))• z = argmax(s + ε)



s+𝝐s

𝝐 (stochastic)

+

z







2. The Gumbel-Max trick• ui ∼ Uniform(0,1)• εi = − log(− log(ui))

• z = argmax(s + ε)



s+𝝐s

𝝐 (stochastic)

+

z










s+𝝐s

𝝐 (stochastic)

+

z








The two methods are equivalent. (Not obvious, but we will not prove it now.)

Requires sampling from the Standard Gumbel Distribu on G(0,1).Deriva on & more info: [Adams, 2013, Vieira, 2014]

s+𝝐s

𝝐 (stochastic)

+

z










s+𝝐s

𝝐 (stochastic)

+

z










s+𝝐s

𝝐 (stochastic)

+

z










s+𝝐s

𝝐 (stochastic)

+

z










s+𝝐s

𝝐 (stochastic)

+

z




Straight-Through Gumbel Es matorApply a variant of the Straight-Through Es mator to Gumbel-Max!


• Forward: z = argmax(s + ε)• Backward: pretend we haddonep = softmax(s + ε)






• Forward: z = argmax(s + ε)

• Backward: pretend we haddonep = softmax(s + ε)

s+𝝐s

𝝐

𝝐i =-log(-log(ui))ui~U(0,1)

+

z

p







s+𝝐s

𝝐


+

z

p







s+𝝐s

𝝐


+

z






max · · · , , , , · · ·




Sampling from incremental structures

• Build a structure as a sequence of discrete choices (e.g., shi -reduce)• Assigns a score to any (par al structure, ac on) tuple.• Reparameterize the scores with Gumbel-Max - now we have a determinis c node.• Forward: the argmax from the reparameterized scores for each step• Backward: pretend we had used a di eren able surrogate func onExample: Gumbel Tree-LSTM [Choi et al., 2018].




• Build a structure as a sequence of discrete choices (e.g., shi -reduce)

• Assigns a score to any (par al structure, ac on) tuple.• Reparameterize the scores with Gumbel-Max - now we have a determinis c node.• Forward: the argmax from the reparameterized scores for each step• Backward: pretend we had used a di eren able surrogate func onExample: Gumbel Tree-LSTM [Choi et al., 2018].




• Build a structure as a sequence of discrete choices (e.g., shi -reduce)• Assigns a score to any (par al structure, ac on) tuple.

• Reparameterize the scores with Gumbel-Max - now we have a determinis c node.• Forward: the argmax from the reparameterized scores for each step• Backward: pretend we had used a di eren able surrogate func onExample: Gumbel Tree-LSTM [Choi et al., 2018].




• Build a structure as a sequence of discrete choices (e.g., shi -reduce)• Assigns a score to any (par al structure, ac on) tuple.• Reparameterize the scores with Gumbel-Max - now we have a determinis c node.

• Forward: the argmax from the reparameterized scores for each step• Backward: pretend we had used a di eren able surrogate func onExample: Gumbel Tree-LSTM [Choi et al., 2018].




• Build a structure as a sequence of discrete choices (e.g., shi -reduce)• Assigns a score to any (par al structure, ac on) tuple.• Reparameterize the scores with Gumbel-Max - now we have a determinis c node.• Forward: the argmax from the reparameterized scores for each step

• Backward: pretend we had used a di eren able surrogate func onExample: Gumbel Tree-LSTM [Choi et al., 2018].




• Build a structure as a sequence of discrete choices (e.g., shi -reduce)• Assigns a score to any (par al structure, ac on) tuple.• Reparameterize the scores with Gumbel-Max - now we have a determinis c node.• Forward: the argmax from the reparameterized scores for each step• Backward: pretend we had used a di eren able surrogate func on

Example: Gumbel Tree-LSTM [Choi et al., 2018].




• Build a structure as a sequence of discrete choices (e.g., shi -reduce)• Assigns a score to any (par al structure, ac on) tuple.• Reparameterize the scores with Gumbel-Max - now we have a determinis c node.• Forward: the argmax from the reparameterized scores for each step• Backward: pretend we had used a di eren able surrogate func onExample: Gumbel Tree-LSTM [Choi et al., 2018].



Example: Gumbel Tree-LSTM[Choi et al., 2018]

• Building task-speci c tree structures.• Straight-Through Gumbel-So max at each step to select one arc.

The cat sat on

The cat cat sat sat on

The cat sat on

v1 = 0.5 v2 = 0.1 v2 = 0.4

q



Sampling from factorized modelsPerturb-and-MAP

[Papandreou and Yuille, 2011, Corro and Titov, 2019a,b]

Reparameterize by perturbing the arc scores. (inexact!)

• Sample from the normal Gumbel distribu on.• Perturb the arc scores with the Gumbel noise.• Compute MAP (task-speci c algorithm).• Backward: we could use Straight-Through withIden ty.

• ε ∼ G(0,1)• η =η + ε• argmaxz∈Z η⊤z






• Sample from the normal Gumbel distribu on.

• Perturb the arc scores with the Gumbel noise.• Compute MAP (task-speci c algorithm).• Backward: we could use Straight-Through withIden ty.

• ε ∼ G(0,1)

• η =η + ε• argmaxz∈Z η⊤z






• Sample from the normal Gumbel distribu on.• Perturb the arc scores with the Gumbel noise.

• Compute MAP (task-speci c algorithm).• Backward: we could use Straight-Through withIden ty.

• ε ∼ G(0,1)• η =η + ε

• argmaxz∈Z η⊤z






• Sample from the normal Gumbel distribu on.• Perturb the arc scores with the Gumbel noise.• Compute MAP (task-speci c algorithm).

• Backward: we could use Straight-Through withIden ty.







• Sample from the normal Gumbel distribu on.• Perturb the arc scores with the Gumbel noise.• Compute MAP (task-speci c algorithm).• Backward: we could use Straight-Through withIden ty.




Summary: Gradient surrogates

• Based on the Straight-Through Es mator.• Can be used for stochas c or determinis c computa on graphs.• Forward pass: Get an argmax (might be structured).• Backpropaga on: use a func on, which we hope is close to argmax.• Examples:• Argmax for itera ve structures and factoriza on into parts• Sampling from itera ve structures and factoriza on into parts



Gradient surrogates: Pros and cons

Pros• Do not su er from the high variance problem of REINFORCE.• Allow for exibility to select or sample a latent structured in the middle of thecomputa on graph.• E cient computa on.

Cons• The Gumbel sampling with Perturb-and-MAP is an approxima on.• Bias, due to func on mismatch on the backpropaga on

(next sec on will address this problem.)



Overview

Eπθ(z|x)L(z)

• REINFORCE• Straight-Through Gumbel

(Perturb & MAP)

• SparseMAP

Largmaxz πθ(z | x)

• Straight-Through• SPIGOT

LEπθ(z|x)[z]

• Structured A n. Nets• SparseMAP

And more, a er the break!



Overview

Eπθ(z|x)L(z)


(Perturb & MAP)

• SparseMAP



LEπθ(z|x)[z]





IV. End-to-endDi eren ableRelaxa ons

End-to-end di eren able relaxa ons

1. Digging into so max2. Alterna ves to so max3. Generalizing to structured predic on4. Stochas city and global structures

Recall: Discrete choices & di eren abilitys

z = 1z = 2

· · ·z =N

s z

p

∂z∂s= 0 or n/a

inputx

outputbys = fθ(x) y = gφ(z, x)

(argmax)

p1

s10

1

s2 − 1 s2 s2 + 1deep-spin.github.io/tutorial 76


One solu on: smooth relaxa ons

z = 1z = 2

· · ·z =N

s

z

p

∂p∂s=

inputx


(so max)

p = softmax(s) = E[z], i.e.replace Ef(z)with fE[z]

p1

s10

1



One solu on: smooth relaxa ons

z = 1z = 2

· · ·z =N

s

z

p

∂p∂s=

inputx


(so max)

p = softmax(s) = E[z], i.e.replace Ef(z)with fE[z]

p1

s10

1



Overview

Eπθ(z|x)L(z)


(Perturb & MAP)

• SparseMAP



LEπθ(z|x)[z]





What is so max?O en de ned via pi =

exp si∑j exp sj

, but where does it come from?

p ∈ : probability distribu on over choicesExpected score under p: Ei∼p si = p⊤sargmax

maximizes expected score

Shannon entropy of p: H(p) = −∑

i pi log piso max maximizes expected score + entropy:

0.5 11.5

0.5 1 1.5

0.5

1

1.5




exp si∑j exp sj


p ∈ : probability distribu on over choices

Expected score under p: Ei∼p si = p⊤sargmax




0.5 11.5

0.5 1 1.5

0.5

1

1.5




exp si∑j exp sj







0.5 11.5

0.5 1 1.5

0.5

1

1.5

p = [0,1,0]




exp si∑j exp sj







0.5 11.5

0.5 1 1.5

0.5

1

1.5 p = [0,0,1]




exp si∑j exp sj







0.5 11.5

0.5 1 1.5

0.5

1

1.5

p = [1/3, 1/3, 1/3]




exp si∑j exp sj


p ∈ : probability distribu on over choicesExpected score under p: Ei∼p si = p⊤s

argmax




0.5 11.5

0.5 1 1.5

0.5

1

1.5s = [.7, .1,1.5]




exp si∑j exp sj


p ∈ : probability distribu on over choicesExpected score under p: Ei∼p si = p⊤sargmax

maximizes expected scoreShannon entropy of p: H(p) = −

∑i pi log pi

so max maximizes expected score + entropy:

0.5 11.5

0.5 1 1.5

0.5

1

1.5s = [.7, .1,1.5]




exp si∑j exp sj


p ∈ : probability distribu on over choicesExpected score under p: Ei∼p si = p⊤sargmax maximizes expected score



0.5 11.5

0.5 1 1.5

0.5

1

1.5s = [.7, .1,1.5]

p⋆ = [0,0,1]

argmaxp∈

p⊤sdeep-spin.github.io/tutorial 78



exp si∑j exp sj


p ∈ : probability distribu on over choicesExpected score under p: Ei∼p si = p⊤sargmax maximizes expected scoreShannon entropy of p: H(p) = −

∑i pi log pi


(1, 0, 0) (0, 1, 0)

(0, 0, 1)

0.800

1.000

1.000

0.800

0.5 11.5

0.5 1 1.5

0.5

1

1.5s = [.7, .1,1.5]

p⋆ = [0,0,1]

argmaxp∈

p⊤s




exp si∑j exp sj


p ∈ : probability distribu on over choicesExpected score under p: Ei∼p si = p⊤sargmax maximizes expected scoreShannon entropy of p: H(p) = −

∑i pi log pi


(1, 0, 0) (0, 1, 0)

(0, 0, 1)

1.000

1.400

1.600

1.600

argmaxp∈

p⊤s +H(p)

0.5 11.5

0.5 1 1.5

0.5

1

1.5s = [.7, .1,1.5]

p⋆ = [0,0,1]

argmaxp∈

p⊤s



Varia onal form of so maxProposi on. The unique solu on to argmax

p∈p⊤s +H(p) is given by pj =

exp sj∑i exp si

.

Explicit form of the op miza on problem:

maximize∑

j pjsj − pj log pjsubject to p ≥ 0, p⊤1 = 1

Lagrangian:

L(p,ν, τ) = −∑

j pjsj − pj log pj − p⊤ν + τ(p⊤1 − 1)Op mality condi ons (KKT):

0 = ∇piL(p,ν, τ) = −si + logpi + 1 − νi + τ

p⊤ν = 0p ∈ ν ≥ 0

log pi = si + νi − (τ + 1)if pi = 0, r.h.s. must be −∞,thus pi > 0, so νi = 0.

pi = exp(si)/exp(τ+1) = exp(si)/Z

Must nd Z such that∑

j pj = 1.Answer: Z =∑

j exp(sj)

So, pi =exp(si)∑j exp(sj)

.

Classic result, e.g., [Boyd and Vandenberghe, 2004,Wainwright and Jordan, 2008]





exp sj∑i exp si

.


maximize∑


Lagrangian:

L(p,ν, τ) = −∑



p⊤ν = 0p ∈ ν ≥ 0





j exp(sj)


.






exp sj∑i exp si

.


maximize∑


Lagrangian:

L(p,ν, τ) = −∑

j pjsj − pj log pj − p⊤ν + τ(p⊤1 − 1)

Op mality condi ons (KKT):


p⊤ν = 0p ∈ ν ≥ 0





j exp(sj)


.






exp sj∑i exp si

.


maximize∑


Lagrangian:

L(p,ν, τ) = −∑



p⊤ν = 0p ∈ ν ≥ 0





j exp(sj)


.






exp sj∑i exp si

.


maximize∑


Lagrangian:

L(p,ν, τ) = −∑



p⊤ν = 0p ∈ ν ≥ 0

log pi = si + νi − (τ + 1)

if pi = 0, r.h.s. must be −∞,thus pi > 0, so νi = 0.




j exp(sj)


.






exp sj∑i exp si

.


maximize∑


Lagrangian:

L(p,ν, τ) = −∑



p⊤ν = 0p ∈ ν ≥ 0





j exp(sj)


.






exp sj∑i exp si

.


maximize∑


Lagrangian:

L(p,ν, τ) = −∑



p⊤ν = 0p ∈ ν ≥ 0





j exp(sj)


.






exp sj∑i exp si

.


maximize∑


Lagrangian:

L(p,ν, τ) = −∑



p⊤ν = 0p ∈ ν ≥ 0




j pj = 1.

Answer: Z =∑

j exp(sj)


.






exp sj∑i exp si

.


maximize∑


Lagrangian:

L(p,ν, τ) = −∑



p⊤ν = 0p ∈ ν ≥ 0





j exp(sj)


.






exp sj∑i exp si

.


maximize∑


Lagrangian:

L(p,ν, τ) = −∑



p⊤ν = 0p ∈ ν ≥ 0





j exp(sj)


.




Generalizing so max: Smoothed argmaxespΩ(s) = argmax

p∈p⊤s −Ω(p)

argmax: Ω(p)=0

so max: Ω(p)=∑

j pj log pjsparsemax: Ω(p)= 1/2∥p∥22α-entmax: Ω(p)= 1/α(α−1)

∑j pαi

p1

s10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

[Niculae and Blondel, 2017]

[Niculae and Blondel, 2017, Mar ns and Astudillo, 2016][Niculae and Blondel, 2017, Blondel et al., 2019][Niculae and Blondel, 2017, Mar ns and Kreutzer, 2017, Malaviya et al., 2018]




p∈p⊤s −Ω(p)

argmax: Ω(p)=0

so max: Ω(p)=∑


∑j pαi

p1

s10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]






p∈p⊤s −Ω(p)

argmax: Ω(p)=0

so max: Ω(p)=∑


∑j pαi

p1

s10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]






p∈p⊤s −Ω(p)

argmax: Ω(p)=0

so max: Ω(p)=∑

j pj log pj

sparsemax: Ω(p)= 1/2∥p∥22α-entmax: Ω(p)= 1/α(α−1)

∑j pαi

p1

s10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]






p∈p⊤s −Ω(p)

argmax: Ω(p)=0

so max: Ω(p)=∑

j pj log pjsparsemax: Ω(p)= 1/2∥p∥22

α-entmax: Ω(p)= 1/α(α−1)∑

j pαi

p1

s10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]


[Niculae and Blondel, 2017, Mar ns and Astudillo, 2016]

[Niculae and Blondel, 2017, Blondel et al., 2019][Niculae and Blondel, 2017, Mar ns and Kreutzer, 2017, Malaviya et al., 2018]




p∈p⊤s −Ω(p)

argmax: Ω(p)=0

so max: Ω(p)=∑


∑j pαi

Generalized entropy interpolates in between [Tsallis, 1988]Used in Sparse Seq2Seq: [Peters et al., 2019]

(Mon 13:50, poster session 2D)

p1

s10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

[Niculae and Blondel, 2017][Niculae and Blondel, 2017, Mar ns and Astudillo, 2016]

[Niculae and Blondel, 2017, Blondel et al., 2019]

[Niculae and Blondel, 2017, Mar ns and Kreutzer, 2017, Malaviya et al., 2018]




p∈p⊤s −Ω(p)

argmax: Ω(p)=0

so max: Ω(p)=∑


∑j pαi

fusedmax: Ω(p)= 1/2∥p∥22 +∑

j |pj − pj−1|csparsemax: Ω(p)= 1/2∥p∥22 + ι(a ≤ p ≤ b)

cso max: Ω(p)=∑

j pj log pj + ι(a ≤ p ≤ b)

p1

s10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

[Niculae and Blondel, 2017][Niculae and Blondel, 2017, Mar ns and Astudillo, 2016][Niculae and Blondel, 2017, Blondel et al., 2019]

[Niculae and Blondel, 2017, Mar ns and Kreutzer, 2017, Malaviya et al., 2018]






, p ∈ ∆.

p1 = 0.2, z1 = [1,0,0,0,1,0,0,0,1]p2 = 0.7, z2 = [0,0,1,0,0,1,1,0,0]p3 = 0.1, z3 = [1,0,0,0,1,0,0,1,0]

⇒ μ=[.3,0, .7,0, .3, .7, .7, .1, .2].

M



[Niculae et al., 2018a]

argmax argmaxp∈

p⊤s

so max argmaxp∈

p⊤s +H(p)

sparsemax argmaxp∈

p⊤s − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)SparseMAP argmax

μ∈Mμ⊤η − 1/2∥μ∥2

M

Just like so max relaxes argmax,marginals relax MAP di eren ably!

Unlike argmax/so max, computa on is not obvious!




argmax argmaxp∈

p⊤s

so max argmaxp∈

p⊤s +H(p)


p⊤s − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η



μ∈Mμ⊤η − 1/2∥μ∥2

M






argmax argmaxp∈

p⊤s

so max argmaxp∈

p⊤s +H(p)


p⊤s − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η



μ∈Mμ⊤η − 1/2∥μ∥2

M






argmax argmaxp∈

p⊤s

so max argmaxp∈

p⊤s +H(p)


p⊤s − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η



μ∈Mμ⊤η − 1/2∥μ∥2

M






argmax argmaxp∈

p⊤s

so max argmaxp∈

p⊤s +H(p)


p⊤s − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η


μ⊤η + eH(μ)

SparseMAP argmaxμ∈M

μ⊤η − 1/2∥μ∥2

M






argmax argmaxp∈

p⊤s

so max argmaxp∈

p⊤s +H(p)


p⊤s − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η


μ⊤η + eH(μ)


μ⊤η − 1/2∥μ∥2

M






argmax argmaxp∈

p⊤s

so max argmaxp∈

p⊤s +H(p)


p⊤s − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η


μ⊤η + eH(μ)


μ⊤η − 1/2∥μ∥2

M





Algorithms for speci c structuresBest structure (MAP) Marginals



Cons tuent treesCKY











Algorithms for speci c structuresBest structure (MAP) Marginals



Cons tuent treesCKY









dyn.prog.



Deriva ves of marginals 1: DPDynamic programming: marginals by Forward-Backward, Inside-Outside, etc.

• Alg. consists of di eren able ops: PyTorch autograd can handle it! (v. bad idea)• Be er book-keeping: Li and Eisner [2009], Mensch and Blondel [2018]• With circular dependencies, this breaks! Can get an approxima on Stoyanov et al. [2011]Marginals in a sequence tagging model.1 input: d tags, n tokens, ηU ∈ Rn×d,ηV ∈ Rd×d2 ini alize α1 = 0,βn = 03 for i ∈ 2, . . . ,n do # forward log-probabili es4 αi,k = log

∑k′ expαi−1,k′ + (ηU)i,k + (ηV)k′,k

for all k

5 for i ∈ n − 1, . . . ,1 do # backward log-probabili es6 βi,k = log∑

k′ expβi+1,k′ + (ηU)i+1,k′ + (ηV)k,k′

for all k

7 Z =∑

k expαn,k # par on func on8 return μ = exp

α +β − logZ

# marginals

(ηU)1,NN

(ηU)1,VB

(ηU)1,PR

(ηU)2,NN

(ηU)2,VB

(ηU)2,PR

ηV

(ηU)n,NN

(ηU)n,VB

(ηU)n,PR



Deriva ves of marginals 1: DPDynamic programming: marginals by Forward-Backward, Inside-Outside, etc.

• Alg. consists of di eren able ops: PyTorch autograd can handle it! (v. bad idea)• Be er book-keeping: Li and Eisner [2009], Mensch and Blondel [2018]• With circular dependencies, this breaks! Can get an approxima on Stoyanov et al. [2011]

Marginals in a sequence tagging model.1 input: d tags, n tokens, ηU ∈ Rn×d,ηV ∈ Rd×d2 ini alize α1 = 0,βn = 03 for i ∈ 2, . . . ,n do # forward log-probabili es4 αi,k = log


for all k



for all k

7 Z =∑


α +β − logZ

# marginals

(ηU)1,NN

(ηU)1,VB

(ηU)1,PR

(ηU)2,NN

(ηU)2,VB

(ηU)2,PR

ηV

(ηU)n,NN

(ηU)n,VB

(ηU)n,PR



Deriva ves of marginals 1: DPDynamic programming: marginals by Forward-Backward, Inside-Outside, etc.• Alg. consists of di eren able ops: PyTorch autograd can handle it! (v. bad idea)

• Be er book-keeping: Li and Eisner [2009], Mensch and Blondel [2018]• With circular dependencies, this breaks! Can get an approxima on Stoyanov et al. [2011]



for all k



for all k

7 Z =∑


α +β − logZ

# marginals

(ηU)1,NN

(ηU)1,VB

(ηU)1,PR

(ηU)2,NN

(ηU)2,VB

(ηU)2,PR

ηV

(ηU)n,NN

(ηU)n,VB

(ηU)n,PR



Deriva ves of marginals 1: DPDynamic programming: marginals by Forward-Backward, Inside-Outside, etc.• Alg. consists of di eren able ops: PyTorch autograd can handle it! (v. bad idea)• Be er book-keeping: Li and Eisner [2009], Mensch and Blondel [2018]

• With circular dependencies, this breaks! Can get an approxima on Stoyanov et al. [2011]



for all k



for all k

7 Z =∑


α +β − logZ

# marginals

(ηU)1,NN

(ηU)1,VB

(ηU)1,PR

(ηU)2,NN

(ηU)2,VB

(ηU)2,PR

ηV

(ηU)n,NN

(ηU)n,VB

(ηU)n,PR



Deriva ves of marginals 1: DPDynamic programming: marginals by Forward-Backward, Inside-Outside, etc.• Alg. consists of di eren able ops: PyTorch autograd can handle it! (v. bad idea)• Be er book-keeping: Li and Eisner [2009], Mensch and Blondel [2018]• With circular dependencies, this breaks! Can get an approxima on Stoyanov et al. [2011]Marginals in a sequence tagging model.1 input: d tags, n tokens, ηU ∈ Rn×d,ηV ∈ Rd×d2 ini alize α1 = 0,βn = 03 for i ∈ 2, . . . ,n do # forward log-probabili es4 αi,k = log


for all k



for all k

7 Z =∑


α +β − logZ

# marginals

(ηU)1,NN

(ηU)1,VB

(ηU)1,PR

(ηU)2,NN

(ηU)2,VB

(ηU)2,PR

ηV

(ηU)n,NN

(ηU)n,VB

(ηU)n,PR



Deriva ves of marginals 2: Matrix-Tree

L(s): Laplacian of the edge score graph

Z = detL(s)μ = L(s)−1

∇μ = ∇L−1 = L−1∂L∂η

L−1



Structured A en on Networks

la

coali on

...

aide

η μinputx

outputy

CRF marginals (from forward–backward) give a en on weights ∈ (0,1)Similar idea for projec ve dependency trees with inside–outside

and non-projec ve with the Matrix-Tree theorem [Liu and Lapata, 2018].

[Kim et al., 2017]



Structured A en on Networksη μinput

xoutput

y



[Kim et al., 2017]




xoutput

y

η(i): score of word ireceiving a en on

η(i, i + 1): score ofconsecu ve wordsreceiving a en on

μ(i): probability ofword i ge ng a en on



[Kim et al., 2017]




xoutput

y

η(i): score of word ireceiving a en on

η(i, i + 1): score ofconsecu ve wordsreceiving a en on

μ(i): probability ofword i ge ng a en on

CRF marginals (from forward–backward) give a en on weights ∈ (0,1)

Similar idea for projec ve dependency trees with inside–outside


[Kim et al., 2017]




xoutput

y

dog on wheels

η(dog→ on): arc score(tree constraints)

μ(dog→ on):probability of arc



[Kim et al., 2017]




xoutput

y

dog on wheels

η(dog→ on): arc score(tree constraints)

μ(dog→ on):probability of arc

CRF marginals (from forward–backward) give a en on weights ∈ (0,1)Similar idea for projec ve dependency trees with inside–outsideand non-projec ve with the Matrix-Tree theorem [Liu and Lapata, 2018].

[Kim et al., 2017]



Di eren able Perturb & ParseExtending Gumbel-So max to structured stochas c models

• Forward pass:sample structure z (approximately)z = argmax

z∈Z(η + ε)⊤z

• Backward pass:pretend we did marginal inferenceμ = argmax

μ∈M(η + ε)⊤z + H(μ)

(or some similar relaxa on)

ηz

μ

[Corro and Titov, 2019a,b]



Back-propaga ng through marginalsPros:

• Familiar algorithms for NLPers,• (Structured A en on Networks:) All computa ons exact.

Cons:

• (Structured A en on Networks:) forward pass marginals are dense;( xed by Perturb & MAP, at cost of rough approxima on)

• E cient & numerically stable back-propaga on through DPs is tricky;(somewhat alleviated by Mensch and Blondel [2018])

• Not applicable when marginals are unavailable.• Case-by-case algorithms required, can get tedious.



Back-propaga ng through marginalsPros:• Familiar algorithms for NLPers,

• (Structured A en on Networks:) All computa ons exact.Cons:






Back-propaga ng through marginalsPros:• Familiar algorithms for NLPers,• (Structured A en on Networks:) All computa ons exact.

Cons:







Cons:







Cons:







Cons:



• Not applicable when marginals are unavailable.

• Case-by-case algorithms required, can get tedious.




Cons:



• Not applicable when marginals are unavailable.• Case-by-case algorithms required, can get tedious.deep-spin.github.io/tutorial 88



Cons:






Cons:






argmax argmaxp∈

p⊤s

so max argmaxp∈

p⊤s +H(p)


p⊤s − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η


μ⊤η + eH(μ)


μ⊤η − 1/2∥μ∥2

M






argmax argmaxp∈

p⊤s

so max argmaxp∈

p⊤s +H(p)


p⊤s − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η



μ∈Mμ⊤η − 1/2∥μ∥2

M





SparseMAP solu on[Niculae et al., 2018a]

μ⋆ = argmaxμ∈M

μ⊤η − 1/2∥μ∥2

= = .6 + .4(μ⋆ is unique, but may have mul ple decomposi ons p. Ac ve Set recovers a sparse one.)



Algorithms for SparseMAPμ⋆ = argmax

μ∈Mμ⊤η − 1/2∥μ∥2

Condi onal Gradient[Frank and Wolfe, 1956, Lacoste-Julien and Jaggi, 2015]

• select a new corner ofM• update the (sparse) coe cients of p

• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Seta.k.a. Min-Norm Point, [Wolfe, 1976][Mar ns et al., 2015, Nocedal and Wright, 1999,

Vinyes and Obozinski, 2017]

Backward pass

∂μ∂η is sparse

compu ng∂μ∂η

⊤dy

takes O(dim(μ) nnz(p⋆))

quadra c objec velinear constraints(alas, exponen ally many!)

argmaxμ∈M

μ⊤ (η −μ(t−1))︸︷︷︸eηAc ve Set achieves

nite & linear convergence!

Completely modular: just add MAP




μ∈Mμ⊤η − 1/2∥μ∥2





Backward pass


compu ng∂μ∂η

⊤dy



argmaxμ∈M







μ∈Mμ⊤η − 1/2∥μ∥2





Backward pass


compu ng∂μ∂η

⊤dy



argmaxμ∈M







μ∈Mμ⊤η − 1/2∥μ∥2


• select a new corner ofM

• update the (sparse) coe cients of p



Backward pass


compu ng∂μ∂η

⊤dy



argmaxμ∈M







μ∈Mμ⊤η − 1/2∥μ∥2


• select a new corner ofM

• update the (sparse) coe cients of p



Backward pass


compu ng∂μ∂η

⊤dy



argmaxμ∈M

μ⊤ (η −μ(t−1))︸︷︷︸eη

Ac ve Set achievesnite & linear convergence!





μ∈Mμ⊤η − 1/2∥μ∥2


• select a new corner ofM• update the (sparse) coe cients of p• Update rules: vanilla, away-step, pairwise

• Quadra c objec ve: Ac ve Seta.k.a. Min-Norm Point, [Wolfe, 1976][Mar ns et al., 2015, Nocedal and Wright, 1999,


Backward pass


compu ng∂μ∂η

⊤dy



argmaxμ∈M







μ∈Mμ⊤η − 1/2∥μ∥2


• select a new corner ofM• update the (sparse) coe cients of p• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Seta.k.a. Min-Norm Point, [Wolfe, 1976][Mar ns et al., 2015, Nocedal and Wright, 1999,


Backward pass


compu ng∂μ∂η

⊤dy



argmaxμ∈M







μ∈Mμ⊤η − 1/2∥μ∥2




Backward pass


compu ng∂μ∂η

⊤dy



argmaxμ∈M

μ⊤ (η −μ(t−1))︸︷︷︸eη

Ac ve Set achievesnite & linear convergence!





μ∈Mμ⊤η − 1/2∥μ∥2




Backward pass


compu ng∂μ∂η

⊤dy



argmaxμ∈M







μ∈Mμ⊤η − 1/2∥μ∥2




Backward pass


compu ng∂μ∂η

⊤dy



argmaxμ∈M







μ∈Mμ⊤η − 1/2∥μ∥2




Backward pass


compu ng∂μ∂η

⊤dy



argmaxμ∈M






[Chen et al., 2017]

a

gentleman

overlooking

a

neighborhood

situation

.

a

police

officer

watches a

situation

closely . a

police

officer

watches a

situation

closely .


a

police

officer

watches a

situation

closely .

a

gentleman

overlooking

a

neighborhood

situation

.

A

gentleman

overlooking

a

neighborhood

situa on

.

A

police

o cer

watches

a

situa on

closely

.

Overview

Eπθ(z|x)L(z)


(Perturb & MAP)

• SparseMAP



LEπθ(z|x)[z]





Structured latent variables without sampling

EzL(z)=∑z∈Z

Ly

φ

(z)π

θ

(z | x)

How to de ne πθ? ∑h∈H

∂EL(z)

∂θidea 1 πθ(z) ∝ exp

fθ(z)

so maxidea 2 πθ(z) = 1 if z =MAP(fθ(·)) else 0 argmaxidea 3 SparseMAP

e.g., a TreeLSTM de ned by zsum overall possible trees

parsing model,using some scorer fθ(z; x)

Exponen ally large sum!

All methods we’ve seen require sampling; hard in general.STE / SPIGOT relax y in backward.




EzL(z)=∑z∈Z

Lyφ(z)πθ(z | x)


∂EL(z)


fθ(z)









EzL(z)=∑z∈Z

Lyφ(z)πθ(z | x)


∂EL(z)


fθ(z)


e.g., a TreeLSTM de ned by z

sum overall possible trees







EzL(z)=∑z∈Z

Lyφ(z)πθ(z | x)


∂EL(z)


fθ(z)


e.g., a TreeLSTM de ned by z

sum overall possible trees







EzL(z)=∑z∈Z

Lyφ(z)πθ(z | x)


∂EL(z)


fθ(z)









EzL(z)=∑z∈Z

Lyφ(z)πθ(z | x)

How to de ne πθ?

∑h∈H

∂EL(z)

∂θ

idea 1

πθ(z) ∝ expfθ(z)

so max

idea 2

πθ(z) = 1 if z =MAP(fθ(·)) else 0 argmax

idea 3

SparseMAP








EzL(z)=∑z∈Z

Lyφ(z)πθ(z | x)


∂EL(z)

∂θ

idea 1


so max

idea 2


idea 3

SparseMAP








EzL(z)=∑z∈Z

Lyφ(z)πθ(z | x)


∂EL(z)

∂θidea 1


so max

idea 2


idea 3

SparseMAP








EzL(z)=∑z∈Z

Lyφ(z)πθ(z | x)


∂EL(z)


fθ(z)

so maxidea 2


idea 3

SparseMAP








EzL(z)=∑z∈Z

Lyφ(z)πθ(z | x)


∂EL(z)


fθ(z)

so maxidea 2


idea 3

SparseMAP








EzL(z)=∑z∈Z

Lyφ(z)πθ(z | x)


∂EL(z)


fθ(z)

so maxidea 2


idea 3

SparseMAP








EzL(z)=∑z∈Z

Lyφ(z)πθ(z | x)


∂EL(z)


fθ(z)

so maxidea 2


idea 3

SparseMAP




All methods we’ve seen require sampling; hard in general.

STE / SPIGOT relax y in backward.




EzL(z)=∑z∈Z

Lyφ(z)πθ(z | x)


∂EL(z)


fθ(z)

so maxidea 2 πθ(z) = 1 if z =MAP(fθ(·)) else 0 argmaxidea 3

SparseMAP








EzL(z)=∑z∈Z

Lyφ(z)πθ(z | x)


∂EL(z)


fθ(z)


SparseMAP








EzL(z)=∑z∈Z

Lyφ(z)πθ(z | x)


∂EL(z)


fθ(z)


SparseMAP








EzL(z)=∑z∈Z

Lyφ(z)πθ(z | x)


∂EL(z)


fθ(z)


SparseMAP




All methods we’ve seen require sampling; hard in general.

STE / SPIGOT relax y in backward.




EzL(z)=∑z∈Z

Lyφ(z)πθ(z | x)


∂EL(z)


fθ(z)









• • •= .7× • • • + .3× • • •

+0× • • • + ...E[L(z)]= .7× L( • • •)+ .3× L( • • • )

recall our shorthand L(z) = L(yφ(z), y)




• • •= .7× • • • + .3× • • • +0× • • • + ...

E[L(z)]= .7× L( • • •)+ .3× L( • • • )





• • •= .7× • • • + .3× • • • +0× • • • + ...E[L(z)]= .7× L( • • •)+ .3× L( • • • )




[Corro and Titov, 2019b]

Stanford Sen ment (Accuracy)Socher et alBigram Naive Bayes 83.1[Niculae et al., 2018b]TreeLSTM w/ CoreNLP 83.2TreeLSTM w/ SparseMAP 84.7[Corro and Titov, 2019b]GCN w/ CoreNLP 83.8GCN w/ Perturb-and-MAP 84.6

Stanford Natural Language Inference (Accuracy)[Kim et al., 2017]Simple A en on 86.2Structured A en on 86.8[Liu and Lapata, 2018]100D SAN - 86.8Yogatama et al100D RL-SPINN 80.5[Choi et al., 2018]100D ST Gumbel-Tree 82.6300D - 85.6600D - 86.0[Corro and Titov, 2019b]Latent Tree + 1 GCN - 85.2Latent Tree + 2 GCN - 86.2



V. Conclusions

Is it syntax?!

• Unlike e.g. unsupervised parsing, the structures we learn are guided by adownstream objec ve (typically discrimina ve).• They don’t typically resemble gramma cal structure (yet) [Williams et al., 2018]

(future work: more induc ve biases and constraints?)

• Common to compare latent structures with parser outputs.But is this always a meaningful comparison?



Is it syntax?!

• Unlike e.g. unsupervised parsing, the structures we learn are guided by adownstream objec ve (typically discrimina ve).• They don’t typically resemble gramma cal structure (yet) [Williams et al., 2018]

(future work: more induc ve biases and constraints?)• Common to compare latent structures with parser outputs.

But is this always a meaningful comparison?



Syntax vs. Composi on Order[Niculae et al., 2018b]

p = 22.6%

⋆ lovely and poignant .

CoreNLP parse, p = 21.4%


· · ·




p = 22.6%




· · ·deep-spin.github.io/tutorial 99



p = 22.6%




· · ·

p = 15.33%

⋆ a deep and meaningful lm .1.0 1.0

1.01.0

1.0 1.0

p = 15.27%

⋆ a deep and meaningful lm .

1.01.0

1.01.0

1.01.0

· · ·CoreNLP parse, p = 0%

⋆ a deep and meaningful lm .

1.0 1.0

1.01.0

1.0

1.0



Overview

Eπθ(z|x)L(z)


(Perturb & MAP)• SparseMAP



LEπθ(z|x)[z]





Overview

Eπθ(z|x)L(z)

• REINFORCESPL• Straight-Through Gumbel

(Perturb & MAP)SPL,MRG

• SparseMAPMAP+


• Straight-ThroughMAP,MRG• SPIGOTMAP+

LEπθ(z|x)[z]

• Structured A n. NetsMRG

• SparseMAPMAP+

Computa on:

SPL: Sampling. (Simple in incremental/unstructured, hard for most global structures.)MAP: Finding the highest-scoring structure.MRG: Marginal inference.




Conclusions

• Latent structure models are desirable for interpretability, structural bias, andhigher predic ve power with fewer parameters.• Stochas c latent variables can be dealt with RL or straight-through gradients.• Determinis c argmax requires surrogate gradients (e.g. SPIGOT).• Con nuous relaxa ons of argmax include SANs and SparseMAP.• Intui vely, some of these di erent methods are trying to do similar things orrequire the same building blocks (e.g. SPIGOT and SparseMAP).

• ... we didn’t even get into deep genera ve models! These tools apply,but there are new challenges. [Corro and Titov, 2019a, Kim et al., 2019a,b, Kawakami et al., 2019]



References IRyan Adams. The gumbel-max trick for discrete distribu ons, 2013. URL

https://lips.cs.princeton.edu/the-gumbel-max-trick-for-discrete-distributions/. Blog post.James K Baker. Trainable grammars for speech recogni on. The Journal of the Acous cal Society of America, 65(S1):S132–S132,1979.

Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Es ma ng or propaga ng gradients through stochas c neurons forcondi onal computa on. arXiv preprint arXiv:1308.3432, 2013.

Mathieu Blondel, André FT Mar ns, and Vlad Niculae. Learning classi ers with Fenchel-Young losses: Generalized entropies,margins, and algorithms. In Proc. of AISTATS, 2019.

Samuel R. Bowman, Jon Gauthier, Abhinav Rastogi, Raghav Gupta, Christopher D. Manning, and Christopher Po s. A fast uni edmodel for parsing and sentence understanding. In Proc. of ACL, 2016. doi: 10.18653/v1/P16-1139.

Stephen Boyd and Lieven Vandenberghe. Convex op miza on. Cambridge University Press, 2004.Peter F Brown, Vincent J Della Pietra, Stephen A Della Pietra, and Robert L Mercer. The mathema cs of sta s cal machinetransla on: Parameter es ma on. Computa onal Linguis cs, 19(2):263–311, 1993.

Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. Enhanced LSTM for natural language inference. InProc. of ACL, 2017.

Jihun Choi, Kang Min Yoo, and Sang-goo Lee. Learning to compose task-speci c tree structures. In Proc. of AAAI, 2018.


https://lips.cs.princeton.edu/the-gumbel-max-trick-for-discrete-distributions/

https://doi.org/10.1121/1.2017061

https://arxiv.org/abs/1308.3432




https://www.aclweb.org/anthology/P16-1139


https://web.stanford.edu/~boyd/cvxbook/




References IIYoeng-Jin Chu and Tseng-Hong Liu. On the shortest arborescence of a directed graph. Science Sinica, 14:1396–1400, 1965.William John Cocke and Jacob T Schwartz. Programming languages and their compilers. Courant Ins tute of Mathema calSciences., 1970.

Shay B Cohen, Karl Stratos, Michael Collins, Dean P Foster, and Lyle Ungar. Spectral learning of latent-variable PCFGs. In Proc. ofACL, 2012.

Caio Corro and Ivan Titov. Di eren able Perturb-and-Parse: Semi-Supervised Parsing with a Structured Varia onal Autoencoder.In Proc. of ICLR, 2019a.

Caio Corro and Ivan Titov. Learning latent trees with stochas c perturba ons and di eren able dynamic programming. In Proc.of ACL, 2019b.

Marco Cuturi and Mathieu Blondel. So -DTW: a di eren able loss func on for me-series. In Proc. of ICML, 2017.Jack Edmonds. Op mum branchings. J. Res. Nat. Bur. Stand., 71B:233–240, 1967.Marguerite Frank and Philip Wolfe. An algorithm for quadra c programming. Nav. Res. Log., 3(1-2):95–110, 1956.Serhii Havrylov, Germán Kruszewski, and Armand Joulin. Coopera ve Learning of Disjoint Syntax and Seman cs. In Proc.

NAACL-HLT, 2019.Geo rey Hinton. Neural networks for machine learning. In Coursera video lectures, 2012.Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameteriza on with Gumbel-so max. In Proc. of ICLR, 2017.


https://www.aclweb.org/anthology/papers/P/P12/P12-1024/




https://doi.org/10.6028%2Fjres.071b.032

https://doi.org/10.1002/nav.3800030109

https://www.aclweb.org/anthology/N19-1115



References IIIRoy Jonker and Anton Volgenant. A shortest augmen ng path algorithm for dense and sparse linear assignment problems.

Compu ng, 38(4):325–340, 1987.Tadao Kasami. An e cient recogni on and syntax-analysis algorithm for context-free languages. Coordinated Science Laboratory

Report no. R-257, 1966.Kazuya Kawakami, Chris Dyer, and Phil Blunsom. Learning to discover, ground and use words with segmental neural languagemodels. In Proc. of ACL, 2019.

Yoon Kim, Carl Denton, Loung Hoang, and Alexander M Rush. Structured a en on networks. In Proc. of ICLR, 2017.Yoon Kim, Chris Dyer, and Alexander Rush. Compound probabilis c context-free grammars for grammar induc on. In Proc. of

ACL, 2019a.Yoon Kim, Alexander Rush, Lei Yu, Adhiguna Kuncoro, Chris Dyer, and Gábor Melis. Unsupervised recurrent neural networkgrammars. In Proc. of NAACL-HLT, 2019b.

Diederik P Kingma and Max Welling. Auto-encoding Varia onal Bayes. 2014.Gustav Kirchho . Ueber die au ösung der gleichungen, auf welche man bei der untersuchung der linearen vertheilunggalvanischer ströme geführt wird. Annalen der Physik, 148(12):497–508, 1847.

Harold W Kuhn. The Hungarian method for the assignment problem. Nav. Res. Log., 2(1-2):83–97, 1955.Simon Lacoste-Julien and Mar n Jaggi. On the global linear convergence of Frank-Wolfe op miza on variants. In Proc. of

NeurIPS, 2015.


https://link.springer.com/article/10.1007/BF02278710








http://onlinelibrary.wiley.com/doi/10.1002/nav.3800020109/abstract



References IVZhifei Li and Jason Eisner. First-and second-order expecta on semirings with applica ons to minimum-risk training ontransla on forests. In Proc. of EMNLP, 2009.

Yang Liu and Mirella Lapata. Learning structured text representa ons. TACL, 6:63–75, 2018.Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribu on: A con nuous relaxa on of discrete randomvariables. In Proc. of ICLR, 2016.

Jean Maillard and Stephen Clark. Latent tree learning with di eren able parsers: Shi -Reduce parsing and chart parsing. arXivpreprint arXiv:1806.00840, 2018.

Chaitanya Malaviya, Pedro Ferreira, and André FT Mar ns. Sparse and constrained a en on for neural machine transla on. InProc. of ACL, 2018.

André FT Mar ns and Ramón Fernandez Astudillo. From so max to sparsemax: A sparse model of a en on and mul -labelclassi ca on. In Proc. of ICML, 2016.

André FT Mar ns and Julia Kreutzer. Learning what’s easy: Fully di eren able neural easy- rst taggers. In Proc. of EMNLP, 2017.André FT Mar ns and Vlad Niculae. Notes on latent structure models and SPIGOT. preprint arXiv:1907.10348, 2019.André FT Mar ns, Mário AT Figueiredo, Pedro MQ Aguiar, Noah A Smith, and Eric P Xing. AD3: Alterna ng direc ons dualdecomposi on for MAP inference in graphical models. JMLR, 16(1):495–545, 2015.

Arthur Mensch and Mathieu Blondel. Di eren able dynamic programming for structured predic on and a en on. In Proc. ofICML, 2018.


http://www.mt-archive.info/EMNLP-2009-Li.pdf

http://www.mt-archive.info/EMNLP-2009-Li.pdf








https://www.aclweb.org/anthology/papers/D/D17/D17-1036/


http://jmlr.org/papers/v16/martins15a.html

http://jmlr.org/papers/v16/martins15a.html



References VNikita Nangia and Samuel Bowman. ListOps: A diagnos c dataset for latent tree learning. In Proc. of NAACL SRW, 2018.Vlad Niculae and Mathieu Blondel. A regularized framework for sparse and structured neural a en on. In Proc. of NeurIPS, 2017.Vlad Niculae, André FT Mar ns, Mathieu Blondel, and Claire Cardie. SparseMAP: Di eren able sparse structured inference. In

Proc. of ICML, 2018a.Vlad Niculae, André FT Mar ns, and Claire Cardie. Towards dynamic computa on graphs via sparse latent structure. In Proc. of

EMNLP, 2018b.Jorge Nocedal and Stephen Wright. Numerical Op miza on. Springer New York, 1999.George Papandreou and Alan L Yuille. Perturb-and-MAP random elds: Using discrete op miza on to learn and sample fromenergy models. In Proc. of ICCV, 2011.

Hao Peng, Sam Thomson, and Noah A Smith. Backpropaga ng through structured argmax using a SPIGOT. In Proc. of ACL, 2018.Ben Peters, Vlad Niculae, and André FT Mar ns. Sparse sequence-to-sequence models. In Proc. of ACL, 2019.Slav Petrov and Dan Klein. Discrimina ve log-linear grammars with latent variables. In Advances in neural informa on processing

systems, pages 1153–1160, 2008.Ariadna Qua oni, Sybor Wang, Louis-Philippe Morency, Michael Collins, and Trevor Darrell. Hidden condi onal random elds.

IEEE Transac ons on Pa ern Analysis & Machine Intelligence, 29(10):1848–1852, 2007.






https://doi.org/10.1007/b98874

https://doi.org/10.1109/ICCV.2011.6126242

https://doi.org/10.1109/ICCV.2011.6126242




References VILawrence R. Rabiner. A tutorial on Hidden Markov Models and selected applica ons in speech recogni on. P. IEEE, 77(2):257–286, 1989.

Hiroaki Sakoe and Seibi Chiba. Dynamic programming algorithm op miza on for spoken word recogni on. IEEE Trans. onAcous cs, Speech, and Sig. Proc., 26:43–49, 1978.

Veselin Stoyanov, Alexander Ropson, and Jason Eisner. Empirical risk minimiza on of graphical model parameters givenapproximate inference, decoding, and model structure. In Proc. of AISTATS, 2011.

Ben Taskar. Learning structured predic on models: A large margin approach. PhD thesis, Stanford University, 2004.Constan no Tsallis. Possible generaliza on of Boltzmann-Gibbs sta s cs. Journal of Sta s cal Physics, 52:479–487, 1988.Leslie G Valiant. The complexity of compu ng the permanent. Theor. Comput. Sci., 8(2):189–201, 1979.Tim Vieira. Gumbel-max trick, 2014. URL

https://timvieira.github.io/blog/post/2014/07/31/gumbel-max-trick/. Blog post.Marina Vinyes and Guillaume Obozinski. Fast column genera on for atomic norm regulariza on. In Proc. of AISTATS, 2017.Mar n J Wainwright and Michael I Jordan. Graphical models, exponen al families, and varia onal inference., volume 1. NowPublishers, Inc., 2008.

Adina Williams, Andrew Drozdov, and Samuel R Bowman. Do latent tree learning models iden fy meaningful structure insentences? TACL, 2018.


https://doi.org/10.1109/5.18626

http://proceedings.mlr.press/v15/stoyanov11a.html

http://proceedings.mlr.press/v15/stoyanov11a.html

https://homes.cs.washington.edu/~taskar/pubs/thesis.pdf


https://doi.org/10.1016/0304-3975(79)90044-6

https://timvieira.github.io/blog/post/2014/07/31/gumbel-max-trick/

http://proceedings.mlr.press/v54/vinyes17a.html

https://people.eecs.berkeley.edu/~wainwrig/Papers/WaiJor08_FTML.pdf




References VIIRonald J. Williams. Simple sta s cal gradient-following algorithms for connec onist reinforcement learning. Mach. Learn., 8,1992.

Philip Wolfe. Finding the nearest point in a polytope. Mathema cal Programming, 11(1):128–149, 1976.Dani Yogatama, Phil Blunsom, Chris Dyer, Edward Grefenste e, and Wang Ling. Learning to compose words into sentences withreinforcement learning. In Proc. of ICLR, 2017.

Daniel H Younger. Recogni on and parsing of context-free languages in me n3. Informa on and Control, 10(2):189–208, 1967.


https://doi.org/10.1007/BF00992696





Documents

Latent Structure Models For NLP - GitHub Pages · 2019-09-01 · ExamplesofstructureinNLP POStagging VERB PREP NOUN dog on wheels NOUN PREP NOUN dog on wheels NOUN DET NOUN dog on