19
Reinforcement Learning and Optimal Control ASU, CSE 691, Winter 2020 Dimitri P. Bertsekas [email protected] Lecture 5 Bertsekas Reinforcement Learning 1 / 22

Reinforcement Learning and Optimal ControlASU, CSE 691 ...Lecture 5 Bertsekas Reinforcement Learning 1 / 22. Outline 1 Multiagent Rollout 2 Deterministic Problem Rollout with Constraints

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Reinforcement Learning and Optimal ControlASU, CSE 691 ...Lecture 5 Bertsekas Reinforcement Learning 1 / 22. Outline 1 Multiagent Rollout 2 Deterministic Problem Rollout with Constraints

Reinforcement Learning and Optimal Control

ASU, CSE 691, Winter 2020

Dimitri P. [email protected]

Lecture 5

Bertsekas Reinforcement Learning 1 / 22

Page 2: Reinforcement Learning and Optimal ControlASU, CSE 691 ...Lecture 5 Bertsekas Reinforcement Learning 1 / 22. Outline 1 Multiagent Rollout 2 Deterministic Problem Rollout with Constraints

Outline

1 Multiagent Rollout

2 Deterministic Problem Rollout with Constraints

3 Combinatorial Applications - Examples

Bertsekas Reinforcement Learning 2 / 22

Page 3: Reinforcement Learning and Optimal ControlASU, CSE 691 ...Lecture 5 Bertsekas Reinforcement Learning 1 / 22. Outline 1 Multiagent Rollout 2 Deterministic Problem Rollout with Constraints

The Pure Form of Rollout: For a Suboptimal Base Policy π, UseJk+`(xk+`) = Jk+`,π(xk+`)

minuk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

m=k+1

gk

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

%

First ℓ Steps “Future”Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial

Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)

Feature Extraction Features: Material Balance, uk = µdk

#xk(Ik)

$

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector φk(xk) Approximator r′kφk(xk)

x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i

s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector φ(x) Approximator φ(x)′r

ℓ Stages Riccati Equation Iterates P P0 P1 P2 γ2 − 1 γ2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k − wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

1

Approximations: Computation of Jk+ℓ: (Could be approximate)

DP minimization Replace E{·} with nominal values

(certainty equivalent control)

Limited simulation (Monte Carlo tree search)

Simple choices Parametric approximation Problem approximation

Rollout

minuk

E!gk(xk, uk, wk) + Jk+1(xk+ℓ)

"

minuk,µk+1,...,µk+ℓ−1

E

#gk(xk, uk, wk) +

k+ℓ−1$

m=k+1

gk

%xm, µm(xm), wm

&+ Jk+ℓ(xk+ℓ)

'

First ℓ Steps “Future” First StepNonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial

Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)

Feature Extraction Features: Material Balance, uk = µdk

%xk(Ik)

&

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector φk(xk) Approximator r′kφk(xk)

x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i

s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

1

b+k b−

k Permanent trajectory P k Tentative trajectory T k

minuk

E!gk(xk, uk, wk)+Jk+1(xk+1)

"

Approximate Min Approximate E{·} Computation of Jk+1

Optimal control sequence {u∗0, . . . , u∗

k, . . . , u∗N−1}

Tail subproblem Time x∗k Future Stages Terminal Cost k N

Stage k Future Stages Terminal Cost gN(xN )

Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N

uk uk xk+1 xk+1 xN xN x′N

Φr = Π#T

(λ)µ (Φr)

$Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk

minu∈U(i)

%nj=1 pij(u)

#g(i, u, j) + J(j)

$Computation of J :

Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)

max{0, ξ} J(x)

Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction

Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1

Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution

Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)

Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)

Set of States (u1) Set of States (u1, u2) Neural Network

Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)

Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)

Candidate (m + 1)-Solutions (u1, . . . , um, um+1)

Cost G(u) Heuristic N -Solutions

Piecewise Constant Aggregate Problem Approximation

Artificial Start State End State

Piecewise Constant Aggregate Problem Approximation

Feature Vector F (i) Aggregate Cost Approximation Cost Jµ

#F (i)

$

R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗

3 Cost Jµ

#F (i)

$

I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗

3 Cost Jµ

#F (i)

$

1

uk = µk(xk, rk) µk(·, rk) µk(xk) xk At xk

µk(xk) Jk(xk) xsk, us

k = µk(xsk) s = 1, . . . , q µk(xk, rk) µ(·, r) µ(x, r)

Motion equations xk+1 = fk(xk, uk) Current State x

Penalty for deviating from nominal trajectory

State and control constraints Keep state close to a trajectory

Control Probabilities Run the Base Policy

Truncated Horizon Rollout Terminal Cost Approximation J

J∗3 (x3) J∗

2 (x2) J∗1 (x1) Optimal Cost J∗

0 (x0) = J∗(x0)

Optimal Cost J∗k (xk) xk xk+1 x

′k+1 x

′′k+1

Opt. Cost J∗k+1(xk+1) Opt. Cost J∗

k+1(x′k+1) Opt. Cost J∗

k+1(x′′k+1)

xk uk u′k u

′′k Matrix of Intercity Travel Costs

Corrected J J J* Cost Jµ

!F (i), r

"of i ≈ Jµ(i) Jµ(i) Feature Map

!F (i), r

": Feature-based parametric architecture State

r: Vector of weights Original States Aggregate States

Position “value” Move “probabilities” Simplify E{·}Choose the Aggregation and Disaggregation Probabilities

Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq

Use a Neural Scheme or Other Scheme

Possibly Include “Handcrafted” Features

Generate Features F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy µ by “Solving” the Aggregate Problem

Same algorithm learned multiple games (Go, Shogi)

Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)

Approximation in a space of basis functions Plays much better thanall chess programs

Cost αkg(i, u, j) Transition probabilities pij(u) Wp

Controlled Markov Chain Evaluate Approximate Cost Jµ of

Evaluate Approximate Cost Jµ

!F (i)

"of

F (i) =!F1(i), . . . , Fs(i)

": Vector of Features of i

!F (i)

": Feature-based architecture Final Features

1

Approximate Q-Factor Qk(xk, uk)

Min Approximation E{·} Approximation Cost-to-Go Approximation

System: xk+1 = 2xk + uk Control constraint: |uk| ≤ 1

Cost per stage: x2k + u2

k

{X0, X1, . . . , XN} must be reachable Largest reachable tube

x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization

Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0

Complete Tours Current Partial Tour Next Cities Next States

Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1

Base Heuristic Minimization Possible Path

Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath

Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)

E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,

2-Step Lookahead (onestep lookahead plus one step approx-imation)

Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′

Truncated Rollout Terminal Cost Approximation J

Parametric approximation Neural nets Discretization

Parametric approximation Neural nets Discretization

Cost Function Approximation Jk+ℓ

Rollout, Model Predictive Control

Rollout Control uk Rollout Policy µk Base Policy Cost

b+k b−

k Permanent trajectory P k Tentative trajectory T k

minuk

E!gk(xk, uk, wk)+Jk+1(xk+1)

$

Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1

Optimal control sequence {u∗0, . . . , u∗

k, . . . , u∗N−1} Simplify E{·}

Tail subproblem Time x∗k Future Stages Terminal Cost k N

1

Approximate Q-Factor Qk(xk, uk)

Min Approximation E{·} Approximation Cost-to-Go Approximation

System: xk+1 = 2xk + uk Control constraint: |uk| ≤ 1

Cost per stage: x2k + u2

k

{X0, X1, . . . , XN} must be reachable Largest reachable tube

x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization

Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0

Complete Tours Current Partial Tour Next Cities Next States

Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1

Base Heuristic Minimization Possible Path

Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath

Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)

E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,

2-Step Lookahead (onestep lookahead plus one step approx-imation)

Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′

Truncated Rollout Terminal Cost Approximation J

Parametric approximation Neural nets Discretization

Parametric approximation Neural nets Discretization

Cost Function Approximation Jk+ℓ

Rollout, Model Predictive Control

Rollout Control uk Rollout Policy µk Base Policy Cost

b+k b−

k Permanent trajectory P k Tentative trajectory T k

minuk

E!gk(xk, uk, wk)+Jk+1(xk+1)

$

Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1

Optimal control sequence {u∗0, . . . , u∗

k, . . . , u∗N−1} Simplify E{·}

Tail subproblem Time x∗k Future Stages Terminal Cost k N

1

Truncated Rollout Approximate Truncated Rollout Approximate Base Policy Cost

Partial Folding Software Critic Software Complete Folding Current Partial Folding

Clients Facilities Corresponding to Open xij i j zj = 0 or 1 Open Close Null

minuk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

i=k+1

gi

#xi, µi(xi), wi

$+ Jk+ℓ(xk+ℓ)

%

bk Belief States bk+1 bk+2 Policy µ m Steps 1 2 3

Truncated Rollout Policy µ m Steps Φr∗λ

B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2

Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN(xN )

µ COMPOSITE SYSTEM SIMULATOR FOR POMDP

(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r)

p(c | x)

uk = rpek + rizk + rddk ξij(u) pij(u)

Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition

Cost

Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i

One-step Lookahead with J(j) =&

y∈A φjyr∗y bk Control uk = µk(bk)

p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm

... (e.g., a NN) Data (xs, cs)

V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗

Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm

Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout

Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ

Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized

Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

1

Truncated Rollout Approximate Truncated Rollout Approximate Base Policy Cost

Partial Folding Software Critic Software Complete Folding Current Partial Folding

Clients Facilities Corresponding to Open xij i j zj = 0 or 1 Open Close Null

minuk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

i=k+1

gi

#xi, µi(xi), wi

$+ Jk+ℓ(xk+ℓ)

%

bk Belief States bk+1 bk+2 Policy µ m Steps 1 2 3

Truncated Rollout Policy µ m Steps Φr∗λ

B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2

Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN(xN )

µ COMPOSITE SYSTEM SIMULATOR FOR POMDP

(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r)

p(c | x)

uk = rpek + rizk + rddk ξij(u) pij(u)

Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition

Cost

Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i

One-step Lookahead with J(j) =&

y∈A φjyr∗y bk Control uk = µk(bk)

p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm

... (e.g., a NN) Data (xs, cs)

V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗

Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm

Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout

Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ

Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized

Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

1

Truncated Rollout Approximate Truncated Rollout Approximate Base Policy Cost

Partial Folding Software Critic Software Complete Folding Current Partial Folding

Clients Facilities Corresponding to Open xij i j zj = 0 or 1 Open Close Null

minuk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

i=k+1

gi

#xi, µi(xi), wi

$+ Jk+ℓ(xk+ℓ)

%

bk Belief States bk+1 bk+2 Policy µ m Steps 1 2 3

Truncated Rollout Policy µ m Steps Φr∗λ

B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2

Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN(xN )

µ COMPOSITE SYSTEM SIMULATOR FOR POMDP

(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r)

p(c | x)

uk = rpek + rizk + rddk ξij(u) pij(u)

Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition

Cost

Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i

One-step Lookahead with J(j) =&

y∈A φjyr∗y bk Control uk = µk(bk)

p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm

... (e.g., a NN) Data (xs, cs)

V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗

Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm

Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout

Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ

Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized

Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

1

Policy improvement property (where no truncation is used):

Jk,π(xk ) ≤ Jk,π(xk ), for all xk and k

The key issue in the practical application of rollout: Too much computationIf the problem is deterministic, the computation is greatly reduced (no Monte Carlo)

Another computational bottleneck: Large control spaces, e.g., the multiagent case,where controls have many components,

uk = (u1k , . . . , u

mk )

Bertsekas Reinforcement Learning 4 / 22

Page 4: Reinforcement Learning and Optimal ControlASU, CSE 691 ...Lecture 5 Bertsekas Reinforcement Learning 1 / 22. Outline 1 Multiagent Rollout 2 Deterministic Problem Rollout with Constraints

Trading off Control Space Complexity with State Space Complexity

...) xk xk+1

Random Transition

Random Transition xk+1 = fk(xk, uk, wk) Random cost

) Random Cost) Random Cost gk(xk, uk, wk)

Stage k

1 Control umk

xk, u1k

1k xk, u1

k, u2k

, u1k , u2

kmk u3

k3k um−1

k2k xk, u1

k, . . . , um−1k

An equivalent reformulation - “Unfolding" the control actionThe control space is simplified at the expense of m − 1 additional layers of states,and corresponding m − 1 cost-to-go functions

J1k (xk , u1

k ), J2k (xk , u1

k , u2k ), . . . , J

m−1k (xk , u1

k , . . . , um−1k )

Multiagent or one-component-at-a-time rollout is just standard rollout for thereformulated problem.

The increase in size of the state space does not adversely affect rollout.

The cost improvement property is maintained.

Complexity reduction: The one-step lookahead branching factor is reduced fromnm to nm, where n is the number of possible choices for each component u i

k .

Base policy for u ik may depend only on xk or include dependence on u i−1

k , u i−2k , . . ..

Bertsekas Reinforcement Learning 5 / 22

Page 5: Reinforcement Learning and Optimal ControlASU, CSE 691 ...Lecture 5 Bertsekas Reinforcement Learning 1 / 22. Outline 1 Multiagent Rollout 2 Deterministic Problem Rollout with Constraints

A Deterministic Example: Multi-Vehicle Routing

1 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 9

1 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 9

1 2 3 4 5 6 7 8 9 Vehicle 1 Vehicle 2

1 2 3 4 5 6 7 8 9 Vehicle 1 Vehicle 2

10 11

1 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 910 11

10 11 12Capacity=1 Optimal Solution

Capacity=1 Optimal Solution

n vehicles move along the arcs of a given graph.

Some of the nodes of the graph include a task to be performed by the vehicles.Each task will be performed only once.

Find a route for each vehicle so that all the tasks are collectively performed by thevehicles in minimum time.

Cost function choice: For each stage there is a cost of one unit for each task thatis pending at the end of the stage.

Total cost: The number of stages that a task is pending, summed over the tasks.What is the state? What is the control? Why is this good multiagent candidate?

Bertsekas Reinforcement Learning 6 / 22

Page 6: Reinforcement Learning and Optimal ControlASU, CSE 691 ...Lecture 5 Bertsekas Reinforcement Learning 1 / 22. Outline 1 Multiagent Rollout 2 Deterministic Problem Rollout with Constraints

Multi-Agent Rollout for Multi-Vehicle Routing

1 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 9

1 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 9

1 2 3 4 5 6 7 8 9 Vehicle 1 Vehicle 2

1 2 3 4 5 6 7 8 9 Vehicle 1 Vehicle 2

10 11

1 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 910 11

10 11 12Capacity=1 Optimal Solution

Capacity=1 Optimal Solution

Base heuristic:Each vehicle makes a move towards the pending task that is closest (in terms ofnumber of moves needed to reach it).

The vehicles make their selection independently one-at-a-time in the order1, . . . , n, and take into account the tasks that have already been performed.

What is the solution produced by the base heuristic?

What is the solution produced by the one-vehicle-at-a-time rollout?

Do we get cost improvement? What is the intuition?

Bertsekas Reinforcement Learning 7 / 22

Page 7: Reinforcement Learning and Optimal ControlASU, CSE 691 ...Lecture 5 Bertsekas Reinforcement Learning 1 / 22. Outline 1 Multiagent Rollout 2 Deterministic Problem Rollout with Constraints

Constrained Deterministic Rollout

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk)

x0 x1 xk xN uk xk+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk)

x0 x1 xk xN uk xk+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . .

x0 x1 xk xN uk xk+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . .

x0 x1 xk xN uk xk+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . .

x0 x1 xk xN uk xk+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . .

x0 x1 xk xN uk xk+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . .

x0 x1 xk xN uk u0k u00

k xk+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . .

x0 x1 xk xN uk u0k u00

k xk+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . .

x0 x1 xk xN uk u0k u00

k xk+1 x0k+1 x00

k+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . .

x0 x1 xk xN uk u0k u00

k xk+1 x0k+1 x00

k+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . .

x0 x1 xk xN x0N x00

N uk u0k u00

k xk+1 x0k+1 x00

k+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . .

x0 x1 xk xN x0N x00

N uk u0k u00

k xk+1 x0k+1 x00

k+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors

x0 x1 xk xN x0N x00

N uk u0k u00

k xk+1 x0k+1 x00

k+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N − 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N − 1 c(N) c(N − 1) k k + 1

Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk

x0 x1 xk xN x′N x′′

N uk u′k u′′

k xk+1 x′k+1 x′′

k+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from

within Wp+

Prob. u Prob. 1 − u Cost 1 Cost 1 − √u

J(1) = min!c, a + J(2)

"

J(2) = b + J(1)

J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N − 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N − 1 c(N) c(N − 1) k k + 1

Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk

x0 x1 xk xN x′N x′′

N uk u′k u′′

k xk+1 x′k+1 x′′

k+1

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from

within Wp+

Prob. u Prob. 1 − u Cost 1 Cost 1 − √u

J(1) = min!c, a + J(2)

"

J(2) = b + J(1)

J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial City Current Partial Tour Next Cities

Nearest Neighbor Heuristic

Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)

E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,

2-Step Lookahead (onestep lookahead plus one step approx-imation)

Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′

Truncated Rollout Terminal Cost Approximation J

Parametric approximation Neural nets Discretization

Parametric approximation Neural nets Discretization

Cost Function Approximation Jk+ℓ

Rollout, Model Predictive Control

b+k b−

k Permanent trajectory P k Tentative trajectory T k

minuk

E!gk(xk, uk, wk)+Jk+1(xk+1)

$

Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1

Optimal control sequence {u∗0, . . . , u∗

k, . . . , u∗N−1} Simplify E{·}

Tail subproblem Time x∗k Future Stages Terminal Cost k N

Stage k Future Stages Terminal Cost gN(xN )

Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N

uk uk xk+1 xk+1 xN xN x′N

Φr = Π"T

(λ)µ (Φr)

#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk

minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :

Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)

max{0, ξ} J(x)

Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction

Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1

1

Initial City Current Partial Tour Next Cities

Nearest Neighbor Heuristic

Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)

E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,

2-Step Lookahead (onestep lookahead plus one step approx-imation)

Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′

Truncated Rollout Terminal Cost Approximation J

Parametric approximation Neural nets Discretization

Parametric approximation Neural nets Discretization

Cost Function Approximation Jk+ℓ

Rollout, Model Predictive Control

b+k b−

k Permanent trajectory P k Tentative trajectory T k

minuk

E!gk(xk, uk, wk)+Jk+1(xk+1)

$

Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1

Optimal control sequence {u∗0, . . . , u∗

k, . . . , u∗N−1} Simplify E{·}

Tail subproblem Time x∗k Future Stages Terminal Cost k N

Stage k Future Stages Terminal Cost gN(xN )

Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N

uk uk xk+1 xk+1 xN xN x′N

Φr = Π"T

(λ)µ (Φr)

#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk

minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :

Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)

max{0, ξ} J(x)

Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction

Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1

1

Initial City Current Partial Tour Next Cities

Nearest Neighbor Heuristic

Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)

E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,

2-Step Lookahead (onestep lookahead plus one step approx-imation)

Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′

Truncated Rollout Terminal Cost Approximation J

Parametric approximation Neural nets Discretization

Parametric approximation Neural nets Discretization

Cost Function Approximation Jk+ℓ

Rollout, Model Predictive Control

b+k b−

k Permanent trajectory P k Tentative trajectory T k

minuk

E!gk(xk, uk, wk)+Jk+1(xk+1)

$

Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1

Optimal control sequence {u∗0, . . . , u∗

k, . . . , u∗N−1} Simplify E{·}

Tail subproblem Time x∗k Future Stages Terminal Cost k N

Stage k Future Stages Terminal Cost gN(xN )

Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N

uk uk xk+1 xk+1 xN xN x′N

Φr = Π"T

(λ)µ (Φr)

#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk

minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :

Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)

max{0, ξ} J(x)

Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction

Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1

1

Initial City Current Partial Tour Next Cities Next States

Nearest Neighbor Heuristic

Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)

E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,

2-Step Lookahead (onestep lookahead plus one step approx-imation)

Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′

Truncated Rollout Terminal Cost Approximation J

Parametric approximation Neural nets Discretization

Parametric approximation Neural nets Discretization

Cost Function Approximation Jk+ℓ

Rollout, Model Predictive Control

b+k b−

k Permanent trajectory P k Tentative trajectory T k

minuk

E!gk(xk, uk, wk)+Jk+1(xk+1)

$

Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1

Optimal control sequence {u∗0, . . . , u∗

k, . . . , u∗N−1} Simplify E{·}

Tail subproblem Time x∗k Future Stages Terminal Cost k N

Stage k Future Stages Terminal Cost gN(xN )

Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N

uk uk xk+1 xk+1 xN xN x′N

Φr = Π"T

(λ)µ (Φr)

#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk

minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :

Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)

max{0, ξ} J(x)

Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction

Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1

1

For every pair (xk , uk ), uk ∈ Uk (xk ), we generate a Q-factor

Qk (xk , uk ) = gk (xk , uk ) + Hk+1(fk (xk , uk )

)using the base heuristic [Hk+1(xk+1) is the heuristic cost starting from xk+1].

Select uk with minimal Q-factor, move to next state xk+1 and continue.

What if there are constraints, i.e., future control choices are constrained by pastcontrol choices?

Base heuristic and rollout should be modified (e.g., avoid controls thatcompromise feasibility of future controls).

Bertsekas Reinforcement Learning 9 / 22

Page 8: Reinforcement Learning and Optimal ControlASU, CSE 691 ...Lecture 5 Bertsekas Reinforcement Learning 1 / 22. Outline 1 Multiagent Rollout 2 Deterministic Problem Rollout with Constraints

Constrained Example: Multi-Vehicle Routing with Constraints

1 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 9

1 2 3 4 5 6 7 8 9 Vehicle 1 Vehicle 2

1 2 3 4 5 6 7 8 9 Vehicle 1 Vehicle 2

10 1110 11 12

Capacity=1

1 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 910 11

1 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 9

Capacity=1 Optimal SolutionCapacity=1 Optimal Solution

Examples of constraintsVehicle capacity constraints (limit on how many tasks some vehicles can perform).

Vehicle specializations (some tasks can be performed only by some of thevehicles).

Time windows (some tasks must be performed within specified time intervals).

Bertsekas Reinforcement Learning 10 / 22

Page 9: Reinforcement Learning and Optimal ControlASU, CSE 691 ...Lecture 5 Bertsekas Reinforcement Learning 1 / 22. Outline 1 Multiagent Rollout 2 Deterministic Problem Rollout with Constraints

How to Deal with Constraints Across Stages in Deterministic Problems

Consider a deterministic optimal control problem with system xk+1 = fk (xk , uk ).

A complete trajectory is a sequence

T = (x0, u0, x1, u1, . . . , uN−1, xN)

We consider the (very general) problem

minT∈C

G(T )

where G is a given cost function and C is a given constraint set of trajectories.

We transform to an unconstrained problem in order to apply rollout ideasRedefine the state to be the partial trajectory

yk = (x0, u0, x1, . . . , uk−1, xk )

Partial trajectory evolves according to a redefined system equation:

yk+1 =(yk , uk , fk (xk , uk )

)The problem becomes to minimize G(yN) subject to the constraint yN ∈ C.

Bertsekas Reinforcement Learning 11 / 22

Page 10: Reinforcement Learning and Optimal ControlASU, CSE 691 ...Lecture 5 Bertsekas Reinforcement Learning 1 / 22. Outline 1 Multiagent Rollout 2 Deterministic Problem Rollout with Constraints

Rollout Algorithm - Partial Trajectory-Dependent Base Heuristic

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

Initial Temperature x0 u0 u1 x1 Oven 1 Oven 2 Final Temperaturex2

⇠k yk+1 = Akyk + ⇠k yk+1 Ck wk

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

Initial Temperature x0 u0 u1 x1 Oven 1 Oven 2 Final Temperaturex2

⇠k yk+1 = Akyk + ⇠k yk+1 Ck wk

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

Initial Temperature x0 u0 u1 x1 Oven 1 Oven 2 Final Temperaturex2

⇠k yk+1 = Akyk + ⇠k yk+1 Ck wk

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

Initial Temperature x0 u0 u1 x1 Oven 1 Oven 2 Final Temperaturex2

⇠k yk+1 = Akyk + ⇠k yk+1 Ck wk

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

Initial Temperature x0 u0 u1 x1 Oven 1 Oven 2 Final Temperaturex2

⇠k yk+1 = Akyk + ⇠k yk+1 Ck wk

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

yk yk+1 H(yk+1) xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

yk yk+1 H(yk+1) xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

yk yk+1 H(yk+1) xN�1 xN u0 u1 uk�1 uk uk+1 uN�1 Tk(yk, uk) =�yk, uk, H(yk+1)

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

yk yk+1 H(yk+1) xN�1 xN u0 u1 uk�1 uk uk+1 uN�1 Tk(yk, uk) =�yk, uk, H(yk+1)

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

1

u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =�yk, uk, R(yk+1)

x0 u⇤0 x⇤

1 u⇤1 x⇤

2 u⇤2 x⇤

3 u1 x2 u2 x3

x0 u⇤0 x⇤

1 u⇤1 x⇤

2 u⇤2 x⇤

3 u0 x1 u1 x1

High Cost Transition Chosen by Heuristic at x⇤1 Rollout Choice

Capacity=1 Optimal Solution

Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial

Base Policy Rollout Policy Approximation in Value Space n n � 1n � 2

One-Step or Multistep Lookahead for stages Possible Terminal Cost

Approximation in Policy Space Heuristic Cost Approximation for

for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)

Approximate Q-Factor Q(x, u) At x Approximation J

minu2U(x)

Ew

ng(x, u, w) + ↵J

�f(x, u, w)

�o

Truncated Rollout Policy µ m Steps

Approximate Q-Factor Q(x, u) At x

Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| 1

Cost per stage: x2k + u2

k

{X0, X1, . . . , XN} must be reachable Largest reachable tube

x0 Control uk (`� 1)-Stages Base Heuristic Minimization

Target Tube 0 k Sample Q-Factors (`� 1)-Stages State xk+` = 0

Complete Tours Current Partial Tour Next Cities Next States

Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+`�1

Base Heuristic Minimization Possible Path

Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath

Jk+1(xk+1) = minuk+12Uk+1(xk+1)

En

gk+1(xk+1, uk+1, wk+1)

+Jk+2

�fk+1(xk+1, uk+1, wk+1)

�o,

1

u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =�yk, uk, R(yk+1)

�2 C

x0 u⇤0 x⇤

1 u⇤1 x⇤

2 u⇤2 x⇤

3 u1 x2 u2 x3

x0 u⇤0 x⇤

1 u⇤1 x⇤

2 u⇤2 x⇤

3 u0 x1 u1 x1

High Cost Transition Chosen by Heuristic at x⇤1 Rollout Choice

Capacity=1 Optimal Solution

Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial

Base Policy Rollout Policy Approximation in Value Space n n � 1n � 2

One-Step or Multistep Lookahead for stages Possible Terminal Cost

Approximation in Policy Space Heuristic Cost Approximation for

for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)

Approximate Q-Factor Q(x, u) At x Approximation J

minu2U(x)

Ew

ng(x, u, w) + ↵J

�f(x, u, w)

�o

Truncated Rollout Policy µ m Steps

Approximate Q-Factor Q(x, u) At x

Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| 1

Cost per stage: x2k + u2

k

{X0, X1, . . . , XN} must be reachable Largest reachable tube

x0 Control uk (`� 1)-Stages Base Heuristic Minimization

Target Tube 0 k Sample Q-Factors (`� 1)-Stages State xk+` = 0

Complete Tours Current Partial Tour Next Cities Next States

Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+`�1

Base Heuristic Minimization Possible Path

Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath

Jk+1(xk+1) = minuk+12Uk+1(xk+1)

En

gk+1(xk+1, uk+1, wk+1)

+Jk+2

�fk+1(xk+1, uk+1, wk+1)

�o,

1

Given yk = {x0, y0, x1, u1, . . . , uk−1, xk} consider all controls uk and correspondingnext states xk+1.

Extend yk to obtain the partial trajectories yk+1 =(yk , uk , xk+1

).

Run the base heuristic from each yk+1 to obtain the partial trajectory R(yk+1).

Join the partial trajectories yk+1 and R(yk+1) to obtain complete trajectoriesdenoted by Tk (yk , uk ) =

(yk , uk ,R(yk+1)

)Find the set of controls Uk (yk ) for which Tk (yk , uk ) is feasible, i.e., Tk (yk , uk ) ∈ C

Choose the control uk ∈ Uk (yk ) according to the minimization

uk ∈ arg minuk∈Uk (yk )

G(Tk (yk , uk )

)Bertsekas Reinforcement Learning 12 / 22

Page 11: Reinforcement Learning and Optimal ControlASU, CSE 691 ...Lecture 5 Bertsekas Reinforcement Learning 1 / 22. Outline 1 Multiagent Rollout 2 Deterministic Problem Rollout with Constraints

Rollout Algorithm Properties

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

Initial Temperature x0 u0 u1 x1 Oven 1 Oven 2 Final Temperaturex2

⇠k yk+1 = Akyk + ⇠k yk+1 Ck wk

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

Initial Temperature x0 u0 u1 x1 Oven 1 Oven 2 Final Temperaturex2

⇠k yk+1 = Akyk + ⇠k yk+1 Ck wk

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

Initial Temperature x0 u0 u1 x1 Oven 1 Oven 2 Final Temperaturex2

⇠k yk+1 = Akyk + ⇠k yk+1 Ck wk

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

Initial Temperature x0 u0 u1 x1 Oven 1 Oven 2 Final Temperaturex2

⇠k yk+1 = Akyk + ⇠k yk+1 Ck wk

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

Initial Temperature x0 u0 u1 x1 Oven 1 Oven 2 Final Temperaturex2

⇠k yk+1 = Akyk + ⇠k yk+1 Ck wk

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

yk yk+1 H(yk+1) xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

yk yk+1 H(yk+1) xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

yk yk+1 H(yk+1) xN�1 xN u0 u1 uk�1 uk uk+1 uN�1 Tk(yk, uk) =�yk, uk, H(yk+1)

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

yk yk+1 H(yk+1) xN�1 xN u0 u1 uk�1 uk uk+1 uN�1 Tk(yk, uk) =�yk, uk, H(yk+1)

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

1

u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =�yk, uk, R(yk+1)

x0 u⇤0 x⇤

1 u⇤1 x⇤

2 u⇤2 x⇤

3 u1 x2 u2 x3

x0 u⇤0 x⇤

1 u⇤1 x⇤

2 u⇤2 x⇤

3 u0 x1 u1 x1

High Cost Transition Chosen by Heuristic at x⇤1 Rollout Choice

Capacity=1 Optimal Solution

Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial

Base Policy Rollout Policy Approximation in Value Space n n � 1n � 2

One-Step or Multistep Lookahead for stages Possible Terminal Cost

Approximation in Policy Space Heuristic Cost Approximation for

for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)

Approximate Q-Factor Q(x, u) At x Approximation J

minu2U(x)

Ew

ng(x, u, w) + ↵J

�f(x, u, w)

�o

Truncated Rollout Policy µ m Steps

Approximate Q-Factor Q(x, u) At x

Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| 1

Cost per stage: x2k + u2

k

{X0, X1, . . . , XN} must be reachable Largest reachable tube

x0 Control uk (`� 1)-Stages Base Heuristic Minimization

Target Tube 0 k Sample Q-Factors (`� 1)-Stages State xk+` = 0

Complete Tours Current Partial Tour Next Cities Next States

Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+`�1

Base Heuristic Minimization Possible Path

Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath

Jk+1(xk+1) = minuk+12Uk+1(xk+1)

En

gk+1(xk+1, uk+1, wk+1)

+Jk+2

�fk+1(xk+1, uk+1, wk+1)

�o,

1

u1 u1 10 11 12 R(yk+1) Tk(yk, uk) =�yk, uk, R(yk+1)

�2 C

x0 u⇤0 x⇤

1 u⇤1 x⇤

2 u⇤2 x⇤

3 u1 x2 u2 x3

x0 u⇤0 x⇤

1 u⇤1 x⇤

2 u⇤2 x⇤

3 u0 x1 u1 x1

High Cost Transition Chosen by Heuristic at x⇤1 Rollout Choice

Capacity=1 Optimal Solution

Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial

Base Policy Rollout Policy Approximation in Value Space n n � 1n � 2

One-Step or Multistep Lookahead for stages Possible Terminal Cost

Approximation in Policy Space Heuristic Cost Approximation for

for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)

Approximate Q-Factor Q(x, u) At x Approximation J

minu2U(x)

Ew

ng(x, u, w) + ↵J

�f(x, u, w)

�o

Truncated Rollout Policy µ m Steps

Approximate Q-Factor Q(x, u) At x

Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| 1

Cost per stage: x2k + u2

k

{X0, X1, . . . , XN} must be reachable Largest reachable tube

x0 Control uk (`� 1)-Stages Base Heuristic Minimization

Target Tube 0 k Sample Q-Factors (`� 1)-Stages State xk+` = 0

Complete Tours Current Partial Tour Next Cities Next States

Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+`�1

Base Heuristic Minimization Possible Path

Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath

Jk+1(xk+1) = minuk+12Uk+1(xk+1)

En

gk+1(xk+1, uk+1, wk+1)

+Jk+2

�fk+1(xk+1, uk+1, wk+1)

�o,

1

The notions of sequential consistency and sequential improvement apply. Part oftheir definition includes that the set of “feasible" controls Uk (yk ) is nonempty for allk ; see the notes.

There is a fortified version (follows the best feasible trajectory). Has the costimprovement property, assuming the base heuristic generates a feasible trajectorystarting from the initial condition y0 = x0

There is a multiagent version that uses one-control-component-at-a-timeminimization.

Additional variants are possible; see the notes.

Bertsekas Reinforcement Learning 13 / 22

Page 12: Reinforcement Learning and Optimal ControlASU, CSE 691 ...Lecture 5 Bertsekas Reinforcement Learning 1 / 22. Outline 1 Multiagent Rollout 2 Deterministic Problem Rollout with Constraints

Example: Multi-Vehicle Routing with Constraints

1 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 9

1 2 3 4 5 6 7 8 9 Vehicle 1 Vehicle 2

1 2 3 4 5 6 7 8 9 Vehicle 1 Vehicle 2

10 1110 11 12

Capacity=1

1 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 910 11

1 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 9

Capacity=1 Optimal SolutionCapacity=1 Optimal Solution

Base heuristic moves each vehicle (one-at-a-time) to the closest pending task.

What is the first move of vehicle 1 chosen by base heuristic and by rollout?

What is the solution found by base heuristic?

What is the solution found by rollout?

Bertsekas Reinforcement Learning 14 / 22

Page 13: Reinforcement Learning and Optimal ControlASU, CSE 691 ...Lecture 5 Bertsekas Reinforcement Learning 1 / 22. Outline 1 Multiagent Rollout 2 Deterministic Problem Rollout with Constraints

General Discrete Optimization Problem: Minimize G(u) subject to u ∈ C

Artificial Start State End State

Piecewise Constant Aggregate Problem Approximation

Feature Vector F (i) Aggregate Cost Approximation Cost Jµ

!F (i)

"

R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗

3 Cost Jµ

!F (i)

"

I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗

3 Cost Jµ

!F (i)

"

Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost

function Jµ(i)I1 ... Iq I2 g(i, u, j)...

TD(1) Approximation TD(0) Approximation V1(i) and V0(i)

Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)

φjf =

#1 if j ∈ If

0 if j /∈ If

1 10 20 30 40 50 I1 I2 I3 i J1(i)

(May Involve a Neural Network) (May Involve Aggregation)

dℓi = 0 if i /∈ Iℓ

φjℓ = 1 if j ∈ Iℓ

pff(u) =

n$

i=1

dfi

n$

j=1

pij(u)φjf

g(f, u) =

n$

i=1

dfi

n$

j=1

pij(u)g(i, u, j)

Representative Feature States Feature Space F F (j) φjf1 φjf2 φjf3

φjf4

i1 i2 iℓ r∗1 r∗

q r∗ℓ . . . iq

Disaggregation Sets If Aggregate Optimization Feature States

Neural Network Features Approximate Cost Jµ Policy ImprovementSampling

1

Piecewise Constant Aggregate Problem Approximation

Feature Vector F (i) Aggregate Cost Approximation Cost Jµ

!F (i)

"

R1 R2 R3 Rℓ Rq−1 Rq r∗2 r∗

3 Cost Jµ

!F (i)

"

I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗

3 Cost Jµ

!F (i)

"

Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost

function Jµ(i)I1 ... Iq I2 g(i, u, j)...

TD(1) Approximation TD(0) Approximation V1(i) and V0(i)

Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)

φjf =

#1 if j ∈ If

0 if j /∈ If

1 10 20 30 40 50 I1 I2 I3 i J1(i)

(May Involve a Neural Network) (May Involve Aggregation)

dℓi = 0 if i /∈ Iℓ

φjℓ = 1 if j ∈ Iℓ

pff(u) =

n$

i=1

dfi

n$

j=1

pij(u)φjf

g(f, u) =

n$

i=1

dfi

n$

j=1

pij(u)g(i, u, j)

Representative Feature States Feature Space F F (j) φjf1 φjf2 φjf3

φjf4

i1 i2 iℓ r∗1 r∗

q r∗ℓ . . . iq

Disaggregation Sets If Aggregate Optimization Feature States

Neural Network Features Approximate Cost Jµ Policy ImprovementSampling

Neural Network Features Approximate Cost Jµ Policy Improvement

1

Piecewise Constant Aggregate Problem Approximation

Feature Vector F (i) Aggregate Cost Approximation Cost Jµ

!F (i)

"

R1 R2 R3 Rℓ Rq−1 Rq r∗2 r∗

3 Cost Jµ

!F (i)

"

I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗

3 Cost Jµ

!F (i)

"

Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost

function Jµ(i)I1 ... Iq I2 g(i, u, j)...

TD(1) Approximation TD(0) Approximation V1(i) and V0(i)

Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)

φjf =

#1 if j ∈ If

0 if j /∈ If

1 10 20 30 40 50 I1 I2 I3 i J1(i)

(May Involve a Neural Network) (May Involve Aggregation)

dℓi = 0 if i /∈ Iℓ

φjℓ = 1 if j ∈ Iℓ

pff(u) =

n$

i=1

dfi

n$

j=1

pij(u)φjf

g(f, u) =

n$

i=1

dfi

n$

j=1

pij(u)g(i, u, j)

Representative Feature States Feature Space F F (j) φjf1 φjf2 φjf3

φjf4

i1 i2 iℓ r∗1 r∗

q r∗ℓ . . . iq

Disaggregation Sets If Aggregate Optimization Feature States

Neural Network Features Approximate Cost Jµ Policy ImprovementSampling

Neural Network Features Approximate Cost Jµ Policy Improvement

1

Piecewise Constant Aggregate Problem Approximation

Feature Vector F (i) Aggregate Cost Approximation Cost Jµ

!F (i)

"

R1 R2 R3 Rℓ Rq−1 Rq r∗2 r∗

3 Cost Jµ

!F (i)

"

I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗

3 Cost Jµ

!F (i)

"

Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost

function Jµ(i)I1 ... Iq I2 g(i, u, j)...

TD(1) Approximation TD(0) Approximation V1(i) and V0(i)

Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)

φjf =

#1 if j ∈ If

0 if j /∈ If

1 10 20 30 40 50 I1 I2 I3 i J1(i)

(May Involve a Neural Network) (May Involve Aggregation)

dℓi = 0 if i /∈ Iℓ

φjℓ = 1 if j ∈ Iℓ

pff(u) =

n$

i=1

dfi

n$

j=1

pij(u)φjf

g(f, u) =

n$

i=1

dfi

n$

j=1

pij(u)g(i, u, j)

Representative Feature States Feature Space F F (j) φjf1 φjf2 φjf3

φjf4

i1 i2 iℓ r∗1 r∗

q r∗ℓ . . . iq

Disaggregation Sets If Aggregate Optimization Feature States

Neural Network Features Approximate Cost Jµ Policy ImprovementSampling

Neural Network Features Approximate Cost Jµ Policy Improvement

1

Piecewise Constant Aggregate Problem Approximation

Feature Vector F (i) Aggregate Cost Approximation Cost Jµ

!F (i)

"

R1 R2 R3 Rℓ Rq−1 Rq r∗2 r∗

3 Cost Jµ

!F (i)

"

I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗

3 Cost Jµ

!F (i)

"

Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost

function Jµ(i)I1 ... Iq I2 g(i, u, j)...

TD(1) Approximation TD(0) Approximation V1(i) and V0(i)

Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)

φjf =

#1 if j ∈ If

0 if j /∈ If

1 10 20 30 40 50 I1 I2 I3 i J1(i)

(May Involve a Neural Network) (May Involve Aggregation)

dℓi = 0 if i /∈ Iℓ

φjℓ = 1 if j ∈ Iℓ

pff(u) =

n$

i=1

dfi

n$

j=1

pij(u)φjf

g(f, u) =

n$

i=1

dfi

n$

j=1

pij(u)g(i, u, j)

Representative Feature States Feature Space F F (j) φjf1 φjf2 φjf3

φjf4

i1 i2 iℓ r∗1 r∗

q r∗ℓ . . . iq

Disaggregation Sets If Aggregate Optimization Feature States

Neural Network Features Approximate Cost Jµ Policy ImprovementSampling

Neural Network Features Approximate Cost Jµ Policy Improvement

1

Piecewise Constant Aggregate Problem Approximation

Feature Vector F (i) Aggregate Cost Approximation Cost Jµ

!F (i)

"

R1 R2 R3 Rℓ Rq−1 Rq r∗2 r∗

3 Cost Jµ

!F (i)

"

I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗

3 Cost Jµ

!F (i)

"

Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost

function Jµ(i)I1 ... Iq I2 g(i, u, j)...

TD(1) Approximation TD(0) Approximation V1(i) and V0(i)

Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)

φjf =

#1 if j ∈ If

0 if j /∈ If

1 10 20 30 40 50 I1 I2 I3 i J1(i)

(May Involve a Neural Network) (May Involve Aggregation)

dℓi = 0 if i /∈ Iℓ

φjℓ = 1 if j ∈ Iℓ

pff(u) =

n$

i=1

dfi

n$

j=1

pij(u)φjf

g(f, u) =

n$

i=1

dfi

n$

j=1

pij(u)g(i, u, j)

Representative Feature States Feature Space F F (j) φjf1 φjf2 φjf3

φjf4

i1 i2 iℓ r∗1 r∗

q r∗ℓ . . . iq

Disaggregation Sets If Aggregate Optimization Feature States

Neural Network Features Approximate Cost Jµ Policy ImprovementSampling

Neural Network Features Approximate Cost Jµ Policy Improvement

1

Piecewise Constant Aggregate Problem Approximation

Feature Vector F (i) Aggregate Cost Approximation Cost Jµ

!F (i)

"

R1 R2 R3 Rℓ Rq−1 Rq r∗2 r∗

3 Cost Jµ

!F (i)

"

I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗

3 Cost Jµ

!F (i)

"

Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost

function Jµ(i)I1 ... Iq I2 g(i, u, j)...

TD(1) Approximation TD(0) Approximation V1(i) and V0(i)

Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)

φjf =

#1 if j ∈ If

0 if j /∈ If

1 10 20 30 40 50 I1 I2 I3 i J1(i)

(May Involve a Neural Network) (May Involve Aggregation)

dℓi = 0 if i /∈ Iℓ

φjℓ = 1 if j ∈ Iℓ

pff(u) =

n$

i=1

dfi

n$

j=1

pij(u)φjf

g(f, u) =

n$

i=1

dfi

n$

j=1

pij(u)g(i, u, j)

Representative Feature States Feature Space F F (j) φjf1 φjf2 φjf3

φjf4

i1 i2 iℓ r∗1 r∗

q r∗ℓ . . . iq

Disaggregation Sets If Aggregate Optimization Feature States

Neural Network Features Approximate Cost Jµ Policy ImprovementSampling

Neural Network Features Approximate Cost Jµ Policy Improvement

1

Piecewise Constant Aggregate Problem Approximation

Feature Vector F (i) Aggregate Cost Approximation Cost Jµ

!F (i)

"

R1 R2 R3 Rℓ Rq−1 Rq r∗2 r∗

3 Cost Jµ

!F (i)

"

I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗

3 Cost Jµ

!F (i)

"

Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost

function Jµ(i)I1 ... Iq I2 g(i, u, j)...

TD(1) Approximation TD(0) Approximation V1(i) and V0(i)

Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)

φjf =

#1 if j ∈ If

0 if j /∈ If

1 10 20 30 40 50 I1 I2 I3 i J1(i)

(May Involve a Neural Network) (May Involve Aggregation)

dℓi = 0 if i /∈ Iℓ

φjℓ = 1 if j ∈ Iℓ

pff(u) =

n$

i=1

dfi

n$

j=1

pij(u)φjf

g(f, u) =

n$

i=1

dfi

n$

j=1

pij(u)g(i, u, j)

Representative Feature States Feature Space F F (j) φjf1 φjf2 φjf3

φjf4

i1 i2 iℓ r∗1 r∗

q r∗ℓ . . . iq

Disaggregation Sets If Aggregate Optimization Feature States

Neural Network Features Approximate Cost Jµ Policy ImprovementSampling

Neural Network Features Approximate Cost Jµ Policy Improvement

1

Piecewise Constant Aggregate Problem Approximation

Feature Vector F (i) Aggregate Cost Approximation Cost Jµ

!F (i)

"

R1 R2 R3 Rℓ Rq−1 Rq r∗2 r∗

3 Cost Jµ

!F (i)

"

I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗

3 Cost Jµ

!F (i)

"

Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost

function Jµ(i)I1 ... Iq I2 g(i, u, j)...

TD(1) Approximation TD(0) Approximation V1(i) and V0(i)

Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)

φjf =

#1 if j ∈ If

0 if j /∈ If

1 10 20 30 40 50 I1 I2 I3 i J1(i)

(May Involve a Neural Network) (May Involve Aggregation)

dℓi = 0 if i /∈ Iℓ

φjℓ = 1 if j ∈ Iℓ

pff(u) =

n$

i=1

dfi

n$

j=1

pij(u)φjf

g(f, u) =

n$

i=1

dfi

n$

j=1

pij(u)g(i, u, j)

Representative Feature States Feature Space F F (j) φjf1 φjf2 φjf3

φjf4

i1 i2 iℓ r∗1 r∗

q r∗ℓ . . . iq

Disaggregation Sets If Aggregate Optimization Feature States

Neural Network Features Approximate Cost Jµ Policy ImprovementSampling

Neural Network Features Approximate Cost Jµ Policy Improvement

1

Aggregate Problem Approximation

Feature Vector F (i) Aggregate Cost Approximation Cost Jµ

!F (i)

"

Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Costfunction Jµ(i)I1 ... Iq I2 g(i, u, j)

TD(1) Approximation TD(0) Approximation V1(i) and V0(i)

Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)

φjf =

#1 if j ∈ If

0 if j /∈ If

1 10 20 30 40 50 I1 I2 I3 i J1(i)

(May Involve a Neural Network) (May Involve Aggregation)

dℓi = 0 if i /∈ Iℓ

φjℓ = 1 if j ∈ Iℓ

pff(u) =

n$

i=1

dfi

n$

j=1

pij(u)φjf

g(f, u) =

n$

i=1

dfi

n$

j=1

pij(u)g(i, u, j)

Representative Feature States Feature Space F F (j) φjf1 φjf2 φjf3

φjf4

i1 i2 iℓ . . . iq

Disaggregation Sets If Aggregate Optimization Feature States

Neural Network Features Approximate Cost Jµ Policy ImprovementSampling

Neural Network Features Approximate Cost Jµ Policy Improvement

F = {f1, f2, f3, f4, f5, f6, f7}

1

Aggregate Problem Approximation

Feature Vector F (i) Aggregate Cost Approximation Cost Jµ

!F (i)

"

Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Costfunction Jµ(i)I1 ... Iq I2 g(i, u, j)

TD(1) Approximation TD(0) Approximation V1(i) and V0(i)

Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)

φjf =

#1 if j ∈ If

0 if j /∈ If

1 10 20 30 40 50 I1 I2 I3 i J1(i)

(May Involve a Neural Network) (May Involve Aggregation)

dℓi = 0 if i /∈ Iℓ

φjℓ = 1 if j ∈ Iℓ

pff(u) =

n$

i=1

dfi

n$

j=1

pij(u)φjf

g(f, u) =

n$

i=1

dfi

n$

j=1

pij(u)g(i, u, j)

Representative Feature States Feature Space F F (j) φjf1 φjf2 φjf3

φjf4

i1 i2 iℓ . . . iq

Disaggregation Sets If Aggregate Optimization Feature States

Neural Network Features Approximate Cost Jµ Policy ImprovementSampling

Neural Network Features Approximate Cost Jµ Policy Improvement

F = {f1, f2, f3, f4, f5, f6, f7}

1

Aggregate Problem Approximation

Feature Vector F (i) Aggregate Cost Approximation Cost Jµ

!F (i)

"

Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Costfunction Jµ(i)I1 ... Iq I2 g(i, u, j)

TD(1) Approximation TD(0) Approximation V1(i) and V0(i)

Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)

φjf =

#1 if j ∈ If

0 if j /∈ If

1 10 20 30 40 50 I1 I2 I3 i J1(i)

(May Involve a Neural Network) (May Involve Aggregation)

dℓi = 0 if i /∈ Iℓ

φjℓ = 1 if j ∈ Iℓ

pff(u) =

n$

i=1

dfi

n$

j=1

pij(u)φjf

g(f, u) =

n$

i=1

dfi

n$

j=1

pij(u)g(i, u, j)

Representative Feature States Feature Space F F (j) φjf1 φjf2 φjf3

φjf4

i1 i2 iℓ . . . iq

Disaggregation Sets If Aggregate Optimization Feature States

Neural Network Features Approximate Cost Jµ Policy ImprovementSampling

Neural Network Features Approximate Cost Jµ Policy Improvement

F = {f1, f2, f3, f4, f5, f6, f7}

1

Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3) Setof States (u1, . . . , uN)

Piecewise Constant Aggregate Problem Approximation

Artificial Start State End State

Piecewise Constant Aggregate Problem Approximation

Feature Vector F (i) Aggregate Cost Approximation Cost Jµ

!F (i)

"

R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗

3 Cost Jµ

!F (i)

"

I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗

3 Cost Jµ

!F (i)

"

Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost

function Jµ(i)I1 ... Iq I2 g(i, u, j)...

TD(1) Approximation TD(0) Approximation V1(i) and V0(i)

Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)

φjf =

#1 if j ∈ If

0 if j /∈ If

1 10 20 30 40 50 I1 I2 I3 i J1(i)

(May Involve a Neural Network) (May Involve Aggregation)

dℓi = 0 if i /∈ Iℓ

φjℓ = 1 if j ∈ Iℓ

pff(u) =n$

i=1

dfi

n$

j=1

pij(u)φjf

g(f, u) =

n$

i=1

dfi

n$

j=1

pij(u)g(i, u, j)

Representative Feature States Feature Space F F (j) φjf1 φjf2 φjf3

φjf4

i1 i2 iℓ r∗1 r∗

q r∗ℓ . . . iq

Disaggregation Sets If Aggregate Optimization Feature States

1

Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3) Setof States (u1, . . . , uN)

Piecewise Constant Aggregate Problem Approximation

Artificial Start State End State

Piecewise Constant Aggregate Problem Approximation

Feature Vector F (i) Aggregate Cost Approximation Cost Jµ

!F (i)

"

R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗

3 Cost Jµ

!F (i)

"

I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗

3 Cost Jµ

!F (i)

"

Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost

function Jµ(i)I1 ... Iq I2 g(i, u, j)...

TD(1) Approximation TD(0) Approximation V1(i) and V0(i)

Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)

φjf =

#1 if j ∈ If

0 if j /∈ If

1 10 20 30 40 50 I1 I2 I3 i J1(i)

(May Involve a Neural Network) (May Involve Aggregation)

dℓi = 0 if i /∈ Iℓ

φjℓ = 1 if j ∈ Iℓ

pff(u) =n$

i=1

dfi

n$

j=1

pij(u)φjf

g(f, u) =

n$

i=1

dfi

n$

j=1

pij(u)g(i, u, j)

Representative Feature States Feature Space F F (j) φjf1 φjf2 φjf3

φjf4

i1 i2 iℓ r∗1 r∗

q r∗ℓ . . . iq

Disaggregation Sets If Aggregate Optimization Feature States

1

Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3) Setof States (u1, . . . , uN)

Piecewise Constant Aggregate Problem Approximation

Artificial Start State End State

Piecewise Constant Aggregate Problem Approximation

Feature Vector F (i) Aggregate Cost Approximation Cost Jµ

!F (i)

"

R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗

3 Cost Jµ

!F (i)

"

I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗

3 Cost Jµ

!F (i)

"

Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost

function Jµ(i)I1 ... Iq I2 g(i, u, j)...

TD(1) Approximation TD(0) Approximation V1(i) and V0(i)

Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)

φjf =

#1 if j ∈ If

0 if j /∈ If

1 10 20 30 40 50 I1 I2 I3 i J1(i)

(May Involve a Neural Network) (May Involve Aggregation)

dℓi = 0 if i /∈ Iℓ

φjℓ = 1 if j ∈ Iℓ

pff(u) =n$

i=1

dfi

n$

j=1

pij(u)φjf

g(f, u) =

n$

i=1

dfi

n$

j=1

pij(u)g(i, u, j)

Representative Feature States Feature Space F F (j) φjf1 φjf2 φjf3

φjf4

i1 i2 iℓ r∗1 r∗

q r∗ℓ . . . iq

Disaggregation Sets If Aggregate Optimization Feature States

1

Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3) Setof States (u1, . . . , uN)

Piecewise Constant Aggregate Problem Approximation

Artificial Start State End State

Piecewise Constant Aggregate Problem Approximation

Feature Vector F (i) Aggregate Cost Approximation Cost Jµ

!F (i)

"

R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗

3 Cost Jµ

!F (i)

"

I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗

3 Cost Jµ

!F (i)

"

Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost

function Jµ(i)I1 ... Iq I2 g(i, u, j)...

TD(1) Approximation TD(0) Approximation V1(i) and V0(i)

Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)

φjf =

#1 if j ∈ If

0 if j /∈ If

1 10 20 30 40 50 I1 I2 I3 i J1(i)

(May Involve a Neural Network) (May Involve Aggregation)

dℓi = 0 if i /∈ Iℓ

φjℓ = 1 if j ∈ Iℓ

pff(u) =n$

i=1

dfi

n$

j=1

pij(u)φjf

g(f, u) =

n$

i=1

dfi

n$

j=1

pij(u)g(i, u, j)

Representative Feature States Feature Space F F (j) φjf1 φjf2 φjf3

φjf4

i1 i2 iℓ r∗1 r∗

q r∗ℓ . . . iq

Disaggregation Sets If Aggregate Optimization Feature States

1

Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)Set of States u = (u1, . . . , uN )

Cost G(u)

Piecewise Constant Aggregate Problem Approximation

Artificial Start State End State

Piecewise Constant Aggregate Problem Approximation

Feature Vector F (i) Aggregate Cost Approximation Cost Jµ

!F (i)

"

R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗

3 Cost Jµ

!F (i)

"

I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗

3 Cost Jµ

!F (i)

"

Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost

function Jµ(i)I1 ... Iq I2 g(i, u, j)...

TD(1) Approximation TD(0) Approximation V1(i) and V0(i)

Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)

φjf =

#1 if j ∈ If

0 if j /∈ If

1 10 20 30 40 50 I1 I2 I3 i J1(i)

(May Involve a Neural Network) (May Involve Aggregation)

dℓi = 0 if i /∈ Iℓ

φjℓ = 1 if j ∈ Iℓ

pff(u) =

n$

i=1

dfi

n$

j=1

pij(u)φjf

g(f, u) =

n$

i=1

dfi

n$

j=1

pij(u)g(i, u, j)

Representative Feature States Feature Space F F (j) φjf1 φjf2 φjf3

φjf4

i1 i2 iℓ r∗1 r∗

q r∗ℓ . . . iq

1

15 1 5 18 4 19 9 21 25 8 12 13

Stage 1 Stage 2 Stage 3 Stage N

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from

within Wp+

Prob. u Prob. 1 − u Cost 1 Cost 1 − √u

J(1) = min!c, a + J(2)

"

J(2) = b + J(1)

J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

15 1 5 18 4 19 9 21 25 8 12 13

Stage 1 Stage 2 Stage 3 Stage N

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from

within Wp+

Prob. u Prob. 1 − u Cost 1 Cost 1 − √u

J(1) = min!c, a + J(2)

"

J(2) = b + J(1)

J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

15 1 5 18 4 19 9 21 25 8 12 13

Stage 1 Stage 2 Stage 3 Stage N

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from

within Wp+

Prob. u Prob. 1 − u Cost 1 Cost 1 − √u

J(1) = min!c, a + J(2)

"

J(2) = b + J(1)

J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

15 1 5 18 4 19 9 21 25 8 12 13

Stage 1 Stage 2 Stage 3 Stage N

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from

within Wp+

Prob. u Prob. 1 − u Cost 1 Cost 1 − √u

J(1) = min!c, a + J(2)

"

J(2) = b + J(1)

J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13

Stage 1 Stage 2 Stage 3 Stage N

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from

within Wp+

Prob. u Prob. 1 − u Cost 1 Cost 1 − √u

J(1) = min!c, a + J(2)

"

J(2) = b + J(1)

J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, u1, . . . , uN�1)

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

Initial Temperature x0 u0 u1 x1 Oven 1 Oven 2 Final Temperaturex2

⇠k yk+1 = Akyk + ⇠k yk+1 Ck wk

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, u1, . . . , uN�1)

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

Initial Temperature x0 u0 u1 x1 Oven 1 Oven 2 Final Temperaturex2

⇠k yk+1 = Akyk + ⇠k yk+1 Ck wk

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, u1, . . . , uN�1)

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

Initial Temperature x0 u0 u1 x1 Oven 1 Oven 2 Final Temperaturex2

⇠k yk+1 = Akyk + ⇠k yk+1 Ck wk

1

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1)

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

10 5 7 8 3 9 6 1 2

Initial Temperature x0 u0 u1 x1 Oven 1 Oven 2 Final Temperaturex2

⇠k yk+1 = Akyk + ⇠k yk+1 Ck wk

1

Three-Dimensional Assignment Problem 3 Node Layers Jobs j Ma-chines ℓ Workers w u0 u1 u2 uN−1

Jobs j Machines ℓ Workers w

5-Dimensional Assignment Problem 6 Node Layers

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN−1) . . .

x0 x1 x2 xk−1 xk xk+1 xk+2 xN−1 xN u0 u1 uk−1 uk uk+1 uN−1

yk yk+1 H(yk+1) xN−1 xN u0 u1 uk−1 uk uk+1 uN−1 Tk(yk, uk) =!yk, uk, H(yk+1)

"∈ C

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN−1) . . .

SC ℓ Stages Riccati Equation Iterates P P0 P1 P2 γ2 − 1 γ2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k − wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n− 1 n p11 p12 p1n p1(n−1) p2(n−1)

...

p22 p2n p2(n−1) p2(n−1) p(n−1)(n−1) p(n−1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1

Three-Dimensional Assignment Problem 3 Node Layers Jobs j Ma-chines ℓ Workers w u0 u1 u2 uN−1

Jobs j Machines ℓ Workers w

5-Dimensional Assignment Problem 6 Node Layers

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN−1) . . .

x0 x1 x2 xk−1 xk xk+1 xk+2 xN−1 xN u0 u1 uk−1 uk uk+1 uN−1

yk yk+1 H(yk+1) xN−1 xN u0 u1 uk−1 uk uk+1 uN−1 Tk(yk, uk) =!yk, uk, H(yk+1)

"∈ C

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN−1) . . .

SC ℓ Stages Riccati Equation Iterates P P0 P1 P2 γ2 − 1 γ2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k − wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n− 1 n p11 p12 p1n p1(n−1) p2(n−1)

...

p22 p2n p2(n−1) p2(n−1) p(n−1)(n−1) p(n−1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1

Three-Dimensional Assignment Problem 3 Node Layers Jobs j Ma-chines ℓ Workers w u0 u1 u2 uN−1

Jobs j Machines ℓ Workers w

5-Dimensional Assignment Problem 6 Node Layers

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN−1) . . .

x0 x1 x2 xk−1 xk xk+1 xk+2 xN−1 xN u0 u1 uk−1 uk uk+1 uN−1

yk yk+1 H(yk+1) xN−1 xN u0 u1 uk−1 uk uk+1 uN−1 Tk(yk, uk) =!yk, uk, H(yk+1)

"∈ C

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN−1) . . .

SC ℓ Stages Riccati Equation Iterates P P0 P1 P2 γ2 − 1 γ2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k − wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n− 1 n p11 p12 p1n p1(n−1) p2(n−1)

...

p22 p2n p2(n−1) p2(n−1) p(n−1)(n−1) p(n−1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1

Three-Dimensional Assignment Problem 3 Node Layers Jobs j Ma-chines ℓ Workers w u0 u1 u2 uN−1

Jobs j Machines ℓ Workers w

5-Dimensional Assignment Problem 6 Node Layers

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN−1) . . .

x0 x1 x2 xk−1 xk xk+1 xk+2 xN−1 xN u0 u1 uk−1 uk uk+1 uN−1

yk yk+1 H(yk+1) xN−1 xN u0 u1 uk−1 uk uk+1 uN−1 Tk(yk, uk) =!yk, uk, H(yk+1)

"∈ C

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN−1) . . .

SC ℓ Stages Riccati Equation Iterates P P0 P1 P2 γ2 − 1 γ2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k − wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n− 1 n p11 p12 p1n p1(n−1) p2(n−1)

...

p22 p2n p2(n−1) p2(n−1) p(n−1)(n−1) p(n−1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1

Aggregate Problem Approximation

Feature Vector F (i) Aggregate Cost Approximation Cost Jµ

!F (i)

"

Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Costfunction Jµ(i)I1 ... Iq I2 g(i, u, j)

TD(1) Approximation TD(0) Approximation V1(i) and V0(i)

Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)

φjf =

#1 if j ∈ If

0 if j /∈ If

1 10 20 30 40 50 I1 I2 I3 i J1(i)

(May Involve a Neural Network) (May Involve Aggregation)

dℓi = 0 if i /∈ Iℓ

φjℓ = 1 if j ∈ Iℓ

pff(u) =

n$

i=1

dfi

n$

j=1

pij(u)φjf

g(f, u) =

n$

i=1

dfi

n$

j=1

pij(u)g(i, u, j)

Representative Feature States Feature Space F F (j) φjf1 φjf2 φjf3

φjf4

i1 i2 iℓ . . . iq

Disaggregation Sets If Aggregate Optimization Feature States

Neural Network Features Approximate Cost Jµ Policy ImprovementSampling

Neural Network Features Approximate Cost Jµ Policy Improvement

F = {f1, f2, f3, f4, f5, f6, f7}

1

This is a special case of the constrained deterministic optimal control problemwhere each state xk can only take a single value.

A very broad range of problems, e.g., combinatorial, integer programming, etc.

Solution by constrained rollout applies. Provides entry point to the use of neuralnets in discrete optimization through approximation in policy space.

Other methods: local/random search, genetic algorithms, etc. Rollout is different.Bertsekas Reinforcement Learning 16 / 22

Page 14: Reinforcement Learning and Optimal ControlASU, CSE 691 ...Lecture 5 Bertsekas Reinforcement Learning 1 / 22. Outline 1 Multiagent Rollout 2 Deterministic Problem Rollout with Constraints

Facility Location: A Classical Integer Programming ProblemSample Space Event {a ≤ X ≤ b} a b x PDF fX(x) δ x x + δ

Clients Facilities xij i j zj = 0 or 1

minuk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

i=k+1

gi

#xi, µi(xi), wi

$+ Jk+ℓ(xk+ℓ)

%

bk Belief States bk+1 bk+2 Policy µ m Steps

Truncated Rollout Policy µ m Steps Φr∗λ

B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2

Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )

µ COMPOSITE SYSTEM SIMULATOR FOR POMDP

(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)

uk = rpek + rizk + rddk ξij(u) pij(u)

Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost

Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i

One-step Lookahead with J(j) =&

y∈A φjyr∗y bk Control uk = µk(bk)

p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm

... (e.g., a NN) Data (xs, cs)

V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗

Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm

Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout

Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ

Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized

Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)

1

Sample Space Event {a ≤ X ≤ b} a b x PDF fX(x) δ x x + δ

Clients Facilities xij i j zj = 0 or 1

minuk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

i=k+1

gi

#xi, µi(xi), wi

$+ Jk+ℓ(xk+ℓ)

%

bk Belief States bk+1 bk+2 Policy µ m Steps

Truncated Rollout Policy µ m Steps Φr∗λ

B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2

Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )

µ COMPOSITE SYSTEM SIMULATOR FOR POMDP

(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)

uk = rpek + rizk + rddk ξij(u) pij(u)

Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost

Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i

One-step Lookahead with J(j) =&

y∈A φjyr∗y bk Control uk = µk(bk)

p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm

... (e.g., a NN) Data (xs, cs)

V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗

Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm

Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout

Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ

Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized

Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)

1

Sample Space Event {a ≤ X ≤ b} a b x PDF fX(x) δ x x + δ

Clients Facilities xij i j zj = 0 or 1

minuk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

i=k+1

gi

#xi, µi(xi), wi

$+ Jk+ℓ(xk+ℓ)

%

bk Belief States bk+1 bk+2 Policy µ m Steps

Truncated Rollout Policy µ m Steps Φr∗λ

B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2

Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )

µ COMPOSITE SYSTEM SIMULATOR FOR POMDP

(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)

uk = rpek + rizk + rddk ξij(u) pij(u)

Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost

Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i

One-step Lookahead with J(j) =&

y∈A φjyr∗y bk Control uk = µk(bk)

p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm

... (e.g., a NN) Data (xs, cs)

V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗

Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm

Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout

Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ

Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized

Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)

1

Sample Space Event {a ≤ X ≤ b} a b x PDF fX(x) δ x x + δ

Clients Facilities xij i j zj = 0 or 1

minuk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

i=k+1

gi

#xi, µi(xi), wi

$+ Jk+ℓ(xk+ℓ)

%

bk Belief States bk+1 bk+2 Policy µ m Steps

Truncated Rollout Policy µ m Steps Φr∗λ

B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2

Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )

µ COMPOSITE SYSTEM SIMULATOR FOR POMDP

(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)

uk = rpek + rizk + rddk ξij(u) pij(u)

Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost

Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i

One-step Lookahead with J(j) =&

y∈A φjyr∗y bk Control uk = µk(bk)

p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm

... (e.g., a NN) Data (xs, cs)

V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗

Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm

Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout

Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ

Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized

Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)

1

Sample Space Event {a ≤ X ≤ b} a b x PDF fX(x) δ x x + δ

Clients Facilities xij i j zj = 0 or 1

minuk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

i=k+1

gi

#xi, µi(xi), wi

$+ Jk+ℓ(xk+ℓ)

%

bk Belief States bk+1 bk+2 Policy µ m Steps

Truncated Rollout Policy µ m Steps Φr∗λ

B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2

Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )

µ COMPOSITE SYSTEM SIMULATOR FOR POMDP

(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)

uk = rpek + rizk + rddk ξij(u) pij(u)

Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost

Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i

One-step Lookahead with J(j) =&

y∈A φjyr∗y bk Control uk = µk(bk)

p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm

... (e.g., a NN) Data (xs, cs)

V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗

Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm

Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout

Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ

Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized

Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)

1

Sample Space Event {a ≤ X ≤ b} a b x PDF fX(x) δ x x + δ

Clients Facilities xij i j zj = 0 or 1

minuk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

i=k+1

gi

#xi, µi(xi), wi

$+ Jk+ℓ(xk+ℓ)

%

bk Belief States bk+1 bk+2 Policy µ m Steps

Truncated Rollout Policy µ m Steps Φr∗λ

B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2

Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )

µ COMPOSITE SYSTEM SIMULATOR FOR POMDP

(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)

uk = rpek + rizk + rddk ξij(u) pij(u)

Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost

Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i

One-step Lookahead with J(j) =&

y∈A φjyr∗y bk Control uk = µk(bk)

p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm

... (e.g., a NN) Data (xs, cs)

V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗

Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm

Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout

Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ

Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized

Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)

1

Corrected J J J* Cost Jµ

�F (i), r

�of i ⇡ Jµ(i) Jµ(i) Feature Map

�F (i), r

�: Feature-based parametric architecture State

r: Vector of weights Original States Aggregate States

Position “value” Move “probabilities” Simplify E{·}Choose the Aggregation and Disaggregation Probabilities

Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq

Use a Neural Scheme or Other Scheme uj = 0 or 1

Possibly Include “Handcrafted” Features

Generate Features F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy µ by “Solving” the Aggregate Problem

Same algorithm learned multiple games (Go, Shogi)

Aggregate costs r⇤` Cost function J0(i) Cost function J1(j)

Approximation in a space of basis functions Plays much better thanall chess programs

Cost ↵kg(i, u, j) Transition probabilities pij(u) Wp

Controlled Markov Chain Evaluate Approximate Cost Jµ of

Evaluate Approximate Cost Jµ

�F (i)

�of

F (i) =�F1(i), . . . , Fs(i)

�: Vector of Features of i

�F (i)

�: Feature-based architecture Final Features

If Jµ

�F (i), r

�=Ps

`=1 F`(i)r` it is a linear feature-based architecture

(r1, . . . , rs: Scalar weights)

Wp: Functions J � Jp with J(xk) ! 0 for all p-stable ⇡

Wp0 : Functions J � Jp0 with J(xk) ! 0 for all p0-stable ⇡

W+ =�J | J � J+, J(t) = 0

VI converges to J+ from within W+

Cost: g(xk, uk) � 0 VI converges to Jp from within Wp

1

Corrected J J J* Cost Jµ

!F (i), r

"of i ≈ Jµ(i) Jµ(i) Feature Map

13

!F (i), r

": Feature-based parametric architecture State

r: Vector of weights Original States Aggregate States

Position “value” Move “probabilities” Simplify E{·}Choose the Aggregation and Disaggregation Probabilities

Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq

Use a Neural Scheme or Other Scheme uj = 0 or 1

Possibly Include “Handcrafted” Features

Generate Features F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy µ by “Solving” the Aggregate Problem

Same algorithm learned multiple games (Go, Shogi)

Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)

Approximation in a space of basis functions Plays much better thanall chess programs

Cost αkg(i, u, j) Transition probabilities pij(u) Wp

Controlled Markov Chain Locations Evaluate Approximate Cost Jµ

of

Evaluate Approximate Cost Jµ

!F (i)

"of

F (i) =!F1(i), . . . , Fs(i)

": Vector of Features of i

!F (i)

": Feature-based architecture Final Features

If Jµ

!F (i), r

"=

#sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture

(r1, . . . , rs: Scalar weights)

Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π

Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π

W+ =$J | J ≥ J+, J(t) = 0

%

VI converges to J+ from within W+

Cost: g(xk, uk) ≥ 0 VI converges to Jp from within Wp

1

Place facilities at some locations to serve the needs of M “clients."

Client i = 1, . . . ,M has a demand di for services that may be satisfied at a locationj = 1, . . . ,N at a cost aij per unit.

A facility placed at location j has capacity cj and costs bj . Here uj ∈ {0, 1}, withuj = 1 if a facility is placed at j .

Problem: Minimize∑M

i=1

∑Nj=1 aijxij +

∑Nj=1 bjuj subject to total demand satisfaction

constraints.

Note: If the placement variables uj are known, the remaining problem is easilysolvable (it is a linear “transportation" problem).

Bertsekas Reinforcement Learning 17 / 22

Page 15: Reinforcement Learning and Optimal ControlASU, CSE 691 ...Lecture 5 Bertsekas Reinforcement Learning 1 / 22. Outline 1 Multiagent Rollout 2 Deterministic Problem Rollout with Constraints

Facility Location Problem: Formulation for Constrained RolloutSample Space Event {a ≤ X ≤ b} a b x PDF fX(x) δ x x + δ

Clients Facilities xij i j zj = 0 or 1

minuk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

i=k+1

gi

#xi, µi(xi), wi

$+ Jk+ℓ(xk+ℓ)

%

bk Belief States bk+1 bk+2 Policy µ m Steps

Truncated Rollout Policy µ m Steps Φr∗λ

B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2

Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )

µ COMPOSITE SYSTEM SIMULATOR FOR POMDP

(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)

uk = rpek + rizk + rddk ξij(u) pij(u)

Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost

Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i

One-step Lookahead with J(j) =&

y∈A φjyr∗y bk Control uk = µk(bk)

p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm

... (e.g., a NN) Data (xs, cs)

V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗

Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm

Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout

Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ

Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized

Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)

1

Sample Space Event {a ≤ X ≤ b} a b x PDF fX(x) δ x x + δ

Clients Facilities xij i j zj = 0 or 1

minuk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

i=k+1

gi

#xi, µi(xi), wi

$+ Jk+ℓ(xk+ℓ)

%

bk Belief States bk+1 bk+2 Policy µ m Steps

Truncated Rollout Policy µ m Steps Φr∗λ

B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2

Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )

µ COMPOSITE SYSTEM SIMULATOR FOR POMDP

(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)

uk = rpek + rizk + rddk ξij(u) pij(u)

Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost

Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i

One-step Lookahead with J(j) =&

y∈A φjyr∗y bk Control uk = µk(bk)

p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm

... (e.g., a NN) Data (xs, cs)

V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗

Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm

Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout

Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ

Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized

Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)

1

Sample Space Event {a ≤ X ≤ b} a b x PDF fX(x) δ x x + δ

Clients Facilities xij i j zj = 0 or 1

minuk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

i=k+1

gi

#xi, µi(xi), wi

$+ Jk+ℓ(xk+ℓ)

%

bk Belief States bk+1 bk+2 Policy µ m Steps

Truncated Rollout Policy µ m Steps Φr∗λ

B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2

Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )

µ COMPOSITE SYSTEM SIMULATOR FOR POMDP

(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)

uk = rpek + rizk + rddk ξij(u) pij(u)

Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost

Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i

One-step Lookahead with J(j) =&

y∈A φjyr∗y bk Control uk = µk(bk)

p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm

... (e.g., a NN) Data (xs, cs)

V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗

Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm

Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout

Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ

Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized

Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)

1

Sample Space Event {a ≤ X ≤ b} a b x PDF fX(x) δ x x + δ

Clients Facilities xij i j zj = 0 or 1

minuk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

i=k+1

gi

#xi, µi(xi), wi

$+ Jk+ℓ(xk+ℓ)

%

bk Belief States bk+1 bk+2 Policy µ m Steps

Truncated Rollout Policy µ m Steps Φr∗λ

B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2

Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )

µ COMPOSITE SYSTEM SIMULATOR FOR POMDP

(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)

uk = rpek + rizk + rddk ξij(u) pij(u)

Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost

Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i

One-step Lookahead with J(j) =&

y∈A φjyr∗y bk Control uk = µk(bk)

p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm

... (e.g., a NN) Data (xs, cs)

V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗

Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm

Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout

Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ

Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized

Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)

1

Sample Space Event {a ≤ X ≤ b} a b x PDF fX(x) δ x x + δ

Clients Facilities xij i j zj = 0 or 1

minuk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

i=k+1

gi

#xi, µi(xi), wi

$+ Jk+ℓ(xk+ℓ)

%

bk Belief States bk+1 bk+2 Policy µ m Steps

Truncated Rollout Policy µ m Steps Φr∗λ

B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2

Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )

µ COMPOSITE SYSTEM SIMULATOR FOR POMDP

(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)

uk = rpek + rizk + rddk ξij(u) pij(u)

Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost

Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i

One-step Lookahead with J(j) =&

y∈A φjyr∗y bk Control uk = µk(bk)

p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm

... (e.g., a NN) Data (xs, cs)

V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗

Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm

Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout

Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ

Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized

Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)

1

Sample Space Event {a ≤ X ≤ b} a b x PDF fX(x) δ x x + δ

Clients Facilities xij i j zj = 0 or 1

minuk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

i=k+1

gi

#xi, µi(xi), wi

$+ Jk+ℓ(xk+ℓ)

%

bk Belief States bk+1 bk+2 Policy µ m Steps

Truncated Rollout Policy µ m Steps Φr∗λ

B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2

Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )

µ COMPOSITE SYSTEM SIMULATOR FOR POMDP

(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)

uk = rpek + rizk + rddk ξij(u) pij(u)

Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost

Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i

One-step Lookahead with J(j) =&

y∈A φjyr∗y bk Control uk = µk(bk)

p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm

... (e.g., a NN) Data (xs, cs)

V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗

Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm

Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout

Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ

Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized

Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)

1

Corrected J J J* Cost Jµ

�F (i), r

�of i ⇡ Jµ(i) Jµ(i) Feature Map

�F (i), r

�: Feature-based parametric architecture State

r: Vector of weights Original States Aggregate States

Position “value” Move “probabilities” Simplify E{·}Choose the Aggregation and Disaggregation Probabilities

Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq

Use a Neural Scheme or Other Scheme uj = 0 or 1

Possibly Include “Handcrafted” Features

Generate Features F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy µ by “Solving” the Aggregate Problem

Same algorithm learned multiple games (Go, Shogi)

Aggregate costs r⇤` Cost function J0(i) Cost function J1(j)

Approximation in a space of basis functions Plays much better thanall chess programs

Cost ↵kg(i, u, j) Transition probabilities pij(u) Wp

Controlled Markov Chain Evaluate Approximate Cost Jµ of

Evaluate Approximate Cost Jµ

�F (i)

�of

F (i) =�F1(i), . . . , Fs(i)

�: Vector of Features of i

�F (i)

�: Feature-based architecture Final Features

If Jµ

�F (i), r

�=Ps

`=1 F`(i)r` it is a linear feature-based architecture

(r1, . . . , rs: Scalar weights)

Wp: Functions J � Jp with J(xk) ! 0 for all p-stable ⇡

Wp0 : Functions J � Jp0 with J(xk) ! 0 for all p0-stable ⇡

W+ =�J | J � J+, J(t) = 0

VI converges to J+ from within W+

Cost: g(xk, uk) � 0 VI converges to Jp from within Wp

1

Corrected J J J* Cost Jµ

!F (i), r

"of i ≈ Jµ(i) Jµ(i) Feature Map

13

!F (i), r

": Feature-based parametric architecture State

r: Vector of weights Original States Aggregate States

Position “value” Move “probabilities” Simplify E{·}Choose the Aggregation and Disaggregation Probabilities

Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq

Use a Neural Scheme or Other Scheme uj = 0 or 1

Possibly Include “Handcrafted” Features

Generate Features F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy µ by “Solving” the Aggregate Problem

Same algorithm learned multiple games (Go, Shogi)

Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)

Approximation in a space of basis functions Plays much better thanall chess programs

Cost αkg(i, u, j) Transition probabilities pij(u) Wp

Controlled Markov Chain Locations Evaluate Approximate Cost Jµ

of

Evaluate Approximate Cost Jµ

!F (i)

"of

F (i) =!F1(i), . . . , Fs(i)

": Vector of Features of i

!F (i)

": Feature-based architecture Final Features

If Jµ

!F (i), r

"=

#sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture

(r1, . . . , rs: Scalar weights)

Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π

Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π

W+ =$J | J ≥ J+, J(t) = 0

%

VI converges to J+ from within W+

Cost: g(xk, uk) ≥ 0 VI converges to Jp from within Wp

1

Consider placements one location at a time.

Stage = Placement decision at a single location (N stages). State at time k = Theplacement choices at locations 1, . . . , k . Control = Choice between 1 (place) or 0(not place) for the next facility. What are the constraints? Is this multiagent?

Base heuristic: Having fixed u1, . . . , uk , place a facility in every remaining location.Rollout: Having fixed u1, . . . , uk , compare two possibilities:

I Set uk+1 = 1 (place facility at location k + 1), set uk+2 = · · · = uN = 1 (as per thebase heuristic), and solve the remaining problem.

I Set uk+1 = 0 (don’t place facility at location k + 1), set uk+2 = · · · = uN = 1 (as perthe base heuristic), and solve the remaining problem.

Select uk+1 = 1 or uk+1 = 0 depending on which yields feasibility and min cost.

Linear transportation problems can be solved with the auction algorithm.Bertsekas Reinforcement Learning 18 / 22

Page 16: Reinforcement Learning and Optimal ControlASU, CSE 691 ...Lecture 5 Bertsekas Reinforcement Learning 1 / 22. Outline 1 Multiagent Rollout 2 Deterministic Problem Rollout with Constraints

RNA Folding

Partial Folding Software Critic Software Facilities xij i j zj = 0 or 1 Sample Space Event {a ≤ X ≤ b}a b x PDF fX(x) δ x x + δ

Clients Facilities xij i j zj = 0 or 1

minuk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

i=k+1

gi

#xi, µi(xi), wi

$+ Jk+ℓ(xk+ℓ)

%

bk Belief States bk+1 bk+2 Policy µ m Steps

Truncated Rollout Policy µ m Steps Φr∗λ

B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2

Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )

µ COMPOSITE SYSTEM SIMULATOR FOR POMDP

(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)

uk = rpek + rizk + rddk ξij(u) pij(u)

Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost

Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i

One-step Lookahead with J(j) =&

y∈A φjyr∗y bk Control uk = µk(bk)

p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm

... (e.g., a NN) Data (xs, cs)

V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗

Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm

Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout

Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ

Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized

Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)

1

Partial Folding Software Critic Software Facilities xij i j zj = 0 or 1 Sample Space Event {a ≤ X ≤ b}a b x PDF fX(x) δ x x + δ

Clients Facilities xij i j zj = 0 or 1

minuk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

i=k+1

gi

#xi, µi(xi), wi

$+ Jk+ℓ(xk+ℓ)

%

bk Belief States bk+1 bk+2 Policy µ m Steps

Truncated Rollout Policy µ m Steps Φr∗λ

B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2

Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )

µ COMPOSITE SYSTEM SIMULATOR FOR POMDP

(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)

uk = rpek + rizk + rddk ξij(u) pij(u)

Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost

Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i

One-step Lookahead with J(j) =&

y∈A φjyr∗y bk Control uk = µk(bk)

p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm

... (e.g., a NN) Data (xs, cs)

V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗

Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm

Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout

Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ

Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized

Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)

1

Partial Folding Software Critic Software Facilities xij i j zj = 0 or 1 Sample Space Event {a ≤ X ≤ b}a b x PDF fX(x) δ x x + δ

Clients Facilities xij i j zj = 0 or 1

minuk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

i=k+1

gi

#xi, µi(xi), wi

$+ Jk+ℓ(xk+ℓ)

%

bk Belief States bk+1 bk+2 Policy µ m Steps

Truncated Rollout Policy µ m Steps Φr∗λ

B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2

Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )

µ COMPOSITE SYSTEM SIMULATOR FOR POMDP

(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)

uk = rpek + rizk + rddk ξij(u) pij(u)

Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost

Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i

One-step Lookahead with J(j) =&

y∈A φjyr∗y bk Control uk = µk(bk)

p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm

... (e.g., a NN) Data (xs, cs)

V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗

Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm

Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout

Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ

Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized

Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)

1

Partial Folding Software Critic Software Facilities xij i j zj = 0 or 1 Sample Space Event {a ≤ X ≤ b}a b x PDF fX(x) δ x x + δ

Clients Facilities xij i j zj = 0 or 1

minuk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

i=k+1

gi

#xi, µi(xi), wi

$+ Jk+ℓ(xk+ℓ)

%

bk Belief States bk+1 bk+2 Policy µ m Steps

Truncated Rollout Policy µ m Steps Φr∗λ

B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2

Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )

µ COMPOSITE SYSTEM SIMULATOR FOR POMDP

(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)

uk = rpek + rizk + rddk ξij(u) pij(u)

Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost

Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i

One-step Lookahead with J(j) =&

y∈A φjyr∗y bk Control uk = µk(bk)

p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm

... (e.g., a NN) Data (xs, cs)

V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗

Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm

Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout

Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ

Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized

Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)

1

Partial Folding Software Critic Software Facilities xij i j zj = 0 or 1 Sample Space Event {a ≤ X ≤ b}a b x PDF fX(x) δ x x + δ

Clients Facilities xij i j zj = 0 or 1

minuk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

i=k+1

gi

#xi, µi(xi), wi

$+ Jk+ℓ(xk+ℓ)

%

bk Belief States bk+1 bk+2 Policy µ m Steps

Truncated Rollout Policy µ m Steps Φr∗λ

B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2

Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )

µ COMPOSITE SYSTEM SIMULATOR FOR POMDP

(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)

uk = rpek + rizk + rddk ξij(u) pij(u)

Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost

Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i

One-step Lookahead with J(j) =&

y∈A φjyr∗y bk Control uk = µk(bk)

p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm

... (e.g., a NN) Data (xs, cs)

V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗

Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm

Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout

Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ

Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized

Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)

1

Partial Folding Software Critic Software Complete Folding Current Partial Folding

Clients Facilities xij i j zj = 0 or 1

minuk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

i=k+1

gi

#xi, µi(xi), wi

$+ Jk+ℓ(xk+ℓ)

%

bk Belief States bk+1 bk+2 Policy µ m Steps

Truncated Rollout Policy µ m Steps Φr∗λ

B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2

Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )

µ COMPOSITE SYSTEM SIMULATOR FOR POMDP

(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)

uk = rpek + rizk + rddk ξij(u) pij(u)

Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost

Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i

One-step Lookahead with J(j) =&

y∈A φjyr∗y bk Control uk = µk(bk)

p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm

... (e.g., a NN) Data (xs, cs)

V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗

Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm

Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout

Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ

Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized

Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)

1

Partial Folding Software Critic Software Complete Folding Current Partial Folding

Clients Facilities xij i j zj = 0 or 1

minuk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

i=k+1

gi

#xi, µi(xi), wi

$+ Jk+ℓ(xk+ℓ)

%

bk Belief States bk+1 bk+2 Policy µ m Steps

Truncated Rollout Policy µ m Steps Φr∗λ

B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2

Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )

µ COMPOSITE SYSTEM SIMULATOR FOR POMDP

(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)

uk = rpek + rizk + rddk ξij(u) pij(u)

Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost

Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i

One-step Lookahead with J(j) =&

y∈A φjyr∗y bk Control uk = µk(bk)

p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm

... (e.g., a NN) Data (xs, cs)

V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗

Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm

Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout

Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ

Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized

Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)

1

Partial Folding Software Critic Software Complete Folding Current Partial Folding

Clients Facilities xij i j zj = 0 or 1

minuk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

i=k+1

gi

#xi, µi(xi), wi

$+ Jk+ℓ(xk+ℓ)

%

bk Belief States bk+1 bk+2 Policy µ m Steps

Truncated Rollout Policy µ m Steps Φr∗λ

B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2

Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )

µ COMPOSITE SYSTEM SIMULATOR FOR POMDP

(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)

uk = rpek + rizk + rddk ξij(u) pij(u)

Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost

Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i

One-step Lookahead with J(j) =&

y∈A φjyr∗y bk Control uk = µk(bk)

p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm

... (e.g., a NN) Data (xs, cs)

V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗

Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm

Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout

Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ

Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized

Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)

1

Partial Folding Software Critic Software Complete Folding Current Partial Folding

Clients Facilities xij i j zj = 0 or 1

minuk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

i=k+1

gi

#xi, µi(xi), wi

$+ Jk+ℓ(xk+ℓ)

%

bk Belief States bk+1 bk+2 Policy µ m Steps

Truncated Rollout Policy µ m Steps Φr∗λ

B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2

Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )

µ COMPOSITE SYSTEM SIMULATOR FOR POMDP

(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)

uk = rpek + rizk + rddk ξij(u) pij(u)

Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost

Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i

One-step Lookahead with J(j) =&

y∈A φjyr∗y bk Control uk = µk(bk)

p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm

... (e.g., a NN) Data (xs, cs)

V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗

Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm

Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout

Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ

Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized

Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)

1

Partial Folding Software Critic Software Complete Folding Current Partial Folding

Clients Facilities xij i j zj = 0 or 1 Open Close Null

minuk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

i=k+1

gi

#xi, µi(xi), wi

$+ Jk+ℓ(xk+ℓ)

%

bk Belief States bk+1 bk+2 Policy µ m Steps

Truncated Rollout Policy µ m Steps Φr∗λ

B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2

Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )

µ COMPOSITE SYSTEM SIMULATOR FOR POMDP

(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)

uk = rpek + rizk + rddk ξij(u) pij(u)

Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost

Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i

One-step Lookahead with J(j) =&

y∈A φjyr∗y bk Control uk = µk(bk)

p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm

... (e.g., a NN) Data (xs, cs)

V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗

Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm

Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout

Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ

Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized

Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)

1

Partial Folding Software Critic Software Complete Folding Current Partial Folding

Clients Facilities xij i j zj = 0 or 1 Open Close Null

minuk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

i=k+1

gi

#xi, µi(xi), wi

$+ Jk+ℓ(xk+ℓ)

%

bk Belief States bk+1 bk+2 Policy µ m Steps

Truncated Rollout Policy µ m Steps Φr∗λ

B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2

Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )

µ COMPOSITE SYSTEM SIMULATOR FOR POMDP

(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)

uk = rpek + rizk + rddk ξij(u) pij(u)

Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost

Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i

One-step Lookahead with J(j) =&

y∈A φjyr∗y bk Control uk = µk(bk)

p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm

... (e.g., a NN) Data (xs, cs)

V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗

Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm

Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout

Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ

Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized

Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)

1

Partial Folding Software Critic Software Complete Folding Current Partial Folding

Clients Facilities xij i j zj = 0 or 1 Open Close Null

minuk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

i=k+1

gi

#xi, µi(xi), wi

$+ Jk+ℓ(xk+ℓ)

%

bk Belief States bk+1 bk+2 Policy µ m Steps

Truncated Rollout Policy µ m Steps Φr∗λ

B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2

Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )

µ COMPOSITE SYSTEM SIMULATOR FOR POMDP

(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)

uk = rpek + rizk + rddk ξij(u) pij(u)

Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost

Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i

One-step Lookahead with J(j) =&

y∈A φjyr∗y bk Control uk = µk(bk)

p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm

... (e.g., a NN) Data (xs, cs)

V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗

Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm

Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout

Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ

Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized

Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)

1

Partial Folding Software Critic Software Complete Folding Current Partial Folding

Clients Facilities Corresponding to Open xij i j zj = 0 or 1 Open Close Null

minuk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

i=k+1

gi

#xi, µi(xi), wi

$+ Jk+ℓ(xk+ℓ)

%

bk Belief States bk+1 bk+2 Policy µ m Steps 1 2 3

Truncated Rollout Policy µ m Steps Φr∗λ

B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2

Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )

µ COMPOSITE SYSTEM SIMULATOR FOR POMDP

(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)

uk = rpek + rizk + rddk ξij(u) pij(u)

Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost

Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i

One-step Lookahead with J(j) =&

y∈A φjyr∗y bk Control uk = µk(bk)

p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm

... (e.g., a NN) Data (xs, cs)

V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗

Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm

Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout

Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ

Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized

Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)

1

Given a sequence of nucleotides, “fold" it in an “interesting" way (introducepairings that result in an “interesting" structure).

Make a pairing decision at each nucleotide in sequence (open, close, do nothing).

Base heuristic: Given a partial folding, generates a complete folding (this is thepartial folding software).

Two complete foldings can be compared by the critic software.

Note: There is no explicit cost function here (it is internal to the critic software).Bertsekas Reinforcement Learning 19 / 22

Page 17: Reinforcement Learning and Optimal ControlASU, CSE 691 ...Lecture 5 Bertsekas Reinforcement Learning 1 / 22. Outline 1 Multiagent Rollout 2 Deterministic Problem Rollout with Constraints

Three-Dimensional Assignment

3-Dimensional Assignment Problem 3 Node Layers Jobs j Machines` Workers w

5-Dimensional Assignment Problem 6 Node Layers

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

yk yk+1 H(yk+1) xN�1 xN u0 u1 uk�1 uk uk+1 uN�1 Tk(yk, uk) =�yk, uk, H(yk+1)

�2 C

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

1

3-Dimensional Assignment Problem 3 Node Layers Jobs j Machines` Workers w

Jobs j Machines ` Workers w

5-Dimensional Assignment Problem 6 Node Layers

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

yk yk+1 H(yk+1) xN�1 xN u0 u1 uk�1 uk uk+1 uN�1 Tk(yk, uk) =�yk, uk, H(yk+1)

�2 C

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1

3-Dimensional Assignment Problem 3 Node Layers Jobs j Machines` Workers w

Jobs j Machines ` Workers w

5-Dimensional Assignment Problem 6 Node Layers

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

yk yk+1 H(yk+1) xN�1 xN u0 u1 uk�1 uk uk+1 uN�1 Tk(yk, uk) =�yk, uk, H(yk+1)

�2 C

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1

3-Dimensional Assignment Problem 3 Node Layers Jobs j Machines` Workers w

Jobs j Machines ` Workers w

5-Dimensional Assignment Problem 6 Node Layers

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

yk yk+1 H(yk+1) xN�1 xN u0 u1 uk�1 uk uk+1 uN�1 Tk(yk, uk) =�yk, uk, H(yk+1)

�2 C

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1

Three-Dimensional Assignment Problem 3 Node Layers Jobs j Ma-chines ` Workers w

Jobs j Machines ` Workers w

5-Dimensional Assignment Problem 6 Node Layers

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

yk yk+1 H(yk+1) xN�1 xN u0 u1 uk�1 uk uk+1 uN�1 Tk(yk, uk) =�yk, uk, H(yk+1)

�2 C

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1

The performance of a job j requires a single machine ` and a single worker w .

There is a given cost aj`w corresponding to the triplet (j, `,w).

Given a set of m jobs, a set of m machines, and a set of m workers, we want tofind a collection of m job-machine-worker triplets that has minimum total cost.

A favorable case is when the cost has separable form aj`w = βj` + γ`w

Enforced separation heuristic:I First solve an artificial 2-dimensional machines-to-workers assignment problem with

costs γ`w derived from aj`w , e.g., γ`w = minj aj`w (the “optimistic" assignment costs).I Next solve the 2-dimensional jobs-to-machines assignment problem with costs βj`

specified by the machines-to-workers assignment and aj`w .

2-D assignment problems are easy (using the auction algorithm; see the notes).Bertsekas Reinforcement Learning 20 / 22

Page 18: Reinforcement Learning and Optimal ControlASU, CSE 691 ...Lecture 5 Bertsekas Reinforcement Learning 1 / 22. Outline 1 Multiagent Rollout 2 Deterministic Problem Rollout with Constraints

Three-Dimensional Assignment: Use of Rollout

3-Dimensional Assignment Problem 3 Node Layers Jobs j Machines` Workers w

5-Dimensional Assignment Problem 6 Node Layers

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

yk yk+1 H(yk+1) xN�1 xN u0 u1 uk�1 uk uk+1 uN�1 Tk(yk, uk) =�yk, uk, H(yk+1)

�2 C

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

1

3-Dimensional Assignment Problem 3 Node Layers Jobs j Machines` Workers w

Jobs j Machines ` Workers w

5-Dimensional Assignment Problem 6 Node Layers

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

yk yk+1 H(yk+1) xN�1 xN u0 u1 uk�1 uk uk+1 uN�1 Tk(yk, uk) =�yk, uk, H(yk+1)

�2 C

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1

3-Dimensional Assignment Problem 3 Node Layers Jobs j Machines` Workers w

Jobs j Machines ` Workers w

5-Dimensional Assignment Problem 6 Node Layers

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

yk yk+1 H(yk+1) xN�1 xN u0 u1 uk�1 uk uk+1 uN�1 Tk(yk, uk) =�yk, uk, H(yk+1)

�2 C

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1

3-Dimensional Assignment Problem 3 Node Layers Jobs j Machines` Workers w

Jobs j Machines ` Workers w

5-Dimensional Assignment Problem 6 Node Layers

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

yk yk+1 H(yk+1) xN�1 xN u0 u1 uk�1 uk uk+1 uN�1 Tk(yk, uk) =�yk, uk, H(yk+1)

�2 C

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1

Three-Dimensional Assignment Problem 3 Node Layers Jobs j Ma-chines ` Workers w

Jobs j Machines ` Workers w

5-Dimensional Assignment Problem 6 Node Layers

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

x0 x1 x2 xk�1 xk xk+1 xk+2 xN�1 xN u0 u1 uk�1 uk uk+1 uN�1

yk yk+1 H(yk+1) xN�1 xN u0 u1 uk�1 uk uk+1 uN�1 Tk(yk, uk) =�yk, uk, H(yk+1)

�2 C

(u0) (u0, u1) (u0, u1, u2) u = (u0, . . . , uN�1) . . .

SC ` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1

View as a constrained multistage problem.

Two stages; control at stage 1 = the jobs-to-machines assignment; control at stage2 = the machines-to-workers assignment

We view stage 1 assignment as “multiagent": Assign one job at a time.

We view stage 2 assignment as “single-agent": Assign all machines at onceoptimally (given the stage 1 assignment).

Base heuristic: Having fixed some stage 1 assignments, use enforced separationheuristic for the remaining stage 1 assignments, and the stage 2 assignment.

Rollout: Fix each job assignment one-at-a-time using the base heuristic tocompare all machine options.

Bertsekas Reinforcement Learning 21 / 22

Page 19: Reinforcement Learning and Optimal ControlASU, CSE 691 ...Lecture 5 Bertsekas Reinforcement Learning 1 / 22. Outline 1 Multiagent Rollout 2 Deterministic Problem Rollout with Constraints

About the Next Two Lectures

We will cover:Parametric approximation architectures.

Neural networks and how we use them.

Approximation in value space and policy space using neural nets.

PLEASE READ AS MUCH OF CHAPTER 3 AS YOU CAN

PLEASE DOWNLOAD THE LATEST VERSION

Bertsekas Reinforcement Learning 22 / 22