Chapter 2 Dynamic Programming

Embed Size (px)

Text of Chapter 2 Dynamic Programming

  • Chapter 2 Dynamic Programming

    2.1 Closed-loop optimization of discrete-time systems:inventory control

    We consider the following inventory control problem:The problem is to minimize the expected cost of ordering quantitiesof a certain product in order to meet a stochastic demand for thatproduct. The ordering is only possible at discrete time instants t0 0, are placed atthe last possible moment, i.e., at tk, to take into account informationabout the development of the stock until tk. In mathematical terms,this amounts to determine a policy or control rule := {k}N1k=0


  • Optimization Theory II, Spring 2007; Chapter 2 49

    described by functions k = k(xk), 0 k N 1, where the functionvalue is the amount of units of the product to be ordered at time tk,if the stock is xk. Hence, given a fixed initial stock x0, the associatedminimization problem reads as follows

    minimize J(x0) := E(R(xN) +



    (r(xk) + ck(xk)


    subject to xk+1 = xk + k(xk) wk , 0 k N 1 ,(2.4b)k(xk) 0 , 0 k N 1 .(2.4c)

    Dynamic Programming (DP) is concerned with the efficient solu-tion of such closed-loop minimization problems.

    Remark: We note that minimization problems associated with de-terministic discrete-time dynamical systems can be considered as well.Such systems will be dealt with in more detail in Chapter 2.3.

  • 50 Ronald H.W. Hoppe

    2.2 Dynamic Programming for discrete-time systems

    2.2.1 Setting of the problem

    For N N, we consider sequences {Sk}Nk=0, {Ck}N1k=0 , and {Dk}N1k=0 of(random) state spaces Sk, 0 k N, control spaces Ck, 0 k N 1, and (random) disturbance spaces Dk, 0 k N 1.Given an initial state x0 S0, we assume that the states xk Sk, 0 k N, evolve according to the discrete-time dynamic system

    xk+1 = fk(xk, uk, wk) , 0 k N 1 ,where fk : Sk Ck Dk Sk+1, 0 k N 1. The controls uksatisfy

    uk Uk(xk) Ck , 0 k N 1 ,i.e., they depend on the state xk Sk, and the random disturbanceswk, 0 k N 1, are characterized by a probability distribution

    Pk(|xk, uk) , 0 k N 1 ,which may explicitly depend on the state xk and on the control uk, butnot on the prior disturbances wj, 0 j k 1.At the time instants 0 k N 1, decisions with respect to thechoice of the controls have to be made by means of control laws

    k : Sk Sk , uk = k(xk) , 0 k N 1 ,leading to a control policy

    = {0, 1, , N1} .We refer to as the set of admissible control policies.Finally, we introduce cost functionals

    gk : Sk Ck Dk R , 0 k N 1 ,associated with the decisions at time instants 0 k N 1, and aterminal cost

    gN : SN R .We consider the minimization problem


    J(x0) = E [gN(xN) +N1


    gk(xk, k(xk), wk) ] ,(2.5a)

    subject to

    xk+1 = fk(xk, k(xk), wk) , 0 k N 1 ,(2.5b)

  • Optimization Theory II, Spring 2007; Chapter 2 51

    where the expectation E is taken over all random states and randomdisturbances.For a given initial state x0, the value

    (2.6) J(x0) = min


    is referred to as the optimal cost function or the optimal valuefunction. Moreover, if there exists an admissible policy suchthat

    (2.7) J(x0) = J(x0) ,

    then is called an optimal policy.

    2.2.2 Bellmans principle of optimality

    Bellmans celebrated principle of optimality is the key for a con-structive approach to the solution of the minimization problem (2.5)which readily leads to a powerful algorithmic tool: the Dynamic Pro-gramming algorithm (DP algorithm).For intermediate states xk Sk, 1 k N 1, that occur withpositive probability, we consider the minimization subproblems


    Jk(x0) = E [gN(xN) +N1


    g`(x`, `(x`), w`) ] ,(2.8a)

    subject to

    x`+1 = f`(x`, `(x`), w`) , k ` N 1 ,(2.8b)where k = {k, k+1, , N1} and k is the set of admissible policiesobtained from by deleting the admissible control laws associated withthe previous time instants 0 ` k 1.Bellmans optimality principle states that if

    = {0, 1, , N1}is an optimal policy for (2.5), then the truncated policy

    k = {k, k+1, , N1}is an optimal policy for the minimization subproblem (2.8). We referto

    (2.9) Jk (xk) = Jk(xk)

    as the optimal cost-to-go for state xk at the time instant k to thefinal time N . For completeness, we further set JN(xN) = gN(xN).The intuitive justification for Bellmans optimality principle is thatif the truncated policy k were not optimal for subproblem (2.8), then

  • 52 Ronald H.W. Hoppe

    we would be able to reduce the optimal cost for (2.5) by switching tothe optimal policy for (2.8) once xk is reached.

    2.2.3 The DP algorithm and its optimality

    Bellmans optimality principle strongly suggests to solve the optimalitysubproblems (2.8) backwards in time, beginning with the terminal costat final time instant N and then recursively compute the optimal cost-to-go for subproblems k = N 1, , 0. This leads to the so-calledbackward DP algorithm which is characterized by the recursions

    JN(xN) = gN(xN) ,(2.10a)

    Jk(xk) = minukUk(xk)

    Ewk [gk(xk, uk, wk) +(2.10b)

    + Jk+1(fk(xk, uk, wk))] , 0 k N 1 ,where Ewk means that the expectation is taken with respect to theprobability distribution of wk.

    Theorem 2.1 (Optimality of the backward DP algorithm)We assume that

    the random disturbance spaces Dk, 0 k N 1, are finite orcountable sets,

    the expectations of all terms in the cost functionals in (2.10b)are finite for every admissible policy.

    Then, there holds

    (2.11) Jk (xk) = Jk(xk) , 0 k N .Moreover, if there exist optimal policies k, 0 k N 1, for (2.10b)such that uk =

    k(xk), then

    = {0, 1, , N1} is an optimalpolicy for (2.5).

    Proof. We note that for k = N the assertion (2.11) holds true bydefinition. For k < N and > 0, we define k(xk) Uk(xk) as the-suboptimal control satisfying

    Ewk [gk(xk, k(xk), wk) + Jk+1(fk(xk,

    k(xk), wk))] (2.12) Jk(xk) + .

    We denote by Jk(xk) the expected cost at state xk and time instant kwith respect to the -optimal policy

    k = {k(xk), k+1(xk+1), , N1(xN1)}.

  • Optimization Theory II, Spring 2007; Chapter 2 53

    We show by induction on N k, k 1, thatJk(xk) Jk(xk) Jk(xk) + (N k) ,(2.13a)Jk (xk) Jk(xk) Jk (xk) + (N k) ,(2.13b)

    Jk(xk) = Jk (xk) .(2.13c)

    (i) Begin of the induction k = N 1 : It follows readily from (2.12)that (2.13a) and (2.13b) hold true for k = N 1. The limit process 0 in (2.13a) and (2.13b) then yields that (2.13c) is satisfied fork = N 1.(ii) Induction hypothesis: We suppose that (2.13a)-(2.13c) holdtrue for some k + 1.

    (iii) Proof for k : Using the definition of Jk(xk), the induction hy-pothesis and (2.12), we get

    Jk(xk) = Ewk [gk(xk, k(xk), wk) + J


    k(xk), wk))]

    Ewk [gk(xk, k(xk), wk) + Jk+1(fk(xk, k(xk), wk))] + (N k 1) Jk(xk) + + (N k 1) = Jk(xk) + (N k) .

    On the other, using the induction hypothesis once more, we obtain

    Jk(xk) = Ewk [gk(xk, k(xk), wk) + J


    k(xk), wk))]

    Ewk [gk(xk, k(xk), wk) + Jk+1(fk(xk, k(xk), wk))] min

    ukUK(xk)Ewk [gk(xk, uk, wk) + Jk+1(fk(xk, uk, wk))] = Jk(xk) .

    The combination of the two preceding inequalities proves (2.13a) for k.Now, for every policy , using the induction hypothesis and (2.12),we have

    Jk(xk) = Ewk [gk(xk, k(xk), wk) + J


    k(xk), wk))]

    Ewk [gk(xk, k(xk), wk) + Jk+1(fk(xk, k(xk), wk))] + (N k 1) min

    ukUk(xk)Ewk [gk(xk, uk, wk) + Jk+1(fk(xk, uk, wk))] + (N k)

    Ewk [gk(xk, k(xk), wk) + Jk+1(fk(xk, k(xk), wk))] + (N k) == Jk(xk) + (N k) .

    Minimizing over k yields

    Jk(xk) Jk (xk) + (N k) .On the other hand, by the definition of Jk (xk) we obviously have

    Jk (xk) Jk(xk) .Combining the preceding inequalities proves (2.13b) for k.Again, (2.13c) finally follows from 0 in (2.13a) and (2.13b).

  • 54 Ronald H.W. Hoppe

    Remark: The general case (where the disturbance spaces are not nec-essarily finite or countable) requires additional structures of the spacesSk, Ck, and Dk as well as measurability assumptions with regard tothe functions fk, gk, and k within the framework of measure-theoreticprobability theory. We refer to [2] for details.

    2.2.4 Applications of the DP algorithm

    Example (Inventory control): We consider the following modifica-tion of the inventory control problem previously discussed in Chapter2.1:We assume that the states xk, the controls uk, and the demands wk arenon-negative integers which can take the values 0, 1 and 2 where thedemand wk has the same probability distribution

    p(wk = 0) = 0.1 , p(wk = 1) = 0.7 , p(wk = 2) = 0.2

    for all planning periods (k, k + 1).We further assume that the excess demand wk xk uk is lost andthat there is an upper bound of 2 units on the stock that can be stored.Consequently, the equation for the evolution of the stock takes the form

    xk+1 = max(0, xk + uk wk)under the constraint

    xk + uk 2 .For the holding costs and the terminal cost we assume

    r(xk) = (xk + uk wk)2 , R(xn) = 0 ,and we suppose that the ordering cost is 1 per unit, i.e.,

    c = 1 .

    Therefore, the functions gk, 0 k N, are given bygk(xk, uk, wk) = uk + (xk + uk wk)2 , 0 k N 1 ,

    gN(xN) = 0 .

    Finally, we suppose x0 = 0 and for the planning horizon we take N = 3so that the recursions (2.8) of the backward DP algorithm have theform

    J3(x3) = 0 ,(2.14a)

    Jk(xk) = min0uk2xk

    Ewk [uk + (xk + uk wk)2 +(2.14b)+ Jk+1(max(0, xk + uk wk))] , 0 k 2 .

    We start with period 2, followed by period 1, and finish with period 0:

  • Optimization Theory II, Spring 2007; Chapter 2 55

    Period 2: We comput