15070 fall 2013

1

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 1 9/4/2013

Metric spaces and topology

Content. Metric spaces and topology. Polish Space. Arzela-Ascoli Theorem. Convergence of mappings. Skorohod metric and Skorohod space.

Metric spaces. Open, closed and compact sets

When we discuss probability theory of random processes, the underlying sample spaces and -eld structures become quite complex. It helps to have a unifying framework for discussing both random variables and stochastic processes, as well as their convergence, and such a framework is provided by metric spaces.

Denition 1. A metric space is a pair (S, ) of a set S and a function : S S R+ such that for all x, y, z S the following holds:

1. (x, y) = 0 if and only if x = y.

2. (x, y) = (y, x) (symmetry).

3. (x, z) (x, y) + (y, z) (triangle inequality). Examples of metric spaces include S = Rd with (x, y) = (xj yj )2,1jd

or = |xj yj | or = max1jd |xj yj |. These metrics are also called 1jd L2, L1 and L norms and we write Ix yI1, Ix yI2, Ix yI, or simply 1 IxyI. More generally one can dene IxyIp = (xj yj)p p , p 1jd1. p = essentially corresponds to Ix yI. Problem 1. Show that Lp is not a metric when 0 < p < 1.

Another important example is S = C[0, T ] the space of continuous functions x : [0, T ] Rd and (x, y) = T = sup0tT Ix(t) y(t)I, where I I can be taken as any of Lp or L. We will usually concentrate on the case d = 1, in which case (x, y) = T = sup0tT |x(t) y(t)|. The space C[0, ) is

1

1also a metric space under (x, y) = nN 2n min(n(x, y), 1), where n is the metric dened on C[0, n]. (Why did we have to use the min operator in the definition above?). We call T and uniform metric. We will also write Ix yIT or Ix yI instead of T . Problem 2. Establish that T and dened above on C[0, ) are metrics. Prove also that if xn x in metric for xn, x C[0, ), then the restrictions ' 'x , x' of xn, x onto [0, T ] satisfy x x' w.r.t. T .n n

Finally, let us give an example of a metric space from a graph theory. Let G = (V, E) be an undirected graph on nodes V and edges E. Namely, each element (edge) of E is a pair of nodes (u, v), u, v V . For every two nodes u and v, which are not necessarily connected by an edge, let (u, v) be the length of a shortest path connecting u with v. Then it is easy to see that is a metric on the nite set V .

Denition 2. A sequence xn S is said to converge to a limit x S (we write xn x) if limn (xn, x) = 0. A sequence xn S is Cauchy if for every E > 0

'there exists n0 such that for all n, n > n0, (xn, xn' ) < E. A metric space is dened to be complete if every Cauchy sequence converges to some limit x.

The space Rd is a complete space under all three metrics L1, L2, L. The space Q of rational points in R is not complete (why?). A subset A S is called dense if for every x S there exists a sequence of points xn A such that xn x. The set of rational values in R is dense and is countable. The set of irrational points in R is dense but not countable. The set of points (q1, . . . , qd) Rd such that qi is rational for all 1 i d is a countable dense subset of Rd . Denition 3. A metric space is dened to be separable if it contains a dense countable subset A. A metric space S is dened to be a Polish space if it is complete and separable.

We just realized that Rd is Polish. Is there a countable dense subset of C[0, T ] of C[0, ), namely are these spaces Polish as well? The answer is yes, but we will get to this later.

Problem 3. Given a set S, consider the metric dened by (x, x) = 0, (x, y) = 1 for x = y. Show that (S, ) is a metric space. Suppose S is uncountable. Show that S is not separable.

Given x S and r > 0 dene a ball with radius r to be B(x, r) = {y S : (x, y) r}. A set A S is dened to be open if for every x A there exists

2

6

E such that B(x, E) A. A set A is dened to be closed if Ac = S \ A is open. The empty set is assumed to be open and closed by denition. It is easy to check that union of open sets is open and intersection of closed sets is closed. Also it is easy to check that a nite intersection of open sets is open and nite union of closed sets is closed. Every interval (a, b) is and every countable union of intervals i1(ai, bi) is an open set. Show that a union of close intervals [ai, bi], i 1 is not necessarily close.

For every set A dene its interior Ao as the union of all open sets U A. This set is open (check). For every set A dene its closure A as the intersection

of all closed sets V A. This set is closed. For every set A dene its boundary A as A \ Ao . Examples of open sets are open balls Bo(x, r) = {y S :

(x, y) < r} B(x, r) (check this). A set K S is dened to be compact if every sequence xn K contains a converging subsequence xnk x and x K. It can be shown that K Rd is compact if and only if K is closed and bounded (namely supxK IxI < (this applies to any Lp metric). Prove that every compact set is closed.

Proposition 1. Given a metric space (S, ) a set K is compact iff every cover of K by open sets contains a nite subcover. Namely, if Ur, r R is a (possibly uncountable) family of sets such that K rUr, then there exists a nite subset r1, . . . , rm R such that K 1imUri .

We skip the proof of this fact, but it can be found in any book on topology.

Problem 4. Give an example of a closed bounded set K C[0, T ] which is not compact.

Denition 4. Given two metric spaces (S1, 1), (S2, 2) a mapping f : S1 S2 is dened to be continuous in x S1 if for every E > 0 there exists > 0 such that f(B(x, )) B(f(x), E). Equivalently for every y such that 1(x, y) < we must have 2(f(x), f(y)) < E. And again, equivalently, if for every sequence xn S1 converging to x S it is also true that f(xn) converges to f(x).

A mapping f is dened to be continuous if it is continuous in every x S1. A mapping is uniformly continuous if for every E > 0 there exists > 0 such that 1(x, y) < implies 2(f(x), f(y)) < E.

Problem 5. Show that f is a continuous mapping if and only if for every open set U S2, f1(U) is an open set in S1. Proposition 2. Suppose K S1 is compact. If f : K Rd is continuous then it is also uniformly continuous. Also there exists x0 K satisfying If(x0)I = supxK If(x)I, for any norm I I = I Ip.

3

Proof. This is where alternative property of compactness provided in Proposition 1 is useful. Fix E > 0. For every x K nd = (x) such that f(B(x, (x))) B(f(x), E). This is possible by continuity of f . Then K xK Bo(x, (x)/2) (recall that B0 is an open version of B). Namely, we have an open cover of K. By Proposition 1, there exists a nite subcover K 1ikBo(xi, (xi)/2). Let = min1ik (xi). This value is positive since k is nite. Consider any two points y, z K such that 1(y, z) < /2. We just showed that there exists i, 1 i k such that 1(xi, y) (xi)/2. By triangle inequality 1(xi, z) < (xi)/2 + /2 (xi). Namely both y and z belong to Bo(xi, (xi)). Then f(y), f(z) Bo(f(xi), E). By triangle inequality we have If(y) f(z)I If(y) f(xi)I + If(z) f(xi)I < 2E. We conclude that for every two points y, z such that 1(y, z) < /2 we have If(y) f(z)I < 2E. The uniform continuity is established. Notice, that in this proof the only property of the target space Rd we used is that it is a metric space. In fact, this part of the proposition is true if Rd is replaced by any metric space S2, 2.

Now let us show the existence of x0 K satisfying If(x0)I = supxK If(x)I. First let us show that supxK If(x)I < . If this is not true, identify a sequence xn K such that If(xn)I . Since K is compact, there exists a subsequence xnk which converges to some point y K. Since f is continuous then f(xnk ) f(y), but this contradicts If(xn)I . Thus supxK If(x)I < . Find a sequence xn satisfying limn If(xn)I = supxK If(x)I. Since K is compact there exists a converging subsequence xnk x0. Again using continuity of f we have f(xnk ) f(x0). But If(xnk )I supxK If(x)I. We conclude f(x0) = supxK If(x)I.

We mentioned that the sets in Rd which are compact are exactly bounded closed sets. What about C[0, T ]? We will need a characterization of compact sets in this space later when we analyze tightness properties and construction of a Brownian motion.

Given x C[0, T ] and > 0, dene wx() = sups,t:|st| 0, there exists > 0 such that wx() < E.

Theorem 1 (Arzel A set A C[0, T ] is compact if and a-Ascoli Theorem).

4

only if it is closed and

sup |x(0)| < , (1) xA

and

lim sup wx() = 0. (2)0 xA

Proof. We only show that if A is compact then (1) and (2) hold. The converse is established using a similar type of mathematical analysis/topology arguments.

We already know that if A is compact it needs to be closed. The assertion (1) follows from Proposition 2. We now show (2). For any s, t [0, T ] we have

|y(t) y(s)| |y(t) x(t)| + |x(t) x(s)| + |x(s) y(s)| |x(t) x(s)| + 2Ix yI.

Similarly we show that |x(t) x(s)| |y(t) y(s)| +2Ix yI. Therefore for every > 0.

|wx() wy()| < 2Ix yI. (3)

We now show (2). Check that (2) is equivalent to

1 lim sup wx( ) = 0. (4) n nxA

Suppose A is compact but (4) does not hold. Then we can nd a subsequence xni A, i 1 such that wxni (1/ni) c for some c > 0. Since A is compact then there is further subsequence of xni which converges to some x A. To ease the notation we denote this subsequence again by xni . Thus Ixni xI 0. From (3) we obtain

|wx(1/ni) wxni (1/ni)| < 2Ix xni I < c/2

for all i larger than some i0. This implies that

wx(1/ni) c/2, (5)

for all sufciently large i. But x is continuous on [0, T ], which implies it is uniformly continuous, as [0, T ] is compact. This contradicts (5).

5

2 Convergence of mappings

Given two metric spaces (S1, 1), (S2, 2) a sequence of mappings fn : S1 S2 is dened to be point-wise converging to f : S1 S2 if for every x S1 we have 2(fn(x), f(x)) 0. A sequence fn is dened to converge to f uniformly if

lim sup 2(fn(x), f(x)) = 0. n xS1

Also given K S1, sequence fn is said to converge to f uniformly on K if the restriction of fn, f onto K gives a uniform convergence. A sequence fn is said to converge to f uniformly on compact sets u.o.c if fn converges uniformly to f on every compact set K S1. Problem 6. Let S1 = [0, ) and let S2 be arbitrary. Show that fn converges to f uniformly on compact sets if and only if for every T > 0

lim sup 2(fn(t), f(t)) = 0. n 0tT

Point-wise convergence does not imply uniform convergence even on compact sets. Consider xn = nx for x [0, 1/n], = n(2/n x) for x [1/n, 2/n] and = 0 for x [2/n, 1]. Then xn converges to zero function point-wise but not uniformly. Moreover, if fn is continuous and fn converges to f point-wise, this does not imply in general that f is continuous. Indeed, let fn = 1/(nx +1), x [0, 1]. Then fn converges to 0 point-wise everywhere except x = 0 where it converges to 1. The limiting function is discontinuous. However, the uniform continuity implies continuity of the limit, as we are about to show.

Proposition 3. Suppose fn : S1 S2 is a sequence of continuous mappings which converges uniformly to f . Then f is continuous as well.

Proof. Fix x S1 and E > 0. There exists n0 such that for all n > n0, sup 2(fn(z), f(z)) < E/3. Fix any such n > n0. Since, by assumption fnz is continuous, then there exists > 0 such that 2(fn(x), fn(y)) < E/3 for all y Bo(x, ). Then for any such y we have

2(f(x), f(y)) 2(f(x), fn(x)) + 2(fn(x), fn(y)) + 2(fn(y), f(x)) < 3E/3 = E.

This proves continuity of f .

Theorem 2. The spaces C[0, T ], C[0, ) are Polish.

6

3

Problem 7. Use Proposition 3 (or anything else useful) to prove that C[0, T ] is complete.

That C[0, T ] has a dense countable subset can be shown via approximations by polynomials with rational coefcients (we skip the details).

Skorohod space and Skorohod metric

The space C[0, ) equipped with uniform metric will be convenient when we discuss Brownian motion and its application later on in the course, since Brownian motion has continuous samples. Many important processes in practice, however, including queueing processes, storage, manufacturing, supply chain, etc. are not continuous, due to discrete quantities involved. As a result we need to deal with probability concept on spaces of not necessarily continuous functions.

Denote by D[0, ) the space of all functions x on [0, ) taking values in R or in general any metric space (S, ), such that x is right-continuous and has left limits. Namely, for every t0, limtt0 f(t), limtt0 f(t) exist, and limtt0 f(t) = f(t0). As an example, think about a process describing the number of customers in a branch of a bank. This process is described as a piece-wise constant function. We adopt a convention that at a moment when a customer arrives/departs, the number of customers is identied with the number of customers right after arrival/departure. This makes the process right-continuous. It also has left-limits, since it is piece-wise constant.

Similarly, dene D[0, T ] to be the space of right-continuous functions on [0, T ] with left limits. We will right shortly RCLL. On D[0, T ] and D[0, ) we would like to dene a metric which measures some proximity between the functions (processes). We can try to use the uniform metric again. Let us consider the following two processes x, y D[0, T ]. Fix , [0, T ) and > 0 such that + < T and dene x(z) = 1{z }, y(z) = 1{z + }. We see that x and y coincide everywhere except for a small interval [, + ). It makes sense to assume that these processes are close to each other. Yet Ix yIT = 1. Thus uniform metric is inadequate. For this reason Skorohod introduce the so called Skorohod metric. Before we dene Skorohod metric let us discuss the idea behind it. The problem with uniform metric was that the two processes x, y described above where close to each other in a sense that one is a perturbed version of the other, where the amount of perturbation is . In particular, consider the following piece-wise linear function : [0, T ] [0, T ] given by t [0, + ]; + t,(t) =

+ 1 (t ), t [ + , T ].1

7

4

We see that x((t)) = y(t). In other words, we rescaled the axis [0, T ] by a small amount and made y close to (in fact identical to) x. This motivates the following denition. From here on we use the following notations: x y stands for min(x, y) and x y stands for max(x, y) Denition 5. Let be the space of strictly increasing continuous functions from [0, T ] onto [0, T ]. A Skorohod metric on D[0, T ] is dened by

s(x, y) = inf I II Ix y)I ,

for all x, y D[0, T ], where I is the identity transformation, and I I is the uniform metric on D[0, T ].

Thus, per this denition, the distance between x and y is less than E if there exists such that sup0tT |(t)t| < E and sup0tT |x(t)y((t))| < E.

Problem 8. Establish that s is a metric on D[0, T ].

Proposition 4. The Skorohod metric and uniform metric are equivalent on C[0, T ], in a sense that for xn, x C[0, T ], the convergence xn x holds under Skorohod metric if and only if it holds under the uniform metric.

Proof. Clearly Ix yI s(x, y). So convergence under uniform metric implies convergence under Skorohod metric. Suppose now s(xn, x) 0. We need to show Ixn xI 0.

Consider any sequence n such that InII 0 and Ix(n)xnI 0. Such a sequence exists since s(xn, x) 0 (check). We have

Ix xnI Ix xnI + Ixn xnI. The second summand in the right-hand side converges to zero by the choice of n. Also since n converges to I uniformly, and x is continuous on [0, T ] and therefore uniformly continuous on [0, T ], then Ix xnI 0.

Additional reading materials

Billingsley [1], Appendix M1-M10.

References

[1] P. Billingsley, Convergence of probability measures, Wiley-Interscience publication, 1999.

8

( )

MIT OpenCourseWarehttp://ocw.mit.edu

15.070J / 6.265J Advanced Stochastic ProcessesFall 2013

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

1


Large Deviations for i.i.d. Random Variables

Content. Chernoff bound using exponential moment generating functions. Properties of a moment generating functions. Legendre transforms.

Preliminary notes

The Weak Law of Large Numbers tells us that if X1, X2, . . . , is an i.i.d. sequence of random variables with mean E[X1] < then for every E > 0

X1 + . . . + XnP(| | > E) 0, n

as n .

But how quickly does this convergence to zero occur? We can try to use Chebyshev inequality which says

X1 + . . . + Xn Var(X1)P(| | > E) . n nE2

This suggest a decay rate of order 1 if we treat Var(X1) and E as a constant. n Is this an accurate rate? Far from so ...

In fact if the higher moment of X1 was nite, for example, E[X2m] < , then 1 1using a similar bound, we could show that the decay rate is at least (exercise). nm

The goal of the large deviation theory is to show that in many interesting cases the decay rate is in fact exponential: ecn. The exponent c > 0 is called the large deviations rate, and in many cases it can be computed explicitly or numerically.

1

2 Large deviations upper bound (Chernoff bound)

Consider an i.i.d. sequence with a common probability distribution function F (x) = P(X x), x R. Fix a value a > , where is again an expectation corresponding to the distribution F . We consider probability that the average of X1, . . . , Xn exceeds a. The WLLN tells us that this happens with probability converging to zero as n increases, and now we obtain an estimate on this probability. Fix a positive parameter > 0. We have

Xi1inP( > a) = P( Xi > na) n

1in Xi na)1in= P(e > e E[e 1in Xi ] Markov inequality

na eXi ]E[ i e=

a)n ,

(e Xi ]But recall that Xis are i.i.d. Therefore E[ i e = (E[eX1 ])n . Thus we

obtain an upper bound nX1 ]1in Xi E[eP( > a) . (1)a n e

Of course this bound is meaningful only if the ratio E[eX1 ]/ea is less than unity. We recognize E[eX1 ] as the moment generating function of X1 and denote it by M(). For the bound to be useful, we need E[eX1 ] to be at least nite. If we could show that this ratio is less than unity, we would be done exponentially fast decay of the probability would be established.

Similarly, suppose we want to estimate Xi1inP( < a),

n

for some a < . Fixing now a negative < 0, we obtain 1in Xi Xi na)1inP( < a) = P(e > en nM() ,

a e

2

and now we need to nd a negative such that M() < ea . In particular, we need to focus on for which the moment generating function is nite. For this purpose let D(M) { : M() < }. Namely D(M) is the set of values for which the moment generating function is nite. Thus we call D the domain of M .

3 Moment generating function. Examples and properties

Let us consider some examples of computing the moment generating functions.

Exponential distribution. Consider an exponentially distributed random variable X with parameter . Then

M() = e xexdx 0

= e ()xdx. 0 1 ()xWhen < this integral is equal to e = 1/( ). But 0

when , the integral is innite. Thus the exp. moment generating function is nite iff < and is M() = /( ). In this case the domain of the moment generating function is D(M) = (, ).

Standard Normal distribution. When X has standard Normal distribution, we obtain 1 x 2 x M() = E[e X ] = e e 2 dx

2 1 x 22x+22 = e 2 dx 2 2

1 (x)2 = e 2 e 2 dx 2

Introducing change of variables y = x we obtain that the integral 2 y 1is equal to e 2 dy = 1 (integral of the density of the standard 2

2

Normal distribution). Therefore M() = e 2 . We see that it is always nite and D(M) = R.

In a retrospect it is not surprising that in this case M() is nite for all . xThe density of the standard Normal distribution decays like e 2 and

3

this is faster than just exponential growth ex. So no matter how large is the overall product is nite.

Poisson distribution. Suppose X has a Poisson distribution with parameter . Then

m

m M() = E[e X ] = e e m!

m=0

(e)m = e m!

m=0

e = e ,

tm(where we use the formula = et). Thus again D(M) = R. m0 m! This again has to do with the fact that m/m! decays at the rate similar to

m 1/m! which is faster then any exponential growth rate e .

We now establish several properties of the moment generating functions.

Proposition 1. The moment generating function M() of a random variable X satises the following properties:

(a) M(0) = 1. If M() < for some > 0 then M(') < for all ' [0, ]. Similarly, if M() < for some < 0 then M(') < for all ' [, 0]. In particular, the domain D(M) is an interval containing zero.

(b) Suppose (1, 2) D(M). Then M() as a function of is differentiable in for every 0 (1, 2), and furthermore,

dM() = E[Xe0X ] < .

d =0

Namely, the order of differentiation and expectation operators can be changed.

Proof. Part (a) is left as an exercise. We now establish part (b). Fix any 0 (1, 2) and consider a -indexed sequence of random variables

exp(X) exp(0X)Y .

0

4

dSince d exp(x) = x exp(x), then almost surely Y X exp(0X), as 0. Thus to establish the claim it sufces to show that convergence of expectations holds as well, namely lim0 E[Y] = E[X exp(0X)], and E[X exp(0X)] < . For this purpose we will use the Dominated Convergence Theorem. Namely, we will identify a random variable Z such that |Y| Z almost surely in some interval (0 E, 0 + E), and E[Z] < .

Fix E > 0 small enough so that (0 E, 0 + E) (1, 2). Let Z = E1 exp(0X + E|X|). Using the Taylor expansion of exp() function, for every (0 E, 0 + E), we have

1 1 1 Y = exp(0X) X + ( 0)X2 + ( 0)2X3 + + ( 0)n1Xn

2! 3! n!

which gives

1 1 |Y| exp(0X) |X| + ( 0)|X|2 + + ( 0)n1|X|n + 2! n!

1 1 exp(0X) |X| + E|X|2 + + En1|X|n + 2! n!

= exp(0X)E1 (exp(E|X|) 1)

exp(0X)E1 exp(E|X|) = Z.

It remains to show that E[Z] < . We have

E[Z] = E1E[exp(0X + EX)1{X 0}] + E1E[exp(0X EX)1{X < 0}] E1E[exp(0X + EX)] + E1E[exp(0X EX)] = E1M(0 + E) + E1M(0 E) < ,

since E was chosen so that (0 E, 0 + E) (1, 2) D(M). This completes the proof of the proposition.

Problem 1.

(a) Establish part (a) of Proposition 1.

(b) Construct an example of a random variable for which the corresponding interval is trivial {0}. Namely, M() = for every > 0.

5

+ , ( )

( )( )

(c) Construct an example of a random variable X such that D(M) = [1, 2] for some 1 < 0 < 2. Namely, the the domain D is a non-zero length closed interval containing zero.

Now suppose the i.i.d. sequence Xi, i 1 is such that 0 (1, 2) D(M), where M is the moment generating function of X1. Namely, M is nite in a neighborhood of 0. Let a > = E[X1]. Applying Proposition 1, let us differentiate this ratio with respect to at = 0:

X1 ]e aE[eX1 ]d M() E[X1e a ae= = a < 0.

a 2a d e e

Note that M()/ea = 1 when = 0. Therefore, for sufciently small positive , the ratio M()/ea is smaller than unity, and (1) provides an exponential bound on the tail probability for the average of X1, . . . , Xn.

Similarly, if a < , the ratio M()/ea < 1 for sufciently small negative .

We now summarize our ndings.

Theorem 1 (Chernoff bound). Given an i.i.d. sequence X1, . . . , Xn suppose the moment generating function M() is nite in some interval (1, 2) : 0. Let a > = E[X1]. Then there exists > 0, such that M()/ea < 1 and

nXi1in M()P( > a) .a n e

Similarly, if a < , then there exists < 0, such that M()/ea < 1 and n

1in Xi M()P( < a) .a n e

How small can we make the ratio M()/ exp(a)? We have some freedom in choosing as long as E[eX1 ] is nite. So we could try to nd which minimizes the ratio M()/ea. This is what we will do in the rest of the lecture. The surprising conclusion of the large deviations theory is very often that such a minimizing value exists and is tight. Namely it provides the correct decay rate! In this case we will be able to say

Xi1inP( > a) exp(I(a, )n), n

awhere I(a, ) = log M()/e .

6

( )

( )

4 Legendre transforms

Theorem 1 gave us a large deviations bound (M()/ea)n which we rewrite as en(alog M()). We now study in more detail the exponent a log M(). Denition 1. A Legendre transform of a random variable X is the function I(a) supR(a log M()).

Let us go over the examples of some distributions and compute their corresponding Legendre transforms.

Exponential distribution with parameter . Recall that M() = /( ) when < and M() = otherwise. Therefore when <

I(a) = sup (a log )

= sup (a log + log( )),

and I(a) = otherwise. Setting the derivative of g() = a log + log( ) equal to zero we obtain the equation a 1/( ) = 0 which has the unique solution = 1/a. For the boundary cases, we have a log + log( )) when either or (check). Therefore

I(a) = a( 1/a) log + log( + 1/a) = a 1 log + log(1/a) = a 1 log log a.

The large deviations bound then tells us that when a > 1/

Xi1in (a1log log a)nP( > a) e . n

(.2log 1.2)nSay = 1 and a = 1.2. Then the approximation gives us e . Note that we can obtain an exact expression for this tail probability. Indeed, X1, X1 +X2, . . . , X1 +X2 +Xn, . . . are the events of a Poisson process with parameter = 1. Therefore we can compute the probability P( Xi > 1.2n) exactly: it is the probability that the Poisson 1in

7

process has at most n 1 events before time 1.2n. Thus Xi1inP( > 1.2) = P( Xi > 1.2n)

n 1in

(1.2n)k 1.2n = e . k!

0kn1

It is not at all clear how revealing this expression is. In hindsight, we know that it is approximately e(.2log 1.2)n, obtained via large deviations theory.

Standard Normal distribution. Recall that M() = e 22

when X1 has the standard Normal distribution. The expected value = 0. Thus we x a > 0 and obtain

2 I(a) = sup (a )

2 2a

= ,2

achieved at = a. Thus for a > 0, the large deviations theory predicts that

2Xi1in a nP( > a) e 2 . n

1inAgain we could compute this probability directly. We know that Xi

n is distributed as a Normal random variable with mean zero and variance 1/n. Thus

n 2 a

After a little bit of technical work one could show that this integral is a+Edominated by its part around a, namely, , which is further approxa 2

n aimated by the value of the function itself at a, namely e 2 n. This is 2

consistent with the value given by the large deviations theory. Simply the lower order magnitude term n disappears in the approximation on the

2 log scale.

8

P( 1in Xi

> a) = n

e

t2 n 2 dt.

5

Poisson distribution. Suppose X has a Poisson distribution with parameeter . Recall that in this case M() = e. Then

I(a) = sup (a (e )).

Setting derivative to zero we obtain = log(a/) and I(a) = a log(a/) (a ). Thus for a > , the large deviations theory predicts that

Xi1in (a log(a/)a+)nP( > a) e . n

In this case as well we can compute the large deviations probability explicitly. The sum X1 + + Xn of Poisson random variables is also a Poisson random variable with parameter n. Therefore

(n)m nP( Xi > an) = e . m!

1in m>an

But again it is hard to infer a more explicit rate of decay using this expression


Chapter 0 of [2]. This is non-technical introduction to the eld which describes motivation and various applications of the large deviations theory. Soft reading.

Chapter 2.2 of [1].

References

[1] A. Dembo and O. Zeitouni, Large deviations techniques and applications, Springer, 1998.

[2] A. Shwartz and A. Weiss, Large deviations for performance analysis, Chapman and Hall, 1995.

9

1


Large deviations Theory. Cramers Theorem

Content.

1. Cramers Theorem.

2. Rate function and properties.

3. Change of measure technique.

Cramers Theorem

We have established in the previous lecture that under some assumptions on the Moment Generating Function (MGF) M(), an i.i.d. sequence of random variables Xi, 1 i n with mean satises P(Sn a) exp(nI(a)),;1where Sn = n Xi, and I(a) sup(a log M()) is the Legendre 1in transform. The function I(a) is also commonly called the rate function in the theory of Large Deviations. The bound implies

log P(Sn a)lim sup I(a),

nn

and we have indicated that the bound is tight. Namely, ideally we would like to establish the limit

log P(Sn a)lim sup = I(a),

nn

Furthermore, we might be interested in more complicated rare events, beyond the interval [a, ). For example, the likelihood that P(Sn A) for some set A R not containing the mean value . The Large Deviations theory says that roughly speaking

lim 1 P(Sn A) = inf I(x), (1)

n n xA

1

but unfortunately this statement is not precisely correct. Consider the following example. Let X be an integer-valued random variable, and A = {m : m p Z, p is odd prime.}. Then for prime n, we have P(Sn A) = 1; but for n = 2k ,

log P (SnA)we have P (Sn A) = 0. As a result, the limit limn in this case n does not exist.

The sense in which the identity (1) is given by the Cramers Theorem below.

Theorem 1 (Cram Given a sequence of i.i.d. real valued raners Theorem). dom variables Xi, i 1 with a common moment generating function M() = E[exp(X1)] the following holds:

(a) For any closed set F R, 1

lim sup log P(Sn F ) inf I(x), n n xF

(b) For any open set U R, 1

lim inf log P(Sn U) inf I(x). n n xU

We will prove the theorem only for the special case when D(M) = R (namely, the MGF is nite everywhere) and when the support of X is entire R. Namely for every K > 0, P(X > K) > 0 and P(X < K) > 0. For example a Gaussian random variable satises this property.

To see the power of the theorem, let us apply it to the tail of Sn. In the following section we will establish that I(x) is a non-decreasing function on the interval [, ). Furthermore, we will establish that if it is nite in some interval containing x it is also continuous at x. Thus x a and suppose I is nite in an interval containing a. Taking F to be the closed set [a, ) with a > , we obtain from the

1 lim sup log P(Sn [a, )) min I(x) n n xa

= I(a). Applying the second part of Cramers Theorem, we obtain

1 1 lim inf log P(Sn [a, )) lim inf log P(Sn (a, )) n n n n

inf I(x) x>a

= I(a).

2

2

Thus in this special case indeed the large deviations limit exists:

1 lim log P(Sn a) = I(a). n n

The limit is insensitive to whether the inequality is strict, in the sense that we also have

1 lim log P(Sn > a) = I(a). n n

Properties of the rate function I

Before we prove this theorem, we will need to establish several properties of I(x) and M().

Proposition 1. The rate function I satises the following properties

(a) I is a convex non-negative function satisfying I() = 0. Furthermore, it is an increasing function on [, ) and a decreasing function on (, ]. Finally I(x) = sup0(x log M()) for every x and I(x) = sup0(x log M()) for every x .

(b) Suppose in addition that D(M) = R and the support of X1 is R. Then, I is a nite continuous function on R. Furthermore, for every x R we have I(x) = 0x log M(0), for some 0 = 0(x) satisfying

M(0) x = . (2)

M(0)

Proof of part (a). Convexity is due to the fact that I(x) is point-wise supremum. Precisely, consider (0, 1)

I(x + (1 )y) = sup[(x + (1 x)y) log M()]

= sup[(x log M()) + (1 )(y log M())] sup(x log M()) + (1 ) sup (y log M())

=I(x) + (1 )I(y). This establishes the convexity. Now since M(0) = 1 then I(x) 0 x log M(0) = 0 and the non-negativity is established. By Jensens inequality, we have that

M() = E[exp(X1)] exp(E[X1]) = exp().

3

Therefore, log M() , namely, log M() 0, implying I() = 0 = minxR I(x).

Furthermore, if x > , then for < 0 we have x log M() (x ) < 0. This means that sup(x log M()) must be equal to sup0(x log M()). Similarly we show that when x < , we have I(x) = sup0(x log M()).

Next, the monotonicity follows from convexity. Specically, the existence of real numbers x < y such that I(x) > I(y) I() = 0 violates convexity (check). This completes the proof of part (a).

Proof of part (b). For any K > 0 we have log M() log exp(x) dP (x)

lim inf = lim inf 1 lim inf log exp(x) dP (x)

K 1 lim inf log (exp(K)P([K, ]))

1

= K + lim inf log P([K, ])

= K (since supp(X1) = R, we have P([K, )) > 0.) Since K is arbitrary,

1 lim inf log M() =

Similarly, 1

lim inf log M() =

Therefore,

1 lim x log M() = lim (x log M())

Therefore, for each x as || , we have that lim x log M() = ||

From the previous lecture we know that M() is differentiable (hence continuous). Therefore the supremum of x log M() is achieved at some nite value 0 = 0(x), namely,

I(x) = 0x log M(0) < ,

4

3

where 0 is found by setting the derivative of x log M() to zero. Namely, 0 must satisfy (2). Since I is a nite convex function on R it is also continuous (verify this). This completes the proof of part (b).

Proof of Cramers Theorem

Now we are equipped to proving the Cramers Theorem.

Proof of Cramers Theorem. Part (a). Fix a closed set F R. Let + = min{x [, +) F } and = max{x (, ] F }. Note that + and exist since F is closed. If + = then I() = 0 = minxR I(x). Note that log P(Sn F ) 0, and the statement (a) follows trivially. Similarly, if = , we also have statement (a). Thus, assume < < +. Then

P (Sn F ) P (Sn [+, )) + P (Sn (, ]) Dene

xn P (Sn [+, )) , yn P (Sn (, ]) . We already showed that

P (Sn +) exp(n(+ log M())), 0. from which we have

1 log P (Sn +) (+ log M()), 0.

n 1 log P (Sn +) sup(+ log M()) = I(+) n 0

The second equality in the last equation is due to the fact that the supremum in I(x) is achieved at 0, which was established as a part of Proposition 1. Thus, we have

lim sup n

1 n log P (Sn +) I(+) (3)

Similarly, we have

lim sup n

1 n log P (Sn ) I() (4)

Applying Proposition 1 we have I(+) = minx+ I(x) and I() = minx I(x). Thus

min{I(+), I()} = inf I(x) (5) xF

5

From (3)-(5), we have that 1 1

lim sup log xn inf I(x), lim sup log yn inf I(x), (6) n n xF n n xF

which implies that

1 lim sup log(xn + yn) inf I(x).

n n xF (you are asked to establish the last implication as an exercise). We have established

1 lim sup log P (Sn F ) inf I(x) (7)

n n xF

Proof of the upper bound in statement (a) is complete.

Proof of Cram Fix E > 0 anders Theorem. Part (b). Fix an open set U R. nd y such that I(y) infxU ((x). It is sufcient to show that

1 lim inf P (Sn U) I(y), (8) n n

since it will imply

1 lim inf P (Sn U) inf I(x) + E, n n xU

and since E > 0 was arbitrary, it will imply the result. Thus we now establish (8). Assume y > . The case y < is treated

similarly. Find 0 = 0(y) such that

I(y) = 0y log M(0). Such 0 exists by Proposition 1. Since y > , then again by Proposition 1 we may assume 0 0.

We will use the change-of-measure technique to obtain the cover bound. For this, consider a new random variable let X0 be a random variable dened by

z1P(X0 z) = exp(0x) dP (x)M(0)

Now, 1

E[X0 ] = x exp(0x) dP (x)M(0) M(0)

= M(0)

= y,

6

where the second equality was established in the previous lecture, and the last equality follows by the choice of 0 and Proposition 1. Since U is open we can nd > 0 be small enough so that (y , y + ) U . Thus, we have

P(Sn U) P(Sn (y , y + )) = dP (x1) dP (xn) | 1 xiy|

1


Applications of the large deviation technique

Content.

1. Insurance problem 2. Queueing problem 3. Buffer overow probability

Safety capital for an insurance company

Consider some insurance company which needs to decide on the amount of capital S0 it needs to hold to avoid the cash ow issue. Suppose the insurance premium per month is a xed (non-random) quantity C > 0. Suppose the claims are i.i.d. random variable AN 0 for the time N = 1, 2, . Then the capital L

Nat time N is SN = S0 + (C An). The company wishes to avoid the n=1situation where the cash ow SN is negative. Thus it needs to decide on the capital S0 so that P(N,SN 0) is small. Obviously this involves a tradeoff between the smallness and the amount S0. Let us assume that upper bound = 0.001, namely 0.1% is acceptable (in fact this is pretty close to the banking regulation standards). We have

N ) P(N,SN 0) = P(min S0 + (C An) 0)

N n=1

N ) = P(max (An C) S0)

N n=1 L

NIf E[A1] C , we have P(maxN (An C) S0) = 1. Thus, the n=1interesting case is E[A1] < C (negative drift), and the goal is to determine the starting capital S0 such that

N )P(max (An C) S0) .

N n=1

1

2

3

Buffer overow in a queueing system

The following model is a variant of a classical so called GI/GI/1 queueing system. In application to communication systems this queueing system consists of a single server, which processes some C > 0 number of communication packets per unit of time. Here C is a xed deterministic constant. Let An be the random number packets arriving at time n, and Qn be the queue length at time n (asume Q0=0). By recursion, we have that

QN = max(QN1 + AN C, 0)

= max(QN2 + AN1 + AN 2C,AN C, 0) n )

= max ( (ANk C), 0) 1nN1

k=1

Notice, that in distributional sense we have n )

QN = max max (ANk C) , 01nN1

k=1

In steady state, i.e. N = , we have n )

Q = max max (Ak C), 0n1

k=1

Our goal is to design the size of the queue length storage (buffer) B, so that the likelihood that the number of packets in the queue exceeds B is small. In communication application this is important since every packet not tting into the buffer is dropped. Thus the goal is to nd buffer size B > 0 such that

n )P(Q B) P(max (Ak C) B)

n1 k=1

If E[A1] C , we have P(Q B) = 1. So the interesting case is E[A1] < C (negative drift).

Buffer overow probability

We see that in both situations we need to estimate n )

P(max (Ak C) B). n1

k=1

We will do this asymptoticlly as B .

2

Theorem 1. Given an i.i.d. sequence An 0 for n 1 and C > E[A1]. Suppose

M() = E[exp(A)] < , for some [0, 0). Then

n )1 lim log P(max (Ak C) B) = sup{ > 0 : M() < exp(C)} B B n1

k=1

Observe that since An is non-negative, the MGF E[exp(An)] is nite for < 0. Thus it is nite in an interval containing = 0, and applying the result of Lecture 2 we can take the derivative of MGF. Then

d d M() = E[A], exp(C) = C

d d =0 =0

Since E[An] < C , then there exists small enough so that M() < exp(C),

( )M

exp( )C

1

*

Figure 1: Illustration for the existance of such that M() < exp(C) .

and thus the set of > 0 for which this is the case is non-empty. (see Figure 1). The theorem says that roughly speaking

n )P(max (Ak C) B) exp(

B), n

k=1

when B is large. Thus given select B such that exp( B) , and we can 1set B = log 1 .

3

Example. Let A be a random variable uniformly distributed in [0, a] and C = 2. Then, the moment generating function of A is

a exp(a) 1 exp(t)a 1 dt =

a M() =

0

Then exp(a) 1

sup{ > 0 : M() exp(C)} = sup{ > 0 : exp(2)}a

Case 1: a = 3, we have = sup{ > 0 : exp(3) 1 3 exp(2)}, i.e. = 1.54078. Case 2: a = 4, we have that { > 0 : exp(3) 1 3 exp(2)} = since E[A] = 2 = C . Case 3: a = 2, we have that { > 0 : exp(3) 1 3 exp(2)} = R+ and thus = , which implies that P(maxn

L

n

k=1(Ak C) B) = 0 by

theorem 1.

Proof of Theorem 1. We will rst prove an upper bound and then a lower bound. Combining them yields the result. For the upper bound, we have that

)n )

n=1 k=1

) )n=1 k=1

n

(Ak C) B) k=1

)P(max P( (Ak C) B)

n

n1 B

P( Ak C + )= n n

B)n=1

= exp(B)

exp(n((C + ) log M())) ( > 0) n )

)n1

Fix any such that C log M(), the inequality above gives

n0

= exp(B)[1 exp((C log M()))]1

exp(n(C log M()))

exp(B) exp(n(C log M()))

Simplication of the inequality above gives n

1 1 log([1 exp((C log M()))]1)

)k=1

lim sup log P(max

log P(max (Ak C) B) + B Bn

)n k=1

1 (Ak C) B) for : M() < exp(C)

B nB

4

Next, we will derive the lower bound. n n ) )

P(max (Ak C) B) P( (Ak C) B),n n

k=1 k=1

Fix a t > 0, then

n Bt ) )P(max (Ak C) B) P( (Ak C) B)

n k=1 k=1

Bt ) Bt P (Ak C)

t k=1

Then, we have n )1

lim inf log P(max (Ak C) B) B B n

k=1 Bt )1 Bt

lim inf log P (Ak C) B B t

k=1 Bt )t Bt

= lim inf log P (Ak C) B Bt t

k=1

n )1 n = t lim inf log P( (Ak C) )

n n t k=1

n )1 1 1 = t lim inf log P( Ak C + )

n n n tk=1

t inf I(x) (by Cramers theorem.) 1

x>C+ t

inf t inf I(x) (since we can choose an arbitrary positive t. ) (1) 1t>0 x>C+ t

We claim that 1

inf t inf I(x) = inf tI(C + )1 tt>0 t>0x>C+ t

Indeed, let x = inf x : I(x) = (possibly x = ). If x C , then I(C + 1 ) = . Suppose C < x . If t is such that C+ 1 x , then inf 1 I(x) = t t x>C+

t

. Therefore it does not make sense to consider such t. Now for c + 1 < x ,t

5

t

we have I is convex non-decreasing and nite on [E[A1], x ). Therefore it is continuous on [E[A1], x ), which gives that

1 inf I(x) = I(C + )

1 tx>C+

and the claim follows. Thus, we obtain n )1 1

lim inf log P(max (Ak C) B) inf tI(C + ) B B n t>0 t

k=1

1

Exercise in HW 2 shows that sup{ > 0 : M() < exp(C)} = inft>0 tI(C + ). t

6

1


Extension of LD to Rd and dependent process. Gartner-Ellis Theorem

Content.

1. Large Deviations in may dimensions

2. Gartner-Ellis Theorem

3. Large Deviations for Markov chains

Large Deviations in Rd

Most of the developments in this lecture follows Dembo and Zeitouni book [1]. p Let Xn Rd be i.i.d. random variables and A Rd . Let Sn = Xn1in The large deviations question is now regarding the existence of the limit

1 Snlim log P( A). n n n

Given Rd, dene M() = E[exp((, X1))] where (, ) represents the inner p product of two vectors: (a, b) = i aibi. Dene I(x) = supRd ((, x) log M()), where again I(x) = is a possibility. Theorem 1 (Cram Suppose M() < ers theorem in multiple dimensions). for all Rd. Then

(a) for all closed set F Rd , 1 Sn

lim sup log P( F ) inf I(x) n n n xF

(b) for all open set U Rd

1 Sn

lim inf log P( U) inf I(x) n n n xU

1

Unfortunately, the theorem does not hold in full generality, and the additional condition such as M() < for all is needed. Known counterexamples are somewhat involved and can be found in a paper by Dinwoodie [2] which builds on an earlier work of Slaby [5]. The difculty arises that there is no longer the notion of monotonicity of I(x) as a function of the vector x. This is not the tightest condition and more general conditions are possible, see [1]. The proof of the theorem is skipped and can be found in [1].

dLet us consider an example of application of Theorem 2. Let Xn = N(0, ) where d = 2 and

11 2 = , F = {(x1, x2) : 2x1 + x2 5}.1 12 1Goal: prove that the limit limn log P(Sn F ) exists, and compute it. n n

By the upper bound part,

1 Snlim sup log P( F ) inf I(x)

n n n xF

We have M() = E[exp((, X))]

Letting = d denote equality in distribution, we have

d(, X) = N(0, T ) = N(0, 1

2 + 12 + 22),

where = (1, 2). Thus

M() = exp( 1(1

2 + 12 + 22))

2

I(x) = sup(1x1 + 2x2 1(12 + 12 + 22))21,2

Let

g(1, 2) = 1x1 + 2x2 1(12 + 12 + 22). 2dFrom g(1, 2) = 0, we have that dj

1 1 x1 1 2 = 0, x2 2 1 = 0,

2 2

2

from which we have 4 2 4 2

1 = x1 x2, 2 = x2 x13 3 3 3

Then 2 2 2I(x1, x2) = (x1 + x2 x1x2). 3

So we need to nd

2inf 2(x1 + x

2 x1x2) x1,x2 3

s.t. 2x1 + x2 5 (x F ) This becomes a non-linear optimization problem. Applying the Karush-Kuhn-Tucker condition, we obtain

min f s.t. g 0 v f + v g = 0, (1) g = 0

< 0. (2)

which gives

4 2 4 2 ( x1 x2, x2 x1) + (2, 1) = 0, (2x1 + x2 5) = 0. 3 3 3 3

If 2x1 + x2 5 0, then 0 and further x1 x2 0. But this violates = = = = 2x1 + x2 5. So we have 2x1 + x2 5 = 0 which implies x2 = 5 2x1. Thus, we have a one dimensional unconstrained minimization problem:

2min 2 x1 +

2(5 2x1)2 x1(5 2x1)

3 310 35which gives x1 = , x2 = and I(x1, x2) = 5.37. Thus 11 11

1 Snlim sup log P( F ) 5.37

n nn

Applying the lower bound part of the Cramers Theorem we obtain

1 Snlim inf log P( F ) (3)

n n n 1 Sn lim inf log P( F o) (4)

n n n = inf I(x1, x2)

2x1+x2>5

= 5.37 (by continuity of I).

3

Combining, we obtain

1 Sn 1 Snlim log P( F ) = lim inf log P( F ) n n n n n n

1 Sn= lim sup log P( F )

n n

= 5.37. n

2 Gartner-Ellis Theorem

The Gartner-Ellis Theorem deals with large deviations event when the sequence Xn is not necessarily independent. One immediate application of this theorem is large deviations for Markov chains, which we will discuss in the following section.

Let Xn be a sequence of not necessarily independent random variables in Rd . Then in general for Sn =

pXk the identity E[exp((, Sn))] = 1kn

(E[exp((, X1))])n does not hold. Nevertheless there exists a broad set of conditions under which the large deviations bounds hold. Thus consider a general sequence of random variable Yn Rd which stands for (1/n)Sn in the i.i.d. case. Let n() = 1 log E[exp(n(, Yn))]. Note that for the i.i.d. case n

1 n() = log E[exp(n(, n1Sn))]

n 1

= log Mn() n

= log M()

= log E[exp((, X1))].

Loosely speaking Gartner-Ellis Theorem says that when convergence

n() () (5)

takes place for some limiting function , then under certain additional technical assumptions, the large deviations principle holds for rate function

I(x) sup((, x) (x)). (6) R

Formally,

Theorem 2. Given a sequence of random variables Yn, suppose the limit () (5) exists for all Rd. Furthermore, suppose () is nite and differentiable

4

3

everywhere on Rd. Then the following large deviations bounds hold for I dened by (6)

lim sup 1 log P(Yn F ) inf I(x), for any closed set F Rd .

n n xF

lim inf 1 log P(Yn U) inf I(x), for any open set U Rd .

n n xU As for Theorem 1, this is not the most general version of the theorem. The

version above is established as exercise 2.3.20 in [1]. More general versions can be found there as well.

Can we use Chernoff type argument to get an upper bound? For > 0, we have

1P( Yn a) = P(exp(Yn) exp(na)) n

exp(n(a n())) So we can get an upper bound

sup(a n(a)) 0

In the i.i.d. case we used the fact that sup0(a M()) = sup(a M()) when a > = E[X]. But now we are dealing with the multidimensional case where such an identity does not make sense.

Large Deviations for nite state Markov chains

Let Xn be a nite state Markov chain with states = {1, 2, . . . , N}. The transition matrix of this Markov chain is P = (Pi,j , 1 i, j N). We assume

(m)that the chain is irreducible. Namely there exists m > 0 such that Pi,j > 0 for all pairs of states i, j, where P (m) denotes the m-the power of P representing the transition probabilities after m steps. Our goal is to derive the large deviations bounds for the empirical means of the Markov chain. Namely, let f : Rd be any function and let Yn = f(Xn). Our goal is to derive the large deviations pbound for n1Sn where Sn = Yk. For this purpose we need to recall 1in the Perron-Frobenious Theorem.

Theorem 3. Let B = (Bi,j , 1 i, j N) denote a non-negative irreducible matrix. Namely, Bi,j 0 for all i, j and there exists m such that all the elements of Bm are strictly positive. Then B possesses an eigenvalue called Perron-Frobenious eigenvalue, which satises the following properties.

5

1. > 0 is real.

2. For every e-value of B, || , where || is the norm of (possibly complex) .

3. The left and right e-vectors of B denoted by and corresponding to , are unique up to a constant multiple and have strictly positive components.

This theorem can be found in many books on linear algebra, for example [4]. The following corollary for the Perron-Frobenious Theorem shows that the

essentially the rate of growth of the sequence of matrices Bn is n. Specically,

Corollary 1. For every vector = (j , 1 j N) with strictly positive elements, the following holds 1 1 Bn Bnlim log i,j j = lim log j,ij = log . n n n n

1jN 1jN

Proof. Let = maxj j , = minj j , = maxj j , = minj j . We have

Bn Bn

i,j j Bn

i,j j .i,j j

Therefore, 1 1 Bn Bn lim log i,j j = lim log i,j jn n n n 1jN 1jN

1 = lim log(ni)

n n = log .

The second identity is established similarly.

Now, given a Markov chain Xn, a function f : Rd and vector Rd , consider a modied matrix P = (e(,f(j))Pi,j , 1 i, j N). Then P is an irreducible non-negative matrix, since P is such a matrix. Let (P) denote its Perron-Frobenious eigenvalue. p1Theorem 4. The sequence 1 Sn = f(Xn) satises the large devian n p1ik tions bounds with rate function I(x) = Rd ((, x) log (P)). Specically,

6

for every state i0 , closed set F Rd and every open set U Rd, the following holds:

1 lim sup log P(n 1Sn F |X0 = i0) inf I(x),

n n xF 1

lim inf log P(n 1Sn U |X0 = i0) inf I(x). n n xU

1 (,Sn)]Proof. We will show that the sequence of functions n() = log E[en has a limit which is nite and differentiable everywhere. Given the starting state i0 we have (,f(ik))log E[e(,Sn)] = log Pi,i1 Pi1,i2 Pin1,in e

i1,...,in 1kn = log Pn(i0, j) ,

1jN

where P n(i, j) denotes the i, j-th entry of the matrix Pn . Letting j = 1 and

applying Corollary 1, we obtain

lim n() = log (P). n

Thus the Gartner-Ellis can be applied provided the differentiability of log (P) with respect to can be established. The Perron-Frobenious theory in fact can be used to show that such a differentiability indeed takes place. Details can be found in the book by Lancaster [3], Theorem 7.7.1.

References

[1] A. Dembo and O. Zeitouni, Large deviations techniques and applications, Springer, 1998.

[2] IH Dinwoodie, A note on the upper bound for iid large deviations, The Annals of Probability 19 (1991), no. 4, 17321736.

[3] Peter Lancaster and Miron Tismenetsky, Theory of matrices, vol. 2, Academic press New York, 1969.

[4] E Seneta, Non-negative matrices andmarkov chains, Springer2Verlag, New York (1981).

[5] M Slaby, On the upper bound for large deviations of sums of iid random vectors, The Annals of Probability (1988), 978990.

7


Brownian motion. Introduction

Content.

1. A heuristic construction of a Brownian motion from a random walk.

2. Denition and basic properties of a Brownian motion.

1 Historical notes

1765 Jan Ingenhousz observations of carbon dust in alcohol. 1828 Robert Brown observed that pollen grains suspended in water per

form a continual swarming motion.

1900 Bacheliers work The theory of speculation on using Brownian motion to model stock prices.

1905 Einstein and Smoluchovski. Physical interpretation of Brownian motion.

1920s Wiener concise mathematical description.

2 Construction of a Brownian motion from a random walk

The developments in this lecture follow closely the book by Resnick [3]. In this section we provide a heuristic construction of a Brownian motion

from a random walk. The derivation below is not a proof. We will provide a rigorous construction of a Brownian motion when we study the weak convergence theory.

The large deviations theory predicts exponential decay of probabilities P( 1 i Xn > n an) when a > = E[X1] and E[eX1 ] are nite. Naturally, the decay will

be

1

slower the closer a is to . We considered only the case when a was a constant. But what if a is a function of n: a = an? The Central Limit Theorem tells us that the decay disappears when a 1n . Recall n

Theorem 1 (CLT). Given an i.i.d. sequence (Xn, n 1) with E[X1] = , var[X1] = 2. For every constant a

X ai n 1 2 1 ilim P n(

a

t) = e 2 dt.

n n 2

Now let us look at a sequence of partial sums Sn = 1 in(Xi ). For simplicity assume = 0 so that we look at Sn =

1 i nX we say i. Can

anything about Sn as a function of n? In fact, let us make it a function of a real

variable t R+ and rescale it by n as follows. Dene B

o1 int X J

n(t) = i

n for every t 0.

Denote by N(, 2) the distribution function of a normal r.v. with mean and variance 2 .

1. For every xed 0 s < t, by CLT we have the distribution of lnsJ

3. Given a small E, for every t the difference Bn(t + E) Bn(t) converges in distribution to N(0, 2

E). When E is very small this difference is very

close to zero with probability approaching 1 as n . Namely, Bn(t) is increasingly close to a continuous function as n .

4. Bn(0) = 0 by denition.

3 Denition

The Brownian motion is the limit B(t) of Bn(t) as n . We will delay answering questions about the existence of this limit as well as the sense in which we take this limit (remember we are dealing with processes and not values here) till future lectures and now simply postulate the existence of a process satisfying the properties above. In the theorem below, for every continuous function C[0, ) we let B(t, ) denote (t). This notation is more consistent with a standard convention of denoting Brownian motion by B and its value at time t by B(t).

Denition 1 (Wiener measure). Given = C[0, ), Borel -eld B dened on C[0, ) and any value > 0, a probability measure P satisfying the following properties is called the Wiener measure:

1. P(B(0) = 0) = 1.

2. P has the independent increments property. Namely for every 0 t1 < < tk < and x1, . . . , xk1 R, P( C[0, ) : B(t2, ) B(t1, ) x1, . . . , B(tk, ) B(tk1, ) xk1) = P( C[0, ) : B(ti, ) B(ti 1), ) xi1)

2ik

3. For every 0 s < t the distribution of B(t)B(s) is normal N(0, 2(t s)). In particular, the variance is a linear function of the length of the time increment t s and the increments are stationary.

The stochastic process B described by this probability space (C[0, ), B, P) is called Brownian motion. When = 1, it is called the standard Brownian motion.

3

Theorem 2 (Existence of Wiener measure). For every 0 there exists a unique Wiener measure.

(what is the Wiener measure when = 0?). As we mentioned before, we delay the proof of this fundamental result. For now just assume that the theorem holds and study the properties.

Remarks :

In future we will not be explicitly writing samples when discussing Brownian motion. Also when we say B(t) is a Brownian motion, we understand it both as a Wiener measure or simply a sample of it, depending on the context. There should be no confusion.

It turns out that for any given such a probability measure is unique. On the other hand note that if B(t) is a Brownian motion, then B(t) is also a Brownian motion. Simply check that all of the conditions of the Wiener measure hold. Why is there no contradiction?

Sometimes we will consider a Brownian motion which does not start at zero: B(0) = x for some value x = 0. We may dene this process as x + B(t), where B is Brownian motion.

Problem 1.

1. Let be the space of all (not necessarily continuous) functions : R+ R.

i Construct an example of a stochastic process in which satises conditions (a)-(c) of the Brownian motion, but such that every path is almost surely discontinuous.

i Construct an example of a stochastic process in which satises conditions (a)-(c) of the Brownian motion, but such that every path is almost surely discontinuous in every point t [0, 1].

HINT: work with the Brownian motion.

2. Suppose B(t) is a stochastic process dened on the set of all (not necessarily continuous) functions x : R+ R satisfying properties (a)-(c) of Denition 1. Prove that for every t 0, lim 1n B(t+ ) = B(t)n almost surely.

4

Problem 2. Let AR be the set of all homogeneous linear functions x(t) = at where a varies over all values a R. B(t) denotes the standard Brownian motion. Prove that P(B AR) = 0.

4 Properties

We now derive several properties of a Brownian motion. We assume that B(t) is a standard Brownian motion.

Joint distribution. Fix 0 < t1 < t2 < < tk. Let us nd the joint distribution of the random vector (B(t1), . . . , B(tk)). Given x1, . . . , xk R let us nd the joint density of (B(t1), B(t2), . . . , B(tk)) in (x1, . . . , xk). It is equal to the joint density of (B(t1), B(t2) B(t1), . . . , B(tk) B(tk )) 1 in (x1, x2 x1, . . . , xk xk 1), which by independent Gaussian increments property of the Brownian motion is equal to

k 1 1 (x 2 x )e i+1 i 2(ti+1t )i .

2(ti+1 ti)i=1

Differential Property. For any s > 0, Bs(t) = B(t + s) B(s), t 0 is a Brownian motion. Indeed Bs(0) = 0 and the process has independent increments Bs(t2) Bs(t1) = B(t2 + s) B(t1 + s) which have a Guassian distribution with variance t2 t1.

Scaling. For every c, cB(t) is a Brownian motion with variance 2 = c2. Indeed, continuity and the stationary independent increment properties as well as the Gaussian distribution of the increments, follow immediately. The variance of the increments cB(t2) cB(t1) is c2(t2 t1).

For every positive c > 0, B( t ) is a Brownian motion with variance 1c c . In-deed, the process is continuous. The increments are stationary, independent with Gaussian distribution. For every t1 < t2, by denition of the standard Brownian motion, the variance of B( t2 ) B( t1 ) is (t 12 t1)/c = (t2 t1)c c c .

Combining these two properties we obtain that

cB( t )c is also a standard Brow-

nian motion.

5

Covariance. Fix 0 s t. Let us compute the covariance

Cov(B(t), B(s)) = E[B(t)B(s)] a

E[B(t)]E[B(s)] = E[B(t)B(s)] = E[(B(s) + B(t) B(s))B(s)] b = E[B2(s)] + E[B(t) B(s)]E[B(s)] = s + 0 = s.

Here (a) follows since E[B(t)] = 0 for all t and (b) follows since by denition of the standard Brownian motion we have E[B2(s)] = s and by independent increments property we have E[(B(t)B(s))B(s)] = E[(B(t)B(s))(B(s) B(0))] = E[(B(t) B(s)]E[(B(s) B(0)] = 0, since increments have a zero mean Gaussian distribution.

Time reversal. Given a standard Brownian motion B(t) consider the process (1) dened by (1) 1 B (t) B (t) = tB( ) for all t > 0 and B(1)(0) = 0t . In other

words we reverse time by the transformation t 1 . We claim that B(1)(t)t is also a standard Brownian motion.

Proof. We need to verify properties (a)-(c) of Denition 1 plus continuity. The continuity at any point t > 0 follows immediately since 1/t is continuous function. B is continuous by assumption, therefore tB(1) is continuous for all t > 0t .The continuity at t = 0 is the most difcult part of the proof and we delay it till the end. For now let us check (a)-(c).

(a) follows since we dened B(1)(0) to be zero.

We delay (b) till we establish normality in (c)

(c) Take any s < t. Write tB(1 )t sB(1)s as 1 1 1 1 1

tB( ) sB( ) = (t s)B( ) + sB( ) sB( )t s t t s

The distribution of 1 B( )B(1) is Gaussian with zero mean and variance 1 t s s 1 t ,since B is standard Brownian motion. By scaling property, the distribution of

sB(1)sB(1 )t s is zero mean Gaussian with variance (1 1 s2 )s t . The distribution of (t s)B(1) is zero mean Gaussian with variance (tt

1 1 1 s)2( )t and also it

is independent from sB( )t sB( )s by independent increments properties of

6

the Brownian motion. Therefore tB(1 ) sB(1 )t s is zero mean Gaussian with variance

s21 1 1

( ) + (t s)2( ) = t s. s t t

This proves (c).

We now return to (b). Take any t1 < t2 < t3. We established in (c) that all the differences B(1)(t2) B(1)(t1), B(1)(t3) B

(1) B(1)(t2), (1)(t3) B(1)(t1) =

B (t3) B(1)(t2) + B(1)(t2) B(1)(t1) are zero mean Gaussian with variances t2 t1, t3 t2 and t3 t1 respectively. In particular the variance of B(1)(t (1) (1) (1)3) B (t1) is the sum of the variances of B (t3) B (t2) and B(1)(t2) B(1)(t1). This implies that the covariance of the summands is zero. Moreover, from part (b) it is not difcult to establish that B(1)(t3) B(1)(t2) and B(1)(t2) B(1)(t1) are jointly Gaussian. Recall, that two jointly Gaussian random variables are independent if and only if their covariance is zero.

It remains to prove the continuity at zero of B(1)(t). We need to show the continuity almost surely, so that the zero measure set corresponding to the samples C[0, ) where the continuity does not hold, can be thrown away. Thus, we need to show that the probability measure of the set

1 A = { C[0, ) : lim tB( , ) = 0

t 0 t}

is equal to unity. We will use Strong Law of Large Numbers (SLLN). First set t = 1/n

and consider tB(1 ) = Bt ments property B(n) = i.i.d. standard normal random E[B(1) B(0)] = 0 a.s.

(n)/n. Because of the independent Gaussian incre1 i n(B(i) B(i 1)) is the sum of independent

variables. By SLLN we have then B(n)/n We showed convergence to zero along the sequence

t = 1/n almost surely. Now we need to take care of the other values of t, or equivalently, values s [n, n + 1). For any such s we have

B(s) B(n) B(s) B(n) B(n) B(n)| | | s n s

| + s

|s

n

|1 1 1 |B(n)|| s

| + sup n n n s n+1

|B(s) B(n)|

|B(n)| 1 + sup2

|B(s) n

B(n)|.n nsn+1

We know from SLLN that B(n)/n 0 a.s. Moreover then B(n)/n2 0. (1)

7

a.s. Now consider the second term and set Zn = supn . s B(s) B(n)n+1 | |We claim that for every E > 0,

P(Zn/n > E i.o.) = P( C[0, ) : Zn()/n > E i.o.) = 0 (2) where i.o. stands for innitely often. Suppose (2) was indeed the case. The equality means that for almost all samples the inequality Zn()/n > E happens for at most nitely many n. This means exactly that for almost all (that is a.s.) Zn()/n 0 as n . Combining with (1) we would conclude that a.s.

B(s) B(n) sup 0,

n s n+1 |

s

n |

as n . Since we already know that B(n)/n 0 we would conclude that a.s. lims B(s)/s = 0 and this means almost sure continuity of B(1)(t) at zero.

It remains to show (2). We observe that due to the independent stationary increments property, the distribution of Zn is the same as that of Z1. This is the distribution of the maximum of the absolute value of a standard Brownian motion during the interval [0, 1]. In the following lecture we will show that this maximum has nite expectation: E[|Z1|] < . On the other hand

=

(n+1) E[|Z1|] P(|Z1| > x)dx =

0 P(|Z1| > x)dx

0 nn=0 0 E P(|Z1| > (n + 1)E).

n=0

We conclude that the sum in the right-hand side is nite. By i.i.d. of Zn 0 |Zn| |ZP 1( > E) = P( | > E) n

n=1 n

0n

=1

Thus the sum on the left-hand side is nite. Now we use the Borel-Cantelli Lemma to conclude that (2) indeed holds.

5 Additional reading materials

Sections 6.1 and 6.4 from Chapter 6 of Resnicks book Adventures in Stochastic Processes in the course packet.

Durrett [2], Section 7.1 Billingsley [1], Chapter 8.

8

References


[2] R. Durrett, Probability: theory and examples, Duxbury Press, second edition, 1996.

[3] S. Resnick, Adventures in stochastic processes, Birkhuser Boston, Inc., 1992.

9


The Reection Principle. The Distribution of the Maximum. Brownian motion with drift

Content.

1. Quick intro to stopping times

2. Reection principle

3. Brownian motion with drift

1 Technical preliminary: stopping times

Stopping times are loosely speaking rules by which we interrupt the process without looking at the process after it was interrupted. For example sell your stock the rst time it hits $20 per share is a stopping rule. Whereas, sell your stock one day before it hits $20 per share is not a stopping rule, since we do not know the day (if any) when it hits this price.

Given a stochastic process {Xt}t0 with t Z+ or t R+, a random variable T is called a stopping time if for every time the event T t is completely determined by the history {Xs}0st.

This is not a formal denition. The formal denition will be given later when we study ltration. Then we will give the denition in terms of the underlying (, F , P). For now, though, let us just adopt this loose denition.

2 The Reection principle. The distribution of the maximum

The goal of this section is to obtain the distribution of

M(t) sup B(s) 0st

1

for any given t. Surprisingly the resulting expression is very simple and follows from one of the key properties of the Brownian motion the reection principle.

Given a > 0, dene

Ta = inf{t : B(t) = a}

the rst time when Brownian motion hits level a. When no such time exists we dene Ta = , although we now show that it is nite almost surely. Proposition 1. Ta < almost surely. Proof. Note that if B hits some level b a almost surely, then by continuity and since B(0) = 0, it hits level a almost surely. Therefore, it sufces to prove that lim supt B(t) = almost surely. This in its own order will follow from lim sup B(n) = almost surely. n Problem 1. Prove that lim sup |B(n)| = almost surely. n

The differential property of the Brownian motion suggests that

B(Ta + s) B(Ta) = B(Ta + s) a (1)

is also a Brownian motion, independent from B(t), t Ta. The only issue here is that Ta is a random instance and the differential property was established for xed times t. Turns out (we do not prove this) the differential property also holds for a random time Ta, since it is a stopping time and is nite almost surely. The rst is an immediate consequence of its denition: we can determine whether Ta t by checking looking at the path B(u), 0 u t. The almost sure niteness was established in Proposition 1. The property (1) is called the strong independent increments property of the Brownian motion.

Theorem 1 (The reection principle). Given a standard Brownian motion B(t), for every a 0 21 x P(M(t) a) = 2P(B(t) a) = 2 e 2t dx. (2)

2t a

Proof. We have

P(B(t) a) = P(B(t) a, M(t) a) + P(B(t) a, M(t) < a).

2

Note, however, that P(B(t) a, M(t) < a) = 0 since M(t) B(t). Now

P(B(t) a, M(t) a) = P(B(t) a|M(t) a)P(M(t) a) = P(B(t) a|Ta t)P(M(t) a).

We have that B(Ta + s) a is a Brownian motion. Conditioned on Ta t, 1

P(B(t) a|Ta t) = P(B(Ta + (t Ta)) a 0|Ta t) = 2

since the Brownian motion satises P(B(t) 0) = 1/2 for every t. Applying this identity, we obtain

1P(B(t) a) = P(M(t) a).

2

This establishes the required identity (2).

We now establish the joint probability distribution of M(t) and B(t).

Proposition 2. For every a > 0, y 0

P(M(t) a, B(t) a y) = P(B(t) > a + y). (3)

Proof. We have

P(B(t) > a + y) = P(B(t) > a + y, M(t) a) + P(B(t) > a + y, M(t) < a) = P(B(t) > a + y, M(t) a) = P(B(Ta + (t Ta)) a > y|M(t) a)P(M(t) a).

But since B(Ta + (t Ta)) a, by differential property is also a Brownian motion, then, by symmetry

P(B(Ta + (t Ta)) a > y|M(t) a) = P(B(Ta + (t Ta)) a < y|M(t) a) = P(B(t) < a y|M(t) a).

We conclude

P(B(t) > a + y) = P(B(t) < a y|M(t) a)P(M(t) a) = P(B(t) < a y, M(t) a).

3

3

We now compute the Laplace transform of the hitting time Ta.

Proposition 3. For every > 0 2a E[e Ta ] = e .

Proof. We rst compute the density of Ta. We have

22 x P(Ta t) = P(M(t) a) = 2P(B(t) a) = e 2t dx = 2(1 N(2t a

By differentiating with respect to t we obtain that the density of Ta is given as

a 1 a 2 e3 2t . t 2 2

Therefore 2a atE[e Ta ] = e 3 e 2t dt.

0 t 2 2

Computing this integral is a boring exercise in calculus. We just state the result 2a which is e .

Brownian motion with drift

So far we considered a Brownian motion which is characterized by zero mean and some variance parameter 2. The standard Brownian motion is the special case = 1.

There is a natural way to extend this process to a non-zero mean process by considering B(t) = t + B(t), given a Brownian motion B(t). Some properties of B(t) follow immediately. For example given s < t, the increments B(t) B(s) have mean (t s) and variance 2(t s). Also, by the Time Reversal property of B (see the previous lecture) we know that limt(1/t)B(t) = 0 almost surely. Therefore, almost surely

B(t)lim = . t t

When < 0 this means that M() supt0 B(t) < almost surely. On the other hand M() 0 (why?).

Our goal now is to compute the probability distribution of M().

4

a )). t

4

Theorem 2. For every < 0, the distribution of M() is exponential with parameter 2||/. Namely, for every x 0

2||xP(M() > x) = e .

The direct proof of this result can be found in Section 6.8 of Resnicks book [3]. The proof consists of two parts. We rst show that the distribution of M() is exponential. Then we compute its parameter.

Later on we will study an alternative proof based on the optional stopping theory for martingale processes.


Sections 6.5 and 6.8 from Chapter 6 of Resnicks book Adventures in Stochastic Processes.

Sections 7.3 and 7.4 in Durrett [2]. Billingsley [1], Section 9.

References


[2] R. Durrett, Probability: theory and examples, Duxbury Press, second edition, 1996.


5

1


Quadratic variation property of Brownian motion

Content.

1. Unbounded variation of a Brownian motion.

2. Bounded quadratic variation of a Brownian motion.

Unbounded variation of a Brownian motion

Any sequence of values 0 < t0 < t1 < < tn < T is called a partition = (t0, . . . , tn) of an interval [0, T ]. Given a continuous function f : [0, T ] R its total variation is dened to be

LV (f) sup |f(tk) f(tk1)|, 1kn

where the supremum is taken over all possible partitions of the interval [0, T ] for all n. A function f is dened to have bounded variation if its total variation is nite.

Theorem 1. Almost surely no path of a Brownian motion has bounded variation for every T 0. Namely, for every T

P( : LV (B()) < ) = 0. The main tool is to use the following result from real analysis, which we do

not prove: if a function f has bounded variation on [0, T ] then it is differentiable almost everywhere on [0, T ]. We will now show that quite the opposite is true.

Proposition 1. Brownian motion is almost surely nowhere differentiable. Specifically,

B(t + h) B(t)P( t 0 : lim sup | | = ) = 1.

hh0

1

Proof. Fix T > 0,M > 0 and consider A(M, T ) C[0, ) the set of all paths C[0, ) such that there exists at least one point t [0, T ] such that

B(t + h) B(t)lim sup | | M.

hh0

We claim that P(A(M, T )) = 0. This implies P(M1A(M, T )) = 0 which is what we need. Then we take a union of the sets A(M, T ) with increasing T and conclude that B is almost surely nowhere differentiable on [0, ). If A(M, T ), then there exists t [0, T ] and n such that |B(s)B(t)| 2M |st|

2for all s (t 2 , t + ). Now dene An C[0, ) to be the set of all paths n n such that for some t [0, T ]

|B(s) B(t)| 2M |s t| 2for all s (t 2 , t + ). Then n n

An An+1 (1)

and

A(M, T ) nAn. (2) jFind k = max{j : t}. Dene n

k + 2 k + 1 k + 1 k k k 1 Yk = max{|B( ) B( )|, |B( ) B( )|, |B( ) B( )|}.

n n n n n n

In other words, consider the maximum increment of the Brownian motion over these three short intervals. We claim that Yk 6M/n for every path An.

To prove the bound required bound on Yk we rst consider

k + 2 k + 1 k + 2 k + 1 |B( ) B( )| |B( ) B(t)| + |B(t) B( )|n n n n

2 1 2M + 2M n n

6M . n

The other two differences are analyzed similarly.

Now consider event Bn which is the set of all paths such that Yk() 6M/n for some 0 k Tn. We showed that An Bn. We claim that limn P(Bn) =

2

2

0. Combining this with (1), we conclude P(An) = 0. Combining with (2), this will imply that P(A(M, T )) = 0 and we will be done.

Now to obtain the required bound on P(Bn) we note that, since the increments of a Brownian motion are independent and identically distributed, then

P(Bn) P(Yk 6M/n) 0kTn

3 2 2 1 1 TnP(max{|B( ) B( )|, |B( ) B( )|, |B( ) B(0)|} 6M/n) n n n n n

1 = Tn[P(|B( )| 6M/n)]3 . (3)

n

Finally, we just analyze this probability. We have

1 P(|B( )| 6M/n) = P(|B(1)| 6M/ n).

n

Since B(1) which has the standard normal distribution, its density at any point is at most 1/ 2, then we have that this probability is at a most (2(6M)/ 2n). We conclude that the expression in (3) is, ignoring constants, O(n(1/

n)3) =

O(1/ n) and thus converges to zero as n . We proved limn P(Bn) = 0.

Bounded quadratic variation of a Brownian motion

Even though Brownian motion is nowhere differentiable and has unbounded total variation, it turns out that it has bounded quadratic variation. This observation is the cornerstone of Ito calculus, which we will study later in this course.

We again start with partitions = (t0, . . . , tn) of a xed interval [0, T ], but now consider instead

Q(, B) (B(tk) B(tk1))2 . 1kn

where, we make (without loss of generality) t0 = 0 and tn = T . For every partition dene

() = max |tk tk1|. 1kn

Theorem 2. Consider an arbitrary sequence of partitions i, i = 1, 2, . . .. Suppose limi (i) = 0. Then

lim E[(Q(i, B) T )2] = 0. (4)i

3

Suppose in addition limi i2(i) = 0 (that is the resolution (i) converges to zero faster than 1/i2). Then almost surely

Q(i, B) T. (5) In words, the standard Brownian motion has almost surely nite quadratic variation which is equal to T .

Proof. We will use the following fact. Let Z be a standard Normal random variable. Then E[Z4] = 3 (cute, isnt it?). The proof can be obtained using Laplace transforms of Normal random variables or integration by parts, and we skip the details.

Let i = (B(ti)B(ti1))2 (ti ti1). Then, using the independent Gaussian increments property of Brownian motion, i is a sequence of independent zero mean random variables. We have

Q(i) T = i. 1in

Now consider the second moment of this difference

E(Q(i) T )2 = E(B(ti) B(ti1))4 1in

2 E(B(ti) B(ti1))2(ti ti1) + (ti ti1)2 . 1in 1in

Using the E[Z4] = 3 property, this expression becomes

3(ti ti1)2 2 (ti ti1)2 + (ti ti1)2 1in 1in 1in

= 2 (ti ti1)2 1in

2(i) (ti ti1) 1in

= 2(i)T.

Now if limi (i) = 0, then the bound converges to zero as well. This establishes the rst part of the theorem.

To prove the second part identify a sequence Ei 0 such that (i) = Ei/i

2 . By assumption, such a sequence exists. By Markovs inequality, this is

4

3

bounded by

P((Q(i) T )2 > 2Ei) E(Q(i) T )2

2Ei 2(i)T

2Ei =

T i2

(6)

TSince i i2 < , then the sum of probabilities in (6) is nite. Then applying the Borel-Cantelli Lemma, the probability that (Q(i) T )2 > 2Ei for innitely many i is zero. Since Ei 0, this exactly means that almost surely, limi Q(i) = T .


Sections 6.11 and 6.12 of Resnicks [1] chapter 6 in the book.

References


5


Conditional expectations, ltration and martingales

Content.

1. Conditional expectations

2. Martingales, sub-martingales and super-martingales

1 Conditional Expectations

1.1 Denition

Recall how we dene conditional expectations. Given a random variable X and E[X1{A}]an event A we dene E[X|A] = .P(A)

Also we can consider conditional expectations with respect to random variables. For simplicity say Y is a simple random variable on taking values y1, y2, . . . , yn with some probabilities

P( : Y () = yi) = pi.

Now we dene conditional expectation E[X|Y ] as a random variable which takes value E[X|Y = yi] with probability pi, where E[X|Y = yi] should be understood as expectation of X conditioned on the event { : Y () = yi}.

It turns out that one can dene conditional expectation with respect to a eld. This notion will include both conditioning on events and conditioning on random variables as special cases.

Denition 1. Given , two -elds G F on , and a probability measure P on (, F). Suppose X is a random variable with respect to F but not necessarily with respect to G, and suppose X has a nite L1 norm (that is E[|X|] < ). The conditional expectation E[X|G] is dened to be a random variable Y which satises the following properties:

(a) Y is measurable with respect to G.

1

(b) For every A G, we have E[X1{A}] = E[Y 1{A}]. For simplicity, from now on we write Z F to indicate that Z is measurable

with respect to F . Also let F(Z) denote the smallest -eld such with respect to which Z is measurable.

Theorem 1. The conditional expectation E[X|G] exists and is unique. Uniqueness means that if Y ' G is any other random variable satisfying

conditions (a),(b), then Y ' = Y a.s. (with respect to measure P). We will prove this theorem using the notion of Radon-Nikodym derivative, the existence of which we state without a proof below. But before we do this, let us develop some intuition behind this denition.

1.2 Simple properties

Consider the trivial case when G = {, }. We claim that the constant value c = E[X] is E[X|G]. Indeed, any constant function is measurable with respect to any -eld So (a) holds. For (b), we have E[X1{}] = E[X] = c and E[c1{}] = E[c] = c; and E[X1{}] = 0 and E[c1{}] = 0.

As the other extreme, suppose G = F . Then we claim that X = E[X|G]. The condition (b) trivially holds. The condition (a) also holds because of the equality between two -elds.

Let us go back to our example of conditional expectation with respect to an event A . Consider the associated -elds G = {, A, Ac , }(we established in the rst lecture that this is indeed a -eld). Consider a random variable Y : R dened as

E[X1{A}]Y () = E[X|A] = c1P(A)

for A and E[X1{Ac}]

Y () = E[X|Ac] = c2P(Ac)for Ac . We claim that Y = E[X|G]. First Y G. Indeed, assume for simplicity c1 < c2. Then { : Y () x} = when x < c1, = A

2

for c1 x < c2, = when x c2. Thus Y G. Then we need to check equality E[X1{B}] = E[Y 1{B}] for every B = , A, Ac , , which is straightforward to do. For example say B = A. Then

E[X1{A}] = E[X|A]P(A) = c1P(A).

On the other hand we dened Y () = c1 for all A. Thus

E[Y 1{A}] = c1E[1{A}] = c1P(A).

And the equality checks.

Suppose now G corresponds to some partition A1, . . . , Am of the sample space . Given X F , using a similar analysis, we can check that Y = E[X|G] is a random variable which takes values E[X|Aj ] for all Aj , for j = 1, 2, . . . ,m. You will recognize that this is one of our earlier examples where we considered conditioning on a simple random variable Y to get E[X|Y ]. In fact this generalizes as follows:

Given two random variables X, Y : R, suppose both F . Let G = G(Y ) F be the eld generated by Y . We dene E[X|Y ] to be E[X|G].

1.3 Proof of existence

We now give a proof sketch of Theorem 1.

Proof. Given two probability measures P1, P2 dened on the same (, F), P2 is dened to be absolutely continuous with respect to P1 if for every set A F , P1(A) = 0 implies P2(A) = 0.

The following theorem is the main technical part for our proof. It involves using the familiar idea of change of measures.

Theorem 2 (Radon-Nikodym Theorem). Suppose P2 is absolutely continuous with respect to P1. Then there exists a non-negative random variable Y : R+ such that for every A F

P2(A) = EP1 [Y 1{A}].

Function Y is called Radon-Nikodym (RN) derivative and sometimes is denoted dP2/dP1.

3

Problem 1. Prove that Y is unique up-to measure zero. That is if Y ' is also RN derivative, then Y = Y ' a.s. w.r.t. P1 and hence P2.

We now use this theorem to establish the existence of conditional expectations. Thus we have G F , P is a probability measure on F and X is measurable with respect to F . We will only consider the case X 0 such that E[X] < . We also assume that X is not constant, so that E[X] > 0. Consider a new probability measure P2 on G dened as follows:

EP[X1{A}]P2(A) = , A G,EP[X] where we write EP in place of E to emphasize that the expectation operator is with respect to the original measure P. Check that this is indeed a probability measure on (, G). Now P also induced a probability measure on (, G). We claim that P2 is absolutely continuous with respect to P. Indeed if P(A) = 0 then the numerator is zero. By the Radon-Nikodym Theorem then there exists Z which is measurable with respect to G such that for any A G

P2(A) = EP[Z1{A}]. We now take Y = ZEP[X]. Then Y satises the condition (b) of being a conditional expectation, since for every set B

EP[Y 1{B}] = EP[X]EP[Z1{B}] = EP[X1{B}]. The second part, corresponding to the uniqueness property is proved similarly to the uniqueness of the RN derivative (Problem 1).

2 Properties

Here are some additional properties of conditional expectations.

Linearity. E[aX + Y |G] = aE[X|G] + E[Y |G].

Monotonicity. If X1 X2 a.s, then E[X1|G] E[X2|G]. Proof idea is similar to the one you need to use for Problem 1.

Independence.

Problem 2. Suppose X is independent from G. Namely, for every measurable A R, B G P({X A} B) = P(X A)P(B). Prove that E[X|G] = E[X].

4

Conditional Jensens inequality. Let be a convex function and E[|X|], E[|(X)|] < . Then (E[X|G]) E[(X)|G]. Proof. We use the following representation of a convex function, which we do not prove (see Durrett [1]). Let

A = {(a, b) Q : ax + b (x), x}.

Then (x) = sup{ax + b : (a, b) A}. Now we prove the Jensens inequality. For any pair of rationals a, b

Q satisfying the bound ab

Documents

15070 fall 2013