214.9 (Wishart)

8/10/2019 214.9 (Wishart)

1/9

1 Data Arrays and Decompositions

1.1 Variance Matrices and Eigenstructure

Consider ap ppositive definite and symmetric matrix V- a model parameter or a sample variance matrix.The eigenstructure is of interest in understanding patterns of association and underlying structure that may

be lower dimensional, in the sense that highly correlated - collinear - variables may be driven by a commonunderlying but unobserved factor, or simply redundant measures of the same phenomenon.

WriteV =EDE

whereD = diag(d1, . . . , dp)is the diagonal matrix of eigenvalues ofVand the corresponding eigen-vectors are the columns of the orthogonal matrixE. Inversely,EV E= D.

IfV is the variance matrix of a generic randompvectorx, thenEmapsx to uncorrelated variatesand back; that is, there exists a pvectorf such that V(f) = D andx = Ef , or f = Ex. Therepresentationx = E fmay be referred to as a factor decompositionofx;the uncorrelated elementsoffare factors that, through the linear combinations defined by the map E, generate the patterns ofvariation and association in the elements ofx. Thejth factor in fimpacts the ith element ofx throughthe weightEi,j ,and for this reasonEmay be referred to as the factor loadingsmatrix.

The factors with largest variances - the largest eigenvalues - play dominant roles in defining the levelsof variation and patterns of association in the elements ofx.Factoricontributes100di/

pj=1 dj%of

thetotal variationinV,namelyp

j=1 dj = tr(V).

IfV is singular - rank deficient of rankr < p- the same structure exists butp r of the eigenvaluesare zero. NowD = diag(d1, . . . , dr) represents the non-zero and positive eigenvalues, and Eis nolonger square but p r with EE = I, now the r r identity. Further,x = Ef and f = Exwherefis a factor vector with V(f) = D. This clearly represents the precise collinearities amongthe elements ofx - there are only r free dimensions of variation. In non-singular cases, very smalleigenvalues indicate a context of high collinearities, approaching singularity.

This decomposition - both the eigendecomposition ofV and the resulting representation x = Ef- is also known as the principal component decomposition. Principal component analysis (PCA)

involves evaluation and exploration of the empirical factors computed based on a sample estimate of

the variance matrix of apdimensional distribution.

1.2 Data Arrays, Sample Variances and Singular Value Decompositions

Consider the data array fromn observations onpvariables, denoted by then pmatrixXwhose rows aresamples and columns are variables. Observation/case/samplei has values in the pvectorxi,and xi is theith row ofX. The p nmatrixX has variables as rows, andn samples as columnsx1, . . . , xn.

Assume the variables are centered - i.e., have zero mean, or that the sample means have been subtracted

- so that sample covariances are represented in thep pmatrixV =S/n whereS=XX= ni=1 xixi.(The divisor could be taken asn 1,as a matter of detail.)

V andShave the same eigenvectors and eigenvalues that are that same up to the factor n, i.e.,V =EDE andS= E DsE

whereDs = nD.This holds whether or notS, and soV,is of full rank: Eisp rof rankr and D = diag(d1, . . . , dr)with positive values. The rankr ofScannot, of course,exceed that ofX, so r min(p, n). In particular, ifp > n thenr n < p. That is, the rank is atmost the sample size when there are more variables than samples.

1

8/10/2019 214.9 (Wishart)

2/9

Thesingular value decompositionof the data matrixX is

X =EF

where the rn matrix F is such that F F is diagonal. In fact, we see thatF = EX so thatF F =ESE= D= nD. Ther elementsndi are also known as the singular values ofX.

A more common form of the SVD is

X =ED1/2 F

where ther nmatrixF= D1/2 Fis such thatFF=I ,the r ridentity. For example, the Matlab and R svdfunctions generate outputs in this form. The rows ofF

simple represent standardized (unit variance) versions of the r factors inF.

In cases ofp < n,bothX andEare p nmatrices, having more columns than rows - they arelong and skinny matrices.

In cases ofp > n, r can be no more than the sample size. Then bothX andEare tall and

skinny, withEis p rhaving possibly fewer thann columns in rank reduced cases. Standard SVD routines of software packages generally produce redundant decompositions andthe computation is inefficient. For example, in cases withp > n, the standard Matlab function

returnsE of dimensionp pand D1/2 asp n with the lower p n rows filled with zeros.The function can be flagged to produce Eof dimensionp n and just the reduced D1/2s withthe n relevant eigenvalues. Check the documentation in Matlab and R; see also the cover Matlabfunctionsvd0on the course web site.

Write F = (f1, . . . , f n) so that xi = Efi and fi = Exi. The fi are the n sample values of thesingular factorpvectors, andEprovides the loadings of the data variables on the singular factors.

Finally, consider the precision matrix corresponding to V. We have K=V which is the regular inverse ifVis non-singular, or the generalized inverse otherwise (recall that the generalized inverse satisfies V VV =VandVV V =V.) WithV =E DE we have

K=EDE

where:

ifV is non-singular, thenEis p pandD = D1 = diag(1/d1, . . . , 1/dp); ifVis singular of rankr < p,thenEis p randD = diag(1/d1, . . . , 1/dr).

Note how the patterns of loadings of variables on factors, defined by the elements ofE , also plays major

roles in defining the elements of the precision matrix.See the course data page for exploration of patterns of association in time series exchange rate returns,

and some exploratory Matlab code.

2

8/10/2019 214.9 (Wishart)

3/9

2 Wishart Distributions: Variance and Precision Matrices

The Wishart distributions arise as models for random variation and descriptions of uncertainty about vari-

ance and precision matrices. They are of particular interest in sampling and inference on covariance and

association structure in multivariate normal models, and in ranges of extensions in regression and state

space models.

2.1 Definition and Structure

Suppose that is ap psymmetric matrix of random quantities

=

1,1 1,2 1,3 1,p1,2 2,2 2,3 2,p1,3 2,3 3,3 2,p

......

... . . .

...

1,p 2,p 3,p p,p

.

Suppose that the joint density of thep(p + 1)/2univariate elements defining is given by

p() =c||(dp1)/2 exp{tr(A1)/2}for some constantdegrees of freedom dandp ppositive definite symmetric matrixA,and that this densityis defined and non-zero only when is positive definite, and hence non-singular. This is the p.d.f. of aWishart distribution for.The Wishart is a multivariate extension of the gamma distribution, as the form ofthe p.d.f. intimates.

Some notation, comments and key properties are noted (see Lauritzen, 1996,Graphical Models (O.U.P.),

Appendix C, for good and detailed development of many aspects of the theory of normal and Wishart

distributions.)

The standard notation is Wp(d, A).

The distribution is defined and proper for all real-valued degrees of freedom d

p,and for integer

degrees of freedom0 < d < p.In the latter case, the distribution is singular with the density definedand positive only on a reduced space of matrices of rankd < p.See discussion of singular cases ina subsection below.

Ais the location matrixparameter of the distribution. E() =dA andE(1) =A1/(d p 1)(the latter only defined whend > p + 1.) The normalizing constantcis given by

c1 = |A|d/22dp/2p(p1)/4pi=1

((d + 1 i)/2).

In the exponent of the p.d.f. ,tr(A1) = tr(A1).

The distribution is proper and defined via the p.d.f. if and only if the degrees of freedom is no lessthan the dimension,d p,but then applies for any value ofd, not only integer values.

The eigen-decomposition of is = where is thep porthogonal matrix whose columnsare eigenvalues of, and = diag(1, . . . , p)are the positive eigenvalues. If(a1, . . . , ap) are the(also positive) eigenvalues ofA, then

p() {pi=1

(dp1)/2i a

d/2i } exp{tr(A1)/2}.

3

8/10/2019 214.9 (Wishart)

4/9

The Wishart distribution is a multivariate version of the gamma distribution. Further, marginal distributions

of diagonal elements and block diagonal elements of are also Wishart distributed. Specifically:

Ifp = 1,write = and a = A, both now scalars. The p.d.f. shows that Ga(d/2, 1/(2a)) or= a where 2d.

Partition as =

1,1 1,21,2 2,2

where 1,1 is q q with q < p, 2,2 is (p q) (p q) and 1,2 is q (p q). Partition Aconformably, with elementsA1,1, A2,2 and A1,2.Then

1,1 Wq(d, A1,1) and 2,2 Wpq(d, A2,2).

The diagonal elements have gamma marginal distributions, i,i Ga(d/2, 1/(2ai,i)) where ai,i istheith diagonal element ofA. That is,wi,i = ai,ikiwhereki 2d.

These are just a few key properties of the Wishart distribution, there being much more theory of relevancein multivariate analysis and also statistical modelling that relates to the joint and conditional distributions of

matrix sub-elements of. In particular, Bayesian analysis of Gaussian graphical models relies heavily onsuch structure for both graphical model development and for specification of prior distributions over graph-

ical models (see Lauritzen, 1996, Graphical Models (O.U.P.), Appendix C, for summary of key theoretical

results.)

2.2 Inverse Wishart Distributions and Notations

If Wp(d, A)then the random variance matrix = 1 has aninverse Wishart distribution, denoted by

IWp(d, A).

The density is derived by direct transformation, using the Jacobian = ||(p+1).

The IW pdf is

p() =c||(d+p+1)/2 exp{tr(A1)/2}with normalising constantc as given in the previous subsection.

An alternative notation sometimes used for Wishart and inverse Wishart distributions refers to f =d

p+ 1 as the degree of freedom parameter, rather than d. Notice thatf > 0whend

pso this

convention has any positive value for the degree of freedom in these regular cases.

In this notation the powers of|| and|| in their pdfs are then (dp1)/2 = f /21 and(d +p + 1)/2 = (p + f /2), respectively.

Note that, since the distribution exists and is very useful and used in multivariate analysis forintegerd < p, this leads tof

8/10/2019 214.9 (Wishart)

5/9

2.3 Wishart Sampling Distributions for Sample Variance Matrices

The Wishart distribution arises naturally as the sampling distribution of (to a constant) sample variance

matrices in multivariate normal populations, as follows:

Supposen observationsxi

N(0, )withxi

xj fori

=j, and

S=n

i=1

xixi = X

X

where X is the np data matrix whose rows are xi. The usual sample variance matrix is then =S/n.This is a sufficient statistic for and the MLE of.We have

(S|) Wp(n, )

withE(S|) =nso that is an unbiased estimate of. Supposen observationsxi N(, )withxi xj for i =j, and

S=n

i=1

(xi x)(xi x) = X X

where Xis the np centered data matrix whose rows are(xi x). The usual sample variance matrixis then =S/(n 1)and we haveS xwith

(S|) Wp(n 1, ),

and nowE(S|) = (n 1)so that is an unbiased estimate of. Notice that when n < p the sum of squares matrix S is singular of rankn < p. The Wishart dis-

tribution then has support that is the subspace of non-negative definite symmetric p

pmatrices of

rankn,rather than the full space. OtherwiseSis non-singular (with probability one) and the Wishartdistribution is regular.

2.4 Wishart Priors and Posteriors in Multivariate Normal Models: Known Mean

Consider a random sample x1:n from the pdimensional normal distribution with zero mean, (xi|)N(0, ),and set = 1 for the precision matrix, supposing and to be non-singular.

The likelihood function isp(x1:n|) ||n/2 exp{tr(S)/2}

where

S=

ni=1

xixi = XX

whereXis then pdata matrix. Note that the likelihood function has the mathematicalformof thedensity function earlier introduced.

The standard reference prior is p() ||(p+1)/2 over the space of positive definite symmetricmatrices. This leads to the standard reference posterior for a normal precision matrix

p(|x1:n) ||(np1)/2 exp{tr(S)/2}

5

8/10/2019 214.9 (Wishart)

6/9

so that (|x1:n) Wp(n, S1). Also, has an inverse Wishart posterior distribution (|x1:n)IWp(n, S

1).. Posterior expectations are

E(|x1:n) =nS1 =1

and

E(|x1:n) =E(1|x1:n) =S/(n p 1) = (n/(n p 1))ifn > p + 1.The sample variance matrix is theharmonic posterior meanof.

The Wishart is also the conjugate proper prior for normal precision matrices, and much use of this fact is

made in Bayesian analysis of Gaussian graphical models as well as state space modelling for multivariate

time series. In particular, with a priorWp(d0, A0)whereA0 =S10 for some prior sum of squaresmatrixS0 and prior sample size d0, the posterior based on the above likelihood function isWp(dn, An)wheredn = d0+ nandAn = (S0+ S)

1.

2.5 Standard Analysis of Multivariate Normal Models: Reference Analysis

Now consider a random sample x1:n from the pdimensional normal distribution(xi|, ) N(, ),with all parameters to be estimated.

Write x= ni=1 xi/nandS= ni=1(xi x)(xi x). The standard reference prior is p(, ) = p()p() ||(p+1)/2. It is easily verified that the

resulting posterior isp(, |x1:n) =p(|, x1:n)p(|x1:n)where:

(|, x1:n) N(x, /n) (|x1:n) Wp(n 1, S1)where nowSis the centered sum of squares with eachxi replaced

byxi x.

The details of this derivation are similar to those of the fully conjugate, proper prior analysis frame-work now discussed, so are left as an exercise.

2.6 Standard Analysis of Multivariate Normal Models: Full Conjugate Analysis

The main discussion here is of the full conjugate proper prior analysis. This is used a good deal in linear

models, mixture modelling with multivariate normal mixtures, graphical models and elsewhere.

A member of the class of conjugatenormal/Wishartpriors has the formp(|)p()where: (|) N(m0, t0)for some mean vector m0 and scalart0> 0. Wp(d0, A0)where A0= S10 for some prior sum of squares matrix S0and prior sample

sized0,

The full likelihood functionp(x1:n|, )can be manipulated into the form

p(x1:n|, ) = (2)(dnn1)/2||n/2 exp{tr(S)/2} exp{(x )(n)(x )/2}.

wheredn = d0+ nas above. This uses two standard mathematical tricks:

6

8/10/2019 214.9 (Wishart)

7/9

8/10/2019 214.9 (Wishart)

8/9

The Bartlett decomposition of the standard Wishart distributionWp(n, I) provides an efficient directsimulation algorithm, as well as useful theory. If we can efficiently simulate the standard Wishart, then the

last point above shows how we can use that to create samples from any Wishart distribution. The Bartlett

decomposition, and hence construction, is as follows:

For fixed dimensionp and integerd

p,generate independent normal and chi-square random quan-

tities to define the upper triangular matrix

U=

1 z1,2 z1,3 z1,p0 2 z2,3 z2,p0 0 3 z3,p...

......

. . . ...

0 0 0 p

where the non-zero entries are independent random quantities with:

diagonal elementsi =

i wherei 2di+1for i = 1, . . . , p;

upper off-diagonal elementszi,j N(0, 1)fori = 1, . . . , pandj = i + 1, . . . , p . Then (Odell and Fieveson, JASA 1968), the random matrix =UU Wp(d, I). Hence, ifA = P P for any non-singular p p matrix P, we can sample from Wp(d, A) by

generatingUand computing = (U P)U P.

Some uses of simulation include the ease with which posterior inference on complicated functions ofcanbe derived. For example, inference may be desired for:

Correlations: the correlation between elementsi and j ofx are i,j/i,ij,j where the terms arethe relevant entries in = 1.

Complete conditional regression coefficients and covariance selection.Recall that ifx= (x1, . . . , xp)

has zero mean normal distribution with precision matrix,then

(xi|x1:p\i, ) N(mi(x1:p\i), 1/i,i)

where

mi(x1:p\i) =

j=1:p\i

i,jxj and i,j = i,j/i,i.

This last example shows that the posterior for in a data analysis therefore immediately provides directinferences, via simulation of the elements of the implied terms, for the partial regression coefficients ineach of thep implied linear regressions. This assumes, of course, a full model in the sense that eachxjhas, with probability one, a non-zero coefficient in each regression. The study ofcovariance selection and

Gaussian graphical modelsfocuses on questions of just what variables are relevant as predictors in each ofthesep conditional distributions.

2.8 Reduced Rank Cases - Singular Wishart Distributions

Sometimes we are directly interested in non-singular (reduced rank, or rank deficient) variance matrices and

cases that arise directly from location matrices A of reduced rank. For example, in the normal samplingmodel suppose thatXis rank deficient due to collinearities among the variables, so that Sis non-singular.More often,A may be close to singular, then using the modified method below will be numerically stable.

8

8/10/2019 214.9 (Wishart)

9/9

The real utility arises in problems in whichp > nin that analysis, so that the rank ofSis usuallyn or maybe less thann, and certainly lower thanpdue to dimensionality.

The general framework of possibly reduced rank distributions also includes the regular Wishart as a

special case.

Suppose thatAhas rankr

pwith eigendecompositionA= EBE whereEis p

r,EE= Iand

B= diag(b1, . . . , br)where eachdi> 0.This allowsAto be rank deficient.

The generalized inverse ofAisA =EB1E. Suppose = PP whereP = EB1/2 and whereWr(n, I).Then is rank deficient and so

singular whenr < p. In those cases,has thesingular Wishart distribution.

The p.d.f. isp() ri=1 (nr1)/2i exp{tr(A)/2} where(1, . . . , r)are the rpositive eigen-values of.

Simulation is still direct: simulate a regular, non-singular Wishart Wr(n, I)and transform to therank deficient.

For the reference analysis of the normal variance/precision model, a singular sample variance matrix (aris-

ing, as indicated by example, in cases ofp > n,) leads toA = S.WithS=XX=E(nD)E as earlierexplored, this impliesA = EBE as above, where nowB = (nD)1.

9

Documents

214.9 (Wishart)