214.9 (Wishart)

Embed Size (px)

Citation preview

  • 8/10/2019 214.9 (Wishart)

    1/9

    1 Data Arrays and Decompositions

    1.1 Variance Matrices and Eigenstructure

    Consider ap ppositive definite and symmetric matrix V- a model parameter or a sample variance matrix.The eigenstructure is of interest in understanding patterns of association and underlying structure that may

    be lower dimensional, in the sense that highly correlated - collinear - variables may be driven by a commonunderlying but unobserved factor, or simply redundant measures of the same phenomenon.

    WriteV =EDE

    whereD = diag(d1, . . . , dp)is the diagonal matrix of eigenvalues ofVand the corresponding eigen-vectors are the columns of the orthogonal matrixE. Inversely,EV E= D.

    IfV is the variance matrix of a generic randompvectorx, thenEmapsx to uncorrelated variatesand back; that is, there exists a pvectorf such that V(f) = D andx = Ef , or f = Ex. Therepresentationx = E fmay be referred to as a factor decompositionofx;the uncorrelated elementsoffare factors that, through the linear combinations defined by the map E, generate the patterns ofvariation and association in the elements ofx. Thejth factor in fimpacts the ith element ofx throughthe weightEi,j ,and for this reasonEmay be referred to as the factor loadingsmatrix.

    The factors with largest variances - the largest eigenvalues - play dominant roles in defining the levelsof variation and patterns of association in the elements ofx.Factoricontributes100di/

    pj=1 dj%of

    thetotal variationinV,namelyp

    j=1 dj = tr(V).

    IfV is singular - rank deficient of rankr < p- the same structure exists butp r of the eigenvaluesare zero. NowD = diag(d1, . . . , dr) represents the non-zero and positive eigenvalues, and Eis nolonger square but p r with EE = I, now the r r identity. Further,x = Ef and f = Exwherefis a factor vector with V(f) = D. This clearly represents the precise collinearities amongthe elements ofx - there are only r free dimensions of variation. In non-singular cases, very smalleigenvalues indicate a context of high collinearities, approaching singularity.

    This decomposition - both the eigendecomposition ofV and the resulting representation x = Ef- is also known as the principal component decomposition. Principal component analysis (PCA)

    involves evaluation and exploration of the empirical factors computed based on a sample estimate of

    the variance matrix of apdimensional distribution.

    1.2 Data Arrays, Sample Variances and Singular Value Decompositions

    Consider the data array fromn observations onpvariables, denoted by then pmatrixXwhose rows aresamples and columns are variables. Observation/case/samplei has values in the pvectorxi,and xi is theith row ofX. The p nmatrixX has variables as rows, andn samples as columnsx1, . . . , xn.

    Assume the variables are centered - i.e., have zero mean, or that the sample means have been subtracted

    - so that sample covariances are represented in thep pmatrixV =S/n whereS=XX= ni=1 xixi.(The divisor could be taken asn 1,as a matter of detail.)

    V andShave the same eigenvectors and eigenvalues that are that same up to the factor n, i.e.,V =EDE andS= E DsE

    whereDs = nD.This holds whether or notS, and soV,is of full rank: Eisp rof rankr and D = diag(d1, . . . , dr)with positive values. The rankr ofScannot, of course,exceed that ofX, so r min(p, n). In particular, ifp > n thenr n < p. That is, the rank is atmost the sample size when there are more variables than samples.

    1

  • 8/10/2019 214.9 (Wishart)

    2/9

    Thesingular value decompositionof the data matrixX is

    X =EF

    where the rn matrix F is such that F F is diagonal. In fact, we see thatF = EX so thatF F =ESE= D= nD. Ther elementsndi are also known as the singular values ofX.

    A more common form of the SVD is

    X =ED1/2 F

    where ther nmatrixF= D1/2 Fis such thatFF=I ,the r ridentity. For example, the Matlab and R svdfunctions generate outputs in this form. The rows ofF

    simple represent standardized (unit variance) versions of the r factors inF.

    In cases ofp < n,bothX andEare p nmatrices, having more columns than rows - they arelong and skinny matrices.

    In cases ofp > n, r can be no more than the sample size. Then bothX andEare tall and

    skinny, withEis p rhaving possibly fewer thann columns in rank reduced cases. Standard SVD routines of software packages generally produce redundant decompositions andthe computation is inefficient. For example, in cases withp > n, the standard Matlab function

    returnsE of dimensionp pand D1/2 asp n with the lower p n rows filled with zeros.The function can be flagged to produce Eof dimensionp n and just the reduced D1/2s withthe n relevant eigenvalues. Check the documentation in Matlab and R; see also the cover Matlabfunctionsvd0on the course web site.

    Write F = (f1, . . . , f n) so that xi = Efi and fi = Exi. The fi are the n sample values of thesingular factorpvectors, andEprovides the loadings of the data variables on the singular factors.

    Finally, consider the precision matrix corresponding to V. We have K=V which is the regular inverse ifVis non-singular, or the generalized inverse otherwise (recall that the generalized inverse satisfies V VV =VandVV V =V.) WithV =E DE we have

    K=EDE

    where:

    ifV is non-singular, thenEis p pandD = D1 = diag(1/d1, . . . , 1/dp); ifVis singular of rankr < p,thenEis p randD = diag(1/d1, . . . , 1/dr).

    Note how the patterns of loadings of variables on factors, defined by the elements ofE , also plays major

    roles in defining the elements of the precision matrix.See the course data page for exploration of patterns of association in time series exchange rate returns,

    and some exploratory Matlab code.

    2

  • 8/10/2019 214.9 (Wishart)

    3/9

    2 Wishart Distributions: Variance and Precision Matrices

    The Wishart distributions arise as models for random variation and descriptions of uncertainty about vari-

    ance and precision matrices. They are of particular interest in sampling and inference on covariance and

    association structure in multivariate normal models, and in ranges of extensions in regression and state

    space models.

    2.1 Definition and Structure

    Suppose that is ap psymmetric matrix of random quantities

    =

    1,1 1,2 1,3 1,p1,2 2,2 2,3 2,p1,3 2,3 3,3 2,p

    ......

    ... . . .

    ...

    1,p 2,p 3,p p,p

    .

    Suppose that the joint density of thep(p + 1)/2univariate elements defining is given by

    p() =c||(dp1)/2 exp{tr(A1)/2}for some constantdegrees of freedom dandp ppositive definite symmetric matrixA,and that this densityis defined and non-zero only when is positive definite, and hence non-singular. This is the p.d.f. of aWishart distribution for.The Wishart is a multivariate extension of the gamma distribution, as the form ofthe p.d.f. intimates.

    Some notation, comments and key properties are noted (see Lauritzen, 1996,Graphical Models (O.U.P.),

    Appendix C, for good and detailed development of many aspects of the theory of normal and Wishart

    distributions.)

    The standard notation is Wp(d, A).

    The distribution is defined and proper for all real-valued degrees of freedom d

    p,and for integer

    degrees of freedom0 < d < p.In the latter case, the distribution is singular with the density definedand positive only on a reduced space of matrices of rankd < p.See discussion of singular cases ina subsection below.

    Ais the location matrixparameter of the distribution. E() =dA andE(1) =A1/(d p 1)(the latter only defined whend > p + 1.) The normalizing constantcis given by

    c1 = |A|d/22dp/2p(p1)/4pi=1

    ((d + 1 i)/2).

    In the exponent of the p.d.f. ,tr(A1) = tr(A1).

    The distribution is proper and defined via the p.d.f. if and only if the degrees of freedom is no lessthan the dimension,d p,but then applies for any value ofd, not only integer values.

    The eigen-decomposition of is = where is thep porthogonal matrix whose columnsare eigenvalues of, and = diag(1, . . . , p)are the positive eigenvalues. If(a1, . . . , ap) are the(also positive) eigenvalues ofA, then

    p() {pi=1

    (dp1)/2i a

    d/2i } exp{tr(A1)/2}.

    3

  • 8/10/2019 214.9 (Wishart)

    4/9

    The Wishart distribution is a multivariate version of the gamma distribution. Further, marginal distributions

    of diagonal elements and block diagonal elements of are also Wishart distributed. Specifically:

    Ifp = 1,write = and a = A, both now scalars. The p.d.f. shows that Ga(d/2, 1/(2a)) or= a where 2d.

    Partition as =

    1,1 1,21,2 2,2

    where 1,1 is q q with q < p, 2,2 is (p q) (p q) and 1,2 is q (p q). Partition Aconformably, with elementsA1,1, A2,2 and A1,2.Then

    1,1 Wq(d, A1,1) and 2,2 Wpq(d, A2,2).

    The diagonal elements have gamma marginal distributions, i,i Ga(d/2, 1/(2ai,i)) where ai,i istheith diagonal element ofA. That is,wi,i = ai,ikiwhereki 2d.

    These are just a few key properties of the Wishart distribution, there being much more theory of relevancein multivariate analysis and also statistical modelling that relates to the joint and conditional distributions of

    matrix sub-elements of. In particular, Bayesian analysis of Gaussian graphical models relies heavily onsuch structure for both graphical model development and for specification of prior distributions over graph-

    ical models (see Lauritzen, 1996, Graphical Models (O.U.P.), Appendix C, for summary of key theoretical

    results.)

    2.2 Inverse Wishart Distributions and Notations

    If Wp(d, A)then the random variance matrix = 1 has aninverse Wishart distribution, denoted by

    IWp(d, A).

    The density is derived by direct transformation, using the Jacobian = ||(p+1).

    The IW pdf is

    p() =c||(d+p+1)/2 exp{tr(A1)/2}with normalising constantc as given in the previous subsection.

    An alternative notation sometimes used for Wishart and inverse Wishart distributions refers to f =d

    p+ 1 as the degree of freedom parameter, rather than d. Notice thatf > 0whend

    pso this

    convention has any positive value for the degree of freedom in these regular cases.

    In this notation the powers of|| and|| in their pdfs are then (dp1)/2 = f /21 and(d +p + 1)/2 = (p + f /2), respectively.

    Note that, since the distribution exists and is very useful and used in multivariate analysis forintegerd < p, this leads tof

  • 8/10/2019 214.9 (Wishart)

    5/9

    2.3 Wishart Sampling Distributions for Sample Variance Matrices

    The Wishart distribution arises naturally as the sampling distribution of (to a constant) sample variance

    matrices in multivariate normal populations, as follows:

    Supposen observationsxi

    N(0, )withxi

    xj fori

    =j, and

    S=n

    i=1

    xixi = X

    X

    where X is the np data matrix whose rows are xi. The usual sample variance matrix is then =S/n.This is a sufficient statistic for and the MLE of.We have

    (S|) Wp(n, )

    withE(S|) =nso that is an unbiased estimate of. Supposen observationsxi N(, )withxi xj for i =j, and

    S=n

    i=1

    (xi x)(xi x) = X X

    where Xis the np centered data matrix whose rows are(xi x). The usual sample variance matrixis then =S/(n 1)and we haveS xwith

    (S|) Wp(n 1, ),

    and nowE(S|) = (n 1)so that is an unbiased estimate of. Notice that when n < p the sum of squares matrix S is singular of rankn < p. The Wishart dis-

    tribution then has support that is the subspace of non-negative definite symmetric p

    pmatrices of

    rankn,rather than the full space. OtherwiseSis non-singular (with probability one) and the Wishartdistribution is regular.

    2.4 Wishart Priors and Posteriors in Multivariate Normal Models: Known Mean

    Consider a random sample x1:n from the pdimensional normal distribution with zero mean, (xi|)N(0, ),and set = 1 for the precision matrix, supposing and to be non-singular.

    The likelihood function isp(x1:n|) ||n/2 exp{tr(S)/2}

    where

    S=

    ni=1

    xixi = XX

    whereXis then pdata matrix. Note that the likelihood function has the mathematicalformof thedensity function earlier introduced.

    The standard reference prior is p() ||(p+1)/2 over the space of positive definite symmetricmatrices. This leads to the standard reference posterior for a normal precision matrix

    p(|x1:n) ||(np1)/2 exp{tr(S)/2}

    5

  • 8/10/2019 214.9 (Wishart)

    6/9

    so that (|x1:n) Wp(n, S1). Also, has an inverse Wishart posterior distribution (|x1:n)IWp(n, S

    1).. Posterior expectations are

    E(|x1:n) =nS1 =1

    and

    E(|x1:n) =E(1|x1:n) =S/(n p 1) = (n/(n p 1))ifn > p + 1.The sample variance matrix is theharmonic posterior meanof.

    The Wishart is also the conjugate proper prior for normal precision matrices, and much use of this fact is

    made in Bayesian analysis of Gaussian graphical models as well as state space modelling for multivariate

    time series. In particular, with a priorWp(d0, A0)whereA0 =S10 for some prior sum of squaresmatrixS0 and prior sample size d0, the posterior based on the above likelihood function isWp(dn, An)wheredn = d0+ nandAn = (S0+ S)

    1.

    2.5 Standard Analysis of Multivariate Normal Models: Reference Analysis

    Now consider a random sample x1:n from the pdimensional normal distribution(xi|, ) N(, ),with all parameters to be estimated.

    Write x= ni=1 xi/nandS= ni=1(xi x)(xi x). The standard reference prior is p(, ) = p()p() ||(p+1)/2. It is easily verified that the

    resulting posterior isp(, |x1:n) =p(|, x1:n)p(|x1:n)where:

    (|, x1:n) N(x, /n) (|x1:n) Wp(n 1, S1)where nowSis the centered sum of squares with eachxi replaced

    byxi x.

    The details of this derivation are similar to those of the fully conjugate, proper prior analysis frame-work now discussed, so are left as an exercise.

    2.6 Standard Analysis of Multivariate Normal Models: Full Conjugate Analysis

    The main discussion here is of the full conjugate proper prior analysis. This is used a good deal in linear

    models, mixture modelling with multivariate normal mixtures, graphical models and elsewhere.

    A member of the class of conjugatenormal/Wishartpriors has the formp(|)p()where: (|) N(m0, t0)for some mean vector m0 and scalart0> 0. Wp(d0, A0)where A0= S10 for some prior sum of squares matrix S0and prior sample

    sized0,

    The full likelihood functionp(x1:n|, )can be manipulated into the form

    p(x1:n|, ) = (2)(dnn1)/2||n/2 exp{tr(S)/2} exp{(x )(n)(x )/2}.

    wheredn = d0+ nas above. This uses two standard mathematical tricks:

    6

  • 8/10/2019 214.9 (Wishart)

    7/9

  • 8/10/2019 214.9 (Wishart)

    8/9

    The Bartlett decomposition of the standard Wishart distributionWp(n, I) provides an efficient directsimulation algorithm, as well as useful theory. If we can efficiently simulate the standard Wishart, then the

    last point above shows how we can use that to create samples from any Wishart distribution. The Bartlett

    decomposition, and hence construction, is as follows:

    For fixed dimensionp and integerd

    p,generate independent normal and chi-square random quan-

    tities to define the upper triangular matrix

    U=

    1 z1,2 z1,3 z1,p0 2 z2,3 z2,p0 0 3 z3,p...

    ......

    . . . ...

    0 0 0 p

    where the non-zero entries are independent random quantities with:

    diagonal elementsi =

    i wherei 2di+1for i = 1, . . . , p;

    upper off-diagonal elementszi,j N(0, 1)fori = 1, . . . , pandj = i + 1, . . . , p . Then (Odell and Fieveson, JASA 1968), the random matrix =UU Wp(d, I). Hence, ifA = P P for any non-singular p p matrix P, we can sample from Wp(d, A) by

    generatingUand computing = (U P)U P.

    Some uses of simulation include the ease with which posterior inference on complicated functions ofcanbe derived. For example, inference may be desired for:

    Correlations: the correlation between elementsi and j ofx are i,j/i,ij,j where the terms arethe relevant entries in = 1.

    Complete conditional regression coefficients and covariance selection.Recall that ifx= (x1, . . . , xp)

    has zero mean normal distribution with precision matrix,then

    (xi|x1:p\i, ) N(mi(x1:p\i), 1/i,i)

    where

    mi(x1:p\i) =

    j=1:p\i

    i,jxj and i,j = i,j/i,i.

    This last example shows that the posterior for in a data analysis therefore immediately provides directinferences, via simulation of the elements of the implied terms, for the partial regression coefficients ineach of thep implied linear regressions. This assumes, of course, a full model in the sense that eachxjhas, with probability one, a non-zero coefficient in each regression. The study ofcovariance selection and

    Gaussian graphical modelsfocuses on questions of just what variables are relevant as predictors in each ofthesep conditional distributions.

    2.8 Reduced Rank Cases - Singular Wishart Distributions

    Sometimes we are directly interested in non-singular (reduced rank, or rank deficient) variance matrices and

    cases that arise directly from location matrices A of reduced rank. For example, in the normal samplingmodel suppose thatXis rank deficient due to collinearities among the variables, so that Sis non-singular.More often,A may be close to singular, then using the modified method below will be numerically stable.

    8

  • 8/10/2019 214.9 (Wishart)

    9/9

    The real utility arises in problems in whichp > nin that analysis, so that the rank ofSis usuallyn or maybe less thann, and certainly lower thanpdue to dimensionality.

    The general framework of possibly reduced rank distributions also includes the regular Wishart as a

    special case.

    Suppose thatAhas rankr

    pwith eigendecompositionA= EBE whereEis p

    r,EE= Iand

    B= diag(b1, . . . , br)where eachdi> 0.This allowsAto be rank deficient.

    The generalized inverse ofAisA =EB1E. Suppose = PP whereP = EB1/2 and whereWr(n, I).Then is rank deficient and so

    singular whenr < p. In those cases,has thesingular Wishart distribution.

    The p.d.f. isp() ri=1 (nr1)/2i exp{tr(A)/2} where(1, . . . , r)are the rpositive eigen-values of.

    Simulation is still direct: simulate a regular, non-singular Wishart Wr(n, I)and transform to therank deficient.

    For the reference analysis of the normal variance/precision model, a singular sample variance matrix (aris-

    ing, as indicated by example, in cases ofp > n,) leads toA = S.WithS=XX=E(nD)E as earlierexplored, this impliesA = EBE as above, where nowB = (nD)1.

    9