48
9-10 September Latent variable models: a review of estimation methods Irini Moustaki London School of Economics Conference to honor the scientific contributions of Professor Michael Browne Ohio State University

Latent variable models: a review of estimation methods · 9-10 September Latent variable models: a review of estimation methods Irini Moustaki London School of Economics Conference

Embed Size (px)

Citation preview

9-10 September

Latent variable models: a review of estimationmethods

Irini MoustakiLondon School of Economics

Conference to honor the scientific contributions of Professor Michael Browne

Ohio State University

9-10 September

Outline

• Modeling approaches for ordinal variables

• Review methods of estimation for latent variable models

• Look at a special case: longitudinal data

• Goodness-of-fit tests and Model selection criteria

• Applications

Ohio State University

9-10 September

Notation and general aim

Manifest variables are denoted by: x1, x2, . . . , xp.

Latent variables are denoted by: z1, z2, . . . , zq.

One wants to find a set of latent variables z1, . . . , zq, fewer in number than theobserved variables (q < p), that contain essentially the same information

Ohio State University

9-10 September

Estimation Methods

• Full Maximum Likelihood estimation (Bock and Aitkin, 1981; Bartholomewand Knott, 1997)

– E-M algorithm– Newthon-Raphson maximization

• Limited Information Methods or Composite likelihood: lower order margins

• Markov Chain Monte Carlo methods (Albert 1992; Albert and Chib, 1993;Baker, 1998; Patz and Junker, 1999a,b; Dunson 2000, 2003; Fox and Glas,2001; Shi and lee, 1998; Lee and Song, 2003; Lee, 2007.)

Ohio State University

9-10 September

Full Maximum Likelihood

Since all variables are random and only x is observed:

f(x) =∫· · ·

∫g(x | z)h(z)dz

where

g(x | z) =p∏i=1

g(xi | z)

(z1, . . . , zq) ∼ N(0, 1)For a random sample of size n the loglikelihood is written as:

L =n∑h=1

log f(xh)

Ohio State University

9-10 September

Notes

• q-dimension integration (Laplace, quadrature points, adaptive, Monte Carlo)

• Maximization (E-M, Newthon-Raphson)

Ohio State University

9-10 September

Composite likelihoods a categorization based on Varin andVidoni, 2005

• Omission methods: remove terms that make the evaluation of the fulllikelihood complicated. (Besag, 1974; Azzalini, 1983)

• Subsetting methods: likelihoods composed of univariate, bivariate, trivariate,... margins (Cox and Reid 2004).

• They all fall within the context of pseudo likelihood methods (Godambe, 1960)or misspecified models (White, 1982)

Ohio State University

9-10 September

Composite likelihoods based on marginal densities

• Key idea: The model still holds marginally for a specific set of variables.

• Use only information from the univariate or/and bivariate margins.

• e.g.: maximize the sum of all the log bivariate likelihoods.

• The estimator obtained for large n is consistent and asymptotically normal.

• Aim: decrease the number of integrals required without loosing too muchprecision.

Ohio State University

9-10 September

Bayesian framework estimation

Let us denote by v′ = (z′,α′) the vector with all the unknown parameters

The joint posterior distribution of the parameter vector v:

h(v | x) =g(x | v)ψ(v)∫

. . .∫g(x | v)ψ(v)dv

∝ g(x | v)ψ(v), (1)

which is the likelihood of the data multiplied by the prior distribution (ψ(v)) of all model

parameters including the latent variables and divided by a normalizing constant.

Calculating the normalizing constant requires multi-dimensional integration that can be very heavy

computationally if not infeasible.

Ohio State University

9-10 September

The main steps of the Bayesian approach are:

1. Inference is based on: h(v | x).

2. The mean, mode or any other percentile vector and the standard deviation of h(v | x) can

be used as an estimator of v and its corresponding standard error.

3. Analytic evaluation of the above expectation is impossible. Alternatives include numerical

evaluation, analytic approximations and Monte Carlo integration.

4. MCMC methods are used for sampling from the posterior distribution (Metropolis Hastings

algorithm, Gibbs sampling)

Ohio State University

9-10 September

General Framework for ordinal variables

Let x1, x2, · · · , xp denote the observed ordinal variables and let mi denote thenumber of response categories of variable i. We write xi = s to mean that xibelongs to the ordered category s, s = 1, 2, . . . ,mi.∏pi=1mi: possible response patterns.

Let xr = (x1 = s1, x2 = s2, . . . , xp = sp) represent anyone of these.

The model specifies the probability πr = πr(θ) > 0 of xr.∑πr = 1, for all θ where θ is the vector with all model parameters.

The different approaches differ in the way πr(θ) is specified and in the way themodel is estimated.

Ohio State University

9-10 September

Modelling approaches

Generalized latent variable framework (Samejima 1969, Moustaki, 2003)

To take into account the ordinality property of the items we model the cumulative probabilities:

γi,s(z) = P (xi ≤ s | z) = πi1(z) + πi2(z) + · · ·+ πis(z)

The response category probabilities are denoted by:

πi,s(z) = γi,s(z)− γi,s−1(z), s = 1, . . . ,mi

Proportional odds model

ln[

γi,s(z)1− γi,s(z)

]= αis −

q∑j=1

αijzj, i = 1, . . . , p

Ohio State University

9-10 September

αi1 < αi2 · · · < αimi−1< αimi = ∞

Let xr = (x1 = s1, x2 = s2, . . . , xp = sp) represent a full response pattern. Underthe assumption of conditional independence, the conditional probability, for givenz, of the response pattern xr is

g(xr | z) = πr(z) =p∏i=1

πi,s(z) =p∏i=1

[γi,s(z)− γi,s−1(z)] . (2)

The unconditional probability πr of the response pattern xr is obtained byintegrating πr(z) over the q-dimensional factor space:

f(xr) = πr =∫ +∞

−∞· · ·

∫ +∞

−∞g(xr | z)h(z)dz , (3)

where h(z) is the density function of z.

Ohio State University

9-10 September

Full Information ML, FIML

lnL =∑r

nr lnπr = N∑r

pr lnπr , (4)

where nr is the frequency of response pattern r, N =∑

r nr is the sample size and pr = nr/N

is the sample proportion of response pattern r.

• The Maximization can be done with E-M or N-R.

• Full ML computationally intensive for a large number of factors.

• Adaptive methods should be used.

• Pairwise likelihood does not have an advantage here.

Ohio State University

9-10 September

MCMC estimation

The Gibbs sampling (Geman and Geman, 1984; Gelfand and Smith, 1990) is away for sampling from complex joint posterior distributions.

• Partition the vector of unknown parameters v into two components, v1 = zand v2 = α.

• Simulate from h(z1 | α0,x) and h(α1 | z1,x).

• The Gibbs sampler eventually produces a sequence of iterations v0,v1, . . . thatform a Markov Chain that converges into the desired posterior distribution andit can be summarized in the following two steps:

Ohio State University

9-10 September

1. Start with initial guesses of z0,α0.

2. Then simulate in the following order:

• Draw z1 from h(z | α0,x).• Draw α1 from h(α | z1,x).

The conditional distributions are:

h(z | α,x) =g(x | z,α)h(z,α)∫g(x | z,α)h(z,α)dz

(5)

h(α | z,x) =g(x | z,α)h(z,α)∫g(x | z,α)h(z,α)dα

(6)

Ohio State University

9-10 September

• When the normalizing constant is not in closed form a M-H algorithm withinGibbs is needed

• Computationally heavy

• Convergence criteria should be used

• Model selection criteria such as the Bayes factor require the calculation of thenormalizing constant.

• MCMC methods give similar results with FIML (Kim, 2001; Wollack and ea.,2002; Moustaki and Knott, 2005)

Ohio State University

9-10 September

Example: compare E-M and MCMC

• Social Life Feelings scale (Bartholomew and Knott, 2007)

• MCMC solution obtained with BUGS

• Two-parameter model is fitted

E-M MCMC

Item αi0 αi1 αi0 αi1

1 -2.35(0.13) 1.20(0.14) -2.37(0.14) 1.21(0.15)

2 0.80(0.06) 0.71(0.09) 0.79(0.06) 0.70(0.09)

3 0.99(0.09) 1.53(0.17) 0.99(0.09) 1.52(0.17)

4 -0.67(0.13) 2.55(0.41) -0.74(0.20) 2.86(0.81)

5 -1.10(0.07) 0.92(0.10) -1.10(0.07) 0.92(0.11)

Ohio State University

9-10 September

Underlying variable approach assumes that the observed categorical variables are generated

by a set of underlying unobserved continuous variables (Muthen, 1984; Joreskog, 1994; Lee, Poon,

and Bentler, 1990, 1992).

This approach employs the classical factor analysis model:

x∗i = λi1z1 + λi2z2 + · · ·+ λiqzq + ui , i = 1, 2, . . . , p , (7)

x∗i is

xi = a ⇐⇒ τ(i)a−1 < x∗i ≤ τ (i)

a , a = 1, 2, . . . ,mi , (8)

τ(i)0 = −∞ , τ

(i)1 < τ

(i)2 < . . . < τ

(i)mi−1 , τ

(i)mi

= +∞ ,

• z1, . . . , zq, u1, . . . , up are independent and normally distributed with zj ∼N(0, 1) and ui ∼ N(0, ψ2

i ).

Ohio State University

9-10 September

• ψ2i = 1−

∑kj=1 λ

2ij.

• x∗1, . . . , x∗p ∼MVN(0, 1,P = (ρij)), where

ρij =∑ql=1 λilλjl.

It follows that the probability πr(θ) of a general p-dimensional response pattern

πr(θ) = Pr(x1 = s1, x2 = s2, . . . , xp = sp)

=∫ τ

(1)s1

τ(1)s1−1

∫ τ(2)s2

τ(2)s2−1

· · ·∫ τ

(p)sp

τ(p)sp−1

φp(u1, u2, . . . up|P)du1du2 · · · dup , (9)

where the integral (9) is over the p-dimensional normal density function.

Ohio State University

9-10 September

Estimation methods within the UVA

1. Full ML:lnL =

∑r

nr lnπr = N∑r

pr lnπr , (10)

If there is no model so that the πr are unconstrained, the maximum of lnL is

lnL1 =∑r

nr ln pr = N∑r

pr ln pr .

Instead of maximizing (10) it is convenient to minimize the fit function

F (θ) =∑r

pr[ln pr − lnπr(θ)] =∑r

pr ln[pr/πr(θ)] . (11)

Ohio State University

9-10 September

2. Three-stage procedures leading to limited information methods.

3. Composite likelihood methods: Use information from the univariate andbivariate margins to estimate the model. (Joreskog and Moustaki, 2001)

From the multivariate normality of x∗1, . . . , x∗p, it follows

π(g)a (θ) =

∫ τ(g)a

τ(g)a−1

φ(u)du , (12)

π(gh)ab (θ) =

∫ τ(g)a

τ(g)a−1

∫ τ(h)b

τ(h)b−1

φ2(u, v|ρgh)dudv , (13)

where φ2(u, v|ρ) is the density function of the standardized bivariate normaldistribution with correlation ρ.

Ohio State University

9-10 September

Minimizes the sum of all univariate and bivariate fit functions

F (θ) =p∑g=1

mg∑a=1

p(g)a ln[p(g)

a /π(g)a (θ)]+

p∑g=2

g−1∑h=1

mg∑a=1

mh∑b=1

p(gh)ab ln[p(gh)

ab /π(gh)ab (θ)] ,

• Only data in the univariate and bivariate margins are used.

• This approach is quite feasible in that it can handle a large number ofvariables as well as a large number of factors.

• The asymptotic covariance matrix is often unstable in small samples,particularly if there are zero or small frequencies in the bivariate margins.

• The bivariate approach estimates the thresholds and the factor loadings inone single step from the univariate and bivariate margins without the use of aweight matrix.

Ohio State University

9-10 September

Simulation study: small population

• p = 4 satisfying a one-factor model.

• mi = 5, i = 1, . . . , p

• There are Q = 625 possible response pattern.

• The model has 20 parameters.

• The sample size is N = 800.

• In the probit-generated data there are 163 distinct response patterns in thesample (NOR).

• In the logit-generated data there are 305 distinct response patterns in thesample (POM).

Ohio State University

9-10 September

• Logit gives more probability to a wider set of response patterns, whereas Probitconcentrates more probabilities to a smaller set of response patterns.

Table 1: Small Population: Estimated Factor Loadings

NOR-Generated Data POM-Generated DataTrue UBN NOR POM UBN NOR POM.90 .90 .85 .90 .81 .72 .88.80 .82 .82 .88 .61 .65 .82.70 .71 .72 .89 .46 .48 .69.60 .57 .58 .77 .31 .32 .51

Ohio State University

9-10 September

Multivariate longitudinal ordinal data

• Study the evolvement in time of traits, attitudes, ability, etc.

• Characteristics of longitudinal data:

– Extra dependencies to account for (within time/ between time).– Factor changes over time are modelled through a multivariate normal

distribution. The covariance matrix imposed on the time-related latentvariables accounts for the dependence of the item responses across time.

– A GLLVM that accounts for dependencies within time and across timeusing time specific factors and a heteroscedastic item-dependent randomeffect term (Dunson 2003, JASA and Cagnone, Moustaki, Vasdekis, 2009,BJMSP).

– An autoregressive model is used to model the structural part of the model.

Ohio State University

9-10 September

• Within Structural Equation Modelling (SEM) framework longitudinal data aretreated in different ways:

– Individual response curves over time by means of latent variable growthmodels. The main feature of this class of models is that the parametersof the curve, random intercept and random slope, can be viewed as latentvariables.

– Allow for correlated errors.

Ohio State University

9-10 September

Notation for longitudinal data

Let x1t, x2t, . . . , xpt be the ordinal observed variables measured at time t, (t =1, . . . , T ).

The mi ordered categories of the ith item measured at time t have responsecategory probabilities πit(1)(zt, ui), πit(2)(zt, ui), · · · , πit(mi)(zt, ui)

• The time-dependent latent variable zt account for the dependencies amongthe p items measured at time t

• The random effect ui accounts for the dependencies among the item imeasuredat different time points.

Ohio State University

9-10 September

Model specification and assumptions

The systematic component:

ηit(s) = τit(s) − αitzt + ui , i = 1, . . . , p; s = 1, . . . ,mi; t = 1, . . . , T. (14)

The link between the systematic component and the conditional means of therandom component distributions is

ηit(s) = vit(s)(γit(s))

where γit(s) = P (xit ≤ s | zt, ui) and vit(s)(.) is the link function.

Ohio State University

9-10 September

• The associations among the items measured at time t are explained by thelatent variable zt.

Cov(ηit, ηi′t) = αitαi′tV ar(zt), i 6= i′ (15)

with V ar(zt) = φ2(t−1)σ21 + I(t ≥ 2)

∑t−1k=1 φ

2(k−1)

• The associations among the same item measured across time (xi1, . . . , xiT )are explained by the item-specific random effect ui.

Cov(ηit, ηit′) = αitαit′Cov(zt, zt′) + σ2ui, t < t′ (16)

Cov(zt, zt′) = φt+t′−2σ2

1 + I(t ≥ 2)∑t−2k=0 φ

t′−t+2k, where I(.) is the indicatorfunction.

Ohio State University

9-10 September

• The within-individual correlations are accounted for through modelling thecovariance between latent variables zt and zt′.

zt = φzt−1 + δt (17)

δt ∼ N(0, 1) and z1 ∼ N(0, 1).

• ui ∼ N(0, σ2u)

Cov(ηit, ηi′t′) = αitαi′t′Cov(zt, zt′), i 6= i′, t < t′ (18)

Ohio State University

9-10 September

Full ML Estimation with the EM-algorithm

For a random sample of size n the complete log-likelihood is written as:

L =n∑

m=1

log f(xm, zm,um)

= logn∑

m=1

[log g(xm | zm,um) + log h(zm,um)] (19)

g(xm | zm,um) =T∏t=1

p∏i=1

g(xmit | zmt, umi) (20)

Ohio State University

9-10 September

Let us define with ζm = (zm,um) and with h(zm,um) = h(ζm, φ, σ2u). Assuming

that their common distribution function is multivariate normal their log-likelihoodis, apart from a constant

log h(ζm, φ, σ2u) = −1

2ln |Φ| − 1

2ζ′mΦ−1ζm (21)

Φ =[

Γ 00 Ω

](22)

where Ω = diagi=1,...,pσ2ui and the elements of Γ are such that its inverse has

Ohio State University

9-10 September

a well known special pattern as

Γ−1 =

1σ2

1+ φ2 −φ 0 . . . 0 0 0

−φ 1 + φ2 −φ . . . 0 0 00 −φ 1 + φ2 . . . 0 0 0... ... ... . . . ... ...0 0 0 . . . −φ 1 + φ2 −φ0 0 0 . . . 0 −φ 1

Ohio State University

9-10 September

EM-algorithm

• E-step: the expected score function of the model parameters is computed.The expectation is with respect to the posterior distribution of (zm,um) giventhe observations (h(zm,um | xm)) for each individual.

• M-step: updated parameter estimates are obtained.

• The full ML is very computationally intensive as the number of items andfactors increase.

Ohio State University

9-10 September

Composite likelihood estimation

Vasdekis, Cagnone, Moustaki (working paper)

• The model specification remains the same.

• S1 = (i, j, t); 1 ≤ i < j ≤ p; t = 1, . . . , T

• S2 = (i, i′, t, t′); 1 ≤ i ≤ p; 1 ≤ i′ ≤ p; 1 ≤ t < t′ ≤ T

Ohio State University

9-10 September

• π(ait,ai′t′)it,i′t′,zt,zt′,ui,ui′

(θ) the joint density of (xit, xi′t′, zt, zt′, ui, ui′)

• π(ait,ai′t′)it,i′t′ (θ) = P (xit = ait, xi′t′ = ai′t′;θ) the marginal probability density of

any pair of observations.

• The bivariate probability for a pair of response (xit, xi′t) is:

π(ait,ai′t)it,i′t (θ) =

∫ +∞

−∞

∫ +∞

−∞

∫ +∞

−∞π

(ait,ai′t)it,i′t,zt,ui,ui′

(θ)dztduidui′

• and for a pair of response (xit, xi′t′)

π(ait,ai′t′)it,i′t′ (θ) =

∫ +∞

−∞

∫ +∞

−∞

∫ +∞

−∞

∫ +∞

−∞π

(ait)

it (zt, ui)π(ai′t′)i′t′ (zt′, ui′)h(zt, zt′, ui, ui′)dztdzt′duidui′

Ohio State University

9-10 September

• The latter can be three dimensional if it happens that i = i′. In any case, themaximum number of integrations we have to deal with, is four.

The contribution of any given individual to the log pairwise likelihood is therefore:

pl(θ;y) =

∑S1

log π(ait,ai′t)it,i′t (θ) +

∑S2

log π(ait,ai′t′)it,i′t′ (θ)

(23)

Since each component of equation (23) is a likelihood object we can claimthat ∇pl(θ) = 0 is unbiased under the usual regularity conditions. Themaximum pairwise likelihood estimator θPL is consistent and asymptoticallynormally distributed (Arnold and Strauss, 1991) under regularity conditions foundin Lehman (1983), Crowder (1986) or Geys et. al. (1997).

The maximization of the likelihood in (23) is done with the E-M algorithm.

Ohio State University

9-10 September

Standard Errors

For the pairwise likelihood defined in (23), the maximum pairwise likelihoodestimator θPL = (τPL, ψPL) converges in probability to θ0 and

θPL → Nq(θ0, J

−1(θ0)H(θ0)J−1(θ0))

where J(θ) = Ey∇2pl(θ;y)

and H(θ) = var(∇pl(θ;y))

Heagerty and Lele (1998) and Varin and Vidoni (2005) give expressions for theestimation of J(θ) and H(θ).

Ohio State University

9-10 September

Model selection criteria and Goodness-of-fit tests

• Varin and Vidoni (2005) propose also a composite likelihood informationcriterion based on the expected Kullback-Leibler information between the truedensity of the data and the density estimated via pairwise likelihood:

pl(θPL;y) + trJH−1 (24)

• Limited information goodness-of-fit tests developed by Reiser (1996), Olivaresand Joe (2005), Olivares and Joe (2006).

X2e = e′Σ

+

e e (25)

where Σ+

e is the Moore-Penrose inverse of Σe. X2e has an asymptotic χ2

distribution with degrees freedom given by the rank of Σe

Ohio State University

9-10 September

Application: British Household Panel Survey

BHPS: study social and economic change at the individual and household level inBritain.

The original data set consists of 6 ordinal items on social and political attitudesmeasured at 5 different waves (1992, 1994, 1996, 1998, 2000). The sample sizewas 3784. A random sample of 500 is considered.

The following three items for three waves (1994=1,1996=2, 1998=3) are used:

• Private enterprise solves economic problems [Enterp]

• Govt. has obligation to provide jobs [Gov]

• Strong trade unions protect employees [Trunion]

Ohio State University

9-10 September

• The response alternatives are: Strongly agree, Agree, Not agree/disagree,disagree, strongly disagree.

• We collapsed the first two categories and the last two categories.

• A full ML and pairwise method are used to estimate the model.

• Five quadrature points are used

Ohio State University

9-10 September

Results

Model 1: No equality constraintsModel 2: Thresholds equality constraintsModel 3: Loadings equality constraintsModel 4: Thresholds and loadings equality constraints

Model selection criteriaFIML Pairwise

Models AIC CLICModel 1 8123.694 73114.35Model 2 8154.829 73322.34Model 3 8123.385 73113.41Model 4 8153.579 73321.95

Ohio State University

9-10 September

Results

Parameter estimates with standard errors in brackets, BHPSFIML Pairwise

Items αi σ2ui αi σ2

ui

Enterp 1.00 1.71 (0.73) 1.00 2.46(0.52)

Govern 0.86 (0.12) 5.09 (3.41) 1.05(0.09) 5.54(1.98)

TrUnion 1.01 (0.18) 5.45 (3.91) 1.28(0.15) 5.63(1.43)

Estimated covariance matrix of the time-dependent attitudinal latent variables

ΓFIML =

4.22 3.99 3.78

3.99 4.78 4.52

3.78 4.52 5.28

ΓPair =

4.25 3.64 3.11

3.64 4.11 3.52

3.11 3.52 4.01

φFIML = 0.95(0.01) φPair = 0.86(0.04)

.

Ohio State University

9-10 September

Simulation: Full vs. Pairwise

Comparison between the two estimation methods under the following conditionsof the study

• Setting as in the BHPS application ⇒ 3 ordinal variables, 3 waves (with 4items FIML was not able to converge)

• different sample sizes (200; 1000),

• quadrature points equal to 5,

• 200 replications

Ohio State University

9-10 September

Results for loadings (variances in parentheses)

Full ML Pairwise MLMean MSE Mean MSE

n=200 α1 = 1.00α2 = 0.80 0.84 (0.05) 0.05 0.84 (0.05) 0.05α3 = 1.20 1.22 (0.10) 0.10 1.22 (0.07) 0.07

n=1000 α1 = 1.00α2 = 0.80 0.82 (0.01) 0.01 0.82 (0.01) 0.01α3 = 1.20 1.19 (0.02) 0.02 1.21 (0.02) 0.02

Ohio State University

9-10 September

Results for variances of item random effects

Full ML Pairwise MLMean MSE Mean MSE

n=200 φ = 0.80 0.79 (0.01) 0.01 0.81 (0.01) 0.01σ2u1 = 2.00 1.98 (0.69) 0.69 2.01 (0.54) 0.54σ2u2 = 5.00 5.42 (2.56) 2.73 5.16 (1.73) 1.76σ2u3 = 5.00 5.12 (2.51) 2.53 4.87 (2.06) 2.08σ2

1 = 4.00 4.32 (2.96) 3.07 4.04 (1.43) 1.43

n=1000 φ = 0.80 0.79 (0.00) 0.00 0.80 (0.00) 0.00σ2u1 = 2.00 1.98 (0.11) 0.11 2.03 (0.12) 0.12σ2u2 = 5.00 5.25 (0.38) 0.44 5.13 (0.31) 0.33σ2u3 = 5.00 5.02 (0.40) 0.40 4.89 (0.40) 0.41σ2

1 = 4.00 4.10 (0.35) 0.36 3.98 (0.34) 0.34

Ohio State University

9-10 September

Conclusions

• Full ML vs Composite likelihoods method: Full ML is in some cases not possibleeven for moderate size problems, composite likelihoods seem promising.

• Efficiency of estimators to be studied (computationally demanding).

• Weights either on separate marginal components or inclusion of weightedversion of univariate densities to be checked.

• Limited information goodness-of-fit test under investigation.

• MCMC methods work for all models. Caution needs to be given on theconvergence, parameterization, goodness-of-fit.

.

Ohio State University