Gaussian Graphical Models and Graphical Lassoyc5/ele538b_sparsity/lectures/... · 2018-11-07 ·...

ELE 538B: Sparsity, Structure and Inference

Gaussian Graphical Models and Graphical Lasso

Yuxin Chen

Princeton University, Spring 2017

Multivariate Gaussians

Consider a random vector x ∼ N (0,Σ) with pdf

f(x) = 1(2π)p/2 det (Σ)1/2 exp

2x>Σ−1x

}∝ det (Θ)1/2 exp

2x>Θx

}where Σ = E[xx>] � 0 is covariance matrix, and Θ = Σ−1 is inversecovariance matrix / precision matrixGraphical lasso 5-2

Undirected graphical models

x1 ⊥⊥ x4 | {x2, x3, x5, x6, x7, x8}

• Represent a collection of variables x = [x1, · · · , xp]> by a vertexset V = {1, · · · , p}• Encode conditional independence by a set E of edges

◦ For any pair of vertices u and v,

(u, v) /∈ E ⇐⇒ xu ⊥⊥ xv | xV\{u,v}

Graphical lasso 5-3

Gaussian graphical models

Fact 5.1(Homework) Consider a Gaussian vector x ∼ N (0,Σ). For any u andv,

xu ⊥⊥ xv | xV\{u,v}

iff Θu,v = 0, where Θ = Σ−1.

conditional independence ⇐⇒ sparsity

Graphical lasso 5-4

Gaussian graphical models

∗ ∗ 0 0 ∗ 0 0 0∗ ∗ 0 0 0 ∗ ∗ 00 0 ∗ 0 ∗ 0 0 ∗0 0 0 ∗ 0 0 ∗ 0∗ 0 ∗ 0 ∗ 0 0 ∗0 ∗ 0 0 0 ∗ 0 00 ∗ 0 ∗ 0 0 ∗ 00 0 ∗ 0 ∗ 0 0 ∗

︸︷︷︸

Graphical lasso 5-5

Likelihoods for Gaussian models

Draw n i.i.d. samples x(1), · · · ,x(n) ∼ N (0,Σ), then log-likelihood(up to additive constant) is

` (Θ) = 1n

n∑i=1

log f(x(i)) = 12 log det (Θ)− 1

n∑i=1x(i)>Θx(i)

= 12 log det (Θ)− 1

2 〈S,Θ〉 ,

where S := 1n

∑ni=1 x

(i)x(i)> is sample covariance; 〈S,Θ〉 = tr(SΘ)

Maximum likelihood estimation

maximizeΘ�0 log det (Θ)− 〈S,Θ〉

Graphical lasso 5-6

Challenge in high-dimensional regime

Classical theory says MLE coverges to the truth as sample size n→∞

Practically, we are often in the regime where sample size n is small(n < p)• In this regime, S is rank-deficient, and MLE does not even exist

Graphical lasso 5-7

Graphical lasso (Friedman, Hastie, &Tibshirani ’08)

Many pairs of variables are conditionally independent⇐⇒ many missing links in the graphical model (sparsity)

Key idea: apply lasso by treating each node as a response variable

maximizeΘ�0 log det (Θ)− 〈S,Θ〉 − λ‖Θ‖1︸︷︷︸lasso penalty

• It is a convex program! (homework)• First-order optimality condition

Θ−1 − S − λ ∂‖Θ‖1︸︷︷︸subgradient

= 0 (5.1)

=⇒ (Θ−1)i,i = Si,i + λ, 1 ≤ i ≤ p

Graphical lasso 5-8

Blockwise coordinate descentIdea: repeatedly cycle through all columns/rows and, in each step,optimize only a single column/row

Notation: use W to denote working version of Θ−1. Partition allmatrices into 1 column/row vs. the rest

Θ11 θ12θ>12 θ22

[S11 s12s>12 s22

[W11 w12w>12 w22

]Graphical lasso 5-9

Blockwise coordinate descent

Blockwise step: suppose we fix all but the last row / column. Itfollows from (5.1) that

0 ∈W11β − s12 − λ∂‖θ12‖1 = W11β − s12 + λ∂‖β‖1 (5.2)

where β = −θ12/θ̃22 (since[

Θ11 θ12θ>12 θ22

]−1=[

∗ − 1θ̃22

Θ−111 θ12

∗ ∗

]︸︷︷︸

matrix inverse formula

) with

θ̃22 = θ22 − θ>12Θ−111 θ12 > 0

This coincides with the optimality condition for

minimizeβ12∥∥W 1/2

11 β −W−1/211 s12

∥∥22 + λ‖β‖1 (5.3)

Graphical lasso 5-10

Algorithm 5.1 Block coordinate descent for graphical lassoInitialize W = S + λI and fix its diagonals {wi,i}.Repeat until covergence:

for t = 1, · · · p:(i) Partition W (resp. S) into 4 parts, where the upper-left part

consists of all but the jth row / column(ii) Solve

minimizeβ12‖W

1/211 β −W

−1/211 s12‖2 + λ‖β‖1

(iii) Update w12 = W11β

Set θ̂12 = −θ̂22β with θ̂22 = 1/(w22 −w>12β)

The only remaining thing is to ensure W � 0. This is automaticallysatisfied:

Lemma 5.2 (Mazumder & Hastie, ’12)

If we start with W � 0 satisfying ‖W − S‖∞ ≤ λ, then everyrow/column update maintains positive definiteness of W .

• If we start with W (0) = S + λI, then W (t) will always bepositive definite

Proof of Lemma 5.2

A key observation for the proof of Lemma 5.2

Fact 5.3 (Lemma 2, Mazumder & Hastie, ’12)

Solving (5.3) is equivalent to solving

minimizeγ (s12 + γ)>W−111 (s12 + γ) s.t. ‖γ‖∞ ≤ λ (5.4)

where solutions to 2 problems are related by β̂ = W−111 (s12 + γ̂)

• Check that optimality condition of (5.3) and that of (5.4) match

Proof of Lemma 5.2Suppose in tth iteration one has ‖W (t) − S‖∞ ≤ λ and

W (t) � 0

⇐⇒ W(t)11 � 0; w22−w(t)>

)−1w

(t)12 > 0 (Schur complement)

We only update w12, so it suffices to show

w22 −w(t+1)>12

)−1w

(t+1)12 > 0 (5.5)

Recall that w(t+1)12 = W t

11βt+1. It follows from Fact 5.3 that and

‖w(t+1)12 − s12‖∞ ≤ λ;

w(t+1)>12

(t)11)−1w

(t+1)12 ≤ w(t)>

(t)11)−1w

(t)12 .

Since w22 = s22 + λ remains unchanged, we establish (5.5).Graphical lasso 5-14

Reference

[1] ”Sparse inverse covariance estimation with the graphical lasso,”J. Friedman, T. Hastie, and R. Tibshirani, Biostatistics, 2008.

[2] ”The graphical lasso: new insights and alternatives,” R. Mazumder andT. Hastie, Electronic journal of statistics, 2012.

[3] ”Statistical learning with sparsity: the Lasso and generalizations,”T. Hastie, R. Tibshirani, and M. Wainwright, 2015.

Gaussian Graphical Models and Graphical Lassoyc5/ele538b_sparsity/lectures/... · 2018-11-07 ·...

Documents

VECTOR AUTOREGRESSIVE COVARIANCE MATRIX ESTIMATIONecon.lse.ac.uk/staff/wdenhaan/papers/asym35.pdf · VECTOR AUTOREGRESSIVE COVARIANCE MATRIX ESTIMATION ... (PSD) kernels. The VAR

Covariance Estimation For Markowitz Portfolio Optimization

Sparse grids - uni-bonn.de

A TESTING METHOD FOR COVARIANCE STRUCTURE ANALYSIS

Sparse Matrices for High-Performance Graph Analyticsgilbert/talks/GilbertCLSAC17oct2013.pdf · 3 Outline • Motivation • Sparse matrices for graph algorithms • CombBLAS: sparse

analysis of covariance rheumatoid arthritis using cluster

Graphical inference

Graphical abstract 2

Sparse Recovery ( Using Sparse Matrices)

Comparing Mirrored Mutations and Active Covariance Matrix

input u y F Δ H x - Portland State University · Actual (sampling) Linearized (EKF) UT sigma points true mean UT mean and covariance weighted sample mean mean UT covariance covariance

Lorentz covariance of LQG covariance of LQG Review:Lorentz covariance of loop quantum gravity C.Rovelli and S.Speziale , Physical Review D 83,104029 (2011) 名古屋大学院理学研究科

Eddy Covariance Data Processing

Continuation of Sparse Eigendecompositions

Landslide graphical analysis

Bayesian Graphical Models

Sparse Covariance Estimation, Variable Selection and Testing for … · 2018. 9. 27. · Bickel and Levina (2008): when zt is an iid Gaussian process and Σz is sparse For su¢ ciently

Microsoft · dense-CNNI dense-CNN2 sparse-CNNI sparse-CNN2 0.05 0.1 0.15 0.2 0.25 0.3 Noise 1) 2) NETWORK DENSE CNN-I DENSE CNN-2 SPARSE CNN-I SPARSE CNN-2 0.35 0.4 0.5 Networks used

Tro07 sparse-solutions-talk

(1) Time Series Stationary: Toeplitz covariance matrix