Sparse estimation tutorial 2014

スパース推定概観：モデル・理論・応用

† 鈴木　大慈

†Tokyo Institute of TechnologyDepartment of Mathematical and Computing Sciences

2014年 9月 15日統計連合大会@東京大学

1 / 56

Outline

1 スパース推定のモデル

2 いろいろなスパース正則化

3 スパース推定の理論n≫ pの理論n≪ pの理論

4 高次元線形回帰の検定

5 スパース推定の最適化手法

2 / 56

高次元データでの問題意識

ゲノムデータ

金融データ

協調フィルタリング

コンピュータビジョン

音声認識

次元 d = 10000の時，サンプル数 n = 1000で推定ができるか？どのような条件があれば推定が可能か？

何らかの低次元性 (スパース性)を利用．

3 / 56

歴史: スパース推定の手法と理論

1992 Donoho and Johnstone Wavelet shrinkage(Soft-thresholding)

1996 Tibshirani Lassoの提案

2000 Knight and Fu Lassoの漸近分布(n≫ p)

2006 Candes and Tao, 圧縮センシングDonoho (制限等長性，完全復元，p ≫ n)

2009 Bickel et al., Zhang 制限固有値条件(Lassoのリスク評価, p ≫ n)

2013 van de Geer et al., スパース推定における検定Lockhart et al. (p ≫ n)

これ以前にも反射法地震探査や画像雑音除去，忘却付き構造学習に L1正則化は使われて

いた．詳しくは田中利幸 (2010)を参照.4 / 56

Outline






5 / 56

高次元データ解析

サンプル数 ≪ 次元

× 古典的数理統計学：サンプル数 ≫ 次元

バイオインフォテキストデータ画像データ6 / 56

高次元データ解析

サンプル数 ≪ 次元× 古典的数理統計学：サンプル数 ≫ 次元

バイオインフォテキストデータ画像データ6 / 56

スパース推定

サンプル数 ≪ 次元無駄な情報を切り落とす→スパース性

Lasso推定量R. Tsibshirani (1996). Regression shrinkage and selection via the lasso.J. Royal. Statist. Soc B., Vol. 58, No. 1, pages 267–288.引用数：10185 (2014年 5月 25日)

7 / 56

変数選択の問題（線形回帰）

デザイン行列 X = (Xij) ∈ Rn×p.p (次元)≫ n (サンプル数).真のベクトル β∗ ∈ Rp: 非ゼロ要素の個数がたかだか d 個 (スパース).

モデル : Y = Xβ∗ + ξ.

(Y ,X )から β∗を推定．実質推定しなくてはいけない変数の数は d 個→変数選択．

Mallows’ Cp, AIC:

βMC = argminβ∈Rp

∥Y − Xβ∥2 + 2σ2∥β∥0

ただし ∥β∥0 = |{j | βj = 0}|.→ 2p 個の候補を探索．NP-困難．

8 / 56

変数選択の問題（線形回帰）

デザイン行列 X = (Xij) ∈ Rn×p.p (次元)≫ n (サンプル数).真のベクトル β∗ ∈ Rp: 非ゼロ要素の個数がたかだか d 個 (スパース).

モデル : Y = Xβ∗ + ξ.

(Y ,X )から β∗を推定．実質推定しなくてはいけない変数の数は d 個→変数選択．

Mallows’ Cp, AIC:

βMC = argminβ∈Rp

∥Y − Xβ∥2 + 2σ2∥β∥0

ただし ∥β∥0 = |{j | βj = 0}|.→ 2p 個の候補を探索．NP-困難．

8 / 56

Lasso推定量

Mallows’ Cp 最小化: βMC = argminβ∈Rp

∥Y − Xβ∥2 + 2σ2∥β∥0.

問題点: ∥β∥0は凸関数ではない．連続でもない．沢山の局所最適解．→ 凸関数で近似．

Lasso [L1正則化]

βLasso = argminβ∈Rp

∥Y − Xβ∥2 + λ∥β∥1

ただし ∥β∥1 =∑p

j=1 |βj |.

→ 凸最適化！L1ノルムは L0ノルムの [−1, 1]p における凸包 (下から抑える最大の凸関数)

L1ノルムは要素数関数の Lovasz拡張

9 / 56

Lasso推定量のスパース性

p = n, X = I の場合．

βLasso = argminβ∈Rp

1

2∥Y − β∥2 + C∥β∥1

⇒ βLasso,i = argminb∈R

1

2(yi − b)2 + C |b|

=

{sign(yi )(yi − C ) (|yi | > C )

0 (|yi | ≤ C ).

小さいシグナルは 0に縮小される→スパース！

10 / 56

Lasso推定量のスパース性

β = argminβ∈Rp

1

n∥Xβ − Y ∥22 + λn

p∑j=1

|βj |.

11 / 56

スパース性の恩恵

β = argminβ∈Rp

1

n∥Xβ − Y ∥22 + λn

p∑j=1

|βj |.

Theorem (Lassoの収束レート)

ある条件のもと，定数 C が存在して高い確率で次の不等式が成り立つ：

∥β − β∗∥22 ≤ Cd log(p)

n.

※次元が高くても，たかだか log(p)でしか効いてこない．実質的な次元d が支配的．（「ある条件」については後で詳細を説明）

12 / 56

Outline






13 / 56

Lassoを一般化

Lasso:

minβ∈Rp

1

n

n∑i=1

(yi − x⊤i β)2 + ∥β∥1︸︷︷︸

正則化項

.

一般化したスパース正則化推定法:

minw∈Rp

1

n

n∑i=1

ℓ(zi , β) + ψ(β).

L1正則化項以外にどのような正則化項が有用であろうか？

14 / 56

Lassoを一般化

Lasso:

minβ∈Rp

1

n

n∑i=1

(yi − x⊤i β)2 + ∥β∥1︸︷︷︸

正則化項

.

一般化したスパース正則化推定法:

minw∈Rp

1

n

n∑i=1

ℓ(zi , β) + ψ(β).

L1正則化項以外にどのような正則化項が有用であろうか？

14 / 56

L1正則化によってスパースになる理由：座標軸に沿って尖っている．

正則化項の尖り方を工夫することで様々なスパース性が得られる．

15 / 56

グループ正則化

C∑g∈G

∥βg∥

重複なし重複あり

グループ内すべての変数が同時に 0になりやすい．より積極的にスパースにできる．

応用例：ゲノムワイド相関解析

16 / 56

グループ正則化の応用例

マルチタスク学習 Lounici et al. (2009)

T 個のタスクで同時に推定:

y(t)i ≈ x

(t)⊤i β(t) (i = 1, . . . , n(t), t = 1, . . . ,T ).

minβ(t)

T∑t=1

n(t)∑i=1

(yi − x(t)⊤i β(t))2 + C

p∑k=1

∥(β(1)k , . . . , β(T )k )∥︸︷︷︸


.

β(1)β(2) β(T)

タスク間共通で非ゼロな変数を選択17 / 56

グループ正則化の応用例

マルチタスク学習 Lounici et al. (2009)

T 個のタスクで同時に推定:

y(t)i ≈ x

(t)⊤i β(t) (i = 1, . . . , n(t), t = 1, . . . ,T ).

minβ(t)

T∑t=1

n(t)∑i=1

(yi − x(t)⊤i β(t))2 + C

p∑k=1

∥(β(1)k , . . . , β(T )k )∥︸︷︷︸


.

β(1)β(2) β(T)

タスク間共通で非ゼロな変数を選択17 / 56

トレースノルム正則化

W : M × N 行列．

∥W ∥Tr = Tr[(WW⊤)12 ] =

min{M,N}∑j=1

σj(W )

σj(W )はW の j 番目の特異値 (非負とする)．

特異値の和 = 特異値への L1正則化 → 特異値がスパース

特異値がスパース = 低ランク

18 / 56

例: 推薦システム

ランク 1と仮定する

映画 A 映画 B 映画 C · · · 映画 X

ユーザ 1 4 8 * · · · 2

ユーザ 2 2 * 2 · · · *

ユーザ 3 2 4 * · · · *

...

(e.g., Srebro et al. (2005), NetFlix Bennett and Lanning (2007))

19 / 56


ランク 1と仮定する

映画 A 映画 B 映画 C · · · 映画 X

ユーザ 1 4 8 4 · · · 2

ユーザ 2 2 4 2 · · · 1

ユーザ 3 2 4 2 · · · 1

...

(e.g., Srebro et al. (2005), NetFlix Bennett and Lanning (2007))

19 / 56


W*=

N

M

映画

ユーザ

→ 低ランク行列補完:

低ランク行列の Rademacher Complexity: Srebro et al. (2005).

Compressed sensing: Candes and Tao (2009), Candes and Recht(2009).

20 / 56

例: 縮小ランク回帰

縮小ランク回帰 (Anderson, 1951, Burket, 1964, Izenman, 1975)

マルチタスク学習 (Argyriou et al., 2008)

縮小ランク回帰

=

W*

n

Y X

N M

N

W

+

W ∗は低ランク.

21 / 56

スパース共分散選択

xk ∼ N(0,Σ) (i.i.d.,Σ ∈ Rp×p), Σ = 1n

∑nk=1 xkx

⊤k .

S = argminS:半正定対称

{− log(det(S)) + Tr[SΣ] + λ

p∑i ,j=1

|Si ,j |}.

(Meinshausen and B uhlmann, 2006, Yuan and Lin, 2007, Banerjee et al., 2008)

Σの逆行列 S を推定．Si ,j = 0 ⇔ X(i),X(j)が条件付き独立．ガウシアングラフィカルモデルが凸最適化で推定できる．

22 / 56

(一般化) Fused Lasso

ψ(β) = C∑

(i ,j)∈E

|βi − βj |.

(Tibshirani et al. (2005), Jacob et al. (2009))

Fused lassoによる遺伝子データ解析 (Tibshirani

and Taylor ‘11)

TV デノイジング (Chambolle ‘04)

23 / 56

非凸正則化

-3 -2 -1 0 1 2 30

0.5

1

1.5

2

2.5

3

3.5

L1

SCAD

MCP

Lq (q=0.5)

SCAD (Smoothly Clipped Absolute Deviation) (Fan and Li, 2001)

MCP (Minimax Concave Penalty) (Zhang, 2010)

Lq 正則化 (q < 1), Bridge正則化 (Frank and Friedman, 1993)

よりスパースな解．その代わり最適化は難しくなる．24 / 56

その他 L1正則化の拡張

Adaptive Lasso (Zou, 2006)

ある一致推定量 βがあるとして，それを利用．

ψ(β) = C

p∑j=1

|βj ||βj |γ

Lassoよりも小さいバイアス (漸近不偏)．オラクルプロパティ．

スパース加法モデル (Hastie and Tibshirani, 1999, Ravikumar et al., 2009)

f (x) =∑p

j=1 fj(xj)なる非線形関数を推定．fj ∈ Hj (Hj : 再生核ヒルベルト空間) とする．

ψ(f ) = C

p∑j=1

∥fj∥Hj

Group Lassoの一般化．Multiple Kernel Learningとも呼ばれる．

25 / 56

Outline






26 / 56

問題設定

簡単のため線形回帰を考える．

Y = Xβ∗ + ϵ.

Y ∈ Rn : 応答変数, X ∈ Rn×p : 説明変数, ϵ = [ϵ1, . . . , ϵn]⊤ ∈ Rn.

一般化線形回帰への一般化も可能．

27 / 56

n≫ pの理論

28 / 56

Lassoの漸近分布

pは固定，n→∞の漸近的振る舞いを考える．1nX

⊤Xp−→ C ≻ O.

ノイズ ϵi は平均 0分散 σ2とする.

Theorem (Lassoの漸近分布 (Knight and Fu, 2000))

λn√n→ λ0 ≥ 0なら

√n(β − β∗) d−→ argmin

uV (u),

ただし，V (u) = u⊤Cu − 2u⊤W + λ0

∑pj=1[ujsign(β

∗j )1(β

∗j = 0) + |uj |1(β∗j = 0)],

W ∼ N(0, σ2C ).

βは√n-consistentである．

β∗j = 0なる成分で βj は正の確率で 0となる．

第三項のせいで，漸近的にバイアスが残る．

β = argminβ1n∥Y − Xβ∥2 + λn

∑pj=1 |βj |.

29 / 56

Adaptive Lassoのオラクルプロパティ

βはある一致推定量．

β = argminβ

1

n∥Y − Xβ∥2 + λn

p∑j=1

|βj ||βj |γ

.

Theorem (Adaptive Lassoのオラクルプロパティ (Zou, 2006))

λn√n→ 0, λnn

1+γ2 →∞のとき，

1 limn→∞ P(J = J)→ 1 (J := {j | |βj | = 0}, J := {j | |β∗j | = 0}),

2√n(βJ − β∗J )

d−→ N(0, σ2C−1JJ ).

変数選択の一致性あり．漸近不偏，漸近正規性あり．ただし，β∗のある成分が原点に近づくような局所的な議論(β∗j = O(1/

√n) なる状況) については何も言っていないことに注意．

30 / 56

n≪ pの理論

31 / 56

Lassのリスクの上界

β = argminβ∈Rp

1

n∥Xβ − Y ∥22 + λn

p∑j=1

|βj |.

Theorem (Lassoの収束レート (Bickel et al., 2009, Zhang, 2009))

デザイン行列が Restricted eigenvalue condition (Bickel et al., 2009)かつmaxi ,j |Xij | ≤ 1を満たし，ノイズが E[eτξi ] ≤ eσ

2τ2/2 (∀τ > 0)を満たすなら, 確率 1− δで

∥β − β∗∥22 ≤ Cd log(p/δ)

n.

次元が高くても，たかだか log(p)でしか効いてこない．実質的な次元 d が支配的．

ノイズの条件はサブガウシアンの必要十分条件．

32 / 56

Lassoのminimax最適性

Theorem (ミニマクス最適レート (Raskutti and Wainwright, 2011))

ある条件のもと，確率 1/2以上で，

minβ:推定量

maxβ∗:d-スパース

∥β − β∗∥2 ≥ Cd log(p/d)

n.

Lassoはminimaxレートを達成する (d log(d)n の項を除いて)．

この結果をMultiple Kernel Learningに拡張した結果: Raskutti et al.(2012), Suzuki and Sugiyama (2013).

33 / 56

制限固有値条件 (Restricted eigenvalue condition)

A = 1nX

⊤X とする．

Definition (制限固有値条件 (RE(k ′,C )))

ϕRE(k′,C ) = ϕRE(k

′,C ,A) := infJ⊆{1,...,n},v∈Rp :

|J|≤k′,C∥vJ∥1≥∥vJc ∥1

v⊤Av

∥vJ∥22

に対し，ϕRE > 0が成り立つ．

ほぼスパースなベクトルに制限して定義した最小固有値．

J

34 / 56

適合性条件 (Compatibility condition)

A = 1nX

⊤X とする．

Definition (適合性条件 (COM(J ,C )))

ϕCOM(J,C ) = ϕCOM(J,C ,A) := infv∈Rp :

C∥vJ∥1≥∥vJc ∥1

kv⊤Av

∥vJ∥21

に対し，ϕCOM > 0が成り立つ．

|J| ≤ k ′なら，REよりも弱い条件．

35 / 56

制限等長性条件 (Restricted isometory condition)

Definition (制限等長性条件 (RI(k ′, δ)))

ある普遍定数 c が存在して，ある δ > 0に対し，

(1− δ)∥β∥2 ≤ ∥Xβ∥2 ≤ (1 + δ)∥β∥2

が全ての k ′-スパースなベクトル β ∈ Rp に対して成り立つ．

RE, COMよりも強い条件．

Johnson-Lindenstraussの補題.

圧縮センシングにおける完全復元の文脈でよく用いられる．

36 / 56

各条件の関係と収束レート

β : Lasso推定量.J := {j | β∗j = 0}. d := |J|.

強い 1n∥X (β − β∗)∥22 ∥β − β∗∥22 ∥β − β∗∥21

RI(2d , δ) → 圧縮センシングにおける完全復元⇓

RE(2d , 3) →d log(p)

n

d log(p)

n

d2 log(p)

n⇓

COM(J, 3) →d log(p)

n

d2 log(p)

n

d2 log(p)

n弱い

関連事項の詳細は Buhlmann and van de Geer (2011) に網羅されている．

37 / 56

制限固有値条件 (RE) が成立する確率

制限固有値条件はどれだけ成り立ちやすいか？

p次元確率変数 Z が等方的: E[⟨Z , z⟩2] = ∥z∥22 (∀z ∈ Rp).サブガウシアンノルム ∥Z∥ψ2 を次のように定義する:

∥Z∥ψ2 = supz∈Rp ,∥z∥=1 inft{t | E[exp(⟨Z , z⟩)2/t2] ≤ 2}.

1. Z = [Z1,Z2, . . . ,Zn]⊤ ∈ Rn×p の各行 Zi ∈ Rp が独立な等方的サブガ

ウシアン確率変数とする．2. ある半正定対称行列 Σ ∈ Rp×p を用いて X = ZΣであるとする．

Theorem (Rudelson and Zhou (2013))

∥Zi∥ψ2 ≤ κ (∀i)とする．ある普遍定数 c0が存在しm = c0maxi (Σi,i )

2

ϕ2RE(k,9,Σ)に対

し，n ≥ 4c0mκ4 log(60ep/(mκ))ならば

P

(ϕRE(k, 3, Σ) ≥

1

2ϕRE(k, 9,Σ)

)≥ 1− 2 exp(−n/(4c0κ4)).

つまり，真の分散共分散行列が制限固有値条件を満たす⇒ 経験的分散共分散行列も高い確率で同条件を満たす． 38 / 56

凸最適化を用いないスパース推定法の性質

情報量規準型推定量: Massart (2003), Bunea et al. (2007), Rigolletand Tsybakov (2011).

minβ∈Rp

∥Y − Xβ∥2 + Cσ2∥β∥0{1 + log

(p

∥β∥0

)}.

Bayes推定量: Dalalyan and Tsybakov (2008), Alquier and Lounici(2011), Suzuki (2012).

オラクル不等式: X に何も条件を課さずに次の不等式が成り立つ:

1

n∥Xβ∗ − X β∥2 ≤ Cσ2

d

nlog

(1 +

p

d

).

■ ミニマックス最適．■ 凸正則化推定法と大きなギャップ．■ 計算量と統計的性質のトレードオフ．

39 / 56

Outline






40 / 56

バイアス除去による方法

アイディア：Lasso推定量 βからバイアスを除去.(van de Geer et al., 2014, Javanmard and Montanari, 2014)� �

β = β +MX⊤(Y − X β)� �M が (X⊤X )−1なら，

β = β∗ + (X⊤X )−1X⊤ϵ.

→ バイアスなし， (漸近) 正規．

問題点: n≫ pのとき，X⊤X は非可逆．

41 / 56

M の求め方:

minM∈Rp×p

|ΣM⊤ − I |∞.

(| · |∞ は中身をベクトルとみなした無限大ノルム)

Theorem (Javanmard and Montanari (2014))

ϵi ∼ N(0, σ2) (i .i .d .)とする．

√n(β−β∗) = Z+∆, Z ∼ N(0, σ2MΣM⊤), ∆ =

√n(MΣ−I )(β∗−β).

また，X がランダムで分散共分散行列が正定な時，λn = cσ√

log(p)/nとすると，

∥∆∥∞ = Op

(d log(p)√

n

).

※ n≫ d2 log2(p)ならば∆ ≈ 0で，√n(β − β∗)はほぼ正規分布に従う．

→ 信頼区間の構築や検定ができる．

42 / 56

M の求め方:

minM∈Rp×p

|ΣM⊤ − I |∞.

(| · |∞ は中身をベクトルとみなした無限大ノルム)

Theorem (Javanmard and Montanari (2014))√n(β − β∗) = Z︸︷︷︸

正規分布

+ ∆︸︷︷︸X が非可逆であることによる残りカス

条件が良い時 0 へ収束

※ n≫ d2 log2(p)ならば∆ ≈ 0で，√n(β − β∗)はほぼ正規分布に従う．

→ 信頼区間の構築や検定ができる．

42 / 56

数値実験

(a) 人口データにおける 95%信頼区間.(n, p, d) = (1000, 600, 10).

(b) 人口データにおける p 値の CDF.(n, p, d) = (1000, 600, 10).

図は Javanmard and Montanari (2014)から引用．

43 / 56

共分散検定統計量 (Lockhart et al., 2014)

0 500 1000 1500 2000 2500 3000

−6

00

−4

00

−2

00

02

00

40

06

00

L1 Norm

Co

e"

cie

nts

0 2 4 6 8 10 10

J = supp(β(λk)), J∗ = supp(β∗),β(λk+1) := argmin

β:βJ∈R|J|,βJc=0

∥Y − XJβJ∥2 + λk+1∥βJ∥1.

J∗ ⊆ J ならば (β∗j = 0である)，

Tk =(⟨Y ,X β(λk+1)⟩ − ⟨Y ,X β(λk+1)⟩

)/σ2

d−→ Exp(1) (n, p →∞).

44 / 56

共分散検定統計量 (Lockhart et al., 2014)

0 500 1000 1500 2000 2500 3000

−6

00

−4

00

−2

00

02

00

40

06

00

L1 Norm

Co

e"

cie

nts

0 2 4 6 8 10 10

J = supp(β(λk)), J∗ = supp(β∗),β(λk+1) := argmin

β:βJ∈R|J|,βJc=0

∥Y − XJβJ∥2 + λk+1∥βJ∥1.

J∗ ⊆ J ならば (β∗j = 0である)，

Tk =(⟨Y ,X β(λk+1)⟩ − ⟨Y ,X β(λk+1)⟩

)/σ2

d−→ Exp(1) (n, p →∞).

44 / 56

Outline






45 / 56

スパース推定における最適化の問題意識

R(β) =n∑

i=1

ℓ(yi , x⊤i β)︸︷︷︸

f (β):ロス関数

+ ψ(β)︸︷︷︸正則化項

= f (β) + ψ(β)

ψが尖っている→微分不可能．尖っている関数の最適化は難しい．f はなめらかな場合が多い．ψの構造を利用すれば，あたかも R がなめらかであるかのように最適化可能．

典型例：L1正則化 ψ(β) = C∑p

j=1 |βj | → 座標ごとに分かれている.

一次元の最適化minb{(b − y)2 + C |b|}は簡単.

46 / 56

スパース推定における最適化の問題意識

R(β) =n∑

i=1

ℓ(yi , x⊤i β)︸︷︷︸

f (β):ロス関数

+ ψ(β)︸︷︷︸正則化項

= f (β) + ψ(β)

ψが尖っている→微分不可能．尖っている関数の最適化は難しい．f はなめらかな場合が多い．ψの構造を利用すれば，あたかも R がなめらかであるかのように最適化可能．

典型例：L1正則化 ψ(β) = C∑p

j=1 |βj | → 座標ごとに分かれている.

一次元の最適化minb{(b − y)2 + C |b|}は簡単.46 / 56

座標降下法

座標降下法の手順1 座標 j ∈ {1, . . . , p}を何らかの方法で選択．2 j 番目の座標 βj を更新．（以下は更新方法の例）

β(k+1)j ← argminβj

R([β(k)1 , . . . , βj , . . . , β

(k)p ]).

gj =∂f (β(k))

∂βjとして，

β(k+1)j ← argminβj

⟨gj , βj⟩+ ψj(βj) +ηk

2 ∥βj − β(k)j ∥.

座標を一つずつではなく複数個選ぶことも多い: ブロック座標降下法．

座標はあるルールに従って選んだり，ランダムに選んだりする．

47 / 56

座標降下法の収束レート

分解可能な正則化項 ψ(β) =∑p

j=1 ψj(βj)を考える．f は微分可能で勾配が L-Lipschitz連続 (∇f (β)−∇f (β′) ≤ L∥β − β′∥)→ このとき，f は「滑らか」であると言う．

サイクリック (Saha and Tewari, 2013)

R(β(k))− R(β) ≤ L∥β(0) − β∥2

2k= O(1/k).

ランダム選択 (Nesterov, 2012, Peter Richtarik, 2014)

加速法なし: O(1/k).Nesterovの加速法: O(1/k2) (Fercoq and Richtarik, 2013).f が α-強凸: O(exp(−C (α/L)k)).f が α-強凸+加速法: O(exp(−C

√α/Lk)) (Lin et al., 2014).

座標の選び方はランダムで良い．

48 / 56

大規模データにおける座標降下法

Hydra: 並列分散計算を用いた座標降下法 (Richtarik and Takac, 2013,Fercoq et al., 2014).

大規模 Lasso (p = 5× 108, n = 109) における Hydraの計算効率(Richtarik and Takac, 2013)．128ノード，4,096コア．

49 / 56

近接勾配法型手法

f (β)︸︷︷︸線形近似

+ψ(β)

gk ∈ ∂f (β(k)), gk = 1k

∑kτ=1 gτ .

近接勾配法:

β(k+1) = argminβ∈Rp

{g⊤k β + ψ(β) +

ηk2∥β − β(k)∥2

}.

正則化双対平均法 (Xiao, 2009, Nesterov, 2009):

β(k+1) = argminβ∈Rp

{g⊤k β + ψ(β) +

ηk2∥β∥2

}.

鍵となる計算は近接写像: prox(q|ψ) := argminx

{ψ(x) +

1

2∥x− q∥2

}.

L1正則化なら簡単に計算できる (Soft-thresholding関数)．50 / 56

近接写像の例: L1正則化

prox(q|C∥ · ∥1) = argminx

{C∥x∥1 +

1

2∥x− q∥2

}= (sign(qj)max(|qj | − C , 0))j .

→ Soft-thresholding関数. 解析解!

51 / 56

近接勾配法型手法の収束レート

f の性質滑らか非滑らか

強凸 exp(−√α/Lk)

1

k

非強凸1

k21√k

1 滑らかな場合，Nesterovの加速法を使った時の収束レートを示している (Nesterov, 2007, Zhang et al., 2010)．

2 加速法を使わなければ，それぞれ exp(−(α/L)k), 1k になる．

3 上のオーダーは勾配情報のみを用いる方法 (First order method) の中で最適．

52 / 56

拡張ラグランジアン型手法

minβ

f (β) + ψ(β)

⇔ minx ,y

f (x) + ψ(y) s.t. x = y .

最適化の難しさを分離．

拡張ラグランジアン: L(x , y , λ) = f (x) + ψ(y) + λ⊤(y − x) + ρ2∥y − x∥2.

乗数法 (Hestenes, 1969, Powell, 1969, Rockafellar, 1976)

1 (x (k+1), y (k+1)) = argminx ,y L(x , y , λ(k)).2 λ(k+1) = λ(k) − ρ(y (k+1) − x (k+1)).

x , y での同時最適化はやや面倒 → 交互方向乗数法．

53 / 56

交互方向乗数法

minx ,y

f (x) + ψ(y) s.t. x = y .

L(x , y , λ) = f (x) + ψ(y) + λ⊤(y − x) +ρ

2∥y − x∥2.

交互方向乗数法 (Gabay and Mercier, 1976)

x (k+1) = argminx

f (x)− λ(k)⊤x +ρ

2∥y (k) − x∥2

y (k+1) = argminy

ψ(y) + λ(k)⊤y +ρ

2∥y − x (k+1)∥2(= prox(x (k+1) − λ(k)/ρ|ψ/ρ))

λ(k+1) = λ(k) − ρ(x (k+1) − y (k+1))

x , y の同時最適化は交互に最適化することで回避．y の更新は近接写像．L1正則化のような場合は簡単．構造的正則化への拡張も容易．最適解に収束する保証あり．一般的にはO(1/k) (He and Yuan, 2012), 強凸ならば線形収束 (Deng and

Yin, 2012, Hong and Luo, 2012)． 54 / 56

確率的最適化

サンプル数の多い大規模データで有用．一回の更新にすべてのサンプルを読み込まないでも大丈夫.

■ オンライン型FOBOS (Duchi and Singer, 2009)

RDA (Xiao, 2009)

■ バッチ型SVRG (Stochastic Variance Reduced Gradient) (Johnson and Zhang, 2013)

SDCA (Stochastic Dual Coordinate Ascent) (Shalev-Shwartz and Zhang, 2013)

SAG (Stochastic Averaging Gradient) (Le Roux et al., 2013)

確率的交互方向乗数法: Suzuki (2013), Ouyang et al. (2013), Suzuki(2014).

55 / 56

まとめ

様々なスパースモデリングL1正則化グループ正則化トレースノルム正則化

Lassoの漸近的振る舞い漸近分布Adaptive Lassoのオラクルプロパティ制限固有値条件→ ∥β − β∗∥2 = Op(d log(p)/n)

検定バイアス除去法，共分散検定統計量

最適化手法座標降下法近接勾配法(交互方向) 乗数法

56 / 56

P. Alquier and K. Lounici. PAC-Bayesian bounds for sparse regressionestimation with exponential weights. Electronic Journal of Statistics, 5:127–145, 2011.

T. Anderson. Estimating linear restrictions on regression coefficients formultivariate normal distributions. Annals of Mathematical Statistics, 22:327–351, 1951.

A. Argyriou, C. A. Micchelli, M. Pontil, and Y. Ying. A spectralregularization framework for multi-task structure learning. In Y. S.J.C. Platt, D. Koller and S. Roweis, editors, Advances in NeuralInformation Processing Systems 20, pages 25–32, Cambridge, MA,2008. MIT Press.

O. Banerjee, L. E. Ghaoui, and A. d’Aspremont. Model selection throughsparse maximum likelihood estimation for multivariate gaussian orbinary data. Journal of Machine Learning Research, 9:485–516, 2008.

J. Bennett and S. Lanning. The netflix prize. In Proceedings of KDD Cupand Workshop 2007, 2007.

P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of Lassoand Dantzig selector. The Annals of Statistics, 37(4):1705–1732, 2009.

56 / 56

P. Buhlmann and S. van de Geer. Statistics for high-dimensional data.Springer, 2011.

F. Bunea, A. Tsybakov, and M. Wegkamp. Aggregation for gaussianregression. The Annals of Statistics, 35(4):1674–1697, 2007.

G. R. Burket. A study of reduced-rank models for multiple prediction,volume 12 of Psychometric monographs. Psychometric Society, 1964.

E. Candes and T. Tao. The power of convex relaxations: Near-optimalmatrix completion. IEEE Transactions on Information Theory, 56:2053–2080, 2009.

E. J. Candes and B. Recht. Exact matrix completion via convexoptimization. Foundations of Computational Mathematics, 9(6):717–772, 2009.

E. J. Candes and T. Tao. Near-optimal signal recovery from randomprojections: Universal encoding strategies? IEEE Transactions onInformation Theory, 52(12):5406–5425, 2006.

A. Dalalyan and A. B. Tsybakov. Aggregation by exponential weightingsharp PAC-Bayesian bounds and sparsity. Machine Learning, 72:39–61,2008.

56 / 56

W. Deng and W. Yin. On the global and linear convergence of thegeneralized alternating direction method of multipliers. Technical report,Rice University CAAM TR12-14, 2012.

D. Donoho. Compressed sensing. IEEE Transactions of InformationTheory, 52(4):1289–1306, 2006.

D. L. Donoho and J. M. Johnstone. Ideal spatial adaptation by waveletshrinkage. Biometrika, 81(3):425–455, 1994.

J. Duchi and Y. Singer. Efficient online and batch learning using forwardbackward splitting. Journal of Machine Learning Research, 10:2873–2908, 2009.

J. Fan and R. Li. Variable selection via nonconcave penalized likelihoodand its oracle properties. Journal of the American StatisticalAssociation, 96(456), 2001.

O. Fercoq and P. Richtarik. Accelerated, parallel and proximal coordinatedescent. Technical report, 2013. arXiv:1312.5799.

O. Fercoq, Z. Qu, P. Richtarik, and M. Takac. Fast distributed coordinatedescent for non-strongly convex losses. In Proceedings of MLSP2014:IEEE International Workshop on Machine Learning for SignalProcessing, 2014.

56 / 56

I. E. Frank and J. H. Friedman. A statistical view of some chemometricsregression tools. Technometrics, 35(2):109–135, 1993.

D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinearvariational problems via finite-element approximations. Computers &Mathematics with Applications, 2:17–40, 1976.

T. Hastie and R. Tibshirani. Generalized additive models. Chapman &Hall Ltd, 1999.

B. He and X. Yuan. On the O(1/n) convergence rate of theDouglas-Rachford alternating direction method. SIAM J. NumericalAnalisis, 50(2):700–709, 2012.

M. Hestenes. Multiplier and gradient methods. Journal of OptimizationTheory & Applications, 4:303–320, 1969.

M. Hong and Z.-Q. Luo. On the linear convergence of the alternatingdirection method of multipliers. Technical report, 2012.arXiv:1208.3922.

A. J. Izenman. Reduced-rank regression for the multivariate linear model.Journal of Multivariate Analysis, pages 248–264, 1975.

56 / 56

L. Jacob, G. Obozinski, and J.-P. Vert. Group lasso with overlap andgraph lasso. In Proceedings of the 26th International Conference onMachine Learning, 2009.

A. Javanmard and A. Montanari. Confidence intervals and hypothesistesting for high-dimensional regression. Journal of Machine Learning,page to appear, 2014.

R. Johnson and T. Zhang. Accelerating stochastic gradient descent usingpredictive variance reduction. In C. Burges, L. Bottou, M. Welling,Z. Ghahramani, and K. Weinberger, editors, Advances in NeuralInformation Processing Systems 26, pages 315–323. Curran Associates,Inc., 2013. URL http://papers.nips.cc/paper/

4937-accelerating-stochastic-gradient-descent-using-predictive-variance-reduction.

pdf.

K. Knight and W. Fu. Asymptotics for lasso-type estimators. The Annalsof Statistics, 28(5):1356–1378, 2000.

N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method withan exponential convergence rate for strongly-convex optimization withfinite training sets. In Advances in Neural Information ProcessingSystems 25, 2013.

56 / 56

http://papers.nips.cc/paper/4937-accelerating-stochastic-gradient-descent-using-predictive-variance-reduction.pdf



Q. Lin, Z. Lu, and L. Xiao. An accelerated proximal coordinate gradientmethod and its application to regularized empirical risk minimization.Technical report, 2014. arXiv:1407.1296.

R. Lockhart, J. Taylor, R. J. Tibshirani, and R. Tibshirani. A significancetest for the lasso. The Annals of Statistics, 42(2):413–468, 2014.

K. Lounici, A. Tsybakov, M. Pontil, and S. van de Geer. Taking advantageof sparsity in multi-task learning. 2009.

P. Massart. Concentration Inequalities and Model Selection: Ecole d’etede Probabilites de Saint-Flour 23. Springer, 2003.

N. Meinshausen and P. B uhlmann. High-dimensional graphs and variableselection with the lasso. The Annals of Statistics, 34(3):1436–1462,2006.

Y. Nesterov. Gradient methods for minimizing composite objectivefunction. Technical Report 76, Center for Operations Research andEconometrics (CORE), Catholic University of Louvain (UCL), 2007.

Y. Nesterov. Primal-dual subgradient methods for convex problems.Mathematical Programming, Series B, 120:221–259, 2009.

56 / 56

Y. Nesterov. Efficiency of coordinate descent methods on huge-scaleoptimization problems. SIAM Journal on Optimization, 22(2):341–362,2012.

H. Ouyang, N. He, L. Q. Tran, and A. Gray. Stochastic alternatingdirection method of multipliers. In Proceedings of the 30th InternationalConference on Machine Learning, 2013.

M. T. Peter Richtarik. Iteration complexity of randomizedblock-coordinate descent methods for minimizing a composite function.Mathematical Programming, Series A, 144:1–38, 2014.

M. Powell. A method for nonlinear constraints in minimization problems.In R. Fletcher, editor, Optimization, pages 283–298. Academic Press,London, New York, 1969.

G. Raskutti and M. J. Wainwright. Minimax rates of estimation forhigh-dimensional linear regression over ℓq-balls. IEEE Transactions onInformation Theory, 57(10):6976–6994, 2011.

G. Raskutti, M. Wainwright, and B. Yu. Minimax-optimal rates for sparseadditive models over kernel classes via convex programming. Journal ofMachine Learning Research, 13:389–427, 2012.

56 / 56

P. Ravikumar, J. Lafferty, H. Liu, and L. Wasserman. Sparse additivemodels. Journal of the Royal Statistical Society: Series B, 71(5):1009–1030, 2009.

P. Richtarik and M. Takac. Distributed coordinate descent method forlearning with big data. Technical report, 2013. arXiv:1310.2059.

P. Rigollet and A. Tsybakov. Exponential screening and optimal rates ofsparse estimation. The Annals of Statistics, 39(2):731–771, 2011.

R. T. Rockafellar. Augmented Lagrangians and applications of theproximal point algorithm in convex programming. Mathematics ofOperations Research, 1:97–116, 1976.

M. Rudelson and S. Zhou. Reconstruction from anisotropic randommeasurements. IEEE Transactions of Information Theory, 39, 2013.

A. Saha and A. Tewari. On the non-asymptotic convergence of cycliccoordinate descent methods. SIAM Journal on Optimization, 23(1):576–601, 2013.

S. Shalev-Shwartz and T. Zhang. Proximal stochastic dual coordinateascent. Technical report, 2013. arXiv:1211.2717.

56 / 56

N. Srebro, N. Alon, and T. Jaakkola. Generalization error bounds forcollaborative prediction with low-rank matrices. In Advances in NeuralInformation Processing Systems (NIPS) 17, 2005.

T. Suzuki. Pac-bayesian bound for gaussian process regression andmultiple kernel additive model. In JMLR Workshop and ConferenceProceedings, volume 23, pages 8.1–8.20, 2012. Conference on LearningTheory (COLT2012).

T. Suzuki. Dual averaging and proximal gradient descent for onlinealternating direction multiplier method. In Proceedings of the 30thInternational Conference on Machine Learning, pages 392–400, 2013.

T. Suzuki. Stochastic dual coordinate ascent with alternating directionmethod of multipliers. In Proceedings of the 31th InternationalConference on Machine Learning, pages 736–744, 2014.

T. Suzuki and M. Sugiyama. Fast learning rate of multiple kernel learning:trade-off between sparsity and smoothness. The Annals of Statistics, 41(3):1381–1405, 2013.

R. Tibshirani. Regression shrinkage and selection via the lasso. Journal ofthe Royal Statistical Society, Series B, 58(1):267–288, 1996.

56 / 56

R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsityand smoothness via the fused lasso. 67(1):91–108, 2005.

S. van de Geer, P. B uehlmann, Y. Ritov, and R. Dezeure. Onasymptotically optimal confidence regions and tests for high-dimensionalmodels. The Annals of Statistics, 42(3):1166–1202, 2014.

L. Xiao. Dual averaging methods for regularized stochastic learning andonline optimization. In Advances in Neural Information ProcessingSystems 23, 2009.

M. Yuan and Y. Lin. Model selection and estimation in the Gaussiangraphical model. Biometrika, 94(1):19–35, 2007.

C.-H. Zhang. Nearly unbiased variable selection under minimax concavepenalty. The Annals of Statist, 38(2):894–942, 2010.

P. Zhang, A. Saha, and S. V. N. Vishwanathan. Regularized riskminimization by nesterov’s accelerated gradient methods: Algorithmicextensions and empirical studies. CoRR, abs/1011.0472, 2010.

T. Zhang. Some sharp performance bounds for least squares regressionwith l1 regularization. The Annals of Statistics, 37(5):2109–2144, 2009.

H. Zou. The adaptive lasso and its oracle properties. Journal of theAmerican Statistical Association, 101(476):1418–1429, 2006.

56 / 56

田中利幸. 圧縮センシングの数理. IEICE Fundamentals Review, 4(1):39–47, 2010.

56 / 56

Data & Analytics

Sparse estimation tutorial 2014