MLP輪読スパース8章トレースノルム正則化

Chapter 8

8.1

8.1.1

8.1.2

8.1.2

8.2

8.2.1

8.2.2

8.2.3

8.2.4 prox

2

8.3

8.4

8.4.1

8.4.2

8.4.3

• • 1h

• •

• • • •

•

3

8.1

•

• • (Nuclear) -1

• cf)

• U,V 4

W = U diag(�1, ...,�d)VT =

dX

j=1

�jujvTj

kW k⇤ = tr(p

WTW ) =dX

j=1

�j(W )

≧0

≧0 u,v

• • W= -

•5

j

W =dX

j=1

�jujvTj

j �j

8.1.1

• ×

• • →

6

( i, j)

minW2Rd1⇥d2

X

(i,j)2⌦

l(Yi,j ,Wi,j) + �kW k⇤

Yi,j

8.1.1

• •

•

7min

w1,...,wT2Rd0

TX

t=1

Lt(wt) + �kW k⇤

j

W =dX

j=1

�jujvTj

j

8.1.1 /

•

etc.

•

8

minW2Rd1⇥d2 ,b2R

fl(X(W ) + 1nb) + �kW k⇤

X(W ) = (hXi,W i)ni=1

8.2

• 8.2.1

• •

9

kW k⇤ = max

XhX,W i subject to kXk 1

kW k⇤ = min

P ,Q

1

2

(tr(P ) + tr(Q)) subject to

P WWT Q

�⌫ 0

kW k⇤ = min

U ,V

1

2

(kUk2F + kV k2F ) subject to W = UV T

8.2.2

• 5.2 k- L1

• 8.2 r

• r k

L2 L1

• → L1

10

kW k⇤ prkW kF

kwk1 pkkwk2

8.2.3

• 6.1 L1

•

• 6.3

11

kW k⇤ = min

�

1

2

(tr(WT�†W ) + tr(�)) subject to � ⌫ 0

kwk1 =1

2

dX

j=1

min⌘2Rd:⌘j�0

(w2

j

⌘j+ ⌘j)

8.2.4 prox

• 6.2 L1 prox )

• 8.4 prox

• • →

12

prox

tr� (Y ) = argmin

W2Rd1⇥d2

(

1

2

kY �W k2F + �kW k⇤)

= U max (⌃� �Id, 0)VT

max

hprox

l1� (y)

i

j= max(|yj |� �, 0)

yj|yj |

8.3

• W

• • ν

•

13

NEGAHBAN AND WAINWRIGHT

1n ∑

ni=1 ξi

√RX (i)√C, and secondly, we need to understand how to choose the parameter r so as

to achieve the tightest possible bound. When Θ∗ is exactly low-rank, then it is obvious that weshould choose r = rank(Θ∗), so that the approximation error vanishes—more specifically, so that∑drj=r+1σ j(

√RΘ∗√C) j = 0. Doing so yields the following result:

Corollary 1 (Exactly low-rank matrices) Suppose that the noise sequence {ξi} is i.i.d., zero-meanand sub-exponential, and Θ∗ has rank at most r, Frobenius norm at most 1, and spikiness at mostαsp(Θ∗) ≤ α∗. If we solve the SDP (7) with λn = 4ν

!d logdn then there is a numerical constant c′1

such that

|||"Θ−Θ∗|||2ω(F) ≤ c′1 (ν2∨L2) (α∗)2

rd logdn

+c1(α∗L)2

n(10)

with probability greater than 1− c2 exp(−c3 logd).

Note that this rate has a natural interpretation: since a rank r matrix of dimension dr × dc hasroughly r(dr+ dc) free parameters, we require a sample size of this order (up to logarithmic fac-tors) so as to obtain a controlled error bound. An interesting feature of the bound (10) is the termν2∨1=max{ν2,1}, which implies that we do not obtain exact recovery as ν→ 0. As we discuss atmore length in Section 3.4, under the mild spikiness condition that we have imposed, this behavioris unavoidable due to lack of identifiability within a certain radius, as specified in the set C. Forinstance, consider the matrix Θ∗ and the perturbed version #Θ = Θ∗+ 1√

drdce1eT1 . With high prob-

ability, we have Xn(Θ∗) = Xn(#Θ), so that the observations—even if they were noiseless—fail todistinguish between these two models. These types of examples, leading to non-identifiability, can-not be overcome without imposing fairly restrictive matrix incoherence conditions, as we discuss atmore length in Section 3.4.

As with past work (Candes and Plan, 2010; Keshavan et al., 2010b), Corollary 1 applies to thecase of matrices that have exactly rank r. In practical settings, it is more realistic to assume that theunknown matrix is not exactly low-rank, but rather can be well approximated by a matrix with lowrank. One way in which to formalize this notion is via the ℓq-“ball” of matrices

Bq(ρq) :=$Θ ∈ R

dr×dc |min{dr,dc}

∑j=1

|σ j(√RΘ

√C)|q ≤ ρq

%. (11)

For q = 0, this set corresponds to the set of matrices with rank at most r = ρ0, whereas for valuesq ∈ (0,1], it consists of matrices whose (weighted) singular values decay at a relatively fast rate. Byapplying Theorem 2 to this matrix family, we obtain the following corollary:

Corollary 2 (Estimation of near low-rank matrices) Suppose that the noise {ξi} is zero-meanand sub-exponential, Consider a matrix Θ∗ ∈ Bq(ρq) with spikiness at most αsp(Θ∗) ≤ α∗, andFrobenius norm at most one. With the same choice of λn as Corollary 1, there is a universal con-stant c′1 such that

|||"Θ−Θ∗|||2ω(F) ≤ c1ρq&(ν2∨L2)(α∗)2

d logdn

'1− q2+c1(α∗L)2

n(12)


1672

⇥W

rd log d

n

NEGAHBAN AND WAINWRIGHT

1n ∑

ni=1 ξi

√RX (i)√C, and secondly, we need to understand how to choose the parameter r so as

to achieve the tightest possible bound. When Θ∗ is exactly low-rank, then it is obvious that weshould choose r = rank(Θ∗), so that the approximation error vanishes—more specifically, so that∑drj=r+1σ j(

√RΘ∗√C) j = 0. Doing so yields the following result:

Corollary 1 (Exactly low-rank matrices) Suppose that the noise sequence {ξi} is i.i.d., zero-meanand sub-exponential, and Θ∗ has rank at most r, Frobenius norm at most 1, and spikiness at mostαsp(Θ∗) ≤ α∗. If we solve the SDP (7) with λn = 4ν

!d logdn then there is a numerical constant c′1

such that

|||"Θ−Θ∗|||2ω(F) ≤ c′1 (ν2∨L2) (α∗)2

rd logdn

+c1(α∗L)2

n(10)


Note that this rate has a natural interpretation: since a rank r matrix of dimension dr × dc hasroughly r(dr+ dc) free parameters, we require a sample size of this order (up to logarithmic fac-tors) so as to obtain a controlled error bound. An interesting feature of the bound (10) is the termν2∨1=max{ν2,1}, which implies that we do not obtain exact recovery as ν→ 0. As we discuss atmore length in Section 3.4, under the mild spikiness condition that we have imposed, this behavioris unavoidable due to lack of identifiability within a certain radius, as specified in the set C. Forinstance, consider the matrix Θ∗ and the perturbed version #Θ = Θ∗+ 1√

drdce1eT1 . With high prob-

ability, we have Xn(Θ∗) = Xn(#Θ), so that the observations—even if they were noiseless—fail todistinguish between these two models. These types of examples, leading to non-identifiability, can-not be overcome without imposing fairly restrictive matrix incoherence conditions, as we discuss atmore length in Section 3.4.

As with past work (Candes and Plan, 2010; Keshavan et al., 2010b), Corollary 1 applies to thecase of matrices that have exactly rank r. In practical settings, it is more realistic to assume that theunknown matrix is not exactly low-rank, but rather can be well approximated by a matrix with lowrank. One way in which to formalize this notion is via the ℓq-“ball” of matrices

Bq(ρq) :=$Θ ∈ R

dr×dc |min{dr,dc}

∑j=1

|σ j(√RΘ

√C)|q ≤ ρq

%. (11)

For q = 0, this set corresponds to the set of matrices with rank at most r = ρ0, whereas for valuesq ∈ (0,1], it consists of matrices whose (weighted) singular values decay at a relatively fast rate. Byapplying Theorem 2 to this matrix family, we obtain the following corollary:

Corollary 2 (Estimation of near low-rank matrices) Suppose that the noise {ξi} is zero-meanand sub-exponential, Consider a matrix Θ∗ ∈ Bq(ρq) with spikiness at most αsp(Θ∗) ≤ α∗, andFrobenius norm at most one. With the same choice of λn as Corollary 1, there is a universal con-stant c′1 such that

|||"Θ−Θ∗|||2ω(F) ≤ c1ρq&(ν2∨L2)(α∗)2

d logdn

'1− q2+c1(α∗L)2

n(12)


1672

8.4

• 8.4.1

• 6.3 8.2.3

•

•

14

�t+1 = (WWT)1/2

W t

8.4.2

•

• •

•

•15

W t+1= prox

tr�⌘t

(W t � ⌘trˆL(W t))

W t+1/2 = W t � ⌘trL(W t)

W t+1= prox

tr� (W

t+1/2)

= U max (W 1/2 � �Id, 0)VT

8.4.2

•

• λ

• k

• k

k←2k

• × …

• (16

W t= U max (W 1/2 � �Id, 0)V

T

W tY

8.4.3 DAL

•

•

•

17

min↵2Rn

f⇤l (�↵) + �k·k�(X

T(↵)) + �·=0(1Tn↵)

minW2Rd1⇥d2 ,b2R

fl(X(W ) + 1nb) + �kW k⇤

XT(↵) =nX

i=1

↵iXi

X(W ) = (hXi,W i)ni=1

't(↵) = f⇤l (�↵) +

1

2⌘tkproxtr�⌘t

(W t+ ⌘tX

T(↵))k2F +

1

2⌘t(bt + ⌘t1

Tn↵)

2

8.4.3 DAL

• • L1 ( )

• ( )

• prox18

't(↵) = f⇤l (�↵) +

1

2⌘tkproxtr�⌘t

(W t+ ⌘tX

T(↵))k2F +

1

2⌘t(bt + ⌘t1

Tn↵)

2

't(↵) = f⇤l (�↵) +

1

2⌘tkproxl1�⌘t

(wt+ ⌘tX

T↵)k22

8.4.3 DAL

•

• prox

• • α

19

↵t+1 u argmin↵2Rn

't(↵)

't(↵) = f⇤l (�↵) +

1

2⌘tkproxtr�⌘t

(W t+ ⌘tX

T(↵))k2F +

1

2⌘t(bt + ⌘t1

Tn↵)

2

W t+1= prox

tr�⌘t

(W t+ ⌘tX

T(↵t+1

))

bt+1 = bt + ⌘t1Tn↵

t+1

•

• W

• •

• , (j) ,

• • DAL

20

W =dX

j=1

�jujvTj

�

Engineering

MLP輪読スパース8章 トレースノルム正則化

MLP輪読スパース8章トレースノルム正則化