Upload
akira-tanimoto
View
249
Download
0
Embed Size (px)
Citation preview
Chapter 8
8.1
8.1.1
8.1.2
8.1.2
8.2
8.2.1
8.2.2
8.2.3
8.2.4 prox
2
8.3
8.4
8.4.1
8.4.2
8.4.3
• • 1h
• •
• • • •
•
3
8.1
•
• • (Nuclear) -1
• cf)
• U,V 4
W = U diag(�1, ...,�d)VT =
dX
j=1
�jujvTj
kW k⇤ = tr(p
WTW ) =dX
j=1
�j(W )
≧0
≧0 u,v
• • W= -
•5
j
W =dX
j=1
�jujvTj
j �j
8.1.1
• ×
• • →
6
( i, j)
minW2Rd1⇥d2
X
(i,j)2⌦
l(Yi,j ,Wi,j) + �kW k⇤
Yi,j
8.1.1
• •
•
7min
w1,...,wT2Rd0
TX
t=1
Lt(wt) + �kW k⇤
j
W =dX
j=1
�jujvTj
j
8.1.1 /
•
etc.
•
8
minW2Rd1⇥d2 ,b2R
fl(X(W ) + 1nb) + �kW k⇤
X(W ) = (hXi,W i)ni=1
8.2
• 8.2.1
• •
9
kW k⇤ = max
XhX,W i subject to kXk 1
kW k⇤ = min
P ,Q
1
2
(tr(P ) + tr(Q)) subject to
P WWT Q
�⌫ 0
kW k⇤ = min
U ,V
1
2
(kUk2F + kV k2F ) subject to W = UV T
8.2.2
• 5.2 k- L1
• 8.2 r
• r k
L2 L1
• → L1
10
kW k⇤ prkW kF
kwk1 pkkwk2
8.2.3
• 6.1 L1
•
• 6.3
11
kW k⇤ = min
�
1
2
(tr(WT�†W ) + tr(�)) subject to � ⌫ 0
kwk1 =1
2
dX
j=1
min⌘2Rd:⌘j�0
(w2
j
⌘j+ ⌘j)
8.2.4 prox
• 6.2 L1 prox )
• 8.4 prox
• • →
12
prox
tr� (Y ) = argmin
W2Rd1⇥d2
(
1
2
kY �W k2F + �kW k⇤)
= U max (⌃� �Id, 0)VT
max
hprox
l1� (y)
i
j= max(|yj |� �, 0)
yj|yj |
8.3
• W
• • ν
•
13
NEGAHBAN AND WAINWRIGHT
1n ∑
ni=1 ξi
√RX (i)√C, and secondly, we need to understand how to choose the parameter r so as
to achieve the tightest possible bound. When Θ∗ is exactly low-rank, then it is obvious that weshould choose r = rank(Θ∗), so that the approximation error vanishes—more specifically, so that∑drj=r+1σ j(
√RΘ∗√C) j = 0. Doing so yields the following result:
Corollary 1 (Exactly low-rank matrices) Suppose that the noise sequence {ξi} is i.i.d., zero-meanand sub-exponential, and Θ∗ has rank at most r, Frobenius norm at most 1, and spikiness at mostαsp(Θ∗) ≤ α∗. If we solve the SDP (7) with λn = 4ν
!d logdn then there is a numerical constant c′1
such that
|||"Θ−Θ∗|||2ω(F) ≤ c′1 (ν2∨L2) (α∗)2
rd logdn
+c1(α∗L)2
n(10)
with probability greater than 1− c2 exp(−c3 logd).
Note that this rate has a natural interpretation: since a rank r matrix of dimension dr × dc hasroughly r(dr+ dc) free parameters, we require a sample size of this order (up to logarithmic fac-tors) so as to obtain a controlled error bound. An interesting feature of the bound (10) is the termν2∨1=max{ν2,1}, which implies that we do not obtain exact recovery as ν→ 0. As we discuss atmore length in Section 3.4, under the mild spikiness condition that we have imposed, this behavioris unavoidable due to lack of identifiability within a certain radius, as specified in the set C. Forinstance, consider the matrix Θ∗ and the perturbed version #Θ = Θ∗+ 1√
drdce1eT1 . With high prob-
ability, we have Xn(Θ∗) = Xn(#Θ), so that the observations—even if they were noiseless—fail todistinguish between these two models. These types of examples, leading to non-identifiability, can-not be overcome without imposing fairly restrictive matrix incoherence conditions, as we discuss atmore length in Section 3.4.
As with past work (Candes and Plan, 2010; Keshavan et al., 2010b), Corollary 1 applies to thecase of matrices that have exactly rank r. In practical settings, it is more realistic to assume that theunknown matrix is not exactly low-rank, but rather can be well approximated by a matrix with lowrank. One way in which to formalize this notion is via the ℓq-“ball” of matrices
Bq(ρq) :=$Θ ∈ R
dr×dc |min{dr,dc}
∑j=1
|σ j(√RΘ
√C)|q ≤ ρq
%. (11)
For q = 0, this set corresponds to the set of matrices with rank at most r = ρ0, whereas for valuesq ∈ (0,1], it consists of matrices whose (weighted) singular values decay at a relatively fast rate. Byapplying Theorem 2 to this matrix family, we obtain the following corollary:
Corollary 2 (Estimation of near low-rank matrices) Suppose that the noise {ξi} is zero-meanand sub-exponential, Consider a matrix Θ∗ ∈ Bq(ρq) with spikiness at most αsp(Θ∗) ≤ α∗, andFrobenius norm at most one. With the same choice of λn as Corollary 1, there is a universal con-stant c′1 such that
|||"Θ−Θ∗|||2ω(F) ≤ c1ρq&(ν2∨L2)(α∗)2
d logdn
'1− q2+c1(α∗L)2
n(12)
with probability greater than 1− c2 exp(−c3 logd).
1672
⇥W
rd log d
n
NEGAHBAN AND WAINWRIGHT
1n ∑
ni=1 ξi
√RX (i)√C, and secondly, we need to understand how to choose the parameter r so as
to achieve the tightest possible bound. When Θ∗ is exactly low-rank, then it is obvious that weshould choose r = rank(Θ∗), so that the approximation error vanishes—more specifically, so that∑drj=r+1σ j(
√RΘ∗√C) j = 0. Doing so yields the following result:
Corollary 1 (Exactly low-rank matrices) Suppose that the noise sequence {ξi} is i.i.d., zero-meanand sub-exponential, and Θ∗ has rank at most r, Frobenius norm at most 1, and spikiness at mostαsp(Θ∗) ≤ α∗. If we solve the SDP (7) with λn = 4ν
!d logdn then there is a numerical constant c′1
such that
|||"Θ−Θ∗|||2ω(F) ≤ c′1 (ν2∨L2) (α∗)2
rd logdn
+c1(α∗L)2
n(10)
with probability greater than 1− c2 exp(−c3 logd).
Note that this rate has a natural interpretation: since a rank r matrix of dimension dr × dc hasroughly r(dr+ dc) free parameters, we require a sample size of this order (up to logarithmic fac-tors) so as to obtain a controlled error bound. An interesting feature of the bound (10) is the termν2∨1=max{ν2,1}, which implies that we do not obtain exact recovery as ν→ 0. As we discuss atmore length in Section 3.4, under the mild spikiness condition that we have imposed, this behavioris unavoidable due to lack of identifiability within a certain radius, as specified in the set C. Forinstance, consider the matrix Θ∗ and the perturbed version #Θ = Θ∗+ 1√
drdce1eT1 . With high prob-
ability, we have Xn(Θ∗) = Xn(#Θ), so that the observations—even if they were noiseless—fail todistinguish between these two models. These types of examples, leading to non-identifiability, can-not be overcome without imposing fairly restrictive matrix incoherence conditions, as we discuss atmore length in Section 3.4.
As with past work (Candes and Plan, 2010; Keshavan et al., 2010b), Corollary 1 applies to thecase of matrices that have exactly rank r. In practical settings, it is more realistic to assume that theunknown matrix is not exactly low-rank, but rather can be well approximated by a matrix with lowrank. One way in which to formalize this notion is via the ℓq-“ball” of matrices
Bq(ρq) :=$Θ ∈ R
dr×dc |min{dr,dc}
∑j=1
|σ j(√RΘ
√C)|q ≤ ρq
%. (11)
For q = 0, this set corresponds to the set of matrices with rank at most r = ρ0, whereas for valuesq ∈ (0,1], it consists of matrices whose (weighted) singular values decay at a relatively fast rate. Byapplying Theorem 2 to this matrix family, we obtain the following corollary:
Corollary 2 (Estimation of near low-rank matrices) Suppose that the noise {ξi} is zero-meanand sub-exponential, Consider a matrix Θ∗ ∈ Bq(ρq) with spikiness at most αsp(Θ∗) ≤ α∗, andFrobenius norm at most one. With the same choice of λn as Corollary 1, there is a universal con-stant c′1 such that
|||"Θ−Θ∗|||2ω(F) ≤ c1ρq&(ν2∨L2)(α∗)2
d logdn
'1− q2+c1(α∗L)2
n(12)
with probability greater than 1− c2 exp(−c3 logd).
1672
8.4
• 8.4.1
• 6.3 8.2.3
•
•
14
�t+1 = (WWT)1/2
W t
8.4.2
•
• •
•
•15
W t+1= prox
tr�⌘t
(W t � ⌘trˆL(W t))
W t+1/2 = W t � ⌘trL(W t)
W t+1= prox
tr� (W
t+1/2)
= U max (W 1/2 � �Id, 0)VT
8.4.2
•
• λ
• k
• k
k←2k
• × …
• (16
W t= U max (W 1/2 � �Id, 0)V
T
W tY
8.4.3 DAL
•
•
•
17
min↵2Rn
f⇤l (�↵) + �k·k�(X
T(↵)) + �·=0(1Tn↵)
minW2Rd1⇥d2 ,b2R
fl(X(W ) + 1nb) + �kW k⇤
XT(↵) =nX
i=1
↵iXi
X(W ) = (hXi,W i)ni=1
't(↵) = f⇤l (�↵) +
1
2⌘tkproxtr�⌘t
(W t+ ⌘tX
T(↵))k2F +
1
2⌘t(bt + ⌘t1
Tn↵)
2
8.4.3 DAL
• • L1 ( )
• ( )
• prox18
't(↵) = f⇤l (�↵) +
1
2⌘tkproxtr�⌘t
(W t+ ⌘tX
T(↵))k2F +
1
2⌘t(bt + ⌘t1
Tn↵)
2
't(↵) = f⇤l (�↵) +
1
2⌘tkproxl1�⌘t
(wt+ ⌘tX
T↵)k22
8.4.3 DAL
•
• prox
• • α
19
↵t+1 u argmin↵2Rn
't(↵)
't(↵) = f⇤l (�↵) +
1
2⌘tkproxtr�⌘t
(W t+ ⌘tX
T(↵))k2F +
1
2⌘t(bt + ⌘t1
Tn↵)
2
W t+1= prox
tr�⌘t
(W t+ ⌘tX
T(↵t+1
))
bt+1 = bt + ⌘t1Tn↵
t+1
•
• W
• •
• , (j) ,
• • DAL
20
W =dX
j=1
�jujvTj
�