38
An Overview of Advances In Sparse Representation Zhangyang (Atlas) Wang

An Overview of Advances In Sparse Representationhome.ustc.edu.cn/~qingling/pdf/Wang20160606.pdf · Definition: Sparsity u A signalx ∈ $% is sparse, when most of its entries are

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: An Overview of Advances In Sparse Representationhome.ustc.edu.cn/~qingling/pdf/Wang20160606.pdf · Definition: Sparsity u A signalx ∈ $% is sparse, when most of its entries are

An Overview of AdvancesIn Sparse Representation

Zhangyang (Atlas) Wang

Page 2: An Overview of Advances In Sparse Representationhome.ustc.edu.cn/~qingling/pdf/Wang20160606.pdf · Definition: Sparsity u A signalx ∈ $% is sparse, when most of its entries are

Why Sparsity Is (Still) Important物以“稀”为贵

Page 3: An Overview of Advances In Sparse Representationhome.ustc.edu.cn/~qingling/pdf/Wang20160606.pdf · Definition: Sparsity u A signalx ∈ $% is sparse, when most of its entries are

Definition: Sparsity

u A signal x ∈ 𝐶% is sparse, when most of its entries are zero.

u Formally, x ∈ 𝐶% is s–sparse, if it has at most s nonzeroentries. One can think of an s– sparse signal as havingonly s degrees of freedom.

u In many cases, x is only approximately sparse (e.g.,exponentially decayed magnitudes). Fortunately, most s–sparse conclusions could still be generalized here.

Page 4: An Overview of Advances In Sparse Representationhome.ustc.edu.cn/~qingling/pdf/Wang20160606.pdf · Definition: Sparsity u A signalx ∈ $% is sparse, when most of its entries are

Extension: Low-Dimensionality

u Low Rank:u A matrix X ∈ 𝐶(×% has low rank if its rank 𝑟 is (substantially) less

than the ambient dimension min(m,n).

u One can think of a rank- 𝑟 matrix as having only 𝑟(𝑚 + 𝑛 − 𝑟)degrees of freedom, as this is the dimension of the tangent space to the manifold of rank- 𝑟 matrices.

u Higher-dimensional: Tensoru Check “Tensor Decomposition” and “Tensor Completion”

Page 5: An Overview of Advances In Sparse Representationhome.ustc.edu.cn/~qingling/pdf/Wang20160606.pdf · Definition: Sparsity u A signalx ∈ $% is sparse, when most of its entries are

Example: Compressive Sensing

u An underdetermined system y = Ax: y ∈ 𝐶(,A ∈ 𝐶(×%, x ∈𝐶%, and 𝑚 ≪ 𝑛.

u It does not even take a Professor to teach you that a uniquesolution x cannot be obtained in general.

u But it indeed took many years before we realized that 𝐱could be recovered from its highly incompletemeasurements 𝒚, by tractable algorithms.

u … given that x is known to be sparse, and some condition(“mutual incoherence”) of A is satisfied. RIP condition

Page 6: An Overview of Advances In Sparse Representationhome.ustc.edu.cn/~qingling/pdf/Wang20160606.pdf · Definition: Sparsity u A signalx ∈ $% is sparse, when most of its entries are

Example: Sparse Coding-basedClassification (SRC)u Deep Learning rules ImageNet since 2012…

u But Do you know who ruled before deep learning stepped in??

2010 Winner: NEC-UIUC team

Driven by sparisity(#parameter:less

than 1% of AlexNet)

Page 7: An Overview of Advances In Sparse Representationhome.ustc.edu.cn/~qingling/pdf/Wang20160606.pdf · Definition: Sparsity u A signalx ∈ $% is sparse, when most of its entries are

Applications of sparsity are found everywhere in science and technology

u Image, Speech, and Video Processing

u Pattern Classification/Clustering

u Matrix Completion

u Robust PCA

u System Identification

u Sensor Network Fusion

u MRI Phase Retrieval

u Quantum-State Tomography

The list goes on and on, and keeps on growing…

Page 8: An Overview of Advances In Sparse Representationhome.ustc.edu.cn/~qingling/pdf/Wang20160606.pdf · Definition: Sparsity u A signalx ∈ $% is sparse, when most of its entries are

The Insights of Sparsity

u Sparsity is important for both predictive accuracy, and model interpretation. u It makes an “organized” and “informative” representation,

and usually implies statistical robustness.

u Example: a simple probability distribution vector

u Example: signal & noise under Fourier Transformation

Page 9: An Overview of Advances In Sparse Representationhome.ustc.edu.cn/~qingling/pdf/Wang20160606.pdf · Definition: Sparsity u A signalx ∈ $% is sparse, when most of its entries are

The Insights of Sparsity

u Consider what sparsity means (informally): ugiven a class of objects, if you can create a model such that

the objects can be represented much compactly, but still preserving high fidelity to the original

u… it means that you have created a great model, in the sensethat redundancies of the original objects are reduced withouthampering the information.

Page 10: An Overview of Advances In Sparse Representationhome.ustc.edu.cn/~qingling/pdf/Wang20160606.pdf · Definition: Sparsity u A signalx ∈ $% is sparse, when most of its entries are

The Insights of Sparsity

u Many computational benefits to the algorithmspeed, memory and storage:u In SVM, many algorithmic approaches exploit sparsity

for fast optimization.

u In LASSO-style regression, sparsity enables the ability to compute full regularization paths.

u ……

Page 11: An Overview of Advances In Sparse Representationhome.ustc.edu.cn/~qingling/pdf/Wang20160606.pdf · Definition: Sparsity u A signalx ∈ $% is sparse, when most of its entries are

Sparse Recovery Algorithms论茴香豆的“茴”字有几种写法

Page 12: An Overview of Advances In Sparse Representationhome.ustc.edu.cn/~qingling/pdf/Wang20160606.pdf · Definition: Sparsity u A signalx ∈ $% is sparse, when most of its entries are

Problem Formulation

Page 13: An Overview of Advances In Sparse Representationhome.ustc.edu.cn/~qingling/pdf/Wang20160606.pdf · Definition: Sparsity u A signalx ∈ $% is sparse, when most of its entries are

Primal-Dual Interior-Point Methods (PDIPM)

• The algorithm simultaneously optimizes the primal-dual pair of the linear programming problems (P) and (D).

• (P) can be converted to a family of logarithmic barrier problems.• The primal-dual interior-point algorithm seek the domain of the central

trajectory for the problems (P) and (D), defined by KKT.

Page 14: An Overview of Advances In Sparse Representationhome.ustc.edu.cn/~qingling/pdf/Wang20160606.pdf · Definition: Sparsity u A signalx ∈ $% is sparse, when most of its entries are

Gradient Projection Methods (GPM)u GPM reformulates the l1 min as quadratic programming (QP)

u One can separate the positive coefficients x+ and the negative coefficients x-

u It could be easily rewritten in the standard QP form:

Page 15: An Overview of Advances In Sparse Representationhome.ustc.edu.cn/~qingling/pdf/Wang20160606.pdf · Definition: Sparsity u A signalx ∈ $% is sparse, when most of its entries are

Gradient Projection Methods (GPM)u The gradient of Q(z) is defined as:

u It leads to a basic algorithm that searches from each iterate along the negative gradient, with the aid of the standard line-search process.

u A Variant: Truncated Newton interior-point method (TNIPM):

u A logarithmic barrier for the constraints is constructed. Using the primal barrier method, the optimal search direction is computed via Newton’s method.

u Approximating the Newton step: preconditioned conjugate gradients (PCG)

Page 16: An Overview of Advances In Sparse Representationhome.ustc.edu.cn/~qingling/pdf/Wang20160606.pdf · Definition: Sparsity u A signalx ∈ $% is sparse, when most of its entries are

Homotopy Methods(拓扑同伦)

u Both PDIPA and GP require the solution sequence to be close to a “central path”, which is often difficult to satisfy and computationally expensive.

u The homotopy methods exploit the fact that the objective function of LASSO undergoes a homotopy from the l2 constraint to the l1objective, as the penalty coefficient decreases. (“solution path”)

u It is only necessary to identify those “breakpoints” that lead to changes of the nonzero support set.

Page 17: An Overview of Advances In Sparse Representationhome.ustc.edu.cn/~qingling/pdf/Wang20160606.pdf · Definition: Sparsity u A signalx ∈ $% is sparse, when most of its entries are

Homotopy Methods(拓扑同伦)

u Pros: The homotopy algorithm provably solves l1-min (P1)[NOT approximately]. For a k-sparse signal, homotopymethods can find it in k iterations.

u Cons: It may lose its computational competitiveness when the sparsity of x grows proportionally with the observation dimension d

Page 18: An Overview of Advances In Sparse Representationhome.ustc.edu.cn/~qingling/pdf/Wang20160606.pdf · Definition: Sparsity u A signalx ∈ $% is sparse, when most of its entries are

Iterative Shrinkage and ThresholdingAlgorithm (ISTA)

A closed-form solution w.r.t. each scalar coefficient

Related:• FISTA• Proximal Gradient

Methods (PGM)

Page 19: An Overview of Advances In Sparse Representationhome.ustc.edu.cn/~qingling/pdf/Wang20160606.pdf · Definition: Sparsity u A signalx ∈ $% is sparse, when most of its entries are

Alternating Direction Methods ofMultipliers (ADMM)

• Using the Lagrangian method, it is converted to an unconstrained form with two additional variables

An alternating minimization precedure is then performed.

• ADMM can be also applied to the dual problems of l1 –min. An implementation called YALL1 iterates in both the primal and dual spaces to converge.

Page 20: An Overview of Advances In Sparse Representationhome.ustc.edu.cn/~qingling/pdf/Wang20160606.pdf · Definition: Sparsity u A signalx ∈ $% is sparse, when most of its entries are

Connecting “Sparse” to “Deep”历史的车轮总要转回同一根辐条

Page 21: An Overview of Advances In Sparse Representationhome.ustc.edu.cn/~qingling/pdf/Wang20160606.pdf · Definition: Sparsity u A signalx ∈ $% is sparse, when most of its entries are

A Starting Point

u Regularized Least Squares (RLS)

𝑌 = 𝑎𝑟𝑔𝑚𝑖𝑛| 𝑋 − 𝐷𝑌 |C + 𝑟(𝑌)

𝑋: input data 𝑌: feature𝐷: the basis for feature representation𝑟(𝑌): the regularization term that incorporates problem-specific prior

u Why RLS?u It represents a large family of feature learning modelsu It is solvable by a similar class of algorithms

u It derives most popular building blocks in latest deep learning

21

Page 22: An Overview of Advances In Sparse Representationhome.ustc.edu.cn/~qingling/pdf/Wang20160606.pdf · Definition: Sparsity u A signalx ∈ $% is sparse, when most of its entries are

Approximated Regression Machine (ARM)

General form of iterative algorithm:𝑌DEF= 𝑁(𝐿F 𝑋 + 𝐿C 𝑌D )

Idea: unfold & truncate iterative algorithm

22

𝐿F + 𝑁

𝐿C

𝑋 𝑌

Page 23: An Overview of Advances In Sparse Representationhome.ustc.edu.cn/~qingling/pdf/Wang20160606.pdf · Definition: Sparsity u A signalx ∈ $% is sparse, when most of its entries are

Approximated Regression Machine (ARM)

n Example: unfold & truncate to 𝑘 = 2 (2-Iteration ARM)n Train & inference are done in the same architecturen Could be further tuned on training datan End-to-end training & Fast inference

23

A 𝑘-Iteration ARMis a (𝑘+1)-layerneural network!

𝑌𝑓

𝐿F

+𝑁 𝐿C 𝑁 𝐿C + 𝑁

𝑋

Page 24: An Overview of Advances In Sparse Representationhome.ustc.edu.cn/~qingling/pdf/Wang20160606.pdf · Definition: Sparsity u A signalx ∈ $% is sparse, when most of its entries are

Approximated Regression Machine (ARM)

u 𝑙F-RLS:𝑌 = 𝑎𝑟𝑔𝑚𝑖𝑛| 𝑋 − 𝐷𝑌 |C + 𝑐| 𝑌 |F

u Iterative Algorithm:

𝑌DEF = 𝑁(𝐿F 𝑋 + 𝐿C 𝑌D )

u 𝐿F 𝑋 = 𝐷N𝑋, 𝐿C 𝑌D = (𝐼 − 𝐷N𝐷)𝑌D

u 𝑁(. ):

24ReLU

c-c

Page 25: An Overview of Advances In Sparse Representationhome.ustc.edu.cn/~qingling/pdf/Wang20160606.pdf · Definition: Sparsity u A signalx ∈ $% is sparse, when most of its entries are

Approximated Regression Machine (ARM)

u Non-negative𝑙F-RLS:𝑌 = 𝑎𝑟𝑔𝑚𝑖𝑛| 𝑋 − 𝐷𝑌 |C + 𝑐| 𝑌 |F,𝑌 ≥ 0

u Iterative Algorithm:

𝑌DEF = 𝑁(𝐿F 𝑋 + 𝐿C 𝑌D )u 𝐿F 𝑋 = 𝐷N𝑋 − 𝒄, 𝐿C 𝑌D = (𝐼 − 𝐷N𝐷)𝑌D

25

𝑌

𝐿F

+ReLU 𝐿C ReLU 𝐿C + ReLU

𝑋

Non-negative 𝑙F-ARM (2 iterations)

Page 26: An Overview of Advances In Sparse Representationhome.ustc.edu.cn/~qingling/pdf/Wang20160606.pdf · Definition: Sparsity u A signalx ∈ $% is sparse, when most of its entries are

More Examples: 𝑙T, 𝑙C, 𝑙U-ARMs

26

Nonlinearity transform 𝑁 as neuron

Nonlinearity transform 𝑁 as pooling

• 𝑙T-RLS (form 1): 𝑌 = 𝑎𝑟𝑔𝑚𝑖𝑛| 𝑋 − 𝐷𝑌 |C + 𝑐C| 𝑌 |T• 𝑙C-RLS: 𝑌 = 𝑎𝑟𝑔𝑚𝑖𝑛| 𝑋 − 𝐷𝑌 |C + 𝑐| 𝑌 |C

• 𝑙U-RLS: 𝑌 = 𝑎𝑟𝑔𝑚𝑖𝑛| 𝑋 − 𝐷𝑌 |C, 𝑠. 𝑡. | 𝑌 |U ≤ 𝑐

• 𝑙T-RLS (form 2): 𝑌 = 𝑎𝑟𝑔𝑚𝑖𝑛| 𝑋 − 𝐷𝑌 |C, 𝑠. 𝑡. | 𝑌 |T ≤ 𝑀

Page 27: An Overview of Advances In Sparse Representationhome.ustc.edu.cn/~qingling/pdf/Wang20160606.pdf · Definition: Sparsity u A signalx ∈ $% is sparse, when most of its entries are

Compare Non-linearities in ARMs

(a) tanh (b) ReLU (c) L1

(d) L0 (e) L2 (f) L∞27

Page 28: An Overview of Advances In Sparse Representationhome.ustc.edu.cn/~qingling/pdf/Wang20160606.pdf · Definition: Sparsity u A signalx ∈ $% is sparse, when most of its entries are

Compare Non-linearities in ARMs

𝑙T-RLS (form 2): 𝑌 = 𝑎𝑟𝑔𝑚𝑖𝑛| 𝑋 − 𝐷𝑌 |C, 𝑠. 𝑡. | 𝑌 |T ≤ 𝑀

u 𝑁(. ): keep the top 𝑀 largest coefficients ➡ max-𝑀poolingu Generalization of well-known max pooling operator 𝑀 = 1u Explains its success in deep learning: sparse representation

28

Max-M Pooling(M = 2)

Page 29: An Overview of Advances In Sparse Representationhome.ustc.edu.cn/~qingling/pdf/Wang20160606.pdf · Definition: Sparsity u A signalx ∈ $% is sparse, when most of its entries are

Convolutional RLS

𝑌 = 𝑎𝑟𝑔𝑚𝑖𝑛| 𝑋 −[𝐹] ∗ 𝑍]]

|C +[𝑟(𝑍])]

𝑋: input data 𝑍]: feature maps

𝐹]: the convolutional filter bank

𝑟(𝑍]): the regularization term

u Two interested special cases

u 𝑟(𝑍]) = 0, ∀𝑖

-> PCANet, a recently proposed baseline

u 𝑟(𝑍]) = 𝜆||𝑍]||F,𝑍] ≥ 0,∀𝑖

-> Non-negative convolutional 𝒍𝟏-ARM 29

Page 30: An Overview of Advances In Sparse Representationhome.ustc.edu.cn/~qingling/pdf/Wang20160606.pdf · Definition: Sparsity u A signalx ∈ $% is sparse, when most of its entries are

Towards General Feed-Forward Networks

u 𝑌DEF = 𝑁 𝐿F 𝑋 + 𝐿C 𝑌D

𝐿F 𝑋 = 𝐷N𝑋 − 𝑐, 𝐿C 𝑌D = (𝐼 − 𝐷N𝐷)𝑌D , 𝑁 = ReLU

u What if… let k = 0? -> A “reckless” 0-iteration ARM?

u 𝑌 = 𝑅𝑒𝐿𝑈 𝐷N𝑋 − 𝑐

u Equivalent to a fully-connected layer + ReLU!

30

+ReLU 𝐿C ReLU 𝐿C + ReLU

𝑋 Y

𝐿F

Page 31: An Overview of Advances In Sparse Representationhome.ustc.edu.cn/~qingling/pdf/Wang20160606.pdf · Definition: Sparsity u A signalx ∈ $% is sparse, when most of its entries are

Stacked Approximated Regression Machine

31

Non-negative 𝑙F-RLS ➡ 0-iteration ARM

➡ Fully-connected layer + neuron (ReLU)

Non-negative convolutional 𝑙F- RLS ➡ 0-iteration ARM

➡ Convolutional layer + neuron (ReLU)

l Besides, 𝑙𝟎-RLS (form 2) ➡ max pooling

Stacking (and/or Jointly Tuning) ARMs

➡ Stacked Approximated Regression Machine (SARM)

Most current populardeep models are

special cases of SARM!

Page 32: An Overview of Advances In Sparse Representationhome.ustc.edu.cn/~qingling/pdf/Wang20160606.pdf · Definition: Sparsity u A signalx ∈ $% is sparse, when most of its entries are

Interpret Deep Networks with SARM

32

Each trainable layer (plusneuron) is a 0-iteration ARM,of non-negative (linear orconv.) 𝑙F-RLS.

Each ARM comes with itsown set of parameters:dictionary or filter bank.

Each hidden layer outputs aone-step approximation ofthe solution to the originalRLS model.

All 0-iteration ARMs are stacked into a SARM, andtuned from end to end tolearn all parameters jointly.

Deep Model From ASARM Viewpoint

Page 33: An Overview of Advances In Sparse Representationhome.ustc.edu.cn/~qingling/pdf/Wang20160606.pdf · Definition: Sparsity u A signalx ∈ $% is sparse, when most of its entries are

Interpret Deep Networks with SARM

33

• Infinitelydeep?

• Controlsparsity

• Enforce“hard”sparsity

• Enforce non-negativesparsity

ReLU Pooling

DepthBias

Page 34: An Overview of Advances In Sparse Representationhome.ustc.edu.cn/~qingling/pdf/Wang20160606.pdf · Definition: Sparsity u A signalx ∈ $% is sparse, when most of its entries are

Resemblance to Residual Learning

x +z3

L2 +L1 L2z2z1N N N

34

x + aW2W1 W3N N N

(Expansion: L2 = I – L1L1T)

Page 35: An Overview of Advances In Sparse Representationhome.ustc.edu.cn/~qingling/pdf/Wang20160606.pdf · Definition: Sparsity u A signalx ∈ $% is sparse, when most of its entries are

Solving End-To-End Optimization with SARM

u Example: Compressive Sensing (CS)

35

An End-to-EndLearning ofAll CSparameters:

Page 36: An Overview of Advances In Sparse Representationhome.ustc.edu.cn/~qingling/pdf/Wang20160606.pdf · Definition: Sparsity u A signalx ∈ $% is sparse, when most of its entries are

Deeply Optimized Compressive Sensing

u A feed-forward network pipeline to solve the bi-level optimization

u More Examples Solved: SVD (special case), dual sparsity model, etc.

36

GX

DMA Y

ℓ1-ARMA X

Encoder Decoder

Representation Measurement Recovery Reconstruction

Page 37: An Overview of Advances In Sparse Representationhome.ustc.edu.cn/~qingling/pdf/Wang20160606.pdf · Definition: Sparsity u A signalx ∈ $% is sparse, when most of its entries are

Related Work

u Existing (albeit limited) work to interpret deep learning

u [S. Mallat et.al. 2013] Scattering network

u [Y. Bengio et.al. 2012-2016] Explain dropout, initialization…

u [R. Baraniuk et.al. 2015] Generative probabilistic theory

u [U. Kamilov et.al. 2016] Learning optimal nonlinearities for ISTA

u [B. Xin et. al. 2016] Maximal Sparsity with Deep Networks?

u Correlate classical ML models to deep models

u [Y. LeCun et.al. 2010] Learned ISTA (LISTA)

u [P. Sprechmann et.al. 2015] Fixed-complexity basis pursuit

u [R. Vemulapalli et.al. 2015] Deep Gaussian Random Field

u [S. Zheng et. al. 2015] CRF as RNN37

Page 38: An Overview of Advances In Sparse Representationhome.ustc.edu.cn/~qingling/pdf/Wang20160606.pdf · Definition: Sparsity u A signalx ∈ $% is sparse, when most of its entries are

Take-Home Points

u Nonparametric structured models, based on sparsity/low-dimensionality/etc., are powerful and flexible.

u While they may not always be the best models in any particular application, they are quite often surprisingly competitive.

u They bring in more interpretability than “data-drivenblack boxes”. Their algorithms are concise and beautiful.

u We may still embrace them in the age of deep learning(see my next talk).