View
224
Download
4
Category
Preview:
Citation preview
Lecture 8: RLSC - Prof. Sethu Vijayakumar 1
Lecture 8– Dimensionality Reduction
Contents: • Subset Selection & Shrinkage
• Ridge regression, Lasso
• PCA, PCR, PLS
• Comparison of Methods
Lecture 8: RLSC - Prof. Sethu Vijayakumar 2
Data From Human Movement
Measure arm movement and full-body movement of humans and anthropomorphic robots
Perform local dimensionality analysis with a growing variational mixture of factor analyzers
Lecture 8: RLSC - Prof. Sethu Vijayakumar 3
Dimensionality of Full Body Motion
About 8 dimensions in the space formed by joint positions, velocities, and accelerations are needed to model an inverse dynamics model
Q u i c k T i m e ™ a n d aG r a p h i c s d e c o m p r e s s o ra r e n e e d e d t o s e e t h i s p i c t u r e .
1 11 21 31 410
0.05
0.1
0.15
0.2
0.25
Pro
ba
bili
ty
Dimensionality
Q u i c k T i m e ™ a n d aG r a p h i c s d e c o m p r e s s o ra r e n e e d e d t o s e e t h i s p i c t u r e .
/ /
105/ /
105
Q u i c k T i m e ™ a n d aG r a p h i c s d e c o m p r e s s o ra r e n e e d e d t o s e e t h i s p i c t u r e .
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 10 20 30 40 50
Cu
mu
lative
va
rian
ce
No. of retained dimensions
32
Q u i c k T i m e ™ a n d aG r a p h i c s d e c o m p r e s s o ra r e n e e d e d t o s e e t h i s p i c t u r e .
/ /
105/ /
105
Cu
mu
lati
ve V
aria
nce
Pro
bab
ility
Global PCA Local FA
Lecture 8: RLSC - Prof. Sethu Vijayakumar 4
Dimensionality Reduction
Goals:
fewer dimensions for subsequent processing
better numerical stability due to removal of correlations
simplify post processing due to “advanced statistical properties” of pre-processed data
don’t lose important information, only redundant information or irrelevant information
perform dimensionality reduction spatially localized for nonlinear problems (e.g., each RBF has its own local dimensionality reduction)
Lecture 8: RLSC - Prof. Sethu Vijayakumar 5
Refers to methods which reduces/shrinks the redundant or irrelevant variables in a more continuous fashion.
• Ridge Regression
• Lasso
• Derived Input Direction Methods – PCR, PLS
Subset Selection & Shrinkage Methods
Subset Selection
Refers to methods which selects a set of variables to be included and discards other dimensions based on some optimality criterion. Does regression on this reduced dimensional input set. Discrete method –variables are either selected or discarded
• leaps and bounds procedure (Furnival & Wilson, 1974)
• Forward/Backward stepwise selection
Shrinkage Methods
Lecture 8: RLSC - Prof. Sethu Vijayakumar 6
Ridge Regression
Ridge regression shrinks the coefficients by imposing a penalty on their size. They minimize a penalized residual sum of squares :
M
j
j
N
i
M
j
jijiw
ridge wwxwtw1
2
1
2
1
0 )(minargˆ
Complexity parameter controlling amount of shrinkage
swtosubject
wxwtw
M
j
j
N
i
M
j
jijiw
ridge
1
2
1
2
1
0 )(minargˆ
Equivalent representation:
Lecture 8: RLSC - Prof. Sethu Vijayakumar 7
Ridge Regression (cont’d)
Some notes:
• when there are many correlated variables in a regression problem, their coefficients can be poorly determined and exhibit high variance e.g. a wildly large positive variable can be canceled by a similarly largely negative coefficient on its correlated cousin.
• The bias term is not included in the penalty.
• When the inputs are orthogonal, the ridge estimates are just a scaled version of the least squares estimate
tXIXXw
wwXwtXwtw
TTridge
TTJ
1)(ˆ2
1),(
Matrix representation of criterion and it’s solution:
Lecture 8: RLSC - Prof. Sethu Vijayakumar 8
Ridge Regression (cont’d)
Ridge regression shrinks the dimension with least variance the most.
)(0)(df:
)(][)(df
12
2
tionregularizanoifMNote
d
dλtr
M
j j
jT1T
XI)XX(X
[Page 63: Elem. Stat. Learning]
Effective degree of freedom :
Shrinkage Factor :
.
,)(
2
2
2
valueeigeningcorrespondthetorefersdwhere
d
dbyshrunkisdirectionEach
j
j
j
[Page 62: Elem. Stat. Learning]
Lecture 8: RLSC - Prof. Sethu Vijayakumar 9
Lasso
Lasso is also a shrinkage method like ridge regression. It minimizes a penalized residual sum of squares :
M
j
j
N
i
M
j
jijiw
lasso wwxwtw11
2
1
0 )(minargˆ
swtosubject
wxwtw
M
j
j
N
i
M
j
jijiw
lasso
1
1
2
1
0 )(minargˆ
Equivalent representation:
The L2 ridge penalty is replaced by the L1 lasso penalty. This makes the solution non-linear in t.
Lecture 8: RLSC - Prof. Sethu Vijayakumar 10
Derived Input Direction Methods
These methods essentially involves transforming the input directions to some low dimensional representation and using these directions to perform the regression.
Example Methods
• Principal Components Regression (PCR)
• Based on input variance only
• Partial Least Squares (PLS)
• Based on input-output correlation and output variance
Lecture 8: RLSC - Prof. Sethu Vijayakumar 11
Principal Component Analysis
From earlier discussions:
PCA finds the eigenvectors of the correlation (covariance) matrix of the the data
Here, we show how PCA can be learned by bottle-neck neural networks (auto-associator) and Hebbian learning frameworks
Lecture 8: RLSC - Prof. Sethu Vijayakumar 12
PCA Implementation
Hebbian Learning obtain a measure of familiarity of a new data point
the more familiar a data point, the larger the output
x1 x2 x3 xd
y
wn1 wn yx
w
Lecture 8: RLSC - Prof. Sethu Vijayakumar 13
Fixing the Instability (Oja’s rule)
Renormalization
Oja’s Rule
Verify Oja’s Rule (does it do the right thing ??)
wn1 wn xy
wn xy
w wn1 wn y x yw yx y2w
2
T T T
T
E E y y
E
C
w x w
xx w w xx ww
Cw w ww
After convergence:
0
, thus is eigenvector
thus, 1
T
T T T T T
T
E
C
C C
w
Cw w w w w w
w Cw w w w w w w w w
w w
Lecture 8: RLSC - Prof. Sethu Vijayakumar 14
Application Examples of Oja’s Rule
TX = UDV
Singular Value Decomposition (SVD)
Finds the direction (line) which minimizes the
sum of the total squared distance from each
point to its orthogonal projection onto the line.
Lecture 8: RLSC - Prof. Sethu Vijayakumar 15
PCA : Batch vs Stochastic
PCA in Batch Update: just subtract the mean from the data
calculate eigenvectors to obtain principle components (Matlab function “eig”)
PCA in Stochastic Update: stochastically estimate the mean and subtract it from input data
use Oja’s rule on mean subtracted data
x n1 x nn x
n 1 x n
1
n 1x x n
x n1 x n x x n
Lecture 8: RLSC - Prof. Sethu Vijayakumar 16
Autoencoder as motivation for Oja’s Rule
Note that Oja’s rule looks like a supervised learning rule the update looks like a reverse delta-rule: it depends on the difference
between the actual input and the back-propagated output
w w n1 wn y x yw
x
x
w
w
Convert unsupervised learning into a supervised learning problem by trying to reconstruct the inputs from few features!
y
Lecture 8: RLSC - Prof. Sethu Vijayakumar 17
Learning in the Autoencoder
Minimize the cost J 1
2x ˆ x
Tx ˆ x
1
2x yw
Tx yw
J
w
y
ww y
w
w
T
x yw
wTx w
w yw
w
T
x yw
xTw y x yw
y y x yw
2y x yw
w J
w y x yw This is Oja’s rule!
Lecture VIII: MLSC - Dr. Sethu Vijayakumar 18
Discussion about PCA
If data is noisy, we may represent the noise instead of the data The way out: Factor Analysis (handles noise in input dimension also)
PCA has no data generating model
PCA has no probabilistic interpretation (not quite true !!)
PCA ignores possible influence of subsequent (e.g., supervised) learning steps
PCA is a linear method Way out: Nonlinear PCA
PCA can converge very slowly Way out: EM versions of PCA
But PCA is a very reliable method for dimensionality reduction if it is appropriate!
Lecture 8: RLSC - Prof. Sethu Vijayakumar 19
PCA preprocessing for Supervised Learning
In Batch Learning Notation for a Linear Network:
1
Subsequent Linear Regression for Network Weights
T T T T
W U X XU U X Y
NOTE: Inversion of the above matrix is very cheap since it is diagonal! No numerical problems!
Problems of this pre-processing: Important regression data might have been clipped!
1
max(1: )
max(1: )
1
where contains mean-zero data1
NT
n n
n
k
T
k
eigenvectorsN
eigenvectorsN
x x x x
U
X XX
Lecture 8: RLSC - Prof. Sethu Vijayakumar 20
Principal Component Regression (PCR)
Build the matrix X and vector y
Compute the linear model
T
n
T
n
ttt ),...,,(
)~,...,~,~(
21
21
t
xxxX
tXUXU)X(Uw
vvvU
TTTTpcr
kvalueseigen
1
21
ˆ
][max
Eigen Decomposition
TTVVDXX
2
Data projection
Lecture 8: RLSC - Prof. Sethu Vijayakumar 21
PCA in Joint Data Space
A straightforward extension to take supervised learning step into account:
perform PCA in joint data space
extract linear weight parameters from PCA results
x
y
Lecture 8: RLSC - Prof. Sethu Vijayakumar 22
PCA in Joint Data Space: Formalization
1
max(1: )
max(1: )
1
1
NT
n n
n
k
T
k
eigenvectorsN
eigenvectorsN
x xZ
y y
z z z z
U
Z Z
Note: this new kind of linear network can actually tolerate noise in the input data! But only the same noise level in all (joint) dimensions!
1
The Network Weights become:
( ), where
( )
xT T T T
x y y y y y y
y
d k
c k
UW U U U U U I U U U
U
Lecture 8: RLSC - Prof. Sethu Vijayakumar 23
Partial Least Squares (PLS)
Build the matrix X and vector y
Recursively compute the linear model
i
i
T
T
i
T
T
pls
i
T
β
w
sprojectionofri
spXX
syyss
Xsp
ss
ys
uXs
yXu
yyXX
resres
resres
res
res
ires
resresi
resres
)(#:1For
,:Initialize
Projection direction
Univariate regressions
residuals
Projected data
Projection direction based on input output correlation.
T
n
T
n
ttt ),...,,(
)~,...,~,~(
21
21
t
xxxX
Partial Least Squares is a linear regression methods that includes dimensionality reduction
Lecture 8: RLSC - Prof. Sethu Vijayakumar 24
Comparing Dimensionality Red. Methods
One Projection
Four Projections
Five Projections
- Correct
Six Projections
FA
PCR
PCA_J
PLS
Lecture 8: RLSC - Prof. Sethu Vijayakumar 25
Comparison of Methods (I)
Least complex
model within one
S.D. of the best
Prostate Cancer Data Example
(pg. 57, Elem. Stat. Analysis)
Lecture 8: RLSC - Prof. Sethu Vijayakumar 26
Comparison of Methods (II)
Least complex
model within one
S.D. of the best
Prostate Cancer Data Example
(pg. 57, Elem. Stat. Analysis)
Lecture 8: RLSC - Prof. Sethu Vijayakumar 27
Generalization of Shrinkage Methods
regressionridgeq
lassoq
selectionsubsetvariableq
qforwwxwtwM
j
q
j
N
i
M
j
jijiw
gen
:2
:1
:0
.0)(minargˆ11
2
1
0
Contours of constant value of for given value of q.
M
j
q
jw1
Lecture 8: RLSC - Prof. Sethu Vijayakumar 28
Error- Regularization Tradeoff
Lasso Ridge regression
Error Contours
Constraint regions
t 21 22
2
2
1 t
Recommended