Upload
ngothien
View
219
Download
0
Embed Size (px)
Citation preview
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 1
L28: kernel-based feature extraction
• Kernel PCA
• Kernel LDA
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 2
Principal Components Analysis
• As we saw in L9, PCA can only extract a linear projection of the data – To do so, we first compute the covariance matrix
𝐶 =1
𝑀∑𝑗=1
𝑀 𝑥𝑗𝑥𝑗𝑇
– Then, we find the eigenvectors and eigenvalues 𝐶𝑣 = 𝜆𝑣
– And, finally, we project onto the eigenvectors with largest eigenvalues 𝑦 = 𝑣1𝑣2 … 𝑣𝐷 𝑥
• Can the kernel trick be used to perform this operation implicitly in a higher-dimensional space? – If so, this would be equivalent to performing non-linear PCA in the
feature space
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 3
Kernel PCA
• To derive kernel-PCA – We would first project the data into the high-dim feature space 𝐹
Φ: 𝑅𝑁 → 𝐹; 𝑥 → 𝑋
– Then we would compute the covariance matrix
𝐶𝐹 =1
𝑀∑𝑗=1
𝑀 𝜑 𝑥𝑗 𝜑 𝑥𝑗𝑇
• where we have assumed that the data in 𝐹 is centered 𝐸[𝜑(𝑥)] = 0 (more on this later)
– Then we would compute the principal components by solving the eigenvalue problem
𝐶𝐹𝑣 = 𝜆𝑣
– The challenge is… how do we do this implicitly?
Schölkopf et al., (Neural Computation, 1998)
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 4
• Solution – As we saw in the snapshot PCA lecture, the eigenvectors can be
expressed as linear combinations of the training data
𝐶𝐹𝑉 =1
𝑀∑𝑖=1
𝑀 𝜑 𝑥𝑖 𝜑 𝑥𝑖𝑇 𝑉 = 𝜆𝑉 ⇒
𝑉 =1
𝑀𝜆∑𝑖=1
𝑀 𝜑 𝑥𝑖 𝜑 𝑥𝑖𝑇 𝑉 = ∑
𝜑 𝑥𝑖𝑇𝑉
𝑀𝜆𝑀𝑖=1 𝜑 𝑥𝑖 = ∑ 𝛼𝑖𝜑 𝑥𝑖
𝑀𝑖=1
– We then multiply by 𝜑(𝑥𝑘) both sides of 𝜆𝑉 = 𝐶𝐹𝑉 𝜆 𝜑 𝑥𝑘 𝑉 = 𝜑 𝑥𝑘 𝐶𝐹𝑉
– which, combining with the previous expression
𝜆 𝜑 𝑥𝑘 ∑ 𝛼𝑖𝜑 𝑥𝑖𝑀𝑖=1 = 𝜑 𝑥𝑘
1
𝑀∑𝑗=1
𝑀 𝜑 𝑥𝑗 𝜑 𝑥𝑗𝑇
∑ 𝛼𝑖𝜑 𝑥𝑖𝑀𝑖=1
– and regrouping terms, yields
𝜆 𝛼𝑖𝜑 𝑥𝑘 𝜑 𝑥𝑖
𝑀
𝑖=1
=1
𝑀 𝛼𝑖 𝜑 𝑥𝑘 ∑𝑗=1
𝑀 𝜑 𝑥𝑗
𝑀
𝑖=1
𝜑 𝑥𝑗 𝜑 𝑥𝑖
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 5
– Defining an 𝑀 × 𝑀 matrix 𝐾 as
𝐾𝑖𝑗 ≔ 𝜑 𝑥𝑖 ∙ 𝜑 𝑥𝑗
– the previous expression becomes 𝑀𝜆𝐾𝛼 = 𝐾2𝛼
– which can be solved through the eigenvalue problem
𝑀𝜆𝛼 = 𝐾𝛼
• Normalization – To ensure that eigenvectors 𝑉 are orthonormal, we then scale
eigenvectors 𝛼
𝑉𝑘 ∙ 𝑉𝑘 = 1 ⇒ ∑ 𝛼𝑖𝑘𝜑 𝑥𝑖
𝑀𝑖=1 ∑ 𝛼𝑗
𝑘𝜑 𝑥𝑗𝑀𝑗=1 = 1
𝛼𝑖𝑘𝛼𝑗
𝑘𝜑 𝑥𝑖 𝜑 𝑥𝑗
𝑀
𝑖,𝑗=1
= 1 ⇒ 𝛼𝑖𝑘𝛼𝑗
𝑘𝐾𝑖𝑗
𝑀
𝑖,𝑗=1
= 1 ⇒ 𝛼𝑘𝐾𝛼𝑘 = 1
– which, since 𝛼 are the eigenvectors of 𝐾, yields
𝜆𝑘 𝛼𝑘𝛼𝑘 = 1
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 6
• To find the k-th principal component of a new sample x
𝑉𝑘 ∙ 𝜑 𝑥 = 𝛼𝑖𝑘𝜑 𝑥𝑖
𝑀
𝑖=1
∙ 𝜑 𝑥 = 𝛼𝑖𝑘𝐾 𝑥𝑖 , 𝑥
𝑀
𝑖=1
– Note that, when the kernel function is the dot-product, the kernel PCA solution reduces to the snapshot PCA solution
– However, unlike in snapshot PCA, here will be unable to find the eigenvectors since they reside in the high dimensional space 𝐹
𝑉 = 𝛼𝑖𝜑 𝑥𝑖
𝑀
𝑖=1
– This implies that kernel PCA can be used for feature extraction but CANNOT be used (at least directly) for reconstruction purposes
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 7
Centering in the high-dimensional space
• Earlier we assumed that the data was centered in F
𝜑 𝑥𝑖 ≔ 𝜑 𝑥𝑖 −1
𝑀∑ 𝜑 𝑥𝑖
𝑀𝑖=1
– So the covariance matrix in this centered space is
𝐾 𝑖𝑗 = 𝜑 𝑥𝑖 ∙ 𝜑 𝑥𝑗
– And the eigenvalue problem that we need to solve is
𝜆 𝛼 = 𝐾 𝛼
– Merging the first expression into the second one
𝐾 𝑖𝑗 = 𝜑 𝑥𝑖 −1
𝑀∑ 𝜑 𝑥𝑚
𝑀𝑚=1 𝜑 𝑥𝑗 −
1
𝑀∑ 𝜑 𝑥𝑛
𝑀𝑛=1 =
𝐾𝑖𝑗 −1
𝑀∑ 1𝑖𝑚𝐾𝑚𝑗
𝑀𝑚=1 −
1
𝑀∑ 1𝑖𝑛𝐾𝑛𝑗
𝑀𝑛=1 +
1
𝑀2∑ 1𝑖𝑚𝐾𝑚𝑛1𝑛𝑗
𝑀𝑚=1 =
𝐾 − 1𝑀𝐾 − 𝐾1𝑀 + 1𝑀𝐾1𝑀 𝑖𝑗
• where 1𝑖𝑗 = 1 (for all i,j), 1𝑀 𝑖𝑗: = 1/𝑀
– So the centered kernel matrix can be computed from the uncentered one
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 8
• To project new test data 𝑡1, 𝑡2, … , 𝑡𝐿 – First, we define two matrices
𝐾𝑖𝑗𝑡𝑒𝑠𝑡 = 𝜑 𝑡𝑖 ∙ 𝜑 𝑥𝑗
𝐾 𝑖𝑗𝑡𝑒𝑠𝑡 = 𝜑 𝑡𝑖 −
1
𝑀∑ 𝜑 𝑥𝑚
𝑀𝑚=1 ∙ 𝜑 𝑥𝑗 −
1
𝑀∑ 𝜑 𝑥𝑛
𝑀𝑛=1
– Then, we express 𝐾 𝑡𝑒𝑠𝑡 in terms of 𝐾𝑡𝑒𝑠𝑡 𝐾 𝑡𝑒𝑠𝑡 = 𝐾𝑡𝑒𝑠𝑡 − 1𝑀
′ 𝐾 − 𝐾𝑡𝑒𝑠𝑡1𝑀 + 1𝑀′ 𝐾1𝑀
• where 1𝑀′ is an LM matrix with all entries equal to 1/M
– From here, we can then find the principal components of test data as
𝑉 𝑘𝜑 𝑡 = 𝛼 𝑖𝑘𝜑 𝑥𝑖
𝑀
𝑖=1
𝜑 𝑡 = 𝛼 𝑖𝑘𝐾 𝑥𝑖 , 𝑡
𝑀
𝑖=1
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 9
Kernel PCA example
• Simple dataset with three modes, 20 samples per mode
-0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2-0.2
0
0.2
0.4
0.6
0.8
1
1.2
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 10
• The (linear) PCA solution
PCA
1
-0.5
0
0.5
1
1.5
2
PCA2
-1
-0.5
0
0.5
1
PCA 1 PCA 2
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 11
• The kernel PCA solution (Gaussian Kernel)
kernel-PCA
1
-0.08
-0.06
-0.04
-0.02
0
0.02
0.04
0.06
0.08
kernel-PCA2
-0.06
-0.04
-0.02
0
0.02
0.04
0.06
0.08
kernel-PCA3
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
0.25
kernel-PCA4
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
Kernel PCA 1
Kernel PCA 3
Kernel PCA 2
Kernel PCA 4
Kernel PCA 1
Kernel PCA 3
Kernel PCA 2
Kernel PCA 4
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 12
• More kernel PCA projections (out of 60)
kernel-PCA
5
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
kernel-PCA7
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
kernel-PCA8
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
kernel-PCA6
-0.25
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
0.25
Kernel PCA 5
Kernel PCA 7
Kernel PCA 6
Kernel PCA 8
Kernel PCA 5
Kernel PCA 7
Kernel PCA 6
Kernel PCA 8
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 13
Kernel LDA
• Assume a two-class discrimination problem, with 𝑁𝟏 and 𝑁𝟐 examples from classes 𝝎1 and 𝝎𝟐, respectively – From L10, and under the homoscedatic Gaussian assumption, the
optimum projection 𝑣 is obtained by maximizing the Rayleigh quotient
𝐽 𝑣 =𝑣𝑇𝑆𝐵𝑣
𝑣𝑇𝑆𝑊𝑣
– where
𝑆𝑊 = 𝑥 − 𝑚𝑖 𝑥 − 𝑚𝑖𝑇
𝑥∈𝜔𝑖
2
𝑖=1
𝑆𝐵 = 𝑚2 − 𝑚1 𝑚2 − 𝑚1𝑇
𝑚𝑖 =1
𝑁𝑖 𝑥𝑗
𝑖
𝑁𝑖
𝑗=1
[Mika et al., 1999]
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 14
• Can we solve this problem (implicitly) in a high-D kernel space 𝐹 to yield a non-linear version of the Fisher’s LDA? – To do so, we would define between-class and within-class covariance
matrices in kernel space F to obtain the following quotient
𝐽 𝑣 =𝑣𝑇𝑆𝐵
Φ𝑣
𝑣𝑇𝑆𝑊Φ𝑣
– where now 𝑉 ∈ 𝐹, and mean and covariance are defined in 𝐹 as
𝑆𝑊Φ = 𝜑 𝑥 − 𝑚𝑖
Φ 𝜑 𝑥 − 𝑚𝑖Φ 𝑇
𝑥∈𝜔𝑖
2
𝑖=1
𝑆𝐵Φ = 𝑚2
Φ − 𝑚1Φ 𝑚2
Φ − 𝑚1Φ 𝑇
𝑚𝑖Φ =
1
𝑁𝑖 𝜑(𝑥𝑗
𝑖)
𝑁𝑖
𝑗=1
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 15
– As earlier, we make use of the fact that the eigenvector 𝑉 can be expressed as linear combinations of the training data
𝑉 = 𝛼𝑗𝜑 𝑥𝑗
𝑁
𝑗=1
– which, when multiplied by 𝑚𝑖Φ, yields
𝑉𝑇𝑚𝑖Φ = ∑ 𝛼𝑗𝜑 𝑥𝑗
𝑁𝑗=1
𝑇 1
𝑁𝑖∑ 𝜑 𝑥𝑘
𝑖𝑁𝑖𝑘=1 =
1
𝑁𝑖∑ ∑ 𝛼𝑗𝐾 𝑥𝑗 , 𝑥𝑘
𝑖𝑁𝑖𝑘=1 𝑁
𝑗=1 = 𝛼𝑇𝑀𝑖
• where we have defined
𝑀𝑖 𝑗 ≔1
𝑁𝑖∑ 𝐾 𝑥𝑗 , 𝑥𝑘
𝑖𝑁𝑖𝑘=1
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 16
– Merging this result with the definition of 𝑆𝐵Φ yields the following
expression for the numerator
𝑉𝑇𝑆𝐵Φ𝑉 = 𝛼𝑇𝑀𝛼
• where
𝑀 = 𝑀1 − 𝑀2 𝑀1 − 𝑀2𝑇
– Likewise, merging with the definition of 𝑆𝑊Φ yields
𝑉𝑇𝑆𝑊Φ𝑉 = 𝛼𝑇𝑁𝛼
– where
𝑁 ≔ 𝐾𝑗 1 − 1𝑁𝑗𝐾𝑗
𝑇
2
𝑗=1
– where 𝐼 is a 𝑁𝑗 × 𝑁𝑗 identity matrix, 1𝑁𝑗 is a 𝑁𝑗 × 𝑁𝑗 matrix with all
entries equal to 1/𝑁𝑗, and 𝐾𝑗 is a 𝑁 × 𝑁𝑗 matrix such that
𝐾𝑗 𝑛𝑚≔ 𝐾 𝑥𝑛, 𝑥𝑚
𝑗
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 17
– Combining these results, we obtain a new expression for the Rayleigh quotient
𝐽 𝛼 =𝛼𝑇𝑀𝛼
𝛼𝑇𝑁𝛼
• which can be solved by finding the leading eigenvector of 𝑁−1𝑀
– And the projection of a new pattern 𝑡 is given by
𝑉 ∙ 𝜑 𝑡 = 𝛼𝑖𝐾 𝑥𝑖 , 𝑡
𝑁
𝑖=1
• Regularization – To avoid numerical ill-conditioning, one may regularize matrix N by
adding a multiple of the identity matrix 𝑁 = 𝑁 + 𝜇I
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 18
Kernel LDA examples
60 80 100 120 140 160 180-200
-180
-160
-140
-120
-100
-80
-60
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 19
LDA
80
100
120
140
160
180
200
220
240
260
kernel LDA
0.1
0.2
0.3
0.4
0.5
0.6
LDA Kernel LDA
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 20
20 40 60 80 100 120 140 160 180 200 220
-220
-200
-180
-160
-140
-120
-100
-80
-60
-40
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 21
LDA
100
150
200
250
300
350
kernel LDA
-0.25
-0.2
-0.15
-0.1
-0.05
LDA Kernel LDA