21
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 1 L28: kernel-based feature extraction Kernel PCA Kernel LDA

L28: kernel-based feature extraction - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l28.pdf · CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 3 Kernel

Embed Size (px)

Citation preview

Page 1: L28: kernel-based feature extraction - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l28.pdf · CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 3 Kernel

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 1

L28: kernel-based feature extraction

• Kernel PCA

• Kernel LDA

Page 2: L28: kernel-based feature extraction - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l28.pdf · CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 3 Kernel

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 2

Principal Components Analysis

• As we saw in L9, PCA can only extract a linear projection of the data – To do so, we first compute the covariance matrix

𝐶 =1

𝑀∑𝑗=1

𝑀 𝑥𝑗𝑥𝑗𝑇

– Then, we find the eigenvectors and eigenvalues 𝐶𝑣 = 𝜆𝑣

– And, finally, we project onto the eigenvectors with largest eigenvalues 𝑦 = 𝑣1𝑣2 … 𝑣𝐷 𝑥

• Can the kernel trick be used to perform this operation implicitly in a higher-dimensional space? – If so, this would be equivalent to performing non-linear PCA in the

feature space

Page 3: L28: kernel-based feature extraction - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l28.pdf · CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 3 Kernel

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 3

Kernel PCA

• To derive kernel-PCA – We would first project the data into the high-dim feature space 𝐹

Φ: 𝑅𝑁 → 𝐹; 𝑥 → 𝑋

– Then we would compute the covariance matrix

𝐶𝐹 =1

𝑀∑𝑗=1

𝑀 𝜑 𝑥𝑗 𝜑 𝑥𝑗𝑇

• where we have assumed that the data in 𝐹 is centered 𝐸[𝜑(𝑥)] = 0 (more on this later)

– Then we would compute the principal components by solving the eigenvalue problem

𝐶𝐹𝑣 = 𝜆𝑣

– The challenge is… how do we do this implicitly?

Schölkopf et al., (Neural Computation, 1998)

Page 4: L28: kernel-based feature extraction - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l28.pdf · CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 3 Kernel

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 4

• Solution – As we saw in the snapshot PCA lecture, the eigenvectors can be

expressed as linear combinations of the training data

𝐶𝐹𝑉 =1

𝑀∑𝑖=1

𝑀 𝜑 𝑥𝑖 𝜑 𝑥𝑖𝑇 𝑉 = 𝜆𝑉 ⇒

𝑉 =1

𝑀𝜆∑𝑖=1

𝑀 𝜑 𝑥𝑖 𝜑 𝑥𝑖𝑇 𝑉 = ∑

𝜑 𝑥𝑖𝑇𝑉

𝑀𝜆𝑀𝑖=1 𝜑 𝑥𝑖 = ∑ 𝛼𝑖𝜑 𝑥𝑖

𝑀𝑖=1

– We then multiply by 𝜑(𝑥𝑘) both sides of 𝜆𝑉 = 𝐶𝐹𝑉 𝜆 𝜑 𝑥𝑘 𝑉 = 𝜑 𝑥𝑘 𝐶𝐹𝑉

– which, combining with the previous expression

𝜆 𝜑 𝑥𝑘 ∑ 𝛼𝑖𝜑 𝑥𝑖𝑀𝑖=1 = 𝜑 𝑥𝑘

1

𝑀∑𝑗=1

𝑀 𝜑 𝑥𝑗 𝜑 𝑥𝑗𝑇

∑ 𝛼𝑖𝜑 𝑥𝑖𝑀𝑖=1

– and regrouping terms, yields

𝜆 𝛼𝑖𝜑 𝑥𝑘 𝜑 𝑥𝑖

𝑀

𝑖=1

=1

𝑀 𝛼𝑖 𝜑 𝑥𝑘 ∑𝑗=1

𝑀 𝜑 𝑥𝑗

𝑀

𝑖=1

𝜑 𝑥𝑗 𝜑 𝑥𝑖

Page 5: L28: kernel-based feature extraction - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l28.pdf · CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 3 Kernel

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 5

– Defining an 𝑀 × 𝑀 matrix 𝐾 as

𝐾𝑖𝑗 ≔ 𝜑 𝑥𝑖 ∙ 𝜑 𝑥𝑗

– the previous expression becomes 𝑀𝜆𝐾𝛼 = 𝐾2𝛼

– which can be solved through the eigenvalue problem

𝑀𝜆𝛼 = 𝐾𝛼

• Normalization – To ensure that eigenvectors 𝑉 are orthonormal, we then scale

eigenvectors 𝛼

𝑉𝑘 ∙ 𝑉𝑘 = 1 ⇒ ∑ 𝛼𝑖𝑘𝜑 𝑥𝑖

𝑀𝑖=1 ∑ 𝛼𝑗

𝑘𝜑 𝑥𝑗𝑀𝑗=1 = 1

𝛼𝑖𝑘𝛼𝑗

𝑘𝜑 𝑥𝑖 𝜑 𝑥𝑗

𝑀

𝑖,𝑗=1

= 1 ⇒ 𝛼𝑖𝑘𝛼𝑗

𝑘𝐾𝑖𝑗

𝑀

𝑖,𝑗=1

= 1 ⇒ 𝛼𝑘𝐾𝛼𝑘 = 1

– which, since 𝛼 are the eigenvectors of 𝐾, yields

𝜆𝑘 𝛼𝑘𝛼𝑘 = 1

Page 6: L28: kernel-based feature extraction - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l28.pdf · CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 3 Kernel

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 6

• To find the k-th principal component of a new sample x

𝑉𝑘 ∙ 𝜑 𝑥 = 𝛼𝑖𝑘𝜑 𝑥𝑖

𝑀

𝑖=1

∙ 𝜑 𝑥 = 𝛼𝑖𝑘𝐾 𝑥𝑖 , 𝑥

𝑀

𝑖=1

– Note that, when the kernel function is the dot-product, the kernel PCA solution reduces to the snapshot PCA solution

– However, unlike in snapshot PCA, here will be unable to find the eigenvectors since they reside in the high dimensional space 𝐹

𝑉 = 𝛼𝑖𝜑 𝑥𝑖

𝑀

𝑖=1

– This implies that kernel PCA can be used for feature extraction but CANNOT be used (at least directly) for reconstruction purposes

Page 7: L28: kernel-based feature extraction - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l28.pdf · CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 3 Kernel

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 7

Centering in the high-dimensional space

• Earlier we assumed that the data was centered in F

𝜑 𝑥𝑖 ≔ 𝜑 𝑥𝑖 −1

𝑀∑ 𝜑 𝑥𝑖

𝑀𝑖=1

– So the covariance matrix in this centered space is

𝐾 𝑖𝑗 = 𝜑 𝑥𝑖 ∙ 𝜑 𝑥𝑗

– And the eigenvalue problem that we need to solve is

𝜆 𝛼 = 𝐾 𝛼

– Merging the first expression into the second one

𝐾 𝑖𝑗 = 𝜑 𝑥𝑖 −1

𝑀∑ 𝜑 𝑥𝑚

𝑀𝑚=1 𝜑 𝑥𝑗 −

1

𝑀∑ 𝜑 𝑥𝑛

𝑀𝑛=1 =

𝐾𝑖𝑗 −1

𝑀∑ 1𝑖𝑚𝐾𝑚𝑗

𝑀𝑚=1 −

1

𝑀∑ 1𝑖𝑛𝐾𝑛𝑗

𝑀𝑛=1 +

1

𝑀2∑ 1𝑖𝑚𝐾𝑚𝑛1𝑛𝑗

𝑀𝑚=1 =

𝐾 − 1𝑀𝐾 − 𝐾1𝑀 + 1𝑀𝐾1𝑀 𝑖𝑗

• where 1𝑖𝑗 = 1 (for all i,j), 1𝑀 𝑖𝑗: = 1/𝑀

– So the centered kernel matrix can be computed from the uncentered one

Page 8: L28: kernel-based feature extraction - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l28.pdf · CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 3 Kernel

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 8

• To project new test data 𝑡1, 𝑡2, … , 𝑡𝐿 – First, we define two matrices

𝐾𝑖𝑗𝑡𝑒𝑠𝑡 = 𝜑 𝑡𝑖 ∙ 𝜑 𝑥𝑗

𝐾 𝑖𝑗𝑡𝑒𝑠𝑡 = 𝜑 𝑡𝑖 −

1

𝑀∑ 𝜑 𝑥𝑚

𝑀𝑚=1 ∙ 𝜑 𝑥𝑗 −

1

𝑀∑ 𝜑 𝑥𝑛

𝑀𝑛=1

– Then, we express 𝐾 𝑡𝑒𝑠𝑡 in terms of 𝐾𝑡𝑒𝑠𝑡 𝐾 𝑡𝑒𝑠𝑡 = 𝐾𝑡𝑒𝑠𝑡 − 1𝑀

′ 𝐾 − 𝐾𝑡𝑒𝑠𝑡1𝑀 + 1𝑀′ 𝐾1𝑀

• where 1𝑀′ is an LM matrix with all entries equal to 1/M

– From here, we can then find the principal components of test data as

𝑉 𝑘𝜑 𝑡 = 𝛼 𝑖𝑘𝜑 𝑥𝑖

𝑀

𝑖=1

𝜑 𝑡 = 𝛼 𝑖𝑘𝐾 𝑥𝑖 , 𝑡

𝑀

𝑖=1

Page 9: L28: kernel-based feature extraction - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l28.pdf · CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 3 Kernel

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 9

Kernel PCA example

• Simple dataset with three modes, 20 samples per mode

-0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2-0.2

0

0.2

0.4

0.6

0.8

1

1.2

Page 10: L28: kernel-based feature extraction - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l28.pdf · CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 3 Kernel

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 10

• The (linear) PCA solution

PCA

1

-0.5

0

0.5

1

1.5

2

PCA2

-1

-0.5

0

0.5

1

PCA 1 PCA 2

Page 11: L28: kernel-based feature extraction - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l28.pdf · CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 3 Kernel

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 11

• The kernel PCA solution (Gaussian Kernel)

kernel-PCA

1

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08

kernel-PCA2

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08

kernel-PCA3

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

0.25

kernel-PCA4

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

Kernel PCA 1

Kernel PCA 3

Kernel PCA 2

Kernel PCA 4

Kernel PCA 1

Kernel PCA 3

Kernel PCA 2

Kernel PCA 4

Page 12: L28: kernel-based feature extraction - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l28.pdf · CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 3 Kernel

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 12

• More kernel PCA projections (out of 60)

kernel-PCA

5

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

kernel-PCA7

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

kernel-PCA8

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

kernel-PCA6

-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

0.25

Kernel PCA 5

Kernel PCA 7

Kernel PCA 6

Kernel PCA 8

Kernel PCA 5

Kernel PCA 7

Kernel PCA 6

Kernel PCA 8

Page 13: L28: kernel-based feature extraction - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l28.pdf · CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 3 Kernel

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 13

Kernel LDA

• Assume a two-class discrimination problem, with 𝑁𝟏 and 𝑁𝟐 examples from classes 𝝎1 and 𝝎𝟐, respectively – From L10, and under the homoscedatic Gaussian assumption, the

optimum projection 𝑣 is obtained by maximizing the Rayleigh quotient

𝐽 𝑣 =𝑣𝑇𝑆𝐵𝑣

𝑣𝑇𝑆𝑊𝑣

– where

𝑆𝑊 = 𝑥 − 𝑚𝑖 𝑥 − 𝑚𝑖𝑇

𝑥∈𝜔𝑖

2

𝑖=1

𝑆𝐵 = 𝑚2 − 𝑚1 𝑚2 − 𝑚1𝑇

𝑚𝑖 =1

𝑁𝑖 𝑥𝑗

𝑖

𝑁𝑖

𝑗=1

[Mika et al., 1999]

Page 14: L28: kernel-based feature extraction - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l28.pdf · CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 3 Kernel

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 14

• Can we solve this problem (implicitly) in a high-D kernel space 𝐹 to yield a non-linear version of the Fisher’s LDA? – To do so, we would define between-class and within-class covariance

matrices in kernel space F to obtain the following quotient

𝐽 𝑣 =𝑣𝑇𝑆𝐵

Φ𝑣

𝑣𝑇𝑆𝑊Φ𝑣

– where now 𝑉 ∈ 𝐹, and mean and covariance are defined in 𝐹 as

𝑆𝑊Φ = 𝜑 𝑥 − 𝑚𝑖

Φ 𝜑 𝑥 − 𝑚𝑖Φ 𝑇

𝑥∈𝜔𝑖

2

𝑖=1

𝑆𝐵Φ = 𝑚2

Φ − 𝑚1Φ 𝑚2

Φ − 𝑚1Φ 𝑇

𝑚𝑖Φ =

1

𝑁𝑖 𝜑(𝑥𝑗

𝑖)

𝑁𝑖

𝑗=1

Page 15: L28: kernel-based feature extraction - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l28.pdf · CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 3 Kernel

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 15

– As earlier, we make use of the fact that the eigenvector 𝑉 can be expressed as linear combinations of the training data

𝑉 = 𝛼𝑗𝜑 𝑥𝑗

𝑁

𝑗=1

– which, when multiplied by 𝑚𝑖Φ, yields

𝑉𝑇𝑚𝑖Φ = ∑ 𝛼𝑗𝜑 𝑥𝑗

𝑁𝑗=1

𝑇 1

𝑁𝑖∑ 𝜑 𝑥𝑘

𝑖𝑁𝑖𝑘=1 =

1

𝑁𝑖∑ ∑ 𝛼𝑗𝐾 𝑥𝑗 , 𝑥𝑘

𝑖𝑁𝑖𝑘=1 𝑁

𝑗=1 = 𝛼𝑇𝑀𝑖

• where we have defined

𝑀𝑖 𝑗 ≔1

𝑁𝑖∑ 𝐾 𝑥𝑗 , 𝑥𝑘

𝑖𝑁𝑖𝑘=1

Page 16: L28: kernel-based feature extraction - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l28.pdf · CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 3 Kernel

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 16

– Merging this result with the definition of 𝑆𝐵Φ yields the following

expression for the numerator

𝑉𝑇𝑆𝐵Φ𝑉 = 𝛼𝑇𝑀𝛼

• where

𝑀 = 𝑀1 − 𝑀2 𝑀1 − 𝑀2𝑇

– Likewise, merging with the definition of 𝑆𝑊Φ yields

𝑉𝑇𝑆𝑊Φ𝑉 = 𝛼𝑇𝑁𝛼

– where

𝑁 ≔ 𝐾𝑗 1 − 1𝑁𝑗𝐾𝑗

𝑇

2

𝑗=1

– where 𝐼 is a 𝑁𝑗 × 𝑁𝑗 identity matrix, 1𝑁𝑗 is a 𝑁𝑗 × 𝑁𝑗 matrix with all

entries equal to 1/𝑁𝑗, and 𝐾𝑗 is a 𝑁 × 𝑁𝑗 matrix such that

𝐾𝑗 𝑛𝑚≔ 𝐾 𝑥𝑛, 𝑥𝑚

𝑗

Page 17: L28: kernel-based feature extraction - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l28.pdf · CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 3 Kernel

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 17

– Combining these results, we obtain a new expression for the Rayleigh quotient

𝐽 𝛼 =𝛼𝑇𝑀𝛼

𝛼𝑇𝑁𝛼

• which can be solved by finding the leading eigenvector of 𝑁−1𝑀

– And the projection of a new pattern 𝑡 is given by

𝑉 ∙ 𝜑 𝑡 = 𝛼𝑖𝐾 𝑥𝑖 , 𝑡

𝑁

𝑖=1

• Regularization – To avoid numerical ill-conditioning, one may regularize matrix N by

adding a multiple of the identity matrix 𝑁 = 𝑁 + 𝜇I

Page 18: L28: kernel-based feature extraction - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l28.pdf · CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 3 Kernel

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 18

Kernel LDA examples

60 80 100 120 140 160 180-200

-180

-160

-140

-120

-100

-80

-60

Page 19: L28: kernel-based feature extraction - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l28.pdf · CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 3 Kernel

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 19

LDA

80

100

120

140

160

180

200

220

240

260

kernel LDA

0.1

0.2

0.3

0.4

0.5

0.6

LDA Kernel LDA

Page 20: L28: kernel-based feature extraction - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l28.pdf · CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 3 Kernel

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 20

20 40 60 80 100 120 140 160 180 200 220

-220

-200

-180

-160

-140

-120

-100

-80

-60

-40

Page 21: L28: kernel-based feature extraction - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/pr/pr_l28.pdf · CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 3 Kernel

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 21

LDA

100

150

200

250

300

350

kernel LDA

-0.25

-0.2

-0.15

-0.1

-0.05

LDA Kernel LDA