29
28.06.12 1 CAC-2012 Robust SIMCA Robust SIMCA bearing on non bearing on non - - robust PCA robust PCA Alexey Pomerantsev & Oxana Rodionova Semenov Institute of Chemical Physics, Moscow Use and abuse of robust PCA Use and abuse of robust PCA

Alexey Pomerantsev & Oxana Rodionova - Chemometrics · Alexey Pomerantsev & Oxana Rodionova Semenov Institute of Chemical Physics, Moscow Use and abuse of robust PCA. 28.06.12 CAC-2012

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Alexey Pomerantsev & Oxana Rodionova - Chemometrics · Alexey Pomerantsev & Oxana Rodionova Semenov Institute of Chemical Physics, Moscow Use and abuse of robust PCA. 28.06.12 CAC-2012

28.06.12 1CAC-2012

Robust SIMCA Robust SIMCA bearing on nonbearing on non--robust PCArobust PCA

Alexey Pomerantsev & Oxana Rodionova

Semenov Institute of Chemical Physics, Moscow

Use and abuse of robust PCAUse and abuse of robust PCA

Page 2: Alexey Pomerantsev & Oxana Rodionova - Chemometrics · Alexey Pomerantsev & Oxana Rodionova Semenov Institute of Chemical Physics, Moscow Use and abuse of robust PCA. 28.06.12 CAC-2012

28.06.12 2CAC-2012

Contaminated dataContaminated data

Rγ=(1–γ)R+γD

Few outliers Two groups

Page 3: Alexey Pomerantsev & Oxana Rodionova - Chemometrics · Alexey Pomerantsev & Oxana Rodionova Semenov Institute of Chemical Physics, Moscow Use and abuse of robust PCA. 28.06.12 CAC-2012

28.06.12 3CAC-2012

Classical methods for contaminated data Classical methods for contaminated data

Masking effect Swamping effect

Page 4: Alexey Pomerantsev & Oxana Rodionova - Chemometrics · Alexey Pomerantsev & Oxana Rodionova Semenov Institute of Chemical Physics, Moscow Use and abuse of robust PCA. 28.06.12 CAC-2012

28.06.12 4CAC-2012

Classical and robust statisticsClassical and robust statisticsRobustClassical

( )xmedian ~ =x

∑=

=I

iix

Ix

1

1

( )xs ~median1.4826 MAD −= x

∑=

−−

=I

ii xx

Is

1

22 )(1

1

Classical

Robust

outlier

Page 5: Alexey Pomerantsev & Oxana Rodionova - Chemometrics · Alexey Pomerantsev & Oxana Rodionova Semenov Institute of Chemical Physics, Moscow Use and abuse of robust PCA. 28.06.12 CAC-2012

28.06.12 5CAC-2012

Extremes and OutliersExtremes and Outliers

)2(

),(

222 χ∝+

∝⎟⎠⎞

⎜⎝⎛

yx

Nyx I0

)1|2(222 α−χ≤+ −yx

( )Iyx 1222 )1(|2 β−χ≤+ −

α is Extreme significance

β is Outlier significance

α=0.01β=0.05

Page 6: Alexey Pomerantsev & Oxana Rodionova - Chemometrics · Alexey Pomerantsev & Oxana Rodionova Semenov Institute of Chemical Physics, Moscow Use and abuse of robust PCA. 28.06.12 CAC-2012

28.06.12 6CAC-2012

ProblemProblem

OK

Regular data

OKRobustmethods

BADClassicalmethods

Contaminated data

?

Page 7: Alexey Pomerantsev & Oxana Rodionova - Chemometrics · Alexey Pomerantsev & Oxana Rodionova Semenov Institute of Chemical Physics, Moscow Use and abuse of robust PCA. 28.06.12 CAC-2012

28.06.12 7CAC-2012

Principal Component AnalysisPrincipal Component Analysis

I

A

TA

A PA

EA+X I= × J

J

t

I

J

Page 8: Alexey Pomerantsev & Oxana Rodionova - Chemometrics · Alexey Pomerantsev & Oxana Rodionova Semenov Institute of Chemical Physics, Moscow Use and abuse of robust PCA. 28.06.12 CAC-2012

28.06.12 8CAC-2012

PCA PCA robustificationrobustification

(1) Data pre-processing;

(2) Decomposition;

(3) Calculation of thresholds.

Page 9: Alexey Pomerantsev & Oxana Rodionova - Chemometrics · Alexey Pomerantsev & Oxana Rodionova Semenov Institute of Chemical Physics, Moscow Use and abuse of robust PCA. 28.06.12 CAC-2012

28.06.12 9CAC-2012

Scores & Orthogonal DistancesScores & Orthogonal DistancesSD:

distance within the model

∑=

λ==

A

a a

iaiii

th1

21tt )( tTTt

OD:distance to the model

∑∑

+==

==K

Aaia

J

jiji tev

1

2

1

2

Page 10: Alexey Pomerantsev & Oxana Rodionova - Chemometrics · Alexey Pomerantsev & Oxana Rodionova Semenov Institute of Chemical Physics, Moscow Use and abuse of robust PCA. 28.06.12 CAC-2012

28.06.12 10CAC-2012

Data Driven SIMCAData Driven SIMCA

SD OD

hu =

v (u1,...., uI )% (u0/N) χ2(N)

u0= ?

N = ?

J. Chemometrics 22 (2008) 601-609

Page 11: Alexey Pomerantsev & Oxana Rodionova - Chemometrics · Alexey Pomerantsev & Oxana Rodionova Semenov Institute of Chemical Physics, Moscow Use and abuse of robust PCA. 28.06.12 CAC-2012

28.06.12 11CAC-2012

Tolerance AreasTolerance Areas

α is Extreme significance

β is Outlier significance

00 vvN

hhNz vh +=

)(2vh NNz +χ∝

)1|(2 α−+χ≤ −vh NNz

( )Ivh NNz 12 )1(| β−+χ≤ −

Page 12: Alexey Pomerantsev & Oxana Rodionova - Chemometrics · Alexey Pomerantsev & Oxana Rodionova Semenov Institute of Chemical Physics, Moscow Use and abuse of robust PCA. 28.06.12 CAC-2012

28.06.12 12CAC-2012

Classical Data Driven (CDD) SIMCA Classical Data Driven (CDD) SIMCA Classical Method of Moments

2

20

0ˆ2intˆ,ˆusuNuu ==

∑∑==

−−

==I

iiu

I

ii uu

Isu

Iu

1

22

1)(

11,1

(u1,...., uI )% (u0/N) χ2(N)Given

Then

Where

Page 13: Alexey Pomerantsev & Oxana Rodionova - Chemometrics · Alexey Pomerantsev & Oxana Rodionova Semenov Institute of Chemical Physics, Moscow Use and abuse of robust PCA. 28.06.12 CAC-2012

28.06.12 13CAC-2012

Robust Data Driven (RDD) SIMCA Robust Data Driven (RDD) SIMCA Robust Method of Moments

M=median(u) R=interquartile(u)

Given

Then

Where

[ ]⎪⎪⎩

⎪⎪⎨

χ−χ=

χ=⇐

−−

),25.0(),75.0(

),5.0(

~

~

220

200

NNNuR

NNuM

N

u

(u1,...., uI )% (u0/N) χ2(N)

u(1) ≤ u(2 )≤ .... ≤ u(I-1) ≤ u(I)

½ ½

¼ ¼

Page 14: Alexey Pomerantsev & Oxana Rodionova - Chemometrics · Alexey Pomerantsev & Oxana Rodionova Semenov Institute of Chemical Physics, Moscow Use and abuse of robust PCA. 28.06.12 CAC-2012

28.06.12 14CAC-2012

Dual Data Driven (3D) SIMCA Dual Data Driven (3D) SIMCA Given

Then

X=TtP+Eh=(h1,...., hI) v=(v1,...., vI)

NoRDD SIMCA

YesCDD SIMCA

RDD SIMCACDD SIMCA

( ) ( )vh NvNh ˆˆˆˆ00

( ) ( )vh NvNh ~~~~00

( ) ( )vvhh NNNN ˆ~&ˆ~≈≈

Page 15: Alexey Pomerantsev & Oxana Rodionova - Chemometrics · Alexey Pomerantsev & Oxana Rodionova Semenov Institute of Chemical Physics, Moscow Use and abuse of robust PCA. 28.06.12 CAC-2012

28.06.12 15CAC-2012

TOMCAT TOMCAT ToolBoxToolBox

M. Daszykowski, S. Serneels, K. Kaczmarek, P. Van Espen, C. Croux, B. Walczak, ChemoLab 85 (2007) 269-277

http://chemometria.us.edu.pl/RobustToolbox/

Robust PCArobust PCs, robust singular values

Robust classification rulesz-transformed robust OD and SD

)(

)(medianODROD

)(

)(medianSDRD

OD

OD

SD

SD

n

ii

n

ii

Q

Q

−=

−=

Robust pre-processingrobust centering & scaling

Page 16: Alexey Pomerantsev & Oxana Rodionova - Chemometrics · Alexey Pomerantsev & Oxana Rodionova Semenov Institute of Chemical Physics, Moscow Use and abuse of robust PCA. 28.06.12 CAC-2012

28.06.12 16CAC-2012

Case study I. Simulated regular dataCase study I. Simulated regular data

),(N),(N 2I0εV0δ

εδx

σ∝∝

+=

The numbers of variables, J=3

The numbers of objects, I=100

The number of principal components, A=2

The δ properties are:

E(δ) = 0, v11= v22 = v33 = 0.28, rank(V) = 2.

The ε component properties are:

E(ε) = 0, σ=0.05

Page 17: Alexey Pomerantsev & Oxana Rodionova - Chemometrics · Alexey Pomerantsev & Oxana Rodionova Semenov Institute of Chemical Physics, Moscow Use and abuse of robust PCA. 28.06.12 CAC-2012

28.06.12 17CAC-2012

SIMCA plotsSIMCA plots

ext=5

out=11

out=0out=12

CDD SIMCA TOMCAT

extreme area (α=0.05) outlier area (β=0.05)

Page 18: Alexey Pomerantsev & Oxana Rodionova - Chemometrics · Alexey Pomerantsev & Oxana Rodionova Semenov Institute of Chemical Physics, Moscow Use and abuse of robust PCA. 28.06.12 CAC-2012

28.06.12 18CAC-2012

Totally in 10 regular data setsTotally in 10 regular data sets

70

60

Expected

Page 19: Alexey Pomerantsev & Oxana Rodionova - Chemometrics · Alexey Pomerantsev & Oxana Rodionova Semenov Institute of Chemical Physics, Moscow Use and abuse of robust PCA. 28.06.12 CAC-2012

28.06.12 19CAC-2012

Case study II. Simulated data with outliersCase study II. Simulated data with outliers

),(N),(N 2I0εV0δ

εδx

σ∝∝

+=

The numbers of variables, J=3

The numbers of objects, I=100

The number of principal components, A=2

The δ properties are:

E(δ) = 0, v11= v22 = v33 = 0.28, rank(V) = 2.

The ε component properties are:

E(ε) = 0, σ=0.05 (first 97 objects)

E(ε) = 0, σ =0.2 (last 3 objects)

Page 20: Alexey Pomerantsev & Oxana Rodionova - Chemometrics · Alexey Pomerantsev & Oxana Rodionova Semenov Institute of Chemical Physics, Moscow Use and abuse of robust PCA. 28.06.12 CAC-2012

28.06.12 20CAC-2012

SIMCA plotsSIMCA plots

Page 21: Alexey Pomerantsev & Oxana Rodionova - Chemometrics · Alexey Pomerantsev & Oxana Rodionova Semenov Institute of Chemical Physics, Moscow Use and abuse of robust PCA. 28.06.12 CAC-2012

28.06.12 21CAC-2012

REFERENCE & RDDREFERENCE & RDD--SIMCASIMCA

Page 22: Alexey Pomerantsev & Oxana Rodionova - Chemometrics · Alexey Pomerantsev & Oxana Rodionova Semenov Institute of Chemical Physics, Moscow Use and abuse of robust PCA. 28.06.12 CAC-2012

28.06.12 22CAC-2012

Totally in 10 data sets with outliersTotally in 10 data sets with outliers

Expected

Page 23: Alexey Pomerantsev & Oxana Rodionova - Chemometrics · Alexey Pomerantsev & Oxana Rodionova Semenov Institute of Chemical Physics, Moscow Use and abuse of robust PCA. 28.06.12 CAC-2012

28.06.12 23CAC-2012

Case study II. Real world data with 2 groupsCase study II. Real world data with 2 groups

Substance in the closed PE bags,

82 drums measured by NIR.

Totally: 246 spectra

Group G1: 196 objects

Group G2: 50 objects ACA 642 (2009) 222-227

Page 24: Alexey Pomerantsev & Oxana Rodionova - Chemometrics · Alexey Pomerantsev & Oxana Rodionova Semenov Institute of Chemical Physics, Moscow Use and abuse of robust PCA. 28.06.12 CAC-2012

28.06.12 24CAC-2012

Expected/observed number of extremesExpected/observed number of extremes

Clean subset G1 Contaminated dataset G1+G2

Expected number of extremes N=αI

Page 25: Alexey Pomerantsev & Oxana Rodionova - Chemometrics · Alexey Pomerantsev & Oxana Rodionova Semenov Institute of Chemical Physics, Moscow Use and abuse of robust PCA. 28.06.12 CAC-2012

28.06.12 25CAC-2012

Results of separationResults of separation

Subset G1 revealed Subset G2 revealed

Page 26: Alexey Pomerantsev & Oxana Rodionova - Chemometrics · Alexey Pomerantsev & Oxana Rodionova Semenov Institute of Chemical Physics, Moscow Use and abuse of robust PCA. 28.06.12 CAC-2012

28.06.12 26CAC-2012

Conclusion 1Conclusion 1Each tool has its purpose: classical methods are for regular data, whereas robust methods should be used for contaminated data. Do not expect that there exists a common tool that yields reasonable results in both cases.

Page 27: Alexey Pomerantsev & Oxana Rodionova - Chemometrics · Alexey Pomerantsev & Oxana Rodionova Semenov Institute of Chemical Physics, Moscow Use and abuse of robust PCA. 28.06.12 CAC-2012

28.06.12 27CAC-2012

Conclusion 2Conclusion 2

Extreme objects play an important role in data analysis. These objects should not be confused with outliers. The number of extremes should be compared to the expected number, coupled with the significance level α.

Clean dataset Contaminated dataset

Page 28: Alexey Pomerantsev & Oxana Rodionova - Chemometrics · Alexey Pomerantsev & Oxana Rodionova Semenov Institute of Chemical Physics, Moscow Use and abuse of robust PCA. 28.06.12 CAC-2012

28.06.12 28CAC-2012

Conclusion 3Conclusion 3The proposed Dual Data Driven PCA/SIMCA approach looks like a fine competitor to the pure classical and to the strictly robust methods. This technique has demonstrated a proper performance in the analysis of both regular and contaminated data sets.

Clean dataset Contaminated dataset

Page 29: Alexey Pomerantsev & Oxana Rodionova - Chemometrics · Alexey Pomerantsev & Oxana Rodionova Semenov Institute of Chemical Physics, Moscow Use and abuse of robust PCA. 28.06.12 CAC-2012

28.06.12 29CAC-2012

Thank you for Thank you for your attentionyour attention

A Lawyer’s Mistake