Upload
others
View
13
Download
0
Embed Size (px)
Citation preview
28.06.12 1CAC-2012
Robust SIMCA Robust SIMCA bearing on nonbearing on non--robust PCArobust PCA
Alexey Pomerantsev & Oxana Rodionova
Semenov Institute of Chemical Physics, Moscow
Use and abuse of robust PCAUse and abuse of robust PCA
28.06.12 2CAC-2012
Contaminated dataContaminated data
Rγ=(1–γ)R+γD
Few outliers Two groups
28.06.12 3CAC-2012
Classical methods for contaminated data Classical methods for contaminated data
Masking effect Swamping effect
28.06.12 4CAC-2012
Classical and robust statisticsClassical and robust statisticsRobustClassical
( )xmedian ~ =x
∑=
=I
iix
Ix
1
1
( )xs ~median1.4826 MAD −= x
∑=
−−
=I
ii xx
Is
1
22 )(1
1
Classical
Robust
outlier
28.06.12 5CAC-2012
Extremes and OutliersExtremes and Outliers
)2(
),(
222 χ∝+
∝⎟⎠⎞
⎜⎝⎛
yx
Nyx I0
)1|2(222 α−χ≤+ −yx
( )Iyx 1222 )1(|2 β−χ≤+ −
α is Extreme significance
β is Outlier significance
α=0.01β=0.05
28.06.12 6CAC-2012
ProblemProblem
OK
Regular data
OKRobustmethods
BADClassicalmethods
Contaminated data
?
28.06.12 7CAC-2012
Principal Component AnalysisPrincipal Component Analysis
I
A
TA
A PA
EA+X I= × J
J
t
I
J
28.06.12 8CAC-2012
PCA PCA robustificationrobustification
(1) Data pre-processing;
(2) Decomposition;
(3) Calculation of thresholds.
28.06.12 9CAC-2012
Scores & Orthogonal DistancesScores & Orthogonal DistancesSD:
distance within the model
∑=
−
λ==
A
a a
iaiii
th1
21tt )( tTTt
OD:distance to the model
∑∑
+==
==K
Aaia
J
jiji tev
1
2
1
2
28.06.12 10CAC-2012
Data Driven SIMCAData Driven SIMCA
SD OD
hu =
v (u1,...., uI )% (u0/N) χ2(N)
u0= ?
N = ?
J. Chemometrics 22 (2008) 601-609
28.06.12 11CAC-2012
Tolerance AreasTolerance Areas
α is Extreme significance
β is Outlier significance
00 vvN
hhNz vh +=
)(2vh NNz +χ∝
)1|(2 α−+χ≤ −vh NNz
( )Ivh NNz 12 )1(| β−+χ≤ −
28.06.12 12CAC-2012
Classical Data Driven (CDD) SIMCA Classical Data Driven (CDD) SIMCA Classical Method of Moments
2
20
0ˆ2intˆ,ˆusuNuu ==
∑∑==
−−
==I
iiu
I
ii uu
Isu
Iu
1
22
1)(
11,1
(u1,...., uI )% (u0/N) χ2(N)Given
Then
Where
28.06.12 13CAC-2012
Robust Data Driven (RDD) SIMCA Robust Data Driven (RDD) SIMCA Robust Method of Moments
M=median(u) R=interquartile(u)
Given
Then
Where
[ ]⎪⎪⎩
⎪⎪⎨
⎧
χ−χ=
χ=⇐
−−
−
),25.0(),75.0(
),5.0(
~
~
220
200
NNNuR
NNuM
N
u
(u1,...., uI )% (u0/N) χ2(N)
u(1) ≤ u(2 )≤ .... ≤ u(I-1) ≤ u(I)
½ ½
¼ ¼
28.06.12 14CAC-2012
Dual Data Driven (3D) SIMCA Dual Data Driven (3D) SIMCA Given
Then
X=TtP+Eh=(h1,...., hI) v=(v1,...., vI)
NoRDD SIMCA
YesCDD SIMCA
RDD SIMCACDD SIMCA
( ) ( )vh NvNh ˆˆˆˆ00
( ) ( )vh NvNh ~~~~00
( ) ( )vvhh NNNN ˆ~&ˆ~≈≈
28.06.12 15CAC-2012
TOMCAT TOMCAT ToolBoxToolBox
M. Daszykowski, S. Serneels, K. Kaczmarek, P. Van Espen, C. Croux, B. Walczak, ChemoLab 85 (2007) 269-277
http://chemometria.us.edu.pl/RobustToolbox/
Robust PCArobust PCs, robust singular values
Robust classification rulesz-transformed robust OD and SD
)(
)(medianODROD
)(
)(medianSDRD
OD
OD
SD
SD
n
ii
n
ii
Q
Q
−=
−=
Robust pre-processingrobust centering & scaling
28.06.12 16CAC-2012
Case study I. Simulated regular dataCase study I. Simulated regular data
),(N),(N 2I0εV0δ
εδx
σ∝∝
+=
The numbers of variables, J=3
The numbers of objects, I=100
The number of principal components, A=2
The δ properties are:
E(δ) = 0, v11= v22 = v33 = 0.28, rank(V) = 2.
The ε component properties are:
E(ε) = 0, σ=0.05
28.06.12 17CAC-2012
SIMCA plotsSIMCA plots
ext=5
out=11
out=0out=12
CDD SIMCA TOMCAT
extreme area (α=0.05) outlier area (β=0.05)
28.06.12 18CAC-2012
Totally in 10 regular data setsTotally in 10 regular data sets
70
60
Expected
28.06.12 19CAC-2012
Case study II. Simulated data with outliersCase study II. Simulated data with outliers
),(N),(N 2I0εV0δ
εδx
σ∝∝
+=
The numbers of variables, J=3
The numbers of objects, I=100
The number of principal components, A=2
The δ properties are:
E(δ) = 0, v11= v22 = v33 = 0.28, rank(V) = 2.
The ε component properties are:
E(ε) = 0, σ=0.05 (first 97 objects)
E(ε) = 0, σ =0.2 (last 3 objects)
28.06.12 20CAC-2012
SIMCA plotsSIMCA plots
28.06.12 21CAC-2012
REFERENCE & RDDREFERENCE & RDD--SIMCASIMCA
28.06.12 22CAC-2012
Totally in 10 data sets with outliersTotally in 10 data sets with outliers
Expected
28.06.12 23CAC-2012
Case study II. Real world data with 2 groupsCase study II. Real world data with 2 groups
Substance in the closed PE bags,
82 drums measured by NIR.
Totally: 246 spectra
Group G1: 196 objects
Group G2: 50 objects ACA 642 (2009) 222-227
28.06.12 24CAC-2012
Expected/observed number of extremesExpected/observed number of extremes
Clean subset G1 Contaminated dataset G1+G2
Expected number of extremes N=αI
28.06.12 25CAC-2012
Results of separationResults of separation
Subset G1 revealed Subset G2 revealed
28.06.12 26CAC-2012
Conclusion 1Conclusion 1Each tool has its purpose: classical methods are for regular data, whereas robust methods should be used for contaminated data. Do not expect that there exists a common tool that yields reasonable results in both cases.
28.06.12 27CAC-2012
Conclusion 2Conclusion 2
Extreme objects play an important role in data analysis. These objects should not be confused with outliers. The number of extremes should be compared to the expected number, coupled with the significance level α.
Clean dataset Contaminated dataset
28.06.12 28CAC-2012
Conclusion 3Conclusion 3The proposed Dual Data Driven PCA/SIMCA approach looks like a fine competitor to the pure classical and to the strictly robust methods. This technique has demonstrated a proper performance in the analysis of both regular and contaminated data sets.
Clean dataset Contaminated dataset
28.06.12 29CAC-2012
Thank you for Thank you for your attentionyour attention
A Lawyer’s Mistake