Upload
nguyenlien
View
220
Download
5
Embed Size (px)
Citation preview
주성분분석
주성분분석 (Principal Component Analysis): 다차원 자료를 설명력이 높은 소수개의 차원으로 축소하기 위한 분석방법
단, 축소된 차원에서의 각 변수들은 서로 상관계수가 0이 되도록
y1y2...
yq
= y = P′x =
p11 p12 . . . p1pp21 p21 . . . p2p. . . . . . . . . . . .pq1 . . . . . . pqp
x1x2...
xp
Jinseog Kim Dongguk University [email protected]
주성분분석-PCA
주성분분석
x = (x1, . . . , xp)′: p 차원 확률벡터E(x = µCov(x) = Σ (공분산행렬)
y = P′x의 공분산
Cov(y) = Cov(P′x) = P′Cov(x)P = P′ΣP.
(y1, . . . , yp)′의 상관관계가 0 이기 위해서는 y의 공분산은 아래와 같은 형태
Cov(y) =
λ1 0 . . . 00 λ2 . . . 0. . . . . . . . . . . .0 . . . 0 λp
위에서 대각선 원소(λi)는 일변량 변수 yi의 분산이러한 조건을 만족하는 선형변환 (P)은 어떻게 구할까?
Jinseog Kim Dongguk University [email protected]
주성분분석-PCA
Σ의 스펙트럴분해 (spectral decomposition)
Σ의 스펙트럴분해는 아래와 같은 행렬 분해
Σ = ΓΛΓ′.
1 Γ 는 직교행렬(orthogonal)이고, Σ의 고유벡터로 구성2 Λ는 대각행렬로써 대각선원소는 Σ의 고유치
즉,
Σ = (e1, . . . , ep)
λ1 0 . . . 00 λ2 . . . 0. . . . . . . . . . . .0 . . . 0 λp
e′1...
e′p
.
Jinseog Kim Dongguk University [email protected]
주성분분석-PCA
스펙트럴분해를 이용한 Cov(y)의 계산
Cov(y) = P′ΣP = P′ΓΛΓ′P
y의 공분산이 대각행렬(Λ)이 되기 위한 충분 조건
P = Γ⇒P′ΓΛΓ′P = P′PΛP′P = IΛI = Λ
y를 x의 주성분(Principal component)라고 부름, 이 경우 공분산행렬의대각원소의 합은 동일
p∑j=1
Var(xj) = tr(Σ) = tr(ΓΛΓ′) = tr(Γ′ΓΛ) = tr(Λ)
=p∑
j=1
λj =p∑
j=1
Var(yj)
Jinseog Kim Dongguk University [email protected]
주성분분석-PCA
예
3차원 확률벡터 X = (X1,X2,X3)T의 공분산행렬
Σ =
(1 −2 0−2 5 00 0 2
).
고유값 및 고유 벡터 (스펙트럴 분해)
Γ =
(0.383 0 0.924−0.924 0 0.383
0 1 0
), Λ =
(5.83 0 0
0 2.00 00 0 0.17
)주성분변수 (y1, y2, y3)T
y1 = 0.383x1 − 0.924x2
y2 = x3
y3 = 0.924x1 + 0.383x2.
Jinseog Kim Dongguk University [email protected]
주성분분석-PCA
예
원 변수들의 x = (x1, x2, x3)T의 분산합
σ11 + σ22 + σ33 = 1 + 5 + 2 = 8
주성분변수들의 y = (y1, y2, y3)T의 분산합
λ21 + λ2
2 + λ23 = 5.83 + 2 + 0.17 = 8
주성분변수들의 y = (y1, y2, y3)T의 분산비
분산 분산비 누적비율
y1 5.83 0.72875 0.72875y2 2.00 0.25000 0.97875y3 0.17 0.02125 1.00000
Jinseog Kim Dongguk University [email protected]
주성분분석-PCA
R 예제 : Prestige 데이터
캐나다 직업 분류별 소득 및 명망도에 대한 수집자료
library(car)head(Prestige)
## education income women prestige census## gov.administrators 13.11 12351 11.16 68.8 1113## general.managers 12.26 25879 4.02 69.1 1130## accountants 12.77 9271 15.70 63.4 1171## purchasing.officers 11.42 8865 9.11 56.8 1175## chemists 14.62 8403 11.68 73.5 2111## physicists 15.64 11030 5.13 77.6 2113## type## gov.administrators prof## general.managers prof## accountants prof## purchasing.officers prof## chemists prof## physicists prof
Jinseog Kim Dongguk University [email protected]
주성분분석-PCA
car::Prestige
변수의 구성
1 education: 1971년의 직업군 교육 수준 평균 점수2 income: 소득 평균3 women: 여성비율 (%)4 prestige : 1960년 중반 사회조사에 의해 구한 Pineo-Porter 직업 명성도5 census: 직업코드6 type : 직업유형 - bc(기능직); prof(교수, 전문직, 기술직); wc(사무직)
분석 목적
1 소득과 명성도와의 관계2 소득, 성비, 직업유형에 따른 명성도
Jinseog Kim Dongguk University [email protected]
주성분분석-PCA
R 예제
X <- cov(Prestige[,1:4]) #or var(...)# 공분산
X
## education income women prestige## education 7.444408 6691.13 5.353965 39.90856## income 6691.129509 18027855.55 -59411.383661 52223.07756## women 5.353965 -59411.38 1006.471223 -64.58812## prestige 39.908561 52223.08 -64.588116 295.99432
Jinseog Kim Dongguk University [email protected]
주성분분석-PCA
R 예제
# 스펙트럴 분해
ei <- eigen(X)ei
## $values## [1] 1.802821e+07 8.287121e+02 1.298184e+02 1.816137e+00#### $vectors## [,1] [,2] [,3] [,4]## [1,] -0.0003711498 -0.03673138 0.125908181 9.913616e-01## [2,] -0.9999903048 -0.00278663 -0.003409244 -4.463693e-05## [3,] 0.0032956309 -0.98702896 -0.159679771 -1.628943e-02## [4,] -0.0028967751 -0.15625901 0.979100545 -1.301417e-01
Jinseog Kim Dongguk University [email protected]
주성분분석-PCA
R 예제
# 스펙트럴 분해
LAM <- diag(ei$values)GAM <- ei$vectorsZ <- t(GAM)%*%X%*%GAMdiag(Z)
## [1] 1.802821e+07 8.287121e+02 1.298184e+02 1.816137e+00
sum(diag(Z))
## [1] 18029165
sum(diag(X))
## [1] 18029165
Jinseog Kim Dongguk University [email protected]
주성분분석-PCA
R을 이용한 PCA
princomp(formula, data = NULL, subset, na.action, ...)princomp(x)
pr1 <- princomp(~education+income+women+prestige, data=Prestige)pr1
## Call:## princomp(formula = ~education + income + women + prestige, data = Prestige)#### Standard deviations:## Comp.1 Comp.2 Comp.3 Comp.4## 4225.098580 28.645899 11.337797 1.341019#### 4 variables and 102 observations.
head(pr1$scores) # y-values
## Comp.1 Comp.2 Comp.3## gov.administrators -5553.167 -1.406157 5.719698## general.managers -19081.060 -32.071958 -39.073727## accountants -2473.167 3.551840 10.165270## purchasing.officers -2067.173 12.268629 5.969673## chemists -1605.218 8.292322 23.888252## physicists -4232.227 6.758757 20.120810## Comp.4## gov.administrators -0.46491938## general.managers -1.83416108## accountants -0.03568959## purchasing.officers -0.38962275## chemists 0.58812686## physicists 1.05516941
summary(pr1)
## Importance of components:## Comp.1 Comp.2## Standard deviation 4225.0985798 2.864590e+01## Proportion of Variance 0.9999467 4.596509e-05## Cumulative Proportion 0.9999467 9.999927e-01## Comp.3 Comp.4## Standard deviation 1.133780e+01 1.341019e+00## Proportion of Variance 7.200465e-06 1.007333e-07## Cumulative Proportion 9.999999e-01 1.000000e+00
Jinseog Kim Dongguk University [email protected]
주성분분석-PCA
R을 이용한 PCA
주성분 분석의 결과 - 선형변환 행렬(P = Γ)
loadings(pr1) # Gamma: matrix of eigen vectors
#### Loadings:## Comp.1 Comp.2 Comp.3 Comp.4## education 0.126 0.991## income -1.000## women -0.987 -0.160## prestige -0.156 0.979 -0.130#### Comp.1 Comp.2 Comp.3 Comp.4## SS loadings 1.00 1.00 1.00 1.00## Proportion Var 0.25 0.25 0.25 0.25## Cumulative Var 0.25 0.50 0.75 1.00
Jinseog Kim Dongguk University [email protected]
주성분분석-PCA
R을 이용한 PCA
주성분 분석의 결과 - 제 1, 2 주성분 축에 따라 데이터 표시
biplot(pr1)
-0.4 -0.3 -0.2 -0.1 0.0 0.1
-0.4
-0.3
-0.2
-0.1
0.00.1
Comp.1
Comp
.2
gov.administrators
general.managers
accountantspurchasing.officerschemistsphysicists
biologistsarchitectscivil.engineersmining.engineers
surveyorsdraughtsmen
computer.programers
economistspsychologistssocial.workers
lawyers
librarians
vocational.counsellors
ministers
university.teachers
primary.school.teachers
secondary.school.teachers
physicians
veterinariansosteopaths.chiropractors
nurses
nursing.aides
physio.therapsts
pharmacists
medical.technicians
commercial.artistsradio.tv.announcersathletes
secretariestypists
bookkeepers
tellers.cashierscomputer.operators
shipping.clerks
file.clerksreceptionsts
mail.carriers
postal.clerks
telephone.operators
collectorsclaim.adjustorstravel.clerks
office.clerks
sales.supervisorscommercial.travellers
sales.clerks
newsboysservice.station.attendant
insurance.agentsreal.estate.salesmenbuyers
firefighterspolicemen
cooks
bartendersfuneral.directors
babysitters
launderers
janitorselevator.operators
farmersfarm.workers
rotary.well.drillers
bakers
slaughterers.1slaughterers.2
canners
textile.weaverstextile.labourers
tool.die.makersmachinistssheet.metal.workersweldersauto.workersaircraft.workers
electronic.workers
radio.tv.repairmen
sewing.mach.operators
auto.repairmenaircraft.repairmen
railway.sectionmenelectrical.linemenelectriciansconstruction.foremen
carpentersmasonshouse.paintersplumbers
construction.labourers
pilots
train.engineersbus.driverstaxi.driverslongshoremen
typesetters
bookbinders
-40000 -20000 0 10000
-4000
0-20
000
010
000
educationincome womenprestige
Jinseog Kim Dongguk University [email protected]
주성분분석-PCA
R을 이용한 PCA
1 Scree Plot의 이용전체분산(변동) 중 주성분이 설명하는 변동의 양을 이용전체 p개의 변수가 있을 때 다음을 계산하여 그림으로 표현해 준다.
Var(y1 + . . .+ yq) =
∑qi=1 λi∑pi=1 λi
, q = 1, . . . , p.
Jinseog Kim Dongguk University [email protected]
주성분분석-PCA
R을 이용한 PCA - screeplot
screeplot(pr1, type="lines", main="Scree plot")
Scree plot
Varia
nces
0.0e
+00
5.0e
+06
1.0e
+07
1.5e
+07
Comp.1 Comp.2 Comp.3 Comp.4
Jinseog Kim Dongguk University [email protected]
주성분분석-PCA
2004 New Car and Truck Data
428 cars from the 2004 model year, with 19 features.http://ww2.amstat.org/publications/jse/datasets/04cars.dat.txt
Jinseog Kim Dongguk University [email protected]
주성분분석-PCA
2004 New Car and Truck Data
Variable MeaningName Vehicle NameSports Binary indicator for being a sports carSUV Indicator for sports utility vehicleWagon IndicatorMinivan IndicatorPickup IndicatorAWD Indicator for all-wheel driveRWD Indicator for rear-wheel driveRetail Suggested retail price (US$)Dealer Price to dealer (US$)Engine Engine size (liters)Cylinders Number of engine cylindersHorsepower Engine horsepowerCityMPG City gas mileageHighwayMPG Highway gas mileageWeight Weight (pounds)Wheelbase Wheelbase (inches)Length Length (inches)Width Width (inches)
Jinseog Kim Dongguk University [email protected]
주성분분석-PCA
2004 New Car and Truck Data
cars04 <- read.csv("http://bigdata.dongguk.ac.kr/data/04cars.dat.csv", header = F)head(cars04)
## V1 V2 V3 V4 V5## 1 Chevrolet Aveo 4dr 0 0 0 0## 2 Chevrolet Aveo LS 4dr hatch 0 0 0 0## 3 Chevrolet Cavalier 2dr 0 0 0 0## 4 Chevrolet Cavalier 4dr 0 0 0 0## 5 Chevrolet Cavalier LS 2dr 0 0 0 0## 6 Dodge Neon SE 4dr 0 0 0 0## V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19## 1 0 0 0 11690 10965 1.6 4 103 28 34 2370 98 167 66## 2 0 0 0 12585 11802 1.6 4 103 28 34 2348 98 153 66## 3 0 0 0 14610 13697 2.2 4 140 26 37 2617 104 183 69## 4 0 0 0 14810 13884 2.2 4 140 26 37 2676 104 183 68## 5 0 0 0 16385 15357 2.2 4 140 26 37 2617 104 183 69## 6 0 0 0 13670 12849 2.0 4 132 29 36 2581 105 174 67
Jinseog Kim Dongguk University [email protected]
주성분분석-PCA
2004 New Car and Truck Data
cars04.pca = prcomp(cars04[,9:19], scale.=TRUE)biplot(cars04.pca,cex=0.4)
-0.1 0.0 0.1 0.2 0.3
-0.1
0.00.1
0.20.3
PC1
PC2
12
34567891011
121314
151617
181920
21
2223
242526
27
2829303132333435
36
3738
3940414243
444546
4748
4950
51
5253
5455
56
5758
59
6061
6263
64
65
66
67
68697071
72
73
74
75
7677
78
79 80
81
82
838485
8687
88
8990
91
9293
949596
9798
99
100101
102
103104
105106
107
108
109
110111
112
113
114115
116117118119120
121122
123
124125126127128129
130
131132133
134135136
137138
139
140141142
143144145
146147148
149150151152153
154
155156
157158
159
160
161162163164
165166
167168
169170171172
173174
175176177178179
180
181
182
183184
185
186187 188189
190191
192
193194
195
196
197
198199200
201202203204
205206
207
208
209
210211
212
213
214215
216217
218219220
221222
223
224
225
226
227
228
229
230231232
233234
235
236
237238
239240
241242
243
244
245246
247
248
249250
251
252
253254
255256
257258259
260
261262
263
264
265266267268269
270271272
273
274
275
276277
278
279
280
281
282
283
284285
286
287
288
289
290
291
292293
294295
296
297298 299300301
302303
304
305
306307308
309
310
311
312313314
315
316
317
318 319 320321
322
323
324
325
326
327328329330
331
332333
334335 336
337338
339
340
341
342
343
344
345
346347
348 349350 351
352
353
354355
356
357
358359
360
361
362363
364
365366
367
368
369
370
371372 373
374
375376377378
379
380
381382
383
384
385386387
-10 0 10 20 30
-100
1020
30
V9V10
V11
V12
V13
V14V15
V16
V17V18
V19
Jinseog Kim Dongguk University [email protected]
주성분분석-PCA
PCA의 활용
차원축소: 저차원을 이용한 Data Visualization중회귀분석: 입력변수간의 다중공선성이 있을 때요인분석, 판별분석, 집락분석, 이상치의 탐색 등.
Jinseog Kim Dongguk University [email protected]
주성분분석-PCA
실습 과제
1 (실습) MASS package 에 있는 UScrime data set을 이용하여 주성분 분석을하시오.
2 각자 구해온 데이터를 이용하여 주성분분석을 하시오.
Jinseog Kim Dongguk University [email protected]
주성분분석-PCA
Data sets
http://ww2.amstat.org/publications/jse/jse_data_archive.htm
Jinseog Kim Dongguk University [email protected]
주성분분석-PCA