24
ü1-PCA Jinseog Kim Dongguk University [email protected] Jinseog Kim Dongguk University [email protected] ü1-PCA

Jinseog Kim Dongguk University [email protected]/lectures/2017-2/multivariate/...2004 New Car and Truck Data Variable Meaning Name Vehicle Name Sports Binary

Embed Size (px)

Citation preview

Page 1: Jinseog Kim Dongguk University jinseog.kim@gmaildatamining.dongguk.ac.kr/lectures/2017-2/multivariate/...2004 New Car and Truck Data Variable Meaning Name Vehicle Name Sports Binary

주성분분석-PCA

Jinseog KimDongguk University

[email protected]

Jinseog Kim Dongguk University [email protected]

주성분분석-PCA

Page 2: Jinseog Kim Dongguk University jinseog.kim@gmaildatamining.dongguk.ac.kr/lectures/2017-2/multivariate/...2004 New Car and Truck Data Variable Meaning Name Vehicle Name Sports Binary

주성분분석

주성분분석 (Principal Component Analysis): 다차원 자료를 설명력이 높은 소수개의 차원으로 축소하기 위한 분석방법

단, 축소된 차원에서의 각 변수들은 서로 상관계수가 0이 되도록

y1y2...

yq

= y = P′x =

p11 p12 . . . p1pp21 p21 . . . p2p. . . . . . . . . . . .pq1 . . . . . . pqp

x1x2...

xp

Jinseog Kim Dongguk University [email protected]

주성분분석-PCA

Page 3: Jinseog Kim Dongguk University jinseog.kim@gmaildatamining.dongguk.ac.kr/lectures/2017-2/multivariate/...2004 New Car and Truck Data Variable Meaning Name Vehicle Name Sports Binary

주성분분석

x = (x1, . . . , xp)′: p 차원 확률벡터E(x = µCov(x) = Σ (공분산행렬)

y = P′x의 공분산

Cov(y) = Cov(P′x) = P′Cov(x)P = P′ΣP.

(y1, . . . , yp)′의 상관관계가 0 이기 위해서는 y의 공분산은 아래와 같은 형태

Cov(y) =

λ1 0 . . . 00 λ2 . . . 0. . . . . . . . . . . .0 . . . 0 λp

위에서 대각선 원소(λi)는 일변량 변수 yi의 분산이러한 조건을 만족하는 선형변환 (P)은 어떻게 구할까?

Jinseog Kim Dongguk University [email protected]

주성분분석-PCA

Page 4: Jinseog Kim Dongguk University jinseog.kim@gmaildatamining.dongguk.ac.kr/lectures/2017-2/multivariate/...2004 New Car and Truck Data Variable Meaning Name Vehicle Name Sports Binary

Σ의 스펙트럴분해 (spectral decomposition)

Σ의 스펙트럴분해는 아래와 같은 행렬 분해

Σ = ΓΛΓ′.

1 Γ 는 직교행렬(orthogonal)이고, Σ의 고유벡터로 구성2 Λ는 대각행렬로써 대각선원소는 Σ의 고유치

즉,

Σ = (e1, . . . , ep)

λ1 0 . . . 00 λ2 . . . 0. . . . . . . . . . . .0 . . . 0 λp

e′1...

e′p

.

Jinseog Kim Dongguk University [email protected]

주성분분석-PCA

Page 5: Jinseog Kim Dongguk University jinseog.kim@gmaildatamining.dongguk.ac.kr/lectures/2017-2/multivariate/...2004 New Car and Truck Data Variable Meaning Name Vehicle Name Sports Binary

스펙트럴분해를 이용한 Cov(y)의 계산

Cov(y) = P′ΣP = P′ΓΛΓ′P

y의 공분산이 대각행렬(Λ)이 되기 위한 충분 조건

P = Γ⇒P′ΓΛΓ′P = P′PΛP′P = IΛI = Λ

y를 x의 주성분(Principal component)라고 부름, 이 경우 공분산행렬의대각원소의 합은 동일

p∑j=1

Var(xj) = tr(Σ) = tr(ΓΛΓ′) = tr(Γ′ΓΛ) = tr(Λ)

=p∑

j=1

λj =p∑

j=1

Var(yj)

Jinseog Kim Dongguk University [email protected]

주성분분석-PCA

Page 6: Jinseog Kim Dongguk University jinseog.kim@gmaildatamining.dongguk.ac.kr/lectures/2017-2/multivariate/...2004 New Car and Truck Data Variable Meaning Name Vehicle Name Sports Binary

3차원 확률벡터 X = (X1,X2,X3)T의 공분산행렬

Σ =

(1 −2 0−2 5 00 0 2

).

고유값 및 고유 벡터 (스펙트럴 분해)

Γ =

(0.383 0 0.924−0.924 0 0.383

0 1 0

), Λ =

(5.83 0 0

0 2.00 00 0 0.17

)주성분변수 (y1, y2, y3)T

y1 = 0.383x1 − 0.924x2

y2 = x3

y3 = 0.924x1 + 0.383x2.

Jinseog Kim Dongguk University [email protected]

주성분분석-PCA

Page 7: Jinseog Kim Dongguk University jinseog.kim@gmaildatamining.dongguk.ac.kr/lectures/2017-2/multivariate/...2004 New Car and Truck Data Variable Meaning Name Vehicle Name Sports Binary

원 변수들의 x = (x1, x2, x3)T의 분산합

σ11 + σ22 + σ33 = 1 + 5 + 2 = 8

주성분변수들의 y = (y1, y2, y3)T의 분산합

λ21 + λ2

2 + λ23 = 5.83 + 2 + 0.17 = 8

주성분변수들의 y = (y1, y2, y3)T의 분산비

분산 분산비 누적비율

y1 5.83 0.72875 0.72875y2 2.00 0.25000 0.97875y3 0.17 0.02125 1.00000

Jinseog Kim Dongguk University [email protected]

주성분분석-PCA

Page 8: Jinseog Kim Dongguk University jinseog.kim@gmaildatamining.dongguk.ac.kr/lectures/2017-2/multivariate/...2004 New Car and Truck Data Variable Meaning Name Vehicle Name Sports Binary

R 예제 : Prestige 데이터

캐나다 직업 분류별 소득 및 명망도에 대한 수집자료

library(car)head(Prestige)

## education income women prestige census## gov.administrators 13.11 12351 11.16 68.8 1113## general.managers 12.26 25879 4.02 69.1 1130## accountants 12.77 9271 15.70 63.4 1171## purchasing.officers 11.42 8865 9.11 56.8 1175## chemists 14.62 8403 11.68 73.5 2111## physicists 15.64 11030 5.13 77.6 2113## type## gov.administrators prof## general.managers prof## accountants prof## purchasing.officers prof## chemists prof## physicists prof

Jinseog Kim Dongguk University [email protected]

주성분분석-PCA

Page 9: Jinseog Kim Dongguk University jinseog.kim@gmaildatamining.dongguk.ac.kr/lectures/2017-2/multivariate/...2004 New Car and Truck Data Variable Meaning Name Vehicle Name Sports Binary

car::Prestige

변수의 구성

1 education: 1971년의 직업군 교육 수준 평균 점수2 income: 소득 평균3 women: 여성비율 (%)4 prestige : 1960년 중반 사회조사에 의해 구한 Pineo-Porter 직업 명성도5 census: 직업코드6 type : 직업유형 - bc(기능직); prof(교수, 전문직, 기술직); wc(사무직)

분석 목적

1 소득과 명성도와의 관계2 소득, 성비, 직업유형에 따른 명성도

Jinseog Kim Dongguk University [email protected]

주성분분석-PCA

Page 10: Jinseog Kim Dongguk University jinseog.kim@gmaildatamining.dongguk.ac.kr/lectures/2017-2/multivariate/...2004 New Car and Truck Data Variable Meaning Name Vehicle Name Sports Binary

R 예제

X <- cov(Prestige[,1:4]) #or var(...)# 공분산

X

## education income women prestige## education 7.444408 6691.13 5.353965 39.90856## income 6691.129509 18027855.55 -59411.383661 52223.07756## women 5.353965 -59411.38 1006.471223 -64.58812## prestige 39.908561 52223.08 -64.588116 295.99432

Jinseog Kim Dongguk University [email protected]

주성분분석-PCA

Page 11: Jinseog Kim Dongguk University jinseog.kim@gmaildatamining.dongguk.ac.kr/lectures/2017-2/multivariate/...2004 New Car and Truck Data Variable Meaning Name Vehicle Name Sports Binary

R 예제

# 스펙트럴 분해

ei <- eigen(X)ei

## $values## [1] 1.802821e+07 8.287121e+02 1.298184e+02 1.816137e+00#### $vectors## [,1] [,2] [,3] [,4]## [1,] -0.0003711498 -0.03673138 0.125908181 9.913616e-01## [2,] -0.9999903048 -0.00278663 -0.003409244 -4.463693e-05## [3,] 0.0032956309 -0.98702896 -0.159679771 -1.628943e-02## [4,] -0.0028967751 -0.15625901 0.979100545 -1.301417e-01

Jinseog Kim Dongguk University [email protected]

주성분분석-PCA

Page 12: Jinseog Kim Dongguk University jinseog.kim@gmaildatamining.dongguk.ac.kr/lectures/2017-2/multivariate/...2004 New Car and Truck Data Variable Meaning Name Vehicle Name Sports Binary

R 예제

# 스펙트럴 분해

LAM <- diag(ei$values)GAM <- ei$vectorsZ <- t(GAM)%*%X%*%GAMdiag(Z)

## [1] 1.802821e+07 8.287121e+02 1.298184e+02 1.816137e+00

sum(diag(Z))

## [1] 18029165

sum(diag(X))

## [1] 18029165

Jinseog Kim Dongguk University [email protected]

주성분분석-PCA

Page 13: Jinseog Kim Dongguk University jinseog.kim@gmaildatamining.dongguk.ac.kr/lectures/2017-2/multivariate/...2004 New Car and Truck Data Variable Meaning Name Vehicle Name Sports Binary

R을 이용한 PCA

princomp(formula, data = NULL, subset, na.action, ...)princomp(x)

pr1 <- princomp(~education+income+women+prestige, data=Prestige)pr1

## Call:## princomp(formula = ~education + income + women + prestige, data = Prestige)#### Standard deviations:## Comp.1 Comp.2 Comp.3 Comp.4## 4225.098580 28.645899 11.337797 1.341019#### 4 variables and 102 observations.

head(pr1$scores) # y-values

## Comp.1 Comp.2 Comp.3## gov.administrators -5553.167 -1.406157 5.719698## general.managers -19081.060 -32.071958 -39.073727## accountants -2473.167 3.551840 10.165270## purchasing.officers -2067.173 12.268629 5.969673## chemists -1605.218 8.292322 23.888252## physicists -4232.227 6.758757 20.120810## Comp.4## gov.administrators -0.46491938## general.managers -1.83416108## accountants -0.03568959## purchasing.officers -0.38962275## chemists 0.58812686## physicists 1.05516941

summary(pr1)

## Importance of components:## Comp.1 Comp.2## Standard deviation 4225.0985798 2.864590e+01## Proportion of Variance 0.9999467 4.596509e-05## Cumulative Proportion 0.9999467 9.999927e-01## Comp.3 Comp.4## Standard deviation 1.133780e+01 1.341019e+00## Proportion of Variance 7.200465e-06 1.007333e-07## Cumulative Proportion 9.999999e-01 1.000000e+00

Jinseog Kim Dongguk University [email protected]

주성분분석-PCA

Page 14: Jinseog Kim Dongguk University jinseog.kim@gmaildatamining.dongguk.ac.kr/lectures/2017-2/multivariate/...2004 New Car and Truck Data Variable Meaning Name Vehicle Name Sports Binary

R을 이용한 PCA

주성분 분석의 결과 - 선형변환 행렬(P = Γ)

loadings(pr1) # Gamma: matrix of eigen vectors

#### Loadings:## Comp.1 Comp.2 Comp.3 Comp.4## education 0.126 0.991## income -1.000## women -0.987 -0.160## prestige -0.156 0.979 -0.130#### Comp.1 Comp.2 Comp.3 Comp.4## SS loadings 1.00 1.00 1.00 1.00## Proportion Var 0.25 0.25 0.25 0.25## Cumulative Var 0.25 0.50 0.75 1.00

Jinseog Kim Dongguk University [email protected]

주성분분석-PCA

Page 15: Jinseog Kim Dongguk University jinseog.kim@gmaildatamining.dongguk.ac.kr/lectures/2017-2/multivariate/...2004 New Car and Truck Data Variable Meaning Name Vehicle Name Sports Binary

R을 이용한 PCA

주성분 분석의 결과 - 제 1, 2 주성분 축에 따라 데이터 표시

biplot(pr1)

-0.4 -0.3 -0.2 -0.1 0.0 0.1

-0.4

-0.3

-0.2

-0.1

0.00.1

Comp.1

Comp

.2

gov.administrators

general.managers

accountantspurchasing.officerschemistsphysicists

biologistsarchitectscivil.engineersmining.engineers

surveyorsdraughtsmen

computer.programers

economistspsychologistssocial.workers

lawyers

librarians

vocational.counsellors

ministers

university.teachers

primary.school.teachers

secondary.school.teachers

physicians

veterinariansosteopaths.chiropractors

nurses

nursing.aides

physio.therapsts

pharmacists

medical.technicians

commercial.artistsradio.tv.announcersathletes

secretariestypists

bookkeepers

tellers.cashierscomputer.operators

shipping.clerks

file.clerksreceptionsts

mail.carriers

postal.clerks

telephone.operators

collectorsclaim.adjustorstravel.clerks

office.clerks

sales.supervisorscommercial.travellers

sales.clerks

newsboysservice.station.attendant

insurance.agentsreal.estate.salesmenbuyers

firefighterspolicemen

cooks

bartendersfuneral.directors

babysitters

launderers

janitorselevator.operators

farmersfarm.workers

rotary.well.drillers

bakers

slaughterers.1slaughterers.2

canners

textile.weaverstextile.labourers

tool.die.makersmachinistssheet.metal.workersweldersauto.workersaircraft.workers

electronic.workers

radio.tv.repairmen

sewing.mach.operators

auto.repairmenaircraft.repairmen

railway.sectionmenelectrical.linemenelectriciansconstruction.foremen

carpentersmasonshouse.paintersplumbers

construction.labourers

pilots

train.engineersbus.driverstaxi.driverslongshoremen

typesetters

bookbinders

-40000 -20000 0 10000

-4000

0-20

000

010

000

educationincome womenprestige

Jinseog Kim Dongguk University [email protected]

주성분분석-PCA

Page 16: Jinseog Kim Dongguk University jinseog.kim@gmaildatamining.dongguk.ac.kr/lectures/2017-2/multivariate/...2004 New Car and Truck Data Variable Meaning Name Vehicle Name Sports Binary

R을 이용한 PCA

1 Scree Plot의 이용전체분산(변동) 중 주성분이 설명하는 변동의 양을 이용전체 p개의 변수가 있을 때 다음을 계산하여 그림으로 표현해 준다.

Var(y1 + . . .+ yq) =

∑qi=1 λi∑pi=1 λi

, q = 1, . . . , p.

Jinseog Kim Dongguk University [email protected]

주성분분석-PCA

Page 17: Jinseog Kim Dongguk University jinseog.kim@gmaildatamining.dongguk.ac.kr/lectures/2017-2/multivariate/...2004 New Car and Truck Data Variable Meaning Name Vehicle Name Sports Binary

R을 이용한 PCA - screeplot

screeplot(pr1, type="lines", main="Scree plot")

Scree plot

Varia

nces

0.0e

+00

5.0e

+06

1.0e

+07

1.5e

+07

Comp.1 Comp.2 Comp.3 Comp.4

Jinseog Kim Dongguk University [email protected]

주성분분석-PCA

Page 18: Jinseog Kim Dongguk University jinseog.kim@gmaildatamining.dongguk.ac.kr/lectures/2017-2/multivariate/...2004 New Car and Truck Data Variable Meaning Name Vehicle Name Sports Binary

2004 New Car and Truck Data

428 cars from the 2004 model year, with 19 features.http://ww2.amstat.org/publications/jse/datasets/04cars.dat.txt

Jinseog Kim Dongguk University [email protected]

주성분분석-PCA

Page 19: Jinseog Kim Dongguk University jinseog.kim@gmaildatamining.dongguk.ac.kr/lectures/2017-2/multivariate/...2004 New Car and Truck Data Variable Meaning Name Vehicle Name Sports Binary

2004 New Car and Truck Data

Variable MeaningName Vehicle NameSports Binary indicator for being a sports carSUV Indicator for sports utility vehicleWagon IndicatorMinivan IndicatorPickup IndicatorAWD Indicator for all-wheel driveRWD Indicator for rear-wheel driveRetail Suggested retail price (US$)Dealer Price to dealer (US$)Engine Engine size (liters)Cylinders Number of engine cylindersHorsepower Engine horsepowerCityMPG City gas mileageHighwayMPG Highway gas mileageWeight Weight (pounds)Wheelbase Wheelbase (inches)Length Length (inches)Width Width (inches)

Jinseog Kim Dongguk University [email protected]

주성분분석-PCA

Page 20: Jinseog Kim Dongguk University jinseog.kim@gmaildatamining.dongguk.ac.kr/lectures/2017-2/multivariate/...2004 New Car and Truck Data Variable Meaning Name Vehicle Name Sports Binary

2004 New Car and Truck Data

cars04 <- read.csv("http://bigdata.dongguk.ac.kr/data/04cars.dat.csv", header = F)head(cars04)

## V1 V2 V3 V4 V5## 1 Chevrolet Aveo 4dr 0 0 0 0## 2 Chevrolet Aveo LS 4dr hatch 0 0 0 0## 3 Chevrolet Cavalier 2dr 0 0 0 0## 4 Chevrolet Cavalier 4dr 0 0 0 0## 5 Chevrolet Cavalier LS 2dr 0 0 0 0## 6 Dodge Neon SE 4dr 0 0 0 0## V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19## 1 0 0 0 11690 10965 1.6 4 103 28 34 2370 98 167 66## 2 0 0 0 12585 11802 1.6 4 103 28 34 2348 98 153 66## 3 0 0 0 14610 13697 2.2 4 140 26 37 2617 104 183 69## 4 0 0 0 14810 13884 2.2 4 140 26 37 2676 104 183 68## 5 0 0 0 16385 15357 2.2 4 140 26 37 2617 104 183 69## 6 0 0 0 13670 12849 2.0 4 132 29 36 2581 105 174 67

Jinseog Kim Dongguk University [email protected]

주성분분석-PCA

Page 21: Jinseog Kim Dongguk University jinseog.kim@gmaildatamining.dongguk.ac.kr/lectures/2017-2/multivariate/...2004 New Car and Truck Data Variable Meaning Name Vehicle Name Sports Binary

2004 New Car and Truck Data

cars04.pca = prcomp(cars04[,9:19], scale.=TRUE)biplot(cars04.pca,cex=0.4)

-0.1 0.0 0.1 0.2 0.3

-0.1

0.00.1

0.20.3

PC1

PC2

12

34567891011

121314

151617

181920

21

2223

242526

27

2829303132333435

36

3738

3940414243

444546

4748

4950

51

5253

5455

56

5758

59

6061

6263

64

65

66

67

68697071

72

73

74

75

7677

78

79 80

81

82

838485

8687

88

8990

91

9293

949596

9798

99

100101

102

103104

105106

107

108

109

110111

112

113

114115

116117118119120

121122

123

124125126127128129

130

131132133

134135136

137138

139

140141142

143144145

146147148

149150151152153

154

155156

157158

159

160

161162163164

165166

167168

169170171172

173174

175176177178179

180

181

182

183184

185

186187 188189

190191

192

193194

195

196

197

198199200

201202203204

205206

207

208

209

210211

212

213

214215

216217

218219220

221222

223

224

225

226

227

228

229

230231232

233234

235

236

237238

239240

241242

243

244

245246

247

248

249250

251

252

253254

255256

257258259

260

261262

263

264

265266267268269

270271272

273

274

275

276277

278

279

280

281

282

283

284285

286

287

288

289

290

291

292293

294295

296

297298 299300301

302303

304

305

306307308

309

310

311

312313314

315

316

317

318 319 320321

322

323

324

325

326

327328329330

331

332333

334335 336

337338

339

340

341

342

343

344

345

346347

348 349350 351

352

353

354355

356

357

358359

360

361

362363

364

365366

367

368

369

370

371372 373

374

375376377378

379

380

381382

383

384

385386387

-10 0 10 20 30

-100

1020

30

V9V10

V11

V12

V13

V14V15

V16

V17V18

V19

Jinseog Kim Dongguk University [email protected]

주성분분석-PCA

Page 22: Jinseog Kim Dongguk University jinseog.kim@gmaildatamining.dongguk.ac.kr/lectures/2017-2/multivariate/...2004 New Car and Truck Data Variable Meaning Name Vehicle Name Sports Binary

PCA의 활용

차원축소: 저차원을 이용한 Data Visualization중회귀분석: 입력변수간의 다중공선성이 있을 때요인분석, 판별분석, 집락분석, 이상치의 탐색 등.

Jinseog Kim Dongguk University [email protected]

주성분분석-PCA

Page 23: Jinseog Kim Dongguk University jinseog.kim@gmaildatamining.dongguk.ac.kr/lectures/2017-2/multivariate/...2004 New Car and Truck Data Variable Meaning Name Vehicle Name Sports Binary

실습 과제

1 (실습) MASS package 에 있는 UScrime data set을 이용하여 주성분 분석을하시오.

2 각자 구해온 데이터를 이용하여 주성분분석을 하시오.

Jinseog Kim Dongguk University [email protected]

주성분분석-PCA

Page 24: Jinseog Kim Dongguk University jinseog.kim@gmaildatamining.dongguk.ac.kr/lectures/2017-2/multivariate/...2004 New Car and Truck Data Variable Meaning Name Vehicle Name Sports Binary

Data sets

http://ww2.amstat.org/publications/jse/jse_data_archive.htm

Jinseog Kim Dongguk University [email protected]

주성분분석-PCA