Cluster Analysis ( 군집 분석 )

1

Cluster Analysis( 군집 분석 )

2

Cluster Analysis 란

(1) Primary Objective

: 사전에 고려된 변수들에 기초를 두고 , 다양한 특성을 지닌 대상들을

상대적으로 동질적인 집단으로 분류하는 것

(2) Basic Principle

: High internal (Within-cluster) homogeneity

and high external (between-cluster) heterogeneity

군집내의 소비자들은 서로 유사하고 한 군집의 소비자는 다른 군집

의 소비자와 서로 다르게 군집을 선택한다 .

3

(3) Application

ⅰ) Market Segmentation /Benefit Segmentation ⅱ) 구매행동 이해 : 동질구매집단 분류를 통한 특성 파악 ⅲ) 신제품 기회요인 도출 : brand 와 Product 를 clustering ⅳ) Test market 선정 ⅴ) Data 축소

(4) Cluster Vs. Factor Analysis cluster : 대상 분류 Factor : 변수 (variable) 분류

(5) Cluster Vs. Discriminant Analysis - Object Classification Cluster : Cluster 나 Group 에 대한 사전 정보 ( 분류기준 ) 가 없는 경우 ( 독립 관계 분석 ) Discriminant : Cluster 나 Group 에 대한 사전 정보가 있는 경우 ( 종속 관계 분석 )

4

Cluster Analysis 방법

Formulating the problem

Selecting a Distance Measure

Selecting a Clustering Procedure

Deciding on the Number of Clusters

Interpreting and Profiling Clusters

Assessing the Validity of Clustering

5

▣ Basic Concept

● An Ideal Clustering Situation ● A Practical Clustering Situation

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

Variable 2 Variable 2

Vari

ab

le 1

Vari

ab

le 1

6

(1) Formulating the Problem

: clustering 의 기초가 되는 변수 선정

ⅰ) 군집되는 대상의 특성 분류

ⅱ) Cluster Analysis 의 목적과 연결

(2) Similarity Measure

: Distance Measure 가 주로 이용됨

( 주어진 질문에 대해 대답 간 차이의 제곱의 합으로 계산 )

① Euclidean distance

r

dijE = ∑ (Xik - Xjk)2 (k=1,.....r)

k=1

Xik : k 차원에서 대상 i 의 좌표

Xjk : k 차원에서 대상 j 의 좌표

7

② Squared Euclidean distance

Dij = ∑(Xik - Xjk)2

i=1

An example of Euclidean distance between two objects measured on two variables – X and Y.

Y

X

●

●

(X1-Y1) (X1-Y1)

(X2-Y2)

(Y2-Y1)

Object 1

Distance =

(X2-X1) + (Y2-Y1)

22

Normalized distance function

: Raw data 를 Normalization (Mean=0, Variance=1) 하여

scale 상의 차이로 발생된 bias 를 해결한 Euclidean distance

8

③ City-block distance (Manhattan distance) r dij

c = ∑ Xik - Xjk

i=1

[ 문제점 ] ⅰ) 변수간에 correlation 이 없다는 가정 ⅱ) Characteristic 을 측정하는 단위 (Scales) 이 상이성이 가능 -------------------------------------------------------------- Object Purchase Commercial Distance Citi-block Probability(%) Viewing Time(min) (min) (second) -------------------------------------------------------------- A 60 3.0 AB 25.25 61 B 65 3.5 AC 10.00 153 C 64 4.0 BC 4.25 40 --------------------------------------------------------------

9

④ Mahalanobis distance

ⅰ) Standard Deviation 으로 scaling 해서 data 표준화

ⅱ) intercorrelation 을 조정하기 위해서 within-group

variance-covariance 합산하는 접근 방식

ⅲ) 변수간에 서로 correlated 되었을 때 가장 적합

⑤ Minkowski distance

dijM = [∑(Xik - Xjk)p]1/r

10

(3) Clustering Algorithms Clustering

Procedures

Nonhierarchical

Hierarchical

Hierarchical

Divisive

SequentialThreshold

ParallelThreshold

OptimizingPartitioningLinkage

MethodsVarianceMethods

CentroidMethods

Ward’sMethod

SingleLinkage

CompleteLinkage

AverageLinkage

11

1) 계층적 군집방법 (Hierarchical Cluster Procedure)

① Agglomerative Procedure

: 한 개의 대상에서 출발하여 , 주위의 대상이나 cluster 를 군집화하여

최종적으로 1 개의 cluster 로 만드는 방법

ⅰ) Single Linkage : minimum distance rule

군집이나 대상간의 최소거리로 군집화

ⅱ) Complete Linkage : maximum distance rule

ⅲ) Average Linkage

ⅳ) Ward's Method : W

● Within-cluster variance minimization rule

● Within-cluster distance 의 전체 sum of square 의 증가가 최소가

되게 cluster

ⅴ) Centroid Method

● 대상이나 cluster 의 Centroid(mean) 간의 거리 최소화

● 단점 : Metric data 에만 적용 가능

12

② Decisive Method : 큰 한 개의 cluster 로 부터 분리시켜 가는 방법

Dendrogram illustrating hierarchical clustering.

1 2 3 4 5 6 7

01

02

03

04

05

06

07

08

O

bserv

ati

on

nu

mb

er

13

[Single Linkage : 단일기준 결합 방식 ]

A

D

BC

1.5

1.55

1.4

1.2

1.3

A

D

BC

[Complete Linkage : 완전기준 결합방식 ]

A

D

BC

A

D

BC

1.51.55

14

[Average Linkage : 평균기준 결합방식 ]

A

D

BC

1.45

1.425

A

D

BC

15

[Ward Method]

●

● ●

●●

●

●

● ●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

[Centroid Method]

16

2) 비계층적 군집방법 (Nonhierarchical Clustering Procedures)

= k-means clustering

ⅰ) Sequential threshold procedure

① 하나의 cluster center 를 선택하고 미리 산정된 거리 내에 있는

모든 대상을 그 cluster 안에 포함시킨다 .

② 두 번째 cluster center 를 선택하고 미리 산정된 거리 내에 있는

모든 대상을 그 cluster 안에 포함시킨다 .

ⅱ) parallel threshold Procedure

① 초기에 여러 개의 cluster center 를 선정하여 가장 가까운 center

로 대상을 포함시킨다

② threshold 거리는 조절될 수 있다

ⅲ) Optimizing Partitioning Method

: 전체적인 optimizing criterion (e.g.,within-cluster distance 의

평균 ) 에 따라 나중에 대상을 cluster 별로 재편입 시킬 수 있다

17

▣ Nonhierarchical Clustering 의 단점

① 사전에 cluster 수를 결정해야 한다

② Cluster Center 선정이 임의적이다

③ 결과가 data 의 순서에 의존적이다

▣ Nonhierarchical Clustering 의 장점

① center 선정에 있어서 nonrandorn

② Clustering 속도가 빠르다

18

3) 군집방법 선택 : Hierarchical Vs. Nonhierarchical

ⅰ) Hierarchical + Ward's Method + average linkage

⇒ 처음에 잘못 clustering 되면 지속적으로 영향을 미친다

ⅱ) Hierarchical + Nonhierarchical

① Hierarchical procedure 을 사용하여 최초 clustering 결과도출

(Ward Method + average linkage)

② 얻어진 cluster 숫자와 cluster centroid 를 optimizing

partitioning method 의 input 으로 사용

19

(4) Cluster 숫자 결정

ⅰ) 이론적 , 개념적 , 실제적 목적 고려

ⅱ) cluster 간의 거리로 판단

ⅲ) Nonhierarchical clustering 에서

Within Group Variance ---------------------------- 을 도식화시켜 b/w Group Variance

꺾이는 부분을 찾아내어 cluster 숫자로 사용

ⅳ) cluster 내에 case 의 숫자로 판단

(one case 를 가진 cluster 는 바람직하지 않음 )

20

(5) Cluster 해석

ⅰ) 보통 cluster centroid 로 해석

ⅱ) Discriminant analysis 이용

(6) Validation

ⅰ) 여러 가지 distance measure 를 사용한 결과 비교

ⅱ) 여러 가지 Algorithm 을 사용한 결과 비교

ⅲ) data 를 임의로 둘로 나누어 각각의 cluster centroids 비교

ⅳ) 일부 data 를 임의로 빼고 나머지에 대한 결과를 비교

ⅴ) Nonhierarchical Clustering 은 자료의 순서에 의존적이므로

자료의 순서를 바꾸어 여러번 clustering 하고 가장 안정적인 결과선택

21

Examples

(1) Example 1

■ 목적 : 신형 자동차를 출시하기 위해서 기존 시장의 15 차종에 대한 특성 파악 ■ 자동차 분류기준 ( 사전조사결과 ) : 외형크기와 배기량 ■ 외형 크기와 배기량은 표준화

자 동 차 종류 표준화된 승용차 속성의 평가 점수

외형적 크기 엔진 배기량

ABCDEFGHIJKLMNO

2.50 2.25 3.00 2.50 0.25 0.50 0.25-0.25-0.25 0.25-2.00-1.50-2.50-2.00-2.50

2.50 2.00 2.00 1.75 1.00 0.50 0.25 0.50-0.25-0.50-1.50-1.75-2.00-2.25-2.50

22

■ 승용차 특성을 2 차원 도식화

[ 그림 18-4]승용차 특성의 2 차원 도표

X2 ( 엔진배기량 )

X1 ( 외향적 크기 )

KL

M

N

O

A

BD

C

JI

H

E

FG

23

[ 그림 18-5]단일결합방식에 의한 결과

■ SPSS 의 Quick Cluster → Classification cluster center 를 계산하여 각 cluster 의 평균을 계산하여 다시 입력자료로 사용하는 방법

A D B C E G J F H I K N M O L

6

8

12

5

7

11

3

4

2 1

910

13

14

24

[ 그림 18-6]완전결합방식에 의한 결과

■ SPSS 의 Quick Cluster → Classification cluster center 를 계산하여 각 cluster 의 평균을 계산하여 다시 입력자료로 사용하는 방법

A D B C E G J F H I K N M O L

5

7

10

6

6

11

3

4

2 1

9

12

13

14

25

(2) Example 2

■ 목적 : 회사 특성의 중요성 평가에 따른 고객 분류

(Stage 1) Partitioning

Step 1 : Hierarchical cluster Analysis

1) Similarity measure : Squared Euclidean distances

2) Algorithm : Ward's method

⇒ within-cluster difference 를 최소화

3) cluster 수 결정

: Two cluster 가 최선안으로 결정

26

TABLE 7.2 Analysis of AgglomerationCoefficient for Hierarchical ClusterAnalysis

Number of]

Clusters

Percentage Change in

Agglomeration Coefficient to Next

Level10987654321

8.98.59.29.39.3

12.117.017.661.9

-

27

Step 2 : Nonhierarchical Cluster Analysis → hierarchical procedure 결과를 Fine-tune

⇒ Hierarchical procedure 의 결과 확인

Results of Nonhierarchical Cluster Analysis with Initial Seed Points from Hierarchical Results

Mean Values*

Cluster X1 X2 X3 X4 X5 X6 X7 Cluster Size

Classification cluster centers

12

4.402.43

1.393.22

8.706.74

5.095.69

2.942.87

2.652.87

5.918.10

Final cluster centers

12

4.382.57

1.583.21

8.906.80

4.925.60

2.962.87

2.522.82

5.908.13

5248

28

Variables Cluster M.S. Df Error M.S df F Ratio Probability

Significance Testing of Differences Between Cluster Centers

X1X2X3X4X5X6X7

Delivery speedPrice levelPrice flexibilityManufacturer’s imageOverall serviceSales force’s imageProduct quality

81.563166.4571

109.637211.3023

.18832.1233

123.3719

1111111

.9298

.7661

.82331.1778.5682.5786

1.2797

98.098.098.098.098.098.098.0

.000

.000

.000

.003

.566

.058

.000

87.717286.7526

133.17509.5959.3314

3.669796.4042

* X1 = Delivery speed : X2 = Price level : X3 = Price flexibility : X4 = Manufacturer’s image : X5 = Overall service : X6 = Sales force’s image : X7 = Product quality.

29

Group Means and Significance Level for Two-Group Nonhierarchical Cluster Solution

Variables

Stage Two : Interpretation

4.4601.5768.9004.9262.9922.5105.904

2.5703.1526.8885.5702.8402.8208.038

Cluster

1 2 F Ratio Significance

X1X2X3X4X5X6X7


105.0076.61

111.308.731.024.17

82.68

.0000

.0000

.0000

.0039

.3141

.0438

.0000

Stage Three : Profiling

Other variables of interestX9 Usage levelX10 Satisfaction level

49.885.16

42.324.38

21.31226.545

.0000

.0000

30

Stage Two : Interpretation

- Table 7.4 참조

- X5 는 두 그룹 사이에 차이가 없는 것으로 평가됨

- Cluster 1 focuses ⅰ) delivery speed ⅱ) price flexibility

Cluster 2 focuses ⅰ) price ⅱ) manufacturer's image

ⅲ) sales force image ⅳ) product quality

Stage Three : Validation

- Table 7.5 참조 ( 결과의 consistency 확인 )

⇒ 무작위로 선택한 subset 으로 clustering 하여 비교

31

TABLE 7.5 Results of Nonhierarchical Cluster Analysis with Randomly Selected Initial Seed Points

Mean Values*Cluster X1 X2 X3 X4 X5 X6 X7 Cluster Size

Classification cluster centers12

4.951.76

1.142.70

9.036.87

6.555.50

3.211.97

3.792.70

5.098.45

Final cluster centers12

4.472.63

1.573.10

8.936.94

4.995.49

2.992.84

2.572.75

5.788.07

4852

Variables Cluster M.S. Df Error M.S df F Value Probability

X1X2X3X4X5X6X7


84.333958.683798.51646.2640.5883.7477

131.1200

1111111

.9016

.8454

.93671.2292.5641.5927

1.2007

98.098.098.098.098.098.098.0

.000

.000

.000

.026

.310

.264

.000

93.541569.4175

105.17005.09581.04281.2616

109.2055

* X1 = Delivery speed : X2 = Price level : X3 = Price flexibility : X4 = Manufacturer’s image : X5 = Overall service : X6 = Sales force’s image : X7 = Product quality.

Significance Testing of Differences Between Cluster Centers

Documents

Cluster Analysis ( 군집 분석 )