כריית מידע -- Clustering

מידע -- Clusteringכריית

רוזנפלד" אבי ר ד

: הם דומים דברים הכללי הרעיוןדומים

דומים • דברים נאסוף איך–Regression, Classification (Supervised), k-nn– Clustering (Unsupervised) k-meand–Partitioning Algorithms (k-mean), Hierarchical

Algorithms•" " : קירבה להגדיר איך פתוחות שאלות

Euclideanמרחק – Manhattan (Judea Pearl)מרחק –אחריות – אופציות הרבה

)||...|||(|),( 22

22

2

11 pp jx

ix

jx

ix

jx

ixjid

השאלה סימן את לסווג ?איך

K-Nearest Neighborאמת • בזמן הסיווג את model freeבודקיםהשכנים • מספר את לקבוע צריכיםמהנקודה • המרחק לפי שקלול יש כלל בדרךדומה Case Based Reasoningאו CBRגם •לפי ) • משקל איזשהו או הרוב לפי הולכים בסיווג

הקרבה(איזשהו ) • או הרוב לפי יהיה הערך ברגרסיה

) הקרבה לפי משקל

1-Nearest Neighbor

3-Nearest Neighbor

7

k NEAREST NEIGHBOR

• Choosing the value of k:– If k is too small, sensitive to noise points– If k is too large, neighborhood may include points from other

classes– Choose an odd value for k, to eliminate ties

k = 3: Belongs to triangle class

k = 7: Belongs to square class

ICDM: Top Ten Data Mining Algorithms k nearest neighbor classification December 2006

?

k = 1: Belongs to square class

8

Remarks+Highly effective inductive inference method for

noisy training data and complex target functions

+Target function for a whole space may be described as a combination of less complex local approximations

+Learning is very simple- Classification is time consuming

Clustering K-MEAN: האלגוריתם הבסיסי ל Kבחר ערך רצוי של אשכולות: 1. Kמתוך אוכלוסיית המדגם שנבחרה (להלן הנקודות), בחר2.

נקודות אקראיות. נקודות אלו הם המרכזים ההתחלתיים של )Seedsהאשכולות(

קבע את המרחק האוקלידי של כל הנקודות מהמרכזים שנבחרו3.

K כל נקודה משויכת למרכז הקרוב אליה ביותר. בצורה זו קיבלנו 4.אשכולות זרים זה לזה.

בכל אשכול: קבע נקודות מרכז חדשה על ידי חישוב הממוצע 5.של כל הנקודות באשכול

אם נקודת המרכז שווה לנקודה הקודמת התהליך הסתיים , 6.3אחרת חזור ל

נקודות6דוגמא עם

Instance X Y

1 1.0 1.5

2 1.0 4.5

3 2.0 1.5

4 2.0 3.5

5 3.0 2.5

6 5.0 6.0

נקודות6דוגמא עם

1איטרציה C1,C2 להלן 1,3באופן אקראי נבחרו הנקודות •3,4,5,6 נבחרו הנקודות C2. למרכז 1,2 נבחרות נקודות C1למרכז •Distance= √(x1-x2)² + ( y1-y2 ( ²נוסחת המרחק: •

C1המרחק מ C2המרחק מ

0.00 1.00

3.00 3.16

1.00 0.00

2.24 2.00

2.24 1.41

6.02 5.41

בחירת מרכזים חדשים

C1ל •–X=(1.0+1.0)/2=1.0–Y=(1.5+4.5)/2=3.0

C2ל •–X=(2.0+2.0+3.0+5.0)/4.0=3.0–Y=(1.5+3.5+2.5+6.0)/4.0=3.375

2איטרציה C1(1.0, 3.0) C2(3.0, 3.375)נקודות המרכז החדשות: •4,5,6 יצטרפו : C2 ל 1,2,3 יצטרפו הנקודות: C1ל •

C1המרחק מ C2המרחק מ

1.5 2.74

1.5 2.29

1.8 2.125

1.12 1.01

2.06 0.875

5.00 3.30

התוצאה הסופית

CS583, Bing Liu, UIC 20

עם k-meansבעיותמראש • להגדיר המשתמש Kעלהממוצע • את לחשב שניתן מניחל • רגיש outliersמאוד

–Outliers מהאחרים הרחוקות נקודות הם–... טעות סתם להיות יכול


של OUTLIERדוגמא

22

Euclideanמרחק

• Euclidean distance:

• Properties of a metric d(i,j):–d(i,j) 0–d(i,i) = 0–d(i,j) = d(j,i)–d(i,j) d(i,k) + d(k,j)

)||...|||(|),( 22

22

2

11 pp jx

ix

jx

ix

jx

ixjid


Hierarchical Clustering• Produce a nested sequence of clusters, a tree, also

called Dendrogram.


Types of hierarchical clustering• Agglomerative (bottom up) clustering: It builds the

dendrogram (tree) from the bottom level, and – merges the most similar (or nearest) pair of clusters – stops when all the data points are merged into a single cluster

(i.e., the root cluster).

• Divisive (top down) clustering: It starts with all data points in one cluster, the root. – Splits the root into a set of child clusters. Each child cluster is

recursively divided further – stops when only singleton clusters of individual data points

remain, i.e., each cluster with only a single point


Agglomerative clustering

It is more popular then divisive methods.• At the beginning, each data point forms a

cluster (also called a node). • Merge nodes/clusters that have the least

distance.• Go on merging• Eventually all nodes belong to one cluster


Agglomerative clustering algorithm


An example: working of the algorithm


Measuring the distance of two clusters

• A few ways to measure distances of two clusters.

• Results in different variations of the algorithm.– Single link– Complete link– Average link– Centroids– …


Single link method• The distance between two

clusters is the distance between two closest data points in the two clusters, one data point from each cluster.

• It can find arbitrarily shaped clusters, but– It may cause the

undesirable “chain effect” by noisy points

Two natural clusters are split into two


Complete link method• The distance between two clusters is the distance of

two furthest data points in the two clusters. • It is sensitive to outliers because they are far away

EM Algorithm

• Initialize K cluster centers• Iterate between two steps

– Expectation step: assign points to clusters

–Maximation step: estimate model parameters

j

jijkikki cdwcdwcdP ) |Pr() |Pr() (

m

ik

ji

kiik cdP

cdPd

m 1 ) (

) (1

N

cdw i

ki

k

) Pr(

Documents

כריית מידע -- Clustering