42
1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉嘉嘉嘉嘉嘉嘉 嘉嘉嘉 2004/10/20

1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

Embed Size (px)

Citation preview

Page 1: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

1

Converting Categories to Numbers for Approximate Nearest Neighbor Search

嘉義大學資工系郭煌政2004/10/20

Page 2: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

2

Outline

IntroductionMotivationMeasurementAlgorithmsExperimentsConclusion

Page 3: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

3

Introduction

Memory-Based Reasoning– Case-Based Reasoning– Instance-Based Learning

Given a training dataset and a new object, predict the class (target value) of the new object.

Focus on table data

Page 4: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

4

Introduction

K Nearest Neighbor Search– Compute similarity between the new object

and each object in the training dataset.– Linear time to the size of the dataset

Similarity: Euclidean distanceMulti-dimension Index

– Spatial data structure, such as R-tree– Numeric data

Page 5: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

5

Introduction

Indexing on Categorical Data?– Linear order of the categories– Existing correct ordering?– Best ordering?

Store the mapped data on a multi-dimensional data structure as filtering mechanism

Page 6: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

6

Measurement for Ordering

Ordering Problem

Given an undirected weighted complete graph, a simple path is an ordering of the vertices. The edges are the distances between pairs of vertices. The ordering problem is to find a path, called ordering path, of maximal value according to a certain scoring function.

Page 7: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

7

Measurement for Ordering

Relationship ScoringReasonable Ordering Score

In an ordering path <v1, v2, …, vn>, 3-tuple

<vi-1, vi, vi+1> is reasonable if and only if

dist(vi-1, vi+1) ≧ dist(vi-1, vi) and

dist(vi-1, vi+1) ≧ dist(vi, vi+1).

Page 8: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

8

Measurement for Mapping

Pairwise Difference Scoring– Normalized distance matrix– Mapping values of categories

– Distm(vi, vj) = |mapping(vi) - mapping(vj)|

2/)1(*

)),(),(( 2

nn

vvdistvvdist

rmse ji vvjimji

Page 9: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

9

Algorithms

Prim-like OrderingKruskal-like OrderingDivisive OrderingGA Approach Ordering

– A vertex is a category– A graph represent a distance matrix

Page 10: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

10

Prim-like Ordering Algorithm

Prim’s Minimum Spanning Tree– Initially, choose a least edge (u, v)– Add the edge to the tree; S = {u, v}– Choose a least edge connecting a vertex in S

and a vertex, w, not in S– Add the edge to the tree; Add w to S– Repeat until all vertices are in S

Page 11: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

11

Prim-like Ordering Algorithm

Prim-like Ordering– Choose a least edge (u, v)

– Add the edge to the ordering path; S = {u, v}

– Choose a least edge connecting a vertex in S and a vertex, w, not in S

– If the edge creates a circle on the path, discard the edge, and choose again

– Else, add the edge to the ordering path; Add w to S

– Repeat until all vertices are in S

Page 12: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

12

Kruskal-like Ordering Algorithm

Kruskal Minimum Spanning Tree– Initially, choose a least edge (u, v)– Add the edge to the tree; S = {u, v}– Choose a least edge as long as the edge does

not create a circle in the tree– Add the edge to the tree; Add the two vertiecs

to S– Repeat until all vertices are in S

Page 13: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

13

Kruskal-like Ordering Algorithm

Kruskal-like Ordering– Initially, choose a least edge (u, v) and add it to the

ordering path; S = {u, v}– Choose a least edge as long as the edge does not

create a circle in the tree, anddegree of either vertex on the path is <= 2

– Add the edge to the ordering path; Add the two vertices to S

– Repeat until all vertices are in SHeap array can be used to speed up choosing

least edge

Page 14: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

14

Divisive Ordering Algorithm

Idea:– Pick a central vertex, and split the rest vertices– Building a binary tree: vertices are the leaves

Central Vertex:

}),({min,

ijCv

jiCv

ji

vvdist

Page 15: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

15

Divisive Ordering Algorithm

AR is closer to P than AL is.

BL is closer to P than BR is.P

A

AL AR

B

BL BR

Page 16: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

16

Clustering

Splitting a Set of Vertices into Two Groups– Each group has at least one vertex– Close (similar) vertices in same group

Distant vertices in different groups

Clustering Algorithms– Two clusters

Page 17: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

17

Clustering

Clustering– Grouping a set of objects into classes of

similar objects

Agglomerative Hierarchical Clustering Algorithm– Singleton clusters– Merge similar clusters

Page 18: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

18

Clustering

Clustering Algorithm: Cluster Similarity– Single link

dist(Ci, Cj) = min(dist(p, q)), p in Ci, q in Cj– Complete link

dist(Ci, Cj) = max(dist(p, q)), p in Ci, q in Cj– Average link -- adopted in our study

dist(Ci, Cj) = avg(dist(p, q)), p in Ci, q in Cj– others

Page 19: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

19

Clustering

Clustering Implementation Issues– Which pair of clusters to be merged:

Keep cluster-to-cluster similarity for each pair– Recursively partition sets of vertices while

building the binary tree:Non-recursive version with a stack

Page 20: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

20

GA Ordering Algorithm

Genetic Algorithm for Optimal ProblemsChromosome: solutionPopulation: pool of solutionsGenetic Operations

– Crossover– Mutation

Page 21: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

21

GA Ordering Algorithm

Encoding a Solution– Binary string– Ordered list of categories – in our ordering

problemFitness Function

– Reasonable ordering scoreSelecting Chromosomes for crossover

– High fitness value => high probability

Page 22: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

22

GA Ordering Algorithm

Crossover– Single point– Multiple points– Mask

Crossover AB | CDE and BD | AECResults in ABAEC and BDCDE => Illegal

Page 23: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

23

GA Ordering Algorithm

Repair Illegal Chromosome ABAEC– AB*EC => fill D in * position

Repair Illegal Chromosome ABABC– AB**C– D and E are missing– Which one is closest to B, fill it in first *

position

Page 24: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

24

Mapping Function

Ordering Path <v1, v2, …, vn>

Mapping(vi) =

1

01

1

01

),(

),(

n

jjj

i

jjj

vvdist

vvdist

Page 25: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

25

Experiments

Synthetic Data (width/length = 5)

Reasonable Ordering Score

0.4

0.5

0.6

0.7

0.8

0.9

1

10 20 30 40 50100

150200

250300

350400

450500

Number of Categories

Perc

enta

ge

prim

krusal

divisive

Page 26: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

26

Experiments

Synthetic Data (width/length = 10)Reasonable Ordering Score

0.4

0.5

0.6

0.7

0.8

0.9

1

10 20 30 40 50100

150200

250300

350400

450500

Number of Categories

Perc

enta

ge

prim

krusal

divisive

Page 27: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

27

Experiments

Synthetic Data:Reasonable Ordering Score for Divisive Algorithm– width/length = 5 => 0.82

– width/length = 10 => 0.9

– No Ordering => 1/3

Divisive algorithm is better than Prim-like algorithm when number of categories > 100

Page 28: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

28

Experiments

Synthetic Data (width/length = 5)Root Mean Squared Error

0.02

0.07

0.12

0.17

0.22

10 20 30 40 50100

150200

250300

350400

450500

Number of Categories

rmse

prim

krusal

divisive

Page 29: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

29

Experiments

Synthetic Data (width/length = 10)Root Mean Squared Error

0

0.05

0.1

0.15

0.2

10 20 30 40 50100

150200

250300

350400

450500

Number of Categories

rmse

prim

krusal

divisive

Page 30: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

30

Experiments

Divisive Ordering is best among the three ordering algorithms

For divisive ordering algorithm on > 100 categories, RMSE scores are around 0.07 when width/length = 10, and are around 0.05 when width/length = 10.

Prim-like ordering algorithm: 0.12 and 0.1, respectively.

Page 31: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

31

Experiments

“Census-Income” dataset from the University of California, Irvine (UCI) KDD Archive

33 nominal attributes, 7 continuous attributes

Sample 5000 records for training dataset. Sample 2000 records for approximate

KNN search experiment.

Page 32: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

32

Experiments

Distance Matrix: distance between two categoriesV. Ganti, J. Gehrke, and R. Ramakrishnan,

“CACTUS-Clustering Categorical Data Using Summaries,” ACM KDD, 1999

D = {d1, d2, …, dn} of n tuples.

D is subset of D1 * D2 * … * Dk, where Di is a

categorical domain, for 1 i k≦ ≦ .di = <ci1, ci2, …, cik>.

Page 33: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

33

Experiments

yxDyxycxcddDPairs iviuivui

yx ,,},,|,{)(,

yxDyx

jinvnu

ccycxcjdd

DLinks

i

vjujviuivu

iyx

,,

},,1,1

,,,|,,{

)(,

)1/()(/)(),( ,, kDPairsDLinksyxdist iyx

iyx

Page 34: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

34

Experiments

Approximate KNN – nominal attributes

% of True KNN Retrieved (k * 2)

0.7

0.72

0.74

0.76

0.78

0.8

1 2 3 4 5 6 7 8 9 10K

Krusal Prim Divisive

Page 35: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

35

Experiments

Approximate KNN – nominal attributes

% of True KNN Retrieved (k * 3)

0.77

0.79

0.81

0.83

0.85

0.87

1 2 3 4 5 6 7 8 9 10K

Krusal Prim Divisive

Page 36: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

36

Experiments

Approximate KNN – nominal attributes

% of True KNN Retrieved (k * 6)

0.84

0.86

0.88

0.9

0.92

0.94

1 2 3 4 5 6 7 8 9 10K

Krusal Prim Divisive

Page 37: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

37

Experiments

Approximate KNN – all attributes

% of True KNN Retrieved (k * 2)

0.7

0.75

0.8

1 2 3 4 5 6 7 8 9 10K

Krusal Prim Divisive

Page 38: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

38

Experiments

Approximate KNN – all attributes

% of True KNN Retrieved (k * 3)

0.76

0.81

0.86

1 2 3 4 5 6 7 8 9 10K

Krusal Prim Divisive

Page 39: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

39

Experiments

Approximate KNN – all attributes

% of True KNN Retrieved (k * 6)

0.850.870.890.910.930.95

1 2 3 4 5 6 7 8 9 10K

Krusal Prim Divisive

Page 40: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

40

Conclusion

Developed Ordering Algorithms– Prim-like– Krusal-like– Divisive– GA-based

Devised Measurement– Reasonable ordering score– Root mean squared error

Page 41: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

41

Conclusion

What next?– New categories, new mapping function– New index structure?– Training mapping function for a given

ordering path.

Page 42: 1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20

42

Thank you.