1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系...

Preview:

Citation preview

1

Converting Categories to Numbers for Approximate Nearest Neighbor Search

嘉義大學資工系郭煌政2004/10/20

2

Outline

IntroductionMotivationMeasurementAlgorithmsExperimentsConclusion

3

Introduction

Memory-Based Reasoning– Case-Based Reasoning– Instance-Based Learning

Given a training dataset and a new object, predict the class (target value) of the new object.

Focus on table data

4

Introduction

K Nearest Neighbor Search– Compute similarity between the new object

and each object in the training dataset.– Linear time to the size of the dataset

Similarity: Euclidean distanceMulti-dimension Index

– Spatial data structure, such as R-tree– Numeric data

5

Introduction

Indexing on Categorical Data?– Linear order of the categories– Existing correct ordering?– Best ordering?

Store the mapped data on a multi-dimensional data structure as filtering mechanism

6

Measurement for Ordering

Ordering Problem

Given an undirected weighted complete graph, a simple path is an ordering of the vertices. The edges are the distances between pairs of vertices. The ordering problem is to find a path, called ordering path, of maximal value according to a certain scoring function.

7

Measurement for Ordering

Relationship ScoringReasonable Ordering Score

In an ordering path <v1, v2, …, vn>, 3-tuple

<vi-1, vi, vi+1> is reasonable if and only if

dist(vi-1, vi+1) ≧ dist(vi-1, vi) and

dist(vi-1, vi+1) ≧ dist(vi, vi+1).

8

Measurement for Mapping

Pairwise Difference Scoring– Normalized distance matrix– Mapping values of categories

– Distm(vi, vj) = |mapping(vi) - mapping(vj)|

2/)1(*

)),(),(( 2

nn

vvdistvvdist

rmse ji vvjimji

9

Algorithms

Prim-like OrderingKruskal-like OrderingDivisive OrderingGA Approach Ordering

– A vertex is a category– A graph represent a distance matrix

10

Prim-like Ordering Algorithm

Prim’s Minimum Spanning Tree– Initially, choose a least edge (u, v)– Add the edge to the tree; S = {u, v}– Choose a least edge connecting a vertex in S

and a vertex, w, not in S– Add the edge to the tree; Add w to S– Repeat until all vertices are in S

11

Prim-like Ordering Algorithm

Prim-like Ordering– Choose a least edge (u, v)

– Add the edge to the ordering path; S = {u, v}

– Choose a least edge connecting a vertex in S and a vertex, w, not in S

– If the edge creates a circle on the path, discard the edge, and choose again

– Else, add the edge to the ordering path; Add w to S

– Repeat until all vertices are in S

12

Kruskal-like Ordering Algorithm

Kruskal Minimum Spanning Tree– Initially, choose a least edge (u, v)– Add the edge to the tree; S = {u, v}– Choose a least edge as long as the edge does

not create a circle in the tree– Add the edge to the tree; Add the two vertiecs

to S– Repeat until all vertices are in S

13

Kruskal-like Ordering Algorithm

Kruskal-like Ordering– Initially, choose a least edge (u, v) and add it to the

ordering path; S = {u, v}– Choose a least edge as long as the edge does not

create a circle in the tree, anddegree of either vertex on the path is <= 2

– Add the edge to the ordering path; Add the two vertices to S

– Repeat until all vertices are in SHeap array can be used to speed up choosing

least edge

14

Divisive Ordering Algorithm

Idea:– Pick a central vertex, and split the rest vertices– Building a binary tree: vertices are the leaves

Central Vertex:

}),({min,

ijCv

jiCv

ji

vvdist

15

Divisive Ordering Algorithm

AR is closer to P than AL is.

BL is closer to P than BR is.P

A

AL AR

B

BL BR

16

Clustering

Splitting a Set of Vertices into Two Groups– Each group has at least one vertex– Close (similar) vertices in same group

Distant vertices in different groups

Clustering Algorithms– Two clusters

17

Clustering

Clustering– Grouping a set of objects into classes of

similar objects

Agglomerative Hierarchical Clustering Algorithm– Singleton clusters– Merge similar clusters

18

Clustering

Clustering Algorithm: Cluster Similarity– Single link

dist(Ci, Cj) = min(dist(p, q)), p in Ci, q in Cj– Complete link

dist(Ci, Cj) = max(dist(p, q)), p in Ci, q in Cj– Average link -- adopted in our study

dist(Ci, Cj) = avg(dist(p, q)), p in Ci, q in Cj– others

19

Clustering

Clustering Implementation Issues– Which pair of clusters to be merged:

Keep cluster-to-cluster similarity for each pair– Recursively partition sets of vertices while

building the binary tree:Non-recursive version with a stack

20

GA Ordering Algorithm

Genetic Algorithm for Optimal ProblemsChromosome: solutionPopulation: pool of solutionsGenetic Operations

– Crossover– Mutation

21

GA Ordering Algorithm

Encoding a Solution– Binary string– Ordered list of categories – in our ordering

problemFitness Function

– Reasonable ordering scoreSelecting Chromosomes for crossover

– High fitness value => high probability

22

GA Ordering Algorithm

Crossover– Single point– Multiple points– Mask

Crossover AB | CDE and BD | AECResults in ABAEC and BDCDE => Illegal

23

GA Ordering Algorithm

Repair Illegal Chromosome ABAEC– AB*EC => fill D in * position

Repair Illegal Chromosome ABABC– AB**C– D and E are missing– Which one is closest to B, fill it in first *

position

24

Mapping Function

Ordering Path <v1, v2, …, vn>

Mapping(vi) =

1

01

1

01

),(

),(

n

jjj

i

jjj

vvdist

vvdist

25

Experiments

Synthetic Data (width/length = 5)

Reasonable Ordering Score

0.4

0.5

0.6

0.7

0.8

0.9

1

10 20 30 40 50100

150200

250300

350400

450500

Number of Categories

Perc

enta

ge

prim

krusal

divisive

26

Experiments

Synthetic Data (width/length = 10)Reasonable Ordering Score

0.4

0.5

0.6

0.7

0.8

0.9

1

10 20 30 40 50100

150200

250300

350400

450500

Number of Categories

Perc

enta

ge

prim

krusal

divisive

27

Experiments

Synthetic Data:Reasonable Ordering Score for Divisive Algorithm– width/length = 5 => 0.82

– width/length = 10 => 0.9

– No Ordering => 1/3

Divisive algorithm is better than Prim-like algorithm when number of categories > 100

28

Experiments

Synthetic Data (width/length = 5)Root Mean Squared Error

0.02

0.07

0.12

0.17

0.22

10 20 30 40 50100

150200

250300

350400

450500

Number of Categories

rmse

prim

krusal

divisive

29

Experiments

Synthetic Data (width/length = 10)Root Mean Squared Error

0

0.05

0.1

0.15

0.2

10 20 30 40 50100

150200

250300

350400

450500

Number of Categories

rmse

prim

krusal

divisive

30

Experiments

Divisive Ordering is best among the three ordering algorithms

For divisive ordering algorithm on > 100 categories, RMSE scores are around 0.07 when width/length = 10, and are around 0.05 when width/length = 10.

Prim-like ordering algorithm: 0.12 and 0.1, respectively.

31

Experiments

“Census-Income” dataset from the University of California, Irvine (UCI) KDD Archive

33 nominal attributes, 7 continuous attributes

Sample 5000 records for training dataset. Sample 2000 records for approximate

KNN search experiment.

32

Experiments

Distance Matrix: distance between two categoriesV. Ganti, J. Gehrke, and R. Ramakrishnan,

“CACTUS-Clustering Categorical Data Using Summaries,” ACM KDD, 1999

D = {d1, d2, …, dn} of n tuples.

D is subset of D1 * D2 * … * Dk, where Di is a

categorical domain, for 1 i k≦ ≦ .di = <ci1, ci2, …, cik>.

33

Experiments

yxDyxycxcddDPairs iviuivui

yx ,,},,|,{)(,

yxDyx

jinvnu

ccycxcjdd

DLinks

i

vjujviuivu

iyx

,,

},,1,1

,,,|,,{

)(,

)1/()(/)(),( ,, kDPairsDLinksyxdist iyx

iyx

34

Experiments

Approximate KNN – nominal attributes

% of True KNN Retrieved (k * 2)

0.7

0.72

0.74

0.76

0.78

0.8

1 2 3 4 5 6 7 8 9 10K

Krusal Prim Divisive

35

Experiments

Approximate KNN – nominal attributes

% of True KNN Retrieved (k * 3)

0.77

0.79

0.81

0.83

0.85

0.87

1 2 3 4 5 6 7 8 9 10K

Krusal Prim Divisive

36

Experiments

Approximate KNN – nominal attributes

% of True KNN Retrieved (k * 6)

0.84

0.86

0.88

0.9

0.92

0.94

1 2 3 4 5 6 7 8 9 10K

Krusal Prim Divisive

37

Experiments

Approximate KNN – all attributes

% of True KNN Retrieved (k * 2)

0.7

0.75

0.8

1 2 3 4 5 6 7 8 9 10K

Krusal Prim Divisive

38

Experiments

Approximate KNN – all attributes

% of True KNN Retrieved (k * 3)

0.76

0.81

0.86

1 2 3 4 5 6 7 8 9 10K

Krusal Prim Divisive

39

Experiments

Approximate KNN – all attributes

% of True KNN Retrieved (k * 6)

0.850.870.890.910.930.95

1 2 3 4 5 6 7 8 9 10K

Krusal Prim Divisive

40

Conclusion

Developed Ordering Algorithms– Prim-like– Krusal-like– Divisive– GA-based

Devised Measurement– Reasonable ordering score– Root mean squared error

41

Conclusion

What next?– New categories, new mapping function– New index structure?– Training mapping function for a given

ordering path.

42

Thank you.