View
230
Download
0
Category
Preview:
Citation preview
1
Converting Categories to Numbers for Approximate Nearest Neighbor Search
嘉義大學資工系郭煌政2004/10/20
2
Outline
IntroductionMotivationMeasurementAlgorithmsExperimentsConclusion
3
Introduction
Memory-Based Reasoning– Case-Based Reasoning– Instance-Based Learning
Given a training dataset and a new object, predict the class (target value) of the new object.
Focus on table data
4
Introduction
K Nearest Neighbor Search– Compute similarity between the new object
and each object in the training dataset.– Linear time to the size of the dataset
Similarity: Euclidean distanceMulti-dimension Index
– Spatial data structure, such as R-tree– Numeric data
5
Introduction
Indexing on Categorical Data?– Linear order of the categories– Existing correct ordering?– Best ordering?
Store the mapped data on a multi-dimensional data structure as filtering mechanism
6
Measurement for Ordering
Ordering Problem
Given an undirected weighted complete graph, a simple path is an ordering of the vertices. The edges are the distances between pairs of vertices. The ordering problem is to find a path, called ordering path, of maximal value according to a certain scoring function.
7
Measurement for Ordering
Relationship ScoringReasonable Ordering Score
In an ordering path <v1, v2, …, vn>, 3-tuple
<vi-1, vi, vi+1> is reasonable if and only if
dist(vi-1, vi+1) ≧ dist(vi-1, vi) and
dist(vi-1, vi+1) ≧ dist(vi, vi+1).
8
Measurement for Mapping
Pairwise Difference Scoring– Normalized distance matrix– Mapping values of categories
– Distm(vi, vj) = |mapping(vi) - mapping(vj)|
2/)1(*
)),(),(( 2
nn
vvdistvvdist
rmse ji vvjimji
9
Algorithms
Prim-like OrderingKruskal-like OrderingDivisive OrderingGA Approach Ordering
– A vertex is a category– A graph represent a distance matrix
10
Prim-like Ordering Algorithm
Prim’s Minimum Spanning Tree– Initially, choose a least edge (u, v)– Add the edge to the tree; S = {u, v}– Choose a least edge connecting a vertex in S
and a vertex, w, not in S– Add the edge to the tree; Add w to S– Repeat until all vertices are in S
11
Prim-like Ordering Algorithm
Prim-like Ordering– Choose a least edge (u, v)
– Add the edge to the ordering path; S = {u, v}
– Choose a least edge connecting a vertex in S and a vertex, w, not in S
– If the edge creates a circle on the path, discard the edge, and choose again
– Else, add the edge to the ordering path; Add w to S
– Repeat until all vertices are in S
12
Kruskal-like Ordering Algorithm
Kruskal Minimum Spanning Tree– Initially, choose a least edge (u, v)– Add the edge to the tree; S = {u, v}– Choose a least edge as long as the edge does
not create a circle in the tree– Add the edge to the tree; Add the two vertiecs
to S– Repeat until all vertices are in S
13
Kruskal-like Ordering Algorithm
Kruskal-like Ordering– Initially, choose a least edge (u, v) and add it to the
ordering path; S = {u, v}– Choose a least edge as long as the edge does not
create a circle in the tree, anddegree of either vertex on the path is <= 2
– Add the edge to the ordering path; Add the two vertices to S
– Repeat until all vertices are in SHeap array can be used to speed up choosing
least edge
14
Divisive Ordering Algorithm
Idea:– Pick a central vertex, and split the rest vertices– Building a binary tree: vertices are the leaves
Central Vertex:
}),({min,
ijCv
jiCv
ji
vvdist
15
Divisive Ordering Algorithm
AR is closer to P than AL is.
BL is closer to P than BR is.P
A
AL AR
B
BL BR
16
Clustering
Splitting a Set of Vertices into Two Groups– Each group has at least one vertex– Close (similar) vertices in same group
Distant vertices in different groups
Clustering Algorithms– Two clusters
17
Clustering
Clustering– Grouping a set of objects into classes of
similar objects
Agglomerative Hierarchical Clustering Algorithm– Singleton clusters– Merge similar clusters
18
Clustering
Clustering Algorithm: Cluster Similarity– Single link
dist(Ci, Cj) = min(dist(p, q)), p in Ci, q in Cj– Complete link
dist(Ci, Cj) = max(dist(p, q)), p in Ci, q in Cj– Average link -- adopted in our study
dist(Ci, Cj) = avg(dist(p, q)), p in Ci, q in Cj– others
19
Clustering
Clustering Implementation Issues– Which pair of clusters to be merged:
Keep cluster-to-cluster similarity for each pair– Recursively partition sets of vertices while
building the binary tree:Non-recursive version with a stack
20
GA Ordering Algorithm
Genetic Algorithm for Optimal ProblemsChromosome: solutionPopulation: pool of solutionsGenetic Operations
– Crossover– Mutation
21
GA Ordering Algorithm
Encoding a Solution– Binary string– Ordered list of categories – in our ordering
problemFitness Function
– Reasonable ordering scoreSelecting Chromosomes for crossover
– High fitness value => high probability
22
GA Ordering Algorithm
Crossover– Single point– Multiple points– Mask
Crossover AB | CDE and BD | AECResults in ABAEC and BDCDE => Illegal
23
GA Ordering Algorithm
Repair Illegal Chromosome ABAEC– AB*EC => fill D in * position
Repair Illegal Chromosome ABABC– AB**C– D and E are missing– Which one is closest to B, fill it in first *
position
24
Mapping Function
Ordering Path <v1, v2, …, vn>
Mapping(vi) =
1
01
1
01
),(
),(
n
jjj
i
jjj
vvdist
vvdist
25
Experiments
Synthetic Data (width/length = 5)
Reasonable Ordering Score
0.4
0.5
0.6
0.7
0.8
0.9
1
10 20 30 40 50100
150200
250300
350400
450500
Number of Categories
Perc
enta
ge
prim
krusal
divisive
26
Experiments
Synthetic Data (width/length = 10)Reasonable Ordering Score
0.4
0.5
0.6
0.7
0.8
0.9
1
10 20 30 40 50100
150200
250300
350400
450500
Number of Categories
Perc
enta
ge
prim
krusal
divisive
27
Experiments
Synthetic Data:Reasonable Ordering Score for Divisive Algorithm– width/length = 5 => 0.82
– width/length = 10 => 0.9
– No Ordering => 1/3
Divisive algorithm is better than Prim-like algorithm when number of categories > 100
28
Experiments
Synthetic Data (width/length = 5)Root Mean Squared Error
0.02
0.07
0.12
0.17
0.22
10 20 30 40 50100
150200
250300
350400
450500
Number of Categories
rmse
prim
krusal
divisive
29
Experiments
Synthetic Data (width/length = 10)Root Mean Squared Error
0
0.05
0.1
0.15
0.2
10 20 30 40 50100
150200
250300
350400
450500
Number of Categories
rmse
prim
krusal
divisive
30
Experiments
Divisive Ordering is best among the three ordering algorithms
For divisive ordering algorithm on > 100 categories, RMSE scores are around 0.07 when width/length = 10, and are around 0.05 when width/length = 10.
Prim-like ordering algorithm: 0.12 and 0.1, respectively.
31
Experiments
“Census-Income” dataset from the University of California, Irvine (UCI) KDD Archive
33 nominal attributes, 7 continuous attributes
Sample 5000 records for training dataset. Sample 2000 records for approximate
KNN search experiment.
32
Experiments
Distance Matrix: distance between two categoriesV. Ganti, J. Gehrke, and R. Ramakrishnan,
“CACTUS-Clustering Categorical Data Using Summaries,” ACM KDD, 1999
D = {d1, d2, …, dn} of n tuples.
D is subset of D1 * D2 * … * Dk, where Di is a
categorical domain, for 1 i k≦ ≦ .di = <ci1, ci2, …, cik>.
33
Experiments
yxDyxycxcddDPairs iviuivui
yx ,,},,|,{)(,
yxDyx
jinvnu
ccycxcjdd
DLinks
i
vjujviuivu
iyx
,,
},,1,1
,,,|,,{
)(,
)1/()(/)(),( ,, kDPairsDLinksyxdist iyx
iyx
34
Experiments
Approximate KNN – nominal attributes
% of True KNN Retrieved (k * 2)
0.7
0.72
0.74
0.76
0.78
0.8
1 2 3 4 5 6 7 8 9 10K
Krusal Prim Divisive
35
Experiments
Approximate KNN – nominal attributes
% of True KNN Retrieved (k * 3)
0.77
0.79
0.81
0.83
0.85
0.87
1 2 3 4 5 6 7 8 9 10K
Krusal Prim Divisive
36
Experiments
Approximate KNN – nominal attributes
% of True KNN Retrieved (k * 6)
0.84
0.86
0.88
0.9
0.92
0.94
1 2 3 4 5 6 7 8 9 10K
Krusal Prim Divisive
37
Experiments
Approximate KNN – all attributes
% of True KNN Retrieved (k * 2)
0.7
0.75
0.8
1 2 3 4 5 6 7 8 9 10K
Krusal Prim Divisive
38
Experiments
Approximate KNN – all attributes
% of True KNN Retrieved (k * 3)
0.76
0.81
0.86
1 2 3 4 5 6 7 8 9 10K
Krusal Prim Divisive
39
Experiments
Approximate KNN – all attributes
% of True KNN Retrieved (k * 6)
0.850.870.890.910.930.95
1 2 3 4 5 6 7 8 9 10K
Krusal Prim Divisive
40
Conclusion
Developed Ordering Algorithms– Prim-like– Krusal-like– Divisive– GA-based
Devised Measurement– Reasonable ordering score– Root mean squared error
41
Conclusion
What next?– New categories, new mapping function– New index structure?– Training mapping function for a given
ordering path.
42
Thank you.
Recommended