1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系...

Converting Categories to Numbers for Approximate Nearest Neighbor Search

嘉義大學資工系郭煌政2004/10/20

Outline

IntroductionMotivationMeasurementAlgorithmsExperimentsConclusion

Introduction

Memory-Based Reasoning– Case-Based Reasoning– Instance-Based Learning

Given a training dataset and a new object, predict the class (target value) of the new object.

Focus on table data

Introduction

K Nearest Neighbor Search– Compute similarity between the new object

and each object in the training dataset.– Linear time to the size of the dataset

Similarity: Euclidean distanceMulti-dimension Index

– Spatial data structure, such as R-tree– Numeric data

Introduction

Indexing on Categorical Data?– Linear order of the categories– Existing correct ordering?– Best ordering?

Store the mapped data on a multi-dimensional data structure as filtering mechanism

Measurement for Ordering

Ordering Problem

Given an undirected weighted complete graph, a simple path is an ordering of the vertices. The edges are the distances between pairs of vertices. The ordering problem is to find a path, called ordering path, of maximal value according to a certain scoring function.

Measurement for Ordering

Relationship ScoringReasonable Ordering Score

In an ordering path <v1, v2, …, vn>, 3-tuple

<vi-1, vi, vi+1> is reasonable if and only if

dist(vi-1, vi+1) ≧ dist(vi-1, vi) and

dist(vi-1, vi+1) ≧ dist(vi, vi+1).

Measurement for Mapping

Pairwise Difference Scoring– Normalized distance matrix– Mapping values of categories

– Distm(vi, vj) = |mapping(vi) - mapping(vj)|

2/)1(*

)),(),(( 2

vvdistvvdist

rmse ji vvjimji

Algorithms

Prim-like OrderingKruskal-like OrderingDivisive OrderingGA Approach Ordering

– A vertex is a category– A graph represent a distance matrix

Prim-like Ordering Algorithm

Prim’s Minimum Spanning Tree– Initially, choose a least edge (u, v)– Add the edge to the tree; S = {u, v}– Choose a least edge connecting a vertex in S

and a vertex, w, not in S– Add the edge to the tree; Add w to S– Repeat until all vertices are in S

Prim-like Ordering Algorithm

Prim-like Ordering– Choose a least edge (u, v)

– Add the edge to the ordering path; S = {u, v}

– Choose a least edge connecting a vertex in S and a vertex, w, not in S

– If the edge creates a circle on the path, discard the edge, and choose again

– Else, add the edge to the ordering path; Add w to S

– Repeat until all vertices are in S

Kruskal-like Ordering Algorithm

Kruskal Minimum Spanning Tree– Initially, choose a least edge (u, v)– Add the edge to the tree; S = {u, v}– Choose a least edge as long as the edge does

not create a circle in the tree– Add the edge to the tree; Add the two vertiecs

to S– Repeat until all vertices are in S

Kruskal-like Ordering Algorithm

Kruskal-like Ordering– Initially, choose a least edge (u, v) and add it to the

ordering path; S = {u, v}– Choose a least edge as long as the edge does not

create a circle in the tree, anddegree of either vertex on the path is <= 2

– Add the edge to the ordering path; Add the two vertices to S

– Repeat until all vertices are in SHeap array can be used to speed up choosing

least edge

Divisive Ordering Algorithm

Idea:– Pick a central vertex, and split the rest vertices– Building a binary tree: vertices are the leaves

Central Vertex:

}),({min,

vvdist

Divisive Ordering Algorithm

AR is closer to P than AL is.

BL is closer to P than BR is.P

Clustering

Splitting a Set of Vertices into Two Groups– Each group has at least one vertex– Close (similar) vertices in same group

Distant vertices in different groups

Clustering Algorithms– Two clusters

Clustering

Clustering– Grouping a set of objects into classes of

similar objects

Agglomerative Hierarchical Clustering Algorithm– Singleton clusters– Merge similar clusters

Clustering

Clustering Algorithm: Cluster Similarity– Single link

dist(Ci, Cj) = min(dist(p, q)), p in Ci, q in Cj– Complete link

dist(Ci, Cj) = max(dist(p, q)), p in Ci, q in Cj– Average link -- adopted in our study

dist(Ci, Cj) = avg(dist(p, q)), p in Ci, q in Cj– others

Clustering

Clustering Implementation Issues– Which pair of clusters to be merged:

Keep cluster-to-cluster similarity for each pair– Recursively partition sets of vertices while

building the binary tree:Non-recursive version with a stack

GA Ordering Algorithm

Genetic Algorithm for Optimal ProblemsChromosome: solutionPopulation: pool of solutionsGenetic Operations

– Crossover– Mutation

Encoding a Solution– Binary string– Ordered list of categories – in our ordering

problemFitness Function

– Reasonable ordering scoreSelecting Chromosomes for crossover

– High fitness value => high probability

Crossover– Single point– Multiple points– Mask

Crossover AB | CDE and BD | AECResults in ABAEC and BDCDE => Illegal

Repair Illegal Chromosome ABAEC– AB*EC => fill D in * position

Repair Illegal Chromosome ABABC– AB**C– D and E are missing– Which one is closest to B, fill it in first *

position

Mapping Function

Ordering Path <v1, v2, …, vn>

Mapping(vi) =

vvdist

Experiments

Synthetic Data (width/length = 5)

Reasonable Ordering Score

10 20 30 40 50100

150200

250300

350400

450500

Number of Categories

krusal

divisive

Experiments

Synthetic Data (width/length = 10)Reasonable Ordering Score

10 20 30 40 50100

150200

250300

350400

450500

krusal

divisive

Experiments

Synthetic Data:Reasonable Ordering Score for Divisive Algorithm– width/length = 5 => 0.82

– width/length = 10 => 0.9

– No Ordering => 1/3

Divisive algorithm is better than Prim-like algorithm when number of categories > 100

Experiments

Synthetic Data (width/length = 5)Root Mean Squared Error

10 20 30 40 50100

150200

250300

350400

450500

krusal

divisive

Experiments

Synthetic Data (width/length = 10)Root Mean Squared Error

10 20 30 40 50100

150200

250300

350400

450500

krusal

divisive

Experiments

Divisive Ordering is best among the three ordering algorithms

For divisive ordering algorithm on > 100 categories, RMSE scores are around 0.07 when width/length = 10, and are around 0.05 when width/length = 10.

Prim-like ordering algorithm: 0.12 and 0.1, respectively.

Experiments

“Census-Income” dataset from the University of California, Irvine (UCI) KDD Archive

33 nominal attributes, 7 continuous attributes

Sample 5000 records for training dataset. Sample 2000 records for approximate

KNN search experiment.

Experiments

Distance Matrix: distance between two categoriesV. Ganti, J. Gehrke, and R. Ramakrishnan,

“CACTUS-Clustering Categorical Data Using Summaries,” ACM KDD, 1999

D = {d1, d2, …, dn} of n tuples.

D is subset of D1 * D2 * … * Dk, where Di is a

categorical domain, for 1 i k≦ ≦ .di = <ci1, ci2, …, cik>.

Experiments

yxDyxycxcddDPairs iviuivui

yx ,,},,|,{)(,

jinvnu

ccycxcjdd

DLinks

vjujviuivu

},,1,1

,,,|,,{

)1/()(/)(),( ,, kDPairsDLinksyxdist iyx

Experiments

Approximate KNN – nominal attributes

% of True KNN Retrieved (k * 2)

1 2 3 4 5 6 7 8 9 10K

Krusal Prim Divisive

Experiments

1 2 3 4 5 6 7 8 9 10K

Experiments

1 2 3 4 5 6 7 8 9 10K

Experiments

Approximate KNN – all attributes

1 2 3 4 5 6 7 8 9 10K

Experiments

1 2 3 4 5 6 7 8 9 10K

Experiments

0.850.870.890.910.930.95

1 2 3 4 5 6 7 8 9 10K

Conclusion

Developed Ordering Algorithms– Prim-like– Krusal-like– Divisive– GA-based

Devised Measurement– Reasonable ordering score– Root mean squared error

Conclusion

What next?– New categories, new mapping function– New index structure?– Training mapping function for a given

ordering path.

Thank you.

1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系...

Documents

Converting bbm to bbg

ßessibilit rapidit e cortesia - FB Converting

On sampling and approximate counting

Catalogo generale GS Converting 2014

PENGGUNAAN K-NEAREST NEIGHBOR (KNN) UNTUK …

Pack Converting 2015-3

Stochastic Neighbor Compression - GitHub Pagesmkusner.github.io/presentations/Stochastic_Neighbor_Compression.pdf · 1-nearest neighbor rule 1. reduce n Training Consistent Sampling

Converting Employee Feedback to Business Impact

Approximate inference for vector parameters

Classifier-Based Approximate Policy Iteration

Pack Converting 205-1

PENGARUH PENGGUNAAN ANGIOTENSIN-CONVERTING …

EL CONVERTING, AHORA MÁS FÁCIL

Deep Learning with approximate computing

1+eps-Approximate Sparse Recovery

webdynpro Smartform To Pfd Converting

International Journal of Approximate Reasoning

Pack Converting 2015-2

Pack Converting 2014-4

cons30S Measurement Converting Between Systems