Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Preview:

DESCRIPTION

Master presentation of Mike Argyriou in Technological University of Crete about Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays.

Citation preview

Branch-and-bound nearest neighbor searching over unbalanced trie-structured overlays

Master’s Thesis Presentation Technical University of Crete

4.2.2013

Author: Michail Argyriou Supervisor: Ass’t Prof. Vasilis Samoladas

1998

1999

2000

2001

2002

Centralized Semi-distributed Fully-distributed

1999

2000

2001 2001 DHTs 2001

P2P Evolution

2

Distributed Hash Table (DHT)

3

PGrid • Rectangular queries support • Peers only on leaves • High-dimensional queries support with space filling curves

VBI • Height-balanced search tree limitation

GRaSP

• No height-balanced search tree limitation • Abstract types of data and queries • Data: point, rectangular • Queries: point, 3-sided, n-d rectangular

DHT Frameworks Evolution

2003:

2006:

2008:

4

Nearest neighbor search

5

Given a distributed data set how can we find the k most similar data to a query?

“k-Nearest Neighbor Search”

6

Applications

GIS Distributed Databases

Statistical Classification

Recommendation Systems

Cluster analysis Similarity Scores

7

Related Work

1. Naïve algorithm: Central peer collects data and performs k-NN searching

2. K-nn search algorithm over CAN

3. Distributed quad-based index each quadtree block is uniquely identified by its centroid mapped to Chord k-NN search algorithm

8

Contents

GRaSP

k-NN

Evaluation

Conclusions

9

GRaSP

10

Hierarchical space partition:

GRaSP Building the trie ...

Peer p joins

Finds a bootstrapping peer q

Space region s(q) splits into s(q0) and s(q1)

1

2

3 11

GRaSP Space Partition

Before

Volume-balanced

Before

Data-balanced

13

GRaSP Space Partition for a 3-sided query

14

GRaSP Space Partition for a 3-sided query

15

GRaSP Space Partition for a 3-sided query

16

GRaSP Data Insertion

We insert a key k into all peers who own regions that contain k

17

GRaSP Routing Tables

Each peer knows a peer in each complementary subtrie ...

0100 = 1 0100 = 00 0100 = 011 0100 = 0101

18

GRaSP Routing

“In order to route a message from peer p to peer q, the message is forwarded from p to a neighbor peer included in a known subtrie closer

to peer q. From r it is recursively forwarded to q.”

19

Contents

GRaSP

k-NN

Evaluation

Conclusions

20

Searching Algorithm Branch-and-bound algorithm

Priority queue PQ of candidate peers holding answer better than the k-th answer found so far Fringe

1. Branch Step: expand PQ 2. Bound Step: prune PQ

21

Searching Algorithm Parallel Searching vs Iterative Searching

Parallel Searching requires huge message state!

Iterative Searching prunes larger regions of the data space!

22

Searching Algorithm

23

Searching Algorithm Branch-and-bound algorithm

1? d(q,s(1)) < d(q,a) 00? d(q,s(1)) > d(q,a) 011? d(q,s(1)) > d(q,a) 0101? d(q,s(1)) < d(q,a)

24

Latency Complexity Theorem

Latency = |T|O(logn)

Support Set T:

25

Latency Complexity Theorem Proof

Peers visited:

Peers in T:

|T| peers

Find peer in the complementary subtrie: O(logn)

26

Contents

GRaSP

k-NN

Evaluation

Conclusions

27

Performance Evaluation Taking into account number of dimensions

Low Medium High

28

Performance Evaluation Metrics

• Data Fairness Index • Latency • Max Throughput • Fringe Size (mean, max)

29

Low Medium High

Low dimensions

30

Low dimensions Workloads

• Greece, data-balanced partition, k=1/10/100

• Greece, volume-balanced partition, k=1

Datasets

• Synthetic queries • For a network size of n peers we asked n/3

queries

Querysets

31

Low dimensions Which space partition is the best?

Volume-balanced

Data-balanced

32

Low dimensions Which space partition is the best?

Greece ...

Volume-balanced partition Data-balanced partition

Data FI vs

Space Partition

33

Greece, k=1 ...

Volume-balanced partition Data-balanced partition

Low dimensions Which space partition is the best?

Latency vs

Space Partition

34

Greece, k=1 ...

Volume-balanced partition Data-balanced partition

Low dimensions Which space partition is the best?

Fringe Size vs

Space Partition

35

Greece, k=1 ...

Volume-balanced partition Data-balanced partition

Low dimensions Which space partition is the best?

Max Throughput vs

Space Partition

36

Low dimensions Which space partition is the best?

Volume-balanced

Data-balanced

37

Low dimensions k ?

38

Low dimensions How is the size of the fringe

affected?

Fringe Size vs k

k=10

Greece, data-balanced partition ...

k=100 k=1 39

k=10

Low dimensions How is the latency affected?

Greece, data-balanced partition ...

k=100 k=1

Latency vs k

40

Low dimensions How is the Max. Throughput affected?

k=10

Greece, data-balanced partition ...

k=100 k=1

Max Throughput vs k

41

Low dimensions … efficient routing!

42

Low Medium High

Medium dimensions

43

Medium dimensions Workloads

• Uniform, volume-balanced partition, k=1 • ColorMoments, data-balanced partition,

k=1

Datasets

• Synthetic queries • For a network size of n peers we asked

n/3 queries

Querysets

44

Medium dimensions How is the size of the fringe

affected?

45

Medium dimensions How is the size of the fringe

affected?

ColorMoments, data-balanced, k=1 46

Medium dimensions How is the size of the fringe

affected?

Uniform, volume-balanced, k=1 Uniform, volume-balanced, k=1 Uniform, volume-balanced, k=1

Max. Fringe Size Mean Fringe Size 47

Medium dimensions Data Fairness Index

48

Medium dimensions Data Fairness Index

ColorMoments, data-balanced, k=1 49

Medium dimensions Data Fairness Index

Uniform, volume-balanced, k=1 50

Medium dimensions Latency

51

Medium dimensions Latency

ColorMoments, data-balanced, k=1 52

Medium dimensions Latency

Uniform, volume-balanced, k=1 53

Medium dimensions Latency

Latency is high but near to the optimum!

54

Medium dimensions Max. Throughput

55

Medium dimensions Max. Throughput

ColorMoments, data-balanced, k=1 56

Medium dimensions Max. Throughput

Uniform, volume-balanced, k=1 57

Medium dimensions … not efficient routing but near optimum!

It's still good enough for practical

applications!

58

Low Medium High

High dimensions

59

High dimensions Curse of dimensionality

“When the dimensionality increases, the volume of the space

increases so fast that the available data becomes sparse.”

60

Contents

GRaSP

k-NN

Evaluation

Conclusions

61

Conclusions

API

Searching (k-NN)

Trie Data Ins/Rem

Space Partition

Query Types Data Types

Metric Space 62

Future Work

Approximate k-NN searching for high

dimensions

Redundancy

63

THANK YOU

QUESTIONS ?

64

Recommended