63
Branch-and-bound nearest neighbor searching over unbalanced trie-structured overlays Master’s Thesis Presentation Technical University of Crete 4.2.2013 Author: Michail Argyriou Supervisor: Ass’t Prof. Vasilis Samoladas

Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Embed Size (px)

DESCRIPTION

Master presentation of Mike Argyriou in Technological University of Crete about Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays.

Citation preview

Page 1: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Branch-and-bound nearest neighbor searching over unbalanced trie-structured overlays

Master’s Thesis Presentation Technical University of Crete

4.2.2013

Author: Michail Argyriou Supervisor: Ass’t Prof. Vasilis Samoladas

Page 2: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

1998

1999

2000

2001

2002

Centralized Semi-distributed Fully-distributed

1999

2000

2001 2001 DHTs 2001

P2P Evolution

2

Page 3: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Distributed Hash Table (DHT)

3

Page 4: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

PGrid • Rectangular queries support • Peers only on leaves • High-dimensional queries support with space filling curves

VBI • Height-balanced search tree limitation

GRaSP

• No height-balanced search tree limitation • Abstract types of data and queries • Data: point, rectangular • Queries: point, 3-sided, n-d rectangular

DHT Frameworks Evolution

2003:

2006:

2008:

4

Page 5: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Nearest neighbor search

5

Page 6: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Given a distributed data set how can we find the k most similar data to a query?

“k-Nearest Neighbor Search”

6

Page 7: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Applications

GIS Distributed Databases

Statistical Classification

Recommendation Systems

Cluster analysis Similarity Scores

7

Page 8: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Related Work

1. Naïve algorithm: Central peer collects data and performs k-NN searching

2. K-nn search algorithm over CAN

3. Distributed quad-based index each quadtree block is uniquely identified by its centroid mapped to Chord k-NN search algorithm

8

Page 9: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Contents

GRaSP

k-NN

Evaluation

Conclusions

9

Page 10: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

GRaSP

10

Page 11: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Hierarchical space partition:

GRaSP Building the trie ...

Peer p joins

Finds a bootstrapping peer q

Space region s(q) splits into s(q0) and s(q1)

1

2

3 11

Page 12: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

GRaSP Space Partition

Before

Volume-balanced

Before

Data-balanced

13

Page 13: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

GRaSP Space Partition for a 3-sided query

14

Page 14: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

GRaSP Space Partition for a 3-sided query

15

Page 15: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

GRaSP Space Partition for a 3-sided query

16

Page 16: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

GRaSP Data Insertion

We insert a key k into all peers who own regions that contain k

17

Page 17: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

GRaSP Routing Tables

Each peer knows a peer in each complementary subtrie ...

0100 = 1 0100 = 00 0100 = 011 0100 = 0101

18

Page 18: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

GRaSP Routing

“In order to route a message from peer p to peer q, the message is forwarded from p to a neighbor peer included in a known subtrie closer

to peer q. From r it is recursively forwarded to q.”

19

Page 19: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Contents

GRaSP

k-NN

Evaluation

Conclusions

20

Page 20: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Searching Algorithm Branch-and-bound algorithm

Priority queue PQ of candidate peers holding answer better than the k-th answer found so far Fringe

1. Branch Step: expand PQ 2. Bound Step: prune PQ

21

Page 21: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Searching Algorithm Parallel Searching vs Iterative Searching

Parallel Searching requires huge message state!

Iterative Searching prunes larger regions of the data space!

22

Page 22: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Searching Algorithm

23

Page 23: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Searching Algorithm Branch-and-bound algorithm

1? d(q,s(1)) < d(q,a) 00? d(q,s(1)) > d(q,a) 011? d(q,s(1)) > d(q,a) 0101? d(q,s(1)) < d(q,a)

24

Page 24: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Latency Complexity Theorem

Latency = |T|O(logn)

Support Set T:

25

Page 25: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Latency Complexity Theorem Proof

Peers visited:

Peers in T:

|T| peers

Find peer in the complementary subtrie: O(logn)

26

Page 26: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Contents

GRaSP

k-NN

Evaluation

Conclusions

27

Page 27: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Performance Evaluation Taking into account number of dimensions

Low Medium High

28

Page 28: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Performance Evaluation Metrics

• Data Fairness Index • Latency • Max Throughput • Fringe Size (mean, max)

29

Page 29: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Low Medium High

Low dimensions

30

Page 30: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Low dimensions Workloads

• Greece, data-balanced partition, k=1/10/100

• Greece, volume-balanced partition, k=1

Datasets

• Synthetic queries • For a network size of n peers we asked n/3

queries

Querysets

31

Page 31: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Low dimensions Which space partition is the best?

Volume-balanced

Data-balanced

32

Page 32: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Low dimensions Which space partition is the best?

Greece ...

Volume-balanced partition Data-balanced partition

Data FI vs

Space Partition

33

Page 33: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Greece, k=1 ...

Volume-balanced partition Data-balanced partition

Low dimensions Which space partition is the best?

Latency vs

Space Partition

34

Page 34: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Greece, k=1 ...

Volume-balanced partition Data-balanced partition

Low dimensions Which space partition is the best?

Fringe Size vs

Space Partition

35

Page 35: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Greece, k=1 ...

Volume-balanced partition Data-balanced partition

Low dimensions Which space partition is the best?

Max Throughput vs

Space Partition

36

Page 36: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Low dimensions Which space partition is the best?

Volume-balanced

Data-balanced

37

Page 37: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Low dimensions k ?

38

Page 38: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Low dimensions How is the size of the fringe

affected?

Fringe Size vs k

k=10

Greece, data-balanced partition ...

k=100 k=1 39

Page 39: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

k=10

Low dimensions How is the latency affected?

Greece, data-balanced partition ...

k=100 k=1

Latency vs k

40

Page 40: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Low dimensions How is the Max. Throughput affected?

k=10

Greece, data-balanced partition ...

k=100 k=1

Max Throughput vs k

41

Page 41: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Low dimensions … efficient routing!

42

Page 42: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Low Medium High

Medium dimensions

43

Page 43: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Medium dimensions Workloads

• Uniform, volume-balanced partition, k=1 • ColorMoments, data-balanced partition,

k=1

Datasets

• Synthetic queries • For a network size of n peers we asked

n/3 queries

Querysets

44

Page 44: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Medium dimensions How is the size of the fringe

affected?

45

Page 45: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Medium dimensions How is the size of the fringe

affected?

ColorMoments, data-balanced, k=1 46

Page 46: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Medium dimensions How is the size of the fringe

affected?

Uniform, volume-balanced, k=1 Uniform, volume-balanced, k=1 Uniform, volume-balanced, k=1

Max. Fringe Size Mean Fringe Size 47

Page 47: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Medium dimensions Data Fairness Index

48

Page 48: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Medium dimensions Data Fairness Index

ColorMoments, data-balanced, k=1 49

Page 49: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Medium dimensions Data Fairness Index

Uniform, volume-balanced, k=1 50

Page 50: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Medium dimensions Latency

51

Page 51: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Medium dimensions Latency

ColorMoments, data-balanced, k=1 52

Page 52: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Medium dimensions Latency

Uniform, volume-balanced, k=1 53

Page 53: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Medium dimensions Latency

Latency is high but near to the optimum!

54

Page 54: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Medium dimensions Max. Throughput

55

Page 55: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Medium dimensions Max. Throughput

ColorMoments, data-balanced, k=1 56

Page 56: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Medium dimensions Max. Throughput

Uniform, volume-balanced, k=1 57

Page 57: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Medium dimensions … not efficient routing but near optimum!

It's still good enough for practical

applications!

58

Page 58: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Low Medium High

High dimensions

59

Page 59: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

High dimensions Curse of dimensionality

“When the dimensionality increases, the volume of the space

increases so fast that the available data becomes sparse.”

60

Page 60: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Contents

GRaSP

k-NN

Evaluation

Conclusions

61

Page 61: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Conclusions

API

Searching (k-NN)

Trie Data Ins/Rem

Space Partition

Query Types Data Types

Metric Space 62

Page 62: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

Future Work

Approximate k-NN searching for high

dimensions

Redundancy

63

Page 63: Branch and-bound nearest neighbor searching over unbalanced trie-structured overlays

THANK YOU

QUESTIONS ?

64