32
Similarity join problem with Pass-Join-K using Hadoop ---BY Yu Haiyang

Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang

Embed Size (px)

Citation preview

Page 1: Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang

Similarity join problem with Pass-Join-K using Hadoop

---BY Yu Haiyang

Page 2: Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang

23/4/19 http://datamining.xmu.edu.cn 2/32

Outline

Background

The introduction of Pass-Join-K

Combining Pass-Join-K with Hadoop

Page 3: Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang

23/4/19 http://datamining.xmu.edu.cn 3/32

Background

Similarity join: Find all similar pairs

from two sets.

Data Cleaning.

Query Relaxation

Spellchecking

“PO BOX 23, Main St.” “P.O. Box 23, Main St”

“information”“imformation”

Page 4: Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang

23/4/19 http://datamining.xmu.edu.cn 4/32

Background

How to define similarity?

Jaccard distance

Cosine distance

Edit distance

Page 5: Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang

23/4/19 http://datamining.xmu.edu.cn 5/32

Background

Edit distance

The minimum number of edit

operations (insertion, deletion, and

substitution) to transform one string to

another.

Baby BodySubstitution

Bod BodyInsertion

Page 6: Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang

23/4/19 http://datamining.xmu.edu.cn 6/32

Background

How does the edit distance compare

with other two?

Accuracy: {“abcdefg”,”gfedcba”}

Verification time: O(mn) -> O(m+n)

Page 7: Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang

23/4/19 http://datamining.xmu.edu.cn 7/32

Background

Find similar pairs

We have two string sets ,one is

{vldb,sigmod,….} ,the other is

{pvldb,icde,…}.

Find some candidate pairs , and then

verify these pairs.

{<vldb,pvldb>,<vldb,icde>,<vldb,..>,<sigmod,pvldb>,<sigmod,icde>,….}

<vldb,pvldb> Yes <vldb,icde> No

Page 8: Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang

23/4/19 http://datamining.xmu.edu.cn 8/32

Background

So we have to:

Finding candidate pairs. There are

O(N2) if we do not prune some pairs.

verifying these pairs.

O(mn)

Page 9: Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang

23/4/19 http://datamining.xmu.edu.cn 9/32

Introduction of Pass-Join-K

Some obvious pruning techniques

Length –based: threshold =

2,<“ab”,”abcee”>

Shift-based: <“abcd”,”cdef”>a b c d

c d e f

Page 10: Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang

23/4/19 http://datamining.xmu.edu.cn 10/32

Introduction of Pass-Join-K

Partition-based pruning technique

We suppose the threshold tau = 2,

K=2and we have a pair

<“abcdefghijk”,”abdefghk”>abc def ghi jk

ab def gh k

Page 11: Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang

23/4/19 http://datamining.xmu.edu.cn 11/32

Introduction of Pass-Join-K

Partition Scheme

We have seen that the longer the

substrings are, the harder they could be

marched.

So we break the string into tau+k parts

and each part while its length equals

length/(tau+k) or length/(tau+k)+1.

Page 12: Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang

23/4/19 http://datamining.xmu.edu.cn 12/32

Introduction of Pass-Join-K

Partition Scheme

So we break the string into tau+k parts

and each part while its length equals

length/(tau+k) or length/(tau+k)+1.

abc def ghi jk

Page 13: Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang

23/4/19 http://datamining.xmu.edu.cn 13/32

Introduction of Pass-Join-K

Partition Scheme

r = “abcdefghijk” s = “abdefghk”

abc def ghi jk

L11L11

1122 33 44

rr rr rr rr

defdef

Page 14: Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang

23/4/19 http://datamining.xmu.edu.cn 14/32

Introduction of Pass-Join-K

Substring Selection

Here we suppose tau = 3 and k = 1;

abc def ghi jk

a b d e f g h ka b d e f g h k

Page 15: Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang

23/4/19 http://datamining.xmu.edu.cn 15/32

Introduction of Pass-Join-K

Substring Selection

Here we suppose tau = 3 and k = 1;

abc def ghi jk

a b d e f g h k

Page 16: Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang

23/4/19 http://datamining.xmu.edu.cn 16/32

Introduction of Pass-Join-K

Substring Selection

Here we suppose tau = 3 and k = 1;

abc def ghi jk

a b d e f gh k

Page 17: Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang

23/4/19 http://datamining.xmu.edu.cn 17/32

Introduction of Pass-Join-K

Substring Selection

Here we suppose tau = 3 and k = 1;

abc def ghi jk

abd efg hk

Page 18: Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang

23/4/19 http://datamining.xmu.edu.cn 18/32

Introduction of Pass-Join-K

Substring Selection

Here we suppose tau = 3 and k = 1;

abc def ghi jk

a b d e f g h ka b d e f g h k

Page 19: Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang

23/4/19 http://datamining.xmu.edu.cn 19/32

Introduction of Pass-Join-K

Substring Selection

So what we do is to deduce the number

of substrings. More pruning techniques,

please read our paper: 《 Pass-Join-K多分段匹配的相似性连接算法》

Page 20: Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang

23/4/19 http://datamining.xmu.edu.cn 20/32

Introduction of Pass-Join-K

Verification

DP( Dynamic programming)

• D(m,n)=max(D(m,n-1)+1,D(m-1,n)+1,D(m-

1,n-1)+flag) where flag = 1 when sm=rn , s

and r are both strings.

Page 21: Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang

23/4/19 http://datamining.xmu.edu.cn 21/32

Introduction of Pass-Join-K

Verification

Here we suppose tau = 3 and k = 1;

abc def ghi jk

def e f g h kTauleft = 3Tauleft = 3

Tauright = 3-3=0Tauright = 3-3=0

Page 22: Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang

23/4/19 http://datamining.xmu.edu.cn 22/32

Combining Pass-Join-K with Hadoop

Inverted index tree in hadoop

(abc, 1, 11,r) (def,2,11,r) (ghi,3,11,r)

(jk,4,11,r)

abc def ghi jk

1122 33 44

rr rr rr rr

L11L11

Page 23: Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang

23/4/19 http://datamining.xmu.edu.cn 23/32

Combining Pass-Join-K with Hadoop

Substrings in hadoop

Suppose tau = 3, k = 1, and s =

“abdefghk”, length(s) = 8. We have to

generate some records such as (a,1,5,s),

(a,2,6,s)(a,3,7,s),(ab,1,8,s),…,

(ab,1,11,s),…

Page 24: Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang

23/4/19 http://datamining.xmu.edu.cn 24/32

Combining Pass-Join-K with Hadoop

Substrings in hadoop

Suppose tau = 3, k = 1, and s = “abdefghk”,

length(s) = 8. We have to generate more than

2*tau*(tau+k)*m records where m is the

average number that substring for each

segment, such as (a,1,5,s),(a,1,6,s)(a,1,7,s),

(ab,1,8,s),…,(ab,1,11,s),…

Page 25: Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang

23/4/19 http://datamining.xmu.edu.cn 25/32

Combining Pass-Join-K with Hadoop

Data flows in hadoop

Page 26: Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang

23/4/19 http://datamining.xmu.edu.cn 26/32

Combining Pass-Join-K with Hadoop

How to improve the performance ?

We have known that as k increased ,

the pairs we need to verity would be

decrease.

As k increased, more than

(tau+k+1)/(tau+k) records should be

translated in Mapper phase.

Page 27: Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang

23/4/19 http://datamining.xmu.edu.cn 27/32

Combining Pass-Join-K with Hadoop

Here we have 2 ways to improve our

algorithm.

Finding a dataset that the candidate

pairs number are large enough or

making tau are large enough.

Decreasing the data which were

generated in Mapper phase.

Page 28: Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang

23/4/19 http://datamining.xmu.edu.cn 28/32

Combining Pass-Join-K with Hadoop

Decrease the data flows

Page 29: Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang

23/4/19 http://datamining.xmu.edu.cn 29/32

Combining Pass-Join-K with Hadoop

Decrease the data flows

The inverted index record was formulated as

(substring,segmentNumber, LengthInf, Id, flag)

• Each record’s length is length(substring)

+4*sizeof(int), and substring sometimes could be so

long.

• Hash(substring) -> integer, then record length is

5*sizeof(int)

Page 30: Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang

23/4/19 http://datamining.xmu.edu.cn 30/32

Combining Pass-Join-K with Hadoop

Decrease the data flows

The substring would generate some similar

records such as (a,1,5,s),(a,1,6,s)(a,1,7,s)

• Each substring would generate tau+k similar

segments, so we combine them as ,for example,

(a,1,5,7,s). So we make the (tau+k)*4*sizeof(int) to

5*sizeof(int).

Page 31: Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang

23/4/19 http://datamining.xmu.edu.cn 31/32

Combining Pass-Join-K with Hadoop

Decrease the data flows So by using two steps we have seen before, we have

reduced the (length(substring)+4*sizeof(int))*(tau+k)

to 5 times sizeof(int)

Page 32: Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang

23/4/19 http://datamining.xmu.edu.cn 32/32

Email: [email protected]