Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang

Similarity join problem with Pass-Join-K using Hadoop

---BY Yu Haiyang

23/4/19 http://datamining.xmu.edu.cn 2/32

Outline

Background

The introduction of Pass-Join-K

Combining Pass-Join-K with Hadoop


Background

Similarity join: Find all similar pairs

from two sets.

Data Cleaning.

Query Relaxation

Spellchecking

“PO BOX 23, Main St.” “P.O. Box 23, Main St”

“information”“imformation”


Background

How to define similarity?

Jaccard distance

Cosine distance

Edit distance


Background

Edit distance

The minimum number of edit

operations (insertion, deletion, and

substitution) to transform one string to

another.

Baby BodySubstitution

Bod BodyInsertion


Background

How does the edit distance compare

with other two?

Accuracy: {“abcdefg”,”gfedcba”}

Verification time: O(mn) -> O(m+n)


Background

Find similar pairs

We have two string sets ,one is

{vldb,sigmod,….} ,the other is

{pvldb,icde,…}.

Find some candidate pairs , and then

verify these pairs.

{<vldb,pvldb>,<vldb,icde>,<vldb,..>,<sigmod,pvldb>,<sigmod,icde>,….}

<vldb,pvldb> Yes <vldb,icde> No


Background

So we have to:

Finding candidate pairs. There are

O(N2) if we do not prune some pairs.

verifying these pairs.

O(mn)


Introduction of Pass-Join-K

Some obvious pruning techniques

Length –based: threshold =

2,<“ab”,”abcee”>

Shift-based: <“abcd”,”cdef”>a b c d

c d e f



Partition-based pruning technique

We suppose the threshold tau = 2,

K=2and we have a pair

<“abcdefghijk”,”abdefghk”>abc def ghi jk

ab def gh k



Partition Scheme

We have seen that the longer the

substrings are, the harder they could be

marched.

So we break the string into tau+k parts

and each part while its length equals

length/(tau+k) or length/(tau+k)+1.



Partition Scheme

So we break the string into tau+k parts

and each part while its length equals

length/(tau+k) or length/(tau+k)+1.

abc def ghi jk



Partition Scheme

r = “abcdefghijk” s = “abdefghk”

abc def ghi jk

L11L11

1122 33 44

rr rr rr rr

defdef



Substring Selection

Here we suppose tau = 3 and k = 1;

abc def ghi jk

a b d e f g h ka b d e f g h k



Substring Selection


abc def ghi jk

a b d e f g h k



Substring Selection


abc def ghi jk

a b d e f gh k



Substring Selection


abc def ghi jk

abd efg hk



Substring Selection


abc def ghi jk

a b d e f g h ka b d e f g h k



Substring Selection

So what we do is to deduce the number

of substrings. More pruning techniques,

please read our paper: 《 Pass-Join-K多分段匹配的相似性连接算法》



Verification

DP( Dynamic programming)

• D(m,n)=max(D(m,n-1)+1,D(m-1,n)+1,D(m-

1,n-1)+flag) where flag = 1 when sm=rn , s

and r are both strings.



Verification


abc def ghi jk

def e f g h kTauleft = 3Tauleft = 3

Tauright = 3-3=0Tauright = 3-3=0



Inverted index tree in hadoop

(abc, 1, 11,r) (def,2,11,r) (ghi,3,11,r)

(jk,4,11,r)

abc def ghi jk

1122 33 44

rr rr rr rr

L11L11



Substrings in hadoop

Suppose tau = 3, k = 1, and s =

“abdefghk”, length(s) = 8. We have to

generate some records such as (a,1,5,s),

(a,2,6,s)(a,3,7,s),(ab,1,8,s),…,

(ab,1,11,s),…



Substrings in hadoop

Suppose tau = 3, k = 1, and s = “abdefghk”,

length(s) = 8. We have to generate more than

2*tau*(tau+k)*m records where m is the

average number that substring for each

segment, such as (a,1,5,s),(a,1,6,s)(a,1,7,s),

(ab,1,8,s),…,(ab,1,11,s),…



Data flows in hadoop



How to improve the performance ?

We have known that as k increased ,

the pairs we need to verity would be

decrease.

As k increased, more than

(tau+k+1)/(tau+k) records should be

translated in Mapper phase.



Here we have 2 ways to improve our

algorithm.

Finding a dataset that the candidate

pairs number are large enough or

making tau are large enough.

Decreasing the data which were

generated in Mapper phase.



Decrease the data flows




The inverted index record was formulated as

(substring,segmentNumber, LengthInf, Id, flag)

• Each record’s length is length(substring)

+4*sizeof(int), and substring sometimes could be so

long.

• Hash(substring) -> integer, then record length is

5*sizeof(int)




The substring would generate some similar

records such as (a,1,5,s),(a,1,6,s)(a,1,7,s)

…

• Each substring would generate tau+k similar

segments, so we combine them as ,for example,

(a,1,5,7,s). So we make the (tau+k)*4*sizeof(int) to

5*sizeof(int).



Decrease the data flows So by using two steps we have seen before, we have

reduced the (length(substring)+4*sizeof(int))*(tau+k)

to 5 times sizeof(int)


Email: [email protected]

Documents

Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang