An Overview of Similarity Query Processing 2014. 2. 26. 김종익 전북대학교 컴퓨터공학부

An Overview of Similarity Query Processing

2014. 2. 26.

김종익전북대학교 컴퓨터공학부

Table of Contents

01. Applications of similarity query processing

02. Problem Formulation

03. string Decomposition

04. Similarity Function

05. A naïve approach

06. Overlap Similarity

07. Similarity Query Processing with Inverted lists

08. Similarity Function Revisited

09. Filter and Verification Framework

10. Prefix Filtering based Approach

11. Exploiting Document Frequency Ordering

Some examples and figures in this presentationare taken from the following materials

Marios Hadjieleftheriou and Chen Li, Efficient Approximate Search on String Collections (tutorial), ICDE 2009 and VLDB 2009

Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu, Efficient Similarity Joins for Near Duplicate Detection, WWW 2008 (slide)

Jongik Kim and Hongrae Lee, Efficient Exact Similarity Searches using Multiple Token Orderings, ICDE 2012 (slide)

Applications of similarity query processing (1/8)

Actual queries gatheredby Google

Web Search

Should be “Niels Bohr”

Data Integration and data cleaning

informix … …

microsoft … …

… … …

infromix …

… …

mcrosoft …

… …

Duplicate (Web) Documents Detection

Identify Spams

SPAM TEMPLATE

Sir/Madam,We happily announce to you the draw of the EURO MILLIONS SPANISH LOTTERY INTERNATIONALWINNINGS PROGRAM PROMOTIONS held on the 27TH MARCH 2008 in SPAIN. Your company or yourpersonal e-mail address attached to ticket number 653-908-321-675 with serial main number <NUMBER> drew lucky star winning numbers <NUMBER> which consequently won in the 2ND category, you have therefore been approved for a lump sum pay out of 960.000.00 Euros. (NINE HUNDRED AND SIXTY THOUSAND EUROS).CONGRATULATIONS!!!

Sincerely yours,<NAME><AFFILIATION>

Detect Plagiarism

Q. What are the advantages of RAID5 over RAID4?A. 1. Several write requests could be processed in parallel, since the bottleneck of a unique check disk has been eliminated. 2. Read requests have a higher level of parallelism. Since the data is distributed over all disks, read requests involve all disks, whereas in systems with a dedicated check disk the check disk never participates in read.

Q. What are the advantages of RAID5 over RAID4? A. 1. Several write requests could be processed in parallel, since the bottleneck of a single check disk has been eliminated. 2. Read requests have a higher level of parallelism on RAID5. Since the data is distributed over all disks, read requests involve all disks, whereas in systems with a check disk the check disk never participates in read.

Recommendation of friends in an SNS service

Friends vector: 1 0 0 1 1 0 0 1 Friends vector: 1 0 0 1 1 1 0 1

Friends of a person can be representation of a binary vector

Read (a fragment of genome sequence) Alignment

GCTGATGTGCCGCCTCACTCCGGTGG …

CACTCCTGTGG

CTCACTCCTGTGG

GCTGATGTGCCACCTCA

GATGTGCCACCTCACTC

GTGCCGCCTCACTCCTG

CTCCTGTGG

Reference sequence

Short reads

Supported by Oracle Text CREATE TABLE engdict(word VARCHAR(20), len INT); Create preferences for text indexing:

begin ctx_ddl.create_preference('STEM_FUZZY_PREF', 'BASIC_WORDLIST'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_MATCH','ENGLISH'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_SCORE','0'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_NUMRESULTS','5000'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','SUBSTRING_INDEX','TRUE'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','STEMMER','ENGLISH'); end; /

CREATE INDEX fuzzy_stem_subst_idx ON engdict ( word ) INDEXTYPE IS ctxsys.context PARAMETERS ('Wordlist STEM_FUZZY_PREF');

Usage:

SELECT * FROM engdict

WHERE CONTAINS(word, 'fuzzy(universisty, 70, 6, weight)', 1) > 0;

Limitation: cannot handle errors in the first letters:

Katherine versus Catherine

Query Relaxation

Problem Formulation (1/2)

Find strings similar to a given string

Similar to: a domain-specific function returns a similarity value between two strings

Common similarity functions: Jaccard coefficient Cosine similarity Dice similarity Edit distance

Problem Formulation (2/2)

Functions require set data

String Decomposition

Word tokens for long string (e.g. web page)x = “yes as soon as possible”y = “as soon as possible please”

x = {A, B, C, D, E}y = {B, C, D, E, F}

word yes as soon as1 possbile please

token A B C D E F

q-gram tokens for short string (e.g. keyword query)x = “universal”

G(x, 2) = {un, ni, iv, ve, er, rs, sa, al}

u n i v e r s a l

Similarity Function

Jaccard Similarity

Cosine similarity

Dice similarity

4( , ) 0.67

x yJ x y

x = {A, B, C, D, E}y = {B, C, D, E, F}

| | 4( , ) 0.8

5| || |

x yC x y

2 | | 8( , ) 0.8

| | | | 10

x yD x y

Edit DistanceED(x, y) =

minimum number of edit operations to change x to y

(insertion, deletion, substitution)

x: Tom Hanks y: Ton Hank ED(x, y) = 2

A naïve approach

Given a collection of strings C, a query string x, and a threshold t of a similarity function sim,

1. decompose each string in C and the query string into tokens.2. output those string y ∈ C such that sim(x, y) ≥ t.

Since C contains a lot of strings, this approach is obviously inefficient.

Overlap Similarity (1/2)

Given a similarity threshold t,

| | ( , )

| | | | | | | | | | ( , )

x y O x yt

x y x y x y O x y

( , ) | || |O x y t x y | |

( , )| || |

x yC x y t

2 | |( , )

| | | |

x yD x y t

(| | | |)

( , )2

t x yO x y

( , ) 4O x y x y Overlap Similarity

Overlap Similarity (2/2)

( ( , ), ( , )) max(| ( , ) |,| ( , ) |)O G x q G y q G x q G y q d q ( , )ED x y d

Given an edit distance d,

u n i v e r s a l

d edit operations could affect d x q grams

or, d edit operations on x can mutate d x q grams of x

x = “universal” and G(x, 2) = {un, ni, iv, ve, er, rs, sa, al}

2 edit operations on x mutate 2 x 2 q-grams

Hence, y should contains at least |G(x, 2)| - 2 x 2 = 4 q-grams in G(x, 2)

Similarity Query Processing with Inverted lists

ID String Record (token set)1 area { , re, ea}2 artisan { , rt, ti, is, sa, an}3 artist { , rt, ti, is, st}4 tisk {ti, is, sk}… … …

arskeais

1 2 3412

32 3 4

re 1Make Inverted Lists

Query: “artist” Overlap threshold: 4

Merge to count occurrences

Answers of the query 2: “artisan” 3: “artist”

{ , , , , }ar rt ti is st

ararar

Count threshold t≥ 3

minHeap

1: count 2 < t (X)

2: count 3 = t (O)

…11 2

Merge Algorithm – HeapMerge

Similarity Function Revisited

( , ) | |O x y t x tyx

Given a query x with a similarity threshold t, FOR ALL y,

| | | | (| | | |)1

| | | |

ty x y x y

ty t x

2( , ) | |O x y t x | |( , )

| || |

x yC x y t

2 | |( , )

| | | |

x yD x y t

( , )2

t xO x y

To determine the overlap threshold, we need to know the size of y, whichvaries according to each string in a collection.

( ( , ), ( , )) | ( , ) |O G x q G y q G x q t q ( , )ED x y t

Filter and Verification Framework

Find those strings that shares at least α tokens with the query string, where α is an overlap lower bound.

FILTER

Verify each string found in filtering stage by directly applying a similarity function

VERIFICATION

Quickly generate initial candidates using a minimum constraint

Refine candidates using α

FILTER REFINEMENT

Prefix Filtering based Approach

Query x = “artist” {ar, rt, ti, is, st} and overlap threshold α = 4

arisrtstti

1 2 322 332 3 4

Inverted lists for the query

ti1 2 32

Sort the listsby their sizes

Prefix Lists: the first |G(x, 2)| – α + 1 lists

Suffix Lists: remaining α – 1 lists

Filtering Phase (the prefix filtering) Merge the prefix lists to generate candidates

Refinement Phase Search the suffix lists for each candidate A candidate searches each suffix list to identify if it is contained in the list Binary search is used because suffix lists are usually very long

candidates2 3 43 4 5

4Sort the tokens by theirdocument frequencies

Document frequencyordering

Exploiting Document Frequency Ordering (1/2)

General Goal: minimize the number of candidates initially generated by making use of the document frequency ordering

rtstti

2 332 3 4

1 2 32 3 4

ti1 2 32

Query x = “artist” {ar, rt, ti, is, st} and overlap threshold α = 4

Sort the tokens by theirdocument frequencies

candidates

We can reduce1. time for merging short lists2. number of candidates

time for verification candidates

Query x = {w1, w2} and overlap threshold α = 2

w2 is the prefix list# of candidates is 5

Total number of candidates is 0

Partition

Observation By partitioning a data set, we can artificially modify document frequencies of tokens in

each partition. We evaluate a query in each partition and take the union of the results. We can reduce the number of candidates by utilizing different token orderings among

partitions. Because partitions have different token orderings, we need to sort tokens in a query record

in each partition.

Exploiting Document Frequency Ordering (2/2)

An Overview of Similarity Query Processing 2014. 2. 26. 김종익 전북대학교 컴퓨터공학부

Documents

Horton self-similarity of Kingman's coalescent treesites.science.oregonstate.edu/.../HortonODEoffprint.pdf · 2017-07-21 · Horton self-similarity of Kingman’s coalescent tree

전북대학교 수학 ∙통계정보과학부

Class vii geo similarity

11장 대화상자contents.kocw.or.kr/document/Talkbox.pdf · 2011-01-03 · 11장 대화상자. 김성영교수. 금오공과대학교 컴퓨터공학부

Game Programming 한신대학교 컴퓨터공학부 류승택 2014. Spring. 2 강의 소개 ■ 대상 : 한신대학교 컴퓨터공학부 4 학년 ■ 기간 : 2014. 3. ~ 2014. 6

PERBANDINGAN METODE DICE SIMILARITY DENGAN COSINE ...etheses.uin-malang.ac.id/13814/1/13650031.pdf · i perbandingan metode dice similarity dengan cosine similarity menggunakan query

목원대학교 컴퓨터공학부 - unit.mokwon.ac.krunit.mokwon.ac.kr/board/loadFile.ht?fileNm=2017... · Matlab을 이용한 실용디지털 영상처리 (홍릉과학출판사,

SNU 4190.310 Programming Language Prof. Kwangkeun Yi ropas.snu.ac.kr/~kwang 컴퓨터공학부 프로그램밍 연구실 ropas.snu.ac.kr

Similarity and Fuzzy Tolerance Spaces

Similarity analyses of enterprises

Efficient Entity Disambiguation via Similarity Hashing

최 양 희 서울대학교 컴퓨터공학부 MMlab1 1. Introduction and Overview

Molecular Similarity Characterization of ADME Landscapes

relational viewer -NicoNicoDouga similarity recomended engine

Similarity Laws

Segmentation Similarity and Agreement

전북대학교 BK21플러스사업 - eeic.or.kr³µ학교육현장-우수_BK21.pdf · 제23권 제1호·43 공학교육의 현장 Part 3 우수 BK21+ 사업 소개 전북대학교

8. 컴퓨터공학부 - bu3학년 2학기: 유닉스프로그래밍, JSP와서블릿, Java응용프로그래밍, 임베디드시스템, 스마트폰프로그래밍 4학년 1학기:

Self Similarity

야생동물학 발표자료 3 조 전북대학교 수의학과