Page 1 PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, Tianyi

Page 1

PathSim: Meta Path-Based Top-KSimilarity Search in Heteroge-neous Information Networks

Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, Tianyi WuVLDB ’11

Summarized and presented by Kim Chungrim

Page 2

Contents

• Introduction• Motivation & Terminology

– Heterogeneous Information Network (HIN)– Network Schema– Meta Path

• Meta Path-based Similarity Search Framework• PathSim: A Novel Meta Path-Based Similarity Measure• Online Query Processing for Top-K Similarity Search

– PathSim-baseline– PathSim-pruning

• Experiments• Conclusions

Page 3

• Logical networks involving multi-typed objects and multi-typed links denoting different relations are arising

– Bibliographic networks– Social media networks– Knowledge network encoded in Wikipedia

• It is important to study similarity search in such networks, as similarity search is a primitive operation in database and Web search engines.

• Similarity search has been only studied for traditional rela-tional databases or homogeneous information networks

– Personalized PageRank (P-PageRank)– SimRank– Random Walk (RW)– Personalized Random Walk (PRW)

• There are studies of similarity search on Heterogeneous In-formation Networks (HIN)

INTRODUCTION

Page 4

Motivation

• When conventional similarity measures for homogeneous information network is applied to HIN, the subtle semantic meanings that each type of links carry will be ignored

• Limitation of current similarity/proximity measures defined in networks

– Do NOT distinguish different types of objects and different types of links in the network

– Different types of objects and links have different semantic meanings

• E.g., personalized PageRank (P-PageRank), SimRank

• To distinguish the semantic among paths connecting two objects, a meta path-based similarity framework can be considered

Page 5

Terminology

• HIN : – Networks containing multi-typed objects, interconnected via multi-typed

relationships– G(V,E)– Examples

• DBLP network: papers, authors, venues, terms• Flickr network: pictures, tags, users, groups

– Sources• From online web services: online shopping websites, social media websites, bibliographic

websites, …• From database systems: medical databases, university databases, police department databases,

…

• Network Schema : – Information about the entity type and their binary relations– – Similar to the E-R Model

),( RATG

JimP1

VLDBNet-work

P2

Ann

Data

DBLP Network schema

Pa-per

Venue

Term

Au-thor

Page 6

Terminology (cont.)

• Meta Path– Two objects can be connected via different connectivity paths – E.g., two authors can be connected by

• “author-paper-author” (APA)• “author-paper-author-paper-author” (APAPA)• “author-paper-venue-paper-author” (APCPA)

• Each connectivity path represents a different semantic meaning and implies different similarity semantics

• A meta path is a meta level description of the topological connectivity between objects

– Given a Network Schema , A meta path can be defined as

– Can be considered as a new relation defined on type and

JimP1

VLDBNet-work

P2

Ann

Data

),( RATG

121...21

l

RRR AAA l

1A

1lA

Page 7

Meta Path-based Similarity Search Framework

• Similarity definition– Meta Path X Similarity Measure

• Conventional Similarity measures– Path Count : the number of path instances p between x and y following P– s(x,y) = |{p : p ∈ P }|

– Random Walk : the probability ( Prob(p) ) of the random walk that starts from x and ends with y following meta path P, which is the sum of the probabilities of all the path instances p

– s(x,y) =

– Pairwise Random Walk : for a meta path P that can be decomposed into two shorter meta paths with the same length P = , pairwise random walk probability is the probabilities starting from x and y and reaching the same middle object z

– s(x,y) =

Pp pob )(Pr

21PP

)(Pr)(Pr1

2)()( 12121

pobpobPPpp

x yp

x y

z

𝑝1 𝑝2

Page 8

PathSim: A Novel Meta Path-Based Similarity Measure

• Similarity in terms of ‘Peers’– Two similar peer object should not only be strongly connected, but also

share comparable visibility.

• Path count and Random walk (RW)– Favor highly visible objects (objects with large degrees)

• Pairwise random walk (PRW)– Favor pure objects (objects with highly skewed scatterness in their

in-links or out-links)

• PathSim– Favor “peers” (objects with similar visibility and strong connectivity

under the given meta path)

Page 9

PathSim: A Novel Meta Path-Based Similarity Measure (cont)

• Restricted on Round-Trip Meta Path– A round-trip meta path is a path of the form of P = – Guarantees a symmetric relation

–

1

llPP

Jim

VLDB

Mike

SIGMOD 2

50

20

1 s(Mike, Jim) = 0.0826 20)*2050*(501)*12*(2

20)*150*(2*2

Page 10


• Properties of PathSim– Symmetric– s(x,y) = s(y,x)

– Self Maximum– s(x,y) ∈ [0,1], s(x,x) = 1

– Balance of Visibility–

2

),(

ii

jj

jj

ii

M

M

M

Myxs

Page 11


• Comparison with other measures.

Page 12

Online Query Processing for Top-K Similarity Search

• The Top K Similarity Search Problem under PathSim– Given an HIG G and its network schema , given a round-trip meta path P =

the top-K Similarity Search is defined as:

– For a given query object x ∈ A1, find the sorted k objects y in the same type A that are most similar to the object x under PathSim similarity definition

• Major issues for Online Computation– Very large Matrix Multiplication : Need to compute the commuting matrix

– Calculating the commuting matrix is too time consuming– Full materialization of commuting matrix is also time and space expensive

• Solution– Partially materialize commuting matrices for short length meta paths, and

concatenate them online to get longer ones for a given query– Materialize the commuting matrix Mp for meta path

TG

1

llPP

Page 13

Online Query Processing for Top-K Similarity Search – PathSim-baseline

• Find the candidates via traversing the network following meta path P from the query object x

• For each candidate y, calculate s(x,y) using partial commuting matrix Mp

– Calculate and scale it with sum of visibility

– , can be pre-computed and stored using

• Sort y according to s(x,y) and return top-k objects

• Still very time-consuming if the candidate set is very large!

Page 14

Online Query Processing for Top-K Similarity Search – PathSim-pruning

• PathSim-pruning algorithm prunes the candidates that are not promising

• Offline: Generate co-clusters according to partial commuting matrix and store statistics for each block for deriving upper bound of similarity

• Online: For each query– Calculate the upper bound similarity between query object and the candi-

date cluster; prune the whole cluster if it is not promising– Calculate the upper bound similarity between query and each candidate in

the cluster; prune the candidate if it is not promising– Calculate the exact similarity measure between query and the candidate,

and update the top-k list

Page 15

Online Query Processing for Top-K Similarity Search – PathSim-pruning

Page 16

Experiments

• The DBLP network– By Nov. 2009– Contains over 710K authors, 1.2M papers, 5K venues (conferences/journals),

and around 70K terms appearing more than once (stopwords have been re-moved).

– Called full DBLP dataset

• Created two subsets– DBIS dataset : contains all 464 venues and top-5000 authors from the data-

base and information system area– 4-area dataset : contains 20 venues and top-5000 authors from 4 areas:

database, data mining, machine learning and information retrieval• Cluster labels are given for all the 20 venues and a subset of 1713 authors

Page 17

Experiment - Effectiveness

• Labeled top-15 result for 15 queries from venue type in DBIS dataset

– Labeled each result object with relevance score as • 0 : non relevant• 1 : some-relevant• 2 : very-relevant

• Used nDCG to evaluate the quality of a ranking algorithm

Page 18

Experiment - Effectiveness

Page 19

Experiment - Efficiency

Page 20

Experiment - Efficiency

Page 21

Conclusion & Contribution

• The authors– defined a meta path-based similarity framework in HIN– Proposed a new measure called PathSim, which is able to detect peer

objects for the given meta path– Propose a co-clustering-based efficient online search algorithm to support

top-k search

Page 22

Summary / Discussion / Future work

• Network inference procedure assumes ad-hoc edge filtering• Introduced a threshold on edges and a family of Networks to

find aoptimal threshold for a certain prediction task

– The prediction accuracies peak in a non-obvious yet relatively narrow thresholdrange

• Tested on too few datasets• Not enough to give a solid conclusion

• Apply method to variety of networks• Test various thresholds for more interests

Documents

Page 1 PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, Tianyi