12
Intelligent Database Systems Lab 國國國國國國國國 National Yunlin University of Science and Technology A novel document similarity measure based on earth mover’s distance Presenter : Shao-Wei Cheng Authors : Xiaojun Wan InfSci 2007

A novel document similarity measure based on earth mover’s distance

  • Upload
    nyx

  • View
    39

  • Download
    0

Embed Size (px)

DESCRIPTION

A novel document similarity measure based on earth mover’s distance. Presenter : Shao -Wei Cheng Authors : Xiaojun Wan. InfSci 2007. Outline. Motivation Objective Methodology Experiments Conclusion Comments. Motivation. - PowerPoint PPT Presentation

Citation preview

Page 1: A novel document similarity measure  based on earth mover’s distance

Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

A novel document similarity measure based on earth mover’s distance

Presenter : Shao-Wei ChengAuthors : Xiaojun Wan

InfSci 2007

Page 2: A novel document similarity measure  based on earth mover’s distance

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

2

Outline

Motivation Objective Methodology Experiments Conclusion Comments

Page 3: A novel document similarity measure  based on earth mover’s distance

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Motivation

3

Measuring pair-wise document similarity is crucial for various text applications, including document clustering, document filtering, and nearest neighbor search.

There are too many many many methods: VSM - Cosine, Dice, Jaccard, Overlap Information theoretic Retrieval Model - BM25, NVSM, LM OM-based < measure by subtopics > : document structure information

one-to-one

Page 4: A novel document similarity measure  based on earth mover’s distance

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

4

Objectives

Not only one-to-one matching Many-To-Many More information, more nature

Page 5: A novel document similarity measure  based on earth mover’s distance

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Framework

5

Methodology

document decomposition

similarity measure

TextTiling

Sentence clustering

The proposed EMD-based (earth mover’s distance ) measure(Improve the OM-based measure to allow many to many matching)

Page 6: A novel document similarity measure  based on earth mover’s distance

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

6

Methodology

TextTiling Tokenization Lexical score determination Boundary identification

Sentence clustering hierarchical agglomerative clustering algorithm. Use the average-link method to compute similarity.

The merging threshold can be determined through cross-validation.

Page 7: A novel document similarity measure  based on earth mover’s distance

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

7

Methodology

OM-based measure Change the similarity measure to Optimal matching problem. The constraint of optimal matching problem

No two edges share the same node. Find the matching M ( the best E ) that has the largest total weight.

The one-to-one matching might loss information

Page 8: A novel document similarity measure  based on earth mover’s distance

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

8

Methodology

EMD-based measure Change the similarity measure to transportation problem. The earth mover’s distance

Find a flow F = [fij] that minimizes the overall cost

The constraint :

Page 9: A novel document similarity measure  based on earth mover’s distance

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Performance comparison for different similarity measures. MAP - non-interpolated mean average precision

Experiments

9

Page 10: A novel document similarity measure  based on earth mover’s distance

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Influence of document decomposition algorithm Sentence clustering algorithm TextTiling

Experiments

10

Page 11: A novel document similarity measure  based on earth mover’s distance

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Conclusion

11

The proposed measure can overcome the one-to-one matching problem and the experimental results show the effectiveness and robustness of the EMD-based measure.

Future work Combine the Cosine measure and the EMD-based measure in a

re-ranking process. Other document decomposition algorithms.

Page 12: A novel document similarity measure  based on earth mover’s distance

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

12

Comments

Advantage Change document similarity measure to another math problem.

Drawback

Application Clustering Classification Search engine …