13
Intelligent Database Systems Lab 國國國國國國國國 National Yunlin University of Science and Technology CONTOUR: an efficient algorithm for discovering discriminating subsequences Jianyong Wang, Yuzhou Zhang, Lizhu Zhou, George Karypis, Charu C. Aggarwal DMKD, Vol. 18, No. 1, 2009, pp. 1-29. Presenter : Wei-Shen Tai 2009/3/11

CONTOUR: an efficient algorithm for discovering discriminating subsequences

Embed Size (px)

DESCRIPTION

CONTOUR: an efficient algorithm for discovering discriminating subsequences. Jianyong Wang, Yuzhou Zhang, Lizhu Zhou, George Karypis , Charu C. Aggarwal DMKD, Vol. 18, No. 1, 2009, pp. 1-29. Presenter : Wei- Shen Tai 200 9 / 3/11. Outline. Introduction Problem formulation - PowerPoint PPT Presentation

Citation preview

Page 1: CONTOUR: an efficient algorithm for discovering discriminating subsequences

Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

CONTOUR: an efficient algorithm for discovering discriminating subsequences

Jianyong Wang, Yuzhou Zhang, Lizhu Zhou, George Karypis, Charu C. Aggarwal

DMKD, Vol. 18, No. 1, 2009, pp. 1-29.

Presenter : Wei-Shen Tai

2009/3/11

Page 2: CONTOUR: an efficient algorithm for discovering discriminating subsequences

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

2

Outline

Introduction Problem formulation Efficiently mining summarization subsequences Summarization subsequence based clustering Empirical results Conclusions Comments

Page 3: CONTOUR: an efficient algorithm for discovering discriminating subsequences

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

3

Motivation Make frequent sequence mining more efficient

It is very time consuming to mine the complete set of frequent subsequences for large sequence databases.

A subset of useful frequent subsequences is to apply any existing frequent sequence mining algorithm.

Page 4: CONTOUR: an efficient algorithm for discovering discriminating subsequences

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

4

Objective

Effective search space pruning methods Finding the summarization subsequence to represent original input

sequence.

Page 5: CONTOUR: an efficient algorithm for discovering discriminating subsequences

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

5

Problem formulation Subsequence

If sequence Sα is contained in sequence Sβ, Sα is called a subsequence of Sβ .

Absolute support of sequence The number of input sequences in SDB that contain Sα, denoted by

supSDB(Sα). Summarization subsequences

A set of representative subsequences as a concise summarization of the input sequences,

Internal similarity of micro-cluster Cλ

CABAC→BAC

Page 6: CONTOUR: an efficient algorithm for discovering discriminating subsequences

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

6

Efficiently mining summarization subsequences

Frequent subsequence enumeration For each prefix, the mining algorithm builds its

projected database, and computes the set of locally frequent events.

min_sup = 2

SDB| AA ={C,A}, but they cannot be used to extend the prefix AA. (AAC, AAA )

Page 7: CONTOUR: an efficient algorithm for discovering discriminating subsequences

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

7

Closed sequence-based optimization

BackScan search space pruning Semi-maximum period

A subsequence between the first instance and the last instance of subsequence P. (for example, prefix BB)

First, and second to m semi-maximum period

An event A appears in each of the first semi-maximum periods of BB. It means ABB and BB exist simultaneously, ABB is the longer one.

ABCBA ABCBA ABCBA

→ABCB→ABCB

ABCB ACBB ABCB

Page 8: CONTOUR: an efficient algorithm for discovering discriminating subsequences

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

8

Unpromising projected sequence pruning

Current Frequent Covering Subsequence An input sequence Si that has the largest weight and was discovered so

far. Trivial projected sequence

Short projected sequences may not contain sufficient number of events to generate any summarization subsequence.

For example, prefix p=C:5 SDB|p = {PS1 =ABAC, PS3 = B, PS4 = BAC, PS5 = BBA, PS6 = BC},

CFCS1 =ABA:3, CFCS3 =ABCB:2, CFCS4 =BAC:2, CFCS5 =ABA:3, and CFCS6 =ABCB:2.

Page 9: CONTOUR: an efficient algorithm for discovering discriminating subsequences

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

9

Further discussions

Event weight assignment It is similar to TFIDF concept

Multiple summarization subsequence mining An input sequence may support multiple summarization

subsequences.

))(

1(

iSDBe esup

wi

Page 10: CONTOUR: an efficient algorithm for discovering discriminating subsequences

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

10

Summarization subsequence based clustering

Micro-cluster generation Input sequences with the same summarization subsequence are

grouped together. Macro-cluster creation

Agglomerative hierarchical clustering paradigm to create K macro-clusters.

ABA ABCB CBAC

Page 11: CONTOUR: an efficient algorithm for discovering discriminating subsequences

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

11

Empirical results

Page 12: CONTOUR: an efficient algorithm for discovering discriminating subsequences

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

12

Conclusions

CONTOUR A set of summarization subsequences is a concise

representation of the original sequence database.

It preserves much structural information, and can be used to efficiently cluster the input sequences with a high clustering quality.

Page 13: CONTOUR: an efficient algorithm for discovering discriminating subsequences

N.Y.U.S.T.

I. M.

Intelligent Database Systems Lab

13

Comments Advantage

This method provides more concise representation of original sequences than feature selection methods.

Those summarization subsequences can be efficiently adopted in most of conventional sequence mining methods.

Drawback In equation 1 and 2, the internal similarity is computed under one

summarization subsequence. Whereas, the multiple summarization subsequences may not be suitable for these equations.

Application Sequence pattern mining and clustering.