CONTOUR: an efficient algorithm for discovering discriminating subsequences

Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

CONTOUR: an efficient algorithm for discovering discriminating subsequences

Jianyong Wang, Yuzhou Zhang, Lizhu Zhou, George Karypis, Charu C. Aggarwal

DMKD, Vol. 18, No. 1, 2009, pp. 1-29.

Presenter : Wei-Shen Tai

2009/3/11

N.Y.U.S.T.

I. M.


2

Outline

Introduction Problem formulation Efficiently mining summarization subsequences Summarization subsequence based clustering Empirical results Conclusions Comments

N.Y.U.S.T.

I. M.


3

Motivation Make frequent sequence mining more efficient

It is very time consuming to mine the complete set of frequent subsequences for large sequence databases.

A subset of useful frequent subsequences is to apply any existing frequent sequence mining algorithm.

N.Y.U.S.T.

I. M.


4

Objective

Effective search space pruning methods Finding the summarization subsequence to represent original input

sequence.

N.Y.U.S.T.

I. M.


5

Problem formulation Subsequence

If sequence Sα is contained in sequence Sβ, Sα is called a subsequence of Sβ .

Absolute support of sequence The number of input sequences in SDB that contain Sα, denoted by

supSDB(Sα). Summarization subsequences

A set of representative subsequences as a concise summarization of the input sequences,

Internal similarity of micro-cluster Cλ

CABAC→BAC

N.Y.U.S.T.

I. M.


6

Efficiently mining summarization subsequences

Frequent subsequence enumeration For each prefix, the mining algorithm builds its

projected database, and computes the set of locally frequent events.

min_sup = 2

SDB| AA ={C,A}, but they cannot be used to extend the prefix AA. (AAC, AAA )

N.Y.U.S.T.

I. M.


7

Closed sequence-based optimization

BackScan search space pruning Semi-maximum period

A subsequence between the first instance and the last instance of subsequence P. (for example, prefix BB)

First, and second to m semi-maximum period

An event A appears in each of the first semi-maximum periods of BB. It means ABB and BB exist simultaneously, ABB is the longer one.

ABCBA ABCBA ABCBA

→ABCB→ABCB

ABCB ACBB ABCB

N.Y.U.S.T.

I. M.


8

Unpromising projected sequence pruning

Current Frequent Covering Subsequence An input sequence Si that has the largest weight and was discovered so

far. Trivial projected sequence

Short projected sequences may not contain sufficient number of events to generate any summarization subsequence.

For example, prefix p=C:5 SDB|p = {PS1 =ABAC, PS3 = B, PS4 = BAC, PS5 = BBA, PS6 = BC},

CFCS1 =ABA:3, CFCS3 =ABCB:2, CFCS4 =BAC:2, CFCS5 =ABA:3, and CFCS6 =ABCB:2.

N.Y.U.S.T.

I. M.


9

Further discussions

Event weight assignment It is similar to TFIDF concept

Multiple summarization subsequence mining An input sequence may support multiple summarization

subsequences.

))(

1(

iSDBe esup

wi

N.Y.U.S.T.

I. M.


10

Summarization subsequence based clustering

Micro-cluster generation Input sequences with the same summarization subsequence are

grouped together. Macro-cluster creation

Agglomerative hierarchical clustering paradigm to create K macro-clusters.

ABA ABCB CBAC

N.Y.U.S.T.

I. M.


11

Empirical results

N.Y.U.S.T.

I. M.


12

Conclusions

CONTOUR A set of summarization subsequences is a concise

representation of the original sequence database.

It preserves much structural information, and can be used to efficiently cluster the input sequences with a high clustering quality.

N.Y.U.S.T.

I. M.


13

Comments Advantage

This method provides more concise representation of original sequences than feature selection methods.

Those summarization subsequences can be efficiently adopted in most of conventional sequence mining methods.

Drawback In equation 1 and 2, the internal similarity is computed under one

summarization subsequence. Whereas, the multiple summarization subsequences may not be suitable for these equations.

Application Sequence pattern mining and clustering.

Documents

CONTOUR: an efficient algorithm for discovering discriminating subsequences