Upload
felicia-clements
View
20
Download
0
Embed Size (px)
DESCRIPTION
CONTOUR: an efficient algorithm for discovering discriminating subsequences. Jianyong Wang, Yuzhou Zhang, Lizhu Zhou, George Karypis , Charu C. Aggarwal DMKD, Vol. 18, No. 1, 2009, pp. 1-29. Presenter : Wei- Shen Tai 200 9 / 3/11. Outline. Introduction Problem formulation - PowerPoint PPT Presentation
Citation preview
Intelligent Database Systems Lab
國立雲林科技大學National Yunlin University of Science and Technology
CONTOUR: an efficient algorithm for discovering discriminating subsequences
Jianyong Wang, Yuzhou Zhang, Lizhu Zhou, George Karypis, Charu C. Aggarwal
DMKD, Vol. 18, No. 1, 2009, pp. 1-29.
Presenter : Wei-Shen Tai
2009/3/11
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
2
Outline
Introduction Problem formulation Efficiently mining summarization subsequences Summarization subsequence based clustering Empirical results Conclusions Comments
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
3
Motivation Make frequent sequence mining more efficient
It is very time consuming to mine the complete set of frequent subsequences for large sequence databases.
A subset of useful frequent subsequences is to apply any existing frequent sequence mining algorithm.
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
4
Objective
Effective search space pruning methods Finding the summarization subsequence to represent original input
sequence.
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
5
Problem formulation Subsequence
If sequence Sα is contained in sequence Sβ, Sα is called a subsequence of Sβ .
Absolute support of sequence The number of input sequences in SDB that contain Sα, denoted by
supSDB(Sα). Summarization subsequences
A set of representative subsequences as a concise summarization of the input sequences,
Internal similarity of micro-cluster Cλ
CABAC→BAC
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
6
Efficiently mining summarization subsequences
Frequent subsequence enumeration For each prefix, the mining algorithm builds its
projected database, and computes the set of locally frequent events.
min_sup = 2
SDB| AA ={C,A}, but they cannot be used to extend the prefix AA. (AAC, AAA )
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
7
Closed sequence-based optimization
BackScan search space pruning Semi-maximum period
A subsequence between the first instance and the last instance of subsequence P. (for example, prefix BB)
First, and second to m semi-maximum period
An event A appears in each of the first semi-maximum periods of BB. It means ABB and BB exist simultaneously, ABB is the longer one.
ABCBA ABCBA ABCBA
→ABCB→ABCB
ABCB ACBB ABCB
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
8
Unpromising projected sequence pruning
Current Frequent Covering Subsequence An input sequence Si that has the largest weight and was discovered so
far. Trivial projected sequence
Short projected sequences may not contain sufficient number of events to generate any summarization subsequence.
For example, prefix p=C:5 SDB|p = {PS1 =ABAC, PS3 = B, PS4 = BAC, PS5 = BBA, PS6 = BC},
CFCS1 =ABA:3, CFCS3 =ABCB:2, CFCS4 =BAC:2, CFCS5 =ABA:3, and CFCS6 =ABCB:2.
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
9
Further discussions
Event weight assignment It is similar to TFIDF concept
Multiple summarization subsequence mining An input sequence may support multiple summarization
subsequences.
))(
1(
iSDBe esup
wi
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
10
Summarization subsequence based clustering
Micro-cluster generation Input sequences with the same summarization subsequence are
grouped together. Macro-cluster creation
Agglomerative hierarchical clustering paradigm to create K macro-clusters.
ABA ABCB CBAC
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
11
Empirical results
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
12
Conclusions
CONTOUR A set of summarization subsequences is a concise
representation of the original sequence database.
It preserves much structural information, and can be used to efficiently cluster the input sequences with a high clustering quality.
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
13
Comments Advantage
This method provides more concise representation of original sequences than feature selection methods.
Those summarization subsequences can be efficiently adopted in most of conventional sequence mining methods.
Drawback In equation 1 and 2, the internal similarity is computed under one
summarization subsequence. Whereas, the multiple summarization subsequences may not be suitable for these equations.
Application Sequence pattern mining and clustering.