Upload
ganesa
View
33
Download
0
Embed Size (px)
DESCRIPTION
Multi-dimensional Sequential Pattern Mining. ~From: 10th ACM Intednational Conference on Information and Knowledge Management (CIKM 2001), Atlanta. Helen Pinto, Jiawei Han, Jian Pei, Ke Wang, Qiming Chen, Umeshwar Dayal. 碩專二 69121507 阮士峰. Outline. - PowerPoint PPT Presentation
Citation preview
1
Multi-dimensional Sequential Pattern Mining
Helen Pinto, Jiawei Han, Jian Pei, Ke Wang, Qiming Chen, Umeshwar Dayal
~From: 10th ACM Intednational Conference on Information and Knowledge Management (CIKM 2001), Atlanta.
碩專二 69121507 阮士峰
2
Outline Why multidimensional sequential
pattern mining? Problem definition UniSeq Algorithms Dim-Seq and Seq-Dim Experimental results Conclusions
3
Why Sequential Pattern Mining?
Sequential pattern mining: Finding time-related frequent patterns (frequent subsequences)
Many data and applications are time-related Customer shopping patterns, telephone calling
patterns Natural disasters (e.g., earthquake, hurricane) Disease and treatment Stock market fluctuation Weblog click stream analysis DNA sequence analysis
4
Sequential Pattern: Basics
<a(bd)bcb(ade)>50
<(be)(ce)d>40
<(ah)(bf)abf>30
<(bf)(ce)b(fg)>20
<(bd)cb(ac)>10
SequenceSeq. ID
A sequence database A sequence : <(bd) c b (ac)>
Elements
<ad(ae)> is a subsequence of <a(bd)bcb(ade)>Given support threshold min_sup =2, <(bd)cb> is a sequential pattern
5
Multi-Dimenesion Sequence Database
cid
Cust_grp City Age_grp
sequence
10 Business Boston Middle <(bd)cba>
20 Professional
Chicago Young <(bf)(ce)(fg)>
30 Business Chicago Middle <(ah)abf>
40 Education New York
Retired <(be)(ce)> If support =2, P is a MD sequential pattern P=(*,Chicago,*,<bf>) matches tuple 20 and
30
6
Problem definition Sequential patterns are useful
“try a 100 hour free internet access package” “subscribe to 15 hours/mouth package” “ upgrade to 30 hours/mouth package” “upgrade to unlimited package”
Marketing, product design & development
Problems: lack of focus Various groups of customers may have different patterns
MD-sequential pattern mining: integrate multi-dimensional analysis and sequential pattern mining
7
UniSeq Embed MD information into sequences
cid Cust_grp City Age_grp
sequence
10 Business Boston Middle <(bd)cba>
20 Professional
Chicago Young <(bf)(ce)(fg)>
30 Business Chicago Middle <(ah)abf>
40 Education New York
Retired <(be)(ce)>cid
MD-extension of sequences
10 <(Business,Boston,Middle)(bd)cba>
20 <(Professional,Chicago,Young)(bf)(ce)(fg)>
30 <(Business,Chicago,Middle)(ah)abf>
40 <(Education,New York,Retired)(be)(ce)>
Mine the extended sequence database
using sequential pattern mining
methodsTable1 SDB
Table2 SDBMD
8
UniSeq(cont.) Sequence database SDBMD can be mined using
PrefixSpan. First scan the database, PrefixSpan finds all
the single-item frequent sequence. these are <business>:2, <Chicago>:2, <middle>:2, <a>:2, <b>:4, <C>:3, <e>:2 and <f>:2.
The complete set of sequential patterns can then be partitioned into 8 subsets.
cid MD-extension of sequences
10 <(Business,Boston,Middle)(bd)cba>
20 <(Professional,Chicago,Young)(bf)(ce)(fg)>
30 <(Business,Chicago,Middle)(ah)abf>
40 <(Education,New York,Retired)(be)(ce)>
9
UniSeq(cont.) Ex: the <chicago>-projected database contains two
postfix sequences: <(bf)(ce)f> and < middle aabf>.
cid MD-extension of sequences
20 <(Professional,Chicago,Young)(bf)(ce)(fg)>
30 <(Business,Chicago,Middle)(ah)abf>
Then print out the sequential pattern <chicago>, and find this projected database.
They are :<b> and <f>, which form the sequential paterns “<chicago b>:2” and “<Chicago f>:2” respectively.
However, <Chicago b>-projected database contains postfix sequences for:<(-f)f> and <f> with one frequent item between them
find “”<Chicago bf>:2” (*,Chicago,*,<bf>)
10
Mine Sequential Patterns by Prefix Projections
Step 1: find length-1 sequential patterns <a>, <b>, <c>, <d>, <e>, <f>
Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets: The ones having prefix <a>; The ones having prefix <b>; … The ones having prefix <f>
SID sequence
10 <a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>
11
Find Seq. Patterns with Prefix <a>
Only need to consider projections <a> <a>-projected database: <(abc)(ac)d(cf)>,
<(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc> Find all the length-2 seq. pat. Having prefix
<a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af> Further partition into 6 subsets
Having prefix <aa>; … Having prefix <af>
SID sequence
10 <a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>
12
Completeness of PrefixSpan
SID sequence
10 <a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>
SDB
Length-1 sequential patterns<a>, <b>, <c>, <d>, <e>, <f>
<a>-projected database<(abc)(ac)d(cf)><(_d)c(bc)(ae)><(_b)(df)cb><(_f)cbc>
Length-2 sequentialpatterns<aa>, <ab>, <(ab)>,<ac>, <ad>, <af>
Having prefix <a>
Having prefix <aa>
<aa>-proj. db … <af>-proj. db
Having prefix <af>
<b>-projected database …Having prefix <b>
Having prefix <c>, …, <f>
… …
13
Efficiency of PrefixSpan
No candidate sequence needs to be
generated
Projected databases keep shrinking
Major cost of PrefixSpan: constructing
projected databases
14
Dim-Seq First find MD-patterns
E.g. (*,Chicago,*) Form projected sequence database
<(bf)(ce)(fg)> and <(ah)abf> for (*,Chicago,*)
Find seq. pat in projected database E.g. (*,Chicago,*,<bf>)
cid Cust_grp City Age_grp
sequence
10 Business Boston Middle <(bd)cba>
20 Professional
Chicago Young <(bf)(ce)(fg)>
30 Business Chicago Middle <(ah)abf>
40 Education New York
Retired <(be)(ce)>
15
Seq-Dim Find sequential patterns
E.g. <bf> Form projected MD-database
E.g. (Professional,Chicago,Young) and (Business,Chicago,Middle) for <bf>
Mine MD-patterns E.g. (*,Chicago,*,<bf>)
cid Cust_grp City Age_grp
sequence
10 Business Boston Middle <(bd)cba>
20 Professional
Chicago Young <(bf)(ce)(fg)>
30 Business Chicago Middle <(ah)abf>
40 Education New York
Retired <(be)(ce)>
16
Dim-Seq and Seq-Dim The problem of multi-dimensional
sequential pattern mining problem can reduced to two sub-problem: sequential pattern mining and MD-pattern mining
As introduced before, sequential pattern mining can be done efficiently by PrefixSpan.
For MD-pattern mining, we adopt a BUC-like algorithm.
17
BUC algorithm Kevin Beyer , Raghu Ramakrishnan, Bottom-up computation of sparse and Iceberg CUBE, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.359-370, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
18
Mining MD-Patterns(BUC-like)
All
(cust-grp,*,*) (*,city,*) (*,*,age-grp)
(cust-grp,city) Cust-grp,*,age-grp)
(cust-grp,city,age-grp)
cid Cust_grp City Age_grp
sequence
10 Business Boston Middle <(bd)cba>
20 Professional
Chicago Young <(bf)(ce)(fg)>
30 Business Chicago Middle <(ah)abf>
40 Education New York
Retired <(be)(ce)>
BUC processing
19
Experimental results Run on Pentium III pc with 1G main
memory . Using Microsoft Visual C++ 6.0 In this dataset, the number of items is
set to 10,000, while the number of sequence is 10,000. The average number of items within each element is 2.5. The average number of elements in one sequence is 8.
20
Scalability Over Dimensionality
21
Scalability Over Cardinality
22
Scalability Over Support Threshold
23
Scalability Over Database Size
24
Pros & Cons of Algorithms Seq-Dim is efficient and scalable
Fastest in most cases UniSeq is also efficient and
scalable Fastest with low dimensionality
Dim-Seq has poor scalability
25
Conclusions MD seq. pat. mining are interesting
and useful Mining MD seq. pat. efficiently
Uniseq, Dim-Seq, and Seq-Dim Future work
Applications of sequential pattern mining
報告結束報告結束
27
References (1) R. Agrawal and R. Srikant. Fast algorithms for mining association rules.
VLDB'94, pages 487-499. R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95, pages 3-
14. Kevin Beyer , Raghu Ramakrishnan, Bottom-up computation of sparse
and Iceberg CUBE, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.359-370, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
C. Bettini, X. S. Wang, and S. Jajodia. Mining temporal relationships with multiple granularities in time sequences. Data Engineering Bulletin, 21:32-38, 1998.
M. Garofalakis, R. Rastogi, and K. Shim. Spirit: Sequential pattern mining with regular expression constraints. VLDB'99, pages 223-234.
J. Han, G. Dong, and Y. Yin. Efficient mining of partial periodic patterns in time series database. ICDE'99, pages 106-115.
J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M.-C. Hsu. FreeSpan: Frequent pattern-projected sequential pattern mining. KDD'00, pages 355-359.
28
References (2)
J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD'00, pages 1-12.
H. Lu, J. Han, and L. Feng. Stock movement and n-dimensional intertransaction association rules. DMKD'98, pages 12:1-12:7.
H. Mannila, H. Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1:259-289, 1997.
B. "Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. ICDE'98, pages 412-421.
J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth. ICDE'01, pages 215-224.
R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. EDBT'96, pages 3-17.