28
1 Multi-dimensional Sequential Pattern Mining Helen Pinto, Jiawei Han, Jian Pei, Ke Wang, Qiming Chen, Umeshwar Dayal ~From: 10th ACM Intednational Conference on Information and Knowledge Management (CIKM 2001), Atlanta. 碩碩碩 69121507 碩碩碩

Multi-dimensional Sequential Pattern Mining

  • Upload
    ganesa

  • View
    33

  • Download
    0

Embed Size (px)

DESCRIPTION

Multi-dimensional Sequential Pattern Mining. ~From: 10th ACM Intednational Conference on Information and Knowledge Management (CIKM 2001), Atlanta. Helen Pinto, Jiawei Han, Jian Pei, Ke Wang, Qiming Chen, Umeshwar Dayal. 碩專二 69121507 阮士峰. Outline. - PowerPoint PPT Presentation

Citation preview

Page 1: Multi-dimensional Sequential Pattern Mining

1

Multi-dimensional Sequential Pattern Mining

Helen Pinto, Jiawei Han, Jian Pei, Ke Wang, Qiming Chen, Umeshwar Dayal

~From: 10th ACM Intednational Conference on Information and Knowledge Management (CIKM 2001), Atlanta.

碩專二 69121507 阮士峰

Page 2: Multi-dimensional Sequential Pattern Mining

2

Outline Why multidimensional sequential

pattern mining? Problem definition UniSeq Algorithms Dim-Seq and Seq-Dim Experimental results Conclusions

Page 3: Multi-dimensional Sequential Pattern Mining

3

Why Sequential Pattern Mining?

Sequential pattern mining: Finding time-related frequent patterns (frequent subsequences)

Many data and applications are time-related Customer shopping patterns, telephone calling

patterns Natural disasters (e.g., earthquake, hurricane) Disease and treatment Stock market fluctuation Weblog click stream analysis DNA sequence analysis

Page 4: Multi-dimensional Sequential Pattern Mining

4

Sequential Pattern: Basics

<a(bd)bcb(ade)>50

<(be)(ce)d>40

<(ah)(bf)abf>30

<(bf)(ce)b(fg)>20

<(bd)cb(ac)>10

SequenceSeq. ID

A sequence database A sequence : <(bd) c b (ac)>

Elements

<ad(ae)> is a subsequence of <a(bd)bcb(ade)>Given support threshold min_sup =2, <(bd)cb> is a sequential pattern

Page 5: Multi-dimensional Sequential Pattern Mining

5

Multi-Dimenesion Sequence Database

cid

Cust_grp City Age_grp

sequence

10 Business Boston Middle <(bd)cba>

20 Professional

Chicago Young <(bf)(ce)(fg)>

30 Business Chicago Middle <(ah)abf>

40 Education New York

Retired <(be)(ce)> If support =2, P is a MD sequential pattern P=(*,Chicago,*,<bf>) matches tuple 20 and

30

Page 6: Multi-dimensional Sequential Pattern Mining

6

Problem definition Sequential patterns are useful

“try a 100 hour free internet access package” “subscribe to 15 hours/mouth package” “ upgrade to 30 hours/mouth package” “upgrade to unlimited package”

Marketing, product design & development

Problems: lack of focus Various groups of customers may have different patterns

MD-sequential pattern mining: integrate multi-dimensional analysis and sequential pattern mining

Page 7: Multi-dimensional Sequential Pattern Mining

7

UniSeq Embed MD information into sequences

cid Cust_grp City Age_grp

sequence

10 Business Boston Middle <(bd)cba>

20 Professional

Chicago Young <(bf)(ce)(fg)>

30 Business Chicago Middle <(ah)abf>

40 Education New York

Retired <(be)(ce)>cid

MD-extension of sequences

10 <(Business,Boston,Middle)(bd)cba>

20 <(Professional,Chicago,Young)(bf)(ce)(fg)>

30 <(Business,Chicago,Middle)(ah)abf>

40 <(Education,New York,Retired)(be)(ce)>

Mine the extended sequence database

using sequential pattern mining

methodsTable1 SDB

Table2 SDBMD

Page 8: Multi-dimensional Sequential Pattern Mining

8

UniSeq(cont.) Sequence database SDBMD can be mined using

PrefixSpan. First scan the database, PrefixSpan finds all

the single-item frequent sequence. these are <business>:2, <Chicago>:2, <middle>:2, <a>:2, <b>:4, <C>:3, <e>:2 and <f>:2.

The complete set of sequential patterns can then be partitioned into 8 subsets.

cid MD-extension of sequences

10 <(Business,Boston,Middle)(bd)cba>

20 <(Professional,Chicago,Young)(bf)(ce)(fg)>

30 <(Business,Chicago,Middle)(ah)abf>

40 <(Education,New York,Retired)(be)(ce)>

Page 9: Multi-dimensional Sequential Pattern Mining

9

UniSeq(cont.) Ex: the <chicago>-projected database contains two

postfix sequences: <(bf)(ce)f> and < middle aabf>.

cid MD-extension of sequences

20 <(Professional,Chicago,Young)(bf)(ce)(fg)>

30 <(Business,Chicago,Middle)(ah)abf>

Then print out the sequential pattern <chicago>, and find this projected database.

They are :<b> and <f>, which form the sequential paterns “<chicago b>:2” and “<Chicago f>:2” respectively.

However, <Chicago b>-projected database contains postfix sequences for:<(-f)f> and <f> with one frequent item between them

find “”<Chicago bf>:2” (*,Chicago,*,<bf>)

Page 10: Multi-dimensional Sequential Pattern Mining

10

Mine Sequential Patterns by Prefix Projections

Step 1: find length-1 sequential patterns <a>, <b>, <c>, <d>, <e>, <f>

Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets: The ones having prefix <a>; The ones having prefix <b>; … The ones having prefix <f>

SID sequence

10 <a(abc)(ac)d(cf)>

20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <eg(af)cbc>

Page 11: Multi-dimensional Sequential Pattern Mining

11

Find Seq. Patterns with Prefix <a>

Only need to consider projections <a> <a>-projected database: <(abc)(ac)d(cf)>,

<(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc> Find all the length-2 seq. pat. Having prefix

<a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af> Further partition into 6 subsets

Having prefix <aa>; … Having prefix <af>

SID sequence

10 <a(abc)(ac)d(cf)>

20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <eg(af)cbc>

Page 12: Multi-dimensional Sequential Pattern Mining

12

Completeness of PrefixSpan

SID sequence

10 <a(abc)(ac)d(cf)>

20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <eg(af)cbc>

SDB

Length-1 sequential patterns<a>, <b>, <c>, <d>, <e>, <f>

<a>-projected database<(abc)(ac)d(cf)><(_d)c(bc)(ae)><(_b)(df)cb><(_f)cbc>

Length-2 sequentialpatterns<aa>, <ab>, <(ab)>,<ac>, <ad>, <af>

Having prefix <a>

Having prefix <aa>

<aa>-proj. db … <af>-proj. db

Having prefix <af>

<b>-projected database …Having prefix <b>

Having prefix <c>, …, <f>

… …

Page 13: Multi-dimensional Sequential Pattern Mining

13

Efficiency of PrefixSpan

No candidate sequence needs to be

generated

Projected databases keep shrinking

Major cost of PrefixSpan: constructing

projected databases

Page 14: Multi-dimensional Sequential Pattern Mining

14

Dim-Seq First find MD-patterns

E.g. (*,Chicago,*) Form projected sequence database

<(bf)(ce)(fg)> and <(ah)abf> for (*,Chicago,*)

Find seq. pat in projected database E.g. (*,Chicago,*,<bf>)

cid Cust_grp City Age_grp

sequence

10 Business Boston Middle <(bd)cba>

20 Professional

Chicago Young <(bf)(ce)(fg)>

30 Business Chicago Middle <(ah)abf>

40 Education New York

Retired <(be)(ce)>

Page 15: Multi-dimensional Sequential Pattern Mining

15

Seq-Dim Find sequential patterns

E.g. <bf> Form projected MD-database

E.g. (Professional,Chicago,Young) and (Business,Chicago,Middle) for <bf>

Mine MD-patterns E.g. (*,Chicago,*,<bf>)

cid Cust_grp City Age_grp

sequence

10 Business Boston Middle <(bd)cba>

20 Professional

Chicago Young <(bf)(ce)(fg)>

30 Business Chicago Middle <(ah)abf>

40 Education New York

Retired <(be)(ce)>

Page 16: Multi-dimensional Sequential Pattern Mining

16

Dim-Seq and Seq-Dim The problem of multi-dimensional

sequential pattern mining problem can reduced to two sub-problem: sequential pattern mining and MD-pattern mining

As introduced before, sequential pattern mining can be done efficiently by PrefixSpan.

For MD-pattern mining, we adopt a BUC-like algorithm.

Page 17: Multi-dimensional Sequential Pattern Mining

17

BUC algorithm Kevin Beyer , Raghu Ramakrishnan, Bottom-up computation of sparse and Iceberg CUBE, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.359-370, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States

Page 18: Multi-dimensional Sequential Pattern Mining

18

Mining MD-Patterns(BUC-like)

All

(cust-grp,*,*) (*,city,*) (*,*,age-grp)

(cust-grp,city) Cust-grp,*,age-grp)

(cust-grp,city,age-grp)

cid Cust_grp City Age_grp

sequence

10 Business Boston Middle <(bd)cba>

20 Professional

Chicago Young <(bf)(ce)(fg)>

30 Business Chicago Middle <(ah)abf>

40 Education New York

Retired <(be)(ce)>

BUC processing

Page 19: Multi-dimensional Sequential Pattern Mining

19

Experimental results Run on Pentium III pc with 1G main

memory . Using Microsoft Visual C++ 6.0 In this dataset, the number of items is

set to 10,000, while the number of sequence is 10,000. The average number of items within each element is 2.5. The average number of elements in one sequence is 8.

Page 20: Multi-dimensional Sequential Pattern Mining

20

Scalability Over Dimensionality

Page 21: Multi-dimensional Sequential Pattern Mining

21

Scalability Over Cardinality

Page 22: Multi-dimensional Sequential Pattern Mining

22

Scalability Over Support Threshold

Page 23: Multi-dimensional Sequential Pattern Mining

23

Scalability Over Database Size

Page 24: Multi-dimensional Sequential Pattern Mining

24

Pros & Cons of Algorithms Seq-Dim is efficient and scalable

Fastest in most cases UniSeq is also efficient and

scalable Fastest with low dimensionality

Dim-Seq has poor scalability

Page 25: Multi-dimensional Sequential Pattern Mining

25

Conclusions MD seq. pat. mining are interesting

and useful Mining MD seq. pat. efficiently

Uniseq, Dim-Seq, and Seq-Dim Future work

Applications of sequential pattern mining

Page 26: Multi-dimensional Sequential Pattern Mining

報告結束報告結束

Page 27: Multi-dimensional Sequential Pattern Mining

27

References (1) R. Agrawal and R. Srikant. Fast algorithms for mining association rules.

VLDB'94, pages 487-499. R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95, pages 3-

14. Kevin Beyer , Raghu Ramakrishnan, Bottom-up computation of sparse

and Iceberg CUBE, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.359-370, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States

C. Bettini, X. S. Wang, and S. Jajodia. Mining temporal relationships with multiple granularities in time sequences. Data Engineering Bulletin, 21:32-38, 1998.

M. Garofalakis, R. Rastogi, and K. Shim. Spirit: Sequential pattern mining with regular expression constraints. VLDB'99, pages 223-234.

J. Han, G. Dong, and Y. Yin. Efficient mining of partial periodic patterns in time series database. ICDE'99, pages 106-115.

J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M.-C. Hsu. FreeSpan: Frequent pattern-projected sequential pattern mining. KDD'00, pages 355-359.

Page 28: Multi-dimensional Sequential Pattern Mining

28

References (2)

J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD'00, pages 1-12.

H. Lu, J. Han, and L. Feng. Stock movement and n-dimensional intertransaction association rules. DMKD'98, pages 12:1-12:7.

H. Mannila, H. Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1:259-289, 1997.

B. "Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. ICDE'98, pages 412-421.

J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth. ICDE'01, pages 215-224.

R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. EDBT'96, pages 3-17.