你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University

Embed Size (px)

Text of 你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan...

  • Jen-Wei Huangjwhuang@gmail.com

    National Taiwan University

    Jen-Wei Huang

  • *Jen-Wei Huang*

    Jen-Wei Huang

  • *Jen-Wei Huang** http://www.wretch.cc/blog/EtudeBIKE

    Jen-Wei Huang

  • *Jen-Wei Huang** http://www.giant-bicycles.com/zh-TW/

    Jen-Wei Huang

  • *Jen-Wei Huang*

    Jen-Wei Huang

  • *Jen-Wei Huang*

    Jen-Wei Huang

  • *Jen-Wei Huang** http://cape7.pixnet.net/blog

    Jen-Wei Huang

  • *Jen-Wei Huang** http://cape7.pixnet.net/blog

    Jen-Wei Huang

  • *Jen-Wei Huang** http://cape7.pixnet.net/blog

    Jen-Wei Huang

  • *Jen-Wei Huang** http://www.wretch.cc/blog/orzboyz* http://blog.sina.com.tw/9winds/* http://atomcinema.pixnet.net/blog

    Jen-Wei Huang

  • *Jen-Wei Huang*

    Jen-Wei Huang

  • *Jen-Wei Huang** http://www.amazon.com

    Jen-Wei Huang

  • *Jen-Wei Huang** http://www.amazon.com

    Jen-Wei Huang

  • *Jen-Wei Huang** http://www.hq.nasa.gov/office/pao/History/ap11ann/kippsphotos/apollo.html

    Jen-Wei Huang

  • A General Model for Sequential Pattern Mining with a Progressive DatabaseJen-Wei Huang, Chi-Yao Tseng, Jian-Chih Ou and Ming-Syan Chen

    National Taiwan University* IEEE Trans. on Knowledge and Data Engineering, Vol. 20, No. 6, June 2008

    Jen-Wei Huang

  • *Jen-Wei Huang*OutlinesIntroductionPreliminariesAlgorithm PisaExperimentsConclusionsQ & A*

    Jen-Wei Huang

  • *Jen-Wei Huang*Introduction to SPMMining of frequently occurring patterns related to time or other sequences.J. Han, Data Mining Concepts and TechniquesGiven a set of sequences, find the complete set of frequent subsequencesJ. Pei, PrefixSpanEx) What items one will buy if he/she has bought some certain items*

    Jen-Wei Huang

  • *Jen-Wei Huang*Time-related dataCustomers buying behaviorNatural phenomenaSensor network dataWeb access patternsStock price changesDNA sequence applications*

    Jen-Wei Huang

  • *Jen-Wei Huang*DefinitionLet I = {x1, x2, ..., xn} be a set of different items. An element e, denoted by (xi xj ...), is a subset of items I of which items appear in a sequence at the same time.A sequence s, denoted by < e1, e2, ..., em >, is an ordered list of elements. A sequence database Db contains a set of sequences and |Db| represents the number of sequences in Db.*

    Jen-Wei Huang

  • *Jen-Wei Huang*DefinitionA sequence = < a1, a2, ..., an > is a subsequence of another sequence = < b1, b2, ..., bm > if there exists a set of integers, 1 i1 < i2 < ... < in m, such that a1 bi1 , a2 bi2 , ..., and an bin .*

    Jen-Wei Huang

  • *Jen-Wei Huang*DefinitionThe sequential pattern mining can be defined as "Given a sequence database, Db, and a user-defined minimum support, min_sup, find the complete set of subsequences whose occurrence frequencies min_sup |Db|."*

    Jen-Wei Huang

  • *Jen-Wei Huang*Three CategoriesDepending on the management of the corresponding database, sequential pattern mining can be divided into three categories, namely sequential pattern mining witha static database.an incremental database.a progressive database.*

    Jen-Wei Huang

  • How To Do Sequential Pattern Mining on a Static DatabaseAn Overview

    Jen-Wei Huang

  • 2006/03/24jwhuang National Taiwan University*How?Apriori-like algorithmsAprioriAll by Agrawal et alGSP by R. Srikant et al Partition-based algorithmsFreeSpan by J. Han et alPrefixSpan by J. Pei et alVertical format algorithmsSPADE by Zaki et alSPAM by Ayres et al

    jwhuang National Taiwan University

  • 2006/03/24jwhuang National Taiwan University*Apriori-like Algorithms1.Sort phaseSort the databaseCustomer id as the primary key and time as the second key2.Litemset phase Count the frequency of each itemsetThe fraction of customers who bought the itemset

    jwhuang National Taiwan University

  • 2006/03/24jwhuang National Taiwan University*Apriori-like Algorithms3.Transformation phase Transform each tx to all litemsets in the form of C01: C02: C03: C04: C05:

    jwhuang National Taiwan University

  • *Jen-Wei Huang*

    CIDItems2 10 205 902 302 40 60 704 303 30 50 701 301 904 40 704 903 105 101 40 705 202 903 20

    CIDItems1 30 90 {40 70}2 {10 20} 30 {40 60 70} 903 {30 50 70} 10 204 30 {40 70} 905 90 10 20

    Itemset# 103 203 304 403 501 601 704 904 {10 20}1 {40 60}1 {40 70}3 {60 70}1 {40 60 70}1 {30 50}1 {30 70}1 {50 70}1 {30 50 70}1

    Jen-Wei Huang

  • *Jen-Wei Huang*

    Itemset#New 1031 2032 3043 4034 7045 9046 {40 70}37

    CIDItems1 3 6 {4, 5, 7}2 {1, 2} 3 {4, 5, 7} 63 {3, 5} 1 24 3 {4, 5, 7} 65 6 1 2

    Jen-Wei Huang

  • 2006/03/24jwhuang National Taiwan University*Apriori-like Algorithms4.Mining phaseApriori-like algorithm5.Maximal phase Find the maximum patterns

    jwhuang National Taiwan University

  • *Jen-Wei Huang*

    CIDItems1 3 6 {4, 5, 7}2 {1, 2} 3 {4, 5, 7} 63 {3, 5} 1 24 3 {4, 5, 7} 65 6 1 2

    Itemset#1 221 311 411 511 611 712 102 312 412 512 612 713 113 21

    Itemset#3 433 533 633 734 104 204 304 504 624 705 115 215 305 40

    Itemset#5 625 706 116 216 306 416 516 717 107 207 307 407 507 62

    Jen-Wei Huang

  • *Jen-Wei Huang*Therefore, frequent sequential patterns are: According to mappings, original frequent sequential patterns are:

    CIDItems1 3 6 {4, 5, 7}2 {1, 2} 3 {4, 5, 7} 63 {3, 5} 1 24 3 {4, 5, 7} 65 6 1 2

    Itemset#3 4 623 5 623 7 62

    Itemset# 1031 2032 3043 4034 7045 9046 {40 70}37

    Jen-Wei Huang

  • *Jen-Wei Huang*According to mappings, original frequent sequential patterns are:

    Because and are contained by and are contained by and are contained by ,final maximal sequential patterns are:

    Jen-Wei Huang

  • *Jen-Wei Huang*Related WorksStatic databaseAprioriAll by Agrawal et alGSP by R. Srikant et alSPADE by Zaki et alFreeSpan by J. Han et alPrefixSpan by J. Pei et alSPAM by Ayres et al*

    Jen-Wei Huang

  • *Jen-Wei Huang*Related WorksIncremental databaseISM by Parthasarathy et alIncSP by Lin et alISE by Masseglia et alIncSpan by Cheng et alMILE by Chen et al*

    Jen-Wei Huang

  • *Jen-Wei Huang*MotivationThe assumption of having a static database may not hold in practice.The data in real world change on the fly.Finding sequential patterns in an incremental database may lack of interest to the users.It is noted that users are usually more interested in the recent data than the old ones. *

    Jen-Wei Huang

  • *Jen-Wei Huang*MotivationIf a certain sequence does not have any newly arriving elements, this sequence will still stay in the database and undesirably contribute to |Db|.New sequential patterns which appear frequently in the recent sequences may not be considered as frequent sequential patterns.*

    Jen-Wei Huang

  • *Jen-Wei Huang*Definition -- Period of InterestPeriod of Interest (abbreviated as POI) is a sliding windowwhose length is a user-specified time interval, continuously advancing as the time goes by. The sequences having elements whose timestamps fall into this period, POI, contribute to the |Db| for current sequential patterns.*

    Jen-Wei Huang

  • timeACADBBADBBCDCCDBDAAABCBCAACACBDCDDSIDPOI=5, min_supp=0.5*

  • *Jen-Wei Huang*OutlinesIntroductionPreliminariesAlgorithm PisaExperimentsConclusionsQ & A*

    Jen-Wei Huang

  • *Jen-Wei Huang*Progressive Sequential PatternProgressive sequential pattern mining problem is defined as follows"Given a progressive sequence database, a user-specified period of interest, and a user-defined minimum support threshold, find the complete set of frequent subsequences whose occurrence frequencies are greater than or equal to the minimum support times the number of sequences in every period of interest of the database."*

    Jen-Wei Huang

  • *Jen-Wei Huang*Nave AlgorithmUse conventional static sequential pattern mining algorithms to mine sequential patterns separately from all combination of POIse.g., Db1,5, Db2,6, Db3,7, Db4,8, Db5,9, etc. For the sequence database which has the elements appearing in the interval of n timestamps, the total number of POIs in this interval is equal to (n POI +1).*

    Jen-Wei Huang

  • *Jen-Wei Huang*Prior WorkThe only prior work on progressive database is GSP+ and MFS+ proposed by Zhang based on static algorithms GSP and MFS (also derived by the same authors).However, these algorithms still have to re-mine each sub-database using the static algorithms GSP and MFS.Nevertheless, the performance improvement of GSP+ and MFS+ over GSP and MFS is only within 15% as reported by their authors.*

    Jen-Wei Huang

  • *Jen-Wei Huang*Algorithm DirAppStands for Direct Append.Consists of two proceduresProgressively Updating abbreviated as PrUpImmediately Filteringabbreviated as ImFi

    *

    Jen-Wei Huang

  • *Jen-Wei Huang*Procedure PrUpWhen progressively reading newly incoming elements, Procedure PrUp canupdate each sequence in the sequence databasegenerate candidate sequential patternscalculate occurrence frequencies of all candidate equential patterns in the current POI.

    *

    Jen-Wei Huang

  • *Jen-Wei Huang*Procedure ImFiDirApp uses Procedure ImFi to filter out obsolete data from the existing sequence databaseprune away obsolete candidate sequential patterns from the candidate set. report the most up-to-date frequent sequential patterns to the user in every POI*

    Jen-Wei Huang

  • ABCADB*

  • *Jen-Wei Huang*Example*

    Jen-Wei Huang

  • (1)(4)(2)(3)*