30
實實實實實實實實實實實 Content and Knowledg e Management Labora tory (B) Data Mining Part Director: Anthony J. T. Lee Presenter: Wan-chuen Lin

實驗室研究暨成果說明會 Content and Knowledge Management Laboratory (B)

  • Upload
    jena

  • View
    54

  • Download
    0

Embed Size (px)

DESCRIPTION

實驗室研究暨成果說明會 Content and Knowledge Management Laboratory (B). Data Mining Part Director: Anthony J. T. Lee Presenter: Wan-chuen Lin. Outline. Introduction of basic data mining concepts about our research topics Brief description of doctoral research - PowerPoint PPT Presentation

Citation preview

Page 1: 實驗室研究暨成果說明會 Content and Knowledge Management  Laboratory  (B)

實驗室研究暨成果說明會Content and Knowledge Management Laboratory (B)

Data Mining Part

Director: Anthony J. T. Lee

Presenter: Wan-chuen Lin

Page 2: 實驗室研究暨成果說明會 Content and Knowledge Management  Laboratory  (B)

2

Outline

Introduction of basic data mining concepts about our research topics

Brief description of doctoral research Topic 1: Mining frequent itemsets with multi-dim

ensional constraints Topic 2: Mining the inter-transactional associatio

n rules of multi-dimensional interval patterns Topic 3: Inter-sequence association rules mining Topic 4: Mining association rules among time-se

ries data

Page 3: 實驗室研究暨成果說明會 Content and Knowledge Management  Laboratory  (B)

3

Introduction of Data Mining

Data mining is the task of discovering knowledge from large amounts of data.

One of the fundamental data mining problems, frequent itemset mining, covers a broad spectrum of mining topics, including association rules, sequential patterns, etc.

Frequent itemset mining is to discover all the itemsets whose supports in the database exceed a user-specified threshold.

Page 4: 實驗室研究暨成果說明會 Content and Knowledge Management  Laboratory  (B)

4

Introduction of Association Rules

Association rule is of the form XY, where X and Y are both frequent itemsets in the given database and XY=.

The support of XY is the percentage of transactions in the given database that contain both X and Y, i.e., P(XY).

The confidence of XY is the percentage of transactions in the given database containing X that also contain Y, i.e., P(Y|X).

Page 5: 實驗室研究暨成果說明會 Content and Knowledge Management  Laboratory  (B)

5

Introduction of Sequential Patterns

A sequence is an ordered list of itemsets, and denoted by <s1s2…sl>, where sj is an itemset.

sj is also called an element of the sequence, and denoted as (x1x2…xm), where xk is an item.

The support of a sequence in a sequence database is the number of tuples containing .

A sequence is called a sequential pattern if support()min-support.

Page 6: 實驗室研究暨成果說明會 Content and Knowledge Management  Laboratory  (B)

6

Algorithm for Mining Frequent Itemsets Apriori

Candidate set generation-and–test Level-wise: it iteratively generates candidat

e k-itemsets from previously found frequent (k-1)-itemsets, and then checks the supports of candidates to form frequent k-itemsets.

Lk-1 Join Support CheckLk Ck

Page 7: 實驗室研究暨成果說明會 Content and Knowledge Management  Laboratory  (B)

7

Algorithm for Mining Frequent Itemsets (cont’d) FP-growth

The method constructs a compressed frequent pattern tree, called FP-tree.

A divide-and-conquer strategy to recursively decompose the mining task into a set of smaller tasks in conditional databases, and concatenates the suffix itemset with the frequent itemsets generated from a conditional FP-tree.

Page 8: 實驗室研究暨成果說明會 Content and Knowledge Management  Laboratory  (B)

8

Algorithm for Mining Sequential Patterns- PrefixSpan It finds length-1 sequential patterns in the targ

et database first, and partitions the database into smaller projected databases with prefix of each sequential pattern previously found.

The sequential patterns can be mined by constructing corresponding projected databases and mine each recursively.

It preserves the element order of each tuple in the mining process.

Page 9: 實驗室研究暨成果說明會 Content and Knowledge Management  Laboratory  (B)

9

Brief Description of Doctoral Research

Mining calling path patterns in GSM networks Two problems of mining calling path patterns

Mining PMFCPs Mining periodic PMFCPs

Graph structures [(periodic) frequent calling path graph] and graph-based mining algorithms Based on a depth-first No candidate paths are generated and the datab

ase is scanned only once if the whole graph structure can be held in the main memory.

Page 10: 實驗室研究暨成果說明會 Content and Knowledge Management  Laboratory  (B)

10

Brief Description of Doctoral Research (cont’d) Bioinformatic data mining Gene Clustering Sequence comparisons, alignments and compr

ession DNA sequence Protein sequence

Application Phylogenetic tree to predict the function of a ne

w protein Relationship between DNA sequence & disease

Page 11: 實驗室研究暨成果說明會 Content and Knowledge Management  Laboratory  (B)

11

Topic 1: Mining Frequent Itemsets with Multi-dimensional Constraints Frequent itemset mining often generates a ve

ry large number of frequent itemsets. Only the subset of the frequent itemsets and a

ssociation rules is of interest to users. Users need additional post-processing to find

useful ones. Constraint-based mining pushes user-specific

constraints deep inside the mining process to improve performance.

With multi-dimensional items, constraints can be imposed on multiple dimensional attributes.

Page 12: 實驗室研究暨成果說明會 Content and Knowledge Management  Laboratory  (B)

12

Topic 1: Mining Frequent Itemsets with Multi-dimensional Constraints

itemID a1 a2 …. am

ik = (k1, k2 …, km) A = iA = (A1, A2,…, Am) A1=A.a1

attributes (dimensions)

Multi-dimensional Constraints

Page 13: 實驗室研究暨成果說明會 Content and Knowledge Management  Laboratory  (B)

13

Topic 1: Mining Frequent Itemsets with Multi-dimensional Constraints Multi-dimensional constraints can be categoriz

ed according to constraint properties. anti-monotone, monotone, convertible and inco

nvertible It can be also classified according to the numb

er of sub-constraints included. Single constraint against multiple dimensions,

Ex: max(S.cost) min(S.price) Conjunction and/or disjunction of multiple sub-c

onstraints, Ex: (C1: S.cost v1) (C2: S.price v2)

Page 14: 實驗室研究暨成果說明會 Content and Knowledge Management  Laboratory  (B)

14

Topic 1: Mining Frequent Itemsets with Multi-dimensional Constraints We extend constraints to place over multi-dim

ensional itemsets and develop algorithms for mining frequent itemsets with multi-dimensional constraints by extension of CFG (Constrained Frequent Pattern Growth),

Overview of our algorithm Phase 1: Frequency check Phase 2: Constraint check Phase 3: Conditional database construction

Page 15: 實驗室研究暨成果說明會 Content and Knowledge Management  Laboratory  (B)

15

Example: Cam max(S.cost) min(S.price)

Database

BECA

BEA

DA

BDA

BDE

BDECA

BEC

BDEC

DEC

BDC

A-conditional Database

BEC

BE

D

BD

BDEC

EA-conditional Database

D

Frequent items: B, D, E, C, A

C(BDECA)=false

C(B)=trueC(D)=trueC(E)=true

C(C)=trueC(A)=true

Frequent items: B, D, E, C

C(BDECA)=false

C(BA)=falseC(DA)=true

C(EA)=trueC(CA)=false

Frequent items:

Page 16: 實驗室研究暨成果說明會 Content and Knowledge Management  Laboratory  (B)

16

Topic 2: Mining Inter-transactional Association Rules of Multi-dimensional Interval Patterns

Transaction could be the items bought by the same customer, the events happened on the same day, and so on.

Intra-transactional association rules: associations among items within the same transaction. Ex: buy (X, diapers) => buy (X, beer) [support=80%]

Inter-transactional association rules: association relations among different transactions. Ex: If the prices of IBM and SUN go up, Microsoft’s

will most likely [80%] increases the next day.

Page 17: 實驗室研究暨成果說明會 Content and Knowledge Management  Laboratory  (B)

17

Topic 2: Mining Inter-transactional Association Rules of Multi-dimensional Interval Patterns

Interval data are different from the point data in that they occupy regions of non-zero size.

Multi-dimensional Intervals can be represented as line segments (1-D), rectangles (2-D), hyper-cubes (n-D), etc.

Extended item: denoted as (Location)<Size> Reference point: the smallest (Location) amon

g all (Location)<Size>. Maxspan: a sliding window; only associations

covered by it are considered.

Page 18: 實驗室研究暨成果說明會 Content and Knowledge Management  Laboratory  (B)

18

Example

There are two cubes in the 3-dimensional space: 0,2,1<1,1,1> and 1,1,0<2,2,1>.

Reference point: (0,1,0) The two items are

denoted as 0,1,1<1,1,1> and 1,0,0<2,2,1>.

0,2,1<1,1,1>1,1,0<2,2,1>

Page 19: 實驗室研究暨成果說明會 Content and Knowledge Management  Laboratory  (B)

19

Algorithm (Apriori-like) Example

Support: 10% (10%*20=2)

Maxspan: 4 L1:

0,0<1,1>

0,0<1,2>

0,0<1,3>

0,0<2,1>

Page 20: 實驗室研究暨成果說明會 Content and Knowledge Management  Laboratory  (B)

20

Algorithm (Apriori-like) Example (cont’d) Remind: Apriori-like algorithm

Lk-1 L2:

{0,0<1,1>, 1,1<2,1>}, {1,0<1,1>, 0,1<1,2>}, {0,0

<1,2>, 2,0<2,1>}, {0,0<1,3>, 3,0<1,2>} L3: {3,0<1,1>, 2,1<1,2>, 0,3<1,3>}

{1,0<1,1>, 0,1<1,2>, 2,1<2,1>}{3,0<1,1>, 0,3<1,3>, 4,1<2,1>}

{2,0<1,2>, 0,2<1,3>, 4,0<2,1>} L4: {0,3<1,3>, 4,1<2,1>, 2,1<1,2>, 3,0<1,1>}

Join Support Check Lk Ck

Page 21: 實驗室研究暨成果說明會 Content and Knowledge Management  Laboratory  (B)

21

Topic 3: Inter-sequence Association Rules Mining Inter-sequence model

<c(ab)d(ad)>

<ab>

< >

<dd(ac)bd>

<bc>

<ceacc(ce)>

<acc>

<(bc)cb>

<e(ac)bac>

<b(ab)cc>

1 2 3 4 5 6 7 8 9 10

Transaction Time :

Transaction ID : 1 2 3 4 5 6 7 8 9 10

Page 22: 實驗室研究暨成果說明會 Content and Knowledge Management  Laboratory  (B)

22

Topic 3: Inter-sequence Association Rules Mining (cont’d) Extended sequence (denote asΔt<s1s2…sl>):

a sequence s = <s1s2…sl> at time pointΔt.

Algorithm: Step 1: Use PrefixSpan to find all sequential p

atterns Step 2: Use an Apriori-like method to check if

some extended sequence set is large Use L-bucket (List-bucket) & C-bucket (candi

date-bucket) to improve mining efficiency.

Page 23: 實驗室研究暨成果說明會 Content and Knowledge Management  Laboratory  (B)

23

Example

min_support = 3 maxspan = 2

Tran. ID Tran. Time

Sequence

1 1 <c(ab)d(ad)>

2 2 <(bc)cb>

3 3 <e(ac)bac>

4 4 <b(ab)cc>

5 5 <(ab)c>

6 6 <dd(ac)bd>

7 7 <bc>

8 8 <acc>

9 9 <ab>

10 10 <ceacc(ce)>

The database

Sequential Patterns:–<a>, <b>, <c>–<ab>, <(ab)>, <ac>, <ba>, <bc>, <cb>, <cc>–<acc>

PrefixSpan

Page 24: 實驗室研究暨成果說明會 Content and Knowledge Management  Laboratory  (B)

24

Example (cont’d)

Candidates C2

{Δ0<a>, Δ1<a>}, {Δ0<a>, Δ2<a>}

{Δ0<a>, Δ1<b>}, {Δ0<b>, Δ1<a>},

{Δ0<a>, Δ2<b>}, {Δ0<b>, Δ2<a>}

{Δ0<a>, Δ1<c>}, {Δ0<c>, Δ1<a>},

{Δ0<a>, Δ2<c>}, {Δ0<c>, Δ2<a>}

{Δ0<b>, Δ1<b>}, {Δ0<b>, Δ2<b>}

{Δ0<b>, Δ1<c>}, {Δ0<c>, Δ1<b>},

{Δ0<b>, Δ2<c>}, {Δ0<c>, Δ2<b>}

{Δ0<c>, Δ1<c>}, {Δ0<c>, Δ2<c>}

PrefixSpan Result<a>, <b>, <c>

<ab>, <(ab)>, <ac>,

<ba>, <bc>, <cb>,

<cc>

<acc>

L1

{Δ0<a>}

{Δ0<b>}

{Δ0<c>}

Page 25: 實驗室研究暨成果說明會 Content and Knowledge Management  Laboratory  (B)

25

Example (cont’d)L2

{Δ0<ab>}, {Δ0<(ab)>}, {Δ0<ac>},

{Δ0<ba>}, {Δ0<bc>},

{Δ0<cb>},{Δ0<cc>}

{Δ0<a>, Δ1<a>}, {Δ0<a>, Δ2<a>},

{Δ0<a>, Δ1<b>}, {Δ0<b>, Δ1<a>},

{Δ0<a>, Δ2<b>}, {Δ0<b>, Δ2<a>},

{Δ0<a>, Δ1<c>}, {Δ0<c>, Δ1<a>},

{Δ0<a>, Δ2<c>}, {Δ0<c>, Δ2<a>},

{Δ0<b>, Δ1<b>}, {Δ0<b>, Δ2<b>},

{Δ0<b>, Δ1<c>}, {Δ0<c>, Δ1<b>},

{Δ0<b>, Δ2<c>}, {Δ0<c>, Δ2<b>},

{Δ0<c>, Δ1<c>}, {Δ0<c>, Δ2<c>}

PrefixSpan Result<a>, <b>, <c>

<ab>, <(ab)>, <ac>,

<ba>, <bc>, <cb>,

<cc>

<acc>

C2

Apriori-likeLk-1 → Ck → Lk

Page 26: 實驗室研究暨成果說明會 Content and Knowledge Management  Laboratory  (B)

26

Topic 4: Mining Association Rules among Time-series Data A line is an ordered and continuous list in the

form {t1, t2, …, tm} describing the property of th

e subject along the time. Step 1: find the frequent lines and points in e

ach line-set. (Apriori-like algorithm) Step 2: use those frequent-set combination to

find the associations among them. (inter-transaction association rules)

Page 27: 實驗室研究暨成果說明會 Content and Knowledge Management  Laboratory  (B)

27

Topic 4: Mining Association Rules among Time-series Data

Page 28: 實驗室研究暨成果說明會 Content and Knowledge Management  Laboratory  (B)

28

Time-series Data Approximation

For the algorithm’s efficiency

Equally partition the fluctuation rate into several classes.

Page 29: 實驗室研究暨成果說明會 Content and Knowledge Management  Laboratory  (B)

29

Step 1: Line Discovery (Apriori-like)

Step 2: Association Rule Mining

Page 30: 實驗室研究暨成果說明會 Content and Knowledge Management  Laboratory  (B)

Data Mining PartThank You!