LOGO 改善 FP-growth 資料挖掘演算法在巨大資料庫的效能 CHEN-HUNG Lin 2010.05.04 國立高雄大學資訊工程學系 ( 研究所 ) 碩士論文研究生：黃正男

LOGO

改善 FP-growth資料挖掘演算法在巨大資料庫的效能

CHEN-HUNG Lin2010.05.04

國立高雄大學資訊工程學系 (研究所 )

碩士論文研究生：黃正男

Contents

Introduction 1

Item Partition2

Generation of Frequent Itemsets3

Finding Cross-Group Frequent Itemsets

4

Conclusions5

Introduction

Apriori may cause iterative database scan and high

computational cost Frequent-Pattern-tree(FP-tree)

may not allow all nodes generated from a huge database

Introduction

item

B

A

D

FP-tree

root

A: 5

B: 9

D: 3

D: 4

Memory

Introduction

TID Domain item(A, B, C, D, E)

01 A, B, C, D, E

02 B,C, D, E

03 A, C, D, E

04 A, B, C, D

{A, B, C, D, E}

{A, B, C} {D, E}

Independent group

Independent group: The itemsets that cross groups are infrequent. E.g: ABD, BCE, ….

Introduction

{A, B, C, D, E}={A, B, C, D, E}

TID Domain item(A, B, C, D, E)

01 A, B, C

02 C, D, E

03 A, C, D, E

04 A, B, C, D

A

B

C

D

E

2

2

2

1

23

1

2

A

B

C

D

E

min_support=2

min_support=3

{A, B, C, D, E}={C, D}, {A}, {B}, {E}

A

B

C

D

E

Introduction{A, B, C, D, E, F, G, H, I, J}

{A, B, C, D, E, F, G, H}

{I, J}

Item number > threshold 3

Independent group

{A, B, C} {D, E, F} {G, H} Dependent group

FP-treeFP-Growth

FP-treeFP-Growth

FP-treeFP-Growth

FIT(A, B, C) FIT(D, E, F) FIT(G, H)Merge

All frequent itemset

How to divide a big group ?How to find all miss frequent itemset ?

{A, B, C, D, E, F, G, H, I, J}

{A, B, C, D, E, F, G, H}

{I, J}Independent group

{A, B, C}

{D, E, F}

{G, H}Dependent group


FIT(A, B, C)

FIT(D, E, F)

FIT(G, H)Merge


FP-tree FP-Growth

FP-tree FP-Growth

Item-partition algorithm

Start

TID Items TID Items 01 A, B, C, D, F, K 06 H, J, K02 B, C, D, E, F 07 A, B, C, F, J, K03 A, B, C, E, F, J, K 08 A, B, C, D, F04 A, B, C, D, E, F 09 A, C, D, F, G, J, K 05 B, C, D, F, J, K 10 A, C, D, F, G

min_support = 5

Generate& Counts

2-itemset count 2-itemset count 2-itemset count 2-itemset count 2-itemset count

AB 5 BC 7 CE 3 DH 0 FH 0

AC 6 BD 5 CF 8 DJ 2 FJ 4

AD 5 BE 3 CG 2 DK 3 FK 5

AE 2 BF 6 CH 0 EF 3 GH 0

AF 6 BG 0 CJ 4 EG 0 GJ 1

AG 2 BH 0 CK 4 EH 0 GK 1

AH 0 BJ 3 DE 2 EJ 1 HJ 1

AJ 3 BK 4 DF 7 EK 1 HK 1

AK 4 CD 7 DG 2 FG 2 JK 5

Frequent 2-itemsets

Frequent

2-itemset

AB BF

AC CD

AD CF

AF DF

BC FK

BD JK

Initially set

Partition{A},{B},{C},{D},{E},{F},{G},{H},{J},{K}

Merged

Frequent

2-itemsetPartition

- {A},{B},{C},{D},{E},{F},{G},{H},{J},{K}

AB {A,B},{C},{D},{E},{F},{G},{H},{J},{K}

Item-partition algorithm

Start

Generate& Counts

Frequent 2-itemsets

Initially set Merged

Frequent

2-itemsetPartition

Frequent

2-itemsetPartition

- {A},{B},{C},{D},{E},{F},{G},{H},{J},K} BF {A, B, C, D, F},{E},{G},{H},{J},{K}

AB {A,B},{C},{D},{E},{F},{G},{H},{J},{K} CD {A, B, C, D, F},{E},{G},{H},{J},{K}

AC {A, B, C},{D},{E},{F},{G},{H},{J},{K} CF {A, B, C, D, F},{E},{G},{H},{J},{K}

AD {A, B, C, D},{E},{F},{G},{H},{J},{K} DF {A, B, C, D, F},{E},{G},{H},{J},{K}

AF {A, B, C, D, F},{E},{G},{H},{J},{K} FK {A, B, C, D, F, K},{E},{G},{H},{J}

BC {A, B, C, D, F},{E},{G},{H},{J},{K} JK {A, B, C, D, F, J, K},{E},{G},{H}

BD {A, B, C, D, F},{E},{G},{H},{J},{K}

min_support = 5

{A, B, C, D, F, J, K},{E},{G},{H}

Check

Output & Exit

Refine-partitionβ = 3

Start

set upper bound

upper bound = ∞

set the score

{A, B, C, D, F, J, K}

2-itemset score 2-itemset score 2-itemset score

AB 0 BD 0 CK 1

AC 0 BF 0 DF 0

AD 0 BJ 1 DJ 1

AF 0 BK 1 DK 1

AJ 1 CD 0 FJ 1

AK 1 CF 0 FK 0

BC 0 CJ 1 JK 0

Frequent 2-

itemset

AB BF

AC CD

AD CF

AF DF

BC FK

BD JK

set root node

{A, B, C, D, F, J, K}

LB = 0

Generate child nodes

{A,B,C,D,F,J,K}

LB = 0

{A,B,C}{D,F,J,K}

LB =

{A,B,D}{C,F,J,K}

LB =

{A,J,K}{B,C,D,F}LB =

{A,B}{C,D,F,J,K}LB =

{A,B,C,D,F,J,K } =7 itemsβ = 3 ,7/3 = 2.333 => 3 group

7/3 = 2.333 2 or 3 for each group

Start

set upper bound

upper bound = ∞

set the score

set root node


Calculate the lower bound

{A,B,C}{D,F,J,K}

LB =

decided part

undecided part

{A,B} = 0{A,C} = 0{B,C} = 0Sdecide = 0 0+0+0

{D,F} = 0{D,J} = 1{D,K} = 1{F,J} = 1{F,K} = 0{J,K} = 0

{D,F} = 0{F,K} = 0{J,K} = 0{D,J} = 1{D,K} = 1{F,J} = 1Sunecide = 0 0+0+0

2-itemset

score

2-itemset

score

2-itemset

score

AB 0 BD 0 CK 1

AC 0 BF 0 DF 0

AD 0 BJ 1 DJ 1

AF 0 BK 1 DK 1

AJ 1 CD 0 FJ 1

AK 1 CF 0 FK 0

BC 0 CJ 1 JK 0

)2),mod(()2,(

RCCR

n

{A,B,C}{D,F,J,K}LB = 0

Startset upper

bound

upper bound = ∞

set the score

set root node


Calculate the lower bound

Stop nodechoosereplace upper bound

End

{A,B,C,D,F,J,K}

LB = 0

{A,B,C}{D,F,J,K}LB = 0

{A,B,D}{C,F,J,K}LB = 0

{A,J,K}{B,C,D,F}LB = 2

{A,B}{C,D,F,J,K}LB = 5

upper bound = 0

{A,B,C}{D,F}{J,K}

LB = 0

{A,B,C}{D,K}{F,J}

LB = 2

{A,B}{C,D,K}{F,J}

LB = 3

The proposed item-partition

{A, B, C, D, F, J, K},{E},{G},{H}

{A,B,C}{D,F}{J,K},{E},{G},{H}

Start

Generate& Counts

Frequent 2-itemsets

Initially set Merged

Check

Output & Exit

Refine-partition

{A, B, C, D, E, F, G, H, I, J}

{A, B, C, D, E, F, G, H} {I, J} Independent group

{A, B, C}

{D, E, F}

{G, H} Dependent group


FIT(A, B, C) FIT(D, E, F) FIT(G, H)

Merge


FP-treeFP-Growth

FP-treeFP-Growth

Generation of Frequent Itemsets

STEP 1 Generate an initial MFPT with only the empty

root node.STEP 2

Set the initial count of each item in the given group as 0

root

G = {A, B, D}

item countA 0B 0D 0

Algorithm (cont.)

STEP 3 read a transaction from the given data set D delete the items that does not appear in G.

STEP 4 If an item in G appears in the transaction, add

its count by 1STEP 5: Repeat step 3 and 4

until all the transactions are processed

TID Items 01 A, B, C, D, F, G G = {A, B, D}

TID Items 01 A, B, D

item countA 1B 1D 1

item countA 5B 9D 7

Algorithm (cont.)

STEP 6 Compare the items with min_support and

remove the items which are not frequentSTEP 7

Sort the items in G according to their final counts

STEP 8 Sequentially read a transaction T from the

given data set D

item countA 5B 9D 7

Sorted order = (B,D,A)

TID Items 01 A, B, C, D, F, G

Algorithm (cont.)

STEP 9: Generate a tree path P from the transaction T

with only the frequent items according to the sorted order in STEP 7.

Merge P into MFPT in a way similar to FPT.STEP 10:

Add the count of each node in P of MFPT by 1 and add the transaction ID (TID) of T to the last node of P

TID Items 01 A, B, C, D, F, G


A

D

B

root

D: 1

B: 1

A: 1TIDs = 01root

D

B

A

Algorithm (cont.)

STEP 11 Repeat STEPs 8 to 10 until all transactions in

D are processedroot

B: 9

A: 3TIDs = 01, 03,

08

A: 2TIDs = 05,

09

D: 7TIDs=02, 04,06,10

The Enumeration Tree

The enumerated order (B,BD,BA,BDA,D,DA,A)

root

B: 9

A: 3TIDs = 01, 03,

08

A: 2TIDs = 05, 09

D: 7TIDs=02, 04,06,10

{B}(01, 02, 03, 04, 05, 06, 08, 09, 10)

{A}(01, 03, 05, 08, 09)

{D}(01, 02, 03, 04, 06, 08, 10)

{BA}(01, 03, 05, 08, 09)

{BD}(01, 02, 03, 04, 06, 08, 10)

{BDA}(01, 03, 08)

{DA}(01, 02, 03)


FIT(A,B,D)

{BDA}(01:3,03:3,08:3)

{BA}(01:2,03:2,05:2,08:2,09:2)

…

FIT(C,E,F)

{CEF}(01:3,02:3,03:2,04:1)

{CF}{01:1,03:1}

…

FIT(G,H,I)

{GHI}(01:3,02:3,03:2,04:1)

{GI}{01:1,03:1}

…

{A,B,D}

{C,E,F}

{G,H,I}

Depedent group

FP-treeFP-Growth

FP-treeFP-Growth

FP-treeFP-Growth

Merge


{A, B, C, D, E, F, G, H, I, J}

{A, B, C, D, E, F, G, H}

{I, J} Independent group

{A, B, C} {D, E, F} {G, H} Dependent group


FIT(A, B, C) FIT(D, E, F) FIT(G, H)

FP-treeFP-Growth

FP-treeFP-Growth

FP-treeFP-Growth

X

X

X


X


A(10,20,30,50,60,80)

AB(10,20,30,50)

ABC(10,20,30)

D(10,20,30,35,70,80)

DE(10,20,30,35)

DEF(10,20,30)

CFI1=ABC(10:3,20:3,30:3,50:2,60:1,80:1)

CFI1=DEF(10:3,20:3,30:3,35:2,70:1,80:1)

A(10,20,30,50,60,80) D(10,20,30,35,70,80)

AD(10,20,30,80)

DE(10,20,30,35)

ADE(10,20,30)

DEF(10,20,30)

ADEF(10,20,30)

AD(10,20,30,80)

ADE(10,20,30)

ADEF(10,20,30)

ADEF(10:3,20:3,30:3,80:1)


ADEF(10:3,20:3,30:3,80:1)

ABDEF(10:3,20:3,30:3)

AD(10,20,30,80)

ADE(10,20,30)

ADEF(10,20,30)

ABD(10,20,30)

ABDE(10,20,30)

ABDEF(10,20,30)

ABCD(10,20,30)

ABCDE (10,20,30)

ABCDEF (10,20,30)

ABCDEF(10:3,20:3,30:3)

{ABC}(01, 10)

{AB}(01, 05,10, 11)

{A}(01, 05, 06,10, 11)

{DEF}(01, 10)

{DE}(01, 05,10, 11)

{D}(01, 05, 07,10, 11)

{ADE}(01:2, 05:2, 10:2, 11:2)

X



{ABC}(01, 10)

{AB}(01, 05,10, 11)

{A}(01, 05, 06,10, 11)

{DEF}(01, 10)

{DE}(01, 05,10, 11)

{D}(01, 05, 07,10, 11)

{ABDE}(01:2, 05:2, 10:2, 11:2)

X

{ABC}(01, 10)

{AB}(01, 05,10, 11)

{A}(01, 05, 06,10, 11)

{DEF}(01, 10)

{DE}(01, 05,10, 11)

{D}(01, 05, 07,10, 11)

X X


{ADE}(01:2, 05:2, 10:2, 11:2)

{ABDE}(01:2, 05:2, 10:2, 11:2)

FIT(A, B, C)

FIT(A, B, C) X FIT(D, E, F)

FIT(A, B, C) X FIT(D, E, F) X FIT(G, H, I)

FIT(A, B, C) X FIT(D, E, F) X FIT(G, H, I) X FIT(J, K)

FIT(A, B, C) X FITD,E,F) X FIT(J, K)

FIT(A, B, C) X FIT(G,H,I)

FIT(A,B,C) X FIT(G,H,I) X FIT(J,K)

FIT(A, B, C) X FIT(J, K)

Conclusions

focuses on solving or easing off the mining problems incurred from memory limitation.

The proposed approach can be divided into three phases. Item Partition Generation of Frequent Itemsets Finding Cross-Group Frequent Itemsets

Conclusions

優點：可分散至多台電腦執行。亦能在有限資源下，運行龐大資料庫的採掘。

缺點：資料庫不能共享，必須一台電腦一個。在資料merge，只能有少數電腦運行，不能分散進行。

LOGO

Documents

LOGO 改善 FP-growth 資料挖掘演算法 在巨大資料庫的效能 CHEN-HUNG Lin 2010.05.04 國立高雄大學資訊工程學系 ( 研究所 ) 碩士論文 研究生：黃正男

LOGO 改善 FP-growth 資料挖掘演算法在巨大資料庫的效能 CHEN-HUNG Lin 2010.05.04 國立高雄大學資訊工程學系 ( 研究所 ) 碩士論文研究生：黃正男