LOGO 改善 FP-growth 資料挖掘演算法在巨大資料庫的效能 CHEN-HUNG Lin 2010.05.04...

改善 FP-growth資料挖掘演算法在巨大資料庫的效能

CHEN-HUNG Lin2010.05.04

國立高雄大學資訊工程學系 (研究所 )

碩士論文研究生：黃正男

Contents

Introduction 1

Item Partition2

Generation of Frequent Itemsets3

Finding Cross-Group Frequent Itemsets

Conclusions5

Introduction

Apriori may cause iterative database scan and high

computational cost Frequent-Pattern-tree(FP-tree)

may not allow all nodes generated from a huge database

Introduction

FP-tree

Memory

Introduction

TID Domain item(A, B, C, D, E)

01 A, B, C, D, E

02 B,C, D, E

03 A, C, D, E

04 A, B, C, D

{A, B, C, D, E}

{A, B, C} {D, E}

Independent group

Independent group: The itemsets that cross groups are infrequent. E.g: ABD, BCE, ….

Introduction

{A, B, C, D, E}={A, B, C, D, E}

TID Domain item(A, B, C, D, E)

01 A, B, C

02 C, D, E

03 A, C, D, E

04 A, B, C, D

min_support=2

min_support=3

{A, B, C, D, E}={C, D}, {A}, {B}, {E}

Introduction{A, B, C, D, E, F, G, H, I, J}

{A, B, C, D, E, F, G, H}

{I, J}

Item number > threshold 3

Independent group

{A, B, C} {D, E, F} {G, H} Dependent group

FP-treeFP-Growth

FIT(A, B, C) FIT(D, E, F) FIT(G, H)Merge

All frequent itemset

How to divide a big group ?How to find all miss frequent itemset ?

{A, B, C, D, E, F, G, H, I, J}

{A, B, C, D, E, F, G, H}

{I, J}Independent group

{A, B, C}

{D, E, F}

{G, H}Dependent group

FIT(A, B, C)

FIT(D, E, F)

FIT(G, H)Merge

FP-tree FP-Growth

Item-partition algorithm

TID Items TID Items 01 A, B, C, D, F, K 06 H, J, K02 B, C, D, E, F 07 A, B, C, F, J, K03 A, B, C, E, F, J, K 08 A, B, C, D, F04 A, B, C, D, E, F 09 A, C, D, F, G, J, K 05 B, C, D, F, J, K 10 A, C, D, F, G

min_support = 5

Generate& Counts

2-itemset count 2-itemset count 2-itemset count 2-itemset count 2-itemset count

AB 5 BC 7 CE 3 DH 0 FH 0

AC 6 BD 5 CF 8 DJ 2 FJ 4

AD 5 BE 3 CG 2 DK 3 FK 5

AE 2 BF 6 CH 0 EF 3 GH 0

AF 6 BG 0 CJ 4 EG 0 GJ 1

AG 2 BH 0 CK 4 EH 0 GK 1

AH 0 BJ 3 DE 2 EJ 1 HJ 1

AJ 3 BK 4 DF 7 EK 1 HK 1

AK 4 CD 7 DG 2 FG 2 JK 5

Frequent 2-itemsets

Frequent

2-itemset

Initially set

Partition{A},{B},{C},{D},{E},{F},{G},{H},{J},{K}

Merged

Frequent

2-itemsetPartition

- {A},{B},{C},{D},{E},{F},{G},{H},{J},{K}

AB {A,B},{C},{D},{E},{F},{G},{H},{J},{K}

Item-partition algorithm

Generate& Counts

Frequent 2-itemsets

Initially set Merged

Frequent

2-itemsetPartition

Frequent

2-itemsetPartition

- {A},{B},{C},{D},{E},{F},{G},{H},{J},K} BF {A, B, C, D, F},{E},{G},{H},{J},{K}

AB {A,B},{C},{D},{E},{F},{G},{H},{J},{K} CD {A, B, C, D, F},{E},{G},{H},{J},{K}

AC {A, B, C},{D},{E},{F},{G},{H},{J},{K} CF {A, B, C, D, F},{E},{G},{H},{J},{K}

AD {A, B, C, D},{E},{F},{G},{H},{J},{K} DF {A, B, C, D, F},{E},{G},{H},{J},{K}

AF {A, B, C, D, F},{E},{G},{H},{J},{K} FK {A, B, C, D, F, K},{E},{G},{H},{J}

BC {A, B, C, D, F},{E},{G},{H},{J},{K} JK {A, B, C, D, F, J, K},{E},{G},{H}

BD {A, B, C, D, F},{E},{G},{H},{J},{K}

min_support = 5

{A, B, C, D, F, J, K},{E},{G},{H}

Output & Exit

Refine-partitionβ = 3

set upper bound

upper bound = ∞

set the score

{A, B, C, D, F, J, K}

2-itemset score 2-itemset score 2-itemset score

AB 0 BD 0 CK 1

AC 0 BF 0 DF 0

AD 0 BJ 1 DJ 1

AF 0 BK 1 DK 1

AJ 1 CD 0 FJ 1

AK 1 CF 0 FK 0

BC 0 CJ 1 JK 0

Frequent 2-

itemset

set root node

{A, B, C, D, F, J, K}

LB = 0

Generate child nodes

{A,B,C,D,F,J,K}

LB = 0

{A,B,C}{D,F,J,K}

{A,B,D}{C,F,J,K}

{A,J,K}{B,C,D,F}LB =

{A,B}{C,D,F,J,K}LB =

{A,B,C,D,F,J,K } =7 itemsβ = 3 ,7/3 = 2.333 => 3 group

7/3 = 2.333 2 or 3 for each group

set upper bound

upper bound = ∞

set the score

set root node

Calculate the lower bound

{A,B,C}{D,F,J,K}

decided part

undecided part

{A,B} = 0{A,C} = 0{B,C} = 0Sdecide = 0 0+0+0

{D,F} = 0{D,J} = 1{D,K} = 1{F,J} = 1{F,K} = 0{J,K} = 0

{D,F} = 0{F,K} = 0{J,K} = 0{D,J} = 1{D,K} = 1{F,J} = 1Sunecide = 0 0+0+0

2-itemset

AB 0 BD 0 CK 1

AC 0 BF 0 DF 0

AD 0 BJ 1 DJ 1

AF 0 BK 1 DK 1

AJ 1 CD 0 FJ 1

AK 1 CF 0 FK 0

BC 0 CJ 1 JK 0

)2),mod(()2,(

{A,B,C}{D,F,J,K}LB = 0

Startset upper

upper bound = ∞

set the score

set root node

Calculate the lower bound

Stop nodechoosereplace upper bound

{A,B,C,D,F,J,K}

LB = 0

{A,B,C}{D,F,J,K}LB = 0

{A,B,D}{C,F,J,K}LB = 0

{A,J,K}{B,C,D,F}LB = 2

{A,B}{C,D,F,J,K}LB = 5

upper bound = 0

{A,B,C}{D,F}{J,K}

LB = 0

{A,B,C}{D,K}{F,J}

LB = 2

{A,B}{C,D,K}{F,J}

LB = 3

The proposed item-partition

{A, B, C, D, F, J, K},{E},{G},{H}

{A,B,C}{D,F}{J,K},{E},{G},{H}

Generate& Counts

Frequent 2-itemsets

Initially set Merged

Output & Exit

Refine-partition

{A, B, C, D, E, F, G, H, I, J}

{A, B, C, D, E, F, G, H} {I, J} Independent group

{A, B, C}

{D, E, F}

{G, H} Dependent group

FIT(A, B, C) FIT(D, E, F) FIT(G, H)

FP-treeFP-Growth

Generation of Frequent Itemsets

STEP 1 Generate an initial MFPT with only the empty

root node.STEP 2

Set the initial count of each item in the given group as 0

G = {A, B, D}

item countA 0B 0D 0

Algorithm (cont.)

STEP 3 read a transaction from the given data set D delete the items that does not appear in G.

STEP 4 If an item in G appears in the transaction, add

its count by 1STEP 5: Repeat step 3 and 4

until all the transactions are processed

TID Items 01 A, B, C, D, F, G G = {A, B, D}

TID Items 01 A, B, D

item countA 1B 1D 1

item countA 5B 9D 7

Algorithm (cont.)

STEP 6 Compare the items with min_support and

remove the items which are not frequentSTEP 7

Sort the items in G according to their final counts

STEP 8 Sequentially read a transaction T from the

given data set D

item countA 5B 9D 7

Sorted order = (B,D,A)

TID Items 01 A, B, C, D, F, G

Algorithm (cont.)

STEP 9: Generate a tree path P from the transaction T

with only the frequent items according to the sorted order in STEP 7.

Merge P into MFPT in a way similar to FPT.STEP 10:

Add the count of each node in P of MFPT by 1 and add the transaction ID (TID) of T to the last node of P

TID Items 01 A, B, C, D, F, G

A: 1TIDs = 01root

Algorithm (cont.)

STEP 11 Repeat STEPs 8 to 10 until all transactions in

D are processedroot

A: 3TIDs = 01, 03,

A: 2TIDs = 05,

D: 7TIDs=02, 04,06,10

The Enumeration Tree

The enumerated order (B,BD,BA,BDA,D,DA,A)

A: 3TIDs = 01, 03,

A: 2TIDs = 05, 09

D: 7TIDs=02, 04,06,10

{B}(01, 02, 03, 04, 05, 06, 08, 09, 10)

{A}(01, 03, 05, 08, 09)

{D}(01, 02, 03, 04, 06, 08, 10)

{BA}(01, 03, 05, 08, 09)

{BD}(01, 02, 03, 04, 06, 08, 10)

{BDA}(01, 03, 08)

{DA}(01, 02, 03)

FIT(A,B,D)

{BDA}(01:3,03:3,08:3)

{BA}(01:2,03:2,05:2,08:2,09:2)

FIT(C,E,F)

{CEF}(01:3,02:3,03:2,04:1)

{CF}{01:1,03:1}

FIT(G,H,I)

{GHI}(01:3,02:3,03:2,04:1)

{GI}{01:1,03:1}

{A,B,D}

{C,E,F}

{G,H,I}

Depedent group

FP-treeFP-Growth

{A, B, C, D, E, F, G, H, I, J}

{A, B, C, D, E, F, G, H}

{I, J} Independent group

{A, B, C} {D, E, F} {G, H} Dependent group

FIT(A, B, C) FIT(D, E, F) FIT(G, H)

FP-treeFP-Growth

A(10,20,30,50,60,80)

AB(10,20,30,50)

ABC(10,20,30)

D(10,20,30,35,70,80)

DE(10,20,30,35)

DEF(10,20,30)

CFI1=ABC(10:3,20:3,30:3,50:2,60:1,80:1)

CFI1=DEF(10:3,20:3,30:3,35:2,70:1,80:1)

A(10,20,30,50,60,80) D(10,20,30,35,70,80)

AD(10,20,30,80)

DE(10,20,30,35)

ADE(10,20,30)

DEF(10,20,30)

ADEF(10,20,30)

AD(10,20,30,80)

ADE(10,20,30)

ADEF(10,20,30)

ADEF(10:3,20:3,30:3,80:1)

ABDEF(10:3,20:3,30:3)

AD(10,20,30,80)

ADE(10,20,30)

ADEF(10,20,30)

ABD(10,20,30)

ABDE(10,20,30)

ABDEF(10,20,30)

ABCD(10,20,30)

ABCDE (10,20,30)

ABCDEF (10,20,30)

ABCDEF(10:3,20:3,30:3)

{ABC}(01, 10)

{AB}(01, 05,10, 11)

{A}(01, 05, 06,10, 11)

{DEF}(01, 10)

{DE}(01, 05,10, 11)

{D}(01, 05, 07,10, 11)

{ADE}(01:2, 05:2, 10:2, 11:2)

{ABC}(01, 10)

{AB}(01, 05,10, 11)

{A}(01, 05, 06,10, 11)

{DEF}(01, 10)

{DE}(01, 05,10, 11)

{D}(01, 05, 07,10, 11)

{ABDE}(01:2, 05:2, 10:2, 11:2)

{ABC}(01, 10)

{AB}(01, 05,10, 11)

{A}(01, 05, 06,10, 11)

{DEF}(01, 10)

{DE}(01, 05,10, 11)

{D}(01, 05, 07,10, 11)

{ADE}(01:2, 05:2, 10:2, 11:2)

{ABDE}(01:2, 05:2, 10:2, 11:2)

FIT(A, B, C)

FIT(A, B, C) X FIT(D, E, F)

FIT(A, B, C) X FIT(D, E, F) X FIT(G, H, I)

FIT(A, B, C) X FIT(D, E, F) X FIT(G, H, I) X FIT(J, K)

FIT(A, B, C) X FITD,E,F) X FIT(J, K)

FIT(A, B, C) X FIT(G,H,I)

FIT(A,B,C) X FIT(G,H,I) X FIT(J,K)

FIT(A, B, C) X FIT(J, K)

Conclusions

focuses on solving or easing off the mining problems incurred from memory limitation.

The proposed approach can be divided into three phases. Item Partition Generation of Frequent Itemsets Finding Cross-Group Frequent Itemsets

Conclusions

優點：可分散至多台電腦執行。亦能在有限資源下，運行龐大資料庫的採掘。

缺點：資料庫不能共享，必須一台電腦一個。在資料merge，只能有少數電腦運行，不能分散進行。

LOGO 改善 FP-growth 資料挖掘演算法在巨大資料庫的效能 CHEN-HUNG Lin 2010.05.04...

Documents

eHRAF 人類學資料庫

20170415 當julia遇上資料科學

PISA 、 TIMSS 學生學習成就表現資料庫之應用

1 Chapter 11 從資料中學習. Chapter 11 ：從資料中學習 2 OVERVIEW 學習概念資料視覺化類神經網路的學習模式關聯規則分類樹知識管理的涵義

MySQL 資料庫教學

慈濟大學人類發展與心理學系教師個人基本資料表•™師個人資料表（溫錦真201902）.pdf · 慈濟大學人類發展與心理學系教師個人基本資料表

台藝大網路學園教學資料基礎塑造

資料科學與巨量資料分析實際處理，以電子商務為例

臺北市立成功高級中學107學年度第1學期學校日教學活動資料jupiter.cksh.tp.edu.tw/schday/1071homee.pdf · 臺北市立成功高級中學107學年度第1學期學校日教學活動資料

Spark 巨量資料處理基礎教學

102 年度學生社團評鑑評鑑資料

衛服部統計處衛生福利資料科學中心台北醫學大學hcrdc.tmu.edu.tw/uploads/bulletin_file/file...衛福部統計處衛生福利資料科學中心台北醫學大學健康暨臨床研究資料加值中心

中央大學。范錚強 1 2009Fall 資訊模式資料庫和資料模型 ckfarn/09FDM.html 國立中央大學資訊管理系范錚強 2009.09 updated

101 學年度學生基本資料暨兵役資料填寫操作說明

中央大學。范錚強 1 2005Fall 資訊模式資料庫和資料模型國立中央大學資訊管理系范錚強 2005.09

DSP 資料科學計畫簡介

醫學台語語言資料庫紹介

文化記憶的再現資料庫於台灣文學史研究的應用 · （一）台灣文學史料資料庫的應用。舉古典文學與現代文學資料庫為例作說明（二）台灣文學外緣背景資料庫的應用。

「校務評鑑資料上網」教學文件

Z > B 的資料科學

LOGO 改善 FP-growth 資料挖掘演算法 在巨大資料庫的效能 CHEN-HUNG Lin 2010.05.04...

LOGO 改善 FP-growth 資料挖掘演算法在巨大資料庫的效能 CHEN-HUNG Lin 2010.05.04...