View
262
Download
3
Category
Preview:
Citation preview
LOGO
改善 FP-growth資料挖掘演算法在巨大資料庫的效能
CHEN-HUNG Lin2010.05.04
國立高雄大學資訊工程學系 (研究所 )
碩士論文研究生:黃正男
Contents
Introduction 1
Item Partition2
Generation of Frequent Itemsets3
Finding Cross-Group Frequent Itemsets
4
Conclusions5
Introduction
Apriori may cause iterative database scan and high
computational cost Frequent-Pattern-tree(FP-tree)
may not allow all nodes generated from a huge database
Introduction
item
B
A
D
FP-tree
root
A: 5
B: 9
D: 3
D: 4
Memory
Introduction
TID Domain item(A, B, C, D, E)
01 A, B, C, D, E
02 B,C, D, E
03 A, C, D, E
04 A, B, C, D
{A, B, C, D, E}
{A, B, C} {D, E}
Independent group
Independent group: The itemsets that cross groups are infrequent. E.g: ABD, BCE, ….
Introduction
{A, B, C, D, E}={A, B, C, D, E}
TID Domain item(A, B, C, D, E)
01 A, B, C
02 C, D, E
03 A, C, D, E
04 A, B, C, D
A
B
C
D
E
2
2
2
1
23
1
2
A
B
C
D
E
min_support=2
min_support=3
{A, B, C, D, E}={C, D}, {A}, {B}, {E}
A
B
C
D
E
Introduction{A, B, C, D, E, F, G, H, I, J}
{A, B, C, D, E, F, G, H}
{I, J}
Item number > threshold 3
Independent group
{A, B, C} {D, E, F} {G, H} Dependent group
FP-treeFP-Growth
FP-treeFP-Growth
FP-treeFP-Growth
FIT(A, B, C) FIT(D, E, F) FIT(G, H)Merge
All frequent itemset
How to divide a big group ?How to find all miss frequent itemset ?
{A, B, C, D, E, F, G, H, I, J}
{A, B, C, D, E, F, G, H}
{I, J}Independent group
{A, B, C}
{D, E, F}
{G, H}Dependent group
Item number > threshold 3
FIT(A, B, C)
FIT(D, E, F)
FIT(G, H)Merge
All frequent itemset
FP-tree FP-Growth
FP-tree FP-Growth
Item-partition algorithm
Start
TID Items TID Items 01 A, B, C, D, F, K 06 H, J, K02 B, C, D, E, F 07 A, B, C, F, J, K03 A, B, C, E, F, J, K 08 A, B, C, D, F04 A, B, C, D, E, F 09 A, C, D, F, G, J, K 05 B, C, D, F, J, K 10 A, C, D, F, G
min_support = 5
Generate& Counts
2-itemset count 2-itemset count 2-itemset count 2-itemset count 2-itemset count
AB 5 BC 7 CE 3 DH 0 FH 0
AC 6 BD 5 CF 8 DJ 2 FJ 4
AD 5 BE 3 CG 2 DK 3 FK 5
AE 2 BF 6 CH 0 EF 3 GH 0
AF 6 BG 0 CJ 4 EG 0 GJ 1
AG 2 BH 0 CK 4 EH 0 GK 1
AH 0 BJ 3 DE 2 EJ 1 HJ 1
AJ 3 BK 4 DF 7 EK 1 HK 1
AK 4 CD 7 DG 2 FG 2 JK 5
Frequent 2-itemsets
Frequent
2-itemset
AB BF
AC CD
AD CF
AF DF
BC FK
BD JK
Initially set
Partition{A},{B},{C},{D},{E},{F},{G},{H},{J},{K}
Merged
Frequent
2-itemsetPartition
- {A},{B},{C},{D},{E},{F},{G},{H},{J},{K}
AB {A,B},{C},{D},{E},{F},{G},{H},{J},{K}
Item-partition algorithm
Start
Generate& Counts
Frequent 2-itemsets
Initially set Merged
Frequent
2-itemsetPartition
Frequent
2-itemsetPartition
- {A},{B},{C},{D},{E},{F},{G},{H},{J},K} BF {A, B, C, D, F},{E},{G},{H},{J},{K}
AB {A,B},{C},{D},{E},{F},{G},{H},{J},{K} CD {A, B, C, D, F},{E},{G},{H},{J},{K}
AC {A, B, C},{D},{E},{F},{G},{H},{J},{K} CF {A, B, C, D, F},{E},{G},{H},{J},{K}
AD {A, B, C, D},{E},{F},{G},{H},{J},{K} DF {A, B, C, D, F},{E},{G},{H},{J},{K}
AF {A, B, C, D, F},{E},{G},{H},{J},{K} FK {A, B, C, D, F, K},{E},{G},{H},{J}
BC {A, B, C, D, F},{E},{G},{H},{J},{K} JK {A, B, C, D, F, J, K},{E},{G},{H}
BD {A, B, C, D, F},{E},{G},{H},{J},{K}
min_support = 5
{A, B, C, D, F, J, K},{E},{G},{H}
Check
Output & Exit
Refine-partitionβ = 3
Start
set upper bound
upper bound = ∞
set the score
{A, B, C, D, F, J, K}
2-itemset score 2-itemset score 2-itemset score
AB 0 BD 0 CK 1
AC 0 BF 0 DF 0
AD 0 BJ 1 DJ 1
AF 0 BK 1 DK 1
AJ 1 CD 0 FJ 1
AK 1 CF 0 FK 0
BC 0 CJ 1 JK 0
Frequent 2-
itemset
AB BF
AC CD
AD CF
AF DF
BC FK
BD JK
set root node
{A, B, C, D, F, J, K}
LB = 0
Generate child nodes
{A,B,C,D,F,J,K}
LB = 0
{A,B,C}{D,F,J,K}
LB =
{A,B,D}{C,F,J,K}
LB =
{A,J,K}{B,C,D,F}LB =
{A,B}{C,D,F,J,K}LB =
{A,B,C,D,F,J,K } =7 itemsβ = 3 ,7/3 = 2.333 => 3 group
7/3 = 2.333 2 or 3 for each group
Start
set upper bound
upper bound = ∞
set the score
set root node
Generate child nodes
Calculate the lower bound
{A,B,C}{D,F,J,K}
LB =
decided part
undecided part
{A,B} = 0{A,C} = 0{B,C} = 0Sdecide = 0 0+0+0
{D,F} = 0{D,J} = 1{D,K} = 1{F,J} = 1{F,K} = 0{J,K} = 0
{D,F} = 0{F,K} = 0{J,K} = 0{D,J} = 1{D,K} = 1{F,J} = 1Sunecide = 0 0+0+0
2-itemset
score
2-itemset
score
2-itemset
score
AB 0 BD 0 CK 1
AC 0 BF 0 DF 0
AD 0 BJ 1 DJ 1
AF 0 BK 1 DK 1
AJ 1 CD 0 FJ 1
AK 1 CF 0 FK 0
BC 0 CJ 1 JK 0
)2),mod(()2,(
RCCR
n
{A,B,C}{D,F,J,K}LB = 0
Startset upper
bound
upper bound = ∞
set the score
set root node
Generate child nodes
Calculate the lower bound
Stop nodechoosereplace upper bound
End
{A,B,C,D,F,J,K}
LB = 0
{A,B,C}{D,F,J,K}LB = 0
{A,B,D}{C,F,J,K}LB = 0
{A,J,K}{B,C,D,F}LB = 2
{A,B}{C,D,F,J,K}LB = 5
upper bound = 0
{A,B,C}{D,F}{J,K}
LB = 0
{A,B,C}{D,K}{F,J}
LB = 2
{A,B}{C,D,K}{F,J}
LB = 3
The proposed item-partition
{A, B, C, D, F, J, K},{E},{G},{H}
{A,B,C}{D,F}{J,K},{E},{G},{H}
Start
Generate& Counts
Frequent 2-itemsets
Initially set Merged
Check
Output & Exit
Refine-partition
{A, B, C, D, E, F, G, H, I, J}
{A, B, C, D, E, F, G, H} {I, J} Independent group
{A, B, C}
{D, E, F}
{G, H} Dependent group
Item number > threshold 3
FIT(A, B, C) FIT(D, E, F) FIT(G, H)
Merge
All frequent itemset
FP-treeFP-Growth
FP-treeFP-Growth
Generation of Frequent Itemsets
STEP 1 Generate an initial MFPT with only the empty
root node.STEP 2
Set the initial count of each item in the given group as 0
root
G = {A, B, D}
item countA 0B 0D 0
Algorithm (cont.)
STEP 3 read a transaction from the given data set D delete the items that does not appear in G.
STEP 4 If an item in G appears in the transaction, add
its count by 1STEP 5: Repeat step 3 and 4
until all the transactions are processed
TID Items 01 A, B, C, D, F, G G = {A, B, D}
TID Items 01 A, B, D
item countA 1B 1D 1
item countA 5B 9D 7
Algorithm (cont.)
STEP 6 Compare the items with min_support and
remove the items which are not frequentSTEP 7
Sort the items in G according to their final counts
STEP 8 Sequentially read a transaction T from the
given data set D
item countA 5B 9D 7
Sorted order = (B,D,A)
TID Items 01 A, B, C, D, F, G
Algorithm (cont.)
STEP 9: Generate a tree path P from the transaction T
with only the frequent items according to the sorted order in STEP 7.
Merge P into MFPT in a way similar to FPT.STEP 10:
Add the count of each node in P of MFPT by 1 and add the transaction ID (TID) of T to the last node of P
TID Items 01 A, B, C, D, F, G
Sorted order = (B,D,A)
A
D
B
root
D: 1
B: 1
A: 1TIDs = 01root
D
B
A
Algorithm (cont.)
STEP 11 Repeat STEPs 8 to 10 until all transactions in
D are processedroot
B: 9
A: 3TIDs = 01, 03,
08
A: 2TIDs = 05,
09
D: 7TIDs=02, 04,06,10
The Enumeration Tree
The enumerated order (B,BD,BA,BDA,D,DA,A)
root
B: 9
A: 3TIDs = 01, 03,
08
A: 2TIDs = 05, 09
D: 7TIDs=02, 04,06,10
{B}(01, 02, 03, 04, 05, 06, 08, 09, 10)
{A}(01, 03, 05, 08, 09)
{D}(01, 02, 03, 04, 06, 08, 10)
{BA}(01, 03, 05, 08, 09)
{BD}(01, 02, 03, 04, 06, 08, 10)
{BDA}(01, 03, 08)
{DA}(01, 02, 03)
Sorted order = (B,D,A)
FIT(A,B,D)
{BDA}(01:3,03:3,08:3)
{BA}(01:2,03:2,05:2,08:2,09:2)
…
FIT(C,E,F)
{CEF}(01:3,02:3,03:2,04:1)
{CF}{01:1,03:1}
…
FIT(G,H,I)
{GHI}(01:3,02:3,03:2,04:1)
{GI}{01:1,03:1}
…
{A,B,D}
{C,E,F}
{G,H,I}
Depedent group
FP-treeFP-Growth
FP-treeFP-Growth
FP-treeFP-Growth
Merge
All frequent itemset
{A, B, C, D, E, F, G, H, I, J}
{A, B, C, D, E, F, G, H}
{I, J} Independent group
{A, B, C} {D, E, F} {G, H} Dependent group
Item number > threshold 3
FIT(A, B, C) FIT(D, E, F) FIT(G, H)
FP-treeFP-Growth
FP-treeFP-Growth
FP-treeFP-Growth
X
X
X
Finding Cross-Group Frequent Itemsets
X
Finding Cross-Group Frequent Itemsets
A(10,20,30,50,60,80)
AB(10,20,30,50)
ABC(10,20,30)
D(10,20,30,35,70,80)
DE(10,20,30,35)
DEF(10,20,30)
CFI1=ABC(10:3,20:3,30:3,50:2,60:1,80:1)
CFI1=DEF(10:3,20:3,30:3,35:2,70:1,80:1)
A(10,20,30,50,60,80) D(10,20,30,35,70,80)
AD(10,20,30,80)
DE(10,20,30,35)
ADE(10,20,30)
DEF(10,20,30)
ADEF(10,20,30)
AD(10,20,30,80)
ADE(10,20,30)
ADEF(10,20,30)
ADEF(10:3,20:3,30:3,80:1)
Finding Cross-Group Frequent Itemsets
ADEF(10:3,20:3,30:3,80:1)
ABDEF(10:3,20:3,30:3)
AD(10,20,30,80)
ADE(10,20,30)
ADEF(10,20,30)
ABD(10,20,30)
ABDE(10,20,30)
ABDEF(10,20,30)
ABCD(10,20,30)
ABCDE (10,20,30)
ABCDEF (10,20,30)
ABCDEF(10:3,20:3,30:3)
{ABC}(01, 10)
{AB}(01, 05,10, 11)
{A}(01, 05, 06,10, 11)
{DEF}(01, 10)
{DE}(01, 05,10, 11)
{D}(01, 05, 07,10, 11)
{ADE}(01:2, 05:2, 10:2, 11:2)
X
Finding Cross-Group Frequent Itemsets
Finding Cross-Group Frequent Itemsets
{ABC}(01, 10)
{AB}(01, 05,10, 11)
{A}(01, 05, 06,10, 11)
{DEF}(01, 10)
{DE}(01, 05,10, 11)
{D}(01, 05, 07,10, 11)
{ABDE}(01:2, 05:2, 10:2, 11:2)
X
{ABC}(01, 10)
{AB}(01, 05,10, 11)
{A}(01, 05, 06,10, 11)
{DEF}(01, 10)
{DE}(01, 05,10, 11)
{D}(01, 05, 07,10, 11)
X X
Finding Cross-Group Frequent Itemsets
{ADE}(01:2, 05:2, 10:2, 11:2)
{ABDE}(01:2, 05:2, 10:2, 11:2)
FIT(A, B, C)
FIT(A, B, C) X FIT(D, E, F)
FIT(A, B, C) X FIT(D, E, F) X FIT(G, H, I)
FIT(A, B, C) X FIT(D, E, F) X FIT(G, H, I) X FIT(J, K)
FIT(A, B, C) X FITD,E,F) X FIT(J, K)
FIT(A, B, C) X FIT(G,H,I)
FIT(A,B,C) X FIT(G,H,I) X FIT(J,K)
FIT(A, B, C) X FIT(J, K)
Conclusions
focuses on solving or easing off the mining problems incurred from memory limitation.
The proposed approach can be divided into three phases. Item Partition Generation of Frequent Itemsets Finding Cross-Group Frequent Itemsets
Conclusions
優點: 可分散至多台電腦執行。 亦能在有限資源下,運行龐大資料庫的採掘。
缺點: 資料庫不能共享,必須一台電腦一個。 在資料merge,只能有少數電腦運行,不能分散進行。
LOGO
Recommended