Upload
lilah-melendez
View
71
Download
0
Embed Size (px)
DESCRIPTION
Hash-Based Algorithm for Mining Association Rules. Data Mining. Mining Association Rules. Mining Association Rules. Mining Association Rules Support Obtain Large Itemset Confidence Generate Association Rules. Apriori - رويكرد مبتني بر - PowerPoint PPT Presentation
Citation preview
1
Hash-Based Algorithm for Mining Association Rules
2
Data Mining Mining Association Rules
3
Mining Association Rules
Mining Association Rules Support
Obtain Large Itemset Confidence
Generate Association Rules
Apriori -رويكرد مبتني بر در Apriori ابتدا در ميان مجموعه ساختار هاي داده شده به دنبال زيرساختارهاي متناوبي با اندازه
رويكرد مبتني بر كوچك مي گرديم. پس از آن در هر مرحله با يك نود به يك زير ساختار متناوب، زير ساختار جديدي
.ايجاد مي شود براي افزودن نودها به يك زير ساختار متناوب، تنها نودهايي مورد استفاده قرار م يگيرند كه در
مرحله اول به عنوان نود متناوب شناخته شده باشند. با ايجاد زير ساختار جديد، مجموعه ساختارها براي مشخص شدن
تناوب يا عدم .تناوب زيرساختار جديد مورد پويش قرار م يگيرد
4
5
TID Items
100
A C D
200
B C E
300
A B C E
400
B E
D
ScanD
Itemset
Sup.
{A} 2
{B} 3
{C} 3
{D} 1
{E} 3
C1Itemset Sup.
{A} 2
{B} 3
{C} 3
{E} 3
L1
Itemset
{A B}
{A C}
{A E}
{B C}
{B E}
{C E}
ScanD
Itemset
Sup.
{A B} 1
{A C} 2
{A E} 1
{B C} 2
{B E} 3
{C E} 2
Itemset
Sup.
{A C} 2
{B C} 2
{B E} 3
{C E} 2
C2 C2 L2
Itemset
{B C E}
ScanD
Itemset Sup.
{B C E} 2
Itemset Sup.
{B C E} 2
C3 C3 L3
Apriori
Sup=2
6
Apriori Cont. Disadvantages
Inefficient Produce much more useless
candidates
7
DHP Prune useless candidates in advance Reduce database size at each iteration
Direct Hashing with EfficientPruning for Fast Data Mining
DHP
8
C1 Count
{A}
2
{B}
3
{C}
3
{D}
1
{E}
3
L1
{A}
{B}
{C}
{E}
Min sup=2
Making a hash table
100
{A C}
200
{B C},{B E},{C E}
300
{A B},{A C},{A E},{B C},{B E},{C E}
400
{B E}
H{[x y]}=((order of x )*10+(order of y)) mod 7;
{B E}
{C E}
{B C}
{B E}
{A C}
{C E}
{B C}
{B E}
{A B}
{A C}
2 0 2 0 3 1 2
0 1 2 3 4 5 6
1 0 1 0 1 0 1
Hash table H2
Hash address
The number of items hashed to bucket 0
Bit vector
TID Items
100
A C D
200
B C E
300
A B C E
400
B E
D
9
Perfect Hashing Schemes (PHS) for Mining Association Rules
10
Motivation Apriori and DHP produce Ci from Li-
1 that may be the bottleneck
Collisions in DHP
Designing a perfect hashing function for every transaction databases is a thorny problem
11
Definition Definition. A Join operation is to join two
different (k-1)-itemsets, , respectively, to produces a k-itemset, where
= p1p2…pk-1
= q1q2…qk-1 and p2=q1, p3=q2,…,pk-2=qk-3, pk-1=qk-2.
Example: ABC, BCD 3-itemsets of ABCD: ABC, ABD, ACD, BCD only one pair that satisfies the join definition
11kS
21kS
12
Algorithm PHS (Perfect Hashing and Data
Shrinking)
13
Example1 (sup=2)
TID Items
100 ACD
200 BCE
300 BCDE
400 BE
TID Items
100 (CD)
200 (BC) (BE)(CE)
300 (BC)(BD)(BE)(CD)(CE)(DE)
400 (BE)
Itemsets (BC)
(BD)
(BE)
(CD)
(CE)
(DE)
Support 2 1 3 2 2 1
Encoding A B C D
Original (BC) (BE) (CD) (CE)
Itemset
Sup.
{B} 3
{C} 3
{D} 2
{E} 3
L1
2 2( ) ( ) ( ( ) ( )) 1n n-index(X)hash X,Y C C index Y index X
14
Example2 (sup=2)
TID Items
100 Null
200 (AD)
300 (AC)(AD)
400 Null
Itemsets (AB)
(AC)
(AD)
(BC)
(BD)
(CD)
Support 0 1 2 0 0 0
Encoding A
Original (AD)
Decode -> (BC)(CE) = BCE
2 2( ) ( ) ( ( ) ( )) 1n n-index(X)hash X,Y C C index Y index X
15
Problem on Hash Table Consider a database contains p transactions,
which are comprised of unique items and are of equal length N, and the minimum support of 1.
Loading density :2( )
, ( 1)( 1)
N kpm N
pm k
16
How to Improve the Loading Density
Two level perfect hash scheme (parital hash)
A B C
Hash Table C D Null Null
Count 1 2
Itemsets (AB)
(AC)
(AD)
(BC)
(BD)
(CD)
Support 0 1 2 0 0 0
17
18
Experiments
T5I4D200K
0
20
40
60
80
100
1.5 1.25 1 0.75 0.5 0.25
Minimum Support (%)
Tim
e (
sec)
PHS DHP Apriori
T20I4D100K
0500
1000150020002500
1.25 1 0.75 0.5 0.25
Minimum Support (%)Tim
e (
sec)
PHS DHP MPHP
19
Experiments
˹
˺ ˹ ˹
˻ ˹ ˹
˼ ˹ ˹
~̊˹ ˹
�̊˹ ˹
˻ ˹ ˹ K ~̊˹ ˹ K �̊˹ ˹ K �̊˹ ˹ K ˺ ˹ ˹ ˹ K
Tim
e (s
ec)
Number of Transactions
Increasing Number of Transactions
T̊�I̊~(PHS) T˺ ˹ I̊�(PHS)
T̊�I̊~(DHP) T˺ ˹ I̊�(DHP)
20
Experiments
T15I8D500K
100
200
300
400500
600
700
800
1.5 1.25 1 0.75 0.5
Support (%)
Tim
e (
sec)
Direct Hash Partial Hash
T15I8D500K (sup=0.5%)
0
100
200
300
400
2 3 4 5 6 7 8 9 10
PassesM
emor
y us
age
(MB
)
Direct Hash Partial Hash
21
We examined in this paper the issue of mining associationrules among items in a large database of sales transactions.The problem of discovering large itemsets wassolved by constructing a candidate set of itemsets firstand then, identifying, within this candidate set, thoseitemsets that meet the large itemset requirement
Conclusions