8/3/2019 Pattern Mining Lec 2
1/30
Introduction to Data Mining
Saeed Salem
Department of Computer Science
North Dakota State Universitycs.ndsu.edu/~salem
These slides are based on slides by Jiawei Han and Micheline Kamber and Mohammed Zaki
8/3/2019 Pattern Mining Lec 2
2/30
2
Mining Frequent Patterns, Association andCorrelations
Basic concepts and a road map
Efficient and scalable frequent itemset mining
methods
Mining various kinds of association rules
From association mining to correlation
analysis Constraint-based association mining
Summary
8/3/2019 Pattern Mining Lec 2
3/30
3
What Is Frequent Pattern Analysis?
Frequent pattern: a pattern (a set of items, subsequences, substructures,etc.) that occurs frequently in a data set
First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context
offrequent itemsets and association rule mining
Motivation: Finding inherent regularities in data
What products were often purchased together? Beer and diapers?!
What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
Can we automatically classify web documents?
Applications
Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
8/3/2019 Pattern Mining Lec 2
4/30
4
Why Is Freq. Pattern Mining Important?
Discloses an intrinsic and important property of data sets Forms the foundation for many essential data mining tasks
Association, correlation, and causality analysis
Sequential, structural (e.g., sub-graph) patterns
Pattern analysis in spatiotemporal, multimedia, time-
series, and stream data
Classification: associative classification
Cluster analysis: frequent pattern-based clustering
Data warehousing: iceberg cube and cube-gradient
Semantic data compression: fascicles
Broad applications
8/3/2019 Pattern Mining Lec 2
5/30
5
Basic Concepts: Frequent Patterns andAssociation Rules
Itemset X = {x1, , xk}
Find all the rules XYwith minimumsupport and confidence
support, s, probability that a
transaction contains X Y confidence, c,conditional
probability that a transactionhaving X also contains Y
Let supmin= 50%, confmin= 50%
Freq. Pat.:{A:3, B:3, D:4, E:3, AD:3}
Association rules:
A D (60%, 100%)
D A (60%, 75%)
Customer
buys diaper
Customer
buys both
Customer
buys beer
Transaction-id Items bought
10 A, B, D
20 A, C, D
30 A, D, E
40 B, E, F50 B, C, D, E, F
8/3/2019 Pattern Mining Lec 2
6/30
6
Closed Patterns and Max-Patterns
A long pattern contains a combinatorial number of sub-patterns, e.g., {a1, , a100} contains
+ +.+ = =2100 1 = 1.27*1030 sub-patterns!
Solution: Mineclosed patternsandmax-patternsinstead
An itemset Xis closed if X is frequentand there exists no
super-patternY X, with the same supportas X (proposed
by Pasquier, et al. @ ICDT99)
An itemset X is a max-pattern if X is frequent and there
exists no frequent super-pattern Y X (proposed by Bayardo
@ SIGMOD98)
Closed pattern is a lossless compression of freq. patterns
Reducing the # of patterns and rules
100
1
100
1
100
i i
100
2
100
100
8/3/2019 Pattern Mining Lec 2
7/307
Closed Patterns and Max-Patterns
Exercise. DB = {, < a1, , a50>} Min_sup = 1.
What is the set ofclosed itemset?
What is the set ofmax-pattern?
What is the set ofall frequent patterns?
!!
8/3/2019 Pattern Mining Lec 2
8/308
Closed Patterns and Max-Patterns
Exercise. DB = {, < a1, , a50>} Min_sup = 1.
What is the set ofclosed itemset?
: 1 < a1, , a50>: 2
What is the set ofmax-pattern?
: 1
What is the set ofall frequent patterns?
!!
8/3/2019 Pattern Mining Lec 2
9/309
Mining Frequent Patterns, Associationand Correlations
Basic concepts and a road map
Efficient and scalable frequent itemset mining
methods
Mining various kinds of association rules
From association mining to correlation
analysis Constraint-based association mining
Summary
8/3/2019 Pattern Mining Lec 2
10/3010
Brute Force Approach for MiningFrequent Itemsets
BRUTEFORCE (D , I , minsup):1 foreachX I do2 sup(X) COMPUTESUPPORT(X,D );
3 ifsup(X) minsup then4 print X, sup(X)
5 end
6 end
Problems:Lots of wasteful computation since many of
the candidates may NOT be frequent.
8/3/2019 Pattern Mining Lec 2
11/3011
Scalable Methods for Mining Frequent Patterns
The downward closure property of frequent patterns Any subset of a frequent itemset must be frequent
If{beer, diaper, nuts} is frequent, so is {beer,diaper}
i.e., every transaction having {beer, diaper, nuts} alsocontains {beer, diaper}
Scalable mining methods: Three major approaches
Apriori (Agrawal & Srikant@VLDB94)
Freq. pattern growth (FPgrowthHan, Pei & Yin@SIGMOD00)
Vertical data format approach (CharmZaki & Hsiao@SDM02)
8/3/2019 Pattern Mining Lec 2
12/3012
Apriori: A Candidate Generation-and-Test Approach
Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
(Agrawal & Srikant @VLDB94, Mannila, et al. @ KDD 94)
Method: Initially, scan DB once to get frequent 1-itemset
Generate length (k+1) candidate itemsets from length k
frequent itemsets
Test the candidates against DB
Terminate when no frequent or candidate set can be
generated
8/3/2019 Pattern Mining Lec 2
13/3013
The Apriori AlgorithmAn Example
Database TDB
1st scan
C1
L1
L2
C2 C2
2nd scan
C3
L33rd scan
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
Itemset sup
{A} 2
{B} 3
{C} 3
{D} 1
{E} 3
Itemset sup{A} 2
{B} 3
{C} 3
{E} 3
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
Itemset sup
{A, B} 1
{A, C} 2
{A, E} 1
{B, C} 2
{B, E} 3
{C, E} 2
Itemset sup
{A, C} 2
{B, C} 2
{B, E} 3{C, E} 2
Itemset
{B, C, E}
Itemset sup
{B, C, E} 2
Supmin = 2
8/3/2019 Pattern Mining Lec 2
14/3014
The Apriori Algorithm
Pseudo-code:Ck: Candidate itemset of size kLk: frequent itemset of size k
L1
= {frequent items};for(k= 1; Lk !=; k++) do begin
Ck+1= candidates generated from Lk;for each transaction tin database do
increment the count of all candidates in Ck+1
that are contained in tLk+1 = candidates in Ck+1with min_supportend
returnkLk;
8/3/2019 Pattern Mining Lec 2
15/3015
Important Details of Apriori
How to generate candidates?
Step 1: self-joining Lk
Step 2: pruning
How to count supports of candidates?
Example of Candidate-generation L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcdfrom abcand abd
acdefrom acdand ace Pruning:
acdeis removed because adeis not in L3
C4={abcd}
8/3/2019 Pattern Mining Lec 2
16/3016
How to Generate Candidates?
Suppose the items in Lk-1are listed in an order
Step 1: self-joining Lk-1
insert intoCk
selectp.item1
, p.item2
, , p.itemk-1
, q.itemk-1
from Lk-1p, Lk-1q
wherep.item1=q.item1, , p.itemk-2=q.itemk-2, p.itemk-1
8/3/2019 Pattern Mining Lec 2
17/3017
How to Count Supports of Candidates?
Why counting supports of candidates a problem?
The total number of candidates can be very huge
One transaction may contain many candidates
8/3/2019 Pattern Mining Lec 2
18/3018
An Example
Minsup=3Brute Force: 25 -1 =31candidates
Apriori: 21 candidates
out of which 19 arefrequent.
8/3/2019 Pattern Mining Lec 2
19/3019
An Example
Minsup=3Brute Force: 25 -1 =31candidates
Apriori: 21 candidates
out of which 19 arefrequent.
8/3/2019 Pattern Mining Lec 2
20/3020
Challenges of Frequent Pattern Mining
Challenges
Multiple scans of transaction database
Huge number of candidates
Tedious workload of support counting for candidates
Improving Apriori: general ideas
Reduce passes of transaction database scans
Shrink number of candidates
Facilitate support counting of candidates
8/3/2019 Pattern Mining Lec 2
21/3021
Bottleneck of Frequent-pattern Mining
Multiple database scans are costly
Mining long patterns needs many passes of
scanning and generates lots of candidates
To find frequent itemset i1i2i100 # of scans: 100
# of Candidates: (1001) + (100
2) + + (110000) = 2100-
1 = 1.27*10
30
! Bottleneck: candidate-generation-and-test
Can we avoid candidate generation?
8/3/2019 Pattern Mining Lec 2
22/3022
CHARM: Mining by Exploring Vertical Data Format
Vertical format: t(AB) = {T11, T25, } tid-list: list of trans.-ids containing an itemset
Deriving closed patterns based on vertical intersections
t(X) = t(Y): X and Y always happen together
t(X) t(Y): transaction having X always has Y
Using diffset to accelerate mining
Only keep track of differences of tids
t(X) = {T1, T2, T3}, t(XY) = {T1, T3}
Diffset (XY, X) = {T2}
Eclat/MaxEclat (Zaki et al. @KDD97), VIPER(P. Shenoy et
al.@SIGMOD00), CHARM (Zaki & Hsiao@SDM02)
8/3/2019 Pattern Mining Lec 2
23/30
23
CHARM: Mining by Exploring Vertical Data Format
8/3/2019 Pattern Mining Lec 2
24/30
24
Mining Frequent Patterns, Associationand Correlations
Basic concepts and a road map
Efficient and scalable frequent itemset mining
methods
Mining various kinds of association rules
From association mining to correlation
analysis
Constraint-based association mining
Summary
8/3/2019 Pattern Mining Lec 2
25/30
25
Mining Other Interesting Patterns
Flexible support constraints (Wang et al. @ VLDB02)
Some items (e.g., diamond) may occur rarely but are
valuable
Customized supmin
specification and application
Top-K closed frequent patterns (Han, et al. @ ICDM02)
Hard to specify supmin, but top-kwith lengthmin is more
desirable
Dynamically raise supmin in FP-tree construction andmining, and select most promising path to mine
8/3/2019 Pattern Mining Lec 2
26/30
26
Mining Other Interesting Patterns
Mining discriminatory itemsets:can be used as features for classifying transactions.
For an itemset X, delta score=
The higher the delta score the more the pattern X isdiscriminatory.
))()(()( XSXSabsXDD
Transaction-id Items
10 A, B, D20 A, C, D
30 A, D, E
40 B, E, F
50 B, C, D, E, F
Transaction-id Items
10 A, B, C, D, F20 A, C, D, F
30 A, C, D, E, F
40 B, C, E, F
50 B, C, D, E, F
D
D
8/3/2019 Pattern Mining Lec 2
27/30
27
Which Measures Should Be Used?
lift and2are notgood measures forcorrelations in largetransactional DBs
all-conforcoherence could begood measures(Omiecinski@TKDE03)
Both all-confandcoherence have thedownward closure
property Efficient algorithms
can be derived formining (Lee et al.@ICDM03sub)
8/3/2019 Pattern Mining Lec 2
28/30
28
Frequent-Pattern Mining: Summary
Frequent pattern miningan important task in data mining
Scalable frequent pattern mining methods
Apriori (Candidate generation & test)
Projection-based (FPgrowth, CLOSET+, ...)
Vertical format approach (CHARM, ...)
Mining a variety of rules and interesting patterns
Constraint-based mining
Mining sequential and structured patterns
Extensions and applications
8/3/2019 Pattern Mining Lec 2
29/30
29
Frequent-Pattern Mining: Research Problems
Mining fault-tolerant frequent, sequential and structuredpatterns
Patterns allows limited faults (insertion, deletion,
mutation)
Mining truly interesting patterns
Surprising, novel, concise,
Application exploration
E.g., DNA sequence analysis and bio-patternclassification
Invisible data mining
Application: Mining Version Histories to
8/3/2019 Pattern Mining Lec 2
30/30
Application: Mining Version Histories toGuide Software Changes (Zimmermann et.al., 2005)
Suggest and predict likely changes. Suppose aprogrammer has just made a change. What else doesshe have to change?
Prevent errors due to incomplete changes. If aprogrammer wants to commit changes, but hasmissed a related change, issues a warning.