4. Data Mining Trend Pattern Relationship (Knowledge Discovery
in Database, KDD)
5. Data Mining Association Rule Mining Sequential Pattern
Mining Classification Clustering Association Rule Mining Sequential
Pattern Mining Classification data data data training
classification rule classification rule data Classification
Supervised( ) Clustering Clustering Unsupervised( )
6. Association Rule Mining Definition : data set data item
Glossary : Support(AB) = P(A B) Confidence(AB) = P(B | A) Example:
Buys(X, computer) Buys(X,financial_management_software ) A =
Buys(X, computer) B = Buys(X,financial_management_software)
Support(AB) Confidence
7. Association Rule Mining ( ) Algorithm: Association Rule
Apriori minimum support minimum confidence Association Rule a.
frequent item set b. frequent item set Association Rule Association
Rule minimum support minimum confidence Apriori frequent item set
item frequent 1- itemset item frequent itemset frequent k-itemset
a. frequent (k-1)-itemset Join k item set Candidate itemset b.
Candidate itemset minimum support minimum support frequent
k-itemset
8. Association Rule Mining ( ) Example: 9 item 5 5 item minimum
support 2
9. Association Rule Mining ( )
10. Association Rule Mining ( ) {1,4} {3,4} {3,5} {4,5} support
minimum support
11. Association Rule Mining ( ) set support minimum support set
frequent itemsets Association Rule minimum confidence 70%
1&5=>2 2&5=>1 5=>1&2 minimum support minimum
confidence Association Rule minimum support minimum confidence
12. Sequential Pattern Mining Q. How to find the sequential
patterns?
14. With minimum support of 2 customers: The large itemset
(litemset): (30), (40), (70), (90), (40 70) Item Itemset
Transaction Step 2: Large Itemset Sequential Pattern Mining (
)
15. Sequence is supported by customer 1 and 4 is supported by
customer 2 and 4 3-Sequence Step 3: Sequences Sequential Pattern
Mining ( )
16. Q. Find the large sequences with minimum support set to
25%: - Large sequence: , , , , , , Step 4: Large Sequences
Sequential Pattern Mining ( )
17. Q. Find the maximal sequences with minimum support of 2
customers: - The answer set is: , Sequential Patterns Step 5:
Maximal Sequences Sequential Pattern Mining ( )
18. The Algorithm has five phases: Sort phase Large itemset
phase Transformation phase Sequence phase Maximal phase ApriorAll
ApriorSome DynamicSome Sequential Pattern Mining ( )
19. Sort the database with customer-id as the major key and
transaction-time as the minor key. Sort phase
20. Find the large itemset. association rules mining large
itemset itemset customer transactions itemset support Itemsets
mapping Litemset phase
21. Transformation phase Deleting non-large itemsets Mapping
large itemsets to integers
22. Sequence phase Use the set of litemsets to find the desired
sequence. Two families of algorithms: Count-all: AprioriAll
Count-some: AprioriSome, DynamicSome
23. Maximal phase Find the maximum sequences among the set of
large sequences. large sequences sequences sub-sequences maximum
sequences In some algorithms, this phase is combined with the
sequence phase.
24. Maximal phase Algorithm: S the set of all litemsets n the
length of the longest sequence for (k = n; k > 1; k--) do for
each k-sequence sk do Delete from S all subsequences of sk
25. AprioriAll The basic method to mine sequential patterns
Based on the Apriori algorithm. Count all the large sequences,
including non-maximal sequences. Use Apriori-generate function to
generate candidate sequence.
26. Apriori Candidate Generation Generate candidates for pass
using only the large sequences found in the previous pass. Then
make a pass over the data to find the support of the
candidates.
27. Algorithm: Lk the set of all large k-sequences Ck the set
of candidate k-sequences Apriori Candidate Generation insert into
Ck select p.litemset1, p.litemset2,, p.litemsetk-1,q.litemsetk-1
from Lk-1 p, Lk-1 q where p.litemset1=q.litemset1,,
p.litemsetk-2=q.litemsetk-2; for all sequences cCk do for all
(k-1)-subsequences s of c do if (sLk-1) then delete c from Ck;
28. AprioriAll (cont.) L1 = {large 1-sequences}; // Result of
the phase for ( k=2; Lk-1; k++) do begin Ck = New candidate
generate from Lk-1 foreach customer-sequence c in the database do
Increment the count of all candidates in Ck that are contained in c
Lk = Candidates in Ck with minimum support. End Answer=Maximal
Sequences in UkLk;
29. Example: (Customer Sequences) Apriori Candidate Generation
next step: find the large 1-sequences With minimum set to 25%
30. next step: find the large 2-sequences Sequence Support
Example Large 1-Sequence 4 2 4 4 2
31. next step: find the large 3-sequences Sequence Support 2 4
3 3 2 2 3 2 2 Example Large 2-Sequence
32. next step: find the large 4-sequences Sequence Support 2 2
3 2 2 Example Large 3-Sequence
33. next step: find the sequential pattern Sequence Support 2
Example Large 4-Sequence
34. Sequence Support 2 Example Sequence Support 4 2 4 4 2
Sequence Support 2 4 3 3 2 2 3 2 2 Sequence Support 2 2 3 2 2 Find
the maximal large sequences
35. Classification Definition : training classication rule
Classification Rule Classification Rule data Decision Tree
Classification ! Decision Tree Classification Rule Decision Tree
NP-Hard induction-based Hunt Hunt
36. Classification ( ) Training Case T Decision Tree (C1,C2,Ck)
Cases
37. Classification ( ) Example: Training Data Set Outlook Windy
Humidity
39. Clustering Definition : Clustering unsupervised learning
Clustering Methods : Partitioning methods k k a. data b. data
Hierarchical methods hierarchical agglomerative divisive
agglomerative bottom up divisive top down Density-based methods
Cluster Density-based Clustering Data Grid-based methods Grid-based
Clustering multiresolution grid data structure cell grid structure
multiresolution grid structure resolution multiresolution grid data
structure cell Model-based methods