42
資資資資資資資資資資資 資資資 : 資資資 資資 : 2006.11.25

Data mining 1

  • Upload
    ya-dori

  • View
    127

  • Download
    4

Embed Size (px)

Citation preview

  1. 1. : : 2006.11.25
  2. 2. (Data Mining) Association Rule Mining Sequential Pattern Mining Classification Clustering
  3. 3. MiningMining MiningMining (Corporate Memory) (Corporate Intelligence) (Data Mining)
  4. 4. Data Mining Trend Pattern Relationship (Knowledge Discovery in Database, KDD)
  5. 5. Data Mining Association Rule Mining Sequential Pattern Mining Classification Clustering Association Rule Mining Sequential Pattern Mining Classification data data data training classification rule classification rule data Classification Supervised( ) Clustering Clustering Unsupervised( )
  6. 6. Association Rule Mining Definition : data set data item Glossary : Support(AB) = P(A B) Confidence(AB) = P(B | A) Example: Buys(X, computer) Buys(X,financial_management_software ) A = Buys(X, computer) B = Buys(X,financial_management_software) Support(AB) Confidence
  7. 7. Association Rule Mining ( ) Algorithm: Association Rule Apriori minimum support minimum confidence Association Rule a. frequent item set b. frequent item set Association Rule Association Rule minimum support minimum confidence Apriori frequent item set item frequent 1- itemset item frequent itemset frequent k-itemset a. frequent (k-1)-itemset Join k item set Candidate itemset b. Candidate itemset minimum support minimum support frequent k-itemset
  8. 8. Association Rule Mining ( ) Example: 9 item 5 5 item minimum support 2
  9. 9. Association Rule Mining ( )
  10. 10. Association Rule Mining ( ) {1,4} {3,4} {3,5} {4,5} support minimum support
  11. 11. Association Rule Mining ( ) set support minimum support set frequent itemsets Association Rule minimum confidence 70% 1&5=>2 2&5=>1 5=>1&2 minimum support minimum confidence Association Rule minimum support minimum confidence
  12. 12. Sequential Pattern Mining Q. How to find the sequential patterns?
  13. 13. Item Itemset Transaction Step 1: Customer_Id TransactionTime Sequential Pattern Mining ( )
  14. 14. With minimum support of 2 customers: The large itemset (litemset): (30), (40), (70), (90), (40 70) Item Itemset Transaction Step 2: Large Itemset Sequential Pattern Mining ( )
  15. 15. Sequence is supported by customer 1 and 4 is supported by customer 2 and 4 3-Sequence Step 3: Sequences Sequential Pattern Mining ( )
  16. 16. Q. Find the large sequences with minimum support set to 25%: - Large sequence: , , , , , , Step 4: Large Sequences Sequential Pattern Mining ( )
  17. 17. Q. Find the maximal sequences with minimum support of 2 customers: - The answer set is: , Sequential Patterns Step 5: Maximal Sequences Sequential Pattern Mining ( )
  18. 18. The Algorithm has five phases: Sort phase Large itemset phase Transformation phase Sequence phase Maximal phase ApriorAll ApriorSome DynamicSome Sequential Pattern Mining ( )
  19. 19. Sort the database with customer-id as the major key and transaction-time as the minor key. Sort phase
  20. 20. Find the large itemset. association rules mining large itemset itemset customer transactions itemset support Itemsets mapping Litemset phase
  21. 21. Transformation phase Deleting non-large itemsets Mapping large itemsets to integers
  22. 22. Sequence phase Use the set of litemsets to find the desired sequence. Two families of algorithms: Count-all: AprioriAll Count-some: AprioriSome, DynamicSome
  23. 23. Maximal phase Find the maximum sequences among the set of large sequences. large sequences sequences sub-sequences maximum sequences In some algorithms, this phase is combined with the sequence phase.
  24. 24. Maximal phase Algorithm: S the set of all litemsets n the length of the longest sequence for (k = n; k > 1; k--) do for each k-sequence sk do Delete from S all subsequences of sk
  25. 25. AprioriAll The basic method to mine sequential patterns Based on the Apriori algorithm. Count all the large sequences, including non-maximal sequences. Use Apriori-generate function to generate candidate sequence.
  26. 26. Apriori Candidate Generation Generate candidates for pass using only the large sequences found in the previous pass. Then make a pass over the data to find the support of the candidates.
  27. 27. Algorithm: Lk the set of all large k-sequences Ck the set of candidate k-sequences Apriori Candidate Generation insert into Ck select p.litemset1, p.litemset2,, p.litemsetk-1,q.litemsetk-1 from Lk-1 p, Lk-1 q where p.litemset1=q.litemset1,, p.litemsetk-2=q.litemsetk-2; for all sequences cCk do for all (k-1)-subsequences s of c do if (sLk-1) then delete c from Ck;
  28. 28. AprioriAll (cont.) L1 = {large 1-sequences}; // Result of the phase for ( k=2; Lk-1; k++) do begin Ck = New candidate generate from Lk-1 foreach customer-sequence c in the database do Increment the count of all candidates in Ck that are contained in c Lk = Candidates in Ck with minimum support. End Answer=Maximal Sequences in UkLk;
  29. 29. Example: (Customer Sequences) Apriori Candidate Generation next step: find the large 1-sequences With minimum set to 25%
  30. 30. next step: find the large 2-sequences Sequence Support Example Large 1-Sequence 4 2 4 4 2
  31. 31. next step: find the large 3-sequences Sequence Support 2 4 3 3 2 2 3 2 2 Example Large 2-Sequence
  32. 32. next step: find the large 4-sequences Sequence Support 2 2 3 2 2 Example Large 3-Sequence
  33. 33. next step: find the sequential pattern Sequence Support 2 Example Large 4-Sequence
  34. 34. Sequence Support 2 Example Sequence Support 4 2 4 4 2 Sequence Support 2 4 3 3 2 2 3 2 2 Sequence Support 2 2 3 2 2 Find the maximal large sequences
  35. 35. Classification Definition : training classication rule Classification Rule Classification Rule data Decision Tree Classification ! Decision Tree Classification Rule Decision Tree NP-Hard induction-based Hunt Hunt
  36. 36. Classification ( ) Training Case T Decision Tree (C1,C2,Ck) Cases
  37. 37. Classification ( ) Example: Training Data Set Outlook Windy Humidity
  38. 38. Classification ( ) Classification Rule Rule Outlook rain Humidity 95 Windy false Classification Rule Classification Rule
  39. 39. Clustering Definition : Clustering unsupervised learning Clustering Methods : Partitioning methods k k a. data b. data Hierarchical methods hierarchical agglomerative divisive agglomerative bottom up divisive top down Density-based methods Cluster Density-based Clustering Data Grid-based methods Grid-based Clustering multiresolution grid data structure cell grid structure multiresolution grid structure resolution multiresolution grid data structure cell Model-based methods
  40. 40. Clustering ( ) Clustering Market Segmentation Fraud Detection fraud clusters fraud Defect Analysis clusters Lapse Analysis clusters
  41. 41. Clustering ( ) clustering clusters ? Clusters Clusters clusters cluster records cluster record cluster ? (similarity) distance (weighting) clustering ?
  42. 42. Data Mining Association Rule Item Clustering