prev

next

of 34

View

46Download

4

Embed Size (px)

DESCRIPTION

. jpark@cs.sungshin.ac.kr. Data Mining in the KDD Process Association Rule Mining Association Rules in Transaction Databases Algorithm Apriori & DHP Generalized Association Rules Cyclic Association Rules and Negative Associations. - PowerPoint PPT Presentation

Transcript

jpark@cs.sungshin.ac.kr

Data Mining in the KDD ProcessAssociation Rule Mining Association Rules in Transaction DatabasesAlgorithm Apriori & DHPGeneralized Association RulesCyclic Association Rules and Negative Associations.Interestingness MeasurementSequential Patterns and Path Traversal Patterns Homepages

DataTarget DataPreprocessed DataTransformed DataPatternsKnowledgeSelectionPreprocessingTransformationDataMiningInterpretation/ EvaluationOverview of the steps constituting the KDD process

Types of Data-Mining ProblemsPredictionClassificationRegressionTime SeriesKnowledge DiscoveryDeviation DetectionDatabase SegmentationClusteringAssociation RulesSummarizationVisualizationText mining

Association RuleEx: the statement that 90% of transactions that purchase bread and butter also purchase milk.antecedentconsequentFind all rules that have Diet Coke as consequent.Find all rules that have bagels in the antecedent.Find the best k rules that have bagels in the consequent.

I : a set of literals called items.T: a set of items such that T I, transaction.An association rule is an implication of the form X Y, where X I, Y I and X Y = .X Y [support, confidence]

Transaction Databases Applications: pattern association, market analysis, etcGivendata of transactionseach transaction has a list of items purchasedFind all association rules: the presence of one set of items implies the presence of another set of items.- e.g., people who purchased hammers also purchased nails.Measurement of rule strength Confidence: X & Y Z has 90% confidence if 90% of customers who bought X and Y also bought Z.Support: useful rules(for business decision) should have some minimum transaction support.

Two Steps for Association RulesDetermining large itemsetsFind all combinations of items that have transaction support above minimum supportResearches have been focussed on this phase.Generating rulesfor each large itemset L do for each subset c of L doif (support(L) / support(L - c) minimum confidence) then output the rule (L - c) c, with confidence = support(L)/support(L - c) and support = support(L);

Candidate ItemsetsLarge ItemsetsScan DatabaseHow to generatecandidate itemsetsFocus on data structures to speed up scanning the databaseApriori method: join step + prune stepminimum support minimumconfidenceHash tree, Trie, Hash table, etc.

minimum support = 2

Algorithms for Mining Association RulesAIS(Agrawal et al., ACM SIGMOD, May 93)SETM(Swami et al., IBM Tech. Rep., Oct 93)Apriori(Agrawal et al., VLDB, Sept 94)OCD(Mannila et al., AAAI workshop on KDD, July, 94)DHP(Park et al., ACM SIGMOD, May 95)PARTITION(Savasere et al., VLDB, Sept 95)Mining Generalized Association Rules(Srikant et al., VLDB, Sept 95)Sampling Approach(Toivonen, VLDB, Sept 96)DIC(dynamic itemset counting, Brin et al., ACM SIGMOD, May 97)Cyclic Association Rules(zden et al., IEEE ICDE, Feb 98)Negative Associations(Savasere et al., IEEE ICDE, Feb 98)

Algorithm AprioriLk: Set of Large k-itemsetsCk:Set of Candidate k-itemsetsStep; C1 L1 C2 L2, ..., Ck LkInput File: Transaction File, Output: Large itemsets

L1 = {large 1-itemset}for ( k=2; Lk-1 ; k++) do beginCk= apriori-gen(Lk-1);forall transactions t D do beginCt = subset(Ck, t);forall candidates c Ct doc.count++;endLk= {c Ck| c.count minsup}endAnswer = Uk Lk;

Apriori-gen(Lk-1)Join step

Prune step

insert into Ckselect p.item1, p.item2, ..., p.itemk-1, q.itemk-1from Lk-1 p, Lk-1 qwhere p.item1= q.item1, ..., p.itemk-2= q.itemk-2, p.itemk-1< q.itemk-1forall itemsets c Ck doforall (k-1)-subsets s of c doif ( s Lk-1 ) thendelete c from Ck;

Ex: Generation of Candidate Itemsets: L3 C4 .Join stepL3 = {{1, 2 ,3}, {1, 2, 4}, {1, 3, 4}, {1, 3, 5}, {2, 3, 4}}, 4- = { {1 2 3 4}, {1 3 4 5}}Prune step:- {1, 2, 3, 4} 3-subset = {{1,2,3}, {1,2,4}, {1,3,4}, {2,3,4}}- {1, 3, 4, 5} 3-subset = {{1,3,4}, {1,3,5}, {1,4,5}, {3,4,5}} {1,4,5},{3,4,5} L3 {1, 3, 4, 5} pruning!! C4 = {{1, 2, 3, 4}}

Data Structure for Ck Hash Tree .: C2 = {{A,B},{A,C},{A,T} {B,C}, {B,D},{C,D}} Hash TreeABCBCCDA,BA,CB,CB,DC,DC2Level 1Level 2 A,T

C2 count L2

{A C} 2 {A C}{B C} 2 {B C}{B E} 3 {B E}{C E} 2 {C E}

s = 2Counting support in a hash treeD3 = { , }L2 D3 (DHP){A C} Discard{B C} {B E} {C E} Keep {B C E}{A C} {B C} {B E} {C E} Keep {B C E}{B E} Discard

Generalized Association RulesFinding associations between items at any level of the taxonomy.Rules:People who buy clothes tend to buy shoes. ( )People who buy outerwear tend to buy shoes. ( o )People who buy jacket tend to buy shoes. ( )

Problem StatementI = { i1, i2, , im}: set of literals, D: set of transactions,T: a set of taxonomy, DAG(Directed Acyclic Graph) , X Y [confidence, support], where X I, Y I, XY = , and no item in Y is an ancestor of any item in X. (X, Y: any level of taxonomy T )Step1. Find all sets of items whose support is greater than minimum support.2. Generate association rules, whose confidence is greater than minimum confidence.3. Prune all uninteresting rules from this set with respect to the R-interesting.

Interestingness of Generalized RulesUsing new interest measure, R-interesting: Prune out 40% to 60% of the rules as redundant rules.Example:* : Taxonomy: Skim milk is-a Milk, Milk Cereal ( 8% support, 70% confidence), Skim milk = milk 1/4 ,

* Skim milk Cereal ,Expectation: 2% support, 70% confidenceActual support & confidence: 2% support, 70% confidence ==> redundant & uninteresting!!

Cyclic Association RulesBeer and chips are sold together primarily between 6PM and 9PM.Association rules could also display regular hourly, daily, weekly, etc., variation that has the appearance of cycles.An association rule X Y holds in time unit ti, if the support of X Y in D[i] exceeds MinSup andthe confidence of X Y in D[i] exceeds MinConf.It has a cycle c = (l, o), a length l and an offset o.coffee doughnuts has a cycle (24, 7),if the unit of time is an hour and coffee doughnuts holds during the interval 7AM-8AM everyday (I.e., every 24 hours).

Negative Association RulesA rule : 60% of the customers who buy potato chips do not buy bottled water.Negative rule: X Y such that(a) support(X) and support(Y) are greater than minimum support MinSup; and(b) the rule interest measure is greater than MinRI.The interest measure RI of a negative association rule, X Y ,

E[support(X)] is the expected support of an itemset X.

Incremental Updating,Parallel and Distributed Algorithms . (, 95 )Fast updating algorithms, FUP (Cheung et al., IEEE ICDE, 96).Partitioned derivation and incremental updating.PDM (Park et al., ACM CIKM, 95):Use a hashing technique(DHP-like) to identify candidate k-itemsets from the local databases.Count Distribution (Agrawal & Shafer, IEEE TKDE, Vol 8, No 6, 96):An extension of the Apriori algorithm.May require a lot of messages in count exchange.FDM(Cheung et al., IEEE TKDE, Vol 8, No 6, 96).Observation:If an itemset X is globally large, there exists a partition Di such that X and all its subsets are locally large at Di.Candidate set are those which are also local candidates in some component database, plus some message passing optimizations.

When is Market Basket Analysis useful?The following three rules are examples of real rules generated from real data:On Thursdays, grocery store consumers often purchase diapers and beer together.Useful rule: high quality, actionable information.Customers who purchases maintenance agreements are very likely to purchase large appliances.Trivial ruleWhen a new hardware store opens, one of the most commonly sold items is toilet rings.Inexplicable rule

Interestingness Measurement for Association Rules (I)Two popular measurements: support and confidenceThe longer (itemset), the fewer (support).Use taxonomy information for pruning redundant rules A rule is redundant if its support and confidence are close to their expected values based on an ancestor of the rule.Example: milk cereal vs. skim milk cereal.More effective than that based on statistical significance.Interestingness of Patterns If a pattern contradicts the set of hard beliefs of the user, then this pattern is always interesting to the user.The more a pattern affects the belief system, the more interesting it is.

Interestingness Measurement (II)Improvement (Interest )

How much better a rule is at predicting the result than just assuming the result in the first place.Co-occurrence than implication.Symmetric.

Conviction

How far condition and result deviates from independence

Range of measurementImprovementImprovement = 1:condition result item completely independent!Improvement < 1: worse rule!Improvement > 1: better rule!Conviction Conviction = 1:condition result item completely unrelated.Conviction > 1:better rule!!Conviction = :completely related rule

Sequential PatternsExamples of such a pattern:Customers typically rent Star Wars, then Empire Strikes Back, and then Return of the jedi.Note that these rentals need not to be consecutive.: (1) (2) (3) : LG : ?

Mining Sequential PatternsAn itemset is a non-empty set of items.A sequence is an ordered list of itemsets.

The Algorithm for Sequential Patternsby Agrawal and Srikan