연관 규칙 탐사와 그 응용

연관규칙탐사, 박종수 1

연관 규칙 탐사와 그 응용

성신여자대학교 전산학과

박 종수[email protected]


차 례 Data Mining in the KDD Process

Association Rule 의 정의

Mining Association Rules in Transaction Databases

Algorithm Apriori & DHP

Generalized Association Rules

Cyclic Association Rules and Negative Associations.

Interestingness Measurement

Sequential Patterns and Path Traversal Patterns

연구 방향 및 참고 Homepages


DataDataTarget DataTarget Data

Preprocessed Data

Preprocessed Data

Transformed Data

Transformed Data Patterns

Knowledge

Selection Preprocessing

Transformation

DataMining

Interpretation/ Evaluation

Overview of the steps constituting the KDD process


Types of Data-Mining Problems

Prediction– Classification– Regression– Time Series

Knowledge Discovery– Deviation Detection– Database Segmentation– Clustering– Association Rules– Summarization– Visualization– Text mining


Association Rule

Ex: the statement that 90% of transactions that purchase bread and butter also purchase milk.

[Bread], [Butter] [Milk] (12.5%, 90%)

90% : confidence factor of the rule (not 100%)

12.5%: support for the rule, the fraction of transactions in database

antecedent consequent

Find all rules that have “Diet Coke” as consequent.

Find all rules that have “bagels” in the antecedent.

Find the “best” k rules that have “bagels” in the consequent.


연관규칙의 정의 I : a set of literals called items. T: a set of items such that T I, transaction.

An association rule is an implication of the form

X Y, where X I, Y I and X Y = ø.

X Y [support, confidence]

database in the ons transactiof # total

in items theall containing ons transactiof # support

YX

X

YX

contaning ons transactiof #

and both contain that ons transactiof # confidence


Transaction DatabasesTransaction Databases 에서 연관 규칙 탐사에서 연관 규칙 탐사 Applications: pattern association, market analysis, etc GivenGiven

data of transactions each transaction has a list of items purchased

Find all association rulesassociation rules: the presence of one set of items implies the presence of another set of items.- e.g., people who purchased hammers also purchased nailspeople who purchased hammers also purchased nails.

Measurement of rule strengthMeasurement of rule strength

ConfidenceConfidence: X & Y Z has 90% confidence if 90% of customers who bought X and Y also bought Z.

SupportSupport: useful rules(for business decision) should have some minimum transaction support.


Two StepsTwo Steps for Association Rules

DeterminingDetermining “large itemsets”“large itemsets” Find all combinations of items that have transaction support

above minimum supportabove minimum support ResearchesResearches have been focussed on this phase.

Generating rulesfor each large itemset large itemset LL do

for each subset subset cc of of L L do

if (support(support(LL) / support() / support(L - cL - c) ) minimum confidence minimum confidence) then

output the rule (L - c) c,

with confidence = support(L)/support(L - c)

and support = support(L);


Candidate Itemsets Large ItemsetsScan Database

How to generatecandidate itemsets

Focus on data structures to speed up scanning the database

Association Rules

Apriori method: join step + prune step

minimum support

minimumconfidence

Hash tree, Trie, Hash table, etc.


Database D

TID Items100 A C D200 B C E300 A B C E400 B E

C1

Itemset Sup. {A} 2 {B} 3 {C} 3 {D} 1 {E} 3

Scan D

L1

Itemset Sup. {A} 2 {B} 3 {C} 3 {E} 3

C2

Itemset {A B} {A C} {A E} {B C} {B E} {C E}

C2

Itemset Sup. {A B} 1 {A C} 2 {A E} 1 {B C} 2 {B E} 3 {C E} 2

L2

Itemset Sup. {A C} 2 {B C} 2 {B E} 3 {C E} 2

Scan D

Scan D

C3

Itemset{B C E}

C3

Itemset Sup.{B C E} 2

L3

Itemset Sup.{B C E} 2

minimum support = 2


AlgorithmsAlgorithms for Mining Association Rules

AIS(Agrawal et al., ACM SIGMOD, May ‘93May ‘93)

SETM(Swami et al., IBM Tech. Rep., Oct ‘93)

AprioriApriori(Agrawal et al., VLDB, Sept ‘94)

OCD(Mannila et al., AAAI workshop on KDD, July, ‘94)

DHPDHP(Park et al., ACM SIGMOD, May ‘95)

PARTITION(Savasere et al., VLDB, Sept ‘95)

Mining Generalized Association Rules(Srikant et al., VLDB, Sept ‘95)

Sampling Approach(Toivonen, VLDB, Sept ‘96)

DICDIC(dynamic itemset counting, Brin et al., ACM SIGMOD, May ‘97May ‘97)

Cyclic Association Rules(zden et al., IEEE ICDE, Feb ‘98)

Negative Associations(Savasere et al., IEEE ICDE, Feb ‘98)


Algorithm Apriori

Lk: Set of Large k-itemsets

Ck:Set of Candidate k-itemsets

Step; C1 L1 C2 L2, ..., Ck Lk

Input File: Transaction File, Output: Large itemsets

L1 = {large 1-itemset}

for ( k=2; Lk-1 Ø; k++) do begin

Ck= apriori-gen(Lk-1);forall transactions t D do begin

Ct = subset(Ck, t);

forall candidates c Ct doc.count++;

end

Lk= {c Ck| c.count minsup}end

Answer = Uk Lk;


insert into Ck

select p.item1, p.item2, ..., p.itemk-1, q.itemk-1

from Lk-1 p, Lk-1 q

where p.item1= q.item1, ..., p.itemk-2= q.itemk-2,

p.itemk-1< q.itemk-1

Apriori-gen(Lk-1)

Join step

Prune step

forall itemsets c Ck do

forall (k-1)-subsets s of c do

if ( s Lk-1 ) then

delete c from Ck;


Ex: Generation of Candidate Itemsets

예 : L3 로부터 C4 를 생성하는 과정 .

Join step

L3 = {{1, 2 ,3}, {1, 2, 4}, {1, 3, 4}, {1, 3, 5}, {2, 3, 4}} 일때 ,

후보 4- 항목집합 = { {1 2 3 4}, {1 3 4 5}}

Prune step:

- {1, 2, 3, 4} 의 3-subset = {{1,2,3}, {1,2,4}, {1,3,4}, {2,3,4}}

- {1, 3, 4, 5} 의 3-subset = {{1,3,4}, {1,3,5}, {1,4,5}, {3,4,5}}

각 {1,4,5},{3,4,5} L3 이므로 {1, 3, 4, 5} 는 pruning!!

C4 = {{1, 2, 3, 4}}


Data Structure for Ck

각 레벨의 후보집합에 대해 Hash Tree 형성 . 예 : C2 = {{A,B},{A,C},{A,T} {B,C}, {B,D},{C,D}} 의 Hash Tree

A B C

B C C D

A,BA,B A,CA,C B,CB,C B,DB,D

C,DC,D

C2

Level 1

Level 2

중간노드

잎노드A,TA,T


C2Hash Table H2 와 를 생성하는 예 (DHPDHP)후보 2- 항목집합


C2 count L2

{A C} 2 {A C}{B C} 2 {B C}{B E} 3 {B E}{C E} 2 {C E}

s = 2

TID Items

100 A C D

200 B C E300 A B C E400 B E

Counting support in a hash treeCounting support in a hash tree

D3 = { <200, B C E>, <300, B C E> }

L2 와 D3 의 예 (DHP)

{A C} Discard{B C} {B E} {C E} Keep {B C E}{A C} {B C} {B E} {C E} Keep {B C E}{B E} Discard


Generalized Association Rules

Finding associations between items at any level of the taxonomy.

Rules: People who buy clothes tend to buy shoes. ( ) People who buy outerwear tend to buy shoes. ( o ) People who buy jacket tend to buy shoes. ( )

Clothes

Outerwear Shirts

Jackets Ski Pants

Shoes Hiking Boots

Footwear


Problem Statement I = { i1, i2, …, im}: set of literals, D: set of transactions,

T: a set of taxonomy, DAG(Directed Acyclic Graph) 일때 ,

X Y [confidence, support],

where X I, Y I, XY = ,

and no item in Y is an ancestor of any item in X.

(X, Y: any level of taxonomy T )

Step1. Find all sets of items whose support is greater than minimum sup

port.

2. Generate association rules, whose confidence is greater than minimum confidence.

3. Prune all uninteresting rules from this set with respect to the R-interesting.


Interestingness of Generalized Rules

Using new interest measure, R-interesting:

Prune out 40% to 60% of the rules as “redundant “ rules.

Example:* 가정 : Taxonomy: Skim milk is-a Milk,

Milk Cereal ( 8% support, 70% confidence),

Skim milk 의 판매량 = milk 판매량의 1/4 일 때 ,

* Skim milk Cereal 에 대해 , Expectation: 2% support, 70% confidence Actual support & confidence: 약 2% support, 70%

confidence ==> redundant & uninteresting!!


Cyclic Association Rules

Beer and chips are sold together primarily between 6PM and 9PM.

Association rules could also display regular hourly, daily, weekly, etc., variation that has the appearance of cycles.

An association rule X Y holds in time unit ti, – if the support of X Y in D[i] exceeds MinSup and– the confidence of X Y in D[i] exceeds MinConf.– It has a cycle c = (l, o), a length l and an offset o.

“coffee doughnuts” has a cycle (24, 7),– if the unit of time is an hour and “coffee doughnuts” holds durin

g the interval 7AM-8AM everyday (I.e., every 24 hours).


Negative Association Rules

A rule : “60% of the customers who buy potato chips do not buy bottled water.”

Negative rule: X Y such that– (a) support(X) and support(Y) are greater than minimum sup

port MinSup; and– (b) the rule interest measure is greater than MinRI.

The interest measure RI of a negative association rule, X Y ,

– E[support(X)] is the expected support of an itemset X.

)(support

)(support )](support[

X

YXYXERI


Incremental UpdatingIncremental Updating,Parallel and Distributed AlgorithmsParallel and Distributed Algorithms

데이타베이스 연관규칙 탐사를 위한 점진적 평가기법 . ( 김의경등 , 한국정보과학회 ‘ 95 가을 학술 발표 논문지 ) Fast updating algorithms, FUP (Cheung et al., IEEE ICDE, ‘96).

Partitioned derivation and incremental updating.

PDM (Park et al., ACM CIKM, ‘95): Use a hashing technique(DHP-like) to identify candidate k-itemsets from the

local databases.

Count DistributionCount Distribution (Agrawal & Shafer, IEEE TKDE, Vol 8, No 6, ‘96): An extension of the Apriori algorithm. May require a lot of messages in count exchange.

FDMFDM(Cheung et al., IEEE TKDE, Vol 8, No 6, ‘96). Observation:If an itemset X is globally large, there exists a partition Di such t

hat X and all its subsets are locally large at Di. Candidate set are those which are also local candidates in some componen

t database, plus some message passing optimizations.


When is Market Basket Analysis useful?

The following three rules are examples of real rules generated from real data:– On Thursdays, grocery store consumers often

purchase diapers and beer together. Useful rule: high quality, actionable information.

– Customers who purchases maintenance agreements are very likely to purchase large appliances.

Trivial rule

– When a new hardware store opens, one of the most commonly sold items is toilet rings.

Inexplicable rule


InterestingnessInterestingness Measurement for Association Rules (I)

Two popular measurements: support and confidencesupport and confidence The longer (itemset), the fewer (support).

Use taxonomy informationtaxonomy information for pruning redundant rules

A rule is “redundantredundant” if its support and confidence are close to their expected values based on an ancestor of the rule.

Example: ”milk cereal” vs. “skim milk cereal”. More effective than that based on statistical significance.

Interestingness of Patterns

If a patternIf a pattern contradictscontradicts the set of hard beliefs of the userthe set of hard beliefs of the user, then this pattern is always interesting to the user.

The more a pattern “affects” the belief system, the more interesting it is.


InterestingnessInterestingness Measurement (II)Measurement (II)

Improvement (Interest )

– How much better a rule is at predicting the result than just

assuming the result in the first place.– Co-occurrence than implication.– Symmetric.

Conviction

– How far ”condition and result” deviates from independence

P(result) n)P(conditio

result) andn P(conditio

result) andn P(conditio

result)P( n)P(conditio


Range of measurementRange of measurement

Improvement– Improvement = 1:

condition 과 result 의 item 이 completely independent! Improvement < 1:

worse rule!– Improvement > 1:

better rule! Conviction

– Conviction = 1: condition 과 result 의 item 이 completely unrelated.

– Conviction > 1: better rule!!

– Conviction = : completely related rule


Sequential Patterns

Examples of such a pattern:

– Customers typically rent “Star Wars”, then “Empire Strikes Ba

ck”, and then “Return of the jedi”.

– Note that these rentals need not to be consecutive.

– 수강신청 : 관광과 여가 (1 학기 ) 수도권과 주택문제 (2

학기 ) 증권시장 (3 학기 )

– 주가 변동 패턴 : 삼성전자 주가 상승 LG 전자 주가 상승 보해양조 주가 상승

– 구매패턴 : 양복 와이셔츠 검정색 구두 ?

– 의료진단에서 질병 발생 순서 패턴

– 환자 치료에서 진료 및 투약 패턴


Mining Sequential Patterns

An itemset is a non-empty set of items. A sequence is an ordered list of itemsets.

Customer Id Customer Sequence 1 <(30) (90)> 2 <(10 20) (30) (40 60 70)> 3 <(30 50 70)> 4 <(30) (40 70) (90)> 5 <(90)>

Sequential Patterns with support > 25% <(30) (90)> <(30) (40 70)>


The Algorithm for Sequential Patternsby Agrawal and Srikant, 1995 ICDE

Sort Phase– major key: customer-id, minor key: transaction-time

Litemset Phase– litemset = an itemset with minimum support

Transformation Phase– A customer sequence is represented by a list of sets of l

itemsets

Sequence Phase ( Apriori 알고리즘의 응용 )– Candidate sequences ==> Large sequences

Maximal Phase– a sequence s is maximal if s is not contained in any other

sequence


Mining Path Traversal Patterns

Understanding user access patterns in a distributed information providing environment such as WWW, Hitel, etc.

– help improving the system design

– lead to better marketing decisions

Capturing user access patterns

– mining path traversal patterns

– capturing user traveling behavior

– improving the quality of such services


B

C

D

E

G

H W

O

U V

A1

2

3

4

5

6

7

8

9

10

11

12

1314

15

Maximal forward references{ABCD, ABEGH, ABEGW, AOU, AOV}

Traversal patterns

2. Find maximal reference sequences.

1. Find large reference sequences.


연구 방향

연관 규칙 탐사– Sampling approach, parallel method, distributed algorithm

등의 연구– Candidate itemsets 을 효율적으로 관리하고 scanning 에

효과적인 자료구조 연구– 규칙의 흥미도 또는 중요도 측정– 연관 규칙의 응용으로 구체적인 적용 방법 .

Other patterns– pattern 의 정의와 적용에 관한 문제 연구– Similarity search

– WWW 에서 path traversal patterns 등의 연구


Some Data Mining Systems and Homepages• QuestQuest (IBM Almaden: AgrawalAgrawal, et al.):

– large DB-oriented association, classification, sequential patterns, similar sequences, etc.

– “http://www.almaden.ibm.com/cs/quest/”

• DBMinerDBMiner: (SFC: HanHan, et al.): – Interactive, multi-level characterization, classification, association & pr

ediction.– “http://db.cs.sfu.ca/DBMiner/”

• KDDKDD (GTE: Piatetsky-ShapiroPiatetsky-Shapiro, et al.): – multi-strategy, strong rules, statistical approaches, etc.– KD Mine: “http://info.gte.com/~kdd/index.html”“http://info.gte.com/~kdd/index.html”

• Other Homepages for Data Mining – Rakesh Agrawal: “http://www.almaden.ibm.com/cs/people/ragrawal/”– Usama Fayyad: “http://www.research.microsoft.com/~fayyad/”– Heikki Mannila: “http://www.cs.Helsinki.Fl/~mannila/”– Jiawei Han: “http://fas.sfu.ca/cs/people/Faculty/Han/”– Data Mining and Knowledge Discovery JournalData Mining and Knowledge Discovery Journal: “http://www.research.microsoft.

com/research/datamine/” 의 Editorial BoardEditorial Board

Documents

연관 규칙 탐사와 그 응용