16
Association Rules over Interval Data R. J. Miller and Y. Yang 1999 년 5 년 14 년 DE Lab. 년 년 년

ppt

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: ppt

Association Rulesover Interval Data

R. J. Miller and Y. Yang

1999 년 5 월 14 일DE Lab. 최 영 란

Page 2: ppt

Abstract

• Mining association rules over interval data– distance-based association rules

• measures for mining nominal and ordinal data– support and confidence– don’t capture the semantics of interval data– so need a new definition of interest

Page 3: ppt

• Distance-based association rules– for interval data in a way that respects the

quantitative properties

• Application of adaptive techniques– that find rules within given memory

constraints

Page 4: ppt

Motive : example 1.

• Problem– expensive cost

quantitative association rules(QAR)• (Attr = val) or (val1 ≤ Attr ≤ val2)• K-partial completeness• equi-depth partitioning• relative distance between values, the density of

an interval or the distance between interval are not considered

Page 5: ppt

Goal 1. In selecting interval or groups of data to consider, we want a measure of interval quality that reflects the distance between data points

Page 6: ppt

Motive : example 2.

• Problem– interpretation of a rule

Page 7: ppt

• Classical association rule– Job = DBA ∧ Age = 30 Salary = 40,000

(minsup: 50%, minconf: 60%) ----- (1)– based on exact set membership– cannot be used to express a rule such as “30

year-old DBAs earn about 40,000”

Goal 2. For interval data, a definition of an association rule C1 C2 is required that models the following semantics: items in C1 will be close to satisfying C2

Page 8: ppt

Motive : example 3.

• Problem– measure of rule interest– Rule (1) should be assigned both higher

support and higher confidence in R2 than in R1.

Goal 3. For interval data, the measures of rule interest, including the measures of rule frequency and strength of rule implication, should reflect the distance between data points.

Page 9: ppt

Adaptive Solution

• To reduce the amount of storage,– group values and only store counts of

groups

ex 1) one count for all cars rather than a separate count for Hondas,

Fords, etc.ex 2) ages 20 - 30

– in a height-balance tree

Page 10: ppt
Page 11: ppt

Clusters• Identify data groups (Cx)

– that are compact and isolated

• Cx– cluster is defined on X, specific set of attribu

tes– d(Cx[X]) d0

x, |Cx| s0

• d0x : density threshold, s0 : frequency threshold

• |Cx| : number of tuples in the cluster• |X| : number of dimension

– diameter : average pairwise distance between tuples projected on X

Page 12: ppt

04/10/23 DE Lab. 최 영 란 12

Rules

• Cx1 … Cxx CY1 … CYy

– Xi and Yj are disjoint

– confidence = | (UiCxi) U (UjCYj) | / | UiCxi |

– support = | (UiCxi) U (UjCYj) |

Page 13: ppt

04/10/23 DE Lab. 최 영 란 13

Algorithm

1. Standard clustering algorithms– to identify the intervals of interest– Birch

2. Standard association rule algorithm – to identify rules over those intervals

Page 14: ppt

04/10/23 DE Lab. 최 영 란 14

Birch

• Balanced Iterative Reducing and Clustering using Herarchies– incrementally identified and refined in a singl

e pass by CF– CF (Clustering Feature)

• compact summary of sub-cluster

• for Cx = {t1, …, tN}, CF(Cx) = (N, ti[X], ti[X]2)

– CF tree : height-balanced tree• internal node contain a list of (CF, pointer)• leaf node contain lists of CFs

N

i 1

N

i 1

Page 15: ppt

04/10/23 DE Lab. 최 영 란 15

Phase I : Identifying Clusters

1. Each data point is inserted

2. Follow pointer of the closest CF at each level

3. At leaf node– If diameter(Target cluster) < threshold

• point is added

– else• new cluster is created• if leaf nodes are full, split

Page 16: ppt

04/10/23 DE Lab. 최 영 란 16

Phase II : Combining Clusters to Form Rule

1. Scan i– to find candidate i-itemsets– count the number of tuples containing each

set of values of size i: (v1, v2, …, vi) where each vj∈Cxj and all i - 1 subsets are frequent.

2. Prune i– discard itemsets that count < S0

– to find frequent i-itemsets