ppt

Association Rulesover Interval Data

R. J. Miller and Y. Yang

1999 년 5 월 14 일DE Lab. 최 영 란

Abstract

• Mining association rules over interval data– distance-based association rules

• measures for mining nominal and ordinal data– support and confidence– don’t capture the semantics of interval data– so need a new definition of interest

• Distance-based association rules– for interval data in a way that respects the

quantitative properties

• Application of adaptive techniques– that find rules within given memory

constraints

Motive : example 1.

• Problem– expensive cost

quantitative association rules(QAR)• (Attr = val) or (val1 ≤ Attr ≤ val2)• K-partial completeness• equi-depth partitioning• relative distance between values, the density of

an interval or the distance between interval are not considered

Goal 1. In selecting interval or groups of data to consider, we want a measure of interval quality that reflects the distance between data points

Motive : example 2.

• Problem– interpretation of a rule

• Classical association rule– Job = DBA ∧ Age = 30 Salary = 40,000

(minsup: 50%, minconf: 60%) ----- (1)– based on exact set membership– cannot be used to express a rule such as “30

year-old DBAs earn about 40,000”

Goal 2. For interval data, a definition of an association rule C1 C2 is required that models the following semantics: items in C1 will be close to satisfying C2

Motive : example 3.

• Problem– measure of rule interest– Rule (1) should be assigned both higher

support and higher confidence in R2 than in R1.

Goal 3. For interval data, the measures of rule interest, including the measures of rule frequency and strength of rule implication, should reflect the distance between data points.

Adaptive Solution

• To reduce the amount of storage,– group values and only store counts of

groups

ex 1) one count for all cars rather than a separate count for Hondas,

Fords, etc.ex 2) ages 20 - 30

– in a height-balance tree

Clusters• Identify data groups (Cx)

– that are compact and isolated

• Cx– cluster is defined on X, specific set of attribu

tes– d(Cx[X]) d0

x, |Cx| s0

• d0x : density threshold, s0 : frequency threshold

• |Cx| : number of tuples in the cluster• |X| : number of dimension

– diameter : average pairwise distance between tuples projected on X

04/10/23 DE Lab. 최 영 란 12

Rules

• Cx1 … Cxx CY1 … CYy

– Xi and Yj are disjoint

– confidence = | (UiCxi) U (UjCYj) | / | UiCxi |

– support = | (UiCxi) U (UjCYj) |

04/10/23 DE Lab. 최 영 란 13

Algorithm

1. Standard clustering algorithms– to identify the intervals of interest– Birch

2. Standard association rule algorithm – to identify rules over those intervals

04/10/23 DE Lab. 최 영 란 14

Birch

• Balanced Iterative Reducing and Clustering using Herarchies– incrementally identified and refined in a singl

e pass by CF– CF (Clustering Feature)

• compact summary of sub-cluster

• for Cx = {t1, …, tN}, CF(Cx) = (N, ti[X], ti[X]2)

– CF tree : height-balanced tree• internal node contain a list of (CF, pointer)• leaf node contain lists of CFs

N

i 1

N

i 1

04/10/23 DE Lab. 최 영 란 15

Phase I : Identifying Clusters

1. Each data point is inserted

2. Follow pointer of the closest CF at each level

3. At leaf node– If diameter(Target cluster) < threshold

• point is added

– else• new cluster is created• if leaf nodes are full, split

04/10/23 DE Lab. 최 영 란 16

Phase II : Combining Clusters to Form Rule

1. Scan i– to find candidate i-itemsets– count the number of tuples containing each

set of values of size i: (v1, v2, …, vi) where each vj∈Cxj and all i - 1 subsets are frequent.

2. Prune i– discard itemsets that count < S0

– to find frequent i-itemsets

Documents

ppt