知识发现（数据挖掘 ) 第六章

23/4/21 AA12 关联规则史忠植 1

知识发现（数据挖掘 )

第六章

史忠植中国科学院计算技术研究所

关联规则 Association Rules


内容提要引言 Apriori 算法 Frequent-pattern tree 和 FP-growth 算法多维关联规则挖掘相关规则关联规则改进总结


关联规则

关联规则反映一个事物与其他事物之间的相互依存性和关联性。如果两个或者多个事物之间存在一定的关联关系，那么，其中一个事物就能够通过其他事物预测到。关联规则表示了项之间的关系。

示例 :cereal, milk fruit

“买谷类食品和牛奶的人也会买水果 .”

商店可以把牛奶和谷类食品作特价品以使人们买更多的水果 .


市场购物篮分析分析事务数据库表

我们是否可假定 ? Chips => Salsa Lettuce => Spinach

Person Basket

A Chips, Salsa, Cookies, Crackers, Coke, Beer

B Lettuce, Spinach, Oranges, Celery, Apples, Grapes

C Chips, Salsa, Frozen Pizza, Frozen Cake

D Lettuce, Spinach, Milk, Butter


基本概念通常 , 数据包含 :

TID Basket

事务 ID 项的子集


关联规则挖掘在事务数据库 , 关系数据库和其它信息

库中的项或对象的集合之间 , 发现频繁模式 , 关联 , 相关 , 或因果关系的结构 .

频繁模式 : 数据库中出现频繁的模式( 项集 , 序列 , 等等 )


基本概念项集

事务

关联规则

- 事务数据集 ( 例如右图 ) 事务标识 TID

每一个事务关联着一个标识 , 称作 TID.

IT

},...,,{ 21 miiiI

BAIBIA

BA

,,

D

Transaction-id Items bought

10 A, B, C

20 A, C

30 A, D

40 B, E, F


关联规则的度量支持度 s

D 中包含 A 和 B 的事务数与总的事务数的比值

规则 AB 在数据集 D 中的支持度为 s, 其中 s 表示D 中包含 AB ( 即同时包含 A 和B) 的事务的百分率 .

||||

||}|{||)(

D

TBADTBAs


关联规则的度量可信度 c

D 中同时包含 A 和 B 的事务数与只包含 A 的事务数的比值

||}|{||

||}|{||)(

TADT

TBADTBAc

规则 AB 在数据集 D 中的可信度为 c, 其中 c 表示 D 中包含 A 的事务中也包含 B 的百分率 . 即可用条件概率 P(B|A)表示 .

confidence(A B )=P(B|A)

条件概率 P(B|A) 表示 A 发生的条件下 B 也发生的概率 .

23/4/21 AA12 关联规则史忠植 10

关联规则的度量关联规则根据以下两个标准 ( 包含或排除 ): 最小支持度 – 表示规则中的所有项在事务中

出现的频度

最小可信度 - 表示规则中左边的项 (集 ) 的出现暗示着右边的项 (集 ) 出现的频度

23/4/21 AA12 关联规则史忠植 11

市场购物篮分析

I 是什么 ?事务 ID B的 T 是什么 ?s(Chips=>Salsa) 是什么 ?c(Chips=>Salsa) 是什么 ?

事务 ID 购物篮A Chips, Salsa, Cookies, Crackers, Coke, Beer

B Lettuce, Spinach, Oranges, Celery, Apples, Grapes

C Chips, Salsa, Frozen Pizza, Frozen Cake

D Lettuce, Spinach, Milk, Butter, Chips

23/4/21 AA12 关联规则史忠植 12

频繁项集项集 – 任意项的集合k- 项集 – 包含 k 个项的项集频繁 ( 或大 ) 项集 – 满足最小支持度的项集

若 I 包含 m 个项 , 那么可以产生多少个项集 ?

23/4/21 AA12 关联规则史忠植 13

强关联规则给定一个项集 , 容易生成关联规则 .

项集 : {Chips, Salsa, Beer} Beer, Chips => Salsa Beer, Salsa => Chips Chips, Salsa => Beer

强规则是有趣的强规则通常定义为那些满足最小支持度和最小可信

度的规则 .

23/4/21 AA12 关联规则史忠植 14

关联规则挖掘两个基本步骤

找出所有的频繁项集满足最小支持度

找出所有的强关联规则由频繁项集生成关联规则保留满足最小可信度的规则

23/4/21 AA12 关联规则史忠植 15

内容提要

引言 Apriori 算法 Frequent-pattern tree 和 FP-growth 算法多维关联规则挖掘相关规则关联规则改进总结

23/4/21 AA12 关联规则史忠植 16

Apriori 算法 IBM 公司 Almaden 研究中心的 R.Agrawal 等

人在 1993 年提出的 AIS和 SETM 。在 1994 年提出 Apriori和 AprioriTid。 Apriori

和 AprioriTid 算法利用前次过程中的数据项目集来生成新的候选数据项目集，减少了中间不必要的数据项目集的生成，提高了效率

23/4/21 AA12 关联规则史忠植 17

生成频繁项集Naïve algorithm

n <- |D|

for each subset s of I do

l <- 0

for each transaction T in D do

if s is a subset of T then

l <- l + 1

if minimum support <= l/n then

add s to frequent subsets

23/4/21 AA12 关联规则史忠植 18

生成频繁项集 naïve algorithm 的分析

I 的子集 : O(2m) 为每一个子集扫描 n 个事务测试 s为 T 的子集 : O(2mn)

随着项的个数呈指数级增长 ! 我们能否做的更好 ?

23/4/21 AA12 关联规则史忠植 19

Apriori 性质定理 (Apriori 性质 ): 若 A 是一个频繁项集 ,则 A 的每一个子集都是一个频繁项集 .

证明 :设 n 为事务数 . 假设 A是 l 个事务的子集 , 若 A’ A , 则 A’ 为 l’ (l’ l ) 个事务的子集 . 因此 , l/n ≥s( 最小支持度 ), l’/n ≥s 也成立 .

23/4/21 AA12 关联规则史忠植 20

Apriori 算法 Apriori 算法是一种经典的生成布尔型关联规则的

频繁项集挖掘算法 . 算法名字是缘于算法使用了频繁项集的性质这一先验知识 .

思想 : Apriori 使用了一种称作 level-wise 搜索的迭代方法 , 其中 k- 项集被用作寻找 (k+1)- 项集 .

首先 , 找出频繁 1- 项集 ,以 L1 表示 .L1 用来寻找L2, 即频繁 2- 项集的集合 .L2 用来寻找 L3, 以此类推 , 直至没有新的频繁 k- 项集被发现 . 每个 Lk都要求对数据库作一次完全扫描 ..

23/4/21 AA12 关联规则史忠植 21

生成频繁项集中心思想 : 由频繁 (k-1)- 项集构建候选 k- 项集方法

找到所有的频繁 1- 项集扩展频繁 (k-1)- 项集得到候选 k- 项集剪除不满足最小支持度的候选项集

23/4/21 AA12 关联规则史忠植 22

Apriori: 一种候选项集生成 - 测试方法

Apriori 剪枝原理 : 若任一项集是不频繁的 , 则其超集不应该被生成 / 测试 !

方法 : 由频繁 k- 项集生成候选 (k+1)- 项集 ,并且在 DB 中测试候选项集

性能研究显示了 Apriori 算法是有效的和可伸缩(scalablility)的 .

23/4/21 AA12 关联规则史忠植 23

The Apriori 算法—一个示例Database TDB

1st scan

C1L1

L2

C2 C2

2nd scan

C3 L33rd scan

Tid Items

10 A, C, D

20 B, C, E

30 A, B, C, E

40 B, E

Itemset sup

{A} 2

{B} 3

{C} 3

{D} 1

{E} 3

Itemset sup

{A} 2

{B} 3

{C} 3

{E} 3

Itemset

{A, B}

{A, C}

{A, E}

{B, C}

{B, E}

{C, E}

Itemset sup{A, B} 1{A, C} 2{A, E} 1{B, C} 2{B, E} 3{C, E} 2

Itemset sup{A, C} 2{B, C} 2{B, E} 3{C, E} 2

Itemset

{B, C, E}Itemset sup{B, C, E} 2

23/4/21 AA12 关联规则史忠植 24

Apriori 算法Algorithm: Apriori输入 : Database, D, of transactions; minimum support threshold,min_sup.输出 : L, freuqent itemsets in D.过程 :

Ck: Candidate itemset of size kLk : frequent itemset of size k

L1 = find_frequent_1_itemsets(D);for (k = 2; Lk+1 !=; k++) do begin{ Ck = apriori_gen(Lk-1 ,min_sup); for each transaction t in database D do{//scan D for counts

Ct =subset(Ck ,t);// get the subsets of t that are candidates For each candidate c Ct

c.count++;}

Lk = candidates in Ck with min_support }endreturn L=k Lk;

23/4/21 AA12 关联规则史忠植 25

Apriori 算法Procedure apriori_gen(Lk-1: frequent (k-1)-itemsets; min_sup:minimum

support threshold )for each itemset l1 Lk-1

for each itemset l2Lk-1

if(l1[1]=l2[1]) (l1[2]=l2[2]) … (l1[k-1]=l2[k-1]) Then{ c=join(l1,l2)//join step: generate candidates if has_infrequent_subset(c, Lk-1 ) then delete c;//prune step: remove unfruitful candidate else add c to Ck

}return Ck

23/4/21 AA12 关联规则史忠植 26

Apriori 算法Join is generate candidates set of itemsets Ck from

2 itemsets in Lk-1

Procedure join(p,q)

insert into Ck

select p.item1, p.item2,..., p.itemk-1,

q.itemk-1

from Lk-1 p, Lk-1 q

where p.item1=q.item1, ..., p.itemk-2=q.itemk-2,

p.itemk-1 = q.itemk-1

23/4/21 AA12 关联规则史忠植 27

Apriori 算法 Procedure has_infrequent_subset(c:candidate k-itemset;Lk-1:

frequent (k-1)-itemsets;)//use prior knowledge

for each (k-1)-subset s of c

if s Lk-1 then

return TRUE;

return FALSE.

23/4/21 AA12 关联规则史忠植 28

Apriori 算法如何生成候选项集 ?

步骤 1: 自连接 Lk

步骤 2: 剪枝如何计算候选项集的支持度 ? 候选项庥生成的示例

L3={ abc, abd, acd, ace, bcd }

自连接 : L3*L3

由 abc 和 abd 连接得到 abcd 由 acd 和 ace 连接得到 acde

剪枝 : 因为 ade 不在 L3中 acde 被剪除

C4={abcd}

23/4/21 AA12 关联规则史忠植 29

如何生成候选项集 ?

假定 Lk-1 中的项以一定顺序排列步骤 1: 自连接 Lk-1

insert into Ck

select p.item1, p.item2, …, p.itemk-1, q.itemk-1

from Lk-1 p, Lk-1 q

where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1

步骤 2: 剪枝forall itemsets c in Ck do

forall (k-1)-subsets s of c do

if (s is not in Lk-1) then delete c from Ck

23/4/21 AA12 关联规则史忠植 30

如何计算候选项集的支持度 ?

为何候选项集的支持度的计算是一个问题 ? 候选项集的总数可能是巨大的一个事务可能包含多个候选项集

方法 : 候选项集被存储在一个哈希树哈希树的叶子结点包含一个项集和计数的列表内部结点包含一个哈希表子集函数 : 找出包含在一个事务中的所有候选项集

23/4/21 AA12 关联规则史忠植 31

频繁模式挖掘的挑战挑战

多次扫描事务数据库巨大数量的候选项集繁重的计算候选项集的支持度工作

改进 Apriori: 大体的思路减少事务数据库的扫描次数缩减候选项集的数量使候选项集的支持度计算更加方便

23/4/21 AA12 关联规则史忠植 32

AprioriTid 算法 AprioriTid 算法由 Apriori 算法改进优点：只和数据库做一次交互，无须频繁访问

数据库将 Apirori 中的 Ck 扩展，内容由 {c}变为

{TID， c}， TID 用于唯一标识事务引入 Bk ，使得 Bk 对于事务的项目组织集合，而不是被动的等待 Ck 来匹配

23/4/21 AA12 关联规则史忠植 33

AprioriTid 算法举例：minsupp = 2 数据库：

TID 项目100 1 3 4

200 2 3 5

300 1 2 3 5

400 2 5

23/4/21 AA12 关联规则史忠植 34

AprioriTid 算法示例TID 项目集100 {1} {3} {4}

200 {2} {3} {5}

300 {1} {2} {3} {5}

400 {2} {5}

项集支持度

{1} 2

{2} 3

{3} 3

{5} 3

23/4/21 AA12 关联规则史忠植 35

ApioriTid 算法示例TID 项目集100 {{1 3}}

200 {{2 3} {2 5} {3 5} }

300 {{1 2} {1 3} {1 5} {2 3} {2 5} {3 5}}

400 {{2 5}}

项集支持度

{1 3} 2

{2 3} 2

{2 5} 3

{3 5} 2

23/4/21 AA12 关联规则史忠植 36

ApioriTid 算法示例TID 项目集100 空200 {{2 3 5}}

300 {{2 3 5 }}

400 空

23/4/21 AA12 关联规则史忠植 37

ApioriTid 算法上面图中分别为 Bk 和 Lk ，而 Ck 和 Apriori

算法产生的一样，因此没有写出来可以看到 Bk 由 Bk-1 得到，无须由数据库取数

据缺点：内存要求很大，事务过多的时候资源难

以满足

23/4/21 AA12 关联规则史忠植 38

内容提要


23/4/21 AA12 关联规则史忠植 39

频繁模式挖掘的瓶颈多次扫描数据库是高代价的长模式的挖掘需要多次扫描数据库以及生成许

多的候选项集找出频繁项集 i1i2…i100

扫描次数 : 100 候选项集的数量 : (100

1) + (1002) + … + (1

10

00

0) = 2100-1 = 1.27*1030 !

瓶颈 : 候选项集 - 生成 - 测试我们能否避免生成候选项集 ?

23/4/21 AA12 关联规则史忠植 40

不生成候选项集的频繁模式挖掘

利用局部频繁的项由短模式增长为长模式 “abc” 是一个频繁模式

得到所有包含 “ abc”的事务 : DB|abc

“d” 是 DB|abc 的一个局部频繁的项 abcd 是一个频繁模式

23/4/21 AA12 关联规则史忠植 41

FP Growth 算法 (Han, Pei, Yin 2000)

Apriori 算法的一个有问题的方面是其候选项集的生成指数级增长的来源

另一种方法是使用分而治之的策略 (divide and conquer)

思想 : 将数据库的信息压缩成一个描述频繁项相关信息的频繁模式树

23/4/21 AA12 关联规则史忠植 42

利用 FP-树进行频繁模式挖掘

思想 : 频繁模式增长递归地增长频繁模式借助模式和数据库划分

方法对每个频繁项 ,构建它的条件模式基 ,然后构建它的

条件 FP-树 . 对每个新创建的条件 FP-树重复上述过程直至结果 FP-树为空 , 或者它仅包含一个单一路径 .该路径将生成其所有的子路径的组合 , 每个组合都是一个频繁模式 .

23/4/21 AA12 关联规则史忠植 43

频繁 1-项集

最小支持度为 20% ( 计数为 2)

TID Items

1 I1,I2,I5

2 I2,I4

3 I2,I3,I6

4 I1,I2,I4

5 I1,I3

6 I2,I3

7 I1,I3

8 I1,I2,I3,I5

9 I1,I2,I3

Itemset Support count

{I1} 6

{I2} 7

{I3} 6

{I4} 2

{I5} 2

{I6} 1


{I1} 6

{I2} 7

{I3} 6

{I4} 2

{I5} 2

事务数据库支持度计数频繁 1- 项集

23/4/21 AA12 关联规则史忠植 44

FP-树构建


{I1} 6

{I2} 7

{I3} 6

{I4} 2

{I5} 2


{I2} 7

{I1} 6

{I3} 6

{I4} 2

{I5} 2

按支持度降序排列

23/4/21 AA12 关联规则史忠植 45

FP-树构建创建根结点

null 扫描数据库

事务 1: I1, I2, I5 排序 : I2, I1, I5

处理事务以项的顺序增加结点标注项及其计数

(I2,1)

(I1,1)

(I5,1)

1I5

0I4

0I3

1I1

1I2

维护索引表

23/4/21 AA12 关联规则史忠植 46

FP-树构建

null

(I2,2)

(I1,1)

(I5,1)

0I5

1I4

0I3

0I1

2I2

(I4,1)

TID Items

1 I1,I2,I5

2 I2,I4

3 I2,I3,I6

4 I1,I2,I4

5 I1,I3

6 I2,I3

7 I1,I3

8 I1,I2,I3,I5

9 I1,I2,I3

23/4/21 AA12 关联规则史忠植 47

FP-树构建

null

(I2,7)

(I1,4)

(I5,1)

2I5

2I4

6I3

6I1

7I2

(I4,1)

TID Items

1 I1,I2,I5

2 I2,I4

3 I2,I3,I6

4 I1,I2,I4

5 I1,I3

6 I2,I3

7 I1,I3

8 I1,I2,I3,I5

9 I1,I2,I3

(I3,2) (I3,2)

(I1,2)

(I3,2) (I4,1)

(I5,1)

23/4/21 AA12 关联规则史忠植 48

FP-树构建扫描事务数据库 D 一次 ,得到频繁项的集合 F及它

们的支持度 .将 F按支持度降序排列成 L,L 是频繁项的列表 .

创建 FP-树的根 , 标注其为 NULL.对D 中的每个事务进行以下操作 :

根据 L 中的次序对事务中的频繁项进行选择和排序 . 设事务中的已排序的频繁项列表为 [p|P], 其中p 表示第一个元素 ,P 表示剩余的列表 .调用insert_Tree([p|P],T).

23/4/21 AA12 关联规则史忠植 49

FP-树构建 Insert_Tree([p|P],T) If T has a child N such that N.item-name= p.item-name, then increment N’s count by 1; else create a new node N, and let its count be 1, its parent link be linked to T, and its node- link to the nodes with the same item-name via the node-link structure. If P is nonempty, call insert_tree(P,N) recursively.

23/4/21 AA12 关联规则史忠植 50

挖掘 FP-tree从索引表中的最后一个项开始找到所有包含该项的路径

沿着结点 -链接 (node-links)确定条件模式

路径中符合频度要求的模式构建 FP-tree C添加项至 C 中所有路径 , 生成频繁模式递归地挖掘 C (添加项 )从索引表和树中移除项

23/4/21 AA12 关联规则史忠植 51

挖掘 FP-Tree

null

(I2,7)

(I1,4)

(I5,1)

2I5

2I4

6I3

6I1

7I2

(I4,1)

(I3,2)

(I3,2)(I4,1)

(I5,1)

(I1,2)

(I3,2)

前缀路径(I2 I1,1)(I2 I1 I3, 1)

条件路径(I2 I1, 2)

条件 FP-tree

(I2 I1 I5, 2), (I2 I5, 2), (I1 I5, 2)

null

(I2,2)

(I1,2)

23/4/21 AA12 关联规则史忠植 52

挖掘 FP-Tree项条件模式基条件 FP-

tree生成的频繁模式

I5 {(I2 I1:1),(I2 I1 I3:1)} <I2:2,I1:2> I2 I5:2,

I1 I5:2,

I2 I1 I5:2

I4 {(I2 I1:1),(I2:1)} <I2:2> I2 I4:2

I3 {(I2 I1:2,(I2:2),(I1:2)} <I2:4,I1:2>,

<I1:2>

I2 I3:4,

I1 I3:2,

I2 I1 I3:2

I1 {(I2:4)} <I2:4> I2 I1:4

23/4/21 AA12 关联规则史忠植 53

挖掘 FP-TreeProcedure FP_growth(Tree,)(1) If Tree contains a single path P then

(2) for each combination (denote as ) of the nodes in the

path P

(3) generate pattern with support = minisup of nodes in ;

(4) Else for each ai in the header of Tree {

(5) generate pattern =ai with support =ai.support;

(6) construct , s conditional pattern base and then ’conditional

FP_tree Tree;

(7) IF Treeø then

(8) call FP_growth(Tree, );}

Three parallel algorithms: CD, DD, CaD based on Apriori

Discovering frequent itemsets (1) is much more expensive than generating rules (2)

Phase 1: Each node generates candidate k-itemsets locally from the

frequent (k-1)-itemsets how to partition?

Phase 2: The match candidates itemsets and transactions collect the

local counts how to distribute?

Phase 3: - determine the global counts for itemsets how to find?- find frequent k-itemsets and replicate in all nodes

并行关联规则挖掘

k-itemset An itemset having k items

LkSet of frequent k-itemsets (those with minimum support)Each member of this set has 2 fields: itemset and support count

CkSet of candidate k-itemsets (potentially frequent itemsets)Each member of this set has 2 fields: itemset and support count

Pi Processor with id-i

Di The dataset local to the processor Pi

D Ri The dataset local to the processor Pi after repartitioning

Cik

The candidate set maintained with the processor Pi during the kth pass (there are k items in each candidate)

并行关联规则挖掘

Objective: minimizing communication

Techniques:- Straight-forward parallelization of Apriori- Carry out redundant duplicate computations in parallel to

avoid communication- Only requires communicating count values (no data tuples

are exchanged)

Processors can scan the local data asynchronously in parallel

计数分布 CD

Algorithm:

Pass 1:(1) Each processor Pi generates its local candidate itemset Ci

1 depending on the items present in its local data partition Di

(2) Develop and Exchange local counts Ci1

(3) Develop global support counts C1

计数分布 CD

Algorithm:

Pass k>1:(1) Pi generates the complete Ck using the complete Lk-1 created at the end

of pass (k-1). Each processor has the identical Lk-1 thus generates identical Ck and puts its count values in a common order into a count array

(2) Pi makes a pass over data partition Di and develop local support counts for candidates in Ck

(3) Pi exchanges local Ck counts with all other processors to develop global Ck counts. All processors must synchronize.

(4) Pi computes Lk from Ck

(5) Pi independently decide to terminate or continue to the next pass

计数分布 CD

计数分布 CD

Disadvantages: CD does not exploit the aggregate memory of the system Must synchronize and develop global count at the end of

each pass

M over [Ck]N * |M|N

M over [Ck]|M|1

Usage of memory per processor

Total amount of memory

Number of processor

计数分布 CD

Objective: utilize aggregate main memory of the system effectively

Technique:•Partitions the candidates into disjoint sets, which are assigned to different processors. Each processor works with the entire dataset but only portion of the candidate set.

•Each processor counts mutually exclusive candidates. On a N-processor configuration, DD can count in a single pass candidate set that would require N pass in CD

数据分布 DD

Basic Idea

Ck

Ck1 Ck

2

Lk1 Lk

2

LkProcessor 1 Processor 2

Example: 2 processors

Data Distribution only processes a subset of Ck to utilize the aggregate memory

Exchange data to develop global counts for Ck

i

data

数据分布 DD

Algorithm:

Pass 1: Same as the CD algorithm

Pass k>1:(1) Pi generates Ck from Lk-1. It retains only 1/N of the itemsets forming Ci

k

(2) Pi develops support counts for itemsets in Cik for ALL transactions

(using local data pages and data pages received from other processors)

(3) At the end of the data pass, Pi calculates Lik using local Ci

k

(4) Processors exchange Lik so that every processor has the complete Lk

for generating Ck+1 for the next pass (requires processors to synchronize)

(5) Pi can independently decide whether to terminate or continue on to the next pass

数据分布 DD

Lik

Lik Li

k

Lk

数据分布 DD

Disadvantages:

heavy communication

Each processor must broadcast their local data and frequent itemsets to all other processors and synchronize in every pass.

数据分布 DD

Problem:CD and DD require processors to synchronize at the end of each pass

Basic Idea: Remove dependence among processors

• Data dependence

Complete transactions are required to compute support count (in CD)

• Frequent itemsets dependencyA global itemset Lk is needed during the pruning step of Apiori candidate generation algorithm(in DD)

候选分布 CaD

Remove Data Dependency• Each processor Pi works on Ck

i, a disjoint subset of Ck

• Pi derives global support counts for Cki from local

data.• Replicate data amongst processors in order to achieve

the above

Reduce Frequent itemset dependency• Does not wait for the complete pruning information to

arrive from other processors.• Prune the candidate set as much as possible• Late arriving pruning information is used in subsequent

passes.

候选分布 CaD

Algorithm:

Pass k<l: Use either the CD or DD algorithm

Pass k=l :(1) Partition Lk-1 among N processors(2) Pi generates Ci

k logically using only the Lik-1 partition (use standard pruning)

(3) Pi develops global counts for candidates in Cik and the database is repartitioned

into D Ri at the same time (requires communicating local data)(4) Pi receive Lj

k from all other processors needed for pruning Cik+1

(5) Pi computes Lik from Ci

k and asynchronously send it to the other N-1 processors

Pass k>l:(1) Pi collects all frequent itemsets sent by other processors(2) Pi generates Ci

k using local Lik-,, take care of pruning(Lj

l-1)(3) Pi passes over D Ri and counts Ci

k

(4) Pi computes Lik from Ci

k and asynchronously send it to the other N-1 processors

候选分布 CaD

Count Distribution attempts to minimize communication by replicating the candidate sets in each processor’s memory

Data Distribution maximizes the use of aggregate memory by allowing each processor works with the entire dataset but only portion of the candidate set

Candidate Distribution eliminates the synchronization costs at the end of every pass, maximizes the use of aggregate memory while limiting heavy communication to a single redistribution pass

并行关联算法比较

• Exchange data to repartition database (once only).

• Exchange Lki (require no

synchronization).

• Partition Lk

• Generate partial Cki

• Generate partial Lki

Candidate Distribution

• Exchange data to compute support counts for Ck

i

• Exchange Lki

• Synchronize at each pass

• Generate complete Ck

• Partition Ck

• Generate partial Lki

Data Distribution

• Exchange support counts for Cki

• Synchronize at each pass

• Generate complete Ck

• Generate complete Lk

Count Distribution

CommunicationNon-Communication


Advantages Disadvantages

Count Distribution (CD)

Low communication costOnly exchange counts, not data

Doesn’t exploit aggregate memoryMust synchronize at the end of every passWhen the candidate itemsets doesn’t fit the memory of each node?

Data Distribution (DD)

Better utilization of aggregate system memory

Need to broadcast its local data to every other processor at every passNeed to synchronize

Candidate Distribution (CaD)

Processors can proceed independently without synchronizing at the end of every pass

Communication of entire dataset needed for a single redistribution pass. The communication cost is higher than the synchronization cost savings


23/4/21 AA12 关联规则史忠植 72

由事务数据库构建 FP- 树

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header Table

Item frequency head f 4c 4a 3b 3m 3p 3

min_support = 3

TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

1. 扫描 DB 一次 , 找到频繁 1 项 (单一项模式 )

2. 按支持度降序对频繁项排序为 F-list

3. 再次扫描 DB,构建 FP-tree

F-list=f-c-a-b-m-p

23/4/21 AA12 关联规则史忠植 73

划分模式和数据库

频繁模式根据 F-list 可以被划分为若干子集 F-list=f-c-a-b-m-p 包含 p 的模式包含 m 但包含 p 的模式 … 包含 c 但不包含 a ,b, m, p 的模式模式 f

完整性和非冗余性

23/4/21 AA12 关联规则史忠植 74

从 P 的条件数据库找出包含 P 的模式

从 FP-tree 的索引表的频繁项开始沿着每个频繁项 p 的链接遍历 FP-tree 累积项 p 的所有转换前缀路径来形成的 p 的条件模式基

条件模式基项条件模式基c f:3

a fc:3

b fca:1, f:1, c:1

m fca:2, fcab:1

p fcam:2, cb:1

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header Table

Item frequency head f 4c 4a 3b 3m 3p 3

23/4/21 AA12 关联规则史忠植 75

递归 : 挖掘每个条件 FP-tree

{}

f:3

c:3

a:3m- 条件 FP-tree

“am” 的条件模式基 : (fc:3)

{}

f:3

c:3am- 条件 FP-tree

“cm” 的条件模式基 : (f:3){}

f:3

cm- 条件 FP-tree

“cam” 的条件模式基 : (f:3)

{}

f:3

cam- 条件 FP-tree

23/4/21 AA12 关联规则史忠植 76

一个特例 : FP-tree 中的单一前缀路径

假定 ( 条件的 ) FP-tree T 有一个共享的单一前缀路径 P

挖掘可以分为两部分将单一前缀路径约简为一个结点将两部分的挖掘结果串联

a2:n2

a3:n3

a1:n1

{}

b1:m1C1:k1

C2:k2 C3:k3

b1:m1C1:k1

C2:k2 C3:k3

r1

+a2:n2

a3:n3

a1:n1

{}

r1 =

23/4/21 AA12 关联规则史忠植 77

通过 DB 投影 (Projection)使 FP-growth 可伸缩

FP-tree 不能全放入内存 ?—DB 投影首先将一个数据库划分成一个由若干投影

(Projected) 数据库组成的集合然后对每个投影数据库构建和挖掘 FP-tree Parallel projection vs. Partition projection 技术

Parallel projection is space costly

23/4/21 AA12 关联规则史忠植 78

Partition-based Projection

Parallel projection 需要很多磁盘空间

Partition projection 节省磁盘空间

Tran. DB fcampfcabmfbcbpfcamp

p-proj DB fcamcbfcam

m-proj DB fcabfcafca

b-proj DB fcb…

a-proj DBfc…

c-proj DBf…

f-proj DB …

am-proj DB fcfcfc

cm-proj DB fff

…

23/4/21 AA12 关联规则史忠植 79

改进途径使用哈希表存储候选 k- 项集的支持度计

数移除不包含频繁项集的事务对数据采样划分数据

若一个项集是频繁的 , 则它必定在某个数据分区中是频繁的 .

23/4/21 AA12 关联规则史忠植 80

FP-tree 结构的优点完整性

保持了频繁项集挖掘的完整信息没有打断任何事务的长模式

紧密性减少不相关的信息—不频繁的项没有了项按支持度降序排列 : 越频繁出现 ,越可能被共享决不会比原数据库更大 ( 不计结点链接和计数域 ) 对 Connect-4 数据库 , 压缩比率可以超过 100

23/4/21 AA12 关联规则史忠植 81

关联规则可视化 : 方格图 (Pane Graph)

23/4/21 AA12 关联规则史忠植 82

关联规则可视化 : 规则图 (Rule Graph)

23/4/21 AA12 关联规则史忠植 83

内容提要


23/4/21 AA12 关联规则史忠植 84

挖掘多种规则或规律

多层 (Multi-level),量化 (quantitative) 关联规则 ,

相关 (correlation) 和因果 (causality), 比率(ratio) 规则 , 序列 (sequential) 模式 ,浮现(emerging) 模式 , temporal associations, 局部周期 (partial periodicity)

分类 (classification),聚类 (clustering),冰山立方体 ( iceberg cubes), 等等 .

23/4/21 AA12 关联规则史忠植 85

多层关联规则项常常构成层次可伸缩的 (flexible) 支持度设置 : 在较低层的项预期有较低的支持度 .

事务数据库可以基于维度和层次编码探寻共享多层挖掘统一支持度

Milk[support = 10%]

2% Milk [support = 6%]

Skim Milk [support = 4%]

Level 1min_sup = 5%

Level 2min_sup = 5%

Level 1min_sup = 5%

Level 2min_sup = 3%

减少的支持度

23/4/21 AA12 关联规则史忠植 86

可伸缩的支持度约束的多层 / 多维 (ML/MD) 关联规则

为什么设置可伸缩的支持度 ? 实际生活中项的出现频率变化巨大

在一个商店购物篮中的钻石 ,手表 ,钢笔统一的支持度未必是一个有趣的模型

一个可伸缩模型较低层的 ,较多维的组合以及长模式通常具有较小的支持度总体规则应该要容易说明和理解特殊的项和特殊的项的组合可以特别设定 ( 最小支持度 ) 以及拥有更高的优先级

23/4/21 AA12 关联规则史忠植 87

多维关联规则

单维规则 :

buys(X, “milk”) buys(X, “bread”)

多维规则 : 2 个维度或谓词 ( predicates)

跨维度 (Inter-dimension) 关联规则 (无重复谓词 )

age(X,”19-25”) occupation(X,“student”) buys(X,“coke”)

混合维度 (hybrid-dimension) 关联规则 (重复谓词 )

age(X,”19-25”) buys(X, “popcorn”) buys(X, “coke”)

分类 (Categorical)属性有限的几个可能值 ,值之间不可排序

数量 (Quantitative)属性数值的 ,值之间有固有的排序

23/4/21 AA12 关联规则史忠植 88

多层关联规则 : 冗余滤除根据项之间的”先辈” (ancestor) 关系 , 一些规则可能

是冗余的 .

示例 milk wheat bread [support = 8%, confidence = 70%]

2% milk wheat bread [support = 2%, confidence = 72%]

我们说第 1 个规则是第 2 个规则的先辈 .

一个规则是冗余的 ,当其支持度接近基于先辈规则的”预期” (expected)值 .

23/4/21 AA12 关联规则史忠植 89

多层关联规则 : 逐步深化 (Progressive Deepening)

一个自上而下的 ,逐步深化的方法 : 首先挖掘高层的频繁项 :

milk (15%), bread (10%) 然后挖掘它们的较低层”较弱” (weaker) 频繁项 :

2% milk (5%), wheat bread (4%)

多层之间不同的最小支持度阈值导致了不同的算法 : 如果在多个层次间采用了相同的最小支持度 ,若 t 的任何一个

先辈都是非频繁的则扔弃 (toss)t. 如果在较低层采用了减少的最小支持度，则只检验那些先辈

的支持度是频繁的／不可忽略的派生（ descendents）即可．

23/4/21 AA12 关联规则史忠植 90

多维关联规则挖掘的技术

搜索频繁 k-谓词集 (predicate set): 示例 : {age, occupation, buys} 是一个 3-谓词集以 age处理的方式 , 技术可以如下分类

1. 利用数量属性的统计离散 (static discretization) 方法利用预先确定的概念层次对数量属性进行统计离散化2. 量化关联规则

基于数据的分布 , 数量属性被动态地离散化到不同的容器空间(bins)

3. 基于距离 (Distance-based) 的关联规则这是一个动态离散化的过程 ,该过程考虑数据点之间的距离

23/4/21 AA12 关联规则史忠植 91

数量属性的统计离散化

挖掘之前利用概念层次离散化数值被范围 (ranges)替代 .

关系数据库中 , 找出所有的频繁 k-谓词 (predicate) 集要求 k 或 k+1 次表扫描 .

数据立方体 (data cube)非常适合数据挖掘 .

N- 维立方体的 cells 与谓词集 (

predicate sets) 相对应 .

通过数据立方体挖掘会非常快速 .

(income)(age)

()

(buys)

(age, income) (age,buys) (income,buys)

(age,income,buys)

23/4/21 AA12 关联规则史忠植 92

量化关联规则

age(X,”30-34”) income(X,”24K - 48K”) buys(X,”high resolution TV”)

数值属性动态离散化这样挖掘的规则的可信度或紧密度最大化

2- 维量化关联规则 : Aquan1 Aquan2 Acat

示例

23/4/21 AA12 关联规则史忠植 93

Mining Distance-based Association Rules

Binning methods do not capture the semantics of interval data

Distance-based partitioning, more meaningful discretization considering: density/number of points in an interval “closeness” of points in an interval

Price($)Equi-width(width $10)

Equi-depth(depth 2)

Distance-based

7 [0,10] [7,20] [7,7]20 [11,20] [22,50] [20,22]22 [21,30] [51,53] [50,53]50 [31,40]51 [41,50]53 [51,60]

23/4/21 AA12 关联规则史忠植 94

Interestingness Measure: Correlations (Lift)

play basketball eat cereal [40%, 66.7%] is misleading

The overall percentage of students eating cereal is 75% which is

higher than 66.7%.

play basketball not eat cereal [20%, 33.3%] is more accurate,

although with lower support and confidence

Measure of dependent/correlated events: lift

Basketball Not basketball Sum (row)

Cereal 2000 1750 3750

Not cereal 1000 250 1250

Sum(col.) 3000 2000 5000

)()(

)(, BPAP

BAPcorr BA

23/4/21 AA12 关联规则史忠植 95

内容提要


23/4/21 AA12 关联规则史忠植 96

相关规则 (Correlation Rules)

“ Beyond Market Baskets,” Brin et al. 假设执行关联规则挖掘

c c row

t 20 5 25

t 70 5 75

col 90 10 100

tea => coffee20% support80% confidence

but 90% of the people buy coffee anyway!

23/4/21 AA12 关联规则史忠植 97

相关规则

一种度量是计算相关性若两个随机变量 A 和 B 是统计独立的

对 tea 和 coffee:

1)()(

)(

BPAP

BAP

89.0)()(

)(

cPtP

ctP

23/4/21 AA12 关联规则史忠植 98

相关规则利用 2 统计检验来测试独立性设 n 为购物篮的总数设 k 为考虑的项的总数设 r 为一个包含项 (ij, ij) 的规则设O(r) 表示包含规则 r 的购物篮的数量 ( 即频率 )

对单个项 ij, 设 E[ij] = O(ij) ( 反过来即为 n - E[ij])

E[r] = n * E[r1]/n * … * E[rk] / n

23/4/21 AA12 关联规则史忠植 99

相关规则 2 统计量定义为

Look up for significance value in a statistical textbook There are k-1 degrees of freedom If test fails cannot reject independence, otherwise

contigency table represents dependence.

Rr rE

rErO

][

])[)(( 22

23/4/21 AA12 关联规则史忠植 100

示例 Back to tea and coffee

E[t] = 25, E[t]=75, E[c]=90, E[c]=10 E[tc]=100 * 25/100 * 90 /100=22.5 O(tc) = 20 Contrib. to 2 = (20 - 22.5)2 / 22.5 = 0.278 Calculate for the rest to get 2=2.204 Not significant at 95% level (3.84 for k=2) Cannot reject independence assumption

c c row

t 20 5 25

t 70 5 75

col 90 10 100

23/4/21 AA12 关联规则史忠植 101

兴趣度（ Interest ）

If 2 test shows significance, then want to find most interesting cell(s) in table

I(r) = O(r)/E[r] Look for values far away from 1 I(tc) = 20/22.5 = 0.89 I(tc) = 5/2.5 = 2 I(tc) = 70/67.5 = 1.04 I(tc) = 5/7.5 = 0.66

23/4/21 AA12 关联规则史忠植 102

2 统计量的性质上封闭性 (Upward closed)

若一个 k- 项集是相关的 , 则其所有的超集也是相关的 .

寻找最小的相关的项集没有子集是相关的

能否将 a-priori and 2 统计量有效地结合 No generate and prune as in support-confidence

23/4/21 AA12 关联规则史忠植 103

其它度量 (Measures)

l(A B) P(A,B)

P(A)P(B)

TID Items

1 I1,I2,I5

2 I2,I4

3 I2,I3,I6

4 I1,I2,I4

5 I1,I3

6 I2,I3

7 I1,I3

8 I1,I2,I3,I5

9 I1,I2,I3

作用度 (Lift) 度量项之间的相关性 ,但没有检验

23/4/21 AA12 关联规则史忠植 104

其它度量 (Measures)

可信度 (Conviction) 度量一个规则的强度

A B (A B)

c(A B) P(A)P(B)

P(A,B)

TID Items

1 I1,I2,I5

2 I2,I4

3 I2,I3,I6

4 I1,I2,I4

5 I1,I3

6 I2,I3

7 I1,I3

8 I1,I2,I3,I5

9 I1,I2,I3

23/4/21 AA12 关联规则史忠植 105

内容提要


23/4/21 AA12 关联规则史忠植 106

关联规则改进 Lin 等人提出解决规则挖掘算法中的数据倾斜问题，从

而使算法具有较好的均衡性。 Park 等人提出把哈希表结构用于关联规则挖掘。

Agrawal 首先提出事务缩减技术， Han 和 Park 等人也分别在减小数据规模上做了一些工作。

抽样的方法是由 Toivonen 提出的。 Brin 等人采用动态项集计数方法求解频繁项集。 Aggarwal 提出用图论和格的理论求解频繁项集的方

法。 Prutax 算法就是用格遍历的办法求解频繁项集。

23/4/21 AA12 关联规则史忠植 107

关联规则改进关联规则模型有很多扩展，如顺序模型挖掘，在顺序

时间段上进行挖掘等。还有挖掘空间关联规则，挖掘周期性关联规则，挖掘

负关联规则，挖掘交易内部关联规则等。 Guralnik 提出顺序时间段问题的形式描述语言，以便描

述用户感兴趣的时间段，并且构建了有效的数据结构SP 树（顺序模式树）和自底向上的数据挖掘算法。

最大模式挖掘是 Bayardo 等人提出来的。

23/4/21 AA12 关联规则史忠植 108

关联规则改进随后人们开始探讨频率接近项集。 Pei 给出了一种有效的

数据挖掘算法。 B.Özden 等人的周期性关联规则是针对具有时间属性的事

务数据库，发现在规律性的时间间隔中满足最小支持度和信任度的规则。

贝尔实验室的 S.Ramaswamy 等人进一步发展了周期性关联规则，提出挖掘符合日历的关联规则（ Calendric

Association Rules ）算法，用以进行市场货篮分析。

23/4/21 AA12 关联规则史忠植 109

关联规则改进 T.Hannu 等人把负边界引入规则发现算法中，每次挖掘

不仅保存频繁项集，而且同时保存负边界，达到下次挖掘时减少扫描次数的目的。

Srikant 等人通过研究关联规则的上下文，提出规则兴趣度尺度用以剔除冗余规则。

Zakia 还用项集聚类技术求解最大的近似潜在频繁项集，然后用格迁移思想生成每个聚类中的频繁项集。

CAR ，也叫分类关联规则，是 Lin 等人提出的一种新的分类方法，是分类技术与关联规则思想相结合的产物，并给出解决方案和算法。

23/4/21 AA12 关联规则史忠植 110

关联规则改进

Cheung 等人提出关联规则的增量算法。 Thomas 等人把负边界的概念引入其中，进一步发展了增量算法。如，基于 Apriori 框架的并行和分布式数据挖掘算法。

Oates 等人将 MSDD 算法改造为分布式算法。还有其他的并行算法，如利用垂直数据库探求项集聚类等。

23/4/21 AA12 关联规则史忠植 111

ARCS (Association Rules Clustering System)

ARCS

1. Binning

2. Frequent Items

Set

3. Clustering

4. Optimization

23/4/21 AA12 关联规则史忠植 112

ARCS: 成功的应用聚类的概念到分类中 .

但仅限于基于 2- 维规则的分类 ,如 A B Classi 的格式所示

利用装箱 (Binning) 方法将数据属性值离散化 ,因此 ACRS 的准确度与使用的离散化程度强烈相关 .

ARCS ARCS 的特点的特点

23/4/21 AA12 关联规则史忠植 113

基于关联规则的分类(Classification Based on Association rules, CBA)

分类规则挖掘与关联规则挖掘目标

一个小的规则集作为分类器所有规则依照最小支持度与最小可信度

语法 (Syntax) X y X Y

23/4/21 AA12 关联规则史忠植 114

为何及如何结合

分类规则挖掘与关联规则挖掘都是实际应用中必需的 .

结合着眼于关联规则的一个特定子集 , 其右件限制为分类的类属性 . CARs: Class Association Rules

23/4/21 AA12 关联规则史忠植 115

CBA: 三个步骤

若有连续值 , 则离散化 .生成所有的 class association rules

(CARs)构建一个若干生成的 CARs 的分类器 .

23/4/21 AA12 关联规则史忠植 116

CAR 集

生成完整的 CARs 的集合 , 其满足用户指定的最小支持度与最小可信度约束 .

由 CARs构建一个分类器 .

23/4/21 AA12 关联规则史忠植 117

规则生成 : 基本概念

规则项 (Ruleitem) <condset, y> : 条件集 condset 是一个项集 , y 是一个类标签 (class label)

每个规则项表示一个规则 : condset->y

条件集支持度 (condsupCount) D 中包含 condset 的事例 (case) 数

规则支持度 (rulesupCount) D 中包含 condset 及标注类别 y 的事例 (case) 数

支持度 Support=(rulesupCount/|D|)*100% 可信度 Confidence=(rulesupCount/condsupCount)*100%

23/4/21 AA12 关联规则史忠植 118

规则生成 : 基本概念 (Cont.)

频繁规则项 (frequent ruleitems) 一个规则项是频繁的 ,当其支持度不小于最小支持度

minsup. 精确规则 (Accurate rule)

一个规则是精确的 ,当其可信度不小于最小可信度 minconf .

潜在规则 (Possible rule) 对所有包含同样条件集 condset 的规则项 , 可信度最大的规

则项为这一规则项集合的潜在规则 . 类别关联规则 class association rules (CARs) 包含

所有的潜在规则 possible rules (PRs) , 其即是频繁的又是精确的 .

23/4/21 AA12 关联规则史忠植 119

规则生成 : 一个示例

一个规则项 :<{(A,1),(B,1)},(class,1)> 假定

条件集 condset (condsupCount) 的支持度计数为 3, 规则项 ruleitem (rulesupCount) 的支持度计数为 2, |D|=10

则 (A,1),(B,1) -> (class,1) [supt=20% (rulesupCount/|D|)*100% confd=66.7% (rulesupCount/condsupCount)*100%]

23/4/21 AA12 关联规则史忠植 120

RG: 算法

1 F 1 = {large 1-ruleitems};

2 CAR 1 = genRules (F 1 );

3 prCAR 1 = pruneRules (CAR 1 ); //count the item and class occurrences to determine the frequent 1-ruleitems and prune it

4 for (k = 2; F k-1 Ø; k++) do

5 C k = candidateGen (F k-1 ); //generate the candidate ruleitems Ck

using the frequent ruleitems Fk-1

6 for each data case d D do //scan the database

7 C d = ruleSubset (C k , d); //find all the ruleitems in Ck whose condsets are supported by d

8 for each candidate c C d do9 c.condsupCount++;10 if d.class = c.class then

c.rulesupCount++; //update various support counts of the candidates in Ck

11 end12 end

23/4/21 AA12 关联规则史忠植 121

RG: 算法 (cont.)13 F k = {c C k | c.rulesupCount minsup};

//select those new frequent ruleitems to form Fk

14 CAR k = genRules(F k ); //select the ruleitems both accurate and frequent

15 prCAR k = pruneRules(CAR k );

16 end

17 CARs = k CAR k ;

18 prCARs = k prCAR k ;

23/4/21 AA12 关联规则史忠植 122

分类构建器 M1: 基本概念

给定两个规则 ri and rj, 定义 : ri rj 当 ri 的可信度大于 rj的 , 或者它们的可信度相同 ,但 ri 的支持度大于 rj的 , 或者它们的可信度与支持度都相同 , 但 ri 比 rj 生成的早 .

我们的分类器如下面格式所示 : <r1, r2, …, rn, default_class>,

where ri R, ra rb if b>a

23/4/21 AA12 关联规则史忠植 123

M1: 三个步骤

基本思想是选择 R 中一个优先规则 (high precedence) 的集合来覆盖 D.

对生成的规则集 R 排序 . 根据已排好的序列从 R 中选择为分类所用的规则并放入 C 每个选择的规则必须正确分类至少一个增加的事例 (case). 并选择默认的属性和计算误差 .

抛弃 C 中的不能改进分类器的准备度的规则 . 保留那些最小误差的规则的位置 ,抛弃序列中的其余规则 .

23/4/21 AA12 关联规则史忠植 124

M1: Algorithm 1 R = sort(R); //Step1:sort R according to the relation “” 2 for each rule r R in sequence do 3 temp = Ø; 4 for each case d D do //go through D to find those cases covered by each rule r 5 if d satisfies the conditions of r then 6 store d.id in temp and mark r if it correctly classifies d; 7 if r is marked then 8 insert r at the end of C; //r will be a potential rule because it can correctly classify

at least one case d 9 delete all the cases with the ids in temp from D; 10 selecting a default class for the current C; //the majority class in the remaining data 11 compute the total number of errors of C; 12 end 13 end // Step 2 14 Find the first rule p in C with the lowest total number of errors and drop all the rules

after p in C; 15 Add the default class associated with p to end of C, and return C (our classifier). //Step

3

23/4/21 AA12 关联规则史忠植 125

M1: 满足的两个条件

Each training case is covered by the rule with the highest precedence among the rules that can cover the case.

Every rule in C correctly classifies at least one remaining training case when it is chosen.

23/4/21 AA12 关联规则史忠植 126

MOUCLAS Assumption: MOUCLAS algorithm assumes that the initial

association rules can be agglomerated into clustering regions

The implication of the form of the Mouclas Pattern (so called MP) :

Cluster(D)t y where Cluster(D)t is a cluster of D, t = 1 to

m, and y is a class label.

23/4/21 AA12 关联规则史忠植 127

MOUCLAS Definitions:

Frequency of Mouclas Patterns

Accuracy of Mouclas Patterns

Reliability of Mouclas Patterns

Task of Mouclas:

To discover MPs that have support and confidence greater than the user-specified minimum support threshold (called minsup), and minimum confidence threshold (called minconf) and minimum reliability threshold (called minR) respectively, and to construct a classifier based upon MPs.

23/4/21 AA12 关联规则史忠植 128

The MOUCLAS algorithm

Two steps:

Step 1. Discovery of the frequent and accurate and reliable MPs.

Step 2. Construction of a classifier, called De-MP, based on MPs.

23/4/21 AA12 关联规则史忠植 129


The core of the first step in the Mouclas algorithm is to find all cluster_rules that satisfy minsup and minconf and minR. Let C denote the dataset D after dimensionality reduction processing. A cluster_rule represents a MP, namely a rule:

cluset y,

where cluset is a set of itemsets from a cluster Cluster(C)t, y is a class label, y Y.

23/4/21 AA12 关联规则史忠植 130


Algorithm of the first step: Mouclas Mining frequent and accurate and reliable Mouclas patterns (MPs)

Input: A training transaction database, D; minimum support threshold (minsup); minimum confidence threshold (minconf) ; minimum reliability threshold (minR)

Output: A set of frequent and accurate and reliable Mouclas patterns (MPs)

23/4/21 AA12 关联规则史忠植 131


Methods:

(1) Reduce the dimensionality of transactions d, which efficiently reduces the data size by removing irrelevant or redundant attributes (or dimensions) from the training data, and

(2) Identify the clusters of database C for all transactions d after dimensionality reduction on attributes Aj in database C, based on the Mountain function, which is a fuzzy set membership function, and specially capable of transforming quantitative values of attributes in transactions into linguistic terms, and

(3) Generate a set of MPs that are both frequent and accurate, namely, which satisfy the user-specified minimum support (called minsup) and minimum confidence (called minconf) and minimum reliability (called minR) constraints.

23/4/21 AA12 关联规则史忠植 132


1 X = reduceDim (I); // reduce the dimensionality on the set of all items I of in D

2 Cluster(C)t = genCluster (C); // identify the complete clusters of C

3 for each Cluster(C)t do

E = genClusterrules(cluset, class); // generate a set of candidate cluster_rules

4 for each transaction d C do

5 Ed = genSubClusterrules (E, d); // find all the cluster_rules in E whose cluset are supported

by d

6 for each e Ed do

7 e. clusupCount++; // accumulate the clusupCount of the cluset of cluster_rule e

8 if d.class = e.class then e.cisupCount++ // accumulate the cisupCount of cluster_rule e

supported by d

9 end

10 end

11 F = {e E e.cisupCount minsupi }; // construct the set of frequent cluster_rules

12 MP = genRules (F); //generate MP using the genRules function by minconf and minR

13 end

14 MPs = ∪ MP; // discover the final set of MPs

23/4/21 AA12 关联规则史忠植 133


The task of the second step in Mouclas algorithm: Using a heuristic method to generate a classifier, named De-MP,

where the discovered MPs can cover D and are organized according to a decreasing precedence based on their confidence and support.

Suppose R be the set of frequent and accurate and reliable MPs which are generated in the past step, and MPdefault_class denotes the default class, which has the lowest precedence. We can then present the De-MP classifier in the form of

<MP1, MP2, …, MPn, MPdefault_class>,

where MPi R, i = 1 to n, MPa ≻MPb if n b > a 1 and a, b i, C ∪ cluset of MPi

23/4/21 AA12 关联规则史忠植 134


Algorithm: Mouclas constructing De-MP Classifier

Input: A training database after dimensionality reduction, C; The set of frequent and accurate and reliable Mouclas patterns (MPs)

Output: De-MP Classifier

Methods:

(1) Identify the order of all discovered MPs based on the definition of precedence and sequence them according to decreasing precedence order.

(2) Determine possible MPs for De-MP classifier from R following the descending sequence of MPs.

(3) Discard the MPs which cannot contribute to the improvement of the accuracy of the De-MP classifier and keep the final set of MPs to construct the De-MP classifier.

23/4/21 AA12 关联规则史忠植 135


1 R = sort(R); // sort MPs based on their precedence 2 for each MP R in sequence do3 temp = ;4 for each transaction d C do5 if d satisfies the cluset of MP then6 store d.ID in temp;7 if MP correctly classifies d then 8 insert MP at the end of L;9 delete the transaction who has ID in temp from C; 10 selecting a default class for the current L; // determine the default class

based on majority class of remaining transactions in C11 end12 compute the total number of errors of L; // compute the total number of errors

that are made by the current L and the default class 13 end14 Find the first MP in L with the lowest total number of errors and discard all the

MPs after the MP in L;15 Add the default class associated with the above mentioned first MP to end of L;16 De-MP classifier = L

23/4/21 AA12 关联规则史忠植 136

Example of MOUCLAS application

The well logging data sets include attributes (well logging curves) of GR (gamma ray), RDEV (deep resistivity), RMEV (shallow resistivity), RXO (flushed zone resistivity), RHOB (bulk density), NPHI (neutron porosity), PEF (photoelectric factor) and DT (sonic travel time).

A hypothetically useful MP may suggest a relation between well logging data and the class label of oil/gas formation since.

23/4/21 AA12 关联规则史忠植 137

Summary

关联规则挖掘是数据挖掘中的一个基本工具几个算法

Apriori: 利用一个可证明的数学性质来改进性能 FP-Growth: 不再生成候选项集 , 利用有效的数据

结构相关 (correlation) 规则 : 在统计学的基础上评价有

趣度基于约束 (constrain) 的关联规则挖掘

23/4/21 AA12 关联规则史忠植 138

References

Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers

R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. SIGMOD'93, 207-216, Washington, D.C.

J. Han, J. Pei, and Y. Yin: “Mining frequent patterns without candidate generation”. In Proc. ACM-SIGMOD’2000, pp. 1-12, Dallas, TX, May 2000.

23/4/21 AA12 关联规则史忠植 139

References

S. Brin, R. Motwani, and C. Silverstein. Beyond market basket:

Generalizing association rules to correlations. SIGMOD'97, 265-

276, Tucson, Arizona.

J. Han, L. V. S. Lakshmanan, and R. T. Ng, "Constraint-Based,

Multidimensional Data Mining", COMPUTER (special issues on

Data Mining), 32(8): 46-50, 1999.

Craig A. Struble. Association Rule Mining. Slides, Marquette

University

Yalei Hao . Markus Stumptner . Gerald Quirchmayr. Qing He . Data

Mining by MOUCLAS Algorithm for Petroleum Reservoir

Characterization from Well Logging Data. AIAI2004.

23/4/21 AA12 关联规则史忠植 140

www.intsci.ac.cn/shizz/

Questions?!Questions?!

Documents

知识发现（数据挖掘 ) 第六章