34
An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

  • Upload
    sian

  • View
    37

  • Download
    0

Embed Size (px)

DESCRIPTION

An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets. Data Mining. Discover hidden patterns, correlations, association rules, etc., in large data sets When is the discovery interesting, important, significant? We develop rigorous mathematical/statistical - PowerPoint PPT Presentation

Citation preview

Page 1: An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Page 2: An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Data Mining

Discover hidden patterns, correlations, association rules, etc., in large data sets

When is the discovery interesting, important, significant?

We develop rigorous mathematical/statistical

approach

Page 3: An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Frequent Itemsets

Dataset DD of transactions tj (subsets) of a base set of items I, (tj ⊆ 2I).

Support of an itemsets X = number of transactions that contain X.

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

support({support({Beer,DiaperBeer,Diaper}) = 3}) = 3

Page 4: An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Frequent Itemsets

Discover all itemsets with significant support.

Fundamental primitive in data mining applications

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

support({support({Beer,DiaperBeer,Diaper}) = 3}) = 3

Page 5: An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Significance

What support level makes an itemset significantly frequent? Minimize false positive and false negative

discoveries Improve “quality” of subsequent analyses

How to narrow the search to focus only on significant itemsets? Reduce the possibly exponential time

search

Page 6: An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Statistical Model

Input: D = a dataset of t transactions over |I|=n For i∊I, let n(i) be the support of {i} in D. fi= n(i)/t = frequency of i in D

H0 Model: D = a dataset of t transactions, |I|=n Item i is included in transaction j with

probability fi independent of all other events.

Page 7: An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Statistical Tests

H0 : null hypothesis – the support of no itemset is significant with respect to D

H1: alternative hypothesis, the support of itemsets X1, X2, X3,… is significant. It is unlikely that their support comes from the distribution of D

Significance level:α = Prob( rejecting H0 when it’s true )

Page 8: An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Naïve Approach

Let X={x1,x2,…xr}, fx =∏j fj, probability that a given

itemset is in a given transaction sx = support of X, distributed sx ∼

B(t, fx)

Reject H0 if: Prob(B(t, fx) ≥ sx) = p-value ≤ α

Page 9: An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Naïve Approach

Variations: R=support /E[support in

D] R=support - E[support in

D] Z-value = (s-E[s])/ϭ[s] many more…

Page 10: An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

D has 1,000,000 transactions, over 1000 items, each item has frequency 1/1000.

We observed that a pair {i,j} appears 7 times, is this pair statistically significant?

In D (random dataset): E[ support({i,j}) ] = 1 Prob({i,j} has support ≥ 7 ) ≃ 0.0001

p-value 0.0001 - must be significant!

What’s wrong? – example

Page 11: An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

What’s wrong? – example

There are 499,500 pairs, each has probability 0.0001 to appear in 7 transactions in D

The expected number of pairs with support ≥ 7 in D is ≃ 50,

not such a rare event! Many false positive discoveries (flagging

itemsets that are not significant) Need to correct for multiplicity of

hypothesis.

Page 12: An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Multi-Hypothesis test

Testing for significant itemsets of size k involves testing simultaneously for

m= null hypotheses. H0

(X) = support of X conforms with D

sx = support of X, distributed: sx ∼ B(t, fx)

How to combine m tests while minimizing false positive and negative discoveries?

Page 13: An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Family Wise Error Rate (FWER) Family Wise Error Rate (FWER) =

probability of at least one false positive (flagging a non-significant itemset as

significant) Bonferroni method (union bound) – test

each null hypothesis with significance level α/m

Too conservative – many false negative – does not flag many significant itemsets.

Page 14: An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

False Discovery Rate (FDR)

Less conservative approach V= number of false positive discoveries R= total number of rejected null hypothesis = number itemsets flagged as significant

Test with level of significance α : reject maximum number of null hypothesis such that FDR ≤ α

FDR = E[V/R] (FDR=0 when R=0)

Page 15: An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Standard Multi-Hypothesis test

Page 16: An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Standard Multi-Hypothesis test

Less conservative than Bonferroni method: i α/m VS α/m

For m= , still needs very small individual p-value to reject an hypothesis

Page 17: An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Alternative Approach Q(k, si) = observed number of itemsets of

size k and support ≥ si

p-value = the probability of Q(k, si) in D

Fewer hypothesis How to compute the p-value? What is

the distribution of the number of itemsets of size k and support ≥ si in D ?

Page 18: An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Simulations to estimate the probabilities Choose a data set at random and count

Main problem: m =

small probabilities to reject hypothesis

a lot of simulations to estimate probabilities

Permutation Test

Page 19: An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Main Contributions

Poisson approximation: let Qk,s = number of itemsets of size k and support s in D (random dataset), for s≥smin:

Qk,s is well approximate by a Poisson distribution.

Based on the Poisson approximation – a powerful FDR multi-hypothesis test for significant frequent itemsets.

Page 20: An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Chen-Stein Method

A powerful technique for approximating the sum of dependent Bernoulli variables.

For an itemset X of k items let ZX=1 if X has support at least s, else ZX=0

Qk,s = ∑X ZX (X of k items) U~Poisson(λ) I(x)= {Y | |y|=k, Y˄X ≠ empty},

Page 21: An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Chen-Stein Method (2)

Page 22: An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Approximation Result

Qk,s is well approximate by a Poisson distribution for s≥smin

Page 23: An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Monte-Carlo Estimate

To determine smin for a given set of parameters (n,t,fi ): Choose m random datasets with the

required parameters. For each dataset extract all itemsets

with support at least s’ (≤ smin) Find the minimum s such that Prob(b1(s)+b2(s) ≤ ε) ≥ 1-δ

Page 24: An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

New Statistical Test

Instead of testing the significance of the support of individual itemsets we test the significance of the number of itemsets with a given support

The null hypothesis distribution is specified by the Poisson approximation result

Reduces the number of simultaneous tests

More powerful test – less false negatives

Page 25: An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Test I

Define α1, α2, α3, … such that ∑αi≤ α For i=0,…,log (smax – smin ) +1

si= smin +2i

Q(k, si) = observed number of itemsets of size k and support ≥ si

H0(k,si) = “Q(k,si) conforms with Poisson(λi)”

Reject H0(k,si) if p-value < αi

Page 26: An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Test I

Let s* be the smallest s such that H0 (k,s) rejected by Test I With confidence level α the number

of itemsets with support ≥ s* is significant

Some itemsets with support ≥ s* could still be false positive

Page 27: An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Test II

Define β1, β2, β3,… such that ∑ βi≤ β

Reject H0 (k,si) if: p-value < αi and Q(k,si)≥ λi / βi

Let s* be the minimum s such that H0(k,s) was rejected

If we flag all itemsets with support ≥ s* as significant, FDR ≤ β

Page 28: An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Proof

Vi = false discoveries if H0(k,si) first rejected

Ei = “H0(k,si) rejected”

Page 29: An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Real Datasets

FIMI repository http://fimi.cs.helsinki.fi/data/ standard benchmarks m = avg. transaction length

Page 30: An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Experimental Results Poisson approximation

Poisson “regime” ≠ no itemsets expected

Page 31: An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Experimental Results Poisson approximation

not approximating the p-values of itemsets as hypothesis (small!) finding the minimum s such that:

Prob(b1(s)+b2(s) ≤ ε) ≥ 1-δ fewer simulations less time per simulation (“few” itemsets)

Page 32: An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Experimental Results Test II: α = 0.05, β = 0.05

Rk,s* = num. itemsets of size k with support ≥ s*

Itemset of size 154 with support ≥ 7

Page 33: An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Experimental Results

Standard Multi-Hypothesis test: β = 0.05 R = size output Standard Multi-Hypothesis test Rk,s* = size output Test II

Page 34: An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Collaborators

Adam Kirsch (Harvard)

Michael Mitzenmacher (Harvard) Andrea Pietracaprina (U. of Padova) Geppino Pucci (U. of Padova) Eli Upfal (Brown U.) Fabio Vandin (U. of Padova)