16
Parsimonious Explanations of Change in Hierarchical Data Dhiman Barman, Flip Korn, Divesh Srivastava, Dimitrios Gunopulos, Neal Young and Deepak Agarwal

Parsimonious Explanations of Change in Hierarchical Data Dhiman Barman, Flip Korn, Divesh Srivastava, Dimitrios Gunopulos, Neal Young and Deepak Agarwal

  • View
    214

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Parsimonious Explanations of Change in Hierarchical Data Dhiman Barman, Flip Korn, Divesh Srivastava, Dimitrios Gunopulos, Neal Young and Deepak Agarwal

Parsimonious Explanations of Change in Hierarchical Data

Dhiman Barman, Flip Korn, Divesh Srivastava, Dimitrios Gunopulos, Neal

Young and Deepak Agarwal

Page 2: Parsimonious Explanations of Change in Hierarchical Data Dhiman Barman, Flip Korn, Divesh Srivastava, Dimitrios Gunopulos, Neal Young and Deepak Agarwal

Background Dimensions in data warehouses are hierarchical

Variety of applications aggregate along hierarchy e.g., population summarized by geographic

location (state/county/city/zip) Existing OLAP tools for static data

summarize and navigate via drill-down and roll-up operators

e.g., population of each city for the year 2005 Want to summarize and explain changes

e.g., population in 2004 compared to population in 2005 across different locations

Page 3: Parsimonious Explanations of Change in Hierarchical Data Dhiman Barman, Flip Korn, Divesh Srivastava, Dimitrios Gunopulos, Neal Young and Deepak Agarwal

MotivationCensus 2005Census 2004

Output changes, e.g., whole California population doubled except LA countywhich tripled

Page 4: Parsimonious Explanations of Change in Hierarchical Data Dhiman Barman, Flip Korn, Divesh Srivastava, Dimitrios Gunopulos, Neal Young and Deepak Agarwal

Hierarchical Representation

Two hierarchical summaries and corresponds to two different snapshots in time

Naïve: take point-to-point ratios R Verbose and non-hierarchical

Want holistic as well as hierarchical explanations

1.67

1.22

33

3 3

1.4

1 2

1

1

3

120

90

2010

10 20

50

30 20

40

40

30

T1 T2 200

110

6030

30 60

70

30 40

40

40

90

R

Leaf count

Aggregate count

T1 T2

Page 5: Parsimonious Explanations of Change in Hierarchical Data Dhiman Barman, Flip Korn, Divesh Srivastava, Dimitrios Gunopulos, Neal Young and Deepak Agarwal

Problem Context Explanations can be verbose or

parsimonious Census data of US population

Hierarchically organized as (state/county/city/zip)

50 states 81,000 leaves 130,000 nodes

200

110

6030

30 60

70

30 40

40

40

90

California

San Bernardino

Fontana

92334 91729 90001 90002 91101

Victorville

LA

LA County

Pasadena

zip codes

city

county

state

Page 6: Parsimonious Explanations of Change in Hierarchical Data Dhiman Barman, Flip Korn, Divesh Srivastava, Dimitrios Gunopulos, Neal Young and Deepak Agarwal

Ratio Tree

Given trees and , Ratio Tree is a tree with the same structure as or and value in leaf l equals value(l, )/value(l, ) Assume two trees are isomorphic and leaf counts

> 0

10

10

200

110

6030

30 60

70

30 40

40

40

90

120

90

20

20

50

30 20

40

40

30

3 3 1 2 1

T1 T2 Ratio Tree

T1 T2

T1 T2

T2 T1

Page 7: Parsimonious Explanations of Change in Hierarchical Data Dhiman Barman, Flip Korn, Divesh Srivastava, Dimitrios Gunopulos, Neal Young and Deepak Agarwal

Naïve Solution: Bottom-up

Not hierarchical Leaf-weight is ratio

of corresponding leaf counts

Non-leaf weights are 1; not part of explanation

Verbose if significant number of leaves have different counts

3 3 1 2 1

Ratio Tree

• 3 explanations found

Page 8: Parsimonious Explanations of Change in Hierarchical Data Dhiman Barman, Flip Korn, Divesh Srivastava, Dimitrios Gunopulos, Neal Young and Deepak Agarwal

Naïve Solution: Top-down

Hierarchical but not holistic Compute aggregate of subtree leaves for each

node Root-to-node product of weights equal to node

aggregate, for each node

5/3

11/15

11

1 1

63/55

5/7 10/7

3/11

1

9/5

Ratio Tree

• 7 explanations found

Page 9: Parsimonious Explanations of Change in Hierarchical Data Dhiman Barman, Flip Korn, Divesh Srivastava, Dimitrios Gunopulos, Neal Young and Deepak Agarwal

DIFF Solution [Sarawagi’99]

DIFF operator is not parsimonious Tries to adjust whole tree while finding

explanations 7 explanation weights with k=1; 4 with

k=1.5

5/3

11/9

1.4

1 2

1

3

k=1

5/3

1

1

3

k=1.5

Page 10: Parsimonious Explanations of Change in Hierarchical Data Dhiman Barman, Flip Korn, Divesh Srivastava, Dimitrios Gunopulos, Neal Young and Deepak Agarwal

Finding Parsimonious Solutions Root-to-leaf explanations

along an ancestor path P(n) from root to node n

count(leaf,T2)= nP(leaf) weight(n) count(leaf, T1)

Underconstrained system of equations Parsimonious explanations weight

assignment with minimum number of non-1s

Tolerance parameter k weights [1/k, k], k 1 are reported as

explanations to accommodate noise

Page 11: Parsimonious Explanations of Change in Hierarchical Data Dhiman Barman, Flip Korn, Divesh Srivastava, Dimitrios Gunopulos, Neal Young and Deepak Agarwal

Weight Assignmentw0

w2

w12w11

w111 w121

w21

w211 w212

w22

w221

w1

r1r2 r3 r4 r5

r2 = W0 x W1 x W12 x W121

kk

ww ,1

, 112

Page 12: Parsimonious Explanations of Change in Hierarchical Data Dhiman Barman, Flip Korn, Divesh Srivastava, Dimitrios Gunopulos, Neal Young and Deepak Agarwal

Parsimonious: Example

Parsimonious model explains changes optimally k > 1 case captures similar changes among

leaves having near-equal aggregates 2 explanation weights with k=1; none with k=1.5

1

1

11

1 1

1

1 2

1

1

3

8/9

1

3/23/2

3/2 3/2

3/2

3/4 3/2

1

9/8

3/2

k=1.5k=1

Page 13: Parsimonious Explanations of Change in Hierarchical Data Dhiman Barman, Flip Korn, Divesh Srivastava, Dimitrios Gunopulos, Neal Young and Deepak Agarwal

Scale-up #explanations vs.

#leaves k=1.05 has smallest

#exp Population counts do

not change > 5% in 4-5 yrs

Bottom-up close to parsimonious solution with k=1 Leaf counts are

different in many leaves8 samples from same ratio tree, comparisonbetween Year 2003 and 2004 of Census data

Page 14: Parsimonious Explanations of Change in Hierarchical Data Dhiman Barman, Flip Korn, Divesh Srivastava, Dimitrios Gunopulos, Neal Young and Deepak Agarwal

Effect of Threshold Parameter

Population for years 2003 and 2004 are compared

#explanations decreases dramatically with k

k=1 is a special case Extra tolerance for

grouping similar ratios

Page 15: Parsimonious Explanations of Change in Hierarchical Data Dhiman Barman, Flip Korn, Divesh Srivastava, Dimitrios Gunopulos, Neal Young and Deepak Agarwal

Stability

Shk = set of nodes at

level h where “explanations” occur

Sc = Stability of output at level h as tolerance changes from k-k to k

Average Stability across all levels, h Stability is > 0.6

||

||kh

kh

kkh

cS

SSS

Page 16: Parsimonious Explanations of Change in Hierarchical Data Dhiman Barman, Flip Korn, Divesh Srivastava, Dimitrios Gunopulos, Neal Young and Deepak Agarwal

Future Work

Global Budget on Error Tolerance Bound on sum of node tolerances, where

individual node weights can be unequally distributed

Using a Prediction Model Statistical model provides predictions and

confidence intervals on counts, which are compared to observed counts

Multiple Dimension Hierarchies e.g., geography x Dewey Decimal Number