View
214
Download
1
Embed Size (px)
Citation preview
Parsimonious Explanations of Change in Hierarchical Data
Dhiman Barman, Flip Korn, Divesh Srivastava, Dimitrios Gunopulos, Neal
Young and Deepak Agarwal
Background Dimensions in data warehouses are hierarchical
Variety of applications aggregate along hierarchy e.g., population summarized by geographic
location (state/county/city/zip) Existing OLAP tools for static data
summarize and navigate via drill-down and roll-up operators
e.g., population of each city for the year 2005 Want to summarize and explain changes
e.g., population in 2004 compared to population in 2005 across different locations
MotivationCensus 2005Census 2004
Output changes, e.g., whole California population doubled except LA countywhich tripled
Hierarchical Representation
Two hierarchical summaries and corresponds to two different snapshots in time
Naïve: take point-to-point ratios R Verbose and non-hierarchical
Want holistic as well as hierarchical explanations
1.67
1.22
33
3 3
1.4
1 2
1
1
3
120
90
2010
10 20
50
30 20
40
40
30
T1 T2 200
110
6030
30 60
70
30 40
40
40
90
R
Leaf count
Aggregate count
T1 T2
Problem Context Explanations can be verbose or
parsimonious Census data of US population
Hierarchically organized as (state/county/city/zip)
50 states 81,000 leaves 130,000 nodes
200
110
6030
30 60
70
30 40
40
40
90
California
San Bernardino
Fontana
92334 91729 90001 90002 91101
Victorville
LA
LA County
Pasadena
zip codes
city
county
state
Ratio Tree
Given trees and , Ratio Tree is a tree with the same structure as or and value in leaf l equals value(l, )/value(l, ) Assume two trees are isomorphic and leaf counts
> 0
10
10
200
110
6030
30 60
70
30 40
40
40
90
120
90
20
20
50
30 20
40
40
30
3 3 1 2 1
T1 T2 Ratio Tree
T1 T2
T1 T2
T2 T1
Naïve Solution: Bottom-up
Not hierarchical Leaf-weight is ratio
of corresponding leaf counts
Non-leaf weights are 1; not part of explanation
Verbose if significant number of leaves have different counts
3 3 1 2 1
Ratio Tree
• 3 explanations found
Naïve Solution: Top-down
Hierarchical but not holistic Compute aggregate of subtree leaves for each
node Root-to-node product of weights equal to node
aggregate, for each node
5/3
11/15
11
1 1
63/55
5/7 10/7
3/11
1
9/5
Ratio Tree
• 7 explanations found
DIFF Solution [Sarawagi’99]
DIFF operator is not parsimonious Tries to adjust whole tree while finding
explanations 7 explanation weights with k=1; 4 with
k=1.5
5/3
11/9
1.4
1 2
1
3
k=1
5/3
1
1
3
k=1.5
Finding Parsimonious Solutions Root-to-leaf explanations
along an ancestor path P(n) from root to node n
count(leaf,T2)= nP(leaf) weight(n) count(leaf, T1)
Underconstrained system of equations Parsimonious explanations weight
assignment with minimum number of non-1s
Tolerance parameter k weights [1/k, k], k 1 are reported as
explanations to accommodate noise
Weight Assignmentw0
w2
w12w11
w111 w121
w21
w211 w212
w22
w221
w1
r1r2 r3 r4 r5
r2 = W0 x W1 x W12 x W121
kk
ww ,1
, 112
Parsimonious: Example
Parsimonious model explains changes optimally k > 1 case captures similar changes among
leaves having near-equal aggregates 2 explanation weights with k=1; none with k=1.5
1
1
11
1 1
1
1 2
1
1
3
8/9
1
3/23/2
3/2 3/2
3/2
3/4 3/2
1
9/8
3/2
k=1.5k=1
Scale-up #explanations vs.
#leaves k=1.05 has smallest
#exp Population counts do
not change > 5% in 4-5 yrs
Bottom-up close to parsimonious solution with k=1 Leaf counts are
different in many leaves8 samples from same ratio tree, comparisonbetween Year 2003 and 2004 of Census data
Effect of Threshold Parameter
Population for years 2003 and 2004 are compared
#explanations decreases dramatically with k
k=1 is a special case Extra tolerance for
grouping similar ratios
Stability
Shk = set of nodes at
level h where “explanations” occur
Sc = Stability of output at level h as tolerance changes from k-k to k
Average Stability across all levels, h Stability is > 0.6
||
||kh
kh
kkh
cS
SSS
Future Work
Global Budget on Error Tolerance Bound on sum of node tolerances, where
individual node weights can be unequally distributed
Using a Prediction Model Statistical model provides predictions and
confidence intervals on counts, which are compared to observed counts
Multiple Dimension Hierarchies e.g., geography x Dewey Decimal Number