1 Noget helt andet… Platon vil gerne være vært (i Århus) for et BIT møde i efteråret – SOA...

Preview:

Citation preview

1

Noget helt andet…Noget helt andet…

Platon vil gerne være vært (i Århus) for et BIT møde i efteråret– SOA eller MDM– Fint for mig, men hvad siger i ?

Platon inviterer alle til www.bi2006.dk – 7-8 juni– Special pris for BIT medlemmer: 2995 kr.– Tilmelding via Jørgen Davidsen, jda@platon.net

Lineage Tracing in DataLineage Tracing in Data WarehousesWarehouses

Torben Bach Pedersen

Based on work by Yingwei Cui and Jennifer Widom

Stanford University Database Group

3

Motivation: Data WarehousingMotivation: Data Warehousing

Data Warehouse

Source 1 Source 2 Source 3

Lucrative Fields

Databases $8800K Theory $320K

Networks $800K

StudentsEnrollmentsCourses

Wow?!

Databases $8800K

4

Courses Enrollments Students

Oh, I see...

Source 1 Source 2 Source 3

Lineage Tracer

Data Warehouse

Lucrative Fields

Database 1800 Theory $320K

Networks $800K Databases $8800K

CS145 Ted CS154 Joe

CS244 BobCS145 Ann CS245 Jane

……

Bob MS $1K Jane Web $5K

Ann BS $1K

Joe BS $1KTed Web $5K … … …

CS145 Databases CS154 Theory

CS244 Networks CS245 Databases

5

The Data Lineage ProblemThe Data Lineage Problem

Data warehouses integrate data from multiple sources for analysis and mining

Data lineageData lineage: given data item o in the warehouse, which data items in the sources were used to derive o?

Sometimes called “drill-through” in industry– “Drill-through” often limited

6

ChallengesChallenges

Warehouse of relational views over relational sources– What is a good formal definition for lineage?– How do we trace data lineage for arbitrary views?– How do we make it efficient?

Warehouse defined by graph of data transformations– No fixed, well-defined relational operators– Large transformation sequences and graphs

7

Outline of TalkOutline of Talk

Part 1: Lineage tracing for relational views

Part 2: Lineage tracing for general data transformations

8

Part 1: Part 1: Lineage Tracing for Relational ViewsLineage Tracing for Relational Views

Declarative definition of data lineage

Lineage tracing algorithms

Using auxiliary views for efficient lineage tracing

Experimental results (small sample)

9

Views We ConsiderViews We Consider

Relational algebra

Arbitrary use of aggregation

Set semantics

Also in thesis– Set operators – Bag semantics

R S T

V

10

V

V = ( (R S)) Y,sum(Z) X >Z

R

S

X Y Z3 2a

bb

88

06

Y sum

a 2b 6

X Y Z3 2a8 08 98 6

bbb

X Y3 a

Y Z

2a0b9b6b

8 b

Y,sum(Z)X >Z

T U

b 6b8 0b8 6

8 0

8 6

b

b0b

6b

8 b

Simple Lineage ExampleSimple Lineage Example

select Y,sum(Z) from R natural join Swhere X>Zgroup by Y

11

Lineage for Relational OperatorsLineage for Relational Operators

Unary relational operators definition took a long time

op

R

R* t

Lineage of t according to op is the maximal subset R* R such that

(1) op(R*) = {t} - output of R* through op is t(2) t* R*: op({t*}) - op used on t* is nonempty

12

Example 1 – the two conditions ensure that only tuples contributing to t are included in lineage

R

X Y Z3 2a

bb

88

06

X Y Z3 2a8 08 98 6

bbb

X >Z

Lineage of t according to op is the maximal subset R* R such that

(1) (1) opop((RR*) = {*) = {tt}}(2) (2) tt* * RR*: *: opop({({tt*}) *})

Lineage for Relational OperatorsLineage for Relational Operators

b8 68 6b

13

Example 2 –”maximal” requirement ensures that (8,b,0) tuple in included in (b,6) lineage

R

X Y Z3 2a

bb

88

06

Y sum

a 2b 6

Y,sum(Z)

Lineage of t according to op is the maximalmaximal subset R* R such that

(1) op(R*) = {t}(2) t* R*: op({t*})

Lineage for Relational OperatorsLineage for Relational Operators

b 6b8 0b8 6

14

N-ary relational operators ( ,,) – lineage unique

Lineage for Relational OperatorsLineage for Relational Operators

Lineage of t according to op is the maximalmaximal subsets Ri* Ri for i = 1..n such that

(1) op(R1*, …, Rn*) = {t}(2) ti* Ri*: op(R1, …, {ti*}, …, Rn)

op

R1*

*R2

R2

R1

15

Lineage for Relational ViewsLineage for Relational Views

Lineage of a tuple set is union of lineage of each tuple in the set

Lineage for views is defined recursively => naive, but inefficient, algorithm (need to recompute/store all intermediate results)

opop1 2

VU

R1

R2

t

U*

*

*

R1

R2

Lineage of t is R1*, R2*

16

Lineage TracingLineage Tracing

Convert view into segmented normal form (SPJ+agg)segmented normal form (SPJ+agg)

E1 … En Each segment

Generate one tracing query tracing query for each segment

Apply tracing queries recursively

– # non-top + 1

Proof: lineage result is unaffected by Proof: lineage result is unaffected by normalization and segment-level tracingnormalization and segment-level tracing

17

Tracing Query for One SegmentTracing Query for One Segment

V Y sum

a 2b 6

R

S

TQ = Split ( (R S))X >Z Y=b R,S

Y,sum(Z)

X >Z

b

6

b

X Y3 a8

Y Z

2a09b

b

R*={(8,b)}, S*={(b,0),(b,6)}

b 0

6b

b8

b 6

V = ( (R S)) X >ZY,sum(Z)

Split = ”unjoin” – project over R+S schemas

18

Recursive Tracing ProcedureRecursive Tracing Procedure

V W avg

p 4q 6

U

R

S

X Y3 a

Y Z

2a0b9b6b

8 b

T

Y sum

a 2b 6

Y Wa p

pq

bb

TQ = Split ( (U T))W=q1 U,T TQ = Split ( (R S))X >Z Y=b2 R,S

b 6

qb

8 b

0b

6b

q 6

R*={(8,b)}, S*={(b,0),(b,6)}, T*={(b,q)}

8 b

0b

6bqb

V = (( (R S)) T)) W, avg(sum) Y,sum(Z) X >Z

19

Making It EfficientMaking It Efficient

Source accesses are usually expensive or impossible

Need some intermediate results for lineage tracing

Store auxiliary viewsauxiliary views at the warehouse– Reduce or eliminate source accesses– Reduce recomputation of intermediate results

20

Aux View ExampleAux View Example

21

Aux View ExampleAux View Example

22

Auxiliary ViewsAuxiliary Views

There are many possible auxiliary views

For single-segment views– Identified 10 possible auxiliary view schemes– Studied performance tradeoffs

For arbitrary views– Hard optimization problem– Exhaustive and heuristic algorithms– Performance study

R1 … Rn

23

Single Segment SchemesSingle Segment Schemes

Store nothing (NO)

Store Base Tables (BT)

Store Lineage Views (LV)

Store Split Lineage Tables (SLT)

Store Partial Base Tables (PBT)

Store Base Table Projections (BP)

Store Lineage View Projections (LP)

Self-maintainable variations: LV-S, SLT-S, PBT-S

24

+ Always improve lineage tracing

– Must be maintained when sources change

+ Can also help with maintenance of original user views

Auxiliary Views: Performance TradeoffsAuxiliary Views: Performance Tradeoffs

25

Auxiliary View Schemes for Auxiliary View Schemes for Single-Segment ViewsSingle-Segment Views

Parameters:- 3-way SPJ view- sources: 10MB each- disk: 1Mbps- network: 50kbps- 1000 operations- q/u ratio = 4

Measurements:- tracing time- maintenance time

26

Auxiliary View Selection Auxiliary View Selection Algorithms for Arbitrary ViewsAlgorithms for Arbitrary Views

27

Part 2: Part 2: Transformation GraphsTransformation Graphs

Lineage definition

Tracing algorithms

Combining transformations for lineage tracing

Experimental results (tiny sample) Source 1

Data Warehouse

Source 2 Source 3

T6

T4 T5

T3

T2

T1

28

T1

T3 T4 T6 T7T5

id cust date prod-list1 A 2/8/99 1(10),2(10)2 C 4/5/99 2(5),3(10) 3 D 6/1/99 1(20),2(10) 4 B 8/6/99 1(10),3(5)5 D 10/8/99 1(5),3(10) 6 B 12/1/99 2(10),3(10)

id name price valid1 imac 1200 10/1/98- 2 vaio 2400 6/1/98-9/1/99 2 vaio 1800 9/2/99- 3 palm 500 2/1/98-7/1/98 3 palm 400 7/2/98-9/1/99 3 palm 300 9/2/99-

name avg3 Q4 palm 2K 6Kpalmpalm 2K 6K 2K 6K

3 palm 400 7/2/98-9/1/993 palm 400 7/2/98-9/1/99 3 palm 300 9/2/99-3 palm 300 9/2/99-

2 C 4/5/99 2(5),3(10)2 C 4/5/99 2(5),3(10)

4 B 8/6/994 B 8/6/99 1(10),3(5)1(10),3(5)5 D 10/8/99 1(5),3(10)5 D 10/8/99 1(5),3(10) 6 B 12/1/99 2(10),3(10)6 B 12/1/99 2(10),3(10)

SalesJump

Order

Product T2

Transformation Example Transformation Example

selection

“join”split pivot projectionselectionprojection

29

Lineage for General TransformationsLineage for General Transformations

A transformationtransformation can be an arbitrary program

T

select … from … where … main(int argc, char** argv) {…} sed “s/string1/string2/g” …

??

– One extreme: relational operators– Another extreme: we know nothing about T– Middle ground: based on transformation properties

30

Transformation PropertiesTransformation Properties

Transformation classes

Additional properties– Transformation subclasses– Schema information– Provided inverse or tracing procedure

31

i II: T(I) = T({i})

dispatcher

T*(o) = {i | oT({i})}

Transformation ClassesTransformation Classes

Produces 0 or more output items per input item

Applying T on complete set is the same as on each input item separately

32

Dispatcher ExampleDispatcher Example

id cust date prod-list1 A 2/8/99 1(10),2(10)2 C 4/5/99 2(5),3(10) 3 D 6/1/99 1(20),2(10) 4 B 8/6/99 1(10),3(5)5 D 10/8/99 1(5),3(10) 6 B 12/1/99 2(10),3(10)

Orderid cust date pid quant1 A 2/8/99 1 101 A 2/8/99 2 10 : : : 5 D 10/8/99 1 55 D 10/8/99 3 10 6 B 12/1/99 2 106 B 12/1/99 3 10

T1

O1

5 D 10/8/99 1(5),3(10)

5 D 10/8/99 1 55 D 10/8/99 3 10 5 D 10/8/99 3 10

5 D 10/8/99 1(5),3(10)

A non-relational operator, but a typical dispatcher

33

i II: T(I) = T({i})

dispatcher

I and T(I)={o1…on}: unique partition I1..In of I s.t. T(Ik) = {ok}

aggregator

T*(ok) = IkT*(o) = {i | oT({i})}

Transformation ClassesTransformation Classes

34

Aggregator ExampleAggregator Example

T4name Q1 Q2 Q3 Q4imac 12K 24K 12K 6K vaio 24K 12K 24K 18Kpalm 0K 4K 2K 6K

O3

O4

oid name date price quant1 imac 2/8/99 1200 101 vaio 2/8/99 2400 10 2 vaio 4/5/99 2400 5

3 imac 6/1/99 1200 203 vaio 6/1/99 2400 10 4 imac 8/6/99 1200 104 palm 8/6/99 400 55 imac 10/8/99 1200 55 palm 10/8/99 300 10 6 vaio 12/1/99 1800 106 palm 12/1/99 300 10

2 palm 4/5/99 400 10 2 palm 4/5/99 400 10

4 palm 8/6/99 400 5

6 palm 12/1/99 300 10

palm 0K 4K 2K 6K 5 palm 10/8/99 300 10

palm 0K 4K 2K 6K

2 palm 4/5/99 400 10

4 palm 8/6/99 400 5

6 palm 12/1/99 300 10

5 palm 10/8/99 300 10

T4 computes quarterly sales per product by ”pivoting”

Again, a non-relational operator, but a typical aggregator

35

i II: T(I) = T({i})

dispatcher

I and T(I)={o1…on}: unique partition I1..In of I s.t. T(Ik) = {ok}

aggregator black-box

All others

T*(ok) = Ik T*(o) = IT*(o) = {i | oT({i})}

Transformation ClassesTransformation Classes

36

Most transformations are dispatchers, aggregators, or their compositions

A transformation can be both dispatcher and aggregator– Proof: Lineage definitions are then equivalent

Transformations can be relational operators– Lineage definitions same as relational definitions

Transformation ClassesTransformation Classes

37

Transformation PropertiesTransformation Properties

Transformation classes

Additional properties– Transformation subclasses– Schema information– Provided inverse or tracing procedure

38

Transformation SubclassesTransformation Subclasses

Permit more efficient lineage tracing

Filter is a special dispatcher– Each input data item produces itself or nothing

Context-free aggregator– Whether two input data items are in the same partition

is independent of other items

Key-preserving aggregator– Any subset of an input partition always produces the

same output key

39

Tracing Example: AggregatorsTracing Example: Aggregators Consider T(I) = {o1…on}

Tracing the lineage of o for aggregator– Partition input I into I1…In such that T(Ik) = {ok}– Return Ik such that T(Ik) = {o}

Tracing the lineage of o for context-free aggregator– Partition input I into I1…In such that |T(Ik)| = 1– Return Ik such that T(Ik) = {o}

– 2^n versus n^2 running time !

40

Schema InformationSchema Information

Input schema A=(A1…An) and key Akey

Output schema B=(B1…Bn) and key Bkey

Schema mappings: f(A) B and A g(B)

Transformations with special schema mappings– Forward key-map: f(A) Bkey – Backward key-map: Akey g(B) – Backward total-map: A g(B)

– More efficient tracing for these

41

Tracing Example: Forward Key-MapsTracing Example: Forward Key-Maps

T4

O3 O4name Q1 Q2 Q3 Q4imac 12K 24K 12K 6K vaio 24K 12K 24K 18Kpalm 0K 4K 2K 6K palm 0K 4K 2K 6K

oid name date price quant1 imac 2/8/99 1200 101 vaio 2/8/99 2400 10 2 vaio 4/5/99 2400 5

3 imac 6/1/99 1200 203 vaio 6/1/99 2400 10 4 imac 8/6/99 1200 104 palm 8/6/99 400 55 imac 10/8/99 1200 55 palm 10/8/99 300 10 6 vaio 12/1/99 1800 106 palm 12/1/99 300 10

2 palm 4/5/99 400 10 2 palm 4/5/99 400 10

4 palm 8/6/99 400 5

6 palm 12/1/99 300 10

5 palm 10/8/99 300 10

”name” is carried over as key - trace of ”palm” is easy : the O3 tuples with name = ’palm’

42

Other PropertiesOther Properties

Transformation author provides Tracing Procedure

Provided Transformation Inverse T –1

– If T is an aggregator, then o’s lineage is T –1({o}) – Not always true for dispatchers or black-boxes

43

Tracing ProceduresTracing Procedures

Property Procedure # T Calls # Accesses

dispatcher TraceDS O(|I|) O(|I|)

aggregator TraceAG O(2|I|) O(2|I|)

black-box return I; 0 O(|I|)

filter return o; 0 0

context-free aggr. TraceCF O(|I|2) O(|I|2)

key-preserving aggr. TraceKP O(|I|) O(|I|)

forward key-map TraceFM 0 O(|I|)

backward key-map TraceBM 0 O(|I|)

backward total-map TraceTM 0 0

Provided tracing-proc. provided ? ?

44

Property HierarchyProperty HierarchyANY

provided tracing-proc.

or inverse

black-boxaggregator

dispatchercontext-free aggr.

key-preserving aggr.

filter

forward key-mapbackward key-map

total-map

45

Summary of Our Approach for Summary of Our Approach for One TransformationOne Transformation

Properties are provided with transformations– Specified by the transformation author – Declared in prepackaged transformations– Derived using recent techniques [Clio01, RB01]

The best property of a transformation is selected based on the hierarchy

The tracing procedure using the best property is called at tracing time

Indexing techniques

46

Transformation SequencesTransformation Sequences

Naive algorithm traces backwards one transformation at a time– Need all intermediate results– Poor performance for long sequences

T1 T2 T3 TnI O

47

T1 T2 T3 TnI O

T’ TnI O

Combine transformations and trace as one– Reduces number of intermediate results– By combining judiciously

Reduces tracing cost Doesn’t lose accuracy

Transformation SequencesTransformation Sequences

Recommended