40
1 SAD Tagus AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

  • View
    216

  • Download
    1

Embed Size (px)

Citation preview

Page 1: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

1SAD Tagus

AJAX:Model, Declarative Language,

and Algorithms

Helena Galhardas

Page 2: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

2SAD Tagus

Plan

Context

• Problem statement

• Contributions

• Our data cleaning solution

• Validation

• Related solutions

• Conclusions

Page 3: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

3SAD Tagus

Application context

– Eliminate errors and duplicates within a single

source

– Integrate data from different sources

– Migrate poorly structured data into structured

data

Page 4: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

4SAD Tagus

Typical architecture HumanKnowledge

HumanKnowledge

DataExtraction

DataLoading

DataTransformation

Metadata Dictionaries DataAnalysis

SchemaIntegration

... ...

SOURCE DATA TARGET DATA

DataTransformation

Page 5: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

5SAD Tagus

Data cleaning

Activity of transforming source data into target data without errors, duplicates, and inconsistencies

Page 6: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

6SAD Tagus

Motivating example (1)

DirtyData(paper:String)

Data Cleaning

Events(eventKey, name)

Publications(pubKey, title, eventKey, url, volume, number, pages, city, month, year)

Authors(authorKey, name)

PubsAuthors(pubKey, authorKey)

Page 7: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

7SAD Tagus

Motivating example (2)

[1] Dallan Quass, Ashish Gupta, Inderpal Singh Mumick, and Jennifer Widom. Making Views Self-Maintainable for Data Warehousing. In Proceedings of the Conference on Parallel and Distributed Information Systems. Miami Beach, Florida, USA, 1996[2] D. Quass, A. Gupta, I. Mumick, J. Widom, Making views self-maintianable for data warehousing, PDIS’95

DirtyData

Data Cleaning

PDIS | Conference on Parallel and Distributed Information Systems

Events

QGMW96| Making Views Self-Maintainablefor Data Warehousing |PDIS| null | null | null | null | Miami Beach | Florida, USA | 1996

PublicationsAuthors

DQua | Dallan Quass

AGup | Ashish Gupta

JWid | Jennifer Widom…..

QGMW96 | DQua

QGMW96 | AGup….

PubsAuthors

Page 8: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

8SAD Tagus

Plan

• Context Problem statement

• Contributions

• Our data cleaning solution

• Validation

• Related solutions

• Conclusions

Page 9: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

9SAD Tagus

Modeling a data cleaning process

A data cleaning process is modeled by a directed acyclic graph of data transformations

DirtyData

DirtyAuthors

Authors

Duplicate Elimination

Extraction

Standardization

Formatting

DirtyTitles... DirtyEvents

CitiesTags

Page 10: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

10SAD Tagus

Existing technology

• Ad-hoc code– difficult to maintain

• Extraction Transformation Loading (ETI, Informatica, Sagent)

– limited cleaning functionality

• Data Reengineering (Integrity) – fixed implementation for certain operators

• Specific-domain cleaning (idCentric, PureIntegrate)

– names and addresses

• Duplicate elimination (DataCleanser, matchIt)

– finds/eliminates duplicates

Page 11: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

11SAD Tagus

Problems of existing solutions (1)

The semantics of some data transformations is defined in terms of their implementation algorithms

App. Domain 1

App. Domain 2

App. Domain 3

Data cleaning transformations

...

Page 12: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

12SAD Tagus

There is a lack of interactive facilities to tune a data cleaning application program

Problems of existing solutions (2)

Dirty Data

Cleaning process

Clean data Rejected data

Page 13: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

13SAD Tagus

AJAX

• An extensible data cleaning framework

• A declarative language for logical operators

• Efficient implementation of the match operator

• A debugger facility for tuning a data cleaning program application

Page 14: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

14SAD Tagus

Data cleaning framework

• Logical level: set of logical operators to express cleaning criteria enclosed in each data transformation

• Physical level: set of algorithms that implement the logical operations

Page 15: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

15SAD Tagus

Logical level: parametric operators

• View: arbitrary SQL query• Map: iterator-based one-to-many mapping with

arbitrary user-defined functions• Match: iterator-based approximate join • Cluster: uses an arbitrary clustering function• Merge: extends SQL group-by with user-defined

aggregate functions• Apply: executes an arbitrary user-defined

algorithm

Map Match

Merge

ClusterView

Apply

Page 16: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

16SAD Tagus

Logical level

DirtyData

DirtyAuthors

Authors

Duplicate Elimination

Extraction

Standardization

Formatting

DirtyTitles...

CitiesTags

Page 17: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

17SAD Tagus

Logical level

DirtyData

DirtyAuthors

Map

Cluster

Match

Merge

Authors

Map

Map

Duplicate Elimination

Extraction

Standardization

Formatting

DirtyTitles...

CitiesTags

DirtyData

DirtyAuthors

TC

NL

Authors

SQL Scan

Java Scan

Physical level

DirtyTitles...

Java Scan

Java Scan

CitiesTags

Page 18: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

18SAD Tagus

Contributions

• An extensible data cleaning framework

A declarative language for logical operators

• Efficient implementation of the match operator

• A debugger facility for tuning a data cleaning program application

Page 19: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

19SAD Tagus

Match• Input: 2 relations• Finds data records that correspond to the same

real object• Calls distance functions for comparing field values

and computing the distance between input tuples• Output: 1 relation containing matching tuples and

possibly 1 or 2 relations containing non-matching tuples

Page 20: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

20SAD Tagus

Example

Cluster

Match

Merge

Duplicate Elimination

Authors

DirtyAuthors

MatchAuthors

Page 21: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

21SAD Tagus

ExampleCREATE MATCH MatchDirtyAuthors

FROM DirtyAuthors da1, DirtyAuthors da2

LET distance = editDistance(da1.name, da2.name)

WHERE distance < maxDist

INTO MatchAuthorsCluster

Match

Merge

Duplicate Elimination

Authors

DirtyAuthors

MatchAuthors

Page 22: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

22SAD Tagus

ExampleCREATE MATCH MatchDirtyAuthors

FROM DirtyAuthors da1, DirtyAuthors da2

LET distance = editDistance(da1.name, da2.name)

WHERE distance < maxDist

INTO MatchAuthors

Input:

DirtyAuthors(authorKey, name)861|johann christoph freytag

822|jc freytag

819|j freytag

814|j-c freytag

Output:

MatchAuthors(authorKey1, authorKey2, name1, name2)861|822|johann christoph freytag| jc freytag

822|814|jc freytag|j-c freytag ...

Cluster

Match

Merge

Duplicate Elimination

Authors

DirtyAuthors

MatchAuthors

Page 23: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

23SAD Tagus

Implementation of the match operator

s1 S1, s2 S2

(s1, s2) is a match if

editDistance (s1, s2) < maxDist

Page 24: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

24SAD Tagus

Nested loopS1 S2

...

• Very expensive evaluation when handling large amounts of data

Need alternative execution algorithms for the same logical specification

editDistance

Page 25: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

25SAD Tagus

A database solution

CREATE TABLE MatchAuthors ASSELECT authorKey1, authorKey2, distance

FROM (SELECT a1.authorKey authorKey1, a2.authorKey authorKey2,

editDistance (a1.name, a2.name) distance

FROM DirtyAuthors a1, DirtyAuthors a2)

WHERE distance < maxDist;

No optimization supported for a Cartesian product with external function calls

Page 26: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

26SAD Tagus

Window scanning

S

n

Page 27: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

27SAD Tagus

Window scanning

S

n

Page 28: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

28SAD Tagus

Window scanning

S

n

May loose some matches

Page 29: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

29SAD Tagus

String distance filtering

S1 S2

maxDist = 1

John Smith

John Smit

Jogn Smith

John Smithe

length

length- 1

length

length + 1

editDistance

Page 30: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

30SAD Tagus

Annotation-based optimization

• The user specifies types of optimization • The system suggests which algorithm to

use

Ex:

CREATE MATCHING MatchDirtyAuthors

FROM DirtyAuthors da1, DirtyAuthors da2

LET dist = editDistance(da1.name, da2.name)

WHERE dist < maxDist

% distance-filtering: map= length; dist = abs %

INTO MatchAuthors

Page 31: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

31SAD Tagus

Contributions

• An extensible data cleaning framework

• A declarative language for logical operators

• Efficient implementation of the match operator

A debugger facility for tuning a data cleaning program application

Page 32: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

32SAD Tagus

Management of exceptions

• Problem: to mark tuples not handled by the cleaning criteria of an operator

• Solution: to specify the generation of exceptional tuples within a logical operator– exceptions are thrown by external functions– output constraints are violated

Page 33: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

33SAD Tagus

Example (1)

CREATE MAP ExtractionCities

FROM StandardizedDirtyData dd

LET city = extractCities(dd.paper, Cities),

{ SELECT dd.paperKey AS pubKey, city AS city

INTO ExtractedCities

CONSTRAINT NOT NULL city } Map

ExtractedCities(pubKey, city)

Extraction

CitiesStandardizedDirtyData (pubKey, paper)

Page 34: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

34SAD Tagus

Example(2)

ExtractionCities

Cities

ExtractedCitiesStandardizedDirtyDataexc

4| ManyDifferentCities

StandardizedDirtyData

4|y ioannidis r ng k shim and t sellis parametric query optimization technical report univ of wisconsin madison and univ of maryland college park

Page 35: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

35SAD Tagus

Debugger facility

• Supports the (backward and forward) data derivation of tuples wrt an operator to debug exceptions

• Supports the interactive data modification and the incremental execution of some logical operators

Page 36: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

36SAD Tagus

4| Y. Ioannidis, R. Ng, K. Shim, and T. Sellis. Parametric query optimization. Technical Report, Univ. Of Wisconsin, Madison and Univ. Of Maryland, College Park, 1992

4| ManyDifferentCities

4|Technical Report, Univ. Of Wisconsin, and Univ. Of Maryland

StandardizedDirtyDataForExtraction

StandardizeDataForExtraction

ExtractionAuthorsTitleEvent

DirtyEvents

KeyDirtyData

StandardizeData

StandardizedDirtyData

ExtractionCities

ExtractedCitiesStandardizedDirtyDataexc

BackwardDerivationForwardDerivation

Backward/forward data derivation

Cities

Page 37: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

37SAD Tagus

4| ManyDifferentCities

4|Technical Report, Univ. Of Wisconsinand Univ. Of Maryland

StandardizedDirtyDataForExtraction

StandardizeDataForExtraction

ExtractionAuthorsTitleEvent

DirtyEvents

KeyDirtyData

StandardizeData

StandardizedDirtyData

ExtractionCities

ExtractedCitiesStandardizedDirtyDataexc

4| Y. Ioannidis, R. Ng, K. Shim, and T. Sellis. Parametric query optimization. Technical Report, Univ. Of Wisconsin, Madison, 1992101| Y. Ioannidis, R. Ng, K. Shim, and T. Sellis. Parametric query optimization. Technical Report, Univ. Of Maryland, College Park, 1992

Interactive data correction (1)

Cities

Page 38: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

38SAD Tagus

4| Y. Ioannidis, R. Ng, K. Shim, and T. Sellis. Parametric query optimization. Technical Report, Univ. Of Wisconsin, Madison, 1992101| Y. Ioannidis, R. Ng, K. Shim, and T. Sellis. Parametric query optimization. Technical Report, Univ. Of Maryland, College Park, 1992

KeyDirtyData

Interactive data correction(2) 4| Technical Report, Univ. Of Wisconsin101| Technical Report, Univ. Of Maryland 4| Madison

101| College Park

StandardizedDirtyDataForExtraction

StandardizeDataForExtraction

ExtractionAuthorsTitleEvent

DirtyEvents

StandardizeData

StandardizedDirtyData

ExtractionCities

ExtractedCities

incrementalincremental

incrementalincremental

Cities

Page 39: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

39SAD Tagus

AJAX Architecture

Page 40: SAD Tagus 1 AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

40SAD Tagus

AJAX Demo