116
Scalable Information Extraction and Integration Eugene Agichtein Microsoft Research Emory University Sunita Sarawagi IIT Bombay

Eugene Agichtein Microsoft Research - Emory University

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Eugene Agichtein Microsoft Research - Emory University

Scalable Information

Extraction and Integration

Eugene Agichtein

Microsoft Research �

Emory U

niversity

SunitaSarawagi

IIT Bombay

Page 2: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

The Value of Text Data

�“U

nstructured”text data is the primary source of

human-generated inform

ation

�Citeseer, comparison shopping, PIM

systems, web search,

data warehousing

�Managing and utilizing text: inform

ation extraction

and integration

�Scalability: a bottleneck for deployment

�Relevance to data m

ining community

Page 3: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Example: Answering Queries Over Text

For years, Microsoft

CorporationCEOBill

Gateswas against open

source. But today he

appears to have changed

his mind. "We can be

open source. We love the

concept of shared

source," said Bill Veghte,

a MicrosoftVP. "That's a

super-important shift for

us in terms of code

access.“

Richard Stallman,

founderof the Free

Software Foundation,

countered saying…

Name Title Organization

Bill Gates

CEO

Microsoft

Bill Veghte

VP

Microsoft

Richard Stallman

Founder

Free Soft..

PEOPLE

Select Name

From PEOPLE

Where Organization = ‘Microsoft’

Bill Gates

Bill Veghte

(from W

illiam Cohen’s IE tutorial, 2003)

Page 4: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Managing Unstructured Text Data

�Information Extraction from text

�Represent information in text data in a structured form

�Identify instances of entities and relationships

�Main approaches and architectures

�Scaling up to large collections of documents (e.g., web)

�Information Integration

�Combine/resolve/clean information about entities

�Entity Resolution & Deduplication

�Scaling Up: Batch m

ode/algorithmic issues

�Connections between Inform

ation Extraction and Integration

�Coreference Resolution

�Deriving values from m

ultiple sources

�(W

eb) Question Answering

Page 5: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Part I: Tutorial Outline

�Overview of Information Extraction

�Entity tagging

�Relation extraction

�Scaling up Inform

ation Extraction

�Focus on scaling up to large collections (where

data m

ining and M

L techniques shine)

�Other dim

ensions of scalability

Page 6: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Information Extraction Components

Page 7: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Information Extraction Tasks

�Extracting entities and relations: this tutorial

�Entities: named (e.g., Person) and generic (e.g., disease name)

�Relations: entities related in a predefined way (e.g., Location of a

Disease outbreak)

�Common extraction subtasks:

�Preprocessing: sentence chunking, syntactic parsing,

morphological analysis

�Creating rules or extraction patterns: manual, m

achine learning,

and hybrid

�Applying extraction patterns to extract new inform

ation

�Postprocessing and complex extraction: not covered

�Co-reference resolution

�Combining Relations into Events and Facts

Page 8: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Related Tutorials

�Previous inform

ation extraction tutorials: consult for more

details

�R. Feldman, Inform

ation Extraction –

Theory and Practice,

ICML 2006

http://www.cs.biu.ac.il/~feldman/icml_tutorial.htm

l

�W. Cohen, A. McCallu

m, Inform

ation Extraction and

Integration: an O

verview, KDD 2003

http://www.cs.cmu.edu/~wcohen/ie-survey.ppt

�A. Doan, R. Ramakrishnan, S. Vaithyanathan, Managing

Inform

ation Extraction, SIG

MOD’06

�N. Koudas, D. Srivastava, S. Sarawagi, Record Linkage:

Sim

ilarity M

easures and Algorithms, SIG

MOD 2006

Page 9: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Entity Tagging

�Identifying m

entions of entities (e.g., person names, locations,

companies) in text

�MUC (1997): Person, Location, Organization, Date/Tim

e/Currency

�ACE (2005): m

ore than 100 m

ore specific types

�Hand-coded vs. Machine Learning approaches

�Best approach depends on entity type and domain:

�Closed class (e.g., geographical locations, disease names, gene &

protein names): hand coded + dictionaries

�Syntactic (e.g., phone numbers, zipcodes): regexes

�Others (e.g., person and company names): m

ixture of context,

syntactic features, dictionaries, heuristics, etc.

�“Alm

ost solved”for common/typical entity types

�Non-syntactic entities computationally expensive

Page 10: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Example: Extracting Entities from Text

�Useful for data warehousing, data cleaning, web

data integration

14089 Whispering Pines Nobel Drive San Diego CA 92122

House

number

Building

Road

City

Zip

State

Address

Citation

Year

2002

S4

Conference

Proc. of ACM SIG

MOD

S3

Title

Combining Fuzzy Inform

ation from M

ultiple Systems

S2

Author

Ronald Fagin

S1

Label(si)

Sequence

Segment(si)

Ronald Fag

in, Combining Fuzzy Inform

ation from M

ultiple

System

s, Proc. of ACM SIG

MOD, 2002

Page 11: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Hand-Coded Methods

�Easy to construct in m

any cases

�e.g., to recognize prices, phone numbers, zip codes,

conference names, etc.

�Easier to debug & m

aintain

�Especially if written in a “high-level”language (as is

usually the case): e.g.,

�Easier to incorporate / reuse domain knowledge

�Can be quite labor intensive to write

ContactPattern RegularExpression(Email.body,”can be reached at”)

PersonPhone

Precedes(Person

Precedes(ContactPattern, Phone, D),

D)

[From Avatar]

Page 12: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Example of Hand-Coded Entity Tagger

[Ramakrishnan. G

, 2005, Slides from Doan et al., SIGMOD 2006]

Rule 1

This rule will find person names with a salutation (e.g. Dr. Laura

Haas) and two capitalized words

<token> INITIAL</token>

<token>DOT </token>

<token>CAPSWORD</token>

<token>CAPSWORD</token>

Rule 2

This rule will find person names where two capitalized words

are present in a Person dictionary

<token>PERSONDICT, CAPSWORD </token>

<token>PERSONDICT,CAPSWORD</token>

CAPSWORD : W

ord starting with uppercase, second letter lowercase

E.g., DeWitt will satisfy it (D

EWITT will not)

\p{U

pper}\p{Lower}[\p{Alpha}]{1,25}

DOT

: The character ‘.’

Page 13: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Hand Coded Rule Example: Conference Name

# These are subordinate patterns

$wordOrdinals="(?:first|second|third|fourth|fifth|sixth|seventh|eighth|ninth|tenth|eleventh|twelfth|thirteenth|four

teenth|fifteenth)";

my $numberOrdinals="(?:\\d?(?:1st|2nd|3rd|1th|2th|3th|4th|5th|6th|7th|8th|9th|0th))";

my $ordinals="(?:$wordOrdinals|$numberOrdinals)";

my $confTypes="(?:Conference|Workshop|Symposium)";

my $words="(?:[A-Z]\\w+\\s*)"; # A word starting with a capital letter and ending with 0 or more spaces

my $confDescriptors="(?:international\\s+|[A-Z]+\\s+)"; # .e.g "International Conference ...' or the conference

name for workshops (e.g. "VLDB Workshop ...")

my $connectors="(?:on|of)";

my $abbreviations="(?:\\([A-Z]\\w\\w+[\\W\\s]*?(?:\\d\\d+)?\\))"; # Conference abbreviations like "(SIGMOD'06)"

# The actual pattern we search for. A typical conference name this pattern will find is

# "3rd International Conference on Blah BlahBlah(ICBBB-05)"

my

$fullNamePattern="((?:$ordinals\\s+$words*|$confDescriptors)?$confTypes(?:\\s+$connectors\\s+.*?|\\ s+)?$abb

reviations?)(?:\\n|\\r|\\.|<)";

############################## ################################

# Given a <dbworldMessage>, look for the conference pattern

##############################################################

lookForPattern($dbworldMessage, $fullNamePattern);

#########################################################

# In a given <file>, look for occurrences of <pattern>

# <pattern> is a regular expression

#########################################################

sub lookForPattern{

my ($file,$pattern) = @_;

Page 14: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Gene & Protein Tagger: AliBaba

�Extract gene names from PubMedabstracts

�Use C

lassifier (Support Vector Machine -SVM)

Vector

Generator

SVMlight

Tagged

Text

Post

Processor

Tokenized

Training

Corpus

SVM Model

driven

Tagger

New

Text

Vector

Generator

�Corpus of 7500 sentences

�140.000 non-gene words

�60.000 gene names

�SVM

lighton different feature sets

�Dictionary compile

d from G

enbank, HUGO, MGD, YDB

�Post-processing for compound gene names

Page 15: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Some Hand Coded Entity Taggers

�FRUMP [DeJong82]

�CIRCUS / AutoSlog[Riloff93]

�SRI FASTUS [Appelt, 1996]

�MITRE Alembic

(available for use)

�Alias-I LingPipe(available for use)

�OSMX [Embley, 2005]

�DBLife [Doan et al, 2006]

�Avatar [Jayram

et al, 2006]

Page 16: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Machine Learning Methods

�Can work well when training data is easy to construct and is

plentiful

�Can capture complex patterns that are hard to encode w

ith

hand-crafted rules

�e.g., determ

ine w

hether a review is positive or negative

�extract long complex gene names

The human T cell leukemia lymphotropic virus type 1 Tax protein

represses M

yoD-dependent transcription by inhibiting MyoD-

binding to the KIX domain of p300.“

[From AliB

aba]

�Can be labor intensive to construct training data

�Question: how m

uch training data is sufficient?

Page 17: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Popular Machine Learning Methods for IE

�Naive Bayes

�SRV [Freitag-98], Inductive Logic Programming

�Rapier [Califf& M

ooney-97]

�Hidden M

arkov M

odels [Leek, 1997]

�Maxim

um Entropy M

arkov M

odels [McCallu

m et al, 2000]

�Conditional Random Fields [Lafferty et al, 2000]

�Im

plementations available:

�Mallet (Andrew M

cCallum)

�crf.sourceforge.net (SunitaSarawagi)

�MinorThirdminorthird.sourceforge.net (W

illiam Cohen)

For details: [Feldman, 2006 and C

ohen, 2004]

Page 18: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Example of State-based ML Method

Page 19: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Extracted Entities: Resolving Duplicates

Document 1:The Justice Department has officially ended its inquiry into theassassinations

ofJohn F. Kennedyand Martin Luther King Jr., finding ``no persuasive evidence'' to

support conspiracy theories, according to department documents. The House Assassinations

Committee concluded in 1978 thatKennedywas ``probably'' assassinated as the result of a

conspiracy involving a second gunman, a finding that broke from the Warren Commission

's belief that Lee Harvey Oswald acted alone in Dallason Nov. 22, 1963.

Document 2: In 1953, MassachusettsSen. John F. Kennedymarried Jacqueline Lee

Bouvier in Newport, R.I. In 1960, Democratic presidential candidate John F. Kennedy

confronted the issue of his Roman Catholic faith by telling a Protestant group in Houston,

``I do not speak for my church on public matters, and the churchdoes not speak for me.'‘

Document 3:David Kennedywas born in Leicester, England in 1959.…Kennedyco-

edited The New Poetry (BloodaxeBooks 1993), and is the author of New Relations: The

Refashioning Of British Poetry 1980-1994 (Seren1996).

[From Li, M

orie, & Roth, AI Magazine, 2005]

Page 20: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Important Problem, Addressed in Part II

�Appears in numerous real-world contexts

�Plagues m

any applications

�Citeseer, DBLife, AliB

aba, Rexa, etc.

Page 21: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Outline

�Overview of Inform

ation Extraction

�Entity tagging

�Relation extraction

�Scaling up Inform

ation Extraction

�Focus on scaling up to large collections (where

data m

ining and M

L techniques shine)

�Other dim

ensions of scalability

Page 22: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Relation Extraction: D

isease Outbreaks

�Extract structured relations from text

May 19 1995, Atlanta --The Centers for Disease Control

and Prevention, which is in the front line of the world's

response to the deadly Ebola epidemic in Zaire ,

is finding itself hard pressed to cope with the crisis…

Zaire

Ebola

May 1995

U.S.

Pneumonia

Feb. 1995

July 1995

Jan. 1995

Date

Location

Disease Name

U.K.

Mad Cow D

isease

Ethiopia

Malaria

Information

Extraction System

(e.g., NYU’s Proteus)

Disease O

utbreaks in The New York Tim

es

Page 23: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Example: Protein Interactions

„Weshowthat CBF-A and CBF-C interact

witheachother to form a CBF-A-CBF-C complex

and that CBF-B doesnotinteractwith CBF-A or

CBF-C individuallybutthatitassociateswiththe

CBF-A-CBF-C complex.“

CBF-A

CBF-C

CBF-B

CBF-A-CBF-C complex

interact

complex

associates

Page 24: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Relation Extraction

�Typically require Entity Tagging as preprocessing

�Knowledge Engineering

�Rules defined over lexical items

�“<company> located in <location>”

�Rules defined over parsed text

�“((O

bj<company>) (Verb located) (*) (Subj<location>))”

�Proteus, GATE, …

�Machine Learning-based

�Learn rules/patterns from examples

Dan Roth 2005, Cardie

2006, Mooney 2005, …

�Partially-supervised: bootstrap from “seed”examples

Agichtein & G

ravano2000, Etzioniet al., 2004, …

�Recently, hybrid m

odels [Feldman2004, 2006]

Page 25: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Example Extraction Rule [NYU Proteus]

Page 26: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Example Extraction Patterns:

Snowball [AG2000]

{<’s 0.7> <in 0.7>

<headquarters 0.7>}

ORGANIZATION

LOCATION

{<-0.75>

<based 0.75>}

ORGANIZATION

LOCATION

Page 27: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Accuracy of Information Extraction

�Errors cascade (error in entity tag �

error in relation

extraction)

�This estimate is optimistic:

�Holds for well-established tasks

�Many specific/novel IE tasks exhibit lower accuracy

[Feldman, ICML 2006 tutorial]

Page 28: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Outline

�Overview of Inform

ation Extraction

�Entity tagging

�Relation extraction

�Scaling up Information Extraction

�Focus on scaling up to large collections (where

data m

ining and M

L techniques shine)

�Other dim

ensions of scalability

Page 29: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Dimensions of Scalability

�Efficiency/corpus size

�Years to process a large colle

ctions (centuries for Web)

�Heterogeneity/diversity of inform

ation sources

�Requires m

any rules (expensive to apply)

�Many sources/conventions (expensive to m

aintain rules)

�Accessing required documents

�Hidden W

eb databases are not crawlable

�Number of Extraction Tasks (not covered)

�Many patterns/rules to develop and m

aintain

�Open research area

Page 30: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Scaling Up Information Extraction

�Scan-based extraction

�Classification/filtering to avoid processing documents

�Sharing common tags/annotations

�General (keyword) index-based techniques

�QXtract, KnowItAll

�Specialized indexes

�BE/KnowItNow, Linguist’s Search Engine

�Parallelization/Adaptive Processing

�IBM W

ebFountain, Google’s M

ap/Reduce

�Application: Question Answering

�AskMSR, Arranea, Mulder

Page 31: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Scan

Output Tokens

Extraction

System

Text Database

3.Extract output

tokens

2.Process

documents

1.Retrieve docs

from database

��Scan

Scanretrieves and processes documents

sequentially (until reaching target recall)

Execution time = |Retrieved Docs| ·(R + P)

Tim

e for retrieving

a document

Tim

e for processing

a document

Page 32: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Efficient Scanning for Information

Extraction

�80/20 rule: use few sim

ple rules to capture

majority of the cases [PRH2004]

�Train a classifier to discard irrelevant

documents without processing [GHY2002]

�Share base annotations (entity tags) across

multiple tasks

Page 33: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Filtered Scan

Output Tokens

Extraction

System

Text Database

4.Extract output

tokens

3.Process

documents

1.Retrieve docs

from database

��Scan

Scanretrieves and processes all documents(until reaching target recall)

��Filtered Scan

Filtered Scanuses a classifier to identify and process only promising documents

(e.g., the Sports section of NYT is unlikely to describe diseaseoutbreaks)

Execution time = |Retrieved Docs| * ( R + F + P)

Tim

e for retrieving

a document

Tim

e for filtering

a document

Classifier

2.Filter documents

Tim

e for processing

a document

Classifier selectivity(σ≤1)

σ

filtered

Page 34: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Exploiting Keyword and Phrase Indexes

�Generate queries to retrieve only relevant documents

�Data m

ining problem!

�Some m

ethods in literature:

�Traversing Q

uery G

raphs [AIG

2003]

�Iteratively refine queries [AG2003]

�Iteratively partition document space [Etzioniet al., WWW 2004]

�Case studies: QXtract, KnowItAll

Page 35: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Simple Strategy: Iterative Set Expansion

Output Tokens

Extraction

System

Text Database

3.Extract tokens

from docs

2.Process retrieved

documents

1.Query

database with

seed tokens

Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q

Tim

e for retrieving

a document

Tim

e for answering

a query

Tim

e for processing

a document

Query

Generation

4.Augment seed

tokens with

new tokens

(e.g., [Ebola AND Zaire])

(e.g., <Malaria, Ethiopia>)

Page 36: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Querying Graph

�The querying graph is a

bipartite graph, containing

tokens and documents

�Each token (transform

ed to a

keyword

query) retrieves

documents

�Documents contain

tokens

Tokens

Documents

t 1 t 2 t 3 t 4 t 5

d1

d2

d3

d4 d5

<SARS, China>

<Ebola, Zaire>

<Malaria, Ethiopia>

<Cholera, Sudan>

<H5N1, Vietnam>

[AIG

2003]

Page 37: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Recall Limit: ReachabilityGraph

t 1retrievesdocument d1

that containst 2

t 1 t 2t 3 t 4

t 5

Tokens

Documents

t 1 t 2 t 3 t 4 t 5

d1

d2

d3

d4 d5

Upper recall limit: determ

ined by the size

of the biggest connected component

Reachability

Graph

Page 38: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Reachability Graph for DiseaseOutbreaks

Page 39: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Connected Components Visualization

DiseaseOutbreaks,

New York Times 1995

Page 40: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Getting Around Reachability Limit

�KnowItAll:

�Add keywords to partition documents into

retrievable disjoint sets

�Submit queries with parts of extracted instances

�QXtract

�General queries with m

any m

atching documents

�Assumes m

any documents retrievable per query

Page 41: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

QXtract [AG2003]

1.Get docu

men

t sample

with

“likely neg

ative”

and “likely

positive”

exam

ples.

2.Lab

el sam

ple docu

men

ts using

inform

ation extraction system

as “oracle.”

3.Train classifiers

to “reco

gnize”

useful docu

men

ts.

4.Gen

erate queriesfrom classifier

model/rules.

Query Generation

Inform

ation Extraction

Text Database

Search Engine

??

??

??

??

++

++

--

--

Seed Sampling

Classifier Training

Queries

tuple1

tuple2

tuple3

tuple4

tuple5

++

++

--

--

User-Provided Seed Tuples

Page 42: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

KnowItAllArchitecture

�System W

ork Flow

Extractor

Search Engine Interface

Assessor

Database

Web Pages

Rule

Rule template

NP1 “such as”NPList2

& head(N

P1) = plural(name(C

lass1))

&properN

oun(head(each(N

PList2)))

=>

instanceOf(Class1,head(each(N

PList2)))

NP1 “such as”NPList2

& head(N

P1) = “countries”

&properN

oun(head(each(N

PList2)))

=>

instanceOf(Country,head(each(N

PList2)))

Keywords: “countries such as”

Slides: [ZhengShao, UIUC]

Page 43: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

KnowItAllArchitecture (Cont.)

�System W

ork Flow

Extractor

Search Engine Interface

Assessor

Database

Web Pages

Rule

Extracted Inform

ation

Knowledge

the United Kingdom and Canada

India

North Korea, Iran, India and Pakistan

Japan

Iraq, Italy and Spain

…the United Kingdom

Canada

India

North Korea

Iran

Discriminator Phrase

Country AND X

“Countries such as X”

Country AND the United Kingdom

Countries such as the United Kingdom

Frequency

Page 44: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Using Generic Indexes: Summary

�Order of magnitude scale-up in corpus size

�Indexes are approxim

ate (queries not precise)

�Require m

any documents to retrieve

�Can we do better?

Page 45: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Index Structures for Information Extraction

�Bindings Engine [CE2005]

�Indexes of entities: [CGHX2006], [IBM Avatar]

�Other systems (not covered)

�Linguist’s search engine (P. Resniket al.): indexes

syntactic structures

�FREE: Indexing regular expressions: J. Cho et al.

Page 46: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Bindings Engine (BE) [Slides: Cafarella2005]

�Bindings Engine (BE) is search engine where:

�No downloads during query processing

�Disk seeks constant in corpus size

�#queries = #phrases

�BE’sapproach:

�“Variabilized”search query language

�Pre-processes all documents before query-tim

e

�Integrates variable/type data with inverted index,

minim

izing query seeks

Page 47: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

BE Query Support

cities such as <NounPhrase>

President Bush <Verb>

<NounPhrase>is the capital of <NounPhrase>

reach m

e at <phone-number>

�Any sequence of concrete term

s and typed

variables

�NEAR is insufficient

�Functions (e.g., “head(<NounPhrase>)”)

Page 48: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

BE Operation

�Like a generic search engine, BE:

�Downloads a corpus of pages

�Creates an index

�Uses index to process queries efficiently

�BE further requires:

�Set of indexed types (e.g., “NounPhrase”), with a

“recognizer”for each

�String processing functions (e.g., “head()”)

�A BE system can only process types and

functions that its index supports

Page 49: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Index design

�Search engines handle scale with inverted index

�Single disk seekper term

�Mainly sequential reads

�Disk analysis

�Seeksrequire ~5 m

s, so only 200/sec

�Sequential readstransfer 10-40 M

B/sec

�Inverted index m

inim

izes expensive seeks; BE

should do the same

�Parallel downloads are just parallel, distributed

seeks; still very costly

Page 50: Eugene Agichtein Microsoft Research - Emory University

words

such

seattle

nickels

mayors

give

friendly

cities

billy

as

#docs

docid0

docid1

#docs

docid0

#docs

docid0

docid1

docid2

docid#docs-1

…#docs

docid0

docid1

docid2

#docs

docid0

docid1

#docs

docid0

docid1

docid2

docid#docs-1

#docs

docid0

#docs

docid0

docid1

docid2

docid3

#docs

docid0

docid1

docid2

docid#docs-1

#docs

docid0

docid1

docid2

docid#docs-1

Page 51: Eugene Agichtein Microsoft Research - Emory University

words

such

seattle

nickels

mayors

give

friendly

cities

billy

as

#docs

docid0

docid1

docid2

docid#docs-1

104

21

150

322

2501

15

99

322

426

1309

1.Test for equality

2.Advance smaller pointer

3.Abort when a list is exhausted

Returned docs:

322

Query: such as

#docs

docid0

docid1

docid2

docid#docs-1

Page 52: Eugene Agichtein Microsoft Research - Emory University

words

such

seattle

nickels

mayors

give

friendly

cities

billy

as

#docs

…pos 0

pos 1

docid#docs-1pos #

docs-1

#posns

pos 0

pos 1…

pos #

pos-1

docid0

docid1

#docs

…pos 0

pos 1

docid#docs-1pos #

docs-1

docid0

docid1

#posns

pos 0

pos 1…

pos #

pos-1

In phrase queries, match positions as well

#docs

docid0

docid1

docid2

docid#docs-1

#docs

docid0

docid1

docid2

docid#docs-1

“such as”

Page 53: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Neighbor Index

�At each position in the index, store “neighbor text”

that might be useful

�Let’s index <NounPhrase> and <Adj-Term

>

“I love cities such as Philadelphia.”

Left

Right

AdjT: “love”

Page 54: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Neighbor Index

�At each position in the index, store “neighbor text”

that might be useful

�Let’s index <NounPhrase> and <Adj-Term

>

“I love cities such as Philadelphia.”

Left

Right

AdjT: “cities”

NP: “cities”

AdjT: “I”

NP: “I”

Page 55: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Neighbor Index

Left

Right

AdjT: “such”

Query:“cities such as <NounPhrase>”

AdjT: “Philadelphia”

NP: “Philadelphia”

“I love cities such as Philadelphia.”

Page 56: Eugene Agichtein Microsoft Research - Emory University

neighbor 1

str 1

NPrightPhiladelphia

words

such

philadelphia

nickels

mayors

give

friendly

cities

billy

as

#docs

…pos 0

pos 1

docid#docs-1pos #

docs-1

#posns

pos 0

pos 1…

pos #

pos-1

docid0

docid1

“cities such as <NounPhrase>”

1.Find phrase query positions, as with phrase queries

2.If term is adjacent to variable, extract typed value

#posns

pos 0

neighbor 0

pos 1

neighbor 1…

pos #

pos-1…

#neighbors

blk_offset

3<offset>

19

12

In doc 19, starting at posn8:

“I love cities such as Philadelphia.”

neighbor 0

str 0

AdjTleft

such

Page 57: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Asymptotic Efficiency Analysis

�kconcrete term

s in query

�Bbindings found for query

�Ndocuments in corpus

�Tindexed types in corpus

O(N)

O(k + B)

Std M

odel

O(N * T)

O(k)

BE

Index Space

Query Tim

e

(in seeks)

Band Nscale together; koften small; Toften exclusive

Page 58: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Experiment 2: KnowItAllon BE

150k

50k

10k

Num

Extractions

89,641s

29,880s

5,976s

Std Imp/

Google

BE

Speedup

Page 59: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Experiment 2: KnowItAllon BE

150k

50k

10k

Num

Extractions

89,641s

29,880s

5,976s

Std Imp/

Google

N/A

95s

95s

BE

N/A

314x

63x

Speedup

Page 60: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

BE Summary

�Significant im

provement over generic indexes

�Index size grows linearly with number of types

�Some M

L-based patterns (e.g., HMMs, CRFs,

character models) not supported

�Can we use it for general QA, RE tasks?

Page 61: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Similar Approach: [CGHX2006]

�Support “relationship”keyword queries over indexed entities

�Top-K support for early term

ination

Page 62: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Indexing Thousands of Entity Types

[Slides from Chakrabartiet al., WWW 2006]

Page 63: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Workload-Driven Indexing

Page 64: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Selecting Types to Index

Page 65: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Parallelization/Adaptive Processing

�Parallelize processing:

�IBM W

ebFountain

[GCG+2004]

�Google’s M

ap/Reduce

�Select most efficient access strategy

�Cost Estimation and O

ptimization [IAJG2006]

Page 66: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Map/Reduce Framework

Page 67: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Map/Reduce Framework

�General framework

�Scales to 1000s of machines

�Im

plemented in Nutch

�“M

aps”easily to inform

ation extraction

�Map phase:

�Parse individual documents

�Tag entities

�Propose candidate relation tuples

�Reduce phase

�Merge m

ultuple

mentionesof same relation tuple

�Resolve co-references, duplicates

Page 68: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Cost Optimizer for Text-Centric Tasks

100

1,000

10,000

100,000 0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

Recall

Execution Time (secs)

Scan

Filt. Scan

Iterative Set Exp

ansion

Automatic

Query G

en.

OPTIM

IZED

Page 69: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Other Dimensions of Scalability:

Managing Complex Features [CNS2006]

R. Fagin and J. Helpern, Belief, awareness, reasoning, In AI 1998

Surface features

(cheap)

Authors

Database lookup features

(expensive!)

Inverted

index

Efficient search for

top-k most similar

entities

Many large

tables

1.Batch up to do better than

individual top-k?

2.Find top segmentation without

top-k matches for all segments?

J. Widom

E. F. Codd

R. K. Narayan

Nick Koudas

S. Chakrabarti

S. Sudarshan

Steve Cook

Ronald Fagin

Page 70: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Other Dimensions of Scalability:

Extraction Pattern Discovery [K

onigand Brill, KDD 2006]

�Use suffix array to efficiently explore

candidate patterns

Page 71: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Application: W

eb Question Answering

AskMSR: does not use patterns

�Sim

plicity �

scalability (cheap to compute n-grams)

�Challenge: do better than n-grams on web Q

A

Page 72: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Summary

�Brief overview of inform

ation extraction from text

�Techniques to scale up inform

ation extraction

�Scan-based techniques (lim

ited impact)

�Exploiting general indexes (lim

ited accuracy)

�Build

ing specialized index structures (most promising)

�Scalability is a data m

ining problem

�Querying graphs �

link discovery

�Workload m

ining for index optimization

�Must be optimized for specific text mining application?

Page 73: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Related Challenges

�Duplicate entities, relation tuplesextracted

�Missing values

�Extraction errors

�Inform

ation spans m

ultiple documents

�Combining relation tuplesinto complex

events

Page 74: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Break

�Eugene Agichtein, Microsoft &

Emory University

�http://www.m

athcs.emory.edu/~eugene/

�eugene@

mathcs.emory.edu

�Next: Scalable Inform

ation Integration

�Core set of techniques to enable large-scale IE, text

mining

�SunitaSarawagi

Page 75: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

References

�[AGI2005] E. Agichtein, Scaling Inform

ation Extraction to Large Document Colle

ctions,

IEEE Data Engineering Bulletin, 2005

�[AG2003] E. Agichtein and L. Gravano. Querying text databases for efficient

inform

ation extraction. ICDE 2003

�[AIG

2003] E. Agichtein, P. Ipeirotis, and L. Gravano, Modelin

g Q

uery-Based Access to

Text Databases, WebDB2003

�[CDS+2005] l J. Cafarella, D. Downey, S. Soderland, and O

ren Etzioni. KnowItNow: Fast,

scalable inform

ation extraction from the web. (H

LT/EMNLP), 2005.

�[CE2005] M

. J. Cafarella

and O

. Etzioni. A search engine for natural language

applications. (W

WW), 2005

�[CNS2006] A. Chandel, P.C. Nagesh, and S. Sarawagi. Efficient batch top-k search for

dictionary-based entity recognition. ICDE 2006

�[CRW2005] S. Chaudhuri, R. Ramakrishnan, and G

. Weikum. Integrating db and ir

technologies: What is the sound of one hand clapping?, CIDR 2005.

�[CGHX2006] K. Chakrabarti, V. Ganti, JiaweiHan, D. Xin, Ranking O

bjects Based on

Relationships, SIG

MOD 2006

�[CPD 2006] S. Chakrabarti, KritiPuniyaniand SujathaDas, Optimizing Scoring Functions

and Indexes for Proxim

ity Search in Type-annotated Corpora. WWW 2006

Page 76: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

References II

�[DBB+2002] S. Dumais, M. Banko, E. Brill, J. Lin and A. Ng (2002). P. Bennett, S. Dumais

and E. Horvitz (2002). W

eb question answering: Is m

ore always better? SIG

IR 2002

�[G

HY2002] R. Grishman, S. Huttunen, and R. Yangarber. Inform

ation extraction for

enhanced access to disease outbreak reports. Journal of Biomedical Inform

atics, 2002.

�[G

CG+2004] D. Gruhl, L. Chavet, D. Gibson, J. Meyer, P. Pattanayak, A. Tomkins, and J.

Zien. How to build a W

ebFountain: An architecture for very large-scale text analytics. IBM

Systems Journal, 2004.

�[IAJG2006] Ipeirotis, E. Agichtein, P. Jain, and L. Gravano, To Search or to Crawl:

Towards a Q

uery O

ptimizer for Text-Centric Tasks, SIG

MOD 2006

�[KRV+2004] R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, H. Zhu, Avatar: A

Database Approach to Semantic Search, SIG

MOD 2006

�[PRH2004] P. Pantel, D. Ravichandran, and E. Hovy. Towards terascale

knowledge

acquisition. In Conference on Computational Linguistics (COLING), 2004.

�[PE2005] P. Resnikand A. Elkiss. The linguist’s search engine: An overview

(demonstration). In ACL, 2005.

�[PDT2001] P.D. Turney. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In

European Conference on M

achine Learning (ECML), 2001.

�C. König

and E. Brill, Reducing the Human O

verhead in Text Categorization, KDD 2006

Page 77: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

The data integration problem

De-duplication

Extracted entities

•People/organization names,

•Addresses,

•Citations,

•Disease outbreaks

Multiple related entities with m

ostly text attributes

Page 78: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

The challenge of data integration

�Large lists with m

ultiple noisy m

entions of the

same entity

�No single key to order or cluster likely

duplicates while separating them from sim

ilar

but different entities.

�Need to depend on fuzzy and compute-

intensive string sim

ilarity functions

�Cannot afford to compare with m

ention with

every other.

Page 79: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Integration: outline

�Duplicate Entity elim

ination

�Defining duplicates: string sim

ilarity functions

�Scalable algorithms for finding sim

ilar pairs

�Batch m

ode

�Join algorithms to find duplicate pairs

�Onlin

e m

ode

�Efficient index design (same as indices in

batch m

ode)

�Create groups from duplicate entity pairs

�Single entity: data partitions

�Multiple entities: Colle

ctive m

ulti-attribute deduplication

Page 80: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

String similarity measures

�Token-based

�Examples

�Jaccard

�TF-IDF Cosine sim

ilarities

�Suitable for large documents

�Character-based

�Examples:

�Edit-distance and variants like Levenshtein, Jaro-W

inkler

�Soundex

�Suitable for short strings with spelling m

istakes

�Hybrids

Page 81: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Token based

�Tokens/words

�‘AT&T Corporation’-> ‘AT&T’, ‘Corporation’

�Sim

ilarity: various m

easures of overlap of two sets S,T

�Jaccard(S,T) = |S∩T|/|S∪T|

�Example

�S = ‘AT&T C

orporation’-> ‘AT&T’, ‘Corporation’

�T = ‘AT&T C

orp’-> ‘AT&T’, ‘Corp.’

�Jaccard(S,T) = 1/3

�Variants: weights attached with each token

�Useful for large strings: example web documents

Page 82: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Cosine similarity with TF/IDF weights

�Cosine sim

ilarity:

�Sets transform

ed to vectors with each term

as dim

ension

�Sim

ilarity: dot-product of two vectors each norm

alized to unit

length

��

cosine of angle between them

�Term

weight == TF/IDF

�log (tf+1) * log idfwhere

�tf: frequency of ‘term

’in a document d

�idf: number of documents / number of documents containing ‘term

�Intuitively: rare ‘term

s’are m

ore important

�Widely used in traditional IR

�Example:

�‘AT&T Corporation’, ‘AT&T Corp’or ‘AT&T Inc’

�Low weights for ‘Corporation’,’Corp’,’Inc’, Higher weight for ‘AT&T’

Page 83: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Edit Distance [G98]

�Given two strings, S,T, edit(S,T):

�Minim

um cost sequence of operations to transform

S to T.

�Character Operations: I (insert), D (delete), R (Replace).

�Example: edit(Error,Eror) = 1, edit(great,grate) = 2

�Folklore dynamic programming algorithm to compute edit();

�O(m

2) versus O

(2m log m

) for token-based m

easures

�Several variants (gaps,weights) ---becomes NP-complete

easily.

�Varying costs of operations: can be learnt [RY97].

�Observations

�Suitable for common typing m

istakes on small strings

�Comprehensive vsComprenhensive

�Problematic for specific domains

Page 84: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Edit Distance with affine gaps

�Differences between ‘duplicates’often due to abbreviations or

whole word insertions.

�IBM Corp. closer to ATT Corp. than IBM Corporation

�John SmithvsJohn Edward SmithvsJohn E. Smith

�Allo

w sequences of mis-m

atched characters (gaps) in the

alig

nment of two strings.

�Penalty: using the affine cost model

�Cost(g) = s+e ⋅l

�s: cost of opening a gap

�e: cost of extending the gap

�l: length of a gap

�Sim

ilar dynamic programming algorithm

�Parameters domain-dependent, learnable, e.g., [BM03,

MBP05]

Page 85: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Hybrids [CRF03]

�Example: Edward, John Vs Jon Edwerd

�Let S = {a1,…

,aK}, T = {b1,…

bL} sets of term

s:

�Sim

(S,T) =

�Sim

’() some other sim

ilarity function

�C(t,S,T) = {w∈S s.t ∃

v ∈

T, sim

’(w,v) > t}

�D(w

,T) = m

axv∈Tsim

’(w,v), w ∈

C(t,S,T)

�sTFIDF=

)b,

a('max

1j

i

1

1sim

K

K i

L j∑ =

=

∑∈

),

,(

),

(*

),

(*

),

(T

St

Cw

Tw

DT

wW

Sw

W

Page 86: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Integration: outline

�Duplicate Entity elim

ination

�Defining duplicates: string sim

ilarity functions

�Scalable algorithms for finding sim

ilar pairs

�Join algorithms to find duplicate pairs

�Efficient index design

�Create groups from duplicate entity pairs

�Single entity: data partitions

�Multiple entities: Collective m

ulti-attribute

deduplication

Page 87: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Finding all duplicate pairs in large lists

�Input: a large list of records R with string attributes

�Output: all pairs (S,T) of records in R

which satisfy a Sim

ilarity

Criteria:

�Jaccard(S,T) > 0.7

�Overlapping tokens (S,T) > 5

�TF-IDF-C

osine(S,T) > 0.8

�Edit-distance(S,T) < k

�More complicated sim

ilarity functions use these as filters (high

recall, low precision)

�Naïve m

ethod: for each record pair, compute sim

ilarity score

�I/O and CPU intensive, not scalable to m

illions of records

�Goal: reduce O

(n2) cost to O

(n*w

), w

here w << n

�Reduce number of pairs on which sim

ilarity is computed

Page 88: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

General template for similarity functions

�Sets: r,s

threshold

Common

tokens

Page 89: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Approximating edit distance [GIJ+01]

�EditDistance(s,t) ≤d →

|q-grams(s)∩

q-

grams(t)| ≥

max(|s|,|t|) -(d-1)*q –

1

�Q-grams (sequence of q-characters in a field)

�‘AT&T Corporation’

�3-grams: {‘AT&’,’T&T’,’&T ‘, ‘T C’,’

Co’,’orp’,’rpo’,’por’,’ora’,’rat’,’ati’,’tio’,’ion’}

�Typically, q=3 �

Large q-gram sets

�Approxim

ate large q-gram sets to smalle

r sets

Page 90: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Reducing size of large sets

�MinHashmethod

�Random projection m

ethod

Page 91: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Minhashmethod

�MinHashsignature of sets

�Choose a random perm

utation P of all tokens

�MinHash(S

={s

1,s

2,..,sn}) = s

jif s

jis the first token in

perm

utation P

�Jaccard(s,t) = Probability(M

inhash(s) = M

inhash(t))

�Choose B such perm

utations to improve accuracy

�Extended to Jaccard

on weighted term

s(C

harikar STOC

2002)

�Highly effective in practice

�Mirror detection on webpages (Broder et al 1997)

�Cleaning database records (Chaudhuriet al 2004)

Page 92: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Random Projection method

�Project each token to a random value in [-1 1]

�Random projection of set S={s

1,s

2,..,sn}

�1 if sum of projection of each token in S > 0

�0 otherw

ise

�Cosine(s,t) ∝Probability(projection of s and t

same)

�Repeat for B projections to improve accuracy

Page 93: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Approximate join problem

�Find all pairs s,t where Token-overlap(s,t) > T

�Main idea: exploit threshold T to reduce work

�Two algorithms

�Pair-C

ount

�Indexed Count

Page 94: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

�Step 1: Pass over data, for each token create list of sets

that contain it

�Step 2: generate pairs of sets, count and output those

with count > T

Pair-Count

Inverted index

t1 t2

2 2 1

The pair-counting table could be large �

too memory intensive

Not good when list lengths highly skewed

Self-join

lists

(Broderet al WWW 1997)

Data

Page 95: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Step 2: Using each record,

probe

merge lists to find rids in T of them

Probe-Count

Step 1: Create inverted index

w1

w2

Data

Inverted index

Heap

w1, w4,wk

Page 96: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Threshold sensitive list merge

Lists to be merged

Sort by increasing size

Except T-1 largest,

organize rest in heap

(T=3)

Heap

Pop from heap successively

Heap

Search in large lists in increasing order

Use lower bounds to terminate early

(SK04, CGK06)

Page 97: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Summary of the pair creation step

�Can be extended to the weighted case fitting

the general framework.

�More complicated sim

ilarity functions use set

sim

ilarity functions as filters

�Set sizes can be reduced through techniques

like M

inHash (weighted versions also exist)

�Small sets (average set size < 20), m

ost database

entities with word tokens: use as-is

�Large sets: web documents, sets of q-grams

�Use M

inhash or random projection

Page 98: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Integration: outline

�Duplicate Entity elim

ination

�Defining duplicates: string sim

ilarity functions

�Scalable algorithms for finding sim

ilar pairs

�Join algorithms to find duplicate pairs

�Efficient index design

�Create groups from duplicate entity pairs

�Single entity: data partitions

�Multiple entities: Collective m

ulti-attribute

deduplication

Page 99: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Creating partitions

�Transitive closure

�Dangers: unrelated records collapsed into

a single cluster

2

1

10

7

93

5

8

6

4

2

1

10

7

93

5

8

6

4

3 disagreements

�Correlation clustering (Bansalet al 2002)

�Partition to m

inim

ize total disagreements

1.Edges across partitions

2.Missing edges within partition

�More appealing than clustering:

�No m

agic constants: number of clusters,

sim

ilarity thresholds, diameter, etc

�Extends to real-valued scores

�NP Hard: many approxim

ate algorithms

Page 100: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Algorithms for correlation clustering

�Integer programming form

ulation (Charikar03)

�Xij= 1 if i and j in same partition, 0 otherw

ise

�Im

practical: O

(n3) constraints

�Practical substitutes (Heuristics, no guarantees)

�Agglomerative clustering: repeatedly m

erge closest clusters

�Efficient im

plementation possible via heaps (BG 2005)

�Definition of closeness subject to tuning

�Greatest reduction in error

�Average/M

ax/M

in sim

ilarity

Page 101: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Empirical results on data partitioning

�Setup: Online comparison shopping,

�Fields: name, model, description, price

�Learner: O

nline perceptronlearner

�Complete-link clustering >> single-link clustering(transitive closure)

�An issue: when to stop m

erging clusters

Digital cameras

Camcoder

Luggage

(From: Bile

nkoet al,

2005)

Page 102: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Other methods of partitioning

[Chaudhuriet al IC

DE 2005]

�Partitions are compact and relatively far from other points

�A Partition has to satisfy a number of criteria

�Points within partition closer than any points outside

�#points within p-neighborhood of each partition < c

�Either number of points in partition < K, or diameter < θ

Page 103: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Algorithm

�Consider case where partitions required to be of

size < K �

if partition Pjof size m

in output then

�m-nearest neighbors of all r in Pi is Pi

�Neighborhood of each point is sparse

�For each record, do efficient index probes to get

�Get K nearest neighbors

�Count of number of points in p-neighborhood for each m

nearest neighbors

�Form

pairs and perform

grouping based on above

insight to find groups

Page 104: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Summary: partitioning

�Transitive closure is a bad idea

�No verdict yet on best alternative

�Difficult to design an objective and algorithms

�Correlation clustering

�Reasonable objective with a skewed scoring function

�Poor algorithms

�Greedy agglomerative clustering algorithms ok

�Greatest minim

um sim

ilarity (complete-link), benefit

�Reasonable perform

ance with heap-based implementation

�Dense/Sparse partitioning

�Positives: Declarative objective, efficient algorithm

�Parameter retuning across domains

�Need comparison betw

een complete-link, Dense/Sparse, and

Correlation clustering.

Page 105: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Collective de-duplication: m

ulti-attribute

Collectively de-duplicate entities and its many

attributes

Associate variables for predictions

for each attribute k

each record pair (i,j)

for each record pair

from Parag& Domingos2005

a1

a2

a3

Akij

Rij

Page 106: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Dependency graph

A112

A134

A212

A234

A312= A

334

R12

R34

Scoring functions

�Independent scores

�sk(A

k,a

i,aj) Attribute-level

�Any classifier on various text

sim

ilarities of attribute pairs

�s(R

,bi,b

j) Record-level

�Any classifier on various

sim

ilarities of all k attribute

pairs

�Dependency scores

�dk(A

k, R): record pair, attribute

pair

10

10

71

24

Page 107: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Joint de-duplication steps

�Jointly pick 0/1 labels for all record pairs R

ijand all K attribute

pairs A

kijto m

axim

ize

�When dependency scores associative

�dk(1,1) + d

k(0,0) >= d

k(1,0)+dk(0,1)

�Can find optimal scores through graph M

INCUT

[s(Rij

ij∑)+

s k(Aijk

k∑)+d k(Rij,Aijk)]

Page 108: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Other issues and approaches

�Partitioning

�Transitive-closure as a post processing

�Results:

•Collective deduplication

•does not help whole citations,

•helps attributes

•Transitive closure can cause drop in accuracy

�Combined partitioning and linked dedup

�Dong, HaLevy, Madhavan(SIG

MOD 2005)

�Bhattacharya and G

etoor(2005)

CitationAuthorVenue

PT

PT

PT

Independent

87

85

798949

59

Collective

86

89

898986

82

Page 109: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Collective linkage: set-oriented data

(Bhattacharya and G

etoor, 2005)

AnupGupta

P3

David W

hite

P4

Liu, Jane & J G

upta & W

hite, Don

P2

D W

hite, J Liu, A G

upta

P1

Scoring functions

�S(A

ij) Attribute-level

�Text sim

ilarity

�S(A

ij, N

ij) Dependency with

labels of co-author set

�Fraction of co-author set

assigned label 1.

�Score: αs(A

ij) + (1-α) s(A

ij, N

ij)

Algorithm

Greedy agglomerative clustering

�Merge author clusters with highest

score

�Redefine sim

ilarity between clusters

of authors instead of single authors

�Max of author-level sim

ilarity

D White

White, Don

J Liu

Liu Jane

A Gupta

AnupGupta

David White

D White

A Gupta

J Gupta

Page 110: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

Summary

�Scalable algorithms for finding pairs of duplicates

solved for selective set sim

ilarity functions

�Open problem: colle

ction of weak sim

ilarity functions

�Data partitioning: agglomerative heap-based

clustering algorithms ok.

�Open problems:

�scalable algorithms for correlation clustering

�multi-attribute colle

ctive partitioning

�Ambiguity resolution in networked entities

�Other issues: combined extraction and integration

(Online integration)

Page 111: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

1983

Year

10

Update Semantics

2

Canonical

Journal

Title

Id

ACM Trans.

Databases

16

17

AI

17

ACM TODS

10

Canonica

lName

Id

32

22

11

2

Autho

r

Article

4Jeffrey Ullm

an

4

3Ron Fagin

3

4J. Ullm

an

2

M Y Vardi

11

Canonical

Name

IdAuthors

Writes

Journals

Articles

Variant links

to canonical

entries

Database: norm

alized, stores noisy variants

3 Top-

level

entities

R. Faginand J. Helpern, Belief, awareness, reasoning. In AI1988[10] also see

Page 112: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

1988

1983

Year

17

Belief, awareness,

reasoning

7

10

Update Semantics

2

Canonical

Journal

Title

Id

10

ACM Trans.

Databases

16

17

AI

17

ACM TODS

10

Canonica

l

Name

Id

87

32

22

97

11

2

Autho

r

Article

4Jeffrey Ullm

an

4

3R Fagin

8

8J Helpern

9

3Ron Fagin

3

4J. Ullm

an

2

M Y Vardi

11

Canonic

al

Name

IdAuthors

Writes

Journals

Articles

R. Faginand J. Helpern, Belief, awareness, reasoning. In AI1988[10] also see

Extraction

Integration

Author: R. Fagin A

Author: J. Helpern

Title Belief,..reasoning

Journal: AI

Year: 1998

Page 113: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

References

�[ACG02] RohitAnanthakrishna, SurajitChaudhuri, VenkateshGanti: Elim

inating Fuzzy Duplicates in Data W

arehouses.

VLDB 2002: 586-597

�[BD83] Dina Bitton, David J. DeWitt: Duplicate Record Elim

ination in Large Data Files. ACM Trans. Database Syst. 8(2):

255-265 (1983)

�[BE77] Mike W

. Blasgen, KapaliP. Eswaran: Storage and Access in Relational Data Bases. IBM Systems Journal 16(4):

362-377 (1977)

�[BG04] IndrajitBhattacharya, LiseGetoor: Iterative record linkage for cleaning and integration. DMKD 2004: 11-18

�[C98] William W

. Cohen: Integration of Heterogeneous Databases W

ithout Common Domains Using Q

ueries Based on

Textual Sim

ilarity. SIG

MOD Conference 1998: 201-212

�[C00] William W

. Cohen: Data integration using sim

ilarity joins and a word-based inform

ation representation language. ACM

Trans. Inf. Syst. 18(3): 288-321 (2000)

�[CCZ02] Peter Christen, Tim

Churches, Xi Zhu: Probabilistic nameand address cleaning and standardization. Australasian

Data M

ining W

orkshop 2002.

�[CGGM04] SurajitChaudhuri, Kris G

anjam, VenkateshGanti, Rajeev M

otwani: Robust and Efficient Fuzzy M

atch for Online

Data Cleaning. SIG

MOD Conference 2003: 313-324

�[CGG+05] SurajitChaudhuri, Kris G

anjam, VenkateshGanti, RahulKapoor, VivekR. Narasayya, Theo Vassilakis: Data

cleaning in m

icrosoftSQL server 2005. SIG

MOD Conference 2005: 918-920

�[CGK06] SurajitChaudhuri, VenkateshGanti, RaghavKaushik: A primitive operator for sim

ilarity joins in data cleaning. ICDE

2006.

�[CGM05] SurajitChaudhuri, VenkateshGanti, Rajeev M

otwani: Robust Identification of Fuzzy Duplicates. ICDE 2005: 865-

876

�[CRF03] William W

. Cohen, PradeepRavikumar, Stephen E. Fienberg: A Comparison of String Distance M

etrics for Name-

Matching Tasks. IIWeb2003: 73-78

Page 114: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

References

�[DJ03] TamraparniDasu, Theodore Johnson: Exploratory D

ata M

ining and D

ata Cleaning John W

iley

2003

�[DNS91] David J. DeWitt, Jeffrey F. Naughton, Donovan A. Schneider: An Evaluation of Non-E

quijoin

Algorithms. VLDB 1991: 443-452

�[DWI02] Data W

arehousing Institute report 2002

�[E00] Larry English: Plain English on D

ata Q

uality: Inform

ation Q

uality M

anagement: The Next Frontier.

DM Review M

agazine: April 2000. http://w

ww.dmreview.com/article_sub.cfm

?articleId=2073

�[FL95] ChristosFaloutsos, King-IpLin: FastM

ap: A Fast Algorithm for Indexing, Data-M

ining and

Visualization of Traditional and M

ultim

edia Datasets. SIG

MOD Conference 1995: 163-174

�[FS69] I. Fellegi, A. Sunter: A theory of record linkage. Journal of the American Statistical Association,

Vol64. No 328, 1969

�[G

98] D. Gusfield: Algorithms on strings, trees and sequences. Cambridge university press 1998

�[G

FS+01] Helena G

alhardas, Daniela Florescu, Dennis Shasha, Eric Sim

on, Cristian-AugustinSaita:

Declarative Data Cleaning: Language, Model, and Algorithms. VLDB2001: 371-380

�[G

IJ+01] Luis G

ravano, PanagiotisG. Ipeirotis, H. V. Jagadish, Nick Koudas, S. Muthukrishnan, Divesh

Srivastava: Approxim

ate String Joins in a Database (Alm

ost) for Free. VLDB 2001: 491-500

�[G

IKS03] Luis G

ravano, PanagiotisG. Ipeirotis, Nick Koudas, Divesh Srivastava: Text joins in an RDBMS

for web data integration. WWW 2003: 90-101

�[G

KMS04] S. Guha, N. Koudas, A. Marathe, D. Srivastava: Merging the results of approxim

ate m

atch

operations. VLDB 2004.

�[G

KR98] David G

ibson, Jon M

. Kleinberg, PrabhakarRaghavan: Clustering C

ategorical Data: An

Approach Based on D

ynamical Systems. VLDB 1998: 311-322

Page 115: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

References

�[HS95] Mauricio A. Hernández, Salvatore J. Stolfo: The M

erge/Purge Problem for Large Databases.

SIG

MOD Conference 1995: 127-138

�[HS98] Gísli R. Hjaltason, HananSamet: Incremental Distance Join Algorithms for Spatial Databases.

SIG

MOD Conference 1998: 237-248

�[J89] M. A. Jaro: Advances in record linkage m

ethodology as applie

d to m

atching the 1985 census of

Tampa, Florida. Journal of the American Statistical Association 84: 414-420.

�[JLM03] LiangJin, Chen Li, SharadMehrotra: Efficient Record Linkage in Large Data Sets. DASFAA

2003

�[JU91] PetteriJokinen, EskoUkkonen: Two Algorithms for Approxim

ate String M

atching in Static Texts.

MFCS 1991: 240-248

�[KL51] S. Kullb

ack, R. Liebler: On inform

ation and sufficiency. The annals of mathematical statistics

22(1): 79-86. 1959.

�[KMC05] DmitriV. Kalashnikov, SharadMehrotra, ZhaoqiChen: Exploiting R

elationships for Domain-

Independent Data Cleaning. SDM 2005

�[KMS04] Nick Koudas, AmitMarathe, Divesh Srivastava: Flexible String M

atching Against Large

Databases in Practice. VLDB 2004: 1078-1086

�[KMS05] Nick Koudas, AmitMarathe, Divesh Srivastava: SPID

ER: flexible m

atching in databases.

SIG

MOD Conference 2005: 876-878

�[LLL00] Mong-Li Lee, TokWang Ling, WaiLupLow: IntelliClean: a knowledge-based intelligent data

cleaner. KDD 2000: 290-294

�[M

E96] Alvaro E. Monge, Charles Elkan: The Field M

atching Problem: Algorithms and Applications. KDD

1996: 267-270

Page 116: Eugene Agichtein Microsoft Research - Emory University

August 2006

Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration

References

�[M

E97] Alvaro E. Monge, Charles Elkan: An Efficient Domain-Independent Algorithm for Detecting Approxim

ately Duplicate

Database Records. DMKD 1997

�[RY97] E. Ristad, P. Yianilos: Learning string edit distance. IEEE Pattern analysis and m

achine intelligence 1998.

�[S83] Gerry Salton: Introduction to m

odern inform

ation retrieval. M

cGraw Hill 1987.

�[SK04] Sunita Sarawagi, AlokKirpal: Efficient set joins on sim

ilarity predicates. SIG

MOD Conference 2004: 743-754

�[TF95] Howard R. Turtle, James Flood: Query Evaluation: Strategies and O

ptimizations. Inf. Process. Manage. 31(6): 831-

850 (1995)

�[TKF01] S. Tejada, C. Knoblock, S. Minton : Learning object identification rules for inform

ation integration. Inform

ation

Systems, Vol26, No 8, 607-633, 2001.

�[W

94] William E. Winkler: Advanced m

ethods for record linkage. Proceedings of the section on survey research m

ethods,

American Statistical Association 1994: 467-472

�[W

99] William E. Winkler: The state of record linkage and current research problems. IRS publication R

99/04

(http://www.census.gov/srd/www/byname.htm

l)

�[Y02] William E. Yancey: BigMatch: A program for extracting probable m

atches from a large file for record linkage. RRC

2002-01. Statistical Research Division, U.S. Bureau of the Census.