Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University

Language-Independent Set Expansion of Named Entities using the Web

Richard C. Wang & William W. CohenLanguage Technologies InstituteCarnegie Mellon UniversityPittsburgh, PA 15213 USA

2 / 20Language Technologies Institute, Carnegie Mellon University

Language-Independent Set ExpansionRichard C. Wang

Outline

Introduction System Architecture

FetcherExtractorRanker

Evaluation Conclusion



What is Set Expansion?

For example, Given a query: {“spit”, “boogers”, “ear wax”} Answer is: {“puke”, “toe jam”, “sweat”, ....}

More formally, Given a small number of seeds: x1, x2, …, xk wh

ere each xi St Answer is a listing of other probable elements: e1, e2, …, en where each ei St

A well-known example of a web-based set expansion system is Google Sets™ http://labs.google.com/sets



What is it used for?

Derive features for… Named Entity Recognition (Settles, 2004) (Talukdar, 2006)

Expand true named entities in training set Utilize expanded names to assign features to words

Concept Learning (Cohen, 2000) Given a set of instances, look in web pages for tables or lists

that contain some of those instances Automatically extract features from those pages Define features over the instances found

Relation Learning (Cafarella et al, 2005) (Etzioni et al, 2005) Extract items from tables or lists that contain given seeds Utilize extracted items and their contexts for learning relation

s



Our Set Expander: SEAL Features

Independent of human/markup language Support seeds in English, Chinese, Japanese, Korean, ... Accept documents in HTML, XML, SGML, TeX, WikiML, …

Does not require pre-annotated training data Utilize readily-available corpus: World Wide Web Learns wrappers on the fly

Based on two research contributions1. Automatic construction of wrappers

Extracts “lists” of entities on semi-structured web pages

2. Use of random graph walk Ranks extracted entities so that those most likely to be in the target

set are ranked higher

Set Expander for Any Language



System Architecture

Fetcher: download web pages from the Web Extractor: learn wrappers from web pages Ranker: rank entities extracted by wrappers

1. Canon2. Nikon3. Olympus

4. Pentax5. Sony6. Kodak7. Minolta8. Panasonic9. Casio10. Leica11. Fuji12. Samsung13. …



The Fetcher

Procedure:1. Compose a search query using all seeds

2. Use Google API to request for top N URLs We use N = 100, 200, and 300 for evaluation

3. Fetch URLs by using a crawler

4. Send fetched documents to the Extractor



The Extractor

Learn wrappers from web documents and seeds on the fly Utilize semi-structured documents Wrappers defined at character level

No tokenization required; thus language independent

However, very specific; thus page-dependent Wrappers derived from document d is applied to d only



<li class="toyota"><a href="http://www.geisauto.com/">

<li class="honda"><a href="http://www.curryauto.com/">

<li class="acura"><a href="http://www.curryauto.com/">

<li class="toyota"><a href="http://www.curryauto.com/">

<li class="nissan"><a href="http://www.curryauto.com/">

<li class="ford"><a href="http://www.curryauto.com/"> <li class="ford"><a href="http://www.curryauto.com/">

<li class="ford"><a href="http://www.curryauto.com/">




Extractor E1 finds maximally-long contexts that bracket

all instances of every seed

It seems to be working… but what if I add one more instance of “to

yota”?

It seems to be working too… but how about a more complex exa

mple?

…

…

…

…

…

…



<img src="/common/logos/honda/logo-horiz-rgb-lg-dkbg.gif" alt="4"></a> <ul><li><a href="http://www.curryhonda-ga.com/"> Curry Honda Atlanta...</li> <li><a href="http://www.curryhondamass.com/"> Curry Honda...</li> <li class="last"><a href="http://www.curryhondany.com/"> Curry Honda Yorktown...</li></ul> </li>

<li class="honda"><a href="http://www.curryauto.com/">

<li class="acura"><a href="http://www.curryauto.com/">

<li class="toyota"><a href="http://www.curryauto.com/">

<li class="nissan"><a href="http://www.curryauto.com/">

<li class="ford"><a href="http://www.curryauto.com/"> <img src="/common/logos/ford/logo-horiz-rgb-lg-dkbg.gif" alt="3"></a> <ul><li class="last"><a href="http://www.curryauto.com/"> Curry Ford...</li></ul> </li>

<img src="/curryautogroup/images/logo-horiz-rgb-lg-dkbg.gif" alt="5"></a> <ul><li class="last"><a href="http://www.curryacura.com/"> Curry Acura...</li></ul> </li>

<img src="/common/logos/toyota/logo-horiz-rgb-lg-dkbg.gif" alt="7"></a> <ul><li class="last"><a href="http://www.geisauto.com/toyota/"> Curry Toyota...</li></ul> </li>

<img src="/common/logos/nissan/logo-horiz-rgb-lg-dkbg.gif" alt="6"></a> <ul><li class="last"><a href="http://www.geisauto.com/"> Curry Nissan...</li></ul> </li>

I am a noisy entity mention

Me too!

Can you find common contexts that bracket all instances of every seed?

I guess not!

Let’s try out

Extractor E2 and se

e if it works…

Extractor E2 finds

maximally-long contexts that bracket

at least one instance of every seed

Horray! It seems like Extractor E2 works! But how do we get rid of those noisy entity menti

ons?



Extractor: Summary

A wrapper consists of a pair of left (L) and right (R) context string All strings between (but not containing) L and R

are extracted Referred to as “candidate entity mention”

We compared two versions of wrapper: Maximally-long contextual strings that

bracket…1. all instances of every seed (Extractor E1)

2. at least one instance of every seed (Extractor E2)



The Ranker

Rank candidate entity mentions based on “similarity” to seeds

Noisy mentions should be ranked lower We compare two methods for ranking

1. Extracted Frequency (EF) # of times an entity mention is extracted

2. Random Graph Walk (GW) Probability of an “entity mention” node being rea

ched in a graph (explained in next slide)



Building a Graph

A graph consists of a fixed set of… Node Types: {seeds, document, wrapper, mention} Labeled Directed Edges: {find, derive, extract}

Each edge asserts that a binary relation r holds Each edge has an inverse relation r-1 (graph is cyclic)

“ford”, “nissan”, “toyota”

curryauto.com

Wrapper #3

Wrapper #2

Wrapper #1

Wrapper #4

“honda”26.1%

“acura”34.6%

“chevrolet”22.5%

“bmw pittsburgh”8.4%

“volvo chicago”8.4%

find

derive

extract northpointcars.com

Minkov et al. Contextual Search and Name Disambiguation in Email using Graphs. SIGIR 2006



Legend

Node: x, y, z

Edge Relation: r

An edge from x to y with relation r :

Stop Probability: λ

Random Graph Walk

where

otherwise

if

0

1)(I

zxzx

Probability of picking a target node y given an edge relation r and source n

ode x

“curryauto.com”, ...“wrapper #1”, ...“honda”, “acura”, ...

find, find-1, derive, derive-1, extract, extract-1

Probability of staying at a no

de (0.5)

Probability of picking an edge relation r given a s

ource node x

Probability of reaching any node z from x

yx rRecursive computation of probability

Probability of continuing to node z from x

Probability of staying at node x

ryx



Evaluation Datasets



Evaluation Method Mean Average Precision

Commonly used for evaluating ranked lists in IR Contains recall and precision-oriented aspects Sensitive to the entire ranking Mean of average precisions for each ranked list

Evaluation Procedure (per dataset)

1. Randomly select three true entities and use their first listed mentions as seeds

2. Expand the three seeds obtained from step 13. Repeat steps 1 and 2 five times4. Compute MAP for the five ranked lists

where L = ranked list of extracted mentions, r = rank

Prec(r) = precision at rank r

(a) Extracted mention at r matches any true mention

(b) There exist no other extracted mention at rank less than r that is of the same entity as the one at r

otherwise

trueare (b) and (a) if

0

1

)(NewEntity r

# True Entities = total number of true entities in this dataset



Experimental Results

Legend

[Extractor] + [Ranker] + [Top N URLs]

Extractor = { E1: Extractor E1, E2: Extractor E2 }

Ranker = { EF: Extracted Frequency, GW: Graph Walk }

N = { 100, 200, 300 }

Overall MAP vs. Various Methods

14.59%

43.76%

82.39%

0%

20%

40%

60%

80%

100%

G.Sets G.Sets (Eng) E1+EF+100

Methods

MA

P (

%)


82.39%

87.61%

93.13%

70%

75%

80%

85%

90%

95%

100%

E1+EF+100 E2+EF+100 E2+GW+100

Methods

MA

P (

%)


93.13% 94.03% 94.18%

70%

75%

80%

85%

90%

95%

100%

E2+GW+100 E2+GW+200 E2+GW+300

Methods

MA

P (

%)



Conclusion & Future Work

Conclusion Unsupervised approach for expanding sets of named entities

Domain and language independent SEAL performs better than Google Sets

Higher Mean Average Precision on our datasets Handle not only English, but also Chinese and Japanese

Future Work Learn from graphs to re-rank extracted mentions Bootstrap named entities by using extracted mentions in previous

expansion as seeds Identify possible class names for expanded sets

i.e. car makers, constellations, presidents…



References



Top three mentions are the seedsTry it out at http://rcwang.com/seal

Documents

Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University