20
Language- Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213 USA

Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University

Language-Independent Set Expansion of Named Entities using the Web

Richard C. Wang & William W. CohenLanguage Technologies InstituteCarnegie Mellon UniversityPittsburgh, PA 15213 USA

Page 2: Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University

2 / 20Language Technologies Institute, Carnegie Mellon University

Language-Independent Set ExpansionRichard C. Wang

Outline

Introduction System Architecture

FetcherExtractorRanker

Evaluation Conclusion

Page 3: Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University

3 / 20Language Technologies Institute, Carnegie Mellon University

Language-Independent Set ExpansionRichard C. Wang

What is Set Expansion?

For example, Given a query: {“spit”, “boogers”, “ear wax”} Answer is: {“puke”, “toe jam”, “sweat”, ....}

More formally, Given a small number of seeds: x1, x2, …, xk wh

ere each xi St Answer is a listing of other probable elements: e1, e2, …, en where each ei St

A well-known example of a web-based set expansion system is Google Sets™ http://labs.google.com/sets

Page 4: Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University

4 / 20Language Technologies Institute, Carnegie Mellon University

Language-Independent Set ExpansionRichard C. Wang

What is it used for?

Derive features for… Named Entity Recognition (Settles, 2004) (Talukdar, 2006)

Expand true named entities in training set Utilize expanded names to assign features to words

Concept Learning (Cohen, 2000) Given a set of instances, look in web pages for tables or lists

that contain some of those instances Automatically extract features from those pages Define features over the instances found

Relation Learning (Cafarella et al, 2005) (Etzioni et al, 2005) Extract items from tables or lists that contain given seeds Utilize extracted items and their contexts for learning relation

s

Page 5: Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University

5 / 20Language Technologies Institute, Carnegie Mellon University

Language-Independent Set ExpansionRichard C. Wang

Our Set Expander: SEAL Features

Independent of human/markup language Support seeds in English, Chinese, Japanese, Korean, ... Accept documents in HTML, XML, SGML, TeX, WikiML, …

Does not require pre-annotated training data Utilize readily-available corpus: World Wide Web Learns wrappers on the fly

Based on two research contributions1. Automatic construction of wrappers

Extracts “lists” of entities on semi-structured web pages

2. Use of random graph walk Ranks extracted entities so that those most likely to be in the target

set are ranked higher

Set Expander for Any Language

Page 6: Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University

6 / 20Language Technologies Institute, Carnegie Mellon University

Language-Independent Set ExpansionRichard C. Wang

System Architecture

Fetcher: download web pages from the Web Extractor: learn wrappers from web pages Ranker: rank entities extracted by wrappers

1. Canon2. Nikon3. Olympus

4. Pentax5. Sony6. Kodak7. Minolta8. Panasonic9. Casio10. Leica11. Fuji12. Samsung13. …

Page 7: Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University

7 / 20Language Technologies Institute, Carnegie Mellon University

Language-Independent Set ExpansionRichard C. Wang

The Fetcher

Procedure:1. Compose a search query using all seeds

2. Use Google API to request for top N URLs We use N = 100, 200, and 300 for evaluation

3. Fetch URLs by using a crawler

4. Send fetched documents to the Extractor

Page 8: Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University

8 / 20Language Technologies Institute, Carnegie Mellon University

Language-Independent Set ExpansionRichard C. Wang

The Extractor

Learn wrappers from web documents and seeds on the fly Utilize semi-structured documents Wrappers defined at character level

No tokenization required; thus language independent

However, very specific; thus page-dependent Wrappers derived from document d is applied to d only

Page 9: Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University

9 / 20Language Technologies Institute, Carnegie Mellon University

Language-Independent Set ExpansionRichard C. Wang

<li class="toyota"><a href="http://www.geisauto.com/">

<li class="honda"><a href="http://www.curryauto.com/">

<li class="acura"><a href="http://www.curryauto.com/">

<li class="toyota"><a href="http://www.curryauto.com/">

<li class="nissan"><a href="http://www.curryauto.com/">

<li class="ford"><a href="http://www.curryauto.com/"> <li class="ford"><a href="http://www.curryauto.com/">

<li class="ford"><a href="http://www.curryauto.com/">

<li class="ford"><a href="http://www.curryauto.com/">

<li class="ford"><a href="http://www.curryauto.com/">

<li class="ford"><a href="http://www.curryauto.com/">

Extractor E1 finds maximally-long contexts that bracket

all instances of every seed

It seems to be working… but what if I add one more instance of “to

yota”?

It seems to be working too… but how about a more complex exa

mple?

Page 10: Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University

10 / 20Language Technologies Institute, Carnegie Mellon University

Language-Independent Set ExpansionRichard C. Wang

<img src="/common/logos/honda/logo-horiz-rgb-lg-dkbg.gif" alt="4"></a> <ul><li><a href="http://www.curryhonda-ga.com/"> <span class="dName">Curry Honda Atlanta</span>...</li> <li><a href="http://www.curryhondamass.com/"> <span class="dName">Curry Honda</span>...</li> <li class="last"><a href="http://www.curryhondany.com/"> <span class="dName">Curry Honda Yorktown</span>...</li></ul> </li>

<li class="honda"><a href="http://www.curryauto.com/">

<li class="acura"><a href="http://www.curryauto.com/">

<li class="toyota"><a href="http://www.curryauto.com/">

<li class="nissan"><a href="http://www.curryauto.com/">

<li class="ford"><a href="http://www.curryauto.com/"> <img src="/common/logos/ford/logo-horiz-rgb-lg-dkbg.gif" alt="3"></a> <ul><li class="last"><a href="http://www.curryauto.com/"> <span class="dName">Curry Ford</span>...</li></ul> </li>

<img src="/curryautogroup/images/logo-horiz-rgb-lg-dkbg.gif" alt="5"></a> <ul><li class="last"><a href="http://www.curryacura.com/"> <span class="dName">Curry Acura</span>...</li></ul> </li>

<img src="/common/logos/toyota/logo-horiz-rgb-lg-dkbg.gif" alt="7"></a> <ul><li class="last"><a href="http://www.geisauto.com/toyota/"> <span class="dName">Curry Toyota</span>...</li></ul> </li>

<img src="/common/logos/nissan/logo-horiz-rgb-lg-dkbg.gif" alt="6"></a> <ul><li class="last"><a href="http://www.geisauto.com/"> <span class="dName">Curry Nissan</span>...</li></ul> </li>

I am a noisy entity mention

Me too!

Can you find common contexts that bracket all instances of every seed?

I guess not!

Let’s try out

Extractor E2 and se

e if it works…

Extractor E2 finds

maximally-long contexts that bracket

at least one instance of every seed

Horray! It seems like Extractor E2 works! But how do we get rid of those noisy entity menti

ons?

Page 11: Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University

11 / 20Language Technologies Institute, Carnegie Mellon University

Language-Independent Set ExpansionRichard C. Wang

Extractor: Summary

A wrapper consists of a pair of left (L) and right (R) context string All strings between (but not containing) L and R

are extracted Referred to as “candidate entity mention”

We compared two versions of wrapper: Maximally-long contextual strings that

bracket…1. all instances of every seed (Extractor E1)

2. at least one instance of every seed (Extractor E2)

Page 12: Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University

12 / 20Language Technologies Institute, Carnegie Mellon University

Language-Independent Set ExpansionRichard C. Wang

The Ranker

Rank candidate entity mentions based on “similarity” to seeds

Noisy mentions should be ranked lower We compare two methods for ranking

1. Extracted Frequency (EF) # of times an entity mention is extracted

2. Random Graph Walk (GW) Probability of an “entity mention” node being rea

ched in a graph (explained in next slide)

Page 13: Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University

13 / 20Language Technologies Institute, Carnegie Mellon University

Language-Independent Set ExpansionRichard C. Wang

Building a Graph

A graph consists of a fixed set of… Node Types: {seeds, document, wrapper, mention} Labeled Directed Edges: {find, derive, extract}

Each edge asserts that a binary relation r holds Each edge has an inverse relation r-1 (graph is cyclic)

“ford”, “nissan”, “toyota”

curryauto.com

Wrapper #3

Wrapper #2

Wrapper #1

Wrapper #4

“honda”26.1%

“acura”34.6%

“chevrolet”22.5%

“bmw pittsburgh”8.4%

“volvo chicago”8.4%

find

derive

extract northpointcars.com

Minkov et al. Contextual Search and Name Disambiguation in Email using Graphs. SIGIR 2006

Page 14: Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University

14 / 20Language Technologies Institute, Carnegie Mellon University

Language-Independent Set ExpansionRichard C. Wang

Legend

Node: x, y, z

Edge Relation: r

An edge from x to y with relation r :

Stop Probability: λ

Random Graph Walk

where

otherwise

if

0

1)(I

zxzx

Probability of picking a target node y given an edge relation r and source n

ode x

“curryauto.com”, ...“wrapper #1”, ...“honda”, “acura”, ...

find, find-1, derive, derive-1, extract, extract-1

Probability of staying at a no

de (0.5)

Probability of picking an edge relation r given a s

ource node x

Probability of reaching any node z from x

yx rRecursive computation of probability

Probability of continuing to node z from x

Probability of staying at node x

ryx

Page 15: Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University

15 / 20Language Technologies Institute, Carnegie Mellon University

Language-Independent Set ExpansionRichard C. Wang

Evaluation Datasets

Page 16: Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University

16 / 20Language Technologies Institute, Carnegie Mellon University

Language-Independent Set ExpansionRichard C. Wang

Evaluation Method Mean Average Precision

Commonly used for evaluating ranked lists in IR Contains recall and precision-oriented aspects Sensitive to the entire ranking Mean of average precisions for each ranked list

Evaluation Procedure (per dataset)

1. Randomly select three true entities and use their first listed mentions as seeds

2. Expand the three seeds obtained from step 13. Repeat steps 1 and 2 five times4. Compute MAP for the five ranked lists

where L = ranked list of extracted mentions, r = rank

Prec(r) = precision at rank r

(a) Extracted mention at r matches any true mention

(b) There exist no other extracted mention at rank less than r that is of the same entity as the one at r

otherwise

trueare (b) and (a) if

0

1

)(NewEntity r

# True Entities = total number of true entities in this dataset

Page 17: Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University

17 / 20Language Technologies Institute, Carnegie Mellon University

Language-Independent Set ExpansionRichard C. Wang

Experimental Results

Legend

[Extractor] + [Ranker] + [Top N URLs]

Extractor = { E1: Extractor E1, E2: Extractor E2 }

Ranker = { EF: Extracted Frequency, GW: Graph Walk }

N = { 100, 200, 300 }

Overall MAP vs. Various Methods

14.59%

43.76%

82.39%

0%

20%

40%

60%

80%

100%

G.Sets G.Sets (Eng) E1+EF+100

Methods

MA

P (

%)

Overall MAP vs. Various Methods

82.39%

87.61%

93.13%

70%

75%

80%

85%

90%

95%

100%

E1+EF+100 E2+EF+100 E2+GW+100

Methods

MA

P (

%)

Overall MAP vs. Various Methods

93.13% 94.03% 94.18%

70%

75%

80%

85%

90%

95%

100%

E2+GW+100 E2+GW+200 E2+GW+300

Methods

MA

P (

%)

Page 18: Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University

18 / 20Language Technologies Institute, Carnegie Mellon University

Language-Independent Set ExpansionRichard C. Wang

Conclusion & Future Work

Conclusion Unsupervised approach for expanding sets of named entities

Domain and language independent SEAL performs better than Google Sets

Higher Mean Average Precision on our datasets Handle not only English, but also Chinese and Japanese

Future Work Learn from graphs to re-rank extracted mentions Bootstrap named entities by using extracted mentions in previous

expansion as seeds Identify possible class names for expanded sets

i.e. car makers, constellations, presidents…

Page 19: Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University

19 / 20Language Technologies Institute, Carnegie Mellon University

Language-Independent Set ExpansionRichard C. Wang

References

Page 20: Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University

20 / 20Language Technologies Institute, Carnegie Mellon University

Language-Independent Set ExpansionRichard C. Wang

Top three mentions are the seedsTry it out at http://rcwang.com/seal