25
Cross-Language Information Retrieval Applied Natural Language Processing October 29, 2009 Douglas W. Oard

Cross-Language Information Retrieval

  • Upload
    agrata

  • View
    57

  • Download
    1

Embed Size (px)

DESCRIPTION

Cross-Language Information Retrieval. Applied Natural Language Processing October 29, 2009 Douglas W. Oard. What Do People Search For?. Searchers often don’t clearly understand The problem they are trying to solve What information is needed to solve the problem - PowerPoint PPT Presentation

Citation preview

Page 1: Cross-Language Information Retrieval

Cross-Language Information Retrieval

Applied Natural Language ProcessingOctober 29, 2009Douglas W. Oard

Page 2: Cross-Language Information Retrieval

What Do People Search For?• Searchers often don’t clearly understand

– The problem they are trying to solve– What information is needed to solve the problem– How to ask for that information

• The query results from a clarification process

• Dervin’s “sense making”: Need

Gap Bridge

Page 3: Cross-Language Information Retrieval

Taylor’s Model of Question FormationQ1 Visceral Need

Q2 Conscious Need

Q3 Formalized Need

Q4 Compromised Need (Query)En

d-us

er S

earc

hInterm

ediated Search

Page 4: Cross-Language Information Retrieval

Design Strategies• Foster human-machine synergy

– Exploit complementary strengths– Accommodate shared weaknesses

• Divide-and-conquer – Divide task into stages with well-defined interfaces– Continue dividing until problems are easily solved

• Co-design related components– Iterative process of joint optimization

Page 5: Cross-Language Information Retrieval

Human-Machine Synergy• Machines are good at:

– Doing simple things accurately and quickly– Scaling to larger collections in sublinear time

• People are better at:– Accurately recognizing what they are looking for– Evaluating intangibles such as “quality”

• Both are pretty bad at:– Mapping consistently between words and concepts

Page 6: Cross-Language Information Retrieval

Process/System Co-Design

Page 7: Cross-Language Information Retrieval

Supporting the Search Process

SourceSelection

Search

Query

Selection

Ranked List

Examination

Document

Delivery

Document

QueryFormulation

IR System

Query Reformulation and

Relevance Feedback

SourceReselection

Nominate ChoosePredict

Page 8: Cross-Language Information Retrieval

Supporting the Search Process

SourceSelection

Search

Query

Selection

Ranked List

Examination

Document

Delivery

Document

QueryFormulation

IR System

Indexing Index

Acquisition Collection

Page 9: Cross-Language Information Retrieval

Search Component Model

Comparison Function

Representation Function

Query Formulation

Human Judgment

Representation Function

Retrieval Status Value

Utility

Query

Information Need Document

Query Representation Document Representation

Que

ry P

roce

ssin

g

Doc

umen

t Pr

oces

sing

Page 10: Cross-Language Information Retrieval

Relevance• Relevance relates a topic and a document

– Duplicates are equally relevant, by definition– Constant over time and across users

• Pertinence relates a task and a document– Accounts for quality, complexity, language, …

• Utility relates a user and a document– Accounts for prior knowledge

Page 11: Cross-Language Information Retrieval

“Okapi” Term Weights

5.05.0

log*5.05.1 ,

,,

j

j

jii

jiji DF

DFN

TFLL

TFw

0.0

0.2

0.4

0.6

0.8

1.0

0 5 10 15 20 25

Raw TF

Oka

pi T

F 0.5

1.0

2.0

4.4

4.6

4.8

5.0

5.2

5.4

5.6

5.8

6.0

0 5 10 15 20 25

Raw DF

IDF Classic

Okapi

LL /

TF component IDF component

Page 12: Cross-Language Information Retrieval

A Ranking Function: Okapi BM25

])(7)(*8

)),()(*9.03.0(

)),(*2.2(][)5.0)((

)5.0)(([logeqtfeqtf

detfavdl

ddldetf

edfedfN

Qek

k

k

document frequency term frequency

query term query document length document

average document length term frequency in query

])(7)(*8

)),()(*9.03.0(

)),(*2.2(][)5.0)((

)5.0)(([logeqtfeqtf

detfavdl

ddldetf

edfedfN

Qek

k

k

])(7)(*8

)),()(*9.03.0(

)),(*2.2(][)5.0)((

)5.0)(([logeqtfeqtf

detfavdl

ddldetf

edfedfN

Qek

k

k

Page 13: Cross-Language Information Retrieval

Estimating TF and DF for Query Terms

jf

kjjiki dftffepdetf ),(*)(),(

jf

jjii fdffepedf )(*)()(

jf

)( jfdf

),( kj dftf

),( ki detf

)( ji fep

)( iedf

3f2f 4f1f

0.4*20 + 0.3*5 + 0.2*2 + 0.1*50 = 14.9

20

50

5025

3040

0.30.4

0.4*50 + 0.3*40 + 0.2*30 + 0.1*200 = 58

0.1

200

0.2e1

0.40.3

0.20.1

f1

f2

f3

f4

Page 14: Cross-Language Information Retrieval

Learning to Translate• Lexicons

– Phrase books, bilingual dictionaries, …

• Large text collections– Translations (“parallel”)– Similar topics (“comparable”)

• Similarity– Similar pronunciation, similar users

• People

Page 15: Cross-Language Information Retrieval

Hieroglyphic

Demotic

Greek

Page 16: Cross-Language Information Retrieval

Statistical Machine Translation

Señora Presidenta , había pedido a la administración del Parlamento que garantizase

Madam President , I had asked the administration to ensure that

Page 17: Cross-Language Information Retrieval

Bidirectional Translationwonders of ancient world (CLEF Topic 151)

se//0.31demande//0.24demander//0.08peut//0.07merveilles//0.04question//0.02savoir//0.02on//0.02bien//0.01merveille//0.01pourrait//0.01

Unidirectional:si//0.01sur//0.01me//0.01t//0.01emerveille//0.01ambition//0.01merveilleusement//0.01veritablement//0.01cinq//0.01hier//0.01

merveilles//0.92merveille//0.03emerveille//0.03merveilleusement//0.02

Bidirectional:

Page 18: Cross-Language Information Retrieval

Experiment Setup• Test collections

• Document processing- Stemming, accent-removal (CLEF French)- Word segmentation, encoding conversion (TREC Chinese)- Stopword removal (all collections)

• Training statistical translation models (GIZA++)

M1(10)M1(10), HMM(5), M4(5)Models (iterations)

1,583,807672,247# of sentence pairs

English-ChineseEnglish-FrenchLanguages

FBIS et al.EuroparlParallel corpus

Source CLEF’01-03 TREC-5,6

Query language English EnglishDocument language French Chinese# of topics 151 54

# of documents 87,191 139,801

Avg # of rel docs 23 95

Page 19: Cross-Language Information Retrieval
Page 20: Cross-Language Information Retrieval

f1 (0.32)f2 (0.21)f3 (0.11)f4 (0.09)f5 (0.08)f6 (0.05)f7 (0.04)f8 (0.03)f9 (0.03)f10 (0.02)f11 (0.01)f12 (0.01)

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

f1 f1f2f3f4f5

f1f2f3f4

f1f2f3f4f5f6f7

f1f2f3f4f5f6f7f8f9f10f11f12

f1 f1 f1f2

f1f2

f1f2f3

f1

Cumulative Probability ThresholdTranslations

Pruning Translations

Page 21: Cross-Language Information Retrieval

Unidirectional without Synonyms (PSQ)

40%

50%

60%

70%

80%

90%

100%

110%

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Cumulative Probability Threshold

MA

P: C

LIR

/ M

onol

ingu

al

CLEF French

40%

50%

60%

70%

80%

90%

100%

110%

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Cumulative Probability Threshold

MA

P: C

LIR

/ M

onol

ingu

al

TREC-5,6 Chinese

Statistical significance vs monolingual (Wilcoxon signed rank test)• CLEF French: worse at peak• TREC-5,6 Chinese: worse at peak

Q D

Page 22: Cross-Language Information Retrieval

40%

50%

60%

70%

80%

90%

100%

110%

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Cumulative Probability Threshold

MA

P: C

LIR

/Mon

olin

gual

DAMM IMM PSQ

Bidirectional with Synonyms (DAMM)

(Q) (D) v.s. Q D

• DAMM significantly outperformed PSQ• DAMM is statistically indistinguishable from monolingual at peak• IMM: nearly as good as DAMM for French, but not for Chinese

40%

50%

60%

70%

80%

90%

100%

110%

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Cumulative Probability Threshold

MA

P: C

LIR

/Mon

olin

gual

DAMM IMM PSQ

CLEF French TREC-5,6 Chinese

Page 23: Cross-Language Information Retrieval

Indexing Time

0

100

200

300

400

500

Thousands of documents

Inde

xing

tim

e (s

ec)

monolingual cross-language

Dictionary-based vector translation, single Sun SPARC in 2001

Page 24: Cross-Language Information Retrieval

The Problem Space• Retrospective search

– Web search– Specialized services (medicine, law, patents) – Help desks

• Real-time filtering– Email spam– Web parental control– News personalization

• Real-time interaction– Instant messaging– Chat rooms– Teleconferences

Key Capabilities Map across languages

– For human understanding– For automated processing

Page 25: Cross-Language Information Retrieval

Making a Market• Multitude of potential applications

– Retrospective search, email, IM, chat, …– Natural consequence of language diversity

• Limiting factor is translation readability– Searchability is mostly a solved problem

• Leveraging human translation has potential– Translation routing, volunteers, cacheing