Cross-Language Information Retrieval

Cross-Language Information Retrieval

Applied Natural Language ProcessingOctober 29, 2009Douglas W. Oard

What Do People Search For?• Searchers often don’t clearly understand

– The problem they are trying to solve– What information is needed to solve the problem– How to ask for that information

• The query results from a clarification process

• Dervin’s “sense making”: Need

Gap Bridge

Taylor’s Model of Question FormationQ1 Visceral Need

Q2 Conscious Need

Q3 Formalized Need

Q4 Compromised Need (Query)En

d-us

er S

earc

hInterm

ediated Search

Design Strategies• Foster human-machine synergy

– Exploit complementary strengths– Accommodate shared weaknesses

• Divide-and-conquer – Divide task into stages with well-defined interfaces– Continue dividing until problems are easily solved

• Co-design related components– Iterative process of joint optimization

Human-Machine Synergy• Machines are good at:

– Doing simple things accurately and quickly– Scaling to larger collections in sublinear time

• People are better at:– Accurately recognizing what they are looking for– Evaluating intangibles such as “quality”

• Both are pretty bad at:– Mapping consistently between words and concepts

Process/System Co-Design

Supporting the Search Process

SourceSelection

Search

Query

Selection

Ranked List

Examination

Document

Delivery

Document

QueryFormulation

IR System

Query Reformulation and

Relevance Feedback

SourceReselection

Nominate ChoosePredict

Supporting the Search Process

SourceSelection

Search

Query

Selection

Ranked List

Examination

Document

Delivery

Document

QueryFormulation

IR System

Indexing Index

Acquisition Collection

Search Component Model

Comparison Function

Representation Function

Query Formulation

Human Judgment

Representation Function

Retrieval Status Value

Utility

Query

Information Need Document

Query Representation Document Representation

Que

ry P

roce

ssin

g

Doc

umen

t Pr

oces

sing

Relevance• Relevance relates a topic and a document

– Duplicates are equally relevant, by definition– Constant over time and across users

• Pertinence relates a task and a document– Accounts for quality, complexity, language, …

• Utility relates a user and a document– Accounts for prior knowledge

“Okapi” Term Weights

5.05.0

log*5.05.1 ,

,,

j

j

jii

jiji DF

DFN

TFLL

TFw

0.0

0.2

0.4

0.6

0.8

1.0

0 5 10 15 20 25

Raw TF

Oka

pi T

F 0.5

1.0

2.0

4.4

4.6

4.8

5.0

5.2

5.4

5.6

5.8

6.0

0 5 10 15 20 25

Raw DF

IDF Classic

Okapi

LL /

TF component IDF component

A Ranking Function: Okapi BM25

])(7)(*8

)),()(*9.03.0(

)),(*2.2(][)5.0)((

)5.0)(([logeqtfeqtf

detfavdl

ddldetf

edfedfN

Qek

k

k

document frequency term frequency

query term query document length document

average document length term frequency in query

])(7)(*8

)),()(*9.03.0(

)),(*2.2(][)5.0)((

)5.0)(([logeqtfeqtf

detfavdl

ddldetf

edfedfN

Qek

k

k

])(7)(*8

)),()(*9.03.0(

)),(*2.2(][)5.0)((

)5.0)(([logeqtfeqtf

detfavdl

ddldetf

edfedfN

Qek

k

k

Estimating TF and DF for Query Terms

jf

kjjiki dftffepdetf ),(*)(),(

jf

jjii fdffepedf )(*)()(

jf

)( jfdf

),( kj dftf

),( ki detf

)( ji fep

)( iedf

3f2f 4f1f

0.4*20 + 0.3*5 + 0.2*2 + 0.1*50 = 14.9

20

50

5025

3040

0.30.4

0.4*50 + 0.3*40 + 0.2*30 + 0.1*200 = 58

0.1

200

0.2e1

0.40.3

0.20.1

f1

f2

f3

f4

Learning to Translate• Lexicons

– Phrase books, bilingual dictionaries, …

• Large text collections– Translations (“parallel”)– Similar topics (“comparable”)

• Similarity– Similar pronunciation, similar users

• People

Hieroglyphic

Demotic

Greek

Statistical Machine Translation

Señora Presidenta , había pedido a la administración del Parlamento que garantizase

Madam President , I had asked the administration to ensure that

Bidirectional Translationwonders of ancient world (CLEF Topic 151)

se//0.31demande//0.24demander//0.08peut//0.07merveilles//0.04question//0.02savoir//0.02on//0.02bien//0.01merveille//0.01pourrait//0.01

Unidirectional:si//0.01sur//0.01me//0.01t//0.01emerveille//0.01ambition//0.01merveilleusement//0.01veritablement//0.01cinq//0.01hier//0.01

merveilles//0.92merveille//0.03emerveille//0.03merveilleusement//0.02

Bidirectional:

Experiment Setup• Test collections

• Document processing- Stemming, accent-removal (CLEF French)- Word segmentation, encoding conversion (TREC Chinese)- Stopword removal (all collections)

• Training statistical translation models (GIZA++)

M1(10)M1(10), HMM(5), M4(5)Models (iterations)

1,583,807672,247# of sentence pairs

English-ChineseEnglish-FrenchLanguages

FBIS et al.EuroparlParallel corpus

Source CLEF’01-03 TREC-5,6

Query language English EnglishDocument language French Chinese# of topics 151 54

# of documents 87,191 139,801

Avg # of rel docs 23 95

f1 (0.32)f2 (0.21)f3 (0.11)f4 (0.09)f5 (0.08)f6 (0.05)f7 (0.04)f8 (0.03)f9 (0.03)f10 (0.02)f11 (0.01)f12 (0.01)

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

f1 f1f2f3f4f5

f1f2f3f4

f1f2f3f4f5f6f7

f1f2f3f4f5f6f7f8f9f10f11f12

f1 f1 f1f2

f1f2

f1f2f3

f1

Cumulative Probability ThresholdTranslations

Pruning Translations

Unidirectional without Synonyms (PSQ)

40%

50%

60%

70%

80%

90%

100%

110%

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Cumulative Probability Threshold

MA

P: C

LIR

/ M

onol

ingu

al

CLEF French

40%

50%

60%

70%

80%

90%

100%

110%

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


MA

P: C

LIR

/ M

onol

ingu

al

TREC-5,6 Chinese

Statistical significance vs monolingual (Wilcoxon signed rank test)• CLEF French: worse at peak• TREC-5,6 Chinese: worse at peak

Q D

40%

50%

60%

70%

80%

90%

100%

110%

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0


MA

P: C

LIR

/Mon

olin

gual

DAMM IMM PSQ

Bidirectional with Synonyms (DAMM)

(Q) (D) v.s. Q D

• DAMM significantly outperformed PSQ• DAMM is statistically indistinguishable from monolingual at peak• IMM: nearly as good as DAMM for French, but not for Chinese

40%

50%

60%

70%

80%

90%

100%

110%

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0


MA

P: C

LIR

/Mon

olin

gual

DAMM IMM PSQ

CLEF French TREC-5,6 Chinese

Indexing Time

0

100

200

300

400

500

Thousands of documents

Inde

xing

tim

e (s

ec)

monolingual cross-language

Dictionary-based vector translation, single Sun SPARC in 2001

The Problem Space• Retrospective search

– Web search– Specialized services (medicine, law, patents) – Help desks

• Real-time filtering– Email spam– Web parental control– News personalization

• Real-time interaction– Instant messaging– Chat rooms– Teleconferences

Key Capabilities Map across languages

– For human understanding– For automated processing

Making a Market• Multitude of potential applications

– Retrospective search, email, IM, chat, …– Natural consequence of language diversity

• Limiting factor is translation readability– Searchability is mostly a solved problem

• Leveraging human translation has potential– Translation routing, volunteers, cacheing

Documents

Cross-Language Information Retrieval