Upload
agrata
View
57
Download
1
Embed Size (px)
DESCRIPTION
Cross-Language Information Retrieval. Applied Natural Language Processing October 29, 2009 Douglas W. Oard. What Do People Search For?. Searchers often don’t clearly understand The problem they are trying to solve What information is needed to solve the problem - PowerPoint PPT Presentation
Citation preview
Cross-Language Information Retrieval
Applied Natural Language ProcessingOctober 29, 2009Douglas W. Oard
What Do People Search For?• Searchers often don’t clearly understand
– The problem they are trying to solve– What information is needed to solve the problem– How to ask for that information
• The query results from a clarification process
• Dervin’s “sense making”: Need
Gap Bridge
Taylor’s Model of Question FormationQ1 Visceral Need
Q2 Conscious Need
Q3 Formalized Need
Q4 Compromised Need (Query)En
d-us
er S
earc
hInterm
ediated Search
Design Strategies• Foster human-machine synergy
– Exploit complementary strengths– Accommodate shared weaknesses
• Divide-and-conquer – Divide task into stages with well-defined interfaces– Continue dividing until problems are easily solved
• Co-design related components– Iterative process of joint optimization
Human-Machine Synergy• Machines are good at:
– Doing simple things accurately and quickly– Scaling to larger collections in sublinear time
• People are better at:– Accurately recognizing what they are looking for– Evaluating intangibles such as “quality”
• Both are pretty bad at:– Mapping consistently between words and concepts
Process/System Co-Design
Supporting the Search Process
SourceSelection
Search
Query
Selection
Ranked List
Examination
Document
Delivery
Document
QueryFormulation
IR System
Query Reformulation and
Relevance Feedback
SourceReselection
Nominate ChoosePredict
Supporting the Search Process
SourceSelection
Search
Query
Selection
Ranked List
Examination
Document
Delivery
Document
QueryFormulation
IR System
Indexing Index
Acquisition Collection
Search Component Model
Comparison Function
Representation Function
Query Formulation
Human Judgment
Representation Function
Retrieval Status Value
Utility
Query
Information Need Document
Query Representation Document Representation
Que
ry P
roce
ssin
g
Doc
umen
t Pr
oces
sing
Relevance• Relevance relates a topic and a document
– Duplicates are equally relevant, by definition– Constant over time and across users
• Pertinence relates a task and a document– Accounts for quality, complexity, language, …
• Utility relates a user and a document– Accounts for prior knowledge
“Okapi” Term Weights
5.05.0
log*5.05.1 ,
,,
j
j
jii
jiji DF
DFN
TFLL
TFw
0.0
0.2
0.4
0.6
0.8
1.0
0 5 10 15 20 25
Raw TF
Oka
pi T
F 0.5
1.0
2.0
4.4
4.6
4.8
5.0
5.2
5.4
5.6
5.8
6.0
0 5 10 15 20 25
Raw DF
IDF Classic
Okapi
LL /
TF component IDF component
A Ranking Function: Okapi BM25
])(7)(*8
)),()(*9.03.0(
)),(*2.2(][)5.0)((
)5.0)(([logeqtfeqtf
detfavdl
ddldetf
edfedfN
Qek
k
k
document frequency term frequency
query term query document length document
average document length term frequency in query
])(7)(*8
)),()(*9.03.0(
)),(*2.2(][)5.0)((
)5.0)(([logeqtfeqtf
detfavdl
ddldetf
edfedfN
Qek
k
k
])(7)(*8
)),()(*9.03.0(
)),(*2.2(][)5.0)((
)5.0)(([logeqtfeqtf
detfavdl
ddldetf
edfedfN
Qek
k
k
Estimating TF and DF for Query Terms
jf
kjjiki dftffepdetf ),(*)(),(
jf
jjii fdffepedf )(*)()(
jf
)( jfdf
),( kj dftf
),( ki detf
)( ji fep
)( iedf
3f2f 4f1f
0.4*20 + 0.3*5 + 0.2*2 + 0.1*50 = 14.9
20
50
5025
3040
0.30.4
0.4*50 + 0.3*40 + 0.2*30 + 0.1*200 = 58
0.1
200
0.2e1
0.40.3
0.20.1
f1
f2
f3
f4
Learning to Translate• Lexicons
– Phrase books, bilingual dictionaries, …
• Large text collections– Translations (“parallel”)– Similar topics (“comparable”)
• Similarity– Similar pronunciation, similar users
• People
Hieroglyphic
Demotic
Greek
Statistical Machine Translation
Señora Presidenta , había pedido a la administración del Parlamento que garantizase
Madam President , I had asked the administration to ensure that
Bidirectional Translationwonders of ancient world (CLEF Topic 151)
se//0.31demande//0.24demander//0.08peut//0.07merveilles//0.04question//0.02savoir//0.02on//0.02bien//0.01merveille//0.01pourrait//0.01
Unidirectional:si//0.01sur//0.01me//0.01t//0.01emerveille//0.01ambition//0.01merveilleusement//0.01veritablement//0.01cinq//0.01hier//0.01
merveilles//0.92merveille//0.03emerveille//0.03merveilleusement//0.02
Bidirectional:
Experiment Setup• Test collections
• Document processing- Stemming, accent-removal (CLEF French)- Word segmentation, encoding conversion (TREC Chinese)- Stopword removal (all collections)
• Training statistical translation models (GIZA++)
M1(10)M1(10), HMM(5), M4(5)Models (iterations)
1,583,807672,247# of sentence pairs
English-ChineseEnglish-FrenchLanguages
FBIS et al.EuroparlParallel corpus
Source CLEF’01-03 TREC-5,6
Query language English EnglishDocument language French Chinese# of topics 151 54
# of documents 87,191 139,801
Avg # of rel docs 23 95
f1 (0.32)f2 (0.21)f3 (0.11)f4 (0.09)f5 (0.08)f6 (0.05)f7 (0.04)f8 (0.03)f9 (0.03)f10 (0.02)f11 (0.01)f12 (0.01)
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
f1 f1f2f3f4f5
f1f2f3f4
f1f2f3f4f5f6f7
f1f2f3f4f5f6f7f8f9f10f11f12
f1 f1 f1f2
f1f2
f1f2f3
f1
Cumulative Probability ThresholdTranslations
Pruning Translations
Unidirectional without Synonyms (PSQ)
40%
50%
60%
70%
80%
90%
100%
110%
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Cumulative Probability Threshold
MA
P: C
LIR
/ M
onol
ingu
al
CLEF French
40%
50%
60%
70%
80%
90%
100%
110%
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Cumulative Probability Threshold
MA
P: C
LIR
/ M
onol
ingu
al
TREC-5,6 Chinese
Statistical significance vs monolingual (Wilcoxon signed rank test)• CLEF French: worse at peak• TREC-5,6 Chinese: worse at peak
Q D
40%
50%
60%
70%
80%
90%
100%
110%
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Cumulative Probability Threshold
MA
P: C
LIR
/Mon
olin
gual
DAMM IMM PSQ
Bidirectional with Synonyms (DAMM)
(Q) (D) v.s. Q D
• DAMM significantly outperformed PSQ• DAMM is statistically indistinguishable from monolingual at peak• IMM: nearly as good as DAMM for French, but not for Chinese
40%
50%
60%
70%
80%
90%
100%
110%
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Cumulative Probability Threshold
MA
P: C
LIR
/Mon
olin
gual
DAMM IMM PSQ
CLEF French TREC-5,6 Chinese
Indexing Time
0
100
200
300
400
500
Thousands of documents
Inde
tim
e (s
ec)
monolingual cross-language
Dictionary-based vector translation, single Sun SPARC in 2001
The Problem Space• Retrospective search
– Web search– Specialized services (medicine, law, patents) – Help desks
• Real-time filtering– Email spam– Web parental control– News personalization
• Real-time interaction– Instant messaging– Chat rooms– Teleconferences
Key Capabilities Map across languages
– For human understanding– For automated processing
Making a Market• Multitude of potential applications
– Retrospective search, email, IM, chat, …– Natural consequence of language diversity
• Limiting factor is translation readability– Searchability is mostly a solved problem
• Leveraging human translation has potential– Translation routing, volunteers, cacheing