View
260
Download
0
Category
Preview:
Citation preview
Hsin-Hsi Chen 1
跨語言資訊檢索導論
Hsin-Hsi Chen ( 陳信希 )
Department of Computer Science and Information Engineering
National Taiwan University
Hsin-Hsi Chen 2
Outline
Multilingual Environments What is Cross-Language Information
Retrieval? Major Problems in CLIR Major Approaches in CLIR Case Study: CLIR in NPDM Summary
Hsin-Hsi Chen 3
Multilingual Collections
There are 6,703 languages listed in the Ethnologue Digital libraries
– OCLC Online Computer Library Center serves more than 17,000 libraries in 52 countries and contains over 30 million bibliographic records with over 500 million records ownership attached in more than 370 languages
World Wide Web– Around 40% of Internet users do not speak English, ho
wever, 80% of Web sites are still in English
Hsin-Hsi Chen 4
0
200
400
600
800
Spea
kers
(M
illio
ns)
Chinese Hindi-Urdu
Portuguese Russian Japanese
真實世界語言使用人口( http://www.g11n.com/faq.htm)
中文 英語 印度
語 西班牙
語 葡萄牙
語 孟加拉
語 俄語 阿拉伯
語 日語
Hsin-Hsi Chen 5(Statistics from Euro-Marketing Associates, 1998)
西班牙語
德語
日語
法語
中文
荷蘭語葡萄牙語
義大利語
瑞典語
韓文
Hsin-Hsi Chen 6http://www.glreach.com/globstats/
(Statistics from Euro-Marketing Associates, 1999)
中文人口比例 (6.1%)<法文人口比例 (8.8%)
(1998 年 )
Hsin-Hsi Chen 7
網路世界語言使用人口
Hsin-Hsi Chen 8
網際網路內容
100
1,000
10,000
100,000
Inte
rnet
Hos
ts (
thou
sand
s)
English German Dutch Spanish Swe di s h
Language (estimated by domain)
(Network Wizards Jan 99 Internet Domain Survey)
英語
日語 德語 法語 荷蘭
語 芬蘭
語西班牙
語中文 瑞
典語
33,878
1,687 1,684
654 546 473 458 432546
40% 的 Internet 使用者不懂英文,但是 80%的 Internet 內容是英文
Hsin-Hsi Chen 9(Source: http://www.emarketer.com)
Hsin-Hsi Chen 10
What is Cross-Language Information Retrieval?
Definition: Select information in one language based on queries in another.
Terminologies– Cross-Language Information Retrieval
(ACM SIGIR 96 Workshop on Cross-Linguistic Information Retrieval)
– Translingual Information Retrieval(Defense Advanced Research Project Agency - DARPA)
Hsin-Hsi Chen 11
Generalization: Multi- & Cross- Lingual Information Access
Hsin-Hsi Chen 12
MLIR Applications
Multilingual information access in multilingual country, organization, enterprise, etc.
Cross- language information retrieval for users who read a second language (large passive vocabulary) but are not able to formulate good queries (small active vocabulary).
Monolingual users may retrieve images by taking advantage of multilingual captions.
Monolingual users may retrieve documents and have them translated (automatically or manually) in their language.
Hsin-Hsi Chen 13
Why is Cross- Language Information Retrieval Important?
More information workers with less time require fast access to global resources
global B2B interactions (virtual enterprises) global B2C interactions (online trading, travell
ing) time critical information (translation comes to
o late)
Hsin-Hsi Chen 14
History
1970 Salton runs retrieval experiments with a small English/ German dictionary
1972 Pevzner shows for English and Russian that a controlled thesaurus can be used effectively for query term translation
1978 ISO Standard 5964 for developing multilingual thesauri (revised in 1985)
1990 Latent Semantic Indexing (LSI) applied to CLIR
Hsin-Hsi Chen 15
History (Continued)
1994 1st PhD thesis on CLIR by Khaled Radwan
1996 Similarity thesaurus applied to CLIR (ETH Zurich)
1996 Dictionary based retrieval applied to CLIR (Umass & XEROX Grenoble)
1997 Generalized Vector Space Model (GVSM) applied to CLIR (CMU)
Hsin-Hsi Chen 16
History (Continued)
1997 CLIR (Cross- Language Information Retrieval) track starts within TREC
1998 NTCIR starts in Japan 1999 TIDES (Translingual Information Det
ection, Extraction, and Summarization) starts in U. S.
2000 CLEF starts in Europe
Hsin-Hsi Chen 17
An Architecture of Multilingual Information Access
Query Translation
Document Translation
User Interface(UI)
Language Identification
(LI)
Information Extraction
Information Filtering
Information Retrieval
Text Summarization
Text Classification
Multilingual Resources Multiple Langauges
Native Langauge(s)
Text Processing
Language Translation
Hsin-Hsi Chen 18
Major Problems of CLIR
Queries and documents are in different languages.– translation
Words in a query may be ambiguous.– disambiguation
Queries are usually short.– expansion
Hsin-Hsi Chen 19
Major Problems of CLIR (Continued)
Queries may have to be segmented.– segmentation
A document may be in terms of various languages.– language identification
Hsin-Hsi Chen 20
Enhancing TraditionalInformation Retrieval Systems
Which part(s) should be modified for CLIR?
Documents Queries
DocumentRepresentation
QueryRepresentation
Comparison
(3)(1)
(2) (4)
Hsin-Hsi Chen 21
Enhancing Traditional Information Retrieval Systems
(Continued)
(1): text translation (2): vector translation (3): query translation (4): term vector translation (1) and (2), (3) and (4): interlingual form
Hsin-Hsi Chen 22
What are the Problems?
Ambiguous terms (e.g., performance) Multiword phrases may correspond to single-word phra
ses (e. g. South Africa => 南非, Südafrika) Coverage of the vocabulary There is not a one-to-one mapping between two langua
ges Translating queries automatically (lack of syntax) Translating documents automatically (performance, …) Computing mixed result lists
Hsin-Hsi Chen 23
Cross-Language Information Retrieval
Controlled Vocabulary
Thesaurus-based
Ontology-based Dictionary-based
K nowledge-based
Term-aligned Sentence-aligned
Parallel Comparable
Document-aligned Unaligned
Corpus-based Hybrid
Free Text
Query Translation
Text Translation Vector Translation
Document Translation No Translation
Cross-L anguage Information Retrieval
Hsin-Hsi Chen 24
Query Translation Based CLIR
EnglishQuery
TranslationDevice
ChineseQuery
MonolingualChineseRetrievalSystem
RetrievedChinese
Documents
Hsin-Hsi Chen 25
Translating the 400 Millionnon-English Pages of the WWW
... would take 100’000 days (300 years) on one fast PC. Or, 1 month on 3’600 PC’s.
Hsin-Hsi Chen 26
Knowledge-Based
Examples– Subject Thesaurus
• Hierarchical and associative relations.
• Unique term assigned to each node.
– Concept List• Term space partitioned into concept spaces.
– Term List• List of cross-language synonyms.
– Lexicon• Machine readable syntax and/or semantics.
Hsin-Hsi Chen 27
Ontology-Based Approaches
Exploit complex knowledge representations e.g., EuroWordNet
A Proposal for Conceptual Indexing using EuroWordNet
Hsin-Hsi Chen 28
Dictionary-Based Approaches
Exploit machine-readable dictionaries.
Problems– translation ambiguity + target polysemy
– coverage (unknown words, abbreviations, ...)
Hsin-Hsi Chen 29
Dictionary-Based Approaches(Continued)
Issue 1: selection strategy– Select all.– Select N randomly.– Select best N.
Issue 2: which level– word– phrase
Hsin-Hsi Chen 30
Selection Strategy: Select All
Hull and Grefenstette 1996– Take concatenation of all term translation.
E: politically motivated civil disturbancesF: troubles civils a caractere politiquetrouble - turmoil, discord, trouble, unrest, disturbance, disordercivil - civil, civilian, courteouscaractere - character, naturepolitique - political, diplomatic, politician, policy
– Original English (0.393) vs. Automatic word-based transfer dictionary (0.235): 59.8%.
– errors: multi-word expressions and ambiguity
Hsin-Hsi Chen 31
Selection Strategy: Select All(Continued)
Davis 1997 (TREC5)– Replace each English query term with all of its
Spanish equivalent terms from the Collins bilingual dictionary.
– Monolingual (0.2895) vs. All-equivalent substitution (0.1422): 49.12%
Hsin-Hsi Chen 32
Evaluation Method
Average Precision (5-, 9-, 11-points) Model
Spanish QueryMono
IR Engine
English QueryBilingual
DictionaryMono
IR Engine
TRECSpanishCorpus
SpanishEquivalents
English Query MonoIR Engine
TRECSpanishCorpus
SpanishEquivalents
by POS
POSBilingual
Dictionary
TRECSpanishCorpus
Hsin-Hsi Chen 33
Selection Strategy: Select N
Simple word-by-word translation– Each query term is replaced by the word or
group of words given for the first sense of the term’s definition.
• 50-60% drop in performance (average precision)
Hsin-Hsi Chen 34
Selection Strategy: Select N(Continued)
word/phrase translation– Take at most three translations of each word,
one from each of the first three senses. Take phrase translation if appearing in dictionary.
• 30-50% worse than good translation
– Well-translated phrases can greatly improve effectiveness, but poorly translated phrases may negate the improvements.
• WBW (0.0244), phrasal (0.0148), good phrasal (0.0610) -39.3% +150.3%
Hsin-Hsi Chen 35
Selection Strategy: Select Best N
Hayashi, Kikui and Susaki 1997– search for a dictionary entry corresponding to the longe
st sequence of words from left to right– choose the most frequently used word (or phrases) in a
text corpus collected from WWW– no report for this query translation approach
Davis 1997 (TREC5)– POS disambiguation– Monolingual (0.2895) vs. All-equivalent substitution (0.
1422) vs. POS disambiguation (0.1949): near 67.3%
Hsin-Hsi Chen 36
Corpus-Based Approaches
Categorization– Term-Aligned– Sentence-Aligned– Document-Aligned (Parallel, Comparable)– Unaligned
Usage– Setup Thesaurus– Vector Mapping
Hsin-Hsi Chen 37
Term-Aligned Corpora
Fine-grained alignment in parallel corpora Oard 1996
– Term alignment is a challenging problem.
ParallelBinlingual
Corpus
CooccurranceStatistics
TranslationTables
MachineTranslation
System
English Query
SpanishQuery
Hsin-Hsi Chen 38
Sentence-Aligned Corpora
Davis & Dunning 1996 (TREC4)– High-frequency Terms
Hsin-Hsi Chen 39
Brief Summary
dictionary-based methods– Specialized vocabulary not in the dictionaries will not
be translated.– Ambiguities will add extraneous terms to the query.
parallel/comparable corpora-based methods– Parallel corpora are not always available.– Available corpora tend to be relative small or to cover
only a small number of subjects.– Performance is dependent on how well the corpora are
aligned.
Hsin-Hsi Chen 40
Brief Summary (Continued)
Dictionaries are very useful.– Achieve 50% on their own
Parallel corpora have limitations.– Domain shifts
– Term alignment accuracy
Dictionaries and corpora are complementary.– Dictionaries provide broad and shallow coverage.
– Corpora provide narrow (domain-specific) but deep (more terminology) coverage of the language.
Hsin-Hsi Chen 41
Hybrid Methods
What knowledge can be employed?– lexical knowledge– corpus knowledge– ...
Hsin-Hsi Chen 42
Hybrid Methods (Continued)
Query Expansion– Issue 1: context
• pseudo relevance feedback (local feedback)::A query is modified by the addition of terms found in the top retrieved documents.
• local context analysis::Queries are expanded by the addition of the top ranked concepts from the top passages.
Hsin-Hsi Chen 43
Hybrid Methods (Continued)
– Issue 2: when• before query translation
• after query translation
Hsin-Hsi Chen 44
Hybrid Methods (Continued)
Ballesteros & Croft 1997
Original SpanishTREC Queries
humantranslation
English (BASE)Queries
SpanishQueries
automaticdictionarytranslation
EnglishQueries
queryexpansion
SpanishQueries
queryexpansion
SpanishQueries
automaticdictionarytranslation
INQUERY
Hsin-Hsi Chen 45
Hybrid Methods (Continued)
– Performance Evaluation• pre-translation
MRD (0.0823) vs. LF (0.1099) vs. LCA10 (0.1139) +33.5% +38.5%
• post-translationMRD (0.0823) vs. LF (0.0916) vs. LCA20 (0.1022) +11.3% +24.1%
• combined pre- and post-translationMRD (0.0823) vs. LF (0.1242) vs. LCA20 (0.1358) +51.0% +65.0%
• 32% below a monolingual baseline
Hsin-Hsi Chen 46
Cross-Language Evaluation Forum
A collaboration between the DELOS Network of Excellence for Digital Libraries and the US National Institute for Standards and Technology (NIST)
Extension of CLIR track at TREC (1997-1999)
Hsin-Hsi Chen 47
Main Goals
Promote research in cross-language system development for European languages by providing an appropriate infrastructure for:– CLIR system evaluation, testing and tuning– Comparison and discussion of results
Hsin-Hsi Chen 48
CLEF 2000 Task Description
Four evaluation tracks in CLEF 2000– multilingual information retrieval– bilingual information retrieval– monolingual (non-English) information
retrieval– domain-specific IR
Hsin-Hsi Chen 49
Case Study: CLIR for NPDM
Hsin-Hsi Chen 50
3M in Digital Libraries/Museums
Multi-media– Selecting suitable media to represent contents
Multi-linguality– Decreasing the language barriers
Multi-culture– Integrating multiple cultures
Hsin-Hsi Chen 51
NPDM Project
Palace Museum, Taipei, one of the famous museums in the world
NSC supports a pioneer study of a digital museum project NPDM starting from 2000 – Enamels from the Ming and Ch’ing Dynasties – Famous Album Leaves of the Sung Dynasty – Illustrations in Buddhist Scriptures with Relativ
e Drawings
Hsin-Hsi Chen 52
Design Issues
Standardization– A standard metadata protocol is indispensable for the
interchange of resources with other museums.
Multimedia – A suitable presentation scheme is required.
Internationalization – to share the valuable resources of NPDM with users of
different languages
– to utilize knowledge presented in a foreign language
Hsin-Hsi Chen 53
Translingual Issue
CLIR– to allow users to issue queries in one language
to access documents in another language– the query language is English and the document
language is Chinese
Two common approaches– Query translation– Document translation
Hsin-Hsi Chen 54
Resources in NPDM pilot
an enamel, a calligraphy, a painting, or an illustration
MICI-DC– Metadata Interchange for Chinese Information– Accessible fields to users
• Short descriptions vs. full texts
• Bilingual versions vs. Chinese only
– Fields for maintenance only
Hsin-Hsi Chen 55
Search Modes
Free search– users describe their information need using
natural languages (Chinese or English)
Specific topic search– users fill in specific fields denoting authors,
titles, dates, and so on
Hsin-Hsi Chen 56
Example
Information need– Retrieval “Travelers Among Mountains and Streams, F
an K‘uan” (“ 范寬谿山行旅圖” ) Possible queries
– Author: Fan Kuan; Kuan, Fan – Time: Sung Dynasty – Title: Mountains and Streams; Travel among mountains;
Travel among streams; Mountain and stream painting – Free search: landscape painting; travelers, huge mounta
in, Nature; scenery; Shensi province
Hsin-Hsi Chen 57
EnglishNames
ChineseNames
MachineTransliteration
EnglishTitles
ChineseTitles
DocumentTranslation
NameSearch
TitleSearch
EnglishQuery
QueryDisambiguation
SpecificBilingual
Dictionary
GenericBilingual
Dictionary
ChineseQuery
QueryTranslation
Chinese IRSystemNPDM
Collection
Results
ECIR in NPDM
Hsin-Hsi Chen 58
Specific Topic Search
proper names are important query terms– Creators such as “ 林逋” (Lin P’u), “ 李建中”
(Li Chien-chung), “ 歐陽脩” (Ou-yang Hsiu), etc.
– Emperors such as “ 康熙” (K'ang-hsi), “ 乾隆” (Ch'ien-lung), “ 徽宗” (Hui-tsung), etc.
– Dynasty such as ” 宋” (Sung), “ 明” (Ming), “ 清” (Ch’ing), etc.
Hsin-Hsi Chen 59
Name Transliteration
The alphabets of Chinese and English are totally different
Wade-Giles (WG) and Pinyin are two famous systems to romanize Chinese in libraries
backward transliteration– Transliterate target language terms back to source language
ones
– Chen, Huang, and Tsai (COLING, 1998)
– Lin and Chen (ROCLING, 2000)
Hsin-Hsi Chen 60
Name Mapping Table
Divide a name into a sequence of Chinese characters, and transform each character into phonemes
Look up phoneme-to-WG (Pinyin) mapping table, and derive a canonical form for the name
Example– “林逋” “ㄌㄧㄣ ㄆㄨ” “ Lin P’
u” (WG)
Hsin-Hsi Chen 61
Name Similarity
Extract named entity from the query Select the most similar named entity from name mapping t
able Naming sequence/scheme
– LastName FirstName1, e.g., Chu Hsi ( 朱熹 ) – FirstName1 LastName, e.g., Hsi Chu ( 朱熹 ) – LastName FirstName1-FirstName2, e.g., Hsu Tao-ning ( 許道寧 ) – FirstName1-FirstName2 LastName, e.g., Tao-ning Hsu ( 許道寧 ) – Any order, e.g., Tao Ning Hsu ( 許道寧 ) – Any transliteration, e.g., Ju Shi ( 朱熹 )
Hsin-Hsi Chen 62
Title
谿山行旅圖” “ Travelers among Mountains and Streams”
"travelers", "mountains", and "streams" are basic components
Users can express their information need through the descriptions of a desired art
System will measure the similarity of art titles (descriptions) and a query
Hsin-Hsi Chen 63
Free Search
A query is composed of several concepts. Concepts are either transliterated or translated. The query translation similar to a small scale IR
system Resources
– Name-mapping table– Title-mapping table – Specific English-Chinese Dictionary – Generic English-Chinese Dictionary – …
Hsin-Hsi Chen 64
Algorithm (1) For each resource, the Chinese translations whose
scores are larger than a specific threshold are selected. (2) The Chinese translations identified from different
resources are merged, and are sorted by their scores. (3) Consider the Chinese translation with the highest
score in the sorting sequence. – If the intersection of the corresponding English description
and query is not empty, then select the translation and delete the common English terms between query and English description from query.
– Otherwise, skip the Chinese translation.
Hsin-Hsi Chen 65
Algorithm (Continued)
(4) Repeat step (3) until query is empty or all the Chinese translations in the sorting sequence are considered.
(5) If the query is not empty, then these words are looked up from the general dictionary. A Chinese query is composed of all the translated results.
Hsin-Hsi Chen 66
Summary
Users can select English input and Chinese output when they are neither familiar with Chinese input, nor lack of Chinese input device, but can read Chinese.
Images or videos are transparent to those users that cannot read/write Chinese.
The integration of different knowledge sources will be studied in the future.
Recommended