Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer...

Preview:

Citation preview

Hsin-Hsi Chen 1

跨語言資訊檢索導論

Hsin-Hsi Chen ( 陳信希 )

Department of Computer Science and Information Engineering

National Taiwan University

Hsin-Hsi Chen 2

Outline

Multilingual Environments What is Cross-Language Information

Retrieval? Major Problems in CLIR Major Approaches in CLIR Case Study: CLIR in NPDM Summary

Hsin-Hsi Chen 3

Multilingual Collections

There are 6,703 languages listed in the Ethnologue Digital libraries

– OCLC Online Computer Library Center serves more than 17,000 libraries in 52 countries and contains over 30 million bibliographic records with over 500 million records ownership attached in more than 370 languages

World Wide Web– Around 40% of Internet users do not speak English, ho

wever, 80% of Web sites are still in English

Hsin-Hsi Chen 4

0

200

400

600

800

Spea

kers

(M

illio

ns)

Chinese Hindi-Urdu

Portuguese Russian Japanese

真實世界語言使用人口( http://www.g11n.com/faq.htm)

中文 英語 印度

語 西班牙

語 葡萄牙

語 孟加拉

語 俄語 阿拉伯

語 日語

Hsin-Hsi Chen 5(Statistics from Euro-Marketing Associates, 1998)

西班牙語

德語

日語

法語

中文

荷蘭語葡萄牙語

義大利語

瑞典語

韓文

Hsin-Hsi Chen 6http://www.glreach.com/globstats/

(Statistics from Euro-Marketing Associates, 1999)

中文人口比例 (6.1%)<法文人口比例 (8.8%)

(1998 年 )

Hsin-Hsi Chen 7

網路世界語言使用人口

Hsin-Hsi Chen 8

網際網路內容

100

1,000

10,000

100,000

Inte

rnet

Hos

ts (

thou

sand

s)

English German Dutch Spanish Swe di s h

Language (estimated by domain)

(Network Wizards Jan 99 Internet Domain Survey)

英語

日語 德語 法語 荷蘭

語 芬蘭

語西班牙

語中文 瑞

典語

33,878

1,687 1,684

654 546 473 458 432546

40% 的 Internet 使用者不懂英文,但是 80%的 Internet 內容是英文

Hsin-Hsi Chen 9(Source: http://www.emarketer.com)

Hsin-Hsi Chen 10

What is Cross-Language Information Retrieval?

Definition: Select information in one language based on queries in another.

Terminologies– Cross-Language Information Retrieval

(ACM SIGIR 96 Workshop on Cross-Linguistic Information Retrieval)

– Translingual Information Retrieval(Defense Advanced Research Project Agency - DARPA)

Hsin-Hsi Chen 11

Generalization: Multi- & Cross- Lingual Information Access

Hsin-Hsi Chen 12

MLIR Applications

Multilingual information access in multilingual country, organization, enterprise, etc.

Cross- language information retrieval for users who read a second language (large passive vocabulary) but are not able to formulate good queries (small active vocabulary).

Monolingual users may retrieve images by taking advantage of multilingual captions.

Monolingual users may retrieve documents and have them translated (automatically or manually) in their language.

Hsin-Hsi Chen 13

Why is Cross- Language Information Retrieval Important?

More information workers with less time require fast access to global resources

global B2B interactions (virtual enterprises) global B2C interactions (online trading, travell

ing) time critical information (translation comes to

o late)

Hsin-Hsi Chen 14

History

1970 Salton runs retrieval experiments with a small English/ German dictionary

1972 Pevzner shows for English and Russian that a controlled thesaurus can be used effectively for query term translation

1978 ISO Standard 5964 for developing multilingual thesauri (revised in 1985)

1990 Latent Semantic Indexing (LSI) applied to CLIR

Hsin-Hsi Chen 15

History (Continued)

1994 1st PhD thesis on CLIR by Khaled Radwan

1996 Similarity thesaurus applied to CLIR (ETH Zurich)

1996 Dictionary based retrieval applied to CLIR (Umass & XEROX Grenoble)

1997 Generalized Vector Space Model (GVSM) applied to CLIR (CMU)

Hsin-Hsi Chen 16

History (Continued)

1997 CLIR (Cross- Language Information Retrieval) track starts within TREC

1998 NTCIR starts in Japan 1999 TIDES (Translingual Information Det

ection, Extraction, and Summarization) starts in U. S.

2000 CLEF starts in Europe

Hsin-Hsi Chen 17

An Architecture of Multilingual Information Access

Query Translation

Document Translation

User Interface(UI)

Language Identification

(LI)

Information Extraction

Information Filtering

Information Retrieval

Text Summarization

Text Classification

Multilingual Resources Multiple Langauges

Native Langauge(s)

Text Processing

Language Translation

Hsin-Hsi Chen 18

Major Problems of CLIR

Queries and documents are in different languages.– translation

Words in a query may be ambiguous.– disambiguation

Queries are usually short.– expansion

Hsin-Hsi Chen 19

Major Problems of CLIR (Continued)

Queries may have to be segmented.– segmentation

A document may be in terms of various languages.– language identification

Hsin-Hsi Chen 20

Enhancing TraditionalInformation Retrieval Systems

Which part(s) should be modified for CLIR?

Documents Queries

DocumentRepresentation

QueryRepresentation

Comparison

(3)(1)

(2) (4)

Hsin-Hsi Chen 21

Enhancing Traditional Information Retrieval Systems

(Continued)

(1): text translation (2): vector translation (3): query translation (4): term vector translation (1) and (2), (3) and (4): interlingual form

Hsin-Hsi Chen 22

What are the Problems?

Ambiguous terms (e.g., performance) Multiword phrases may correspond to single-word phra

ses (e. g. South Africa => 南非, Südafrika) Coverage of the vocabulary There is not a one-to-one mapping between two langua

ges Translating queries automatically (lack of syntax) Translating documents automatically (performance, …) Computing mixed result lists

Hsin-Hsi Chen 23

Cross-Language Information Retrieval

Controlled Vocabulary

Thesaurus-based

Ontology-based Dictionary-based

K nowledge-based

Term-aligned Sentence-aligned

Parallel Comparable

Document-aligned Unaligned

Corpus-based Hybrid

Free Text

Query Translation

Text Translation Vector Translation

Document Translation No Translation

Cross-L anguage Information Retrieval

Hsin-Hsi Chen 24

Query Translation Based CLIR

EnglishQuery

TranslationDevice

ChineseQuery

MonolingualChineseRetrievalSystem

RetrievedChinese

Documents

Hsin-Hsi Chen 25

Translating the 400 Millionnon-English Pages of the WWW

... would take 100’000 days (300 years) on one fast PC. Or, 1 month on 3’600 PC’s.

Hsin-Hsi Chen 26

Knowledge-Based

Examples– Subject Thesaurus

• Hierarchical and associative relations.

• Unique term assigned to each node.

– Concept List• Term space partitioned into concept spaces.

– Term List• List of cross-language synonyms.

– Lexicon• Machine readable syntax and/or semantics.

Hsin-Hsi Chen 27

Ontology-Based Approaches

Exploit complex knowledge representations e.g., EuroWordNet

A Proposal for Conceptual Indexing using EuroWordNet

Hsin-Hsi Chen 28

Dictionary-Based Approaches

Exploit machine-readable dictionaries.

Problems– translation ambiguity + target polysemy

– coverage (unknown words, abbreviations, ...)

Hsin-Hsi Chen 29

Dictionary-Based Approaches(Continued)

Issue 1: selection strategy– Select all.– Select N randomly.– Select best N.

Issue 2: which level– word– phrase

Hsin-Hsi Chen 30

Selection Strategy: Select All

Hull and Grefenstette 1996– Take concatenation of all term translation.

E: politically motivated civil disturbancesF: troubles civils a caractere politiquetrouble - turmoil, discord, trouble, unrest, disturbance, disordercivil - civil, civilian, courteouscaractere - character, naturepolitique - political, diplomatic, politician, policy

– Original English (0.393) vs. Automatic word-based transfer dictionary (0.235): 59.8%.

– errors: multi-word expressions and ambiguity

Hsin-Hsi Chen 31

Selection Strategy: Select All(Continued)

Davis 1997 (TREC5)– Replace each English query term with all of its

Spanish equivalent terms from the Collins bilingual dictionary.

– Monolingual (0.2895) vs. All-equivalent substitution (0.1422): 49.12%

Hsin-Hsi Chen 32

Evaluation Method

Average Precision (5-, 9-, 11-points) Model

Spanish QueryMono

IR Engine

English QueryBilingual

DictionaryMono

IR Engine

TRECSpanishCorpus

SpanishEquivalents

English Query MonoIR Engine

TRECSpanishCorpus

SpanishEquivalents

by POS

POSBilingual

Dictionary

TRECSpanishCorpus

Hsin-Hsi Chen 33

Selection Strategy: Select N

Simple word-by-word translation– Each query term is replaced by the word or

group of words given for the first sense of the term’s definition.

• 50-60% drop in performance (average precision)

Hsin-Hsi Chen 34

Selection Strategy: Select N(Continued)

word/phrase translation– Take at most three translations of each word,

one from each of the first three senses. Take phrase translation if appearing in dictionary.

• 30-50% worse than good translation

– Well-translated phrases can greatly improve effectiveness, but poorly translated phrases may negate the improvements.

• WBW (0.0244), phrasal (0.0148), good phrasal (0.0610) -39.3% +150.3%

Hsin-Hsi Chen 35

Selection Strategy: Select Best N

Hayashi, Kikui and Susaki 1997– search for a dictionary entry corresponding to the longe

st sequence of words from left to right– choose the most frequently used word (or phrases) in a

text corpus collected from WWW– no report for this query translation approach

Davis 1997 (TREC5)– POS disambiguation– Monolingual (0.2895) vs. All-equivalent substitution (0.

1422) vs. POS disambiguation (0.1949): near 67.3%

Hsin-Hsi Chen 36

Corpus-Based Approaches

Categorization– Term-Aligned– Sentence-Aligned– Document-Aligned (Parallel, Comparable)– Unaligned

Usage– Setup Thesaurus– Vector Mapping

Hsin-Hsi Chen 37

Term-Aligned Corpora

Fine-grained alignment in parallel corpora Oard 1996

– Term alignment is a challenging problem.

ParallelBinlingual

Corpus

CooccurranceStatistics

TranslationTables

MachineTranslation

System

English Query

SpanishQuery

Hsin-Hsi Chen 38

Sentence-Aligned Corpora

Davis & Dunning 1996 (TREC4)– High-frequency Terms

Hsin-Hsi Chen 39

Brief Summary

dictionary-based methods– Specialized vocabulary not in the dictionaries will not

be translated.– Ambiguities will add extraneous terms to the query.

parallel/comparable corpora-based methods– Parallel corpora are not always available.– Available corpora tend to be relative small or to cover

only a small number of subjects.– Performance is dependent on how well the corpora are

aligned.

Hsin-Hsi Chen 40

Brief Summary (Continued)

Dictionaries are very useful.– Achieve 50% on their own

Parallel corpora have limitations.– Domain shifts

– Term alignment accuracy

Dictionaries and corpora are complementary.– Dictionaries provide broad and shallow coverage.

– Corpora provide narrow (domain-specific) but deep (more terminology) coverage of the language.

Hsin-Hsi Chen 41

Hybrid Methods

What knowledge can be employed?– lexical knowledge– corpus knowledge– ...

Hsin-Hsi Chen 42

Hybrid Methods (Continued)

Query Expansion– Issue 1: context

• pseudo relevance feedback (local feedback)::A query is modified by the addition of terms found in the top retrieved documents.

• local context analysis::Queries are expanded by the addition of the top ranked concepts from the top passages.

Hsin-Hsi Chen 43

Hybrid Methods (Continued)

– Issue 2: when• before query translation

• after query translation

Hsin-Hsi Chen 44

Hybrid Methods (Continued)

Ballesteros & Croft 1997

Original SpanishTREC Queries

humantranslation

English (BASE)Queries

SpanishQueries

automaticdictionarytranslation

EnglishQueries

queryexpansion

SpanishQueries

queryexpansion

SpanishQueries

automaticdictionarytranslation

INQUERY

Hsin-Hsi Chen 45

Hybrid Methods (Continued)

– Performance Evaluation• pre-translation

MRD (0.0823) vs. LF (0.1099) vs. LCA10 (0.1139) +33.5% +38.5%

• post-translationMRD (0.0823) vs. LF (0.0916) vs. LCA20 (0.1022) +11.3% +24.1%

• combined pre- and post-translationMRD (0.0823) vs. LF (0.1242) vs. LCA20 (0.1358) +51.0% +65.0%

• 32% below a monolingual baseline

Hsin-Hsi Chen 46

Cross-Language Evaluation Forum

A collaboration between the DELOS Network of Excellence for Digital Libraries and the US National Institute for Standards and Technology (NIST)

Extension of CLIR track at TREC (1997-1999)

Hsin-Hsi Chen 47

Main Goals

Promote research in cross-language system development for European languages by providing an appropriate infrastructure for:– CLIR system evaluation, testing and tuning– Comparison and discussion of results

Hsin-Hsi Chen 48

CLEF 2000 Task Description

Four evaluation tracks in CLEF 2000– multilingual information retrieval– bilingual information retrieval– monolingual (non-English) information

retrieval– domain-specific IR

Hsin-Hsi Chen 49

Case Study: CLIR for NPDM

Hsin-Hsi Chen 50

3M in Digital Libraries/Museums

Multi-media– Selecting suitable media to represent contents

Multi-linguality– Decreasing the language barriers

Multi-culture– Integrating multiple cultures

Hsin-Hsi Chen 51

NPDM Project

Palace Museum, Taipei, one of the famous museums in the world

NSC supports a pioneer study of a digital museum project NPDM starting from 2000 – Enamels from the Ming and Ch’ing Dynasties – Famous Album Leaves of the Sung Dynasty – Illustrations in Buddhist Scriptures with Relativ

e Drawings

Hsin-Hsi Chen 52

Design Issues

Standardization– A standard metadata protocol is indispensable for the

interchange of resources with other museums.

Multimedia – A suitable presentation scheme is required.

Internationalization – to share the valuable resources of NPDM with users of

different languages

– to utilize knowledge presented in a foreign language

Hsin-Hsi Chen 53

Translingual Issue

CLIR– to allow users to issue queries in one language

to access documents in another language– the query language is English and the document

language is Chinese

Two common approaches– Query translation– Document translation

Hsin-Hsi Chen 54

Resources in NPDM pilot

an enamel, a calligraphy, a painting, or an illustration

MICI-DC– Metadata Interchange for Chinese Information– Accessible fields to users

• Short descriptions vs. full texts

• Bilingual versions vs. Chinese only

– Fields for maintenance only

Hsin-Hsi Chen 55

Search Modes

Free search– users describe their information need using

natural languages (Chinese or English)

Specific topic search– users fill in specific fields denoting authors,

titles, dates, and so on

Hsin-Hsi Chen 56

Example

Information need– Retrieval “Travelers Among Mountains and Streams, F

an K‘uan” (“ 范寬谿山行旅圖” ) Possible queries

– Author: Fan Kuan; Kuan, Fan – Time: Sung Dynasty – Title: Mountains and Streams; Travel among mountains;

Travel among streams; Mountain and stream painting – Free search: landscape painting; travelers, huge mounta

in, Nature; scenery; Shensi province

Hsin-Hsi Chen 57

EnglishNames

ChineseNames

MachineTransliteration

EnglishTitles

ChineseTitles

DocumentTranslation

NameSearch

TitleSearch

EnglishQuery

QueryDisambiguation

SpecificBilingual

Dictionary

GenericBilingual

Dictionary

ChineseQuery

QueryTranslation

Chinese IRSystemNPDM

Collection

Results

ECIR in NPDM

Hsin-Hsi Chen 58

Specific Topic Search

proper names are important query terms– Creators such as “ 林逋” (Lin P’u), “ 李建中”

(Li Chien-chung), “ 歐陽脩” (Ou-yang Hsiu), etc.

– Emperors such as “ 康熙” (K'ang-hsi), “ 乾隆” (Ch'ien-lung), “ 徽宗” (Hui-tsung), etc.

– Dynasty such as ” 宋” (Sung), “ 明” (Ming), “ 清” (Ch’ing), etc.

Hsin-Hsi Chen 59

Name Transliteration

The alphabets of Chinese and English are totally different

Wade-Giles (WG) and Pinyin are two famous systems to romanize Chinese in libraries

backward transliteration– Transliterate target language terms back to source language

ones

– Chen, Huang, and Tsai (COLING, 1998)

– Lin and Chen (ROCLING, 2000)

Hsin-Hsi Chen 60

Name Mapping Table

Divide a name into a sequence of Chinese characters, and transform each character into phonemes

Look up phoneme-to-WG (Pinyin) mapping table, and derive a canonical form for the name

Example– “林逋” “ㄌㄧㄣ ㄆㄨ” “ Lin P’

u” (WG)

Hsin-Hsi Chen 61

Name Similarity

Extract named entity from the query Select the most similar named entity from name mapping t

able Naming sequence/scheme

– LastName FirstName1, e.g., Chu Hsi ( 朱熹 ) – FirstName1 LastName, e.g., Hsi Chu ( 朱熹 ) – LastName FirstName1-FirstName2, e.g., Hsu Tao-ning ( 許道寧 ) – FirstName1-FirstName2 LastName, e.g., Tao-ning Hsu ( 許道寧 ) – Any order, e.g., Tao Ning Hsu ( 許道寧 ) – Any transliteration, e.g., Ju Shi ( 朱熹 )

Hsin-Hsi Chen 62

Title

谿山行旅圖” “ Travelers among Mountains and Streams”

"travelers", "mountains", and "streams" are basic components

Users can express their information need through the descriptions of a desired art

System will measure the similarity of art titles (descriptions) and a query

Hsin-Hsi Chen 63

Free Search

A query is composed of several concepts. Concepts are either transliterated or translated. The query translation similar to a small scale IR

system Resources

– Name-mapping table– Title-mapping table – Specific English-Chinese Dictionary – Generic English-Chinese Dictionary – …

Hsin-Hsi Chen 64

Algorithm (1) For each resource, the Chinese translations whose

scores are larger than a specific threshold are selected. (2) The Chinese translations identified from different

resources are merged, and are sorted by their scores. (3) Consider the Chinese translation with the highest

score in the sorting sequence. – If the intersection of the corresponding English description

and query is not empty, then select the translation and delete the common English terms between query and English description from query.

– Otherwise, skip the Chinese translation.

Hsin-Hsi Chen 65

Algorithm (Continued)

(4) Repeat step (3) until query is empty or all the Chinese translations in the sorting sequence are considered.

(5) If the query is not empty, then these words are looked up from the general dictionary. A Chinese query is composed of all the translated results.

Hsin-Hsi Chen 66

Summary

Users can select English input and Chinese output when they are neither familiar with Chinese input, nor lack of Chinese input device, but can read Chinese.

Images or videos are transparent to those users that cannot read/write Chinese.

The integration of different knowledge sources will be studied in the future.

Recommended