66
Hsin-Hsi Chen 1 跨跨跨跨跨跨跨跨跨 Hsin-Hsi Chen ( 跨跨跨 ) Department of Computer Scienc e and Information Engineeri ng National Taiwan University

Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Embed Size (px)

Citation preview

Page 1: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 1

跨語言資訊檢索導論

Hsin-Hsi Chen ( 陳信希 )

Department of Computer Science and Information Engineering

National Taiwan University

Page 2: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 2

Outline

Multilingual Environments What is Cross-Language Information

Retrieval? Major Problems in CLIR Major Approaches in CLIR Case Study: CLIR in NPDM Summary

Page 3: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 3

Multilingual Collections

There are 6,703 languages listed in the Ethnologue Digital libraries

– OCLC Online Computer Library Center serves more than 17,000 libraries in 52 countries and contains over 30 million bibliographic records with over 500 million records ownership attached in more than 370 languages

World Wide Web– Around 40% of Internet users do not speak English, ho

wever, 80% of Web sites are still in English

Page 4: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 4

0

200

400

600

800

Spea

kers

(M

illio

ns)

Chinese Hindi-Urdu

Portuguese Russian Japanese

真實世界語言使用人口( http://www.g11n.com/faq.htm)

中文 英語 印度

語 西班牙

語 葡萄牙

語 孟加拉

語 俄語 阿拉伯

語 日語

Page 5: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 5(Statistics from Euro-Marketing Associates, 1998)

西班牙語

德語

日語

法語

中文

荷蘭語葡萄牙語

義大利語

瑞典語

韓文

Page 6: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 6http://www.glreach.com/globstats/

(Statistics from Euro-Marketing Associates, 1999)

中文人口比例 (6.1%)<法文人口比例 (8.8%)

(1998 年 )

Page 7: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 7

網路世界語言使用人口

Page 8: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 8

網際網路內容

100

1,000

10,000

100,000

Inte

rnet

Hos

ts (

thou

sand

s)

English German Dutch Spanish Swe di s h

Language (estimated by domain)

(Network Wizards Jan 99 Internet Domain Survey)

英語

日語 德語 法語 荷蘭

語 芬蘭

語西班牙

語中文 瑞

典語

33,878

1,687 1,684

654 546 473 458 432546

40% 的 Internet 使用者不懂英文,但是 80%的 Internet 內容是英文

Page 9: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 9(Source: http://www.emarketer.com)

Page 10: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 10

What is Cross-Language Information Retrieval?

Definition: Select information in one language based on queries in another.

Terminologies– Cross-Language Information Retrieval

(ACM SIGIR 96 Workshop on Cross-Linguistic Information Retrieval)

– Translingual Information Retrieval(Defense Advanced Research Project Agency - DARPA)

Page 11: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 11

Generalization: Multi- & Cross- Lingual Information Access

Page 12: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 12

MLIR Applications

Multilingual information access in multilingual country, organization, enterprise, etc.

Cross- language information retrieval for users who read a second language (large passive vocabulary) but are not able to formulate good queries (small active vocabulary).

Monolingual users may retrieve images by taking advantage of multilingual captions.

Monolingual users may retrieve documents and have them translated (automatically or manually) in their language.

Page 13: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 13

Why is Cross- Language Information Retrieval Important?

More information workers with less time require fast access to global resources

global B2B interactions (virtual enterprises) global B2C interactions (online trading, travell

ing) time critical information (translation comes to

o late)

Page 14: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 14

History

1970 Salton runs retrieval experiments with a small English/ German dictionary

1972 Pevzner shows for English and Russian that a controlled thesaurus can be used effectively for query term translation

1978 ISO Standard 5964 for developing multilingual thesauri (revised in 1985)

1990 Latent Semantic Indexing (LSI) applied to CLIR

Page 15: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 15

History (Continued)

1994 1st PhD thesis on CLIR by Khaled Radwan

1996 Similarity thesaurus applied to CLIR (ETH Zurich)

1996 Dictionary based retrieval applied to CLIR (Umass & XEROX Grenoble)

1997 Generalized Vector Space Model (GVSM) applied to CLIR (CMU)

Page 16: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 16

History (Continued)

1997 CLIR (Cross- Language Information Retrieval) track starts within TREC

1998 NTCIR starts in Japan 1999 TIDES (Translingual Information Det

ection, Extraction, and Summarization) starts in U. S.

2000 CLEF starts in Europe

Page 17: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 17

An Architecture of Multilingual Information Access

Query Translation

Document Translation

User Interface(UI)

Language Identification

(LI)

Information Extraction

Information Filtering

Information Retrieval

Text Summarization

Text Classification

Multilingual Resources Multiple Langauges

Native Langauge(s)

Text Processing

Language Translation

Page 18: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 18

Major Problems of CLIR

Queries and documents are in different languages.– translation

Words in a query may be ambiguous.– disambiguation

Queries are usually short.– expansion

Page 19: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 19

Major Problems of CLIR (Continued)

Queries may have to be segmented.– segmentation

A document may be in terms of various languages.– language identification

Page 20: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 20

Enhancing TraditionalInformation Retrieval Systems

Which part(s) should be modified for CLIR?

Documents Queries

DocumentRepresentation

QueryRepresentation

Comparison

(3)(1)

(2) (4)

Page 21: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 21

Enhancing Traditional Information Retrieval Systems

(Continued)

(1): text translation (2): vector translation (3): query translation (4): term vector translation (1) and (2), (3) and (4): interlingual form

Page 22: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 22

What are the Problems?

Ambiguous terms (e.g., performance) Multiword phrases may correspond to single-word phra

ses (e. g. South Africa => 南非, Südafrika) Coverage of the vocabulary There is not a one-to-one mapping between two langua

ges Translating queries automatically (lack of syntax) Translating documents automatically (performance, …) Computing mixed result lists

Page 23: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 23

Cross-Language Information Retrieval

Controlled Vocabulary

Thesaurus-based

Ontology-based Dictionary-based

K nowledge-based

Term-aligned Sentence-aligned

Parallel Comparable

Document-aligned Unaligned

Corpus-based Hybrid

Free Text

Query Translation

Text Translation Vector Translation

Document Translation No Translation

Cross-L anguage Information Retrieval

Page 24: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 24

Query Translation Based CLIR

EnglishQuery

TranslationDevice

ChineseQuery

MonolingualChineseRetrievalSystem

RetrievedChinese

Documents

Page 25: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 25

Translating the 400 Millionnon-English Pages of the WWW

... would take 100’000 days (300 years) on one fast PC. Or, 1 month on 3’600 PC’s.

Page 26: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 26

Knowledge-Based

Examples– Subject Thesaurus

• Hierarchical and associative relations.

• Unique term assigned to each node.

– Concept List• Term space partitioned into concept spaces.

– Term List• List of cross-language synonyms.

– Lexicon• Machine readable syntax and/or semantics.

Page 27: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 27

Ontology-Based Approaches

Exploit complex knowledge representations e.g., EuroWordNet

A Proposal for Conceptual Indexing using EuroWordNet

Page 28: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 28

Dictionary-Based Approaches

Exploit machine-readable dictionaries.

Problems– translation ambiguity + target polysemy

– coverage (unknown words, abbreviations, ...)

Page 29: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 29

Dictionary-Based Approaches(Continued)

Issue 1: selection strategy– Select all.– Select N randomly.– Select best N.

Issue 2: which level– word– phrase

Page 30: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 30

Selection Strategy: Select All

Hull and Grefenstette 1996– Take concatenation of all term translation.

E: politically motivated civil disturbancesF: troubles civils a caractere politiquetrouble - turmoil, discord, trouble, unrest, disturbance, disordercivil - civil, civilian, courteouscaractere - character, naturepolitique - political, diplomatic, politician, policy

– Original English (0.393) vs. Automatic word-based transfer dictionary (0.235): 59.8%.

– errors: multi-word expressions and ambiguity

Page 31: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 31

Selection Strategy: Select All(Continued)

Davis 1997 (TREC5)– Replace each English query term with all of its

Spanish equivalent terms from the Collins bilingual dictionary.

– Monolingual (0.2895) vs. All-equivalent substitution (0.1422): 49.12%

Page 32: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 32

Evaluation Method

Average Precision (5-, 9-, 11-points) Model

Spanish QueryMono

IR Engine

English QueryBilingual

DictionaryMono

IR Engine

TRECSpanishCorpus

SpanishEquivalents

English Query MonoIR Engine

TRECSpanishCorpus

SpanishEquivalents

by POS

POSBilingual

Dictionary

TRECSpanishCorpus

Page 33: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 33

Selection Strategy: Select N

Simple word-by-word translation– Each query term is replaced by the word or

group of words given for the first sense of the term’s definition.

• 50-60% drop in performance (average precision)

Page 34: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 34

Selection Strategy: Select N(Continued)

word/phrase translation– Take at most three translations of each word,

one from each of the first three senses. Take phrase translation if appearing in dictionary.

• 30-50% worse than good translation

– Well-translated phrases can greatly improve effectiveness, but poorly translated phrases may negate the improvements.

• WBW (0.0244), phrasal (0.0148), good phrasal (0.0610) -39.3% +150.3%

Page 35: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 35

Selection Strategy: Select Best N

Hayashi, Kikui and Susaki 1997– search for a dictionary entry corresponding to the longe

st sequence of words from left to right– choose the most frequently used word (or phrases) in a

text corpus collected from WWW– no report for this query translation approach

Davis 1997 (TREC5)– POS disambiguation– Monolingual (0.2895) vs. All-equivalent substitution (0.

1422) vs. POS disambiguation (0.1949): near 67.3%

Page 36: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 36

Corpus-Based Approaches

Categorization– Term-Aligned– Sentence-Aligned– Document-Aligned (Parallel, Comparable)– Unaligned

Usage– Setup Thesaurus– Vector Mapping

Page 37: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 37

Term-Aligned Corpora

Fine-grained alignment in parallel corpora Oard 1996

– Term alignment is a challenging problem.

ParallelBinlingual

Corpus

CooccurranceStatistics

TranslationTables

MachineTranslation

System

English Query

SpanishQuery

Page 38: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 38

Sentence-Aligned Corpora

Davis & Dunning 1996 (TREC4)– High-frequency Terms

Page 39: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 39

Brief Summary

dictionary-based methods– Specialized vocabulary not in the dictionaries will not

be translated.– Ambiguities will add extraneous terms to the query.

parallel/comparable corpora-based methods– Parallel corpora are not always available.– Available corpora tend to be relative small or to cover

only a small number of subjects.– Performance is dependent on how well the corpora are

aligned.

Page 40: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 40

Brief Summary (Continued)

Dictionaries are very useful.– Achieve 50% on their own

Parallel corpora have limitations.– Domain shifts

– Term alignment accuracy

Dictionaries and corpora are complementary.– Dictionaries provide broad and shallow coverage.

– Corpora provide narrow (domain-specific) but deep (more terminology) coverage of the language.

Page 41: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 41

Hybrid Methods

What knowledge can be employed?– lexical knowledge– corpus knowledge– ...

Page 42: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 42

Hybrid Methods (Continued)

Query Expansion– Issue 1: context

• pseudo relevance feedback (local feedback)::A query is modified by the addition of terms found in the top retrieved documents.

• local context analysis::Queries are expanded by the addition of the top ranked concepts from the top passages.

Page 43: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 43

Hybrid Methods (Continued)

– Issue 2: when• before query translation

• after query translation

Page 44: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 44

Hybrid Methods (Continued)

Ballesteros & Croft 1997

Original SpanishTREC Queries

humantranslation

English (BASE)Queries

SpanishQueries

automaticdictionarytranslation

EnglishQueries

queryexpansion

SpanishQueries

queryexpansion

SpanishQueries

automaticdictionarytranslation

INQUERY

Page 45: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 45

Hybrid Methods (Continued)

– Performance Evaluation• pre-translation

MRD (0.0823) vs. LF (0.1099) vs. LCA10 (0.1139) +33.5% +38.5%

• post-translationMRD (0.0823) vs. LF (0.0916) vs. LCA20 (0.1022) +11.3% +24.1%

• combined pre- and post-translationMRD (0.0823) vs. LF (0.1242) vs. LCA20 (0.1358) +51.0% +65.0%

• 32% below a monolingual baseline

Page 46: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 46

Cross-Language Evaluation Forum

A collaboration between the DELOS Network of Excellence for Digital Libraries and the US National Institute for Standards and Technology (NIST)

Extension of CLIR track at TREC (1997-1999)

Page 47: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 47

Main Goals

Promote research in cross-language system development for European languages by providing an appropriate infrastructure for:– CLIR system evaluation, testing and tuning– Comparison and discussion of results

Page 48: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 48

CLEF 2000 Task Description

Four evaluation tracks in CLEF 2000– multilingual information retrieval– bilingual information retrieval– monolingual (non-English) information

retrieval– domain-specific IR

Page 49: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 49

Case Study: CLIR for NPDM

Page 50: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 50

3M in Digital Libraries/Museums

Multi-media– Selecting suitable media to represent contents

Multi-linguality– Decreasing the language barriers

Multi-culture– Integrating multiple cultures

Page 51: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 51

NPDM Project

Palace Museum, Taipei, one of the famous museums in the world

NSC supports a pioneer study of a digital museum project NPDM starting from 2000 – Enamels from the Ming and Ch’ing Dynasties – Famous Album Leaves of the Sung Dynasty – Illustrations in Buddhist Scriptures with Relativ

e Drawings

Page 52: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 52

Design Issues

Standardization– A standard metadata protocol is indispensable for the

interchange of resources with other museums.

Multimedia – A suitable presentation scheme is required.

Internationalization – to share the valuable resources of NPDM with users of

different languages

– to utilize knowledge presented in a foreign language

Page 53: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 53

Translingual Issue

CLIR– to allow users to issue queries in one language

to access documents in another language– the query language is English and the document

language is Chinese

Two common approaches– Query translation– Document translation

Page 54: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 54

Resources in NPDM pilot

an enamel, a calligraphy, a painting, or an illustration

MICI-DC– Metadata Interchange for Chinese Information– Accessible fields to users

• Short descriptions vs. full texts

• Bilingual versions vs. Chinese only

– Fields for maintenance only

Page 55: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 55

Search Modes

Free search– users describe their information need using

natural languages (Chinese or English)

Specific topic search– users fill in specific fields denoting authors,

titles, dates, and so on

Page 56: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 56

Example

Information need– Retrieval “Travelers Among Mountains and Streams, F

an K‘uan” (“ 范寬谿山行旅圖” ) Possible queries

– Author: Fan Kuan; Kuan, Fan – Time: Sung Dynasty – Title: Mountains and Streams; Travel among mountains;

Travel among streams; Mountain and stream painting – Free search: landscape painting; travelers, huge mounta

in, Nature; scenery; Shensi province

Page 57: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 57

EnglishNames

ChineseNames

MachineTransliteration

EnglishTitles

ChineseTitles

DocumentTranslation

NameSearch

TitleSearch

EnglishQuery

QueryDisambiguation

SpecificBilingual

Dictionary

GenericBilingual

Dictionary

ChineseQuery

QueryTranslation

Chinese IRSystemNPDM

Collection

Results

ECIR in NPDM

Page 58: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 58

Specific Topic Search

proper names are important query terms– Creators such as “ 林逋” (Lin P’u), “ 李建中”

(Li Chien-chung), “ 歐陽脩” (Ou-yang Hsiu), etc.

– Emperors such as “ 康熙” (K'ang-hsi), “ 乾隆” (Ch'ien-lung), “ 徽宗” (Hui-tsung), etc.

– Dynasty such as ” 宋” (Sung), “ 明” (Ming), “ 清” (Ch’ing), etc.

Page 59: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 59

Name Transliteration

The alphabets of Chinese and English are totally different

Wade-Giles (WG) and Pinyin are two famous systems to romanize Chinese in libraries

backward transliteration– Transliterate target language terms back to source language

ones

– Chen, Huang, and Tsai (COLING, 1998)

– Lin and Chen (ROCLING, 2000)

Page 60: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 60

Name Mapping Table

Divide a name into a sequence of Chinese characters, and transform each character into phonemes

Look up phoneme-to-WG (Pinyin) mapping table, and derive a canonical form for the name

Example– “林逋” “ㄌㄧㄣ ㄆㄨ” “ Lin P’

u” (WG)

Page 61: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 61

Name Similarity

Extract named entity from the query Select the most similar named entity from name mapping t

able Naming sequence/scheme

– LastName FirstName1, e.g., Chu Hsi ( 朱熹 ) – FirstName1 LastName, e.g., Hsi Chu ( 朱熹 ) – LastName FirstName1-FirstName2, e.g., Hsu Tao-ning ( 許道寧 ) – FirstName1-FirstName2 LastName, e.g., Tao-ning Hsu ( 許道寧 ) – Any order, e.g., Tao Ning Hsu ( 許道寧 ) – Any transliteration, e.g., Ju Shi ( 朱熹 )

Page 62: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 62

Title

谿山行旅圖” “ Travelers among Mountains and Streams”

"travelers", "mountains", and "streams" are basic components

Users can express their information need through the descriptions of a desired art

System will measure the similarity of art titles (descriptions) and a query

Page 63: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 63

Free Search

A query is composed of several concepts. Concepts are either transliterated or translated. The query translation similar to a small scale IR

system Resources

– Name-mapping table– Title-mapping table – Specific English-Chinese Dictionary – Generic English-Chinese Dictionary – …

Page 64: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 64

Algorithm (1) For each resource, the Chinese translations whose

scores are larger than a specific threshold are selected. (2) The Chinese translations identified from different

resources are merged, and are sorted by their scores. (3) Consider the Chinese translation with the highest

score in the sorting sequence. – If the intersection of the corresponding English description

and query is not empty, then select the translation and delete the common English terms between query and English description from query.

– Otherwise, skip the Chinese translation.

Page 65: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 65

Algorithm (Continued)

(4) Repeat step (3) until query is empty or all the Chinese translations in the sorting sequence are considered.

(5) If the query is not empty, then these words are looked up from the general dictionary. A Chinese query is composed of all the translated results.

Page 66: Hsin-Hsi Chen1 跨語言資訊檢索導論 Hsin-Hsi Chen ( 陳信希 ) Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 66

Summary

Users can select English input and Chinese output when they are neither familiar with Chinese input, nor lack of Chinese input device, but can read Chinese.

Images or videos are transparent to those users that cannot read/write Chinese.

The integration of different knowledge sources will be studied in the future.