Hsin-Hsi Chen9-1 Chinese Language Retrieval Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University

Hsin-Hsi Chen 9-1

Chinese Language Retrieval

Hsin-Hsi Chen

Department of Computer Science and Information Engineering

National Taiwan University

Hsin-Hsi Chen 9-2

Chinese Text Retrieval without Using a Dictionary (Chen et al, SIGIR97)

• Segmentation– Break a string of characters into words

• Chinese characters and words– Most Chinese words consist of two characters ( 趙元任 )– 26.7% unigrams, 69.8% bigrams, 2.7% trigrams

( 北京，現代漢語頻率辭典 )– 5% unigrams, 75% bigrams, 14% trigrams, 6% others (Liu)

• Word segmentation– statistical methods, e.g., mutual information statistics– rule-based methods, e.g., morphological rules, longest-match rules, ...– hybrid methods

Hsin-Hsi Chen 9-3

Indexing Techniques

• Unigram Indexing– Break a sequence of Chinese characters into individual ones.– Regard each individual character as an indexing unit.– GB2312-80: 6763 characters

• Bigram Indexing– Regard all adjacent pairs of hanzi characters in text as

indexing terms.

• Trigram Indexing– Regard all the consecutive sequence of three hanzi characters

as indexing terms.

Hsin-Hsi Chen 9-4

Examples

Hsin-Hsi Chen 9-5

Indexing Techniques (Continued)

• Statistical Indexing– Collect occurrence frequency in the collection for all Chinese

characters occurring at least once in the collection.– Collect occurrence frequency in the collection for all Chinese

bigrams occurring at lease once in the collection.– Compute the mutual information for all Chinese bigrams.

I(x,y)=log2(p(x,y)/(p(x)*p(y))) = log2((f(x,y)/N) /((f(x)/N)*(f(y)/N))) = log2((f(x,y)*N) /(f(x)*f(y)))

– Strongly related: much larger value– Not related: close to 0– Negatively related: negative

I(x,y)=log2(p(x,y)/ (p(x)*p(y))) =log2 (p(x)/ (p(x)*p(y))

=log2 (1/ *p(y))I(x,y)=log2(p(x,y)/(p(x)*p(y)))

=log2(p(x|y)/p(x)) =log2(p(x|y)/p(x))=0

Hsin-Hsi Chen 9-6

Indexing Techniques (Continued)

– Apply Richard’s algorithm to segment the text into words.

• Compute the mutual information values for all adjacent bigrams in a phrase. (Recognize only words of one or two characters long)

• Treat the bigram of the largest mutual information value as a word and then remove it from the phrase. The removal of the bigram may result in one or two shorter phrases.

• Perform the last step on each of the shorter phrases until all phrases consist of one or two characters.

Hsin-Hsi Chen 9-7

f(c1): the occurrence frequency value of the first Chinese character of a bigramf(c2): the occurrence frequency value of the second Chinese characterf(c1c2): the occurrence frequency value of a bigramI(c1,c2): mutual information I(c1,c2) >> 0, c1 and c2 have strong relationship I(c1,c2) ~ 0, c1 and c2 have no relationship I(c1,c2) << 0, c1 and c2 have complementrary relationship

> 0

< 0

Hsin-Hsi Chen 9-8

352974671

Hsin-Hsi Chen 9-9

Maximum Matching and Minimum Matchingmaximum matching: group the longest initial sequence of characters that matches a dictionary entry as a wordminimum matching: treat the shortest initial sequence of characters that matches a dictionary entry as a wordforward: start from the beginning of the phrasebackward: start from the end of the phrase

Dictionary: 138,955 entries, including words, compounds, phrases, idioms, proper names, and so on.Unknown words: the longest initial sequence of characters that does not match any entry in the dictionary is treated as an unknown word.

Hsin-Hsi Chen 9-10

Hsin-Hsi Chen 9-11

The Test Collection

in number of characters

Hsin-Hsi Chen 9-12

The Topics

• 28 topics of TREC-5 Chinese track• Topic number 14

Title: 中國的愛滋病歷Description: 中國，雲南，愛滋病， HIV ，高危險群患者，注射器，病毒。Narrative ：相關文件應當包括中國那些地區的病例最多，愛滋病毒在中國是如何傳播的，以及中國政府如何監視愛滋病毒並控制它的傳染。

• Short query: title field中國的愛滋病歷

• long query: the original set of query

Hsin-Hsi Chen 9-13

Construction of manual query

• Iterative process– Do a trial run using the current query.– Examine the top-ranked documents and manually select

the terms that seem to be promising from the top documents.

– Add the chosen terms from the previous step to the current query and assign weights manually to the new terms to from a new query.

• Cost– On the average, it takes 2.8 hours on each topic.

Hsin-Hsi Chen 9-14

*: terms selected from the test database manually

Hsin-Hsi Chen 9-15

Dictionary and stop listStop list: 825 entries, including pronouns, determiners, prepositions, adverbs, conjunctions, and so on.Dictionary: 138,955 entries, including words, phrases, compounds, idioms, proper names, and so on. About 60,000 entries were manually selected from TREC-5 Chinese collection, and are added to a dictionary of about 80,000 entries obtained from a website

Hsin-Hsi Chen 9-16

Evaluation

• For each query, the top 1,000 documents are retrieved from the Chinese collection of 164,789 documents. (0.61%)

• The retrieved documents are ranked in decreasing order by the probability of relevance.

• There are 2,182 relevant documents in total for all 28 queries.

• For all run, 11-point recall-precision are calculated using the TREC evaluation program written by Chris Buckley.

• The base run uses the index file created from the text segmented using the forward maximum matching method.

Hsin-Hsi Chen 9-17

The Long Queries(title, description, narrative fields)

Automatically constructed long queries

Hsin-Hsi Chen 9-18

The Long Queries (Continued)

Manually reformulated queriesmax(f)0.80000.64650.52830.43080.38410.34550.29470.24390.18910.10510.02820.3558baseline1910

27% better

query terms (2-character words) vs. indexing terms (3-character words)

Hsin-Hsi Chen 9-19

The Short Queries

Length: 12.5 characters (short queries) vs. 107.0 characters (long topic queries) vs. 249.7 characters (manually reformulated queries)

Max(f)0.80000.64650.52830.43080.38410.34550.29470.24390.18910.10510.02820.3558baseline1910

longquery

Hsin-Hsi Chen 9-20

Discussion• unigram indexing

– ambiguity: 毒，毒氣，病毒，毒品– function words: 將，的，與，在– the average precision for the set of short queries (0.2770) and the set of

manually reformulated queries (0.4203) are good in comparison with the results of the dictionary-based maximum matching (0.3558).

• All function words are removed from the manually reformulated queries• The number of single-characters are small in the short queries.

• bigram indexing– Most Chinese words consists of two characters.– Bigram indexing (0.3677LA, 0.4522LM, 0.2687S) and the mutual

information segmentation (0.3744LA, 0.4533LM, 0.2849S) produce better performance in comparison to unigram indexing (0.2609LA, 0.4203LM, 0.2770S).

LA: Long & Automatic, LM: Long & Manual, S: Short

Hsin-Hsi Chen 9-21

Discussion (Continued)• trigram Indexing

– the words representing the key concepts in the topics are two-character words

– the number of unique trigrams in a corpus is much larger than that of unique bigrams

• dictionary-based methods– unknown words, including person names, place names, company names, transliterated

names, names of new products, abbreviation of full names, and so on.21/287 university and college names, 14/627 company names in dictionary

– identification and subsequent resolution of segmentation ambiguities

– for the set of short queries, the minimum matching (0.2531f, 0.2465b) works better than maximum matching (0.2346f, 0.2250b)

– for the set of long and manual queries, the maximum matching (0.4519f, 0.4481b) outperforms the minimum matching method (0.3937f, 0.3904b)

• bigram indexing and mutual information-based segmentation outperform the popular dictionary-based maximum matching

Hsin-Hsi Chen 9-22

Comparing Representation in

Chinese Information Retrieval (Kwok, SIGIR97)

• Representation Methods for Chinese Texts– 1-grams (single characters)

• Punctuation signs are deleted.• Stop word removal is not performed.• 6763 words (GB-2312) + English Characters/words + … =8093

– bigrams (two continuous overlapping characters)• 80% of modern Chinese words are bisyllable ( 林語堂 , 1972)• problem: many meaningless character-pairs are also produced.

– short-words • Segment texts into meaningful short-words of 1-4 characters.

Hsin-Hsi Chen 9-23

Segmentation• Step 1 (use facts)

Lookup on a small (about 2000) manually created lexicon list of commonly used short-words of one to three characters.Some 4-character names and proper nouns are also included.

• Step 2 (involve rules)Split chunks into short words of 2 or 3 words using manually determined rules. For example, XX, AX, two by two.

• Step 3 (frequency filtering)A threshold on the frequency of occurrence in the corpus is used to extract most common words from steps 1 and 2.

• Step 4 (iteration)Expand the initial lexicon list in step 1 using those discovered in step 3. Only one run is done in his experiment.

No disambiguation

Hsin-Hsi Chen 9-24

The Chinese Retrieval Environment

• Document– TREC-5 24,988 Xinhua and 139,801 People’s

Daily news articles– Documents are divided into subdocuments of

about 550 characters on a paragraph boundary.– Total number of subdocuments is 231,527.– Index terms for 1-gram (8093),

2-gram (1,482,172), short words (494,288) --- 1/3 of 2 grams

Hsin-Hsi Chen 9-25

The Chinese Retrieval Environment (Continued)

• Query Collection– 28 queries– query 1

美國決定將中國大陸的人權狀況與其是否給予中共最惠國待遇分離最惠國待遇，中國，人權，經濟制裁，分離，脫鉤

相關文件必須提到美國為何將最惠國待遇與人權分離；相關文件也必須提到中共為什麼反對美國將人權與最惠國待遇相提並論。

segmented Chinese text美國 | 決定 | 將 | 中國 | 大陸 | 的 | 人權 | 狀況 | 與其 | 是否 | 給予 | 中共最 | 惠國 | 待遇 | 分離最 | 惠國 | 待遇 | ， | 中國， | 人權， | 經濟 | 制裁， | 分離， | 脫鉤

相關文件 | 必須 | 提到 | 美國 | 為何 | 將 | 最 | 惠國 | 待遇 | 與 | 人權 | 分離 | ；相關文件 | 也 | 必須 | 提到 | 中共 | 為 | 什麼 | 反 | 對 | 美國 | 將 | 人權 | 與 | 最 | 惠國 | 待遇 | 相 | 提並 | 論。

Hsin-Hsi Chen 9-26

The Chinese Retrieval Environment (Continued)

– query 11聯合國駐波斯尼亞維和部隊。波斯尼亞，前南斯拉夫，巴爾幹，聯合國，北約，武器禁運，維和，維持和平

相關文件必須包括聯合國和平部隊如何在戰火蹂躪的波斯尼亞進行維持和平的任務。

segmented Chinese text聯合國 | 駐波 | 斯 | 尼亞 | 維 | 和 | 部隊。| 波斯尼亞 | ， | 前南 | 斯 | 拉夫， | 巴 | 爾 | 幹， | 聯合國， | 北 | 約， | 武器 | 禁運 | ， | 維 | 和， | 維持 | 和平

相關文件 | 必須 | 包括 | 聯合國 | 和平 | 部隊 | 如何 | 在 | 戰火 | 蹂躪 | 的 | 波斯尼亞 | 進行 | 維持 | 和平 | 的 | 任務 | 。

Hsin-Hsi Chen 9-27

Retrieval Based on 1-gram Indexing (Baseline Model)High frequency threshold=30kLow frequency threshold=3

Document frequency:界定中頻的範圍

Recall # of subdocuments=231k

最佳的設定

7% 16%8.65% 10.8% 13%

Hsin-Hsi Chen 9-28

Retrieval Based on 1-gram Indexing (using relevance feedback)

Set thresholds=30k3

最佳的設定

Number of feedback documents=20Number of feedback terms=40

1802.324.471.438.402.352

Baseline model

pseudorelevancefeedback

6.4%18.6%16.8%1.9%13.2%6.7%

Hsin-Hsi Chen 9-29

Retrieval Based on 1-gram Indexing (using relevance feedback)

Set threshold=30k3Set feedback document number=20

How many terms are used for expansion?40-60 give similar results

Summary: Single characters (highly ambiguous) gives surprising results

Hsin-Hsi Chen 9-30

Retrieval Based on 2-gram Indexing (Baseline Model)

30k3

1802.324.471.438.402.352

the best result in 1-gram indexing (baseline)

The best result in 2-gram indexing (baseline)

9% of 231k (document frequency)

12.2%8.2%4.8%6.4%8%10.2%

Hsin-Hsi Chen 9-31

Retrieval Based on 2-gram Indexing (relevance feedback)Increase with # of query expansion terms

20k3

2021.345.482.464.421.385

baseline(2-gram)

20d50t

1919.386.543.493.456.401

baseline(1-gram)

Summary: ~1.5 million character-pairs are generated,Bigrams do not lead to much noisy matchings

Hsin-Hsi Chen 9-32

Retrieval Based on Short-word Indexing (baseline model)

20k3

2021.350.493.466.435.388

Short-word indexing (baseline)2-gram indexing (baseline)

-3.8%10%10.8%3.9%5.1%3.7%

Hsin-Hsi Chen 9-33

Retrieval Based on Short-word Indexing (relevance feedback)Fix the number of “feedback” document at 40 and perform the 2ndstage retrieval with various number of query expansion terms.

Hsin-Hsi Chen 9-34

Retrieval Based on Short-word Indexing (relevance feedback)

Using the best t/2 terms that are single character and t/2 that are not.

2021.433.600.539.505.440

Hsin-Hsi Chen 9-35

1. 1-gram indexing alone is surprisingly good but not sufficiently competitive.2. 2-grams perform well, only slightly worse in precision but about 5.5% better in relevant retrieved compared to short-word indexing.

Hsin-Hsi Chen 9-36

PAT-Tree-Based Keyword Extraction for Chinese IR (SIGIR97, Lee-Feng Chien)

• Two types of methods to overcome problems of keyword extraction in Chinese IR– use character-level information to replace word-level

information

– use lexicon analysis (lexicon + word segmentation)

• statistics-based approach is employed to extract significant lexical patterns from a set of relevant documents– an arbitrary number of successive characters which are specific

and significant

Hsin-Hsi Chen 9-37

Keyword Extraction Approach

Lexical PatternsConstruction

Initial SLPExtraction

Refined SLPExtraction

RelevantDocuments

PATTrees

Candidatesof SLPs

FinalSLPs

關鍵詞抽取關關鍵關鍵詞關鍵詞抽關鍵詞抽取鍵鍵詞鍵詞抽鍵詞抽取詞 ...

關鍵詞

Hsin-Hsi Chen 9-38

PAT Tree

• Record semi-infinite strings at sentence level• These strings logically have the same length• 詞彙自動抽取，簡化索引困難

– 詞彙自動抽取 000…– 彙自動抽取 00000…– 自動抽取 0000000…– 動抽取 000000000…– 抽取 00000000000…– 取 0000000000000…– 簡化索引困難 000…– 化索引困難 00000…– 索引困難 0000000…– 引困難 000000000…– 困難 00000000000…– 難 0000000000000…

Hsin-Hsi Chen 9-39

Implementation• Each identical semi-infinite string is represented as a

node in the PAT tree and has a pointer to the position in document.

• Each node consists of three parameters– comparison bit– # of external nodes in the subtree

• those nodes whose comparison bits are greater than their parents’

– frequency count

個人電腦 , 人腦0 2 4 6 9 10

個人電腦 - 0人電腦 - 2電腦 - 4

腦 - 6人腦 - 9

Hsin-Hsi Chen 9-40

0

4

2 6

9

(0,6,1)

(4,6,1)

(5,3,1)

(24,2,1)

(8,3,2)

個人電腦 , 人腦0 2 4 6 9 10

個人電腦 - 0 (1)人電腦 - 2 (1)電腦 - 4 (1)

腦 - 6 (2)人腦 - 9 (1)

External nodes:those nodes thathave comparisonbits greater thantheir parents

(comparison bit,# of external nodes,frequency)

Hsin-Hsi Chen 9-41

Extraction of significant lexical patterns

• Estimation of significant lexical patterns– the significant lexical patterns have strong association

between its composed and overlapped substrings

– if SEc is large, patterns a and b have to occur together in the text collection

– Pattern c is more complete in semantics than either a or c

fff

fMISE

cba

cabc

a= 關鍵詞抽 fa=6b= 鍵詞抽取 fb=6c= 關鍵詞抽取 fc=6 SEc=1

Hsin-Hsi Chen 9-42

The PAT-Tree-Based Filtering Algorithm

• For each lexical pattern a in the PAT tree, where a has not been marked as a non-SLP, and fa >= THf, consider one of its successor c and search for the other overlapped subpattern b– if a is a terminal node, then a is determined as a candidate of SLP

– if SEc >= Thse, then a and b are determined non-candidates of SLP. In addition, if fc >= THf, the above procedure is done recursively from c to check if c is a candidate of SLP

– if SEc < Thse, but fa >> fb then a is determined as a candidate of SLP, while b and c are uncertain. In addition, if fc >= THf, the procedure is done recursively from c to check if c is a candidate of SLP.c

ab

Hsin-Hsi Chen 9-43

The Refined Procedure

• All the candidates of SLP will be checked using a common-word lexicon and a Chinese PAT tree constructed by using a general-domain corpus

• If a candidate of SLP appears either in the common-word lexicon or in the candidate list of SLP of the general-domain PAT tree, it is treated as a non-specific candidate and will be removed from the list of final SLP

• The remaining candidates of SLP will be further checked by observing their frequencies and distributions in the text collection.

• Only significant patters will be extracted as final SLP

Hsin-Hsi Chen 9-44

Experiments and Applications

• Book Indexing

• Document Classification– Define the keyword set– Analyze the documents

• Relevance Feedback

CsmartSystem

NaturalLanguage

Query

UserInteraction

RetrievedDocuments

KeywordExtraction

based on the proposed approach

User-selectedrelevant documents

QueryExpansion

ExtractedKeywords