Upload
winfred-greene
View
231
Download
0
Embed Size (px)
Citation preview
Hsin-Hsi Chen 9-1
Chinese Language Retrieval
Hsin-Hsi Chen
Department of Computer Science and Information Engineering
National Taiwan University
Hsin-Hsi Chen 9-2
Chinese Text Retrieval without Using a Dictionary (Chen et al, SIGIR97)
• Segmentation– Break a string of characters into words
• Chinese characters and words– Most Chinese words consist of two characters ( 趙元任 )– 26.7% unigrams, 69.8% bigrams, 2.7% trigrams
( 北京,現代漢語頻率辭典 )– 5% unigrams, 75% bigrams, 14% trigrams, 6% others (Liu)
• Word segmentation– statistical methods, e.g., mutual information statistics– rule-based methods, e.g., morphological rules, longest-match rules, ...– hybrid methods
Hsin-Hsi Chen 9-3
Indexing Techniques
• Unigram Indexing– Break a sequence of Chinese characters into individual ones.– Regard each individual character as an indexing unit.– GB2312-80: 6763 characters
• Bigram Indexing– Regard all adjacent pairs of hanzi characters in text as
indexing terms.
• Trigram Indexing– Regard all the consecutive sequence of three hanzi characters
as indexing terms.
Hsin-Hsi Chen 9-4
Examples
Hsin-Hsi Chen 9-5
Indexing Techniques (Continued)
• Statistical Indexing– Collect occurrence frequency in the collection for all Chinese
characters occurring at least once in the collection.– Collect occurrence frequency in the collection for all Chinese
bigrams occurring at lease once in the collection.– Compute the mutual information for all Chinese bigrams.
I(x,y)=log2(p(x,y)/(p(x)*p(y))) = log2((f(x,y)/N) /((f(x)/N)*(f(y)/N))) = log2((f(x,y)*N) /(f(x)*f(y)))
– Strongly related: much larger value– Not related: close to 0– Negatively related: negative
I(x,y)=log2(p(x,y)/ (p(x)*p(y))) =log2 (p(x)/ (p(x)*p(y))
=log2 (1/ *p(y))I(x,y)=log2(p(x,y)/(p(x)*p(y)))
=log2(p(x|y)/p(x)) =log2(p(x|y)/p(x))=0
Hsin-Hsi Chen 9-6
Indexing Techniques (Continued)
– Apply Richard’s algorithm to segment the text into words.
• Compute the mutual information values for all adjacent bigrams in a phrase. (Recognize only words of one or two characters long)
• Treat the bigram of the largest mutual information value as a word and then remove it from the phrase. The removal of the bigram may result in one or two shorter phrases.
• Perform the last step on each of the shorter phrases until all phrases consist of one or two characters.
Hsin-Hsi Chen 9-7
f(c1): the occurrence frequency value of the first Chinese character of a bigramf(c2): the occurrence frequency value of the second Chinese characterf(c1c2): the occurrence frequency value of a bigramI(c1,c2): mutual information I(c1,c2) >> 0, c1 and c2 have strong relationship I(c1,c2) ~ 0, c1 and c2 have no relationship I(c1,c2) << 0, c1 and c2 have complementrary relationship
> 0
< 0
Hsin-Hsi Chen 9-8
352974671
Hsin-Hsi Chen 9-9
Maximum Matching and Minimum Matchingmaximum matching: group the longest initial sequence of characters that matches a dictionary entry as a wordminimum matching: treat the shortest initial sequence of characters that matches a dictionary entry as a wordforward: start from the beginning of the phrasebackward: start from the end of the phrase
Dictionary: 138,955 entries, including words, compounds, phrases, idioms, proper names, and so on.Unknown words: the longest initial sequence of characters that does not match any entry in the dictionary is treated as an unknown word.
Hsin-Hsi Chen 9-10
Hsin-Hsi Chen 9-11
The Test Collection
in number of characters
Hsin-Hsi Chen 9-12
The Topics
• 28 topics of TREC-5 Chinese track• Topic number 14
Title: 中國的愛滋病歷Description: 中國,雲南,愛滋病, HIV ,高危險群患者,注射器,病毒。Narrative :相關文件應當包括中國那些地區的病例最多,愛滋病毒在中國是如何傳播的,以及中國政府如何監視愛滋病毒並控制它的傳染。
• Short query: title field中國的愛滋病歷
• long query: the original set of query
Hsin-Hsi Chen 9-13
Construction of manual query
• Iterative process– Do a trial run using the current query.– Examine the top-ranked documents and manually select
the terms that seem to be promising from the top documents.
– Add the chosen terms from the previous step to the current query and assign weights manually to the new terms to from a new query.
• Cost– On the average, it takes 2.8 hours on each topic.
Hsin-Hsi Chen 9-14
*: terms selected from the test database manually
Hsin-Hsi Chen 9-15
Dictionary and stop listStop list: 825 entries, including pronouns, determiners, prepositions, adverbs, conjunctions, and so on.Dictionary: 138,955 entries, including words, phrases, compounds, idioms, proper names, and so on. About 60,000 entries were manually selected from TREC-5 Chinese collection, and are added to a dictionary of about 80,000 entries obtained from a website
Hsin-Hsi Chen 9-16
Evaluation
• For each query, the top 1,000 documents are retrieved from the Chinese collection of 164,789 documents. (0.61%)
• The retrieved documents are ranked in decreasing order by the probability of relevance.
• There are 2,182 relevant documents in total for all 28 queries.
• For all run, 11-point recall-precision are calculated using the TREC evaluation program written by Chris Buckley.
• The base run uses the index file created from the text segmented using the forward maximum matching method.
Hsin-Hsi Chen 9-17
The Long Queries(title, description, narrative fields)
Automatically constructed long queries
Hsin-Hsi Chen 9-18
The Long Queries (Continued)
Manually reformulated queriesmax(f)0.80000.64650.52830.43080.38410.34550.29470.24390.18910.10510.02820.3558baseline1910
27% better
query terms (2-character words) vs. indexing terms (3-character words)
Hsin-Hsi Chen 9-19
The Short Queries
Length: 12.5 characters (short queries) vs. 107.0 characters (long topic queries) vs. 249.7 characters (manually reformulated queries)
Max(f)0.80000.64650.52830.43080.38410.34550.29470.24390.18910.10510.02820.3558baseline1910
longquery
Hsin-Hsi Chen 9-20
Discussion• unigram indexing
– ambiguity: 毒,毒氣,病毒,毒品– function words: 將,的,與,在– the average precision for the set of short queries (0.2770) and the set of
manually reformulated queries (0.4203) are good in comparison with the results of the dictionary-based maximum matching (0.3558).
• All function words are removed from the manually reformulated queries• The number of single-characters are small in the short queries.
• bigram indexing– Most Chinese words consists of two characters.– Bigram indexing (0.3677LA, 0.4522LM, 0.2687S) and the mutual
information segmentation (0.3744LA, 0.4533LM, 0.2849S) produce better performance in comparison to unigram indexing (0.2609LA, 0.4203LM, 0.2770S).
LA: Long & Automatic, LM: Long & Manual, S: Short
Hsin-Hsi Chen 9-21
Discussion (Continued)• trigram Indexing
– the words representing the key concepts in the topics are two-character words
– the number of unique trigrams in a corpus is much larger than that of unique bigrams
• dictionary-based methods– unknown words, including person names, place names, company names, transliterated
names, names of new products, abbreviation of full names, and so on.21/287 university and college names, 14/627 company names in dictionary
– identification and subsequent resolution of segmentation ambiguities
– for the set of short queries, the minimum matching (0.2531f, 0.2465b) works better than maximum matching (0.2346f, 0.2250b)
– for the set of long and manual queries, the maximum matching (0.4519f, 0.4481b) outperforms the minimum matching method (0.3937f, 0.3904b)
• bigram indexing and mutual information-based segmentation outperform the popular dictionary-based maximum matching
Hsin-Hsi Chen 9-22
Comparing Representation in
Chinese Information Retrieval (Kwok, SIGIR97)
• Representation Methods for Chinese Texts– 1-grams (single characters)
• Punctuation signs are deleted.• Stop word removal is not performed.• 6763 words (GB-2312) + English Characters/words + … =8093
– bigrams (two continuous overlapping characters)• 80% of modern Chinese words are bisyllable ( 林語堂 , 1972)• problem: many meaningless character-pairs are also produced.
– short-words • Segment texts into meaningful short-words of 1-4 characters.
Hsin-Hsi Chen 9-23
Segmentation• Step 1 (use facts)
Lookup on a small (about 2000) manually created lexicon list of commonly used short-words of one to three characters.Some 4-character names and proper nouns are also included.
• Step 2 (involve rules)Split chunks into short words of 2 or 3 words using manually determined rules. For example, XX, AX, two by two.
• Step 3 (frequency filtering)A threshold on the frequency of occurrence in the corpus is used to extract most common words from steps 1 and 2.
• Step 4 (iteration)Expand the initial lexicon list in step 1 using those discovered in step 3. Only one run is done in his experiment.
No disambiguation
Hsin-Hsi Chen 9-24
The Chinese Retrieval Environment
• Document– TREC-5 24,988 Xinhua and 139,801 People’s
Daily news articles– Documents are divided into subdocuments of
about 550 characters on a paragraph boundary.– Total number of subdocuments is 231,527.– Index terms for 1-gram (8093),
2-gram (1,482,172), short words (494,288) --- 1/3 of 2 grams
Hsin-Hsi Chen 9-25
The Chinese Retrieval Environment (Continued)
• Query Collection– 28 queries– query 1
美國決定將中國大陸的人權狀況與其是否給予中共最惠國待遇分離最惠國待遇,中國,人權,經濟制裁,分離,脫鉤
相關文件必須提到美國為何將最惠國待遇與人權分離;相關文件也必須提到中共為什麼反對美國將人權與最惠國待遇相提並論。
segmented Chinese text美國 | 決定 | 將 | 中國 | 大陸 | 的 | 人權 | 狀況 | 與其 | 是否 | 給予 | 中共最 | 惠國 | 待遇 | 分離最 | 惠國 | 待遇 | , | 中國, | 人權, | 經濟 | 制裁, | 分離, | 脫鉤
相關文件 | 必須 | 提到 | 美國 | 為何 | 將 | 最 | 惠國 | 待遇 | 與 | 人權 | 分離 | ;相關文件 | 也 | 必須 | 提到 | 中共 | 為 | 什麼 | 反 | 對 | 美國 | 將 | 人權 | 與 | 最 | 惠國 | 待遇 | 相 | 提並 | 論。
Hsin-Hsi Chen 9-26
The Chinese Retrieval Environment (Continued)
– query 11聯合國駐波斯尼亞維和部隊。波斯尼亞,前南斯拉夫,巴爾幹,聯合國,北約,武器禁運,維和,維持和平
相關文件必須包括聯合國和平部隊如何在戰火蹂躪的波斯尼亞進行維持和平的任務。
segmented Chinese text聯合國 | 駐波 | 斯 | 尼亞 | 維 | 和 | 部隊。| 波斯尼亞 | , | 前南 | 斯 | 拉夫, | 巴 | 爾 | 幹, | 聯合國, | 北 | 約, | 武器 | 禁運 | , | 維 | 和, | 維持 | 和平
相關文件 | 必須 | 包括 | 聯合國 | 和平 | 部隊 | 如何 | 在 | 戰火 | 蹂躪 | 的 | 波斯尼亞 | 進行 | 維持 | 和平 | 的 | 任務 | 。
Hsin-Hsi Chen 9-27
Retrieval Based on 1-gram Indexing (Baseline Model)High frequency threshold=30kLow frequency threshold=3
Document frequency:界定中頻的範圍
Recall # of subdocuments=231k
最佳的設定
7% 16%8.65% 10.8% 13%
Hsin-Hsi Chen 9-28
Retrieval Based on 1-gram Indexing (using relevance feedback)
Set thresholds=30k3
最佳的設定
Number of feedback documents=20Number of feedback terms=40
1802.324.471.438.402.352
Baseline model
pseudorelevancefeedback
6.4%18.6%16.8%1.9%13.2%6.7%
Hsin-Hsi Chen 9-29
Retrieval Based on 1-gram Indexing (using relevance feedback)
Set threshold=30k3Set feedback document number=20
How many terms are used for expansion?40-60 give similar results
Summary: Single characters (highly ambiguous) gives surprising results
Hsin-Hsi Chen 9-30
Retrieval Based on 2-gram Indexing (Baseline Model)
30k3
1802.324.471.438.402.352
the best result in 1-gram indexing (baseline)
The best result in 2-gram indexing (baseline)
9% of 231k (document frequency)
12.2%8.2%4.8%6.4%8%10.2%
Hsin-Hsi Chen 9-31
Retrieval Based on 2-gram Indexing (relevance feedback)Increase with # of query expansion terms
20k3
2021.345.482.464.421.385
baseline(2-gram)
20d50t
1919.386.543.493.456.401
baseline(1-gram)
Summary: ~1.5 million character-pairs are generated,Bigrams do not lead to much noisy matchings
Hsin-Hsi Chen 9-32
Retrieval Based on Short-word Indexing (baseline model)
20k3
2021.350.493.466.435.388
Short-word indexing (baseline)2-gram indexing (baseline)
-3.8%10%10.8%3.9%5.1%3.7%
Hsin-Hsi Chen 9-33
Retrieval Based on Short-word Indexing (relevance feedback)Fix the number of “feedback” document at 40 and perform the 2ndstage retrieval with various number of query expansion terms.
Hsin-Hsi Chen 9-34
Retrieval Based on Short-word Indexing (relevance feedback)
Using the best t/2 terms that are single character and t/2 that are not.
2021.433.600.539.505.440
Hsin-Hsi Chen 9-35
1. 1-gram indexing alone is surprisingly good but not sufficiently competitive.2. 2-grams perform well, only slightly worse in precision but about 5.5% better in relevant retrieved compared to short-word indexing.
Hsin-Hsi Chen 9-36
PAT-Tree-Based Keyword Extraction for Chinese IR (SIGIR97, Lee-Feng Chien)
• Two types of methods to overcome problems of keyword extraction in Chinese IR– use character-level information to replace word-level
information
– use lexicon analysis (lexicon + word segmentation)
• statistics-based approach is employed to extract significant lexical patterns from a set of relevant documents– an arbitrary number of successive characters which are specific
and significant
Hsin-Hsi Chen 9-37
Keyword Extraction Approach
Lexical PatternsConstruction
Initial SLPExtraction
Refined SLPExtraction
RelevantDocuments
PATTrees
Candidatesof SLPs
FinalSLPs
關鍵詞抽取關關鍵關鍵詞關鍵詞抽關鍵詞抽取鍵鍵詞鍵詞抽鍵詞抽取詞 ...
關鍵詞
Hsin-Hsi Chen 9-38
PAT Tree
• Record semi-infinite strings at sentence level• These strings logically have the same length• 詞彙自動抽取,簡化索引困難
– 詞彙自動抽取 000…– 彙自動抽取 00000…– 自動抽取 0000000…– 動抽取 000000000…– 抽取 00000000000…– 取 0000000000000…– 簡化索引困難 000…– 化索引困難 00000…– 索引困難 0000000…– 引困難 000000000…– 困難 00000000000…– 難 0000000000000…
Hsin-Hsi Chen 9-39
Implementation• Each identical semi-infinite string is represented as a
node in the PAT tree and has a pointer to the position in document.
• Each node consists of three parameters– comparison bit– # of external nodes in the subtree
• those nodes whose comparison bits are greater than their parents’
– frequency count
個人電腦 , 人腦0 2 4 6 9 10
個人電腦 - 0人電腦 - 2電腦 - 4
腦 - 6人腦 - 9
Hsin-Hsi Chen 9-40
0
4
2 6
9
(0,6,1)
(4,6,1)
(5,3,1)
(24,2,1)
(8,3,2)
個人電腦 , 人腦0 2 4 6 9 10
個人電腦 - 0 (1)人電腦 - 2 (1)電腦 - 4 (1)
腦 - 6 (2)人腦 - 9 (1)
External nodes:those nodes thathave comparisonbits greater thantheir parents
(comparison bit,# of external nodes,frequency)
Hsin-Hsi Chen 9-41
Extraction of significant lexical patterns
• Estimation of significant lexical patterns– the significant lexical patterns have strong association
between its composed and overlapped substrings
– if SEc is large, patterns a and b have to occur together in the text collection
– Pattern c is more complete in semantics than either a or c
fff
fMISE
cba
cabc
a= 關鍵詞抽 fa=6b= 鍵詞抽取 fb=6c= 關鍵詞抽取 fc=6 SEc=1
Hsin-Hsi Chen 9-42
The PAT-Tree-Based Filtering Algorithm
• For each lexical pattern a in the PAT tree, where a has not been marked as a non-SLP, and fa >= THf, consider one of its successor c and search for the other overlapped subpattern b– if a is a terminal node, then a is determined as a candidate of SLP
– if SEc >= Thse, then a and b are determined non-candidates of SLP. In addition, if fc >= THf, the above procedure is done recursively from c to check if c is a candidate of SLP
– if SEc < Thse, but fa >> fb then a is determined as a candidate of SLP, while b and c are uncertain. In addition, if fc >= THf, the procedure is done recursively from c to check if c is a candidate of SLP.c
ab
Hsin-Hsi Chen 9-43
The Refined Procedure
• All the candidates of SLP will be checked using a common-word lexicon and a Chinese PAT tree constructed by using a general-domain corpus
• If a candidate of SLP appears either in the common-word lexicon or in the candidate list of SLP of the general-domain PAT tree, it is treated as a non-specific candidate and will be removed from the list of final SLP
• The remaining candidates of SLP will be further checked by observing their frequencies and distributions in the text collection.
• Only significant patters will be extracted as final SLP
Hsin-Hsi Chen 9-44
Experiments and Applications
• Book Indexing
• Document Classification– Define the keyword set– Analyze the documents
• Relevance Feedback
CsmartSystem
NaturalLanguage
Query
UserInteraction
RetrievedDocuments
KeywordExtraction
based on the proposed approach
User-selectedrelevant documents
QueryExpansion
ExtractedKeywords