34
1 Pattern Mining to Chinese Unknown word Extraction 資資資資 955202037 資資資 2008/08/12

Unknown Word 08

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Unknown Word 08

1

Pattern Mining to Chinese Unknown wordExtraction

資工碩二 955202037

楊傑程2008/08/12

Page 2: Unknown Word 08

2

Outline

Introduction Related Works Unknown Word Detection Unknown Word Extraction Experiments Conclusions

Page 3: Unknown Word 08

3

Introduction

Since the growing popularity of Chinese, Chinese Text Processing has become a popular research task in recent years.

Before utilizing knowledge of Chinese texts, some preprocessing work should be done, such as Chinese Word Segmentation. There is no blank to mark word boundaries in Chinese

texts.

Page 4: Unknown Word 08

Introduction

Chinese Word Segmentation encounters two major problems: Ambiguity and Unknown Words.

Ambiguity One un-segmented Chinese character string has different

segmentations according to different context information. Ex: the sentence “ 研究生命起源” can be segmented into

“研究 生命 起源” or “研究生 命 起源”。

Unknown Words Also known as Out-Of-Vocabulary words (OOV words), mostly

unfamiliar proper nouns or new-born words. Ex: the sentence “ 王義氣熱衷於研究生命” would be segmented into

“王 義氣 熱衷 於 研究 生命” because “ 王義氣” is a uncommon personal name, which is not in

vocabularies.

Page 5: Unknown Word 08

5

Introduction- types of unknown words In this paper, we focus on Chinese unknown

word problem.

Types of Chinese unknown words

Organization names

Ex: 華碩電腦

Ex: 總經理、電腦化

Abbreviation

Proper Names

Ex: 中油、中大

Personal namesEx: 王小明

Derived Words Compounds

Ex: 電腦桌、搜尋法

Numeric type

compounds

Ex: 1986 年、 19 巷

Page 6: Unknown Word 08

Introduction- unknown word identification Chinese Word Segmentation Process:

1. Initial Segmentation (Dictionary assisted) Correctly identified words are called known words. Unknown words are wrongly segmented into two or more

parts. Ex: personal name 王小明 after initial segmentation,

become 王 小 明

2. Unknown word identification Characters belong to one unknown word should combine

together. Ex: combine 王 小 明 together as 王小明

Page 7: Unknown Word 08

Introduction- unknown word identification How does unknown word identification work?

A character can be a word ( 馬 ) or part of unknown word ( 馬 + 英 + 九 ).

Unknown Word Detection Rules With help of syntactic information 、 context

information Then just focus on detected morphemes

and combine them.

Page 8: Unknown Word 08

8

Introduction- detection and extraction In this paper, we apply continuity pattern mining to

discover unknown word detection rules.

Then, we utilize syntactic information 、 context information and heuristic statistical information to correctly extract unknown words.

Page 9: Unknown Word 08

9

Introduction- applied techniques We adopt Sequential Data Learning methods

and Machine Learning Algorithms to carry out unknown word extraction.

Our unknown word extraction method is a general method not limit extraction on specific types of unknown

words based on artificial rules.

Page 10: Unknown Word 08

10

Related Works- particular methods So far, research on Chinese word segmentation

has lasted for a decade.

First, researchers apply different kinds of information to discover different kinds of unknown words (particular). Patterns, Frequency, Context Information

Proper nouns ([Chen & Li, 1996] 、 [Chen & Chen, 2000])

Page 11: Unknown Word 08

11

Related Works- general methods

(Rule-based) Then, researchers start to figure out methods extracting

whole kinds of unknown words. Rule-based Detection:

Distinguish monosyllabic words and monosyllabic morphemes ([Chen et al., 1998])

Combine Morphological rules with Statistical rules to extract personal names 、 transliteration names and compound nouns. ([Chen et al., 2002]) <Precision: 89%, Recall: 68%>

Utilize context free grammar concept and propose a bottom-up merging algorithm Adopt morphological rules and general rules to extract all kinds of

unknown words. ([Ma et al., 2003]) < Precision: 76%, Recall: 57%>

Page 12: Unknown Word 08

Related Works- general methods

(Statistical Model-based) Statistical Model-based Detection: Apply Machine Learning algorithms and Sequential

Supervised Learning. Direct method:

Generate one corresponding statistical model Initial Segmentation and role tagging (HMM 、 CRF) Chunking (SVM) [Goh et. al, 2006]: HMM+SVM, <Precision: 63.8%, Recall: 58.3%>

[Tsai et. al, 2006]: CRF, <Recall: 73%>

Page 13: Unknown Word 08

Related Works – Data

Sequential Supervised Learning: Direct method, like HMM 、 CRF Indirect method, like Sliding Window 、 Recurrent Sliding

Windows Transform sequential learning problem into classification problem

<[T. G. Dietterich, 2002]>

Imbalance Data Problem <[Seyda et. al, 2007]>

Select the most informative instances. Random sampling 59 instances in each iteration, then pick the

closest instance to the hyper-plane.

Page 14: Unknown Word 08

Unknown Word Detection & Extraction Our idea is similar to [Chen et al, 2002]:

Unknown word detection Continuity pattern mining to derive detection rules.

Unknown word extraction utilize natural language information 、 content & context information

and statistical information to extract unknown words.

Sequential supervised learning methods (indirect) and machine-learning based models are used.

14

Page 15: Unknown Word 08

15

Unknown Word Detection

We call unknown word detection as “Phase 1 process”, and unknown word extraction as “Phase 2 process”.

The following graph is the flow chart of unknown word detection (Phase 1).

Page 16: Unknown Word 08

16

Initial segmentation

Dictionary (Libtabe lexicon )

POS tagging-TnT

Unknownword

detectionDetection rulesPattern Mining

to derivedetection rules

Training data (8/10 balanced corpus)

Phase2 training data

label

Testing 2 (un-

segmented) (1/10 balanced corpus)

Initial segmentation

POS tagging-TnT

Phase1 Training Phase1 Testing

Page 17: Unknown Word 08

17

Unknown word detection- Pattern Mining Pattern Mining:

Sequential Pattern: “因為… , 所以…” Required items match pattern order Allow noise in the middle of required items.

Continuity Pattern: “打球” => “ 打球” : match, “ 打籃球” : not match Strict definition to each items and order. Efficient pattern mining

Page 18: Unknown Word 08

Unknown word detection- Continuity Pattern Mining Prowl

<[Huang et. al, 2004]> Starts with 1-frequent pattern Extend to 2 pattern by two adjacent 1-frequent

patterns, then evaluate its frequency.

Page 19: Unknown Word 08

19

Encoding

Original segmentation label the words based on lexicon matching : known (Y) or unknown (N) “葡萄” , in the lexicon => “ 葡萄” labels as known word

(Y) “葡萄皮” , not in the lexicon => “ 葡萄皮” labels as

unknown word (N)

Encoding examples: 葡萄 (Na) 葡 (Na) Y + 萄 (Na) Y 葡萄皮 (Na) 葡 (Na) N + 萄 (Na) N+ 皮 (Na) N

Page 20: Unknown Word 08

20

Create detection rules

This pattern rule means: when “ 葡 (Na), 萄 (Na)” appears, the probability that “ 葡 (Na)” being a known word (unknown word) is 0.5.

( 葡 (Na) , 萄 Y) : 1( 葡 (Na) , 萄 Y) : 1

Page 21: Unknown Word 08

Store data(term + term_attribute

+ POS)

Phase2 training data

SlidingWindow

Positive example:Find BIES

Negative example:Learn and drop

SVM model2-gram

SVM model3-gram

SVM model4-gram

Calculate term frequency per docs

SVM trainingModels (3)

Calculate Precision/Recall

Correct segmentation

1/10 balanced corpus

Merging

evaluation

Solve overlap and conflict

(SVM)Sequential data

Page 22: Unknown Word 08

22

Unknown Word Extraction After initial segmentation and applying detection rules, each term

will have a “term_attribute” label itself. Six different “term_attributes” are as follows :

ms() mornosyllabic word , Ex: 你、我、他 ms(?) morphemes of unknown word , Ex: “ 王”、“小”、“明” on “ 王小明” ds() double-syllabic word , Ex: 學校 ps() poly-syllabic word , Ex: 筆記型電腦 dot() punctuation , Ex: “ ,”、 “。”… none() no above information or new term

The target of unknown word are those whose “term_attribute” labeled as “ms(?)”.

Page 23: Unknown Word 08

23

Positive / Negative Judgment

A term should be a word or part of unknown word. Based upon the position of a word in the sentence, we have the following four types of position labels :

B Begin ex: “ 王” of “ 王姿分” I Intermediate ex: “ 姿” of “ 王姿分” E End ex: “ 分” of “ 王姿分” S Singular ex: “ 我”、“你”

Find B + I*(zero to more) + E combination (positive) 王 (?) B 姿 (?) I 分 (?) E

Combine as a new word ( 王姿分 ) Random pick the same number of positive examples as number of

negative ones in the training model.

Page 24: Unknown Word 08

Data Processing- Sliding Window Sequential Supervised Learning

Indirect method: transform sequential learning to classification learning

Sliding Window Each time we choose n+2 (+prefix & suffix) terms as one data,

then we shift one token to right to generate another one, and so on. Ps. must exist at least one ms(?) in n terms.

We offer three choices of n, e.g. 2.3.4. Namely, we offer three SVM models to extract different lengths of unknown words.

We call them as N-gram data (model).

24

Page 25: Unknown Word 08

25

EX: 3-gram Model

運動會 ‧ 四年 甲班 王 (?)

‧ 四年 甲班 王(?)

姿 (?)

四年 甲班 王 (?) 姿(?)

分 (?)

甲班 王 (?)

B

姿 (?)

I

分(?)

E

王 (?) 姿 (?) 分 (?) ‧ 本校

discard

negative

negative

negative

positive

運動會 () ‧ ()  四年 ()  甲班 ()  王 (?)  姿 (?)  分 (?) ‧ ()  本校 ()  為 ()  響 ()  應 ()

Page 26: Unknown Word 08

26

Statistical Information• For each n-gram data, we calculate subsequent records:

1. pos tag of each term2. Term_attribute (ms() 、 ms(?) 、 ds()…)3. Statistical information: (examplfied by 3-gram Model),

Frequency of 3-gram. p( prefix | 3-gram), e.g. p( prefix | t1~t3) p( suffix | 3-gram), e.g. p( suffix | t1~t3) p( first term of n | other n-1 consecutive terms), e.g. p( t1 | t2~t3) p( last term of n | other n-1 preceding terms), e.g. p( t3 | t1~t2) p( pos_freq(prefix) / pos_freq(prefix in training positive)) p( pos_freq(suffix) / pos_freq(suffix in training positive))

prefix

(0)

t1 t2 t3 suffix

(4)

Page 27: Unknown Word 08

Experiments

Unknown word detection. Unknown word extraction.

27

Page 28: Unknown Word 08

Unknown Word Detection

8/10 balanced corpus (575m words) as training data. Use Pattern mining tool: Prowl [Huang et al., 2004] Random pick 1/10 balanced corpus (uncovered in training data)

as testing data. Use accuracy as threshold of detection rules.

28

Threshold(Accuracy) Precision Recall

F-measure(our system)

F-measure(AS system)

0.7 0.9324 0.4305 0.589035 0.71250

0.8 0.9008 0.5289 0.66648 0.752447

0.9 0.8343 0.7148 0.769941 0.76955

0.95 0.764 0.8288 0.795082 0.76553

0.98 0.686 0.8786 0.770446 0.744036

Page 29: Unknown Word 08

Unknown Word Extraction

The rest of Sinica corpus data will be used as testing data in Phase 2.

[Chen et al., 2002] evaluates unknown word extraction mainly on Chinese personal names 、 foreign transliteration names and compound nouns.

We utilize our extraction method on all kinds of unknown word types.

29

Page 30: Unknown Word 08

30

Unknown Word Extraction

In judging overlap and conflict problem of different combination of unknown words :

[Chen et al., 2002] : frequency (w) * length (w). Ex: “ 律師 班 奈 特” , => freq( 律師 + 班 )*2 : freq( 班 + 奈 + 特 )*3

Our method: First solve overlap problem for identical N-gram data:

P( prefix | overlap) : P( suffix | overlap) Ex: “ 義民 廟 中” : P( 義民 | 廟 ) : P( 中 | 廟 )

Then solve conflict problem by comparing different N-gram data by: Real frequency

freq (X)-freq (Y), if X is included in Y ex: X=“ 醫學”、“學院” , Y=“ 醫學院” Freq( N-gram) * Freq( POS_N-gram*), N: 2~4

Page 31: Unknown Word 08

31

Testing result

We also evaluate three kinds of unknown word in [Chen et al., 2002]: 3-gram unknown words: recall=0.73 2-gram unknown words: recall=0.7 3-gram and 2-gram combined: recall=0.68

[Chen et al., 2002] : Only morphological rules: F1 score= 0.62

(precision=0.92,recall=0.47) Only statistical rules: F1 score= 0.52

(precision=0.78,recall=0.39) Combination: F1 score= 0.77

(precision=0.89,recall=0.68)

Page 32: Unknown Word 08

32

SVM testing result

For general purpose: N-gram F1 score Precision Recall

Only 4-gram 0.164 0.1

0.57

Only 3-gram 0.377 0.257

0.70

Only 2-gram 0.587 0.492 0.73

Three n-gram models combined

0.524 0.457 0.614

Page 33: Unknown Word 08

Ongoing Experiments

Two experimental directions:

1. Sampling policy <[Seyda et. al, 2007]>:

In SVM, the instances close to the hyper-plane are informative for learning.

Weka classification confidence Spilt whole training data to get confidence

2. Ensemble Methods Bagging 、 AdaBoost

inst# actual predicted error prediction 1 2:-1 2:-1 - 0.984 2 1:1 1:1 - 0.933 …………………………………………….. 116 2:-1 1:1 + 0.505

Page 34: Unknown Word 08

Gram Sample By Algorithm

(inside)

Result

Precision Recall F-Measure

2 P:N = 1:2 Libsvm 0.637 0.716 0.674

2 Confidence=0.95

+ error + all p

Libsvm 0.759 0.612 0.678

3 P:N= 1:4 Libsvm 0.717 0.722 0.72

3 Confidence=0.97

+ all p

Libsvm 0.829 0.674 0.743

3 Confidence=0.97

+ all p

Bagging

(SMO)

0.825 0.688 0.75