Upload
ramla
View
68
Download
0
Embed Size (px)
DESCRIPTION
Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches. 黃瀚萱 2008. Outline. Motivation and Goals Related Work System Design HMMs & CRFs Experiments Conclusion. CCSS: Classical Chinese Sentence Segmentation. - PowerPoint PPT Presentation
Citation preview
Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches
黃瀚萱2008
Outline Motivation and Goals Related Work System Design
HMMs & CRFs Experiments Conclusion
2/38
3/38
4/38
CCSS: Classical Chinese Sentence Segmentation Almost all pre-20th century Chinese is written
without any punctuation marks. Nothing to separate words from words, phrases
from phrases, and sentences from sentences. Explicit boundaries of sentences and clauses
are lacking. readers have to manually identify these
boundaries during reading.
5/38
Example of CCSS from Zhangzi
北冥有魚其名為鯤鯤之大不知其幾千里也化而為鳥其名為鵬鵬之背不知其幾千里也怒而飛其翼若垂天之雲是鳥也海運則將徙於南冥南冥者天池也
北冥有魚.其名為鯤.鯤之大.不知其幾千里也.化而為鳥.其名為鵬.鵬之背.不知其幾千里也.怒而飛.其翼若垂天之雲.是鳥也.海運則將徙於南冥.南冥者.天池也.6/38
Challenges CCSS is not a trivial problem, inherently
ambiguous. 道 / 可道 / 非常道 / 名 / 可名 / 非常名. 道可道 / 非常道 / 名可名 / 非常名.
Difficult to construct a set of rules or practical procedures to do CCSS. Readers perform CCSS in instinctive ways. They
rely on their experience and sense of the language rather than on a systematic procedure.
7/38
Automated CCSS Innumerable documents in Classical Chinese
from the centuries of Chinese history remain to be segmented.
To aid in processing these documents, a automated CCSS system is proposed.
Enable completion of segmentation tasks quickly and accurately.
8/38
Research Goals Evaluation Metrics Datasets
Training data Benchmarking
Statistical segmenters.
9/38
Outline Motivation and Goals Related Work System Design
HMMs & CRFs Experiments Conclusion
10/38
Related Areas
CCSS
Classifiers
Linguistics
Chunking
NLP Machine Learning
Tagging
SBD
11/38
CWS
Useful Chinese Features Chinese Character
則 and 而 usually appear in the head of sentences. 也 and 矣 usually appear in the tail of sentences.
Phonology 反切 , 平仄 , 擬音
POS Verbs, nouns, adjectives, adverbs, etc.
Antithesis and couplet 道 / 可道 / 非常道 / 名 / 可名 / 非常名
12/38
Sentence Boundary Detection Distinguish the periods used as the end-of-
sentence indicator from other usages. Parts of abbreviations, i.e. Dr. Wang. Decimal points, i.e. 1.618. Ellipsis, i.e. “I don’t know…”
Metrics for segmentation F-measure and NIST-SU error rate
SBD in speech.
13/38
Chinese Word Segmentation Identify the boundaries of the words in a given
text. Non-trivial problem, ambiguity
日 文章 魚 怎麼 說 日文 章魚 怎麼 說
Words can be handled with a dictionary. Segmentation by character tagging [Xue,
2003]日 文 章 魚 怎 麼 說LL RR LL RR LL RR LR
14/38
Part-of-Speech Tagging Tagging the words of a sentence with word
class. The[AT] representative[NN] put[VBD] chairs[NNS]
on[IN] the[AT] table[NN]. Tagging another information rather than word
class. Position-of-character tagging.
Classical Chinese POS [Huang et al., 2002] Focused on sentence segmented text.
15/38
Outline Motivation and Goals Related Work System Design
HMMs & CRFs Experiments Conclusion
16/38
CCSS Framework
training data
dataset
test data segmentation model
measure metrics
performance measurement
training
testing
testing outcome
17/38
Sequential Data Transform the sentence segmenting task to a
character labeling task. Tagging with four position-of-character tags
Left boundary (LL) Middle character (MM) Right boundary (RR) single character clause (LR)
北冥有魚 / 其名為鯤 / 鯤之大 / 不知其幾千里也18/38
Markov Chain for CCSS
LL MM
RRLR
Start
Finish
北冥、有
魚
19/38
Sequence Labeling Models Hidden Markov Models Maximum Entropy Conditional Random Fields [Lafferty, 2001]
With Averaged Perceptron [Collins, 2002] Large margin methods
Support Vector Machine AdaBoost
20/38
Conditional Random Fields The model
P( “LL, MM, MM, RR” | “ 北冥有魚” ) Tagging the x with Viterbi algorithm
λ is estimated by the averaged perceptron algorithm.
txyyf
ZxyP ttk
T
t k
kx
,,,exp1| 1
1
xyPy
|maxarg*y
21/38
Datasets Focused on the corpora of the Pre-Qin and
Han Dynasties ( 先秦兩漢 ) Fundamental of later Chinese. Simpler syntax. Shorter sentences. The words are largely composed of a single
character. Qing Palace Memorials ( 奏摺 )
22/38
Dataset StatisticsDataset Paragraphs Characters Distinct
characters Clauses
論語 500 15982 1368 4015
孟子 260 35392 1916 7351
莊子 1128 65165 2936 12574
春秋左傳 3381 195983 3238 47281
春秋公羊傳 1804 44352 1638 11151
春秋穀梁傳 1801 40711 1585 10946
史記 4778 503890 4788 99792
上古漢語混合 1250 97476 3489 20573
清代奏摺 1000 111739 3147 15521
23/38
Evaluation Metrics Specificity (1 – fallout)
Probability of the true negative cases. F-measure (F1)
Harmonic means of precision and recall. NIST-SU error rate
Ratio of the wrongly segmented boundaries to the reference boundaries.
More than 100% if the mis-segmented too serious.
24/38
Outline Motivation and Goals Related Work System Design
HMMs & CRFs Experiments Conclusion
25/38
Experiment Design Experiment 1
Evaluate the performance of HMMs and CRFs 10-fold cross-validation
Experiment 2 Find the best training data from ancient Chinese. Train the system on one dataset, and test it on others.
Experiment 3 Cross-era evaluation. Train the system on the data from the Qing, and test it on the
data from the pre-Qing and Han Dynasties, and vice versa. 以古鑑近?以近鑑古?
26/38
Result: Experiment 1Dataset HMMs CRFs
Specificity F1 NIST-SU Specificity F1 NIST-SU
論語 93.01% 73.84% 49.84% 93.23% 78.52% 42.63%
孟子 94.49% 70.86% 54.35% 91.95% 75.08% 52.90%
莊子 93.72% 70.48% 57.25% 94.13% 76.37% 48.20%
春秋左傳 94.57% 83.55% 33.70% 95.16% 88.25% 25.60%
春秋公羊傳 95.78% 88.52% 24.22% 97.83% 93.60% 13.53%
春秋穀梁傳 95.37% 86.92% 27.10% 97.23% 92.12% 16.19%
史記 91.68% 60.87% 75.60% 92.02% 72.69% 58.02%
清代奏摺 96.70% 73.19% 50.68% 98.68% 78.54% 35.24%
上古漢語混合 93.00% 69.30% 59.05% 90.56% 73.44% 58.93%
Overall 94.26% 75.28% 47.98% 94.53% 80.96% 39.03%
Evaluation in 10-Fold Cross-Validation.5 generations of CRFs averaged perceptron with 100K feature functions. 27/38
Result: Experiment 2
Training Data HMMs CRFsSpecificity F1 NIST-SU Specificity F1 NIST-SU
論語 90.36% 57.61% 80.96% 86.80% 58.93% 88.16%
孟子 92.57% 59.10% 72.99% 89.12% 63.90% 75.12%
莊子 92.77% 60.68% 70.50% 91.00% 65.08% 68.97%
春秋左傳 90.32% 63.57% 73.77% 87.37% 67.00% 75.93%
史記 93.56% 68.86% 57.97% 93.23% 74.11% 51.19%
Overall 91.92% 61.96% 71.24% 89.50% 65.80% 71.87%
5 generations of CRFs averaged perceptron with 100K feature functions. 28/38
Experiment 3a: 以古鑑近
HMMs/CRFs
孟子 莊子論語左傳 史記
Validation
Training Data
Test Data
清代奏摺
29/38
Result: Experiment 3aTraining Data HMMs CRFs
Specificity F1 NIST-SU Specificity F1 NIST-SU
論語 90.40% 41.05% 123.42% 79.17% 37.26% 195.26%
孟子 91.20% 42.00% 119.78% 79.03% 37.22% 194.35%
莊子 90.70% 42.50% 120.34% 83.13% 40.75% 166.21%
春秋左傳 84.31% 34.36% 168.87% 76.16% 37.66% 209.51%
史記 88.00% 40.35% 138.59% 83.24% 41.82% 165.20%
上古漢語混合 87.06% 38.85% 146.95% 80.88% 38.77% 182.38%
Overall 88.61% 39.85% 136.33% 80.27% 38.91% 185.49%
30/38
Experiment 3b: 以近鑑古
HMMs/CRFs
清代奏摺
孟子莊子
論語
左傳史記
Validation
Training Data
Test Data
31/38
Result: Experiment 3bTraining Data HMMs CRFs
Specificity F1 NIST-SU Specificity F1 NIST-SU
論語 91.32% 47.15% 87.23% 94.67% 43.12% 83.79%
孟子 91.55% 46.95% 90.74% 95.01% 42.13% 86.50%
莊子 91.31% 48.35% 91.24% 94.63% 46.52% 83.46%
春秋左傳 95.16% 50.41% 73.65% 97.73% 49.01% 68.96%
史記 93.18% 37.73% 97.01% 96.49% 34.35% 88.89%
上古漢語混合 92.57% 46.59% 87.16% 95.76% 42.72% 82.50%
Overall 92.52% 46.20% 87.84% 95.72% 42.98% 82.35%
32/38
Outline Motivation and Goals Related Work System Design
HMMs & CRFs Experiments Conclusion
33/38
Overall Build up an automated CCSS system. Complete 3 tasks during the system
developing. A set of evaluation metrics. A set of datasets. Two segmentation models.
HMMs CRFs
34/38
Datasets Evaluated some classics from the 5th century
BCE to the 19th century, including 論語 , 孟子 , 莊子 , 左傳 , and 史記 .
My system maintains its performance on a test data differing from the training data, but the difference in written eras between the test data and training data cannot be too great.
史記 is the best dataset for training.
35/38
Segmentation Models Overall performance
Model comparison
Model Correctness Training time Run time Sensitive to Training Data
HMMs Average Fast Fast Insignificant
CRFs Better Slow Fast Sensitive
Model Specificity F-measure NIST-SU error rate
HMMs 94.26% 75.28% 47.98%CRFs 94.53% 80.96% 39.03%
36/38
Future Work Apply more Chinese features
Phonology, POS, and Antithesis. Integrate pre-defined rules
Names, places, dates, numbers. Mix several datasets to obtain a more general,
robust dataset.
37/38
THE END
38/38
Conditional Random Fields The model
P( “LL, MM, MM, RR” | “ 北冥有魚” ) Feature functions: f(yt-1, yt, x, t)
f(MM, RR, ‘ 曰’ ) = 1 ’ 曰’出現在句末
f(RR, ‘ 孟’ ) = 0 ‘ 孟’從未出現在句末
由 λk 決定 fk 的重要性, λk 由資料中學習而來
txyyf
ZxyP ttk
T
t k
kx
,,,exp1| 1
1
39/38
Conditional Random Fields Cont. 將 x 加上標籤
以 Viterbi algorithm 實作 參數評估(決定 λ 值)
保證收斂,但沒有分析式的解法,須透過迭代法逼近。 複雜度高、速度慢、不易實作 GIS, IIS, L-BFGS, etc.
以 averaged perceptron 代替傳統的數值方法。
xyPy
|maxarg*y
40/38
Feature Templates
41/38
Template Example
yi-1, yi LL, MM
yi, wi 北冥有魚其名為鯤wi-2, yi 北冥有魚其名為鯤wi-1, yi 北冥有魚其名為鯤wi+1, yi 北冥有魚其名為鯤wi+2, yi 北冥有魚其名為鯤wi-2, wi-1, yi 北冥有魚其名為鯤wi-1, wi, yi 北冥有魚其名為鯤wi, wi+1, yi 北冥有魚其名為鯤wi+1, wi+2, yi 北冥有魚其名為鯤
Raw Results (Better Case)
42/38
齊宣王問曰.文王之囿.方七十里.有諸.孟子對曰.於傳有之.曰.若是其大乎.曰.民猶以為小也.曰.寡人之囿.方四十里.民猶以為大.何也.曰.文王之 囿.方七十里.芻蕘者往焉.雉兔者往焉.與民同之.民以為小.不亦宜乎.臣始至於境.問國之大禁.然後敢入.臣聞郊關之內.有囿方四十里.殺其麋鹿者.如 殺人之罪.則是方四十里為阱於國中.民以為大.不亦宜乎.
齊宣王問曰.文王之囿方七十里.有諸孟子對曰.於傳有之.曰.若是其大乎.曰.民猶以為小也.曰.寡人之囿.方四十里.民猶以為大.何也.曰.文王之囿方 七十里.芻蕘者往焉.雉兔者往焉.與民同之.民以為小.不亦宜乎.臣始至於境.問國之大禁.然後敢入.臣聞郊關之內.有囿方四十里.殺其麋鹿者.如殺人之 罪.則是方四十里.為阱於國中.民以為大.不亦宜乎.Training Data: 史記, Test Data: 孟子
Raw Results (Worse Case)
43/38
孟子曰.三代之得天下也.以仁.其失天下也以不仁.國之所以廢興存亡者亦然.天子不仁.不保四海.諸侯不仁.不保社稷.卿大夫不仁.不保宗廟.士庶人不仁.不保四體.今惡死亡而樂不仁.是由惡醉而強酒.
孟子曰.三代之.得天下也.以仁其失天下也.以不仁國之所以廢興.存亡者.亦然.天子不仁.不保四海.諸侯不仁.不保社.稷卿大夫.不仁不保.宗廟士.庶人不仁.不保四體.今惡死亡.而樂不仁.是由惡醉.而強酒.
Training Data: 史記, Test Data: 孟子
44/38
45/38
46/38
Evaluation Measures
tp fnfp
tn
47/38
Statistical Machine Learning
Learner (Model)
Training Data
input output
48/38
結論:評估指標 主要參考指標
Specificity F-measure (F1)
Recall 和 precision 的調和平均 NIST-SU error rate
ROC Curves 同時評估 recall 和 specificity 比較多筆斷句結果
49/38
統計式斷句系統設計 Statistical approach 斷句系統可以視為一個 machine learner 。 不透過人力預先定義規則,而從大量
training data 中調整學習 Training data 的需求
一般性 足夠的數量
50/38
51/38
古漢語的文法與斷句研究 訓詁學:虛字的研究
《爾雅.釋詁.釋言.釋訓》 清末馬建忠《馬氏文通》
仿效西洋文法,建立古漢語的文法。 民初楊樹達《詞詮》
為虛字分門別類。 楊樹達《古書句讀釋例》
探討誤讀的因素。 目前尚無數位化文獻。
52/38