Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

黃瀚萱2008

Outline Motivation and Goals Related Work System Design

HMMs & CRFs Experiments Conclusion

2/38

3/38

4/38

CCSS: Classical Chinese Sentence Segmentation Almost all pre-20th century Chinese is written

without any punctuation marks. Nothing to separate words from words, phrases

from phrases, and sentences from sentences. Explicit boundaries of sentences and clauses

are lacking. readers have to manually identify these

boundaries during reading.

5/38

Example of CCSS from Zhangzi

北冥有魚其名為鯤鯤之大不知其幾千里也化而為鳥其名為鵬鵬之背不知其幾千里也怒而飛其翼若垂天之雲是鳥也海運則將徙於南冥南冥者天池也

北冥有魚．其名為鯤．鯤之大．不知其幾千里也．化而為鳥．其名為鵬．鵬之背．不知其幾千里也．怒而飛．其翼若垂天之雲．是鳥也．海運則將徙於南冥．南冥者．天池也．6/38

Challenges CCSS is not a trivial problem, inherently

ambiguous. 道 / 可道 / 非常道 / 名 / 可名 / 非常名．道可道 / 非常道 / 名可名 / 非常名．

Difficult to construct a set of rules or practical procedures to do CCSS. Readers perform CCSS in instinctive ways. They

rely on their experience and sense of the language rather than on a systematic procedure.

7/38

Automated CCSS Innumerable documents in Classical Chinese

from the centuries of Chinese history remain to be segmented.

To aid in processing these documents, a automated CCSS system is proposed.

Enable completion of segmentation tasks quickly and accurately.

8/38

Research Goals Evaluation Metrics Datasets

Training data Benchmarking

Statistical segmenters.

9/38



10/38

Related Areas

CCSS

Classifiers

Linguistics

Chunking

NLP Machine Learning

Tagging

SBD

11/38

CWS

Useful Chinese Features Chinese Character

則 and 而 usually appear in the head of sentences. 也 and 矣 usually appear in the tail of sentences.

Phonology 反切 , 平仄 , 擬音

POS Verbs, nouns, adjectives, adverbs, etc.

Antithesis and couplet 道 / 可道 / 非常道 / 名 / 可名 / 非常名

12/38

Sentence Boundary Detection Distinguish the periods used as the end-of-

sentence indicator from other usages. Parts of abbreviations, i.e. Dr. Wang. Decimal points, i.e. 1.618. Ellipsis, i.e. “I don’t know…”

Metrics for segmentation F-measure and NIST-SU error rate

SBD in speech.

13/38

Chinese Word Segmentation Identify the boundaries of the words in a given

text. Non-trivial problem, ambiguity

日文章魚怎麼說日文章魚怎麼說

Words can be handled with a dictionary. Segmentation by character tagging [Xue,

2003]日文章魚怎麼說LL RR LL RR LL RR LR

14/38

Part-of-Speech Tagging Tagging the words of a sentence with word

class. The[AT] representative[NN] put[VBD] chairs[NNS]

on[IN] the[AT] table[NN]. Tagging another information rather than word

class. Position-of-character tagging.

Classical Chinese POS [Huang et al., 2002] Focused on sentence segmented text.

15/38



16/38

CCSS Framework

training data

dataset

test data segmentation model

measure metrics

performance measurement

training

testing

testing outcome

17/38

Sequential Data Transform the sentence segmenting task to a

character labeling task. Tagging with four position-of-character tags

Left boundary (LL) Middle character (MM) Right boundary (RR) single character clause (LR)

北冥有魚 / 其名為鯤 / 鯤之大 / 不知其幾千里也18/38

Markov Chain for CCSS

LL MM

RRLR

Start

Finish

北冥、有

魚

19/38

Sequence Labeling Models Hidden Markov Models Maximum Entropy Conditional Random Fields [Lafferty, 2001]

With Averaged Perceptron [Collins, 2002] Large margin methods

Support Vector Machine AdaBoost

20/38

Conditional Random Fields The model

P( “LL, MM, MM, RR” | “ 北冥有魚” ) Tagging the x with Viterbi algorithm

λ is estimated by the averaged perceptron algorithm.

txyyf

ZxyP ttk

T

t k

kx

,,,exp1| 1

1

xyPy

|maxarg*y

21/38

Datasets Focused on the corpora of the Pre-Qin and

Han Dynasties ( 先秦兩漢 ) Fundamental of later Chinese. Simpler syntax. Shorter sentences. The words are largely composed of a single

character. Qing Palace Memorials ( 奏摺 )

22/38

Dataset StatisticsDataset Paragraphs Characters Distinct

characters Clauses

論語 500 15982 1368 4015

孟子 260 35392 1916 7351

莊子 1128 65165 2936 12574

春秋左傳 3381 195983 3238 47281

春秋公羊傳 1804 44352 1638 11151

春秋穀梁傳 1801 40711 1585 10946

史記 4778 503890 4788 99792

上古漢語混合 1250 97476 3489 20573

清代奏摺 1000 111739 3147 15521

23/38

Evaluation Metrics Specificity (1 – fallout)

Probability of the true negative cases. F-measure (F1)

Harmonic means of precision and recall. NIST-SU error rate

Ratio of the wrongly segmented boundaries to the reference boundaries.

More than 100% if the mis-segmented too serious.

24/38



25/38

Experiment Design Experiment 1

Evaluate the performance of HMMs and CRFs 10-fold cross-validation

Experiment 2 Find the best training data from ancient Chinese. Train the system on one dataset, and test it on others.

Experiment 3 Cross-era evaluation. Train the system on the data from the Qing, and test it on the

data from the pre-Qing and Han Dynasties, and vice versa. 以古鑑近？以近鑑古？

26/38

Result: Experiment 1Dataset HMMs CRFs

Specificity F1 NIST-SU Specificity F1 NIST-SU

論語 93.01% 73.84% 49.84% 93.23% 78.52% 42.63%

孟子 94.49% 70.86% 54.35% 91.95% 75.08% 52.90%

莊子 93.72% 70.48% 57.25% 94.13% 76.37% 48.20%

春秋左傳 94.57% 83.55% 33.70% 95.16% 88.25% 25.60%

春秋公羊傳 95.78% 88.52% 24.22% 97.83% 93.60% 13.53%

春秋穀梁傳 95.37% 86.92% 27.10% 97.23% 92.12% 16.19%

史記 91.68% 60.87% 75.60% 92.02% 72.69% 58.02%

清代奏摺 96.70% 73.19% 50.68% 98.68% 78.54% 35.24%

上古漢語混合 93.00% 69.30% 59.05% 90.56% 73.44% 58.93%

Overall 94.26% 75.28% 47.98% 94.53% 80.96% 39.03%

Evaluation in 10-Fold Cross-Validation.5 generations of CRFs averaged perceptron with 100K feature functions. 27/38

Result: Experiment 2

Training Data HMMs CRFsSpecificity F1 NIST-SU Specificity F1 NIST-SU

論語 90.36% 57.61% 80.96% 86.80% 58.93% 88.16%

孟子 92.57% 59.10% 72.99% 89.12% 63.90% 75.12%

莊子 92.77% 60.68% 70.50% 91.00% 65.08% 68.97%

春秋左傳 90.32% 63.57% 73.77% 87.37% 67.00% 75.93%

史記 93.56% 68.86% 57.97% 93.23% 74.11% 51.19%

Overall 91.92% 61.96% 71.24% 89.50% 65.80% 71.87%

5 generations of CRFs averaged perceptron with 100K feature functions. 28/38

Experiment 3a: 以古鑑近

HMMs/CRFs

孟子莊子論語左傳史記

Validation

Training Data

Test Data

清代奏摺

29/38

Result: Experiment 3aTraining Data HMMs CRFs


論語 90.40% 41.05% 123.42% 79.17% 37.26% 195.26%

孟子 91.20% 42.00% 119.78% 79.03% 37.22% 194.35%

莊子 90.70% 42.50% 120.34% 83.13% 40.75% 166.21%

春秋左傳 84.31% 34.36% 168.87% 76.16% 37.66% 209.51%

史記 88.00% 40.35% 138.59% 83.24% 41.82% 165.20%

上古漢語混合 87.06% 38.85% 146.95% 80.88% 38.77% 182.38%

Overall 88.61% 39.85% 136.33% 80.27% 38.91% 185.49%

30/38

Experiment 3b: 以近鑑古

HMMs/CRFs

清代奏摺

孟子莊子

論語

左傳史記

Validation

Training Data

Test Data

31/38

Result: Experiment 3bTraining Data HMMs CRFs


論語 91.32% 47.15% 87.23% 94.67% 43.12% 83.79%

孟子 91.55% 46.95% 90.74% 95.01% 42.13% 86.50%

莊子 91.31% 48.35% 91.24% 94.63% 46.52% 83.46%

春秋左傳 95.16% 50.41% 73.65% 97.73% 49.01% 68.96%

史記 93.18% 37.73% 97.01% 96.49% 34.35% 88.89%

上古漢語混合 92.57% 46.59% 87.16% 95.76% 42.72% 82.50%

Overall 92.52% 46.20% 87.84% 95.72% 42.98% 82.35%

32/38



33/38

Overall Build up an automated CCSS system. Complete 3 tasks during the system

developing. A set of evaluation metrics. A set of datasets. Two segmentation models.

HMMs CRFs

34/38

Datasets Evaluated some classics from the 5th century

BCE to the 19th century, including 論語 , 孟子 , 莊子 , 左傳 , and 史記 .

My system maintains its performance on a test data differing from the training data, but the difference in written eras between the test data and training data cannot be too great.

史記 is the best dataset for training.

35/38

Segmentation Models Overall performance

Model comparison

Model Correctness Training time Run time Sensitive to Training Data

HMMs Average Fast Fast Insignificant

CRFs Better Slow Fast Sensitive

Model Specificity F-measure NIST-SU error rate

HMMs 94.26% 75.28% 47.98%CRFs 94.53% 80.96% 39.03%

36/38

Future Work Apply more Chinese features

Phonology, POS, and Antithesis. Integrate pre-defined rules

Names, places, dates, numbers. Mix several datasets to obtain a more general,

robust dataset.

37/38

THE END

38/38

Conditional Random Fields The model

P( “LL, MM, MM, RR” | “ 北冥有魚” ) Feature functions: f(yt-1, yt, x, t)

f(MM, RR, ‘ 曰’ ) = 1 ’ 曰’出現在句末

f(RR, ‘ 孟’ ) = 0 ‘ 孟’從未出現在句末

由 λk 決定 fk 的重要性， λk 由資料中學習而來

txyyf

ZxyP ttk

T

t k

kx

,,,exp1| 1

1

39/38

Conditional Random Fields Cont. 將 x 加上標籤

以 Viterbi algorithm 實作參數評估（決定 λ 值）

保證收斂，但沒有分析式的解法，須透過迭代法逼近。複雜度高、速度慢、不易實作 GIS, IIS, L-BFGS, etc.

以 averaged perceptron 代替傳統的數值方法。

xyPy

|maxarg*y

40/38

Feature Templates

41/38

Template Example

yi-1, yi LL, MM

yi, wi 北冥有魚其名為鯤wi-2, yi 北冥有魚其名為鯤wi-1, yi 北冥有魚其名為鯤wi+1, yi 北冥有魚其名為鯤wi+2, yi 北冥有魚其名為鯤wi-2, wi-1, yi 北冥有魚其名為鯤wi-1, wi, yi 北冥有魚其名為鯤wi, wi+1, yi 北冥有魚其名為鯤wi+1, wi+2, yi 北冥有魚其名為鯤

Raw Results (Better Case)

42/38

齊宣王問曰．文王之囿．方七十里．有諸．孟子對曰．於傳有之．曰．若是其大乎．曰．民猶以為小也．曰．寡人之囿．方四十里．民猶以為大．何也．曰．文王之囿．方七十里．芻蕘者往焉．雉兔者往焉．與民同之．民以為小．不亦宜乎．臣始至於境．問國之大禁．然後敢入．臣聞郊關之內．有囿方四十里．殺其麋鹿者．如殺人之罪．則是方四十里為阱於國中．民以為大．不亦宜乎．

齊宣王問曰．文王之囿方七十里．有諸孟子對曰．於傳有之．曰．若是其大乎．曰．民猶以為小也．曰．寡人之囿．方四十里．民猶以為大．何也．曰．文王之囿方七十里．芻蕘者往焉．雉兔者往焉．與民同之．民以為小．不亦宜乎．臣始至於境．問國之大禁．然後敢入．臣聞郊關之內．有囿方四十里．殺其麋鹿者．如殺人之罪．則是方四十里．為阱於國中．民以為大．不亦宜乎．Training Data: 史記， Test Data: 孟子

Raw Results (Worse Case)

43/38

孟子曰．三代之得天下也．以仁．其失天下也以不仁．國之所以廢興存亡者亦然．天子不仁．不保四海．諸侯不仁．不保社稷．卿大夫不仁．不保宗廟．士庶人不仁．不保四體．今惡死亡而樂不仁．是由惡醉而強酒．

孟子曰．三代之．得天下也．以仁其失天下也．以不仁國之所以廢興．存亡者．亦然．天子不仁．不保四海．諸侯不仁．不保社．稷卿大夫．不仁不保．宗廟士．庶人不仁．不保四體．今惡死亡．而樂不仁．是由惡醉．而強酒．

Training Data: 史記， Test Data: 孟子

44/38

45/38

46/38

Evaluation Measures

tp fnfp

tn

47/38

Statistical Machine Learning

Learner (Model)

Training Data

input output

48/38

結論：評估指標主要參考指標

Specificity F-measure (F1)

Recall 和 precision 的調和平均 NIST-SU error rate

ROC Curves 同時評估 recall 和 specificity 比較多筆斷句結果

49/38

統計式斷句系統設計 Statistical approach 斷句系統可以視為一個 machine learner 。不透過人力預先定義規則，而從大量

training data 中調整學習 Training data 的需求

一般性足夠的數量

50/38

51/38

古漢語的文法與斷句研究訓詁學：虛字的研究

《爾雅．釋詁．釋言．釋訓》清末馬建忠《馬氏文通》

仿效西洋文法，建立古漢語的文法。民初楊樹達《詞詮》

為虛字分門別類。楊樹達《古書句讀釋例》

探討誤讀的因素。目前尚無數位化文獻。

52/38

Documents

Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches