52
Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches 黃黃黃 2008

Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

  • Upload
    ramla

  • View
    68

  • Download
    0

Embed Size (px)

DESCRIPTION

Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches. 黃瀚萱 2008. Outline. Motivation and Goals Related Work System Design HMMs & CRFs Experiments Conclusion. CCSS: Classical Chinese Sentence Segmentation. - PowerPoint PPT Presentation

Citation preview

Page 1: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

黃瀚萱2008

Page 2: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Outline Motivation and Goals Related Work System Design

HMMs & CRFs Experiments Conclusion

2/38

Page 3: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

3/38

Page 4: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

4/38

Page 5: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

CCSS: Classical Chinese Sentence Segmentation Almost all pre-20th century Chinese is written

without any punctuation marks. Nothing to separate words from words, phrases

from phrases, and sentences from sentences. Explicit boundaries of sentences and clauses

are lacking. readers have to manually identify these

boundaries during reading.

5/38

Page 6: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Example of CCSS from Zhangzi

北冥有魚其名為鯤鯤之大不知其幾千里也化而為鳥其名為鵬鵬之背不知其幾千里也怒而飛其翼若垂天之雲是鳥也海運則將徙於南冥南冥者天池也

北冥有魚.其名為鯤.鯤之大.不知其幾千里也.化而為鳥.其名為鵬.鵬之背.不知其幾千里也.怒而飛.其翼若垂天之雲.是鳥也.海運則將徙於南冥.南冥者.天池也.6/38

Page 7: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Challenges CCSS is not a trivial problem, inherently

ambiguous. 道 / 可道 / 非常道 / 名 / 可名 / 非常名. 道可道 / 非常道 / 名可名 / 非常名.

Difficult to construct a set of rules or practical procedures to do CCSS. Readers perform CCSS in instinctive ways. They

rely on their experience and sense of the language rather than on a systematic procedure.

7/38

Page 8: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Automated CCSS Innumerable documents in Classical Chinese

from the centuries of Chinese history remain to be segmented.

To aid in processing these documents, a automated CCSS system is proposed.

Enable completion of segmentation tasks quickly and accurately.

8/38

Page 9: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Research Goals Evaluation Metrics Datasets

Training data Benchmarking

Statistical segmenters.

9/38

Page 10: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Outline Motivation and Goals Related Work System Design

HMMs & CRFs Experiments Conclusion

10/38

Page 11: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Related Areas

CCSS

Classifiers

Linguistics

Chunking

NLP Machine Learning

Tagging

SBD

11/38

CWS

Page 12: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Useful Chinese Features Chinese Character

則 and 而 usually appear in the head of sentences. 也 and 矣 usually appear in the tail of sentences.

Phonology 反切 , 平仄 , 擬音

POS Verbs, nouns, adjectives, adverbs, etc.

Antithesis and couplet 道 / 可道 / 非常道 / 名 / 可名 / 非常名

12/38

Page 13: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Sentence Boundary Detection Distinguish the periods used as the end-of-

sentence indicator from other usages. Parts of abbreviations, i.e. Dr. Wang. Decimal points, i.e. 1.618. Ellipsis, i.e. “I don’t know…”

Metrics for segmentation F-measure and NIST-SU error rate

SBD in speech.

13/38

Page 14: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Chinese Word Segmentation Identify the boundaries of the words in a given

text. Non-trivial problem, ambiguity

日 文章 魚 怎麼 說 日文 章魚 怎麼 說

Words can be handled with a dictionary. Segmentation by character tagging [Xue,

2003]日 文 章 魚 怎 麼 說LL RR LL RR LL RR LR

14/38

Page 15: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Part-of-Speech Tagging Tagging the words of a sentence with word

class. The[AT] representative[NN] put[VBD] chairs[NNS]

on[IN] the[AT] table[NN]. Tagging another information rather than word

class. Position-of-character tagging.

Classical Chinese POS [Huang et al., 2002] Focused on sentence segmented text.

15/38

Page 16: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Outline Motivation and Goals Related Work System Design

HMMs & CRFs Experiments Conclusion

16/38

Page 17: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

CCSS Framework

training data

dataset

test data segmentation model

measure metrics

performance measurement

training

testing

testing outcome

17/38

Page 18: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Sequential Data Transform the sentence segmenting task to a

character labeling task. Tagging with four position-of-character tags

Left boundary (LL) Middle character (MM) Right boundary (RR) single character clause (LR)

北冥有魚 / 其名為鯤 / 鯤之大 / 不知其幾千里也18/38

Page 19: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Markov Chain for CCSS

LL MM

RRLR

Start

Finish

北冥、有

19/38

Page 20: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Sequence Labeling Models Hidden Markov Models Maximum Entropy Conditional Random Fields [Lafferty, 2001]

With Averaged Perceptron [Collins, 2002] Large margin methods

Support Vector Machine AdaBoost

20/38

Page 21: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Conditional Random Fields The model

P( “LL, MM, MM, RR” | “ 北冥有魚” ) Tagging the x with Viterbi algorithm

λ is estimated by the averaged perceptron algorithm.

txyyf

ZxyP ttk

T

t k

kx

,,,exp1| 1

1

xyPy

|maxarg*y

21/38

Page 22: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Datasets Focused on the corpora of the Pre-Qin and

Han Dynasties ( 先秦兩漢 ) Fundamental of later Chinese. Simpler syntax. Shorter sentences. The words are largely composed of a single

character. Qing Palace Memorials ( 奏摺 )

22/38

Page 23: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Dataset StatisticsDataset Paragraphs Characters Distinct

characters Clauses

論語 500 15982 1368 4015

孟子 260 35392 1916 7351

莊子 1128 65165 2936 12574

春秋左傳 3381 195983 3238 47281

春秋公羊傳 1804 44352 1638 11151

春秋穀梁傳 1801 40711 1585 10946

史記 4778 503890 4788 99792

上古漢語混合 1250 97476 3489 20573

清代奏摺 1000 111739 3147 15521

23/38

Page 24: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Evaluation Metrics Specificity (1 – fallout)

Probability of the true negative cases. F-measure (F1)

Harmonic means of precision and recall. NIST-SU error rate

Ratio of the wrongly segmented boundaries to the reference boundaries.

More than 100% if the mis-segmented too serious.

24/38

Page 25: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Outline Motivation and Goals Related Work System Design

HMMs & CRFs Experiments Conclusion

25/38

Page 26: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Experiment Design Experiment 1

Evaluate the performance of HMMs and CRFs 10-fold cross-validation

Experiment 2 Find the best training data from ancient Chinese. Train the system on one dataset, and test it on others.

Experiment 3 Cross-era evaluation. Train the system on the data from the Qing, and test it on the

data from the pre-Qing and Han Dynasties, and vice versa. 以古鑑近?以近鑑古?

26/38

Page 27: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Result: Experiment 1Dataset HMMs CRFs

Specificity F1 NIST-SU Specificity F1 NIST-SU

論語 93.01% 73.84% 49.84% 93.23% 78.52% 42.63%

孟子 94.49% 70.86% 54.35% 91.95% 75.08% 52.90%

莊子 93.72% 70.48% 57.25% 94.13% 76.37% 48.20%

春秋左傳 94.57% 83.55% 33.70% 95.16% 88.25% 25.60%

春秋公羊傳 95.78% 88.52% 24.22% 97.83% 93.60% 13.53%

春秋穀梁傳 95.37% 86.92% 27.10% 97.23% 92.12% 16.19%

史記 91.68% 60.87% 75.60% 92.02% 72.69% 58.02%

清代奏摺 96.70% 73.19% 50.68% 98.68% 78.54% 35.24%

上古漢語混合 93.00% 69.30% 59.05% 90.56% 73.44% 58.93%

Overall 94.26% 75.28% 47.98% 94.53% 80.96% 39.03%

Evaluation in 10-Fold Cross-Validation.5 generations of CRFs averaged perceptron with 100K feature functions. 27/38

Page 28: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Result: Experiment 2

Training Data HMMs CRFsSpecificity F1 NIST-SU Specificity F1 NIST-SU

論語 90.36% 57.61% 80.96% 86.80% 58.93% 88.16%

孟子 92.57% 59.10% 72.99% 89.12% 63.90% 75.12%

莊子 92.77% 60.68% 70.50% 91.00% 65.08% 68.97%

春秋左傳 90.32% 63.57% 73.77% 87.37% 67.00% 75.93%

史記 93.56% 68.86% 57.97% 93.23% 74.11% 51.19%

Overall 91.92% 61.96% 71.24% 89.50% 65.80% 71.87%

5 generations of CRFs averaged perceptron with 100K feature functions. 28/38

Page 29: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Experiment 3a: 以古鑑近

HMMs/CRFs

孟子 莊子論語左傳 史記

Validation

Training Data

Test Data

清代奏摺

29/38

Page 30: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Result: Experiment 3aTraining Data HMMs CRFs

Specificity F1 NIST-SU Specificity F1 NIST-SU

論語 90.40% 41.05% 123.42% 79.17% 37.26% 195.26%

孟子 91.20% 42.00% 119.78% 79.03% 37.22% 194.35%

莊子 90.70% 42.50% 120.34% 83.13% 40.75% 166.21%

春秋左傳 84.31% 34.36% 168.87% 76.16% 37.66% 209.51%

史記 88.00% 40.35% 138.59% 83.24% 41.82% 165.20%

上古漢語混合 87.06% 38.85% 146.95% 80.88% 38.77% 182.38%

Overall 88.61% 39.85% 136.33% 80.27% 38.91% 185.49%

30/38

Page 31: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Experiment 3b: 以近鑑古

HMMs/CRFs

清代奏摺

孟子莊子

論語

左傳史記

Validation

Training Data

Test Data

31/38

Page 32: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Result: Experiment 3bTraining Data HMMs CRFs

Specificity F1 NIST-SU Specificity F1 NIST-SU

論語 91.32% 47.15% 87.23% 94.67% 43.12% 83.79%

孟子 91.55% 46.95% 90.74% 95.01% 42.13% 86.50%

莊子 91.31% 48.35% 91.24% 94.63% 46.52% 83.46%

春秋左傳 95.16% 50.41% 73.65% 97.73% 49.01% 68.96%

史記 93.18% 37.73% 97.01% 96.49% 34.35% 88.89%

上古漢語混合 92.57% 46.59% 87.16% 95.76% 42.72% 82.50%

Overall 92.52% 46.20% 87.84% 95.72% 42.98% 82.35%

32/38

Page 33: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Outline Motivation and Goals Related Work System Design

HMMs & CRFs Experiments Conclusion

33/38

Page 34: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Overall Build up an automated CCSS system. Complete 3 tasks during the system

developing. A set of evaluation metrics. A set of datasets. Two segmentation models.

HMMs CRFs

34/38

Page 35: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Datasets Evaluated some classics from the 5th century

BCE to the 19th century, including 論語 , 孟子 , 莊子 , 左傳 , and 史記 .

My system maintains its performance on a test data differing from the training data, but the difference in written eras between the test data and training data cannot be too great.

史記 is the best dataset for training.

35/38

Page 36: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Segmentation Models Overall performance

Model comparison

Model Correctness Training time Run time Sensitive to Training Data

HMMs Average Fast Fast Insignificant

CRFs Better Slow Fast Sensitive

Model Specificity F-measure NIST-SU error rate

HMMs 94.26% 75.28% 47.98%CRFs 94.53% 80.96% 39.03%

36/38

Page 37: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Future Work Apply more Chinese features

Phonology, POS, and Antithesis. Integrate pre-defined rules

Names, places, dates, numbers. Mix several datasets to obtain a more general,

robust dataset.

37/38

Page 38: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

THE END

38/38

Page 39: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Conditional Random Fields The model

P( “LL, MM, MM, RR” | “ 北冥有魚” ) Feature functions: f(yt-1, yt, x, t)

f(MM, RR, ‘ 曰’ ) = 1 ’ 曰’出現在句末

f(RR, ‘ 孟’ ) = 0 ‘ 孟’從未出現在句末

由 λk 決定 fk 的重要性, λk 由資料中學習而來

txyyf

ZxyP ttk

T

t k

kx

,,,exp1| 1

1

39/38

Page 40: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Conditional Random Fields Cont. 將 x 加上標籤

以 Viterbi algorithm 實作 參數評估(決定 λ 值)

保證收斂,但沒有分析式的解法,須透過迭代法逼近。 複雜度高、速度慢、不易實作 GIS, IIS, L-BFGS, etc.

以 averaged perceptron 代替傳統的數值方法。

xyPy

|maxarg*y

40/38

Page 41: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Feature Templates

41/38

Template Example

yi-1, yi LL, MM

yi, wi 北冥有魚其名為鯤wi-2, yi 北冥有魚其名為鯤wi-1, yi 北冥有魚其名為鯤wi+1, yi 北冥有魚其名為鯤wi+2, yi 北冥有魚其名為鯤wi-2, wi-1, yi 北冥有魚其名為鯤wi-1, wi, yi 北冥有魚其名為鯤wi, wi+1, yi 北冥有魚其名為鯤wi+1, wi+2, yi 北冥有魚其名為鯤

Page 42: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Raw Results (Better Case)

42/38

齊宣王問曰.文王之囿.方七十里.有諸.孟子對曰.於傳有之.曰.若是其大乎.曰.民猶以為小也.曰.寡人之囿.方四十里.民猶以為大.何也.曰.文王之 囿.方七十里.芻蕘者往焉.雉兔者往焉.與民同之.民以為小.不亦宜乎.臣始至於境.問國之大禁.然後敢入.臣聞郊關之內.有囿方四十里.殺其麋鹿者.如 殺人之罪.則是方四十里為阱於國中.民以為大.不亦宜乎.

齊宣王問曰.文王之囿方七十里.有諸孟子對曰.於傳有之.曰.若是其大乎.曰.民猶以為小也.曰.寡人之囿.方四十里.民猶以為大.何也.曰.文王之囿方 七十里.芻蕘者往焉.雉兔者往焉.與民同之.民以為小.不亦宜乎.臣始至於境.問國之大禁.然後敢入.臣聞郊關之內.有囿方四十里.殺其麋鹿者.如殺人之 罪.則是方四十里.為阱於國中.民以為大.不亦宜乎.Training Data: 史記, Test Data: 孟子

Page 43: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Raw Results (Worse Case)

43/38

孟子曰.三代之得天下也.以仁.其失天下也以不仁.國之所以廢興存亡者亦然.天子不仁.不保四海.諸侯不仁.不保社稷.卿大夫不仁.不保宗廟.士庶人不仁.不保四體.今惡死亡而樂不仁.是由惡醉而強酒.

孟子曰.三代之.得天下也.以仁其失天下也.以不仁國之所以廢興.存亡者.亦然.天子不仁.不保四海.諸侯不仁.不保社.稷卿大夫.不仁不保.宗廟士.庶人不仁.不保四體.今惡死亡.而樂不仁.是由惡醉.而強酒.

Training Data: 史記, Test Data: 孟子

Page 44: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

44/38

Page 45: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

45/38

Page 46: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

46/38

Page 47: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Evaluation Measures

tp fnfp

tn

47/38

Page 48: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

Statistical Machine Learning

Learner (Model)

Training Data

input output

48/38

Page 49: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

結論:評估指標 主要參考指標

Specificity F-measure (F1)

Recall 和 precision 的調和平均 NIST-SU error rate

ROC Curves 同時評估 recall 和 specificity 比較多筆斷句結果

49/38

Page 50: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

統計式斷句系統設計 Statistical approach 斷句系統可以視為一個 machine learner 。 不透過人力預先定義規則,而從大量

training data 中調整學習 Training data 的需求

一般性 足夠的數量

50/38

Page 51: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

51/38

Page 52: Classical Chinese Sentence Segmentation Using Sequence Labeling Approaches

古漢語的文法與斷句研究 訓詁學:虛字的研究

《爾雅.釋詁.釋言.釋訓》 清末馬建忠《馬氏文通》

仿效西洋文法,建立古漢語的文法。 民初楊樹達《詞詮》

為虛字分門別類。 楊樹達《古書句讀釋例》

探討誤讀的因素。 目前尚無數位化文獻。

52/38