26
Yuru Jiang Rou Song Beijing University of Technology

Topic Structure Identification of PClause Sequence Based on Generalized Topic Theory

  • Upload
    lark

  • View
    68

  • Download
    2

Embed Size (px)

DESCRIPTION

Topic Structure Identification of PClause Sequence Based on Generalized Topic Theory. Yuru Jiang , Rou Song Beijing University of Technology. Punctuation Clause. Example :斑鳐. 斑鳐 是 鳐形目 鳐科 鳐属 的 1 种 。吻 中长 ,尖 突 。尾 细长 ,. c 1 : 斑鳐 是 鳐形目 鳐科 鳐属 的 1 种 。 c 2 : 吻 中长 , c 3 : 尖 突 。 - PowerPoint PPT Presentation

Citation preview

Page 1: Topic Structure Identification  of  PClause Sequence Based on Generalized Topic Theory

Yuru Jiang , Rou Song

Beijing University of Technology

Page 2: Topic Structure Identification  of  PClause Sequence Based on Generalized Topic Theory

Example :斑鳐

c1: 斑鳐 是 鳐形目 鳐科 鳐属 的 1 种 。c2: 吻 中长 ,c3: 尖 突 。c4: 尾 细长 ,

斑鳐 是 鳐形目 鳐科 鳐属 的 1 种 。吻 中长 ,尖 突 。尾 细长 ,

PClause Sequence

Page 3: Topic Structure Identification  of  PClause Sequence Based on Generalized Topic Theory

c1: 斑鳐 是 鳐形目 鳐科 鳐属 的 1 种 。c2: 吻 中长 ,c3: 尖 突 。c4: 尾 细长 , t1:斑鳐 是 鳐形目 鳐科 鳐属 的 1 种 。t2:斑鳐 吻 中长 ,t3:斑鳐 吻 尖 突 。t4:斑鳐 尾 细长 ,

What we have

done

Page 4: Topic Structure Identification  of  PClause Sequence Based on Generalized Topic Theory

Identification Process Identification Algorithm CTCs Scoring Function

Page 5: Topic Structure Identification  of  PClause Sequence Based on Generalized Topic Theory

Example2 :斑鳐(选自《中国大百科全书》)c1: 斑鳐 是 鳐形目 鳐科 鳐属 的 1 种 。c2: 吻 中长 ,c3: 尖 突 。c4: 尾 细长 ,

t1= c1

t2= ?

Page 6: Topic Structure Identification  of  PClause Sequence Based on Generalized Topic Theory

if :t1: 斑鳐 是 鳐形目 鳐科 鳐属 的 1 种 。c2: 吻 中长 ,

then :t2= ?

1. 吻 中长 ,2. 斑鳐 吻 中长 ,3. 斑鳐 是 吻 中长 ,4. 斑鳐 是 鳐形目 吻 中长 ,5. 斑鳐 是 鳐形目 鳐科 的 吻 中长 ,6. 斑鳐 是 鳐形目 鳐科 鳐属 吻 中长 ,7. 斑鳐 是 鳐形目 鳐科 鳐属 的 吻 中长 ,8. 斑鳐 是 鳐形目 鳐科 鳐属 的 1 吻 中长

,9. 斑鳐 是 鳐形目 鳐科 鳐属 的 1 种 吻

中长 ,

c2 的 CTCs

Page 7: Topic Structure Identification  of  PClause Sequence Based on Generalized Topic Theory

t1

CTCs of c2

Topic Clause of C3C3

Page 8: Topic Structure Identification  of  PClause Sequence Based on Generalized Topic Theory

if :CTCs of c2:

c3: 尖 突 ,then :

t3= ?1. 吻 中长 ,2. 斑鳐 吻 中长 ,3. 斑鳐 是 吻 中长 ,4. 斑鳐 是 鳐形目 吻 中长 ,5. 斑鳐 是 鳐形目 鳐科 的 吻 中长 ,6. 斑鳐 是 鳐形目 鳐科 鳐属 吻 中长 ,7. 斑鳐 是 鳐形目 鳐科 鳐属 的 吻 中长 ,8. 斑鳐 是 鳐形目 鳐科 鳐属 的 1 吻 中长

,9. 斑鳐 是 鳐形目 鳐科 鳐属 的 1 种 吻

中长 ,

CTCs of c2

Page 9: Topic Structure Identification  of  PClause Sequence Based on Generalized Topic Theory

if :one CTC of c2: 斑鳐 是 鳐形目 吻 中长 ,c3: 尖 突 ,

then : one group CTCs of c3 is:

1. 尖 突 ,2. 斑鳐 尖 突 ,3. 斑鳐 是 尖 突 ,4. 斑鳐 是 鳐形目 尖 突 ,5. 斑鳐 是 鳐形目 吻 尖 突 ,6. 斑鳐 是 鳐形目 吻 中长 尖 突 ,

Page 10: Topic Structure Identification  of  PClause Sequence Based on Generalized Topic Theory
Page 11: Topic Structure Identification  of  PClause Sequence Based on Generalized Topic Theory

t1

c2 的CTCs

c3 的CTCs

Page 12: Topic Structure Identification  of  PClause Sequence Based on Generalized Topic Theory

How to choose

the best path?

Page 13: Topic Structure Identification  of  PClause Sequence Based on Generalized Topic Theory

Question1 : How to calculate the value of each node in the CTC tree ?◦ CTCs Scoring Function

Question2 : How to calculate the path value of each leaf node to the root node ?◦ Sum of the node value

Page 14: Topic Structure Identification  of  PClause Sequence Based on Generalized Topic Theory

Given a CTC d of PClause c, a topic clause most similar to d is found from the corpus, whose similarity is marked as sim_CT(d). For any two strings x and y, given that their similarity is sim(x,y). sim_CT(d) is defined as

Topic Clause Corpus

)t,d(simmaxsim_CT(d)Tcorpust

Page 15: Topic Structure Identification  of  PClause Sequence Based on Generalized Topic Theory

CTset(c) is the CTCs set of c, then the topic clause of c is :

Accuracy rate is 0.6499

Reference : Yuru Jiang, Rou Song: Topic Clause Identification Based On Generalized Topic Theory. Journal of Chinese Information Processing. 26(5), (2012)

)sim_CT(d)(maxarg)c(CTsetd

Page 16: Topic Structure Identification  of  PClause Sequence Based on Generalized Topic Theory

))tc_t,tc_d(sim

)c_t,c_d(sim

)t,d(sim(max)d(CT_Simctx

prepre3

2

1Tcorpust

Accuracy rate is 0.7625 >0.6499>baseline

Page 17: Topic Structure Identification  of  PClause Sequence Based on Generalized Topic Theory

Example3 :d_tcpre : A 一般 均 具 H 或 H C ,d_c : 用以 引诱 食饵 。t1 : A 一般 均 具 H 用以 引诱 食饵 。st1 : A C 一般 具 H ,t2 : A 一般 均 具 H 或 H C 用以 引诱 食饵 。

t_tcpre : A 有些 B C 具 C ,t_c : 以 引诱 食饵 ,t : A 有些 B C 具 C 以 引诱 食饵 ,

Page 18: Topic Structure Identification  of  PClause Sequence Based on Generalized Topic Theory

Corpus Evaluation Criteria Experiment Result Analysis

Page 19: Topic Structure Identification  of  PClause Sequence Based on Generalized Topic Theory

202 texts about fish in the Biology volume of China Encyclopedia

15 texts are used for test in the experiment

K-1 test are used

Page 20: Topic Structure Identification  of  PClause Sequence Based on Generalized Topic Theory

For N PClauses, if the number of PClauses whose topic clauses are correctly identified is hitN, then the identification accuracy rate is hitN/N.

Page 21: Topic Structure Identification  of  PClause Sequence Based on Generalized Topic Theory

Fig. 2. PClause Count and Accuracy Rate for Topic Clause Identification about 15 texts

Page 22: Topic Structure Identification  of  PClause Sequence Based on Generalized Topic Theory
Page 23: Topic Structure Identification  of  PClause Sequence Based on Generalized Topic Theory
Page 24: Topic Structure Identification  of  PClause Sequence Based on Generalized Topic Theory
Page 25: Topic Structure Identification  of  PClause Sequence Based on Generalized Topic Theory

CTCs Scoring Function

CTC Tree

Extend to other text

Page 26: Topic Structure Identification  of  PClause Sequence Based on Generalized Topic Theory