35
Sentence Semantic Distance and Sentence Semantic Distance and Novelty Detection Novelty Detection Hua-Ping Zhang Hua-Ping Zhang [email protected] [email protected] LCC Group, Software Division, LCC Group, Software Division, Inst. of Computing Tech., CAS Inst. of Computing Tech., CAS 2003-10-20 2003-10-20

Sentence Semantic Distance and Novelty Detection Hua-Ping Zhang [email protected] LCC Group, Software Division, Inst. of Computing Tech., CAS

Embed Size (px)

Citation preview

Page 1: Sentence Semantic Distance and Novelty Detection Hua-Ping Zhang zhanghp@software.ict.ac.cn LCC Group, Software Division, Inst. of Computing Tech., CAS

Sentence Semantic Distance and Sentence Semantic Distance and Novelty DetectionNovelty Detection

Hua-Ping ZhangHua-Ping Zhang

[email protected]@software.ict.ac.cn

LCC Group, Software Division, LCC Group, Software Division,

Inst. of Computing Tech., CASInst. of Computing Tech., CAS

2003-10-202003-10-20

Page 2: Sentence Semantic Distance and Novelty Detection Hua-Ping Zhang zhanghp@software.ict.ac.cn LCC Group, Software Division, Inst. of Computing Tech., CAS

Inst. Of Computing Tech, CAS

祝王树西生日快乐

祝王树西生日快乐

祝于满泉生日快乐

祝于满泉生日快乐

Page 3: Sentence Semantic Distance and Novelty Detection Hua-Ping Zhang zhanghp@software.ict.ac.cn LCC Group, Software Division, Inst. of Computing Tech., CAS

Inst. Of Computing Tech, CAS

OutlineOutline• Motivation• Semantic distance computation• Overview on WordNet• WordNet-based Sentence semantic distance computat

ion• Novelty Detection using sentence semantic distance• Experiments and analysis• Problems and future works• Conclusion

Page 4: Sentence Semantic Distance and Novelty Detection Hua-Ping Zhang zhanghp@software.ict.ac.cn LCC Group, Software Division, Inst. of Computing Tech., CAS

Inst. Of Computing Tech, CAS

MotivationMotivation

• Introduction to Novelty Track in TREC

The Novelty Track is designed to investigate systems' abilities to locate relevant AND new information within a set of documents relevant to a TREC topic. Systems are given the topic and a set of relevant documents ordered by date, and must identify sentences containing relevant and/or new information in those documents.

Page 5: Sentence Semantic Distance and Novelty Detection Hua-Ping Zhang zhanghp@software.ict.ac.cn LCC Group, Software Division, Inst. of Computing Tech., CAS

Inst. Of Computing Tech, CAS

Motivation IIMotivation II• Topic

– <num>Number: N3– <title>Gingrich Speaker of House– <toptype>event/opinion– <desc>Description: After a serious challenge by Rep. L

ivingston in 1998, Newt Gingrich announced he would not seek re-election as Speaker of the U.S. House of Representatives.

– <narr>Narrative: Relevant is the political maneuvering that led up to Newt Gingrich's announcement in 1998 that he would not run again for the position of speaker of the house;

– Relevant document

Page 6: Sentence Semantic Distance and Novelty Detection Hua-Ping Zhang zhanghp@software.ict.ac.cn LCC Group, Software Division, Inst. of Computing Tech., CAS

Inst. Of Computing Tech, CAS

Motivation IIIMotivation III

• Contrast between sentence and document– Content: average 6-12 words VS. over 30

sentences– Overlapping: very less VS. key words occur

frequently– Information(After stemming and removal of stop

words): 4 or 5 pairs and frequent are mostly 1 VS. dozens of (word, frequent ) pair

Page 7: Sentence Semantic Distance and Novelty Detection Hua-Ping Zhang zhanghp@software.ict.ac.cn LCC Group, Software Division, Inst. of Computing Tech., CAS

Inst. Of Computing Tech, CAS

Motivation IVMotivation IV• Sample

– Event: 1) However, the Scottish team was the first to make a clone from adult animal cell.2) The seminar was held in the context of a recently reported sheep cloning case in Britain.

– Opinion 1) Daily we read news stories about dissatisfaction with managed care, Medicare fraud and overbilling.2) Eighty percent agreed with this.

• Conclusion: Only considering words is far from requirement; We must extend a sentence as possible as we can.

Page 8: Sentence Semantic Distance and Novelty Detection Hua-Ping Zhang zhanghp@software.ict.ac.cn LCC Group, Software Division, Inst. of Computing Tech., CAS

Inst. Of Computing Tech, CAS

Semantic distance computationSemantic distance computation• Semantic distance or similarity computation• Previous works tends on word-word semantic

distance– Corpus-based: co-occurrence distribution [Church

and Hanks 1989, Grefenstette 1992] knowledge-free

– Ontology-based: Taxonomy, or other psudo-knowledge: WordNet, HowNet, Tongyi CiLin [ 刘群 (2002) ,王斌 (1999), 李素建 (2002) ,车万翔 (2003)]

– Hybrid [Jay J Jiang 1997]

Page 9: Sentence Semantic Distance and Novelty Detection Hua-Ping Zhang zhanghp@software.ict.ac.cn LCC Group, Software Division, Inst. of Computing Tech., CAS

Inst. Of Computing Tech, CAS

Semantic distance computation IISemantic distance computation II• Corpus-based approach

– Select a group of words to be features, and train a feature vector for each word. Then compute its similarity using vector computation (i.e. cosine)

– Assumption: similar word <- -> similar context

– Lu Song 1999, Degan(1999)

– “Invariable Point” [Bai Shuo 1995] to make word clustering

• [ 张三 ]{ 吃 }< 饭 >

• [ 李四 ]{ 穿 }< 衣 >

• A word, a category_____?_____All words, a category

Page 10: Sentence Semantic Distance and Novelty Detection Hua-Ping Zhang zhanghp@software.ict.ac.cn LCC Group, Software Division, Inst. of Computing Tech., CAS

Inst. Of Computing Tech, CAS

Semantic distance computation IIISemantic distance computation III• Liu Qun, Li SuJian, 2002 Using HowNet

)2,1(...1,..1

max)2,1( jSiSSimmjni

WWSim

对于两个汉语词语W1 和W2 ,如果W1 有 n个义项(概念): S11 ,S12 ,……, S1n ,W2 有 m个义项(概念): S21 , S22 ,……, S2m ,我们规定,W1 和W2 的相似度各个概念的相似度之最大值,也就是说:

…… (3)

其中 p1 和 p2 表示两个义原( primitive ), d 是 p1 和 p2 在义原层次体系中的路径长度,是一个正整数。 α 是一个可调节的参数。

d

ppSim ),( 21

Page 11: Sentence Semantic Distance and Novelty Detection Hua-Ping Zhang zhanghp@software.ict.ac.cn LCC Group, Software Division, Inst. of Computing Tech., CAS

Inst. Of Computing Tech, CAS

Semantic distance computation IVSemantic distance computation IV• Wang Bin, Using Tongyi Ci Lin

O

LBA

a l……a b

01 02... 01… 01… …… 01

01 02... 01 ... 01 … 01 …… …

01 02…01...01 01 … 01 …… ...

虚线用于标识某上层节点到下层节点的路径

Page 12: Sentence Semantic Distance and Novelty Detection Hua-Ping Zhang zhanghp@software.ict.ac.cn LCC Group, Software Division, Inst. of Computing Tech., CAS

Inst. Of Computing Tech, CAS

Semantic distance computation VSemantic distance computation V• 车万翔

– 编辑距离:删除、插入、替换( HowNet, 同义词林估计代价)

(a)“ 爱吃苹果”与“喜欢吃香蕉”之间的编辑距离为 4 ,如四条虚线所显示 ;

(b)“ 爱吃苹果”与“喜欢吃香蕉”之间的改进编辑距离为 1.1 ,其中“爱” “喜欢”代价为 0.5 ,“苹果” “香蕉”代价为 0.6

Page 13: Sentence Semantic Distance and Novelty Detection Hua-Ping Zhang zhanghp@software.ict.ac.cn LCC Group, Software Division, Inst. of Computing Tech., CAS

Inst. Of Computing Tech, CAS

Semantic distance computation VISemantic distance computation VI• Ontology-based: Simple but effective,

understandable, however subjective• Corpus-based: ignoring the inherent

semantic knowledge, practical for application in parsing, semantic or language usage.

• Hybrid: Using taxonomy as its frame and using corpus to compute edge value instead of arbitrary setting.

Page 14: Sentence Semantic Distance and Novelty Detection Hua-Ping Zhang zhanghp@software.ict.ac.cn LCC Group, Software Division, Inst. of Computing Tech., CAS

Inst. Of Computing Tech, CAS

Overview on WordNetOverview on WordNet

• URL :– http://www.cogsci.princeton.edu/~wn/

• 开发单位:– 普林斯顿大学心理语言学实验室– 初衷是作为研究人类词汇记忆的心理语言学成果– 在自然语言处理中得到广泛的应用

• 免费的在线词汇数据库• 世界很多语种都开发了相应的版本

– 各种欧洲语言: EuroNet– 汉语: CCD ( Chinese Concept Dictioanry )

Page 15: Sentence Semantic Distance and Novelty Detection Hua-Ping Zhang zhanghp@software.ict.ac.cn LCC Group, Software Division, Inst. of Computing Tech., CAS

Inst. Of Computing Tech, CAS

WordNet IIWordNet II

• 同义词集 Synset– 用一组同义词的集合 Synset 来表示一个概念– 每一个概念有一段描述性的说明

• 关系– 上下位关系( hyponymy , troponymy )– 同义反义关系( synonymy , antonymy )– 部分整体关系( entailment , meronymy )– ……

Page 16: Sentence Semantic Distance and Novelty Detection Hua-Ping Zhang zhanghp@software.ict.ac.cn LCC Group, Software Division, Inst. of Computing Tech., CAS

Inst. Of Computing Tech, CAS

WordNet IIIWordNet III名词概念的组织:

Page 17: Sentence Semantic Distance and Novelty Detection Hua-Ping Zhang zhanghp@software.ict.ac.cn LCC Group, Software Division, Inst. of Computing Tech., CAS

Inst. Of Computing Tech, CAS

WordNet IVWordNet IV

形容词概念的组织:

Page 18: Sentence Semantic Distance and Novelty Detection Hua-Ping Zhang zhanghp@software.ict.ac.cn LCC Group, Software Division, Inst. of Computing Tech., CAS

Inst. Of Computing Tech, CAS

Wordnet VWordnet V

• Only covering open class word : adj, n, v, adv• Scale: see attachment• Data format: see attachment• API: How to use it?wninit() (Word,POS) [getindex(char *, int)](synset

offset)[read_synset(int, long, char *)->tagged frequency

be%2:42:03:: 1 10720sense_key sense_number tag_cnt

Page 19: Sentence Semantic Distance and Novelty Detection Hua-Ping Zhang zhanghp@software.ict.ac.cn LCC Group, Software Division, Inst. of Computing Tech., CAS

Inst. Of Computing Tech, CAS

WordNet-based Sentence semantic distanceWordNet-based Sentence semantic distance• Information content

– P(c) 1/P(c)– IC(c)= -logP(c) A dog bites a man VS. A man bites a dog– Entropy(H)=∑P(c)*IC(c)

• Information content of synsetIC(c)=-logP(c)P(c)=freq(c)/Nfreq(c)= ∑freq(w) where w∈c or w ∈c* and c

* is a child of c

Page 20: Sentence Semantic Distance and Novelty Detection Hua-Ping Zhang zhanghp@software.ict.ac.cn LCC Group, Software Division, Inst. of Computing Tech., CAS

Inst. Of Computing Tech, CAS

WordNet-based Sentence semantic distance IIWordNet-based Sentence semantic distance II

• Edge value: Wt(c,p)=(Edge value: Wt(c,p)=(β+(1-β)E’/E(p)){1+1/dβ+(1-β)E’/E(p)){1+1/d(p)}^a[IC(c)-IC(p)](p)}^a[IC(c)-IC(p)]T(c,p) Where: T(c,p) Where: – c child, p: parentc child, p: parent– E densityE density– T(c,p) : Link typeT(c,p) : Link type

• Only focus IC, therefore, Wt(c,p)=Only focus IC, therefore, Wt(c,p)= IC(c)-IC(p)IC(c)-IC(p)• [Jay J. Jiang,1997] Dist(w1,w2)=IC(c1)+IC(c2)-2*I[Jay J. Jiang,1997] Dist(w1,w2)=IC(c1)+IC(c2)-2*I

C(LSuper(c1,c2))C(LSuper(c1,c2))– LSuper: lowest super-ordinate of c1 and c2LSuper: lowest super-ordinate of c1 and c2

Page 21: Sentence Semantic Distance and Novelty Detection Hua-Ping Zhang zhanghp@software.ict.ac.cn LCC Group, Software Division, Inst. of Computing Tech., CAS

Inst. Of Computing Tech, CAS

WordNet-based Sentence semantic distance IIIWordNet-based Sentence semantic distance III

• It’s not enough if only focus hyperlink.It’s not enough if only focus hyperlink.

• We introduce more relationship such as:We introduce more relationship such as:– VERBGROUP$: develop 6 acquire 5 evolve VERBGROUP$: develop 6 acquire 5 evolve – Similar & : aborning 0 003 & 00003552 a 0000 & 00003Similar & : aborning 0 003 & 00003552 a 0000 & 00003

671 a 0000 ! 00003777 a 0101 671 a 0000 ! 00003777 a 0101 – Derivation adj->adv; v<->n, v->adj; Derivation adj->adv; v<->n, v->adj; – Noun-Noun: ISMEMBERPTR SSTUFFPTR ,ISPARTPTNoun-Noun: ISMEMBERPTR SSTUFFPTR ,ISPARTPT

R ,HASMEMBERPTR ,HASSTUFFPTR, HASPARTPTR ,HASMEMBERPTR ,HASSTUFFPTR, HASPARTPTR R

– Example: Friend->friendly->friendlinessExample: Friend->friendly->friendliness

Page 22: Sentence Semantic Distance and Novelty Detection Hua-Ping Zhang zhanghp@software.ict.ac.cn LCC Group, Software Division, Inst. of Computing Tech., CAS

Inst. Of Computing Tech, CAS

WordNet-based Sentence semantic distance IVWordNet-based Sentence semantic distance IV

• Word-Sentence Semantic Distance (WSSD)

WSSD(W,S)=min {WWSD(W, wi)| wi ∈S } wher

e W is a word, S is a sentence and wi is a word i

n sentence S.

• Sentence-Sentence Semantic Distance (SSSD)

||||

),(),(Sb)SSSD(Sa,

SbSa

SawWSSDSbwWSSD Sajw jSaiw i

Page 23: Sentence Semantic Distance and Novelty Detection Hua-Ping Zhang zhanghp@software.ict.ac.cn LCC Group, Software Division, Inst. of Computing Tech., CAS

Inst. Of Computing Tech, CAS

WordNet-based Sentence semantic distance IVWordNet-based Sentence semantic distance IV

Sentence A

Sentence B

wi : Word in Sentence A Wi :Word in Sentence B

w1 w2 w3 w4 w5

w6

W1

W2

W3

W4

Semantic Distance from Word Wi to Sentence A

Semantic Distance from Word wi to Sentence B

Note:

Sentence Semantic Distance

Page 24: Sentence Semantic Distance and Novelty Detection Hua-Ping Zhang zhanghp@software.ict.ac.cn LCC Group, Software Division, Inst. of Computing Tech., CAS

Inst. Of Computing Tech, CAS

Novelty Detection using sentence Novelty Detection using sentence semantic distancesemantic distance

• What factors determine whether a sentence S is novel or new?– Semantic distance between S and topic T. Less is better– Semantic distance between S and previous valid content

T. More is better– Word overlapping with T: More is better– Word overlapping with previous sentence: less is better– Is a paragraph head? Head sentence is more likely to be n

ew.– …

Page 25: Sentence Semantic Distance and Novelty Detection Hua-Ping Zhang zhanghp@software.ict.ac.cn LCC Group, Software Division, Inst. of Computing Tech., CAS

Inst. Of Computing Tech, CAS

Novelty Detection using sentence Novelty Detection using sentence semantic distance IIsemantic distance II

• How to link various factors with decision? – A binary decision: New or not.– Various factor treated as different dimension, they

forms a featured or factor vector. However, different dimension has a different weight.

• Problems turns into binary categorization: put the relevant sentence S factor vector into which category: new or not?

• Similar approaches could be applied in relevance detection.

Page 26: Sentence Semantic Distance and Novelty Detection Hua-Ping Zhang zhanghp@software.ict.ac.cn LCC Group, Software Division, Inst. of Computing Tech., CAS

Inst. Of Computing Tech, CAS

Novelty Detection using sentence Novelty Detection using sentence semantic distance IIIsemantic distance III

• Training on known result using winnow algorithm.– Factor should take same direction and same range:

0->1, ascendly– N2 XIE19970228.0169:05 0.430 1.00 0.174 1.00 1 1– N2 XIE19970228.0169:06 0.364 0.52 0.143 0.81 1 0– N2 XIE19970302.0039:04 0.5978 1.0 0.20 1.0 1 ?

– ∑Wi*Fi>N: new; otherwise: not new

Page 27: Sentence Semantic Distance and Novelty Detection Hua-Ping Zhang zhanghp@software.ict.ac.cn LCC Group, Software Division, Inst. of Computing Tech., CAS

Inst. Of Computing Tech, CAS

Experiments and analysisExperiments and analysis• ICT03NOV4OTP: word overlapping with

topic and previous valid context– Averages over 50 topics: Average precision: 0.59;

Average recall: 0.70; Average F: 0.610– Best: N5 209 227 200 0.88 0.96 0.917– Worst:N49 50 5 0 0.00 0.00 0.000– Sounds not bad with the simple information.

Page 28: Sentence Semantic Distance and Novelty Detection Hua-Ping Zhang zhanghp@software.ict.ac.cn LCC Group, Software Division, Inst. of Computing Tech., CAS

Inst. Of Computing Tech, CAS

Experiments and analysis IIExperiments and analysis II• ICT03NOV4LFF: word overlapping +

semantic similarity with topic and previous valid context– Averages over 50 topics:Average precision: 0.59

Average recall: 0.64Average F: 0.568– Best: N5 209 224 197 0.88 0.94 0.910– Worst:N46 93 2 0 0.00 0.00 0.000– Seems little reduction after introducing semantic

distance.

Page 29: Sentence Semantic Distance and Novelty Detection Hua-Ping Zhang zhanghp@software.ict.ac.cn LCC Group, Software Division, Inst. of Computing Tech., CAS

Inst. Of Computing Tech, CAS

Experiments and analysis IIIExperiments and analysis III• ICT03NOV4ALL: word overlapping +

semantic similarity with topic and previous valid context+head sentence– Averages over 50 topics:Average precision:

0.60Average recall: 0.68 Average F: 0.598– Best: N5 209 224 197 0.88 0.94 0.910– Worst:N46 93 2 0 0.00 0.00 0.000– Enhance after adding head sentence information

Page 30: Sentence Semantic Distance and Novelty Detection Hua-Ping Zhang zhanghp@software.ict.ac.cn LCC Group, Software Division, Inst. of Computing Tech., CAS

Inst. Of Computing Tech, CAS

Problems and future worksProblems and future works

• Semantic computation using corpus depends more on corpus. Data sparseness is bottleneck especially when we lack large scale of semantic corpus till now.

• Semantic computation model should unify taxonomy relationship and other relationship.

• Winnow training algorithm has deficiency in (factors, decision) modeling

Page 31: Sentence Semantic Distance and Novelty Detection Hua-Ping Zhang zhanghp@software.ict.ac.cn LCC Group, Software Division, Inst. of Computing Tech., CAS

Inst. Of Computing Tech, CAS

Problems and future works IIProblems and future works II

• Some modification could be applied in semantic computation.

• Along with the current approach, some works could be optimized. 1) semantic distance between selected sentence and its previous one extends to valid contexts; 2) winnow init; 3) winnow step 4) KNN or other way to replace winnow

Page 32: Sentence Semantic Distance and Novelty Detection Hua-Ping Zhang zhanghp@software.ict.ac.cn LCC Group, Software Division, Inst. of Computing Tech., CAS

Inst. Of Computing Tech, CAS

Problems and future works IIIProblems and future works III• Synset-based VSM or Synset overlapping t

o expand the sentence though overlapping/VSM is proved simple but effective.

Page 33: Sentence Semantic Distance and Novelty Detection Hua-Ping Zhang zhanghp@software.ict.ac.cn LCC Group, Software Division, Inst. of Computing Tech., CAS

Inst. Of Computing Tech, CAS

ConclusionConclusion

• Semantic distance computation using WordNet is effective for many applications such as disambiguation, bilingual word alignment. It could help compare words whose forms are different.

• Factors and decision could be converted into categorization

• Relevance and Novelty detection seems promising though it is difficult. It is still in earlier stage. More works or approaches could be tried. Best results determine best approach. Good ideas should be checked with final result.

Page 34: Sentence Semantic Distance and Novelty Detection Hua-Ping Zhang zhanghp@software.ict.ac.cn LCC Group, Software Division, Inst. of Computing Tech., CAS

Inst. Of Computing Tech, CAS

AcknowledgementsAcknowledgements

• Dr. Jian Sun for instructive discussion and providing papers on semantic computation and novelty.

• Mr. Wei-Feng Pan for winnow program and

• Associate Prof. Qun Liu for discussion on semantic similarity.

• Stanford Univ. for providing WordNet

Page 35: Sentence Semantic Distance and Novelty Detection Hua-Ping Zhang zhanghp@software.ict.ac.cn LCC Group, Software Division, Inst. of Computing Tech., CAS

Inst. Of Computing Tech, CAS

Thanks Thanks for for your your attention!attention!