21
Hebrew Sentence Compression Graduation project by: Parush Anat & Grisha Klots http://www.cs.bgu.ac.il/~klotsg Supervised by: Yoav Goldberg & Dr. Michael Elhadad Nov. 2010, CS BGU.

Hebrew Sentence Compression

  • Upload
    olisa

  • View
    42

  • Download
    0

Embed Size (px)

DESCRIPTION

Graduation project by: Parush Anat & Grisha Klots http://www.cs.bgu.ac.il/~klotsg Supervised by: Yoav Goldberg & Dr. Michael Elhadad. Hebrew Sentence Compression. Nov. 2010, CS BGU. A short example…. A “long” sentence may look like - - PowerPoint PPT Presentation

Citation preview

Page 1: Hebrew Sentence Compression

Hebrew Sentence Compression

Graduation project by:Parush Anat & Grisha Klotshttp://www.cs.bgu.ac.il/~klotsg

Supervised by: Yoav Goldberg & Dr. Michael Elhadad

Nov. 2010, CS BGU.

Page 2: Hebrew Sentence Compression

A short example…

A “long” sentence may look like - אתמול בשעה שש איה אפתה עוגת תפוחים טעימה ואני

אכלתי חתיכה קטנה ממנה

And can be compressed to - אתמול איה אפתה עוגת תפוחים ואני אכלתי ממנה

Page 3: Hebrew Sentence Compression

Sentence Compression – Why & How?Motivations, implementation and some theoretical background

• Automatic text summarization (Academic, technical, text books and so on…)•A sentence by sentence approach – each sentence is compressed individually• Our method is based on word deletion to generate a shorter “version” of the sentence

Page 4: Hebrew Sentence Compression

Work process

Our work consisted of two main phases:1. Corpus Generation – Developed in Python2. Algorithm Implementation – Developed in

Java

• We implemented the algorithm developed by Ryan McDonald and described in his paper: “Discriminative Sentence Compression with Soft Syntactic Evidence”

• First time implemented in Hebrew!

Page 5: Hebrew Sentence Compression

Phase #1Sentence Generation – General

• Sentences were extracted from “Haaretz” web-site• A scoring method was employed to find pairs of “full” and “short” sentences. • All pairs were grouped into a single database (large XML-Like file)• XML file scanned again and filtered for irregularities and words that are not in the Hebrew lexicon• Final output is formatted to a predefined structure as the input for the 2nd Phase – Algorithm Implementation

Page 6: Hebrew Sentence Compression

Main Headline

Sub-headline

Body

Page 7: Hebrew Sentence Compression

Phase #1Sentence Generation – Extraction (Extractor)

•A scoring method ensures that only the best matches are returned (matching percentage varies)

• At this stage, we allowed for Clauses to change their relative place in the different versions of the sentence (number of changes also varies)

Page 8: Hebrew Sentence Compression

Phase #1Sentence Generation – Filtration

• Due to strict rules imposed on the Algorithm Implementation, much filtration has to be performed• Each word that appears in the short version should appear in the long one as well.• No clauses change their position in each pair of sentences.

More than 90% of pairs fail this test!(Initial size of DB was ~4500 sentence. Filtration left us with only ~400 sentences to work with)

Page 9: Hebrew Sentence Compression

Phase #1Sentence Generation – Algorithm Input Formatting

Example: השוטר כופר בכל ההאשמותNN VB DTT NNעו"ד מיכאל בוסקילה המייצג את השוטר אמר כי החשוד כופר בכל ההאשמות נגדוTTL NNP NNP BN AT NN VB CC NN VB DTT NN IN5 9 10 11------------------END------------------

Page 10: Hebrew Sentence Compression

Phase #2Algorithm Implementation

Page 11: Hebrew Sentence Compression

Phase #2Algorithm Implementation

C[i,j]

The

i-th

word

from

the

long

sent

ence

Length of short sentence

C [

k , j

– 1

]

C[i,j]=maxk<i{C[k,j-1]+S(x,k,i)}

Maximum Score for a short sentence having the desired

length

The “heart” of the algorithm – Dynamic Programming: Compress: Long sent x requested length Short Sentence

Page 12: Hebrew Sentence Compression

Phase #2Algorithm Implementation - basic terminology

• Feature – a string that characterizes a pair of words according to their syntactic analysis, their position in the sentence and the words that are between them. • For example:• pi:pj = getPOS(i):getPOS(j)• “pi:pj = NN : VB”• for i<k<j:

IsNeg = isNeg(getWord(k))“IsNeg = True”, “IsNeg = False”pi:pk:pj = getPos(i):getPos(k):getPos(j)

Page 13: Hebrew Sentence Compression

Phase #2Algorithm Implementation - basic terminology (cont.)

• Weights Vector – Contains ordered pairs of <Feature, Weight> for all instances induced by different pairs of words

For example:<<“pi:pj = NN:VB”,100>,<“pi:pj = NN:DTT”,5>>

• All the feature templates are hard-coded and predefined. • The Weights Vector is updated constantly during the learning phase.

Page 14: Hebrew Sentence Compression

Phase #2Algorithm Implementation – the Score function

S(x,k,i) returns the sum of weights of the features for k-th and i-th word in sentence x

C[i,j]=maxk<i{C[k,j-1]+S(x,k,i)}

Page 15: Hebrew Sentence Compression

Phase #2Learning – Dynamic Programming (Part 1)

• For each cluster from the input file, we iterate over the list of indices and for each two adjacent indices, we extract their feature list. For example:

5 9 10 11

• For each feature in each list, we increase its weight by 1 in the Weights Vector

Page 16: Hebrew Sentence Compression

Phase #2Learning – Dynamic Programming (Part 2)

• Now, we compress the long sentence to a new sentence having the length of the short one• From the compressed sentence, we generate a new list of indices (as shown before) and extract the lists of features• For each feature in each list, we decrease its score by 1

Page 17: Hebrew Sentence Compression

Results & DiscussionA short example

•: ארוך העם משפט כמדינת בישראל ההכרה בפרשת כמומעמד הענקת בדבר הממשלה החלטת של העיתוי היהודי

מכוונת כפרובוקציה מצטייר לבירה מיוחד•:) מקורי ) מקוצר בדבר משפט הממשלה החלטת של העיתוי

מכוונת כפרובוקציה מצטייר לבירה מיוחד מעמד הענקת•:)' ( אלג מקוצר בדבר משפט הממשלה החלטת של העיתוי

מכוונת כפרובוקציה מצטייר לבירה מיוחד מעמד הענקת

Page 18: Hebrew Sentence Compression

Results & DiscussionAnd another one…

•: ארוך המשפטים משפט ההחלטה 60למשרד על לערער יוםאובמה ממשל בפני דילמה שמציבה

• :) ( מקורי מקוצר המשפטים משפט לערער 60למשרד יוםההחלטה על• :)' ( אלג מקוצר ממשל 60משפט ההחלטה על לערער יום

אובמה

Page 19: Hebrew Sentence Compression

Results & DiscussionA (very) basic results analysis

We analyzed the compression of 50 “unseen” sentences:• 8% matched exactly the shortened version•25% differ by one or two words from the shortened version• 37% are valid Hebrew sentences•43% retained the general notion of the original sentence

Page 20: Hebrew Sentence Compression

Results & DiscussionFuture improvements

• Increase DB size!!!

• Increase variety – use other sources of

information

•Add more feature templates

Page 21: Hebrew Sentence Compression

Thank you!