Extraction Chapter 3 in Automatic Summarization 한 경 수 2001-11-08 고려대학교 자연어처리연구실

Extraction

Chapter 3inAutomatic Summarization

한 경 수2001-11-08고려대학교 자연어처리연구실

한경수 Extraction 2

Contents Introduction The Edmundsonian paradigm Corpus based sentence extraction

General considerations Aspects of learning approaches

Coherence of extracts Conclusion


Extraction Extraction (discussed here)

Analysis phase dominates. This analysis is relatively shallow. Discourse level information, if used at all, is mostly for …

– establishing coreference between proper names– pronoun resolution

Extraction is not appropriate for every summarization. At high compression rate

– extraction seems less likely to be effective, unless some pre-existing highly compressed summary material is found.

In multi-document summarization– both differences and similarities between documents need to be char

acterized. Human abstractors produce abstracts, not extracts.

Introduction


Extraction element The basic unit of extraction is the sentence. Practical reason preferring sentence to paragraph

It offers better control over compression Linguistic motivation

– Sentence has historically served as a prominent unit in syntactic and semantic analysis.

– Logical accounts of meaning offer precise notions of sentential meaning.

o Sentences can be represented in a logical form, and taken to denote propositions.

The extraction of elements below the sentence level The extracts will often be fragmentary in nature.

The sentence seems a natural unit to consider in the general case.

Introduction


Classic work of Edmundson (1969) Used a corpus of 200 scientific papers on chemistry.

Each paper between 100 and 3900 words long. Manually prepare the target extracts

Features Title words

– Words from the title, subtitles, and headings– given a hand-assigned weight

Cue words– Extracted from the training corpus based on selection ratio

o Selection ratio = # of occurrences in extract / # of occurrences in all sentences of the corpus

– Bonus wordso Evidence for selection: above an upper selection ratio thresholdo comparatives, superlatives, adverbs of conclusion, value terms, relative interrogative

s, causality terms– Stigma wods

o Evidence for non-selection: below a lower selection ratio cutoffo Anaphoric expressions, belittling expressions, insignificant detail expressions, hedgi

ng expressions

The Edmundsonian paradigm


Classic work of Edmundson (1969) Features (continued)

Keywords– The word frequencies were tabulated in descending order

o Until a given cutoff percentage of all the word occurrences in the document were reached

– Non-cue words above that threshold were extracted as key words.– Each word’s weight is its frequency in the document.

Sentence location– Heading weight

o Short list of particular section headings was constructed.• Like “Introduction” and “Conclusion”

o Sentences occurred under such headings were assigned a positive weight.

– Ordinal weighto Sentences were assigned weights based on their ordinal position.o If they occurred in the first and last paragraph or if they were the first or

last sentences of paragraphs, they were assigned a positive weight.



Classic work of Edmundson (1969) Sentence scoring

Based on a linear function of the weights of each features

Edmundson adjusted by hand the feature weights and the tuning parameters

– by feedback from comparisons against manually created training extracts

Evaluations Key words were poorer than the other 3 features. The combination of cue-title-location was the best

– The best individual feature: location, the worst: key words


)()()()()( sTsLsKsCsW


Feature reinterpretation: cue words

Cue words cue phrases Cue phrases

Expressions– like “I conclude by”, “this paper is concerned with”, …

Bonus words, stigma words In-text summary cues (indicator phrases)

– E.g. beginning with “in summary”

Useful for specific technical domains Indicator phrases can be extracted by a

pattern matching process Black(1990): p.49 example



Feature reinterpretation: key words

Key words presence of thematic term features Selected based on term frequency Including key words of Edmundson

Thematic Term Assumption Relatively more frequent terms are more salient. Luhn(1958)

– Find content words in a document by filtering against a stoplist of function words

– Arrange it by frequency– Suitable high-frequency and low-frequency cutoffs were estimated fro

m a collection of articles and their abstracts. A variant of the thematic term assumption: tf*idf

Its use in automatic summarization is somewhat less well-motivated.



Feature reinterpretation: location

Baxendale(1958) Found that important sentences were located at the beginning or e

nd of paragraphs. Salient sentences were likely to occur as …

– first sentence in the paragraph 85% the time– Last sentence 7% of the time

Brandow et al.(1995) Compared their thematic term based extraction system for news(A

NES) against Searchable Lead, a system which just output sentences in order.

Searchable Lead outperformed ANES– Acceptable 87% to 96% of the time– Unacceptable case

o anecdotal, human-interest style lead-ins, documents that contained multiple news stories, stories with unusual structural/stylistic features, …



Feature reinterpretation: location

Lin & Hovy(1997) Defined Optimal Position Policy(OPP). OPP

– A list of positions in the text in which salient sentences were likely to occur.

For 13,000 Ziff-Davis news articles– Title, 1st sentence of 2nd paragraph, 1st sent of 3rd para, …

For Wall Street Journal– Title, 1st sentence of 1st paragraph, 2nd sent of 1st para, …



Feature reinterpretation: title Title words Add Term

Weight is assigned based on terms in it that are also present in the title, article headline, or the user’s profile or query.

A user-focused summary– Relatively heavy weight for – Will favor the relevance of the summary to the query or

topic.– Must be balanced against the fidelity to the source

document.o Need for the summary to represent information in the document



Criticism The Edmundsonian equation is inadequate for summarizati

on for the following reasons Extracts only single elements in isolation, rather than extracting se

quences of elements.– Incoherent summaries– Knowing that a particular sentence has been selected should affect th

e choice of subsequent sentences. Compression rate isn’t directly referenced in the equation

– The compression rate should be part of the summarization process, not just an afterthought.

o E.g.• most salient concept A – s1, s2• Next-to-most salient concept B – s3• One-sentence summary: s3• Two-sentence summary: s1, s2



Criticism A linear equation may not be a powerful enough model

for summarization.– Non-linear model is required for certain applications

o Spreading activation between wordso Other probabilistic models

Uses only shallow, morphological-level features for words and phrases in the sentence, along with the sentence’s location.

– There has been a body of work which explores different linear combinations of syntactic, semantic, and discourse-level features.

Is rather ad hoc.– Doesn’t tell us anything theoretically interesting about what

makes a summary a summary.



General considerations The most interesting empirical work in Edmundsonian para

digm has used some variant of Edmundson’s equation, leveraging a corpus to estimate the weights.

Basic methodology for a corpus-based to sentence extraction Figure 3.1 (p. 54)

Corpus based sentence extraction


Labeling A training extract is also preferred to a training abstract

Because it is somewhat less likely to vary across human summarizers.

Producing an extract from an abstract Mani & Bloedorn(1998)

– Treat the abstrat as a query.– Rank the sentences for similarity to the abstract.

o Combined-match• Each source sentence is matched against the entire abstract treated as a sing

le sentence.• Euqation 3.2 (p. 56)

o Individual-match• Each source sentence is compared against each sentence of the abstract.



Labeling Producing an extract from an abstract (continued)

Marcu(1999)– Prunes a clause away from the source that is least similar to abstract.

Jing & McKeown(1999)– Word-sequence alignment using HMM– Refer to section 3 in Kyoung-Soo’s Technical Note KS-TN-200103

Can result in a score for each sentence Yes/no label Labeling can be left as a continuous function.



Learning representation The result of learning can be represented as

… Rules Mathematical functions

If a human is to trust a machine’s summaries The machine has to have some way of explaining why it

produced the summary it did. Logical rules are usually preferred to

mathematical functions.



Compression & Evaluation Compression

Typically, it is applied at the time of testing. It is possible to train a summarizer for a particular

compression.– Different feature combinations may be used for different

compression rates.

Evaluation Precision, recall, accuracy, F-measure Table 3.1/3.2 (p. 59)



Aspects of learning approaches

Sentence extraction as Bayesian classification Kupiec et al.(1995) 188 full text/summary pairs

– drawn from 21 different collections of scientific articles– Summary was written by a professional abstractor and was 3 sentenc

es long on average. Features

– Sentence length, presence of fixed cue phrases, location, presence of thematic terms, presence of proper names

Bayesian classifier (Equation 3.4 p.60) Producing an extract from the abstract

– Direct match(79%)o identical, or considered to have the same content

– Direct join(3%)o two or more document sentences appear to have the same content as a s

ingle summary sentence.




Sentence extraction as Bayesian classification (cont’d) Evaluation

– 43% recall– As the summaries were lengthened performance improved.

o 84% recall at 25% of the full text length

– Location was the best feature– Location-cue phrase-sentence length was the best

combination




Classifier combination Myaeng & Jang(1999)

– Tagged each sentence in the Introduction and Conclusion sectiono Whether the section represented …

• Background• Main theme• Explanation of the document structure• Description of future work

– 96% of the summary sentence were main theme sentences.– Training method

o Used bayesian classifier to determine whether a sentence belonged to a main theme

o Combined evidence from multiple Bayesian feature classifiers using a voting

o Applied a filter to eliminate redundant sentences.– Evaluation

o Cue words-location-title words was the best combinationo Suggests that the Edmundsonian features are not language-specific.




Term aggregation In a document about a certain topic,

– There would be many reference to that topic.– The reference need not result in verbatim repetition.

o Synonym, more specialized word, related term, … Aone et al.(1999)

– Different methods of term aggregation can impact summarization performance.

o Treat morphological variants, synonyms, name aliases as instances of the same term.

– Performance can be improvedo When place names and organization names are identified as terms,o And when person names are filtered outo Reason: document topics are generally not about people.




Topic-focused summaries Lin(1999)

– Used a corpus, called the Q&A corpuso 120 texts (4 topics * 30 relevant docs/topic)o Human-created, topic-focused passage extraction summary

– Featureso Add-Term: query term

• Sentences are weighted based on the number of query terms they contained.

o Additional relevance feature• Relevance feedback weight for terms that occurred in documents

most relevant to the topic.

o Presence of proper name, sentence lengtho Cohesion features

• Number of terms shared with other sentences

o Numerical expression, pronoun, adjective, reference to specific weekdays or months, presence of quoted speech




Topic-focused summaries (continued) Lin(1999) (continued)

– Feature combinationo Naïve combination with each feature given equal weighto Decision tree learner

– Naïve method outperformed the decision tree learner on 3 out of 4 topics.

– Baseline method(based on sentence order) also performed well on all topics.




Topic-focused summaries (continued) Mani & Bloedorn(1998)

– Cmp-lg corpus: a set of 198 pairs of full-text docs/abstracts– Labeling

o The overall information need for a user was defined by a set of docs.o A subject was told to pick a sample of 10 docs matched his interests.o Top content words were extracted from each docs.o Words for the 10 docs were sorted by their scoreso All words more than 2.5 standard deviations above the mean of these wor

ds’ scores were treated as a representation of the user’s interest, or topic.• There were 72 such words.

o Relevance match• Used spreading activation based on cohesion information to weight word occur

rences in the document related to the topic.• Each sentence was weighted based on the average of its word weights.• The top C% of these sentences were picked as positive examples




Topic-focused summaries (continued) Mani & Bloedorn(1998) (continued)

– Featureso 2 additional user-interest-specific features

• Number of reweighted words(topic keywords) in the sentence• Number of topic keywords / number of content word in the sentence• Specific topic keywords weren’t used as features, since it is preferable to learn

rules that could transfer across user-interests.• Topic keywords are similar to ‘relevance feedback’ terms in Lin’s study.

o Location, thematic featureso cohesion features

• Synonymy: judged by using WordNet• Statistical cooccurrence: scores between content words i and j up to 40 words

apart were computed using mutual information.• Equation 3.5 (p. 65)• Association table only stores scores for tf counts greater than 10 and associat

ion scores greater than 10.




Topic-focused summaries (continued) Mani & Bloedorn(1998) (continued)

– Evaluationo In user-focused summaries, the number of topic keywords in a sentnece

was the single most influential feature.o The cohesion features contributed the least,

• Perhaps because the cohesion calculation was too imprecise.

– Some sample rules (Table 3.4 p.66)o The learned rules are highly intelligible, and can perhaps be edited in acc

ordance with human intuitions.o The discretization of the features degraded performance by about 15%

• There is a tradeoff there between accuracy and transparency.




Case study: Noisy channel model There has been a surge of interest in language modeling approach

es to summarization. (Berger & Mittal 2000) The problem of automatic summarization as a translation problem

– translating between a verbose language(of source documents) and a succinct language(of summaries)

– This idea is related to the notion of the abstractor reconstructing the author’s ideas in order to produce a summary.

Generic summarization


decoder

Noisy Channel

ds *s)|( sdP

))()|((maxarg)|(maxarg* sPsdPdsPsss



Case study: Noisy channel model (continued) User-focused summarization

– fidelity

– relevance


))|()|((maxarg

))|(),|((maxarg),|(maxarg*

dsPsqP

dsPdsqPqdsPs

s

ss

relevance fidelity

m

iidd sPmldsP

1

)()()|(

m

iiss qPklsqP

1

)()()|(



Case study: Noisy channel model (continued) Training

– Use FAQ pages on WWWo Lists a sequence of question-answer pairs (10,395)o Culled from 201 usenet FAQs and 4 call-center FAQso View each answer as the query-focused summary of the document

Evaluation– Assigns the correct summary, on the average, a rank of …

o 1.41 for useneto 4.3 for the call center data

Criticism– The noisy channel model is appealing

o Because it decomposes the summarization problem for generic and user-focused summarization in a theoretically interesting way

– However, the model tends to rely on large quantities of training data.




Conclusion The corpus-based approach to sentence extraction is

attractive because …– It allows one to tune the summarizer to the characteristics of

the corpus or genre of text.– Well-established– The capability to learn interesting and often quite interesting

rules But,

– Lots of design choices and parameters involved in training Issues

– How is the training to be utilized in an end-application?– Learning sequences of sentences to extract deserves more

attention.– Evaluation



Coherence of extracts When extracting sentences from a source,

An obvious problem is preserving context. Picking sentences out of context can result in incoherent

summaries Coherence problems

Dangling anaphors– If an anaphor is present in a summary extract, the extract

may not be entirely intelligible if the referent isn’t included as well.

Gaps– Breaking the connection between the ideas in a text can

cause problems. Structured environments

– Itemized lists, tables, logical arguments, etc., cannot be arbitrarily divided.


Conclusion Abstracts vs. extracts

The most important aspect of an abstract …– Is not so much that it paraphrases the input in its own words.– Some level of abstraction of the input has been carried out

o Providing a degree of compressiono Requires Knowledge of the meaning of the information talked abouto And ability to make inferences at the semantic level

Extraction methods– While knowledge-poor, are not entirely knowledge-free.– Knowledge about a particular domain is represented

o In terms of features specific to that domaino In the particular rules or functions learned for that domain

– The knowledge here is entirely internal. There is fundamental limitation to the capabilities of extraction

systems.– Current attention is focused on the opportunity to avail of

compression in a more effective way by producing abstracts automatically.

Documents

Extraction Chapter 3 in Automatic Summarization 한 경 수 2001-11-08 고려대학교 자연어처리연구실