37
Segmentation Similarity and Agreement A metric for evaluating automatic and human segmenters Chris Fournier Diana Inkpen School of Electrical Engineering and Computer Science University of Ottawa June 4, 2012 1 / 37

Segmentation Similarity and Agreement

Embed Size (px)

DESCRIPTION

We propose a new segmentation evaluation metric, called segmentation similarity (S), that quantifies the similarity between two segmentations as the proportion of boundaries that are not transformed when comparing them using edit distance, essentially using edit distance as a penalty function and scaling penalties by segmentation size. We propose several adapted inter-annotator agreement coefficients which use S that are suitable for segmentation. We show that S is configurable enough to suit a wide variety of segmentation evaluations, and is an improvement upon the state of the art. We also propose using inter-annotator agreement coefficients to evaluate automatic segmenters in terms of human performance. For more information, view the paper and software at: http://nlp.chrisfournier.ca

Citation preview

Page 1: Segmentation Similarity and Agreement

Segmentation Similarityand Agreement

A metric for evaluating automatic andhuman segmenters

Chris Fournier Diana Inkpen

School of Electrical Engineering and Computer ScienceUniversity of Ottawa

June 4, 2012

1 / 37

Page 2: Segmentation Similarity and Agreement

What is segmentation?Introduction

Figure: Baker (1990, pp. 76–77)2 / 37

Page 3: Segmentation Similarity and Agreement

What is segmentation?Introduction

Par. Topic1–3 Intro - the search for life in space4–5 The moon’s chemical composition6–8 How early earth-moon proximity shaped the moon

9–12 How the moon helped life evolve on earth13 Improbability of the earth-moon system

14–16 Binary/trinary star systems make life unlikely17–18 The low probability of nonbinary/trinary systems19–20 Properties of earth’s sun that facilitate life

21 Summary

Figure: Hyp. segmentation (Hearst 1997, p. 33)

3 / 37

Page 4: Segmentation Similarity and Agreement

Why do we segment?Introduction

To model topical shifts, aiding:

É Video and audio retrieval(Franz et al. 2007)

É Question answering(Oh et al. 2007)

É Subjectivity analysis(Stoyanov & Cardie 2008)

É Automatic summarization(Haghighi & Vanderwende 2009)

4 / 37

Page 5: Segmentation Similarity and Agreement

Types of segmentationIntroduction

Linear

s1 3 2 3 1

Hierarchical

5

3

1 1 1

2

1 1

5 / 37

Page 6: Segmentation Similarity and Agreement

Automatically segmentationIntroduction

Many automatic segmenters exist:

É TextTiling(Hearst 1997)

É Minimum Cut segmenter(Malioutov & Barzilay 2006)

É Bayesian segmenter(Eisenstein & Barzilay 2008)

É Affinity Propagation for Segmentation(Kazantseva & Szpakowicz 2011)

6 / 37

Page 7: Segmentation Similarity and Agreement

Problem: selecting a segmenterIntroduction

How do we select the best performingsegmenter for a task?

É Ideally evaluate performance in situÉ Evaluate end-task performance while

varying segmentersÉ Attain ecological validity1

É “. . . the ability of experiments to tell us howreal people operate in the real world”(Cohen 1995, p. 102)

É This is time consuming and expensive

1For an example study, see McCallum et al. (2012)7 / 37

Page 8: Segmentation Similarity and Agreement

Problem: selecting a segmenterIntroduction

How do we less expensively select thebest performing segmenter for a task?

1. Identify/collect manual segmentations

2. Verify their reliability

3. Train an automatic segmenter

4. Compare automatic and manualsegmentations using a metric

8 / 37

Page 9: Segmentation Similarity and Agreement

FocusIntroduction

We focuses on comparing segmentationsto evaluate:

É Manual segmentations reliability

É Automatic segmenter performance

9 / 37

Page 10: Segmentation Similarity and Agreement

Why is this comparison difficult?Difficulty

Difficulty arises because:

É There is no one “true” segmentationÉ Low manual agreement (Hearst 1997)É Coders disagree on granularity (Pevzner

& Hearst 2002)

É Few boundaries to agree uponHearst (1993, p. 6)

É Near misses often occur betweenboundaries

10 / 37

Page 11: Segmentation Similarity and Agreement

No one “true” segmentationDifficulty

1234567

Figure: 7 manual codings collected by Hearst(1997) of Stargazers Look for Life (Baker 1990)

11 / 37

Page 12: Segmentation Similarity and Agreement

Near missesDifficulty

0 5 10 15 20

0

500

1,000

1,500

Distance considered as a near miss (PBs)

Mis

ses

Full Near

Figure: S of Kazantseva & Szpakowicz (2012)12 / 37

Page 13: Segmentation Similarity and Agreement

Existing evaluation metricsEvaluation Metrics

Existing segmentation evaluation metrics:

É Precision, Recall, Fβ-measureÉ Does not discount near-misses

É Pk (Beeferman & Berger 1999)É Window-based near-miss accountingÉ Not stable (Pevzner & Hearst 2002)

É WindowDiff (Pevzner & Hearst 2002)É Substantial modification of Pk

É More stable (Pevzner & Hearst 2002)

13 / 37

Page 14: Segmentation Similarity and Agreement

Stability & internal segment sizesEvaluation Metrics

1−WDS

0.6

0.7

0.8

0.9

1

Met

ric

valu

e

(20,30) (15,35) (10,40) (5,45)

Figure: 10 trials of 100 segs. w/ FP & FN p = 0.514 / 37

Page 15: Segmentation Similarity and Agreement

Common failingsEvaluation Metrics

Existing segmentation evaluation metrics:

É Require one “true” referenceÉ Cannot use multiple manual codings

É Cannot be adapted for agreementÉ Pairwise means must be permutedÉ WD(s1, s2) 6= WD(s2, s1)

15 / 37

Page 16: Segmentation Similarity and Agreement

A new metric: SSegmentation Similarity

Segmentation Similarity (S):É New boundary edit distance

É Edit distance used to penalize error

É Scales and normalizes penalties inrelation to segment mass

S is ideal because it is:É A minimum edit distance (stable)

É Symmetric (no “true” segmentation)

É Highly configurable

16 / 37

Page 17: Segmentation Similarity and Agreement

ParametersSegmentation Similarity

S has three parameters:n the number of PBs considered a

near miss (default is 2)

TE (y/n), to use transposition errorscaling, or not (default is yes)

Weights upon error types to reduce theirseverity (default is 1PB each)

17 / 37

Page 18: Segmentation Similarity and Agreement

Mass and potential boundariesSegmentation Similarity

Segmentations have:É Potential boundaries separating units

É Mass measured in units

É Types of boundaries.

0 1 2 3 4 5 6

⇒ 1 3 2

Figure: Annotation of segmentation mass

18 / 37

Page 19: Segmentation Similarity and Agreement

Modelling dissimilaritySegmentation Similarity

Linear segmentation errors can bemodeled as edit operations at positions:

1 n-wise transposition2,3,4 substitutions

s1

s2

FP FN FP FN FN

1 2 3 4

Figure: Types of segmentations errors

19 / 37

Page 20: Segmentation Similarity and Agreement

NormalizationSegmentation Similarity

S(si1, si2) =mass(i)−1−d(si1,si2)

mass(i)−1

20 / 37

Page 21: Segmentation Similarity and Agreement

Calculating similaritySegmentation Similarity

From the previous example:É 4 edits (3 sub. and 1 transposition)É 14 units of mass.

s1

s2

1 2 3 4

S(si1, si2) =14− 1− 4

14− 1=

9

13= 0.6923

1−WD = 0.615421 / 37

Page 22: Segmentation Similarity and Agreement

Near missesSegmentation Similarity

S can scale near misses by PBs spanned:

te(n,b) = b− (1/b)n−2 where n ≥ 2 and b > 0

s1

s2

6 8

7 7

S = 0.92311−WD = 0.8182

22 / 37

Page 23: Segmentation Similarity and Agreement

Increasing near miss span sizeSegmentation Similarity

0 2 4 6 8 10

0.7

0.8

0.9

1

Difference in position (units)

1−WDS(n = 3)

S(n = 5,scale)S(n = 5,wtrp = 0)

23 / 37

Page 24: Segmentation Similarity and Agreement

Reliability of manual codingsSegmentation Agreement

How do we verify manual reliability?É Inter-coder agreement coefficients: 2 3

κ, π, κ∗, and π∗ =Aa − Ae

1− Ae

É Adapt to use Segmentation Similarity:

κS, πS, κ∗S, and π∗

S

2Fleiss’s Multi-π (π∗) is Siegel & Castellan’s (1988) κ3Formulations from Artstein & Poesio (2008) used

24 / 37

Page 25: Segmentation Similarity and Agreement

CategoriesSegmentation Agreement

Calculate Ae using one category per t:É boundary presence (K = {segt|t ∈ T})

Why?É Coders either place a boundary or not

É Coders do not place non-boundaries

É We desire boundary agreementÉ “Unsure”, “no choice” are not options

É Default is no boundary placement

25 / 37

Page 26: Segmentation Similarity and Agreement

Examples of manual codingsMultiply-Coded Corpora

Linear multiply-coded segmentations:

É Kazantseva & Szpakowicz (2012)É The Moonstone by Wilkie CollinsÉ Topically segmented by 4-6 codersÉ Paragraph-level

É Hearst (1997)É Stargazers Look for Life by Dan BakerÉ Topically segmented by 7 codersÉ Paragraph-level

26 / 37

Page 27: Segmentation Similarity and Agreement

Overall agreementMultiply-Coded Corpora

Kazantseva & Szpakowicz (2012)Mean coder group π∗

S0.8923± 0.0377

Mean S 0.8885± 0.0662

Hearst (1997)π∗

S0.7514

Mean S 0.7619± 0.0706

27 / 37

Page 28: Segmentation Similarity and Agreement

Overall error typesMultiply-Coded Corpora

MissesFull Near

Kazantseva & Szpakowicz (2012) 1039 212Hearst (1997) 72 28

K&S 2012H 1997

Sub. Transp. PBs w/o error

28 / 37

Page 29: Segmentation Similarity and Agreement

Comparing segmentersEvaluation

How can we compare auto segmenters?

É Pairwise mean S with manual codings2

1 3 mean(S1,S2,S3)

É Statistical hypothesis testing

29 / 37

Page 30: Segmentation Similarity and Agreement

Comparing segmentersEvaluation

How can we compare auto segmenters?

É Differences in agreement1. Calculate manual coder agreement

π∗S,3M

2. Recalculate agreement adding anautomatic segmenter’s values

π∗S,3M,1A

3. Compare the two agreement values

30 / 37

Page 31: Segmentation Similarity and Agreement

SummaryConclusion

Segmentation Similarity (S)É Stable, unlike window metrics

É Highly configurable

É Gives detailed error information

É Mean values can be used to performstatistical hypothesis tests

É Adapted inter-annotator agreementÉ Quantify manual agreement & reliabilityÉ Compare automatic segmenters in terms

of human performance

31 / 37

Page 32: Segmentation Similarity and Agreement

Future work & ImplementationConclusion

Future workÉ Multiple boundary types; and

É Hierarchical segmentation.

Software implementation

http://nlp.chrisfournier.ca/

32 / 37

Page 33: Segmentation Similarity and Agreement

References I

Artstein, R. & Poesio, M. (2008), ‘Inter-coder agreement forcomputational linguistics’, Computational Linguistics34(4), 555–596.

Baker, D. (1990), ‘Stargazers look for life’, South Magazine117, 76–77.

Beeferman, D. & Berger, A. (1999), ‘Statistical models fortext segmentation’, Machine learning 34(1-3), 177–210.

Cohen, P. R. (1995), Empirical methods for artificialintelligence, Cambridge, MA, USA.

Eisenstein, J. & Barzilay, R. (2008), Bayesian unsupervisedtopic segmentation, in ‘Proceedings of the Conference onEmpirical Methods in Natural Language Processing’,number October, Association for ComputationalLinguistics, Morristown, NJ, USA, pp. 334–343.

33 / 37

Page 34: Segmentation Similarity and Agreement

References IIFranz, M., Mccarley, J. S., Xu, J.-m., Systems, H. I. & Search, I.

(2007), User-Oriented Text Segmentation EvaluationMeasure, in ‘Proceedings of the 30th annual internationalACM SIGIR conference on Research and development ininformation retrieval’, number 1, pp. 701–702.

Haghighi, A. & Vanderwende, L. (2009), Exploring contentmodels for multi-document summarization, in‘Proceedings of Human Language Technologies: The 2009Annual Conference of the North American Chapter of theAssociation for Computational Linguistics’, NAACL ’09,Association for Computational Linguistics, Stroudsburg,PA, USA, pp. 362–370.

Hearst, M. A. (1993), TextTiling: A Quantitative Approach toDiscourse, Technical report.

Hearst, M. A. (1997), ‘TextTiling: segmenting text intomulti-paragraph subtopic passages’, ComputationalLinguistics 23(1), 33–64.

34 / 37

Page 35: Segmentation Similarity and Agreement

References IIIKazantseva, A. & Szpakowicz, S. (2011), Linear Text

Segmentation Using Affinity Propagation, in ‘Proceedingsof the 2011 Conference on Empirical Methods in NaturalLanguage Processing’, Association for ComputationalLinguistics, Edinburgh, Scotland, UK., pp. 284–293.

Kazantseva, A. & Szpakowicz, S. (2012), TopicalSegmentation: a Study of Human Performance, in‘Proceedings of the Human Language Technologies: The2012 Annual Conference of the North American Chapter ofthe Association for Computational Linguistics (HLT’12)’,Association for Computational Linguistics.

Malioutov, I. & Barzilay, R. (2006), Minimum cut model forspoken lecture segmentation, in ‘Proceedings of the 21stInternational Conference on Computational Linguistics andthe 44th annual meeting of the Association forComputational Linguistics’, ACL-44, Association forComputational Linguistics, Stroudsburg, PA, USA,pp. 25–32.

35 / 37

Page 36: Segmentation Similarity and Agreement

References IVMcCallum, A., Munteanu, C., Penn, G. & Zhu, X. (2012),

Ecological validity and the evaluation of speechsummarization quality, in ‘Proceedings of the NAACL HLT2012 Workshop on Evaluation Metrics and SystemComparison for Automatic Summarization’, Association forComputational Linguistics.

Oh, H.-J., Myaeng, S. H. & Jang, M.-G. (2007), ‘Semanticpassage segmentation based on sentence topics forquestion answering’, Information Sciences177(18), 3696–3717.

Pevzner, L. & Hearst, M. (2002), ‘A critique and improvementof an evaluation metric for text segmentation’,Computational Linguistics 28(1), 19–36.

Siegel, S. & Castellan, N. (1988), Nonparametric Statistics forthe Behavioral Sciences, second edn, McGraw-Hill, Inc.

36 / 37

Page 37: Segmentation Similarity and Agreement

References V

Stoyanov, V. & Cardie, C. (2008), Topic identification forfine-grained opinion analysis, in ‘Proceedings of the 22ndInternational Conference on Computational Linguistics -Volume 1’, COLING ’08, Association for ComputationalLinguistics, Stroudsburg, PA, USA, pp. 817–824.

37 / 37