Segmentation Similarity and Agreement

Segmentation Similarityand Agreement

A metric for evaluating automatic andhuman segmenters

Chris Fournier Diana Inkpen

School of Electrical Engineering and Computer ScienceUniversity of Ottawa

June 4, 2012

1 / 37

What is segmentation?Introduction

Figure: Baker (1990, pp. 76–77)2 / 37

What is segmentation?Introduction

Par. Topic1–3 Intro - the search for life in space4–5 The moon’s chemical composition6–8 How early earth-moon proximity shaped the moon

9–12 How the moon helped life evolve on earth13 Improbability of the earth-moon system

14–16 Binary/trinary star systems make life unlikely17–18 The low probability of nonbinary/trinary systems19–20 Properties of earth’s sun that facilitate life

21 Summary

Figure: Hyp. segmentation (Hearst 1997, p. 33)

3 / 37

Why do we segment?Introduction

To model topical shifts, aiding:

É Video and audio retrieval(Franz et al. 2007)

É Question answering(Oh et al. 2007)

É Subjectivity analysis(Stoyanov & Cardie 2008)

É Automatic summarization(Haghighi & Vanderwende 2009)

4 / 37

Types of segmentationIntroduction

Linear

s1 3 2 3 1

Hierarchical

5

3

1 1 1

2

1 1

5 / 37

Automatically segmentationIntroduction

Many automatic segmenters exist:

É TextTiling(Hearst 1997)

É Minimum Cut segmenter(Malioutov & Barzilay 2006)

É Bayesian segmenter(Eisenstein & Barzilay 2008)

É Affinity Propagation for Segmentation(Kazantseva & Szpakowicz 2011)

6 / 37

Problem: selecting a segmenterIntroduction

How do we select the best performingsegmenter for a task?

É Ideally evaluate performance in situÉ Evaluate end-task performance while

varying segmentersÉ Attain ecological validity1

É “. . . the ability of experiments to tell us howreal people operate in the real world”(Cohen 1995, p. 102)

É This is time consuming and expensive

1For an example study, see McCallum et al. (2012)7 / 37

Problem: selecting a segmenterIntroduction

How do we less expensively select thebest performing segmenter for a task?

1. Identify/collect manual segmentations

2. Verify their reliability

3. Train an automatic segmenter

4. Compare automatic and manualsegmentations using a metric

8 / 37

FocusIntroduction

We focuses on comparing segmentationsto evaluate:

É Manual segmentations reliability

É Automatic segmenter performance

9 / 37

Why is this comparison difficult?Difficulty

Difficulty arises because:

É There is no one “true” segmentationÉ Low manual agreement (Hearst 1997)É Coders disagree on granularity (Pevzner

& Hearst 2002)

É Few boundaries to agree uponHearst (1993, p. 6)

É Near misses often occur betweenboundaries

10 / 37

No one “true” segmentationDifficulty

1234567

Figure: 7 manual codings collected by Hearst(1997) of Stargazers Look for Life (Baker 1990)

11 / 37

Near missesDifficulty

0 5 10 15 20

0

500

1,000

1,500

Distance considered as a near miss (PBs)

Mis

ses

Full Near

Figure: S of Kazantseva & Szpakowicz (2012)12 / 37

Existing evaluation metricsEvaluation Metrics

Existing segmentation evaluation metrics:

É Precision, Recall, Fβ-measureÉ Does not discount near-misses

É Pk (Beeferman & Berger 1999)É Window-based near-miss accountingÉ Not stable (Pevzner & Hearst 2002)

É WindowDiff (Pevzner & Hearst 2002)É Substantial modification of Pk

É More stable (Pevzner & Hearst 2002)

13 / 37

Stability & internal segment sizesEvaluation Metrics

1−WDS

0.6

0.7

0.8

0.9

1

Met

ric

valu

e

(20,30) (15,35) (10,40) (5,45)

Figure: 10 trials of 100 segs. w/ FP & FN p = 0.514 / 37

Common failingsEvaluation Metrics

Existing segmentation evaluation metrics:

É Require one “true” referenceÉ Cannot use multiple manual codings

É Cannot be adapted for agreementÉ Pairwise means must be permutedÉ WD(s1, s2) 6= WD(s2, s1)

15 / 37

A new metric: SSegmentation Similarity

Segmentation Similarity (S):É New boundary edit distance

É Edit distance used to penalize error

É Scales and normalizes penalties inrelation to segment mass

S is ideal because it is:É A minimum edit distance (stable)

É Symmetric (no “true” segmentation)

É Highly configurable

16 / 37

ParametersSegmentation Similarity

S has three parameters:n the number of PBs considered a

near miss (default is 2)

TE (y/n), to use transposition errorscaling, or not (default is yes)

Weights upon error types to reduce theirseverity (default is 1PB each)

17 / 37

Mass and potential boundariesSegmentation Similarity

Segmentations have:É Potential boundaries separating units

É Mass measured in units

É Types of boundaries.

0 1 2 3 4 5 6

⇒ 1 3 2

Figure: Annotation of segmentation mass

18 / 37

Modelling dissimilaritySegmentation Similarity

Linear segmentation errors can bemodeled as edit operations at positions:

1 n-wise transposition2,3,4 substitutions

s1

s2

FP FN FP FN FN

1 2 3 4

Figure: Types of segmentations errors

19 / 37

NormalizationSegmentation Similarity

S(si1, si2) =mass(i)−1−d(si1,si2)

mass(i)−1

20 / 37

Calculating similaritySegmentation Similarity

From the previous example:É 4 edits (3 sub. and 1 transposition)É 14 units of mass.

s1

s2

1 2 3 4

S(si1, si2) =14− 1− 4

14− 1=

9

13= 0.6923

1−WD = 0.615421 / 37

Near missesSegmentation Similarity

S can scale near misses by PBs spanned:

te(n,b) = b− (1/b)n−2 where n ≥ 2 and b > 0

s1

s2

6 8

7 7

S = 0.92311−WD = 0.8182

22 / 37

Increasing near miss span sizeSegmentation Similarity

0 2 4 6 8 10

0.7

0.8

0.9

1

Difference in position (units)

1−WDS(n = 3)

S(n = 5,scale)S(n = 5,wtrp = 0)

23 / 37

Reliability of manual codingsSegmentation Agreement

How do we verify manual reliability?É Inter-coder agreement coefficients: 2 3

κ, π, κ∗, and π∗ =Aa − Ae

1− Ae

É Adapt to use Segmentation Similarity:

κS, πS, κ∗S, and π∗

S

2Fleiss’s Multi-π (π∗) is Siegel & Castellan’s (1988) κ3Formulations from Artstein & Poesio (2008) used

24 / 37

CategoriesSegmentation Agreement

Calculate Ae using one category per t:É boundary presence (K = {segt|t ∈ T})

Why?É Coders either place a boundary or not

É Coders do not place non-boundaries

É We desire boundary agreementÉ “Unsure”, “no choice” are not options

É Default is no boundary placement

25 / 37

Examples of manual codingsMultiply-Coded Corpora

Linear multiply-coded segmentations:

É Kazantseva & Szpakowicz (2012)É The Moonstone by Wilkie CollinsÉ Topically segmented by 4-6 codersÉ Paragraph-level

É Hearst (1997)É Stargazers Look for Life by Dan BakerÉ Topically segmented by 7 codersÉ Paragraph-level

26 / 37

Overall agreementMultiply-Coded Corpora

Kazantseva & Szpakowicz (2012)Mean coder group π∗

S0.8923± 0.0377

Mean S 0.8885± 0.0662

Hearst (1997)π∗

S0.7514

Mean S 0.7619± 0.0706

27 / 37

Overall error typesMultiply-Coded Corpora

MissesFull Near

Kazantseva & Szpakowicz (2012) 1039 212Hearst (1997) 72 28

K&S 2012H 1997

Sub. Transp. PBs w/o error

28 / 37

Comparing segmentersEvaluation

How can we compare auto segmenters?

É Pairwise mean S with manual codings2

1 3 mean(S1,S2,S3)

É Statistical hypothesis testing

29 / 37

Comparing segmentersEvaluation

How can we compare auto segmenters?

É Differences in agreement1. Calculate manual coder agreement

π∗S,3M

2. Recalculate agreement adding anautomatic segmenter’s values

π∗S,3M,1A

3. Compare the two agreement values

30 / 37

SummaryConclusion

Segmentation Similarity (S)É Stable, unlike window metrics

É Highly configurable

É Gives detailed error information

É Mean values can be used to performstatistical hypothesis tests

É Adapted inter-annotator agreementÉ Quantify manual agreement & reliabilityÉ Compare automatic segmenters in terms

of human performance

31 / 37

Future work & ImplementationConclusion

Future workÉ Multiple boundary types; and

É Hierarchical segmentation.

Software implementation

http://nlp.chrisfournier.ca/

32 / 37

http://nlp.chrisfournier.ca/

References I

Artstein, R. & Poesio, M. (2008), ‘Inter-coder agreement forcomputational linguistics’, Computational Linguistics34(4), 555–596.

Baker, D. (1990), ‘Stargazers look for life’, South Magazine117, 76–77.

Beeferman, D. & Berger, A. (1999), ‘Statistical models fortext segmentation’, Machine learning 34(1-3), 177–210.

Cohen, P. R. (1995), Empirical methods for artificialintelligence, Cambridge, MA, USA.

Eisenstein, J. & Barzilay, R. (2008), Bayesian unsupervisedtopic segmentation, in ‘Proceedings of the Conference onEmpirical Methods in Natural Language Processing’,number October, Association for ComputationalLinguistics, Morristown, NJ, USA, pp. 334–343.

33 / 37

References IIFranz, M., Mccarley, J. S., Xu, J.-m., Systems, H. I. & Search, I.

(2007), User-Oriented Text Segmentation EvaluationMeasure, in ‘Proceedings of the 30th annual internationalACM SIGIR conference on Research and development ininformation retrieval’, number 1, pp. 701–702.

Haghighi, A. & Vanderwende, L. (2009), Exploring contentmodels for multi-document summarization, in‘Proceedings of Human Language Technologies: The 2009Annual Conference of the North American Chapter of theAssociation for Computational Linguistics’, NAACL ’09,Association for Computational Linguistics, Stroudsburg,PA, USA, pp. 362–370.

Hearst, M. A. (1993), TextTiling: A Quantitative Approach toDiscourse, Technical report.

Hearst, M. A. (1997), ‘TextTiling: segmenting text intomulti-paragraph subtopic passages’, ComputationalLinguistics 23(1), 33–64.

34 / 37

References IIIKazantseva, A. & Szpakowicz, S. (2011), Linear Text

Segmentation Using Affinity Propagation, in ‘Proceedingsof the 2011 Conference on Empirical Methods in NaturalLanguage Processing’, Association for ComputationalLinguistics, Edinburgh, Scotland, UK., pp. 284–293.

Kazantseva, A. & Szpakowicz, S. (2012), TopicalSegmentation: a Study of Human Performance, in‘Proceedings of the Human Language Technologies: The2012 Annual Conference of the North American Chapter ofthe Association for Computational Linguistics (HLT’12)’,Association for Computational Linguistics.

Malioutov, I. & Barzilay, R. (2006), Minimum cut model forspoken lecture segmentation, in ‘Proceedings of the 21stInternational Conference on Computational Linguistics andthe 44th annual meeting of the Association forComputational Linguistics’, ACL-44, Association forComputational Linguistics, Stroudsburg, PA, USA,pp. 25–32.

35 / 37

References IVMcCallum, A., Munteanu, C., Penn, G. & Zhu, X. (2012),

Ecological validity and the evaluation of speechsummarization quality, in ‘Proceedings of the NAACL HLT2012 Workshop on Evaluation Metrics and SystemComparison for Automatic Summarization’, Association forComputational Linguistics.

Oh, H.-J., Myaeng, S. H. & Jang, M.-G. (2007), ‘Semanticpassage segmentation based on sentence topics forquestion answering’, Information Sciences177(18), 3696–3717.

Pevzner, L. & Hearst, M. (2002), ‘A critique and improvementof an evaluation metric for text segmentation’,Computational Linguistics 28(1), 19–36.

Siegel, S. & Castellan, N. (1988), Nonparametric Statistics forthe Behavioral Sciences, second edn, McGraw-Hill, Inc.

36 / 37

References V

Stoyanov, V. & Cardie, C. (2008), Topic identification forfine-grained opinion analysis, in ‘Proceedings of the 22ndInternational Conference on Computational Linguistics -Volume 1’, COLING ’08, Association for ComputationalLinguistics, Stroudsburg, PA, USA, pp. 817–824.

37 / 37

Education

Segmentation Similarity and Agreement