Click here to load reader

Discovering Structural Similarities in Narrative Texts ... · PDF file Discovering Structural Similarities in Narrative Texts using Event Alignment Algorithms ... Structural similarities

  • View
    1

  • Download
    0

Embed Size (px)

Text of Discovering Structural Similarities in Narrative Texts ... · PDF file Discovering Structural...

  • Discovering Structural Similarities in Narrative Texts using Event Alignment

    Algorithms

    Dissertation zur Erlangung der Doktorwürde im Fach Computerlinguistik

    der Neuphilologischen Fakultät der Ruprecht-Karls-Universität Heidelberg

    vorgelegt von

    Nils Reiter

  • Publication June 2014

    Commission chair Prof. Dr. Ekkehard Felder Germanistisches Seminar Universität Heidelberg

    Supervisor and first reviewer Prof. Dr. Anette Frank Institut für Computerlinguistik Universität Heidelberg

    Secondary reviewer Prof. Dr. Sebastian Padó Institut für Maschinelle Sprachverarbeitung Universität Stuttgart

    ii

  • Acknowledgements

    I would like to thank my supervisor, Anette Frank, for supervising and guiding this thesis. In particular, I am thankful for the opportunity to delve into details, for the continuous support, the engaging discussions and the critical yet fruitful feedback. It was a pleasure to work in and with her entire group. In particular, I would like to thank my year-long office mate Eva Sourjiková for the pleasant office atmosphere, my friend and colleague Michael Roth for all the fun we had and Matthias Hartung for . . . I don’t know where to start.

    Furthermore, I thank “my” ritual experts, Oliver Hellwig and Christof Zotter, for an- swering questions and doing annotation work. I also thank Thomas Bögel, Irina Goss- mann, Mareike Hartmann, Borayin Larios, Julio Cezar Rodrigues and Britta Zeller for working in this ritual research project and doing implementation and annotation. All this would not have been possible without the Sonderforschungsbereich 619 and the funding by the German Research Foundation, for which I am equally thankful.

    Following Propp, folktales have magical helper agents, and so does the tale of working on this thesis. I am very thankful to my helper agents and for the encouragement they gave me.

    Nils Reiter

    iii

  • Abstract

    This thesis is about the discovery of structural similarities across narrative texts. We will describe a method that is based on event alignments created automatically on au- tomatically preprocessed texts. This opens up a path to large-scale empirical research on structural similarities across texts.

    Structural similarities are of interest for many areas in the humanities and social sci- ences. We will focus on folkloristics and research of rituals as application scenarios. Folkloristics researches folktales, i.e., tales that have been passed down orally for a long time. Similarities across different folktales have been observed, both at the level of indi- vidual events (being abandoned in the woods) or participants (the gingerbread house) and structurally: Events do not happen at random, but in a certain order. Rituals are an omnipresent part of human behavior and are studied in ethnology, social sciences and history. Similarities across types of rituals have been observed and sparked a discussion about structural principles that govern the combination of individual ritual elements to rituals.

    As descriptions of rituals feature a lot of uncommon language constructions, we will also discuss methods of domain adaptation in order to adapt existing NLP components to the domain of rituals. We will mainly use supervised methods and employ retrain- ing as a means for adaptation. This presupposes annotating small amounts of domain data. We will be discussing the following linguistic levels: Part of speech, chunking, dependency parsing, word sense disambiguation, semantic role labeling and corefer- ence resolution. On all levels, we have achieved improvements. We will also describe how these annotation levels are brought together in a single, integrated discourse rep- resentation that is the basis for further experiments.

    In order to discover structural similarities, we employ three different alignment algo- rithms and use them to align semantically similar events. Sequence alignment (Needle- man-Wunsch) is a classic algorithm with limited capabilities. A graph-based event alignment system that has been developed for newspaper texts will be used in com- parison. As a third algorithm, we employ Bayesian model merging, which induces a hidden Markov model, from which we extract an alignment. We will evaluate the algo- rithms in two experiments. In the first experiment, we evaluate against a gold standard of aligned descriptions of rituals. Bayesian model merging and predicate alignment achieve the best results, measured using the Blanc metric. Due to difficulties in creating an event alignment gold standard, the second experiment is based on cluster induction. Although this is not a strict evaluation of structural similarities, it gives some insight into the behavior of the algorithms.

    We induce a document similarity measure from the generated alignments and use this measure to cluster the documents. The clustering is then compared against a

    iv

  • gold standard classification of documents from both scenarios. In this experiment, the lemma alignment baseline achieves the best numerical performance on folktales (but as it aligns lemmas instead of event representations, its expressiveness is limited), fol- lowed by predicate alignment, Needleman-Wunsch and Bayesian model merging. On descriptions of rituals, the predicate alignment algorithm outperforms all baselines and the other algorithms. Shallow measures of semantic similarities of texts outperform the alignment-based algorithms on folktales, but they do not allow the exact localization of similarities.

    Finally, we present a graph-based algorithm that ranks events according to their par- ticipation in structurally similar regions across documents. This allows us to direct researchers from humanities to interesting cases, which are worth manual inspection. Because in digital humanities scenarios, the accessibility of results to researchers from humanities is of utmost importance, we close the thesis with a showcase scenario in which we analyze descriptions of rituals using the alignment, clustering and event ranking algorithms we have described before. We will show in this showcase how results can be visualized and interpreted by researchers of rituals.

    v

  • Contents

    1 Introduction 1

    2 Digital Humanities 4 2.1 Existing Computational Linguistics Research within Digital Humanities 6 2.2 Challenges for Computational Linguistics . . . . . . . . . . . . . . . . . . . 7 2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    3 Related Work 10 3.1 Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2 Computational Narrative Analysis . . . . . . . . . . . . . . . . . . . . . . . . 16

    4 Application Scenarios 33 4.1 Folktales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2 Rituals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    5 Automatic Semantic Annotation and Domain Adaptation 47 5.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.2 Adaptation to the Ritual Domain . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

    6 Discovering Structural Similarities 63 6.1 Discovering Story Similarities through Event Alignments . . . . . . . . . . 63 6.2 Event Alignment Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.3 Gold Standard and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.4 Experiment 1: Comparison against an Alignment Gold Standard . . . . . 78 6.5 Experiment 2: Alignment-based Clustering Evaluation . . . . . . . . . . . 83 6.6 Graph-based Detection of Structural Similarities . . . . . . . . . . . . . . . 88 6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

    7 Analyzing and Exploiting Structural Similarities in Digital Humanities 91 7.1 Inspecting Story Similarities Globally . . . . . . . . . . . . . . . . . . . . . . 91 7.2 Uncovering Structural Similarities . . . . . . . . . . . . . . . . . . . . . . . . 92 7.3 Fine-grained Analysis of Structural Similarities . . . . . . . . . . . . . . . . 96

    8 Conclusions 98 8.1 Challenges for Computational Linguistics . . . . . . . . . . . . . . . . . . . 98

    vi

  • Contents

    8.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 8.3 Outlook and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

    Appendix 103 1 Folktale: Bearskin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 2 Proppian Event Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 3 Description of a Cūd. ākaran. a Ritual . . . . . . . . . . . . . . . . . . . . . . . 104 4 Mathematical Notation Overview . . . . . . . . . . . . . . . . . . . . . . . . . 107 5 Discourse Representation File Format . . . . . . . . . . . . . . . . . . . . . . 107

    vii

  • List of Tables

    2.1 Past digital humanities research in computational linguistics . . . . . . . . 5

    3.1 Approaches for statistical domain adaptation . . . . . . . . . . . . . . . . . 13 3.2 Collections with annotated story intention graphs . . . . . . . . . . . . . . 22 3.3 Story modeling approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4 Story aggregation approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    4.

Search related