21
Research methods in corpus linguistics Xiaofei Lu

Research methods in corpus linguistics Xiaofei Lu

Embed Size (px)

Citation preview

Page 1: Research methods in corpus linguistics Xiaofei Lu

Research methods in corpus linguistics

Xiaofei Lu

Page 2: Research methods in corpus linguistics Xiaofei Lu

2

Overview

What is a corpus? Types of corpora Corpus design Where to obtain corpora Corpus annotation Corpus analysis Note on research project design Exercises and demos in between Future courses on corpus linguistics

Page 3: Research methods in corpus linguistics Xiaofei Lu

3

What is a corpus?

Leech (1992): an unexciting phenomenon, a helluva lot of text,

stored on a computer

Francis (1982): a collection of texts assumed to be representative of a

given language, dialect, or other subset of a language to be used for linguistic analysis

Sinclair (1991): a collection of naturally-occurring language

text, chosen to characterise a state or a variety of language

Page 4: Research methods in corpus linguistics Xiaofei Lu

4

Types of corpora

General-purpose monolingual corpora The British National Corpus

Specialized corpora Lancaster Corpus of Academic Written English

Learner corpora International Corpus of Learner English

Parallel & comparable corpora The JRC-Acquis Multilingual Parallel Corpus The English-Chinese Parallel Concordancer

Corpora and varieties International Corpus of English

Synchronic and diachronic corpora

Page 5: Research methods in corpus linguistics Xiaofei Lu

5

Corpus design

Purpose Comparability Type Content: mode, interaction, domain,

medium Structure: proportions Size Sampling? Design of the BNC

Page 6: Research methods in corpus linguistics Xiaofei Lu

6

Where to obtain corpora

Linguistic data consortium Bookmarks for corpus-based linguists Ask on the corpora list Compile your own corpora

Design your corpus Getting permission File format, metadata, and data markup Text capture

Scanning, typing, electronic files, web crawlers, e.g., WebSPHINX

Transcription tools, e.g., Transcriber A Guide to Good Practice

Page 7: Research methods in corpus linguistics Xiaofei Lu

7

Corpus annotation

Why annotate Levels of corpus annotation Difficulties for corpus annotation Tools for corpus annotation

Page 8: Research methods in corpus linguistics Xiaofei Lu

8

Why annotate

For linguistic research Allow more effective corpus searches

For natural language processing Spelling and grammar checking Text summarization Machine translation Question answering

Page 9: Research methods in corpus linguistics Xiaofei Lu

9

Levels of corpus annotation

Sentence segmentation Word segmentation/tokenization Part-of-speech (POS) tagging Chunking/shallow parsing Syntactic parsing Semantic annotation Pragmatic annotation Parallel corpora: sentence alignment Learner corpora: error annotation

Page 10: Research methods in corpus linguistics Xiaofei Lu

10

Difficulties for corpus annotation

Ambiguity I saw a pig with binoculars. Problems for tagging, parsing, & WSD

Unknown words Identification POS tagging Semantic annotation

Page 11: Research methods in corpus linguistics Xiaofei Lu

11

Tools for corpus annotation

Bookmarks for corpus-based linguists Corpora and Corpus Annotation Tools on t

he WWW POS tagger demonstration

Sentence segmentation POS tagging Extracting NPs of the form DT NN NN

Dexter: Tools for analyzing language data

Page 12: Research methods in corpus linguistics Xiaofei Lu

12

Corpus analysis

Levels of corpus analysis Tools for corpus analysis Interpreting corpus data

Page 13: Research methods in corpus linguistics Xiaofei Lu

13

Levels of corpus analysis

Word frequency lists Concordances

Collocation (lexical patterning) Colligation (syntactic patterning)

Keyword lists

Page 14: Research methods in corpus linguistics Xiaofei Lu

14

Tools for corpus analysis

Bookmarks for corpus-based linguists

Recommendations: WordSmith Tools (not free) AntConc (free) TextStat (free)

Unix tools Write your own scripts

Page 15: Research methods in corpus linguistics Xiaofei Lu

15

Exercise (part 1)

Download and install AntConc Download some text for processing

Project Gutenberg Generate a word frequency list for

your mini-corpus

Page 16: Research methods in corpus linguistics Xiaofei Lu

16

Interpreting corpus data

Are frequency differences statistically significant? w appears x times in an n-word corpus,

and y times in an m-word corpus Chi-square test (doesn’t work well for

small numbers) Fisher’s Exact Test (doesn’t work for a

cross table larger than 2×2)

Page 17: Research methods in corpus linguistics Xiaofei Lu

17

Exercise (part 2)

Compare your word frequency list with that of BNC

Anything interesting? Run the chi-square test and Fisher’s

Exact test on some interesting words

Page 18: Research methods in corpus linguistics Xiaofei Lu

18

Interpreting corpus data (cont.)

Collocational analysis: How strongly are x and y associated Mutual information

Measures difference between observed and expected frequencies of (X,Y)

Higher MI, stronger association Doesn’t work well for low frequencies

T-test Measures confidence with which to claim

strong association between X and Y Higher t-score, higher association

Online calculations

Page 19: Research methods in corpus linguistics Xiaofei Lu

19

Exercise (part 3)

Generate a concordance for a target word

Find a word that co-occurs frequently with the target word

Test if the word is strongly associated with the target word

Page 20: Research methods in corpus linguistics Xiaofei Lu

20

Note on research project design

Purpose of project Corpus compilation and annotation Corpus analysis

Bottom-up: from observations of recurring patterns to hypothesis and generalizations

Top-down: start with given categories and search for evidence of use and variance

Caution on generalizability

Page 21: Research methods in corpus linguistics Xiaofei Lu

21

Future courses on corpus linguistics

Spring 2007 APLING 597E: Introduction to Corpus

Linguistics Hands-on course on principles and tools for

corpus compilation, annotation, processing, and analysis

Spring 2008 APLING 597: Seminar on Corpus Linguistics Advanced seminar on using corpora for serious

research projects