Upload
antony-payne
View
222
Download
0
Embed Size (px)
Citation preview
Research methods in corpus linguistics
Xiaofei Lu
2
Overview
What is a corpus? Types of corpora Corpus design Where to obtain corpora Corpus annotation Corpus analysis Note on research project design Exercises and demos in between Future courses on corpus linguistics
3
What is a corpus?
Leech (1992): an unexciting phenomenon, a helluva lot of text,
stored on a computer
Francis (1982): a collection of texts assumed to be representative of a
given language, dialect, or other subset of a language to be used for linguistic analysis
Sinclair (1991): a collection of naturally-occurring language
text, chosen to characterise a state or a variety of language
4
Types of corpora
General-purpose monolingual corpora The British National Corpus
Specialized corpora Lancaster Corpus of Academic Written English
Learner corpora International Corpus of Learner English
Parallel & comparable corpora The JRC-Acquis Multilingual Parallel Corpus The English-Chinese Parallel Concordancer
Corpora and varieties International Corpus of English
Synchronic and diachronic corpora
5
Corpus design
Purpose Comparability Type Content: mode, interaction, domain,
medium Structure: proportions Size Sampling? Design of the BNC
6
Where to obtain corpora
Linguistic data consortium Bookmarks for corpus-based linguists Ask on the corpora list Compile your own corpora
Design your corpus Getting permission File format, metadata, and data markup Text capture
Scanning, typing, electronic files, web crawlers, e.g., WebSPHINX
Transcription tools, e.g., Transcriber A Guide to Good Practice
7
Corpus annotation
Why annotate Levels of corpus annotation Difficulties for corpus annotation Tools for corpus annotation
8
Why annotate
For linguistic research Allow more effective corpus searches
For natural language processing Spelling and grammar checking Text summarization Machine translation Question answering
9
Levels of corpus annotation
Sentence segmentation Word segmentation/tokenization Part-of-speech (POS) tagging Chunking/shallow parsing Syntactic parsing Semantic annotation Pragmatic annotation Parallel corpora: sentence alignment Learner corpora: error annotation
10
Difficulties for corpus annotation
Ambiguity I saw a pig with binoculars. Problems for tagging, parsing, & WSD
Unknown words Identification POS tagging Semantic annotation
11
Tools for corpus annotation
Bookmarks for corpus-based linguists Corpora and Corpus Annotation Tools on t
he WWW POS tagger demonstration
Sentence segmentation POS tagging Extracting NPs of the form DT NN NN
Dexter: Tools for analyzing language data
12
Corpus analysis
Levels of corpus analysis Tools for corpus analysis Interpreting corpus data
13
Levels of corpus analysis
Word frequency lists Concordances
Collocation (lexical patterning) Colligation (syntactic patterning)
Keyword lists
14
Tools for corpus analysis
Bookmarks for corpus-based linguists
Recommendations: WordSmith Tools (not free) AntConc (free) TextStat (free)
Unix tools Write your own scripts
15
Exercise (part 1)
Download and install AntConc Download some text for processing
Project Gutenberg Generate a word frequency list for
your mini-corpus
16
Interpreting corpus data
Are frequency differences statistically significant? w appears x times in an n-word corpus,
and y times in an m-word corpus Chi-square test (doesn’t work well for
small numbers) Fisher’s Exact Test (doesn’t work for a
cross table larger than 2×2)
17
Exercise (part 2)
Compare your word frequency list with that of BNC
Anything interesting? Run the chi-square test and Fisher’s
Exact test on some interesting words
18
Interpreting corpus data (cont.)
Collocational analysis: How strongly are x and y associated Mutual information
Measures difference between observed and expected frequencies of (X,Y)
Higher MI, stronger association Doesn’t work well for low frequencies
T-test Measures confidence with which to claim
strong association between X and Y Higher t-score, higher association
Online calculations
19
Exercise (part 3)
Generate a concordance for a target word
Find a word that co-occurs frequently with the target word
Test if the word is strongly associated with the target word
20
Note on research project design
Purpose of project Corpus compilation and annotation Corpus analysis
Bottom-up: from observations of recurring patterns to hypothesis and generalizations
Top-down: start with given categories and search for evidence of use and variance
Caution on generalizability
21
Future courses on corpus linguistics
Spring 2007 APLING 597E: Introduction to Corpus
Linguistics Hands-on course on principles and tools for
corpus compilation, annotation, processing, and analysis
Spring 2008 APLING 597: Seminar on Corpus Linguistics Advanced seminar on using corpora for serious
research projects