корпусная лингвистика

Корпусная лингвистика

Введение в корпусную лингвистику

What is a corpus?

• a collection of words?• Is it a theory or methodology of language?

Why use a corpus?

• Large amounts of data tell us about tendencies and what’s normal or typical in real-life language use • Corpora also reveal instances of very rare or exceptional cases, that we wouldn’t get from looking at single texts or introspection. • Human researchers make mistakes and are slow. Computers are much quicker and more accurate.

Criteria in building a corpus

1. It must be a large body of text. 2. It needs to be representative of language (or a

genre of language). 3. Must be in machine-readable form (e.g. txt files

on a computer). 4. Acts as a standard reference about what’s

typical in language. 5. Often annotated with additional linguistic

information – e.g. grammatical codes.

annotation and mark-up

corpus texts may be enriched with additional information to ease analysis.

Note that this type of additional information may be called ‘mark up’, ‘annotation’, or ‘tagging’. All three terms are near synonyms. Annotation usually refers to linguistic information encoded in a corpus - however, the encoding is achieved using a mark-up language. Similarly, the annotation itself is usually undertaken by putting so called tags - short codes to indicate some linguistics feature - into a text. Hence, while the terms can be separated, they can also be used inter-changeably!

One final note - an xml tag finishes with a forward slash rather than a back slash.

Some untagged text

“Arrest warrant out for Clowes’ partner years before collapse.” By Daniel John A WARRANT for the arrest of the former partner of Mr Peter Clowes was issued seven years before his Barlow Clowes investment empire collapsed, according to evidence submitted to the Parliamentary Ombudsman.

Add tags for headers and paragraphs

<head type=MAIN>“Arrest warrant out for Clowes’ partner years before collapse.”</head><head type=BYLINE>By Daniel John</head>A WARRANT for the arrest of the former partner of Mr Peter Clowes wasissued seven years before his Barlow Clowes investment empire collapsed, according to evidence submitted to the Parliamentary Ombudsman.

• Add sentence tags<head type=MAIN><s n=001>“Arrest warrant out for Clowes’ partner years before collapse.”</head><head type=BYLINE><s n=002>By Daniel John</head><s n=003>A WARRANT for the arrest of the former partner of Mr PeterClowes was issued seven years before his Barlow Clowes investmentempire collapsed, according to evidence submitted to theParliamentary Ombudsman.

Change quotes to SGML<head type=MAIN><s n=001>&bquo;Arrest warrant out for Clowes’ partner years before collapse&equo;</head><head type=BYLINE><s n=002>By Daniel John</head><s n=003>A WARRANT for the arrest of the former partner of Mr Peter Clowes wasissued seven years before his Barlow Clowes investment empire collapsed,according to evidence submitted to the Parliamentary Ombudsman.

Add tags for punctuation<head type=MAIN><s n=001><c PUQ>&bquo;Arrest warrant out for Clowes<c PUN>’ partner yearsbefore collapse <c PUQ>&equo;</head><head type=BYLINE><s n=002>By Daniel John</head><s n=003>A WARRANT for the arrest of the former partner of Mr Peter Cloweswas issued seven years before his Barlow Clowes investment empire collapsed, according to evidence submitted to the Parliamentary Ombudsman <c PUN>.

Add grammatical codes to wordunits

<head type=MAIN><s n=001><c PUQ>&bquo<w NN1>Arrest <w NN1>warrant <w AVP>out <w PRP>for <wNP0>Clowes<c PUN>’ <w NN1>partner <w NN2>years <w PRP>before <w NN1>collapse <cPUQ>&equo’<c PUN>.</head><head type=BYLINE><s n=002><w PRP>By <w NP0>Daniel <w NP0>John</head><s n=003><w AT1>A <w=NN1>WARRANT <w PRP>for <w AT0>the <w NN1>arrest <w PRF>of <wAT0>the <w DT0>former <w NN1>partner <w PRF>of <w NP0>Mr <w NP0>Peter <wNP0>Clowes <w VBD>was <w VVN>issued <w CRD>seven <w NN2>years <w CJS>before <wDPS>his <w NN1-NP0>Barlow <w NP0>Clowes <w NN1>investment <w NN1>empire <w VVD>collapsed<c PUN>, <w PRP>according to <w NN1>evidence <w VVN>submitted <w PRP>to <w AT0>the <w AJ0>Parliamentary <w NN1>Ombudsman<c PUN>.

Types of Corpora

1 Specialised corpus – e.g.• genre: the language of newspapers• time: 2005 to the present day• place: just texts published in China2 General corpus – needs to be much larger. E.g. The BritishNational Corpus (BNC) has about 100 million words ofspoken and written British English:

The BNC

Types of Corpora3. Multilingual corpus – e.g. English and Spanish. Or AmericanEnglish and Indian English. http://ice-corpora.net/ICE/INDEX.HTM4. Parallel corpus – e.g. English and Spanish – exactly thesame texts translated. E.g. the CRATER corpus http://catalog.elra.info/product_info.php?products_id=845. Learner corpus – language use created by people learning aparticular language. E.g. the International Corpus ofLearner English.6. Historical or Diachronic corpus – e.g. Helsinki corpus – 1.5 million words of texts from 700AD to 1700AD.7. Monitor corpus – continually being added to. e.g. the Bankof English http://www.collins.co.uk/page/Wordbanks+Online

http://ice-corpora.net/ICE/INDEX.HTM

http://catalog.elra.info/product_info.php?products_id=84

http://www.collins.co.uk/page/Wordbanks+Online

frequency data, concordances and collocation

• FrequenciesYour query "wash" returned 2415 matches in 952 different texts (in 97,626,093 words; freq: 24.74 instances per million words)

Concordances aka Key Word In Context

Concordance (sorted at 1L)

Collocations

Corpora and Language Teaching

Textbooks• Dictionaries• Classroom Exercises• Tests• Learner Corpora

Limitations of Corpus linguisticsIt won’t tell us if something is possible in a language, orwell-formed. E.g. is “he expired of heart disease” acceptableEnglish?• Any generalisations we make from corpus data can only bedeductions – not facts.• Corpora give us evidence, but not information orexplanations. Why do women say “wash” more than men?• Corpora give us language out of context – so no visualinformation e.g. pictures, fonts etc. And with spoken data –no information on what the speakers look like, behaviour orbody language.

Further Reading• McEnery, Tony & Wilson, Andrew (2001) Corpus Linguistics.Edinburgh: Edinburgh University Press. Chapter 1.• Hunston, S. (2002) Corpora in Applied Linguistics.Cambridge: Cambridge University Press. Chapter 1.

Question 1What is a corpus?• A theory of language.• A collection of texts stored on a computer.• An electronic database similar to a dictionary.• Any large collection of words such as a

collection of books, newspapers or magazines.

Question 2What is the main reason for using corpora?• Other methods of language analysis are not reliable.• Computers can confirm our intuitions about language.• Computers can help us discover interesting patterns in

language which would be difficult to spot otherwise.• With corpora we can answer all research questions

about language.

Question 3What is corpus annotation?• Adding an extra layer of information to the

text to allow for more sophisticated searches.• Separating text into sentences.• Manual coding of text for parts of speech.• Adding critical comments to a text.

Question 4What is a specialised corpus?

• A corpus that is used for historical language investigations.

• A corpus that is composed of a large variety of genres.

• A corpus that is used by language specialists.• • A corpus that focuses on e.g. one type of genre, one period, one place

etc.

Question 5Which of these is NOT a type of corpus?• Multilingual corpus• Learner corpus• Diachronic corpus• Observer corpus

Question 6What is the BNC?

• A large general corpus of British English.

• A corpus of different genres of English writing.

• A large spoken corpus of British English.

• A specialised corpus representing the language of newspapers.

Question 7Which of these statements is NOT true about a monitor corpus?• It is frequently updated.• The Bank of English is an example of a monitor

corpus.• The BNC is an example of a monitor corpus.• It is used to monitor rapid change in language.

Question 8What is a concordance?• Information about word frequencies normalised per

million words.• Listing of examples of a word searched in a corpus with

some context on the right and some context on the left.• An alphabetical list of words that appear in a text.• A list of words and their frequencies that can be used

for identifying important words in a text.

Question 9What is collocation?• The tendency of speakers to talk over each other.• The tendency of words to co-occur with one

another.• The tendency of words to appear in unique,

different contexts each time.• The tendency of sentences to create meaning.

Question 10What is a frequency distribution in a corpus?• Information about how frequent a word is in a corpus.• Information about the frequency of use of a term across a

number of different texts, corpus sections, speakers etc.• Information about how frequent a word is per million

words.• Sociolinguistic information about the gender of the

speakers that are represented in a corpus.

Brown and LOB View 80 comments

These corpora are sometimes referred to as ‘snapshot’ corpora - their design is such that they try to represent a broad range of genres of published, professionally authored, English. Their goal is to capture the language at one moment in time, hence the term ‘snapshot’.

Of course, as with any snapshot there are things you see and things you do not see. So, in this case, we are looking at professionally authored written English - not speech and not writing of a more informal variety. We are also only looking at certain genres. As with any snapshot, it was taken at a certain point of time in a certain place - Brown is America in the early 1960s, LOB is the UK in the early 1960s. Such corpora are often used to compare and contrast varieties of a language - in this case two varieties of English. They can also be looked at on their own to explore either variety of English in its own right.

The Brown corpus is so named because it was developed at Brown University in the US. LOB is an acronym, standing for Lancaster-Oslo-Bergen, the three Universities that collaborated to build that corpus.

Back to the snapshot metaphor! The two corpora can be compared because they are composed in the same way - the subject is the same, if you like. They look at broadly the same genres. Those genres are represented by similarly sized and numbers of chunks of data. Also, of course, the data was gathered in roughly the same time period.

The genres covered in the two corpora are outlined below. Note the letter code for each genre - that is important, as it shows you which genre is associated with which file in the corpus. Following the letter code is a description of the type of data in the category, followed by two numbers in parentheses - the first is the number of chunks of data in that category in Brown, the second is the number of chunks of data in that category in LOB. There are five hundred chunks of data in each corpus. Each chunk is approximately 2,000 words in size, giving a rough overall corpus size of 1,000,000 words each.

A Press: reportage (44, 44)

B Press: editorial (27, 27)

C Press: reviews (17, 17)

D Religion (17, 17)

E Skills, trades and hobbies (36, 38)

F Popular lore (48, 44)

G Belles lettres, biography, essays (75, 77)

H Miscellaneous (documents, reports, etc.) (30, 30)

J Learned and scientific writings (80, 80)

K General fiction (29, 29)

L Mystery and detective fiction (24, 24)

M Science fiction (6, 6)

N Adventure and western fiction (29, 29)

P Romance and love story (29, 29)

R Humour (9, 9)

Education

корпусная лингвистика