37
Корпусная лингвистика Введение в корпусную лингвистику

корпусная лингвистика

  • Upload
    ntsp

  • View
    74

  • Download
    0

Embed Size (px)

Citation preview

Page 1: корпусная лингвистика

Корпусная лингвистика

Введение в корпусную лингвистику

Page 2: корпусная лингвистика

What is a corpus?

• a collection of words?• Is it a theory or methodology of language?

Page 3: корпусная лингвистика

Why use a corpus?

• Large amounts of data tell us about tendencies and what’s normal or typical in real-life language use • Corpora also reveal instances of very rare or exceptional cases, that we wouldn’t get from looking at single texts or introspection. • Human researchers make mistakes and are slow. Computers are much quicker and more accurate.

Page 4: корпусная лингвистика

Criteria in building a corpus

1. It must be a large body of text. 2. It needs to be representative of language (or a

genre of language). 3. Must be in machine-readable form (e.g. txt files

on a computer). 4. Acts as a standard reference about what’s

typical in language. 5. Often annotated with additional linguistic

information – e.g. grammatical codes.

Page 5: корпусная лингвистика

annotation and mark-up

corpus texts may be enriched with additional information to ease analysis.

Note that this type of additional information may be called ‘mark up’, ‘annotation’, or ‘tagging’. All three terms are near synonyms. Annotation usually refers to linguistic information encoded in a corpus - however, the encoding is achieved using a mark-up language. Similarly, the annotation itself is usually undertaken by putting so called tags - short codes to indicate some linguistics feature - into a text. Hence, while the terms can be separated, they can also be used inter-changeably!

One final note - an xml tag finishes with a forward slash rather than a back slash.

Page 6: корпусная лингвистика

Some untagged text

“Arrest warrant out for Clowes’ partner years before collapse.” By Daniel John A WARRANT for the arrest of the former partner of Mr Peter Clowes was issued seven years before his Barlow Clowes investment empire collapsed, according to evidence submitted to the Parliamentary Ombudsman.

Page 7: корпусная лингвистика

Add tags for headers and paragraphs

<head type=MAIN>“Arrest warrant out for Clowes’ partner years before collapse.”</head><head type=BYLINE>By Daniel John</head><p>A WARRANT for the arrest of the former partner of Mr Peter Clowes wasissued seven years before his Barlow Clowes investment empire collapsed, according to evidence submitted to the Parliamentary Ombudsman.</p>

Page 8: корпусная лингвистика

• Add sentence tags<head type=MAIN><s n=001>“Arrest warrant out for Clowes’ partner years before collapse.”</head><head type=BYLINE><s n=002>By Daniel John</head><p><s n=003>A WARRANT for the arrest of the former partner of Mr PeterClowes was issued seven years before his Barlow Clowes investmentempire collapsed, according to evidence submitted to theParliamentary Ombudsman.</p>

Page 9: корпусная лингвистика

Change quotes to SGML<head type=MAIN><s n=001>&bquo;Arrest warrant out for Clowes’ partner years before collapse&equo;</head><head type=BYLINE><s n=002>By Daniel John</head><p><s n=003>A WARRANT for the arrest of the former partner of Mr Peter Clowes wasissued seven years before his Barlow Clowes investment empire collapsed,according to evidence submitted to the Parliamentary Ombudsman.</p>

Page 10: корпусная лингвистика

Add tags for punctuation<head type=MAIN><s n=001><c PUQ>&bquo;Arrest warrant out for Clowes<c PUN>’ partner yearsbefore collapse <c PUQ>&equo;</head><head type=BYLINE><s n=002>By Daniel John</head><p><s n=003>A WARRANT for the arrest of the former partner of Mr Peter Cloweswas issued seven years before his Barlow Clowes investment empire collapsed, according to evidence submitted to the Parliamentary Ombudsman <c PUN>.</p>

Page 11: корпусная лингвистика

Add grammatical codes to wordunits

<head type=MAIN><s n=001><c PUQ>&bquo<w NN1>Arrest <w NN1>warrant <w AVP>out <w PRP>for <wNP0>Clowes<c PUN>’ <w NN1>partner <w NN2>years <w PRP>before <w NN1>collapse <cPUQ>&equo’<c PUN>.</head><head type=BYLINE><s n=002><w PRP>By <w NP0>Daniel <w NP0>John</head><p><s n=003><w AT1>A <w=NN1>WARRANT <w PRP>for <w AT0>the <w NN1>arrest <w PRF>of <wAT0>the <w DT0>former <w NN1>partner <w PRF>of <w NP0>Mr <w NP0>Peter <wNP0>Clowes <w VBD>was <w VVN>issued <w CRD>seven <w NN2>years <w CJS>before <wDPS>his <w NN1-NP0>Barlow <w NP0>Clowes <w NN1>investment <w NN1>empire <w VVD>collapsed<c PUN>, <w PRP>according to <w NN1>evidence <w VVN>submitted <w PRP>to <w AT0>the <w AJ0>Parliamentary <w NN1>Ombudsman<c PUN>.</p>

Page 12: корпусная лингвистика

Types of Corpora

1 Specialised corpus – e.g.• genre: the language of newspapers• time: 2005 to the present day• place: just texts published in China2 General corpus – needs to be much larger. E.g. The BritishNational Corpus (BNC) has about 100 million words ofspoken and written British English:

Page 13: корпусная лингвистика

The BNC

Page 14: корпусная лингвистика

Types of Corpora3. Multilingual corpus – e.g. English and Spanish. Or AmericanEnglish and Indian English. http://ice-corpora.net/ICE/INDEX.HTM4. Parallel corpus – e.g. English and Spanish – exactly thesame texts translated. E.g. the CRATER corpus http://catalog.elra.info/product_info.php?products_id=845. Learner corpus – language use created by people learning aparticular language. E.g. the International Corpus ofLearner English.6. Historical or Diachronic corpus – e.g. Helsinki corpus – 1.5 million words of texts from 700AD to 1700AD.7. Monitor corpus – continually being added to. e.g. the Bankof English http://www.collins.co.uk/page/Wordbanks+Online

Page 15: корпусная лингвистика
Page 16: корпусная лингвистика
Page 17: корпусная лингвистика
Page 18: корпусная лингвистика

frequency data, concordances and collocation

• FrequenciesYour query "wash" returned 2415 matches in 952 different texts (in 97,626,093 words; freq: 24.74 instances per million words)

Page 19: корпусная лингвистика

Concordances aka Key Word In Context

Page 20: корпусная лингвистика

Concordance (sorted at 1L)

Page 21: корпусная лингвистика

Collocations

Page 22: корпусная лингвистика
Page 23: корпусная лингвистика

Corpora and Language Teaching

Textbooks• Dictionaries• Classroom Exercises• Tests• Learner Corpora

Page 24: корпусная лингвистика

Limitations of Corpus linguisticsIt won’t tell us if something is possible in a language, orwell-formed. E.g. is “he expired of heart disease” acceptableEnglish?• Any generalisations we make from corpus data can only bedeductions – not facts.• Corpora give us evidence, but not information orexplanations. Why do women say “wash” more than men?• Corpora give us language out of context – so no visualinformation e.g. pictures, fonts etc. And with spoken data –no information on what the speakers look like, behaviour orbody language.

Page 25: корпусная лингвистика

Further Reading• McEnery, Tony & Wilson, Andrew (2001) Corpus Linguistics.Edinburgh: Edinburgh University Press. Chapter 1.• Hunston, S. (2002) Corpora in Applied Linguistics.Cambridge: Cambridge University Press. Chapter 1.

Page 26: корпусная лингвистика

Question 1What is a corpus?• A theory of language.• A collection of texts stored on a computer.• An electronic database similar to a dictionary.• Any large collection of words such as a

collection of books, newspapers or magazines.

Page 27: корпусная лингвистика

Question 2What is the main reason for using corpora?• Other methods of language analysis are not reliable.• Computers can confirm our intuitions about language.• Computers can help us discover interesting patterns in

language which would be difficult to spot otherwise.• With corpora we can answer all research questions

about language.

Page 28: корпусная лингвистика

Question 3What is corpus annotation?• Adding an extra layer of information to the

text to allow for more sophisticated searches.• Separating text into sentences.• Manual coding of text for parts of speech.• Adding critical comments to a text.

Page 29: корпусная лингвистика

Question 4What is a specialised corpus?

• A corpus that is used for historical language investigations.

• A corpus that is composed of a large variety of genres.

• A corpus that is used by language specialists.• • A corpus that focuses on e.g. one type of genre, one period, one place

etc.

Page 30: корпусная лингвистика

Question 5Which of these is NOT a type of corpus?• Multilingual corpus• Learner corpus• Diachronic corpus• Observer corpus

Page 31: корпусная лингвистика

Question 6What is the BNC?

• A large general corpus of British English.

• A corpus of different genres of English writing.

• A large spoken corpus of British English.

• A specialised corpus representing the language of newspapers.

Page 32: корпусная лингвистика

Question 7Which of these statements is NOT true about a monitor corpus?• It is frequently updated.• The Bank of English is an example of a monitor

corpus.• The BNC is an example of a monitor corpus.• It is used to monitor rapid change in language.

Page 33: корпусная лингвистика

Question 8What is a concordance?• Information about word frequencies normalised per

million words.• Listing of examples of a word searched in a corpus with

some context on the right and some context on the left.• An alphabetical list of words that appear in a text.• A list of words and their frequencies that can be used

for identifying important words in a text.

Page 34: корпусная лингвистика

Question 9What is collocation?• The tendency of speakers to talk over each other.• The tendency of words to co-occur with one

another.• The tendency of words to appear in unique,

different contexts each time.• The tendency of sentences to create meaning.

Page 35: корпусная лингвистика

Question 10What is a frequency distribution in a corpus?• Information about how frequent a word is in a corpus.• Information about the frequency of use of a term across a

number of different texts, corpus sections, speakers etc.• Information about how frequent a word is per million

words.• Sociolinguistic information about the gender of the

speakers that are represented in a corpus.

Page 36: корпусная лингвистика

Brown and LOB View 80 comments

These corpora are sometimes referred to as ‘snapshot’ corpora - their design is such that they try to represent a broad range of genres of published, professionally authored, English. Their goal is to capture the language at one moment in time, hence the term ‘snapshot’.

Of course, as with any snapshot there are things you see and things you do not see. So, in this case, we are looking at professionally authored written English - not speech and not writing of a more informal variety. We are also only looking at certain genres. As with any snapshot, it was taken at a certain point of time in a certain place - Brown is America in the early 1960s, LOB is the UK in the early 1960s. Such corpora are often used to compare and contrast varieties of a language - in this case two varieties of English. They can also be looked at on their own to explore either variety of English in its own right.

The Brown corpus is so named because it was developed at Brown University in the US. LOB is an acronym, standing for Lancaster-Oslo-Bergen, the three Universities that collaborated to build that corpus.

Back to the snapshot metaphor! The two corpora can be compared because they are composed in the same way - the subject is the same, if you like. They look at broadly the same genres. Those genres are represented by similarly sized and numbers of chunks of data. Also, of course, the data was gathered in roughly the same time period.

The genres covered in the two corpora are outlined below. Note the letter code for each genre - that is important, as it shows you which genre is associated with which file in the corpus. Following the letter code is a description of the type of data in the category, followed by two numbers in parentheses - the first is the number of chunks of data in that category in Brown, the second is the number of chunks of data in that category in LOB. There are five hundred chunks of data in each corpus. Each chunk is approximately 2,000 words in size, giving a rough overall corpus size of 1,000,000 words each.

A Press: reportage (44, 44)

B Press: editorial (27, 27)

C Press: reviews (17, 17)

D Religion (17, 17)

E Skills, trades and hobbies (36, 38)

F Popular lore (48, 44)

G Belles lettres, biography, essays (75, 77)

H Miscellaneous (documents, reports, etc.) (30, 30)

J Learned and scientific writings (80, 80)

K General fiction (29, 29)

L Mystery and detective fiction (24, 24)

M Science fiction (6, 6)

N Adventure and western fiction (29, 29)

P Romance and love story (29, 29)

R Humour (9, 9)

Page 37: корпусная лингвистика