Types und Tokens Distribution in TITUS Распределение словоформ в корпусе TITUS Dr. Svetlana Ahlborn Institut für Empirische Sprachwissenschaft Universität

  • View
    214

  • Download
    2

Embed Size (px)

Transcript

Types- und Tokens-Verteilung in TITUS-Ressourcen: Erstellung und Anwendung

Types und Tokens Distribution in TITUS TITUS

Dr. Svetlana Ahlborn Institut fr Empirische Sprachwissenschaft Universitt Frankfurt am MainE-Mail: l.ahlborn@em.uni-frankfurt.de

1Types- und Tokens-Verteilung in TITUS-Ressourcen25.06.2013OutlineTITUS Resource Data Peculiarities of TITUS texts Tokens and Types calculation in TITUS ResourcesMetadata for Tokens and Types distribution 201326.06.20132Tokens and Types Distribution in TITUS2Types- und Tokens-Verteilung in TITUS-Ressourcen25.06.2013

TITUS Resource Data

TITUS (Thesaurus Indogermanischer Text- und Sprachmaterialien)

http://titus.uni-frankfurt.de

201326.06.2013

A token represents the concrete occurrence of the linguistic unit, and in a type, tokens associated with each other are bundled.

3TITUS includes currently 660 texts in 55 languages, more than 30 Mio. tokens

Tokens and Types Distribution in TITUS3Types- und Tokens-Verteilung in TITUS-Ressourcen25.06.2013TITUS Data

201326.06.2013

http://www.clarin.eu/node/1512Added by J. Gippert, R. Mittmann4Tokens and Types Distribution in TITUSTITUS Search EngineTITUS Search Engine does not determine the number of tokens in the concrete text, but the number of quotations of the word. 201326.06.2013

5Tokens and Types Distribution in TITUSPeculiarities of TITUS texts: GothicBiblia Gothica contains additional parallel passages in Latin and Greek. 201326.06.2013

Biblia Gothica (http://titus.uni-frankfurt.de/texte/etcs/germ/got/gotnt/gotnt.htm).6Tokens and Types Distribution in TITUSPeculiarities of TITUS texts: Old Church SlavonicOld Church Slavonic texts are represented in two ways: in the Glagolitic alphabet original form of the text and in Cyrillic one. 201326.06.2013Codex Marianus (http://titus.uni-frankfurt.de/texte/etcs/slav/aksl/marianus/maria.htm).

7Tokens and Types Distribution in TITUSPeculiarities of TITUS texts: Old PolishOld Polish texts contain a simultaneous display of editions that have arisen at different times. 201326.06.2013

Kazania Switokrzyskie (http://titus.uni-frankfurt.de/texte/etcs/slav/apoln/ kazania/kazan.htm).8Tokens and Types Distribution in TITUSPeculiarities of TITUS texts: OssetianThe Ossetian Nart epic is represented in Latinica und in the advanced Cyrillic. 201326.06.2013

Ossetian: Nart epic (http://titus.uni-frankfurt.de/texte/etcs/iran/niran/oss/nart/nart.htm).9Tokens and Types Distribution in TITUSPeculiarities of TITUS texts: Russian-Low GermanTnnies Fenne's Manual (17th century) contains at least 9 different languages or language variations. 201326.06.2013

10Tokens and Types Distribution in TITUSPeculiarities of TITUS texts: Old Prussian 201326.06.2013

Old Prussian corpus consists of at least 21 different languages or language variants (Old Prussian, Old Lithuanian, Latin, Gothic, Old Low German, Old High German).11Tokens and Types Distribution in TITUSCreationA digitized source consists not only of a source language words, but contains various information which does not belong originally to the document: numbers, tags, punctuation marks, edition information etc. 201326.06.2013

$zeile =~ s/\d*\s+\x{003C}\x86\x87\x84\x{003E}//gi; #

$zeile =~ s/\d*\s+