25
Principles of organizing a common morphological tagset and a search engine for PolUKR (Polish-Ukrainian Parallel Corpus) Польсько-Український паралельний корпус Polsko-Ukraiński Korpus Równoległy http://corpus.domeczek.pl Natalia Kotsyba, Institute of Slavic Studies, Polish Academy of Sciences Olga Shypnivska, ULIF, Ukrainian Academy of Sciences Magdalena Turska, Warsaw University

Natalia Kotsyba, Institute of Slavic Studies, Polish Academy of Sciences

  • Upload
    tanika

  • View
    41

  • Download
    0

Embed Size (px)

DESCRIPTION

Principles of organizing a common morphological tagset and a search engine for PolUKR (Polish-Ukrainian Parallel Corpus) Польсько-Український паралельний корпус Polsko-Ukraiński Korpus Równoległy http://corpus.domeczek.pl. - PowerPoint PPT Presentation

Citation preview

Page 1: Natalia Kotsyba, Institute of Slavic Studies, Polish Academy of Sciences

Principles of organizing a common morphological tagset and

a search engine for PolUKR (Polish-Ukrainian Parallel Corpus)

Польсько-Український паралельний корпус Polsko-Ukraiński Korpus Równoległy

http://corpus.domeczek.pl

Natalia Kotsyba, Institute of Slavic Studies, Polish Academy of Sciences

Olga Shypnivska, ULIF, Ukrainian Academy of Sciences

Magdalena Turska, Warsaw University

Page 2: Natalia Kotsyba, Institute of Slavic Studies, Polish Academy of Sciences

Main objectives and expected applications

• at least 3 mln tokens ; representative• sentence-level alignment• morphological annotation with a common tagset• public access; user-friendly

• linguistic material for – (independent) language learning– bilingual dictionaries– research on grammar and lexis

• translation memory for humans and machines

Page 3: Natalia Kotsyba, Institute of Slavic Studies, Polish Academy of Sciences
Page 4: Natalia Kotsyba, Institute of Slavic Studies, Polish Academy of Sciences
Page 5: Natalia Kotsyba, Institute of Slavic Studies, Polish Academy of Sciences
Page 6: Natalia Kotsyba, Institute of Slavic Studies, Polish Academy of Sciences

Statistics (prototype version)

total Polish part Ukrainian part

texts 70 35 35

tokens 359 926 179 087 180 120

characters 3 863 564 1 449 376 2 407 034

Kb 3941 1492 2439

Page 7: Natalia Kotsyba, Institute of Slavic Studies, Polish Academy of Sciences
Page 8: Natalia Kotsyba, Institute of Slavic Studies, Polish Academy of Sciences

Search (present)

• based on PERL regular expressions• any searched chain has to be “embraced” by “/”.

E.g. /Холодна війна/• special characters:І alternative; ) end of subchain[ i ] beginning and end of a defined character class? 1 or 0 appearances; * 0 or more appearances+ 1 or more appearances\s any empty character\w any letter, digit, underlining sign\b end of word, \ escape

Page 9: Natalia Kotsyba, Institute of Slavic Studies, Polish Academy of Sciences

Examples of search formulae

/jako/ „jako”

/jako\s/ „jako, niejako, dwojako”

/\bjako/ „jakość’

/norma\./ „norma” before a dot

Page 10: Natalia Kotsyba, Institute of Slavic Studies, Polish Academy of Sciences

Sources of morphological information

• Polish: IPI PAN corpus + …

• Ukrainian:

- grammatical dictionary by ULIF, UAS (Igor Shevchenko) lemma <> wordform

- morphological analyzer (information is slightly different, built for homonymy disambiguation)

- no lemmatization (so far)

Page 11: Natalia Kotsyba, Institute of Slavic Studies, Polish Academy of Sciences
Page 12: Natalia Kotsyba, Institute of Slavic Studies, Polish Academy of Sciences
Page 13: Natalia Kotsyba, Institute of Slavic Studies, Polish Academy of Sciences

Types of tagsets

SYMBOLS: encoding all possible grammatical characteristics of a wordform in one symbol English (BNC), Ukrainian

- takes little machine memory but requires too much of the human one

CHAINS:

contain codes corresponding to particular grammatical categories and/or their values; morphological characteristics of a wordform is represented by a sequence of such codes

can be even more economic than symbols, if a query concerns morphological categories owned by several lexico-grammatical classes

• positional Czech

every category (and its values) have a fixed position in a chain• flexemic Polish, Russian

every category has its own subtagset

Page 14: Natalia Kotsyba, Institute of Slavic Studies, Polish Academy of Sciences

Multext-East tagset for En Ro Sl Cz Bg Et Hu Hr Sr Re

• chain-like; criticised• 14 PoS: N10, V15, A12, P(ron)17, Det10, T(he)6,

adveRb6, S(adposition)4, C(onj)7, nuMeral12, Intjn2, X(residual), Yabbr5, Qparticle3

• only Bg and Hu do not have modal verbs and copulas• En Ro have determiners, Ro Hu Re have articles, Bg –

has neither (analitism, segmentation); • Is a Bg noun formally indefinite if the article is attached

to the adj? (cf. agglutinativity of Pl być)• negation as morphological category• Cz transgresivity (adverbial participle)

Page 15: Natalia Kotsyba, Institute of Slavic Studies, Polish Academy of Sciences

Treatment of participles• Polish (no aspectual characteristics)

(Here and further cited by: Adam Przepiórkowski i Marcin Woliński A Flexemic Tagset for Polish.)

• Ukrainian (aspect and tense)Дієслово, дієприслівник, доконаний вид, минулий час, активний стан

VW прочитавши

Дієслово, дієприслівник, недоконаний вид, теперішній час, активний стан

UQ читаючи

(Here and further cited by: Широков В.А et al. Корпусна лінгвістика.)

• PolUKRparticiple I (doing/having done) characterised by aspect

Page 16: Natalia Kotsyba, Institute of Slavic Studies, Polish Academy of Sciences

Treatment of pronouns

• notorious Slavonic pronoun problem: 296 unique tags for 309 pronouns

• Polish: division into 1-2 p, 3p and siebie (ów, jak?)

• Ukrainian: pro-noun, pro-adjective • Russian: also pro-predicative and pro-adverb• Czech: many subcategories on the level of

SubPoS• PolUKR: Ua approach and Pl division into 1-2

and 3 person

Page 17: Natalia Kotsyba, Institute of Slavic Studies, Polish Academy of Sciences

Treatment of predicatives

• Polish: adverbs with modal semantics like można, trzeba (it is) allowed/one can, (it is) necessary, ?to

• Ukrainian (code X0) includes adverbs of state like жарко, шкода, жаль (it is) hot, (it is) a pity

• PolUKR moving the category from the morphological level to the semantic one

Page 18: Natalia Kotsyba, Institute of Slavic Studies, Polish Academy of Sciences
Page 19: Natalia Kotsyba, Institute of Slavic Studies, Polish Academy of Sciences

http://www.ruscorpora.ru/search-main.html

Page 20: Natalia Kotsyba, Institute of Slavic Studies, Polish Academy of Sciences
Page 21: Natalia Kotsyba, Institute of Slavic Studies, Polish Academy of Sciences
Page 22: Natalia Kotsyba, Institute of Slavic Studies, Polish Academy of Sciences

Search engine for PolUKR• choose the direction of the search (Ua>Pl or Pl<Ua) • search conditions for both languages (RvonW)• 3 levels of search: - exact form - (lemma) with the morphological choice - using Poliqarp-like tag formulas (for advanced users)• idea of subcategories (either a POS or a SUBPOS can be selected,

but not both; similarly, one cannot select all subcategories of a POS), cf. aliases in IPI PAN corpus

• alternative is ensured through tick-off boxes, so that one can choose EITHER „VERB finite past” OR „NOUN dative neutral” OR sth else, etc.)

• restrictions on choice within 1 of 10 POS

Page 23: Natalia Kotsyba, Institute of Slavic Studies, Polish Academy of Sciences

VERB infinitive participle I non-finite form finite form

aspect perfective imperfective

mood imperative indicative

person first second third

tense present future past

gender masculine feminine neutral

number singular plural

NOUN general proper name pro-noun 1-2 person pro-noun 3 person

case nominative genetive dative accusative instrumentative locative vocative

gender masculine feminina neutral pluralia tantum

number singular plural

ADJECTIVAL adjective, participle I

and cardinal numeral pro-adjective indeclinable adjective

case nominative genetive dative accusative instrumentative locative

gender masculine feminina neutral

number singular plural

NUMERAL genderic non-genderic

case nominative genetive dative accusative instrumentative locative

gender masculine feminina neutral

ADVERB PARTICLE PROPOSITION CONJUNCTION INTERJECTION

Page 24: Natalia Kotsyba, Institute of Slavic Studies, Polish Academy of Sciences

Built-in restrictions on search

category can be selected (active) only if the following category/value(s) have been selected by the

user:

mood OR tense OR person

finite form

gender finite form AND past tense OR adjective AND singular number OR pro-adjective AND singular number OR pro-noun 1-2 person AND singular number OR pro-noun 3 person AND singular number OR ЧИСЛІВНИК родовий AND singular number

gender pluralia tantum NOUN general OR NOUN proper name

case vocative NOUN general OR NOUN proper name

none indeclinable adjective OR ADVERB OR PARTICLE OR PROPOSITION OR CONJUNCTION OR INTERJECTION

Page 25: Natalia Kotsyba, Institute of Slavic Studies, Polish Academy of Sciences

Literature

• INTERA unified tagset project www.elda.org/intera• Tomas Erjavec et al. Multext-East specifications for Slavic

languages, Budapest, 2003.• Jan Hajič. Positional Tags: Quick Reference (Czech „HM”

Morphology), 2000.• Adam Przepiórkowski and Marcin Woliński. A Flexemic Tagset for

Polish. In: The Proceedings of the Workshop on Morphological Processing of Slavic Languages, EACL 2003. http://nlp.ipipan.waw.pl/~adamp/Papers/2003-eacl-ws12/ws12.pdf

• Elena Paskaleva. Balcan South-East Corpora Aligned to English. In: The Proceedings of the Workshop on Common Natural Language Processing Paradigm for Balkan Languages, EACL 2003

• Широков В.А et al. Корпусна лінгвістика. Київ: Довіра, 2005.