WordNet для русского языка. Русские тезаурусы: что есть и что надо? Ведущий: Павел Браславский

Русские ворднеты: что есть и что надо?

AINL 2014

12.09.2014

Вопросы

• Что у нас есть?

• Что нужно?

• Кто потребитель?

• Что можно сделать?

YARN

• Открытый проект

• Crowdsourcing

http://russianword.net

http://russianword.net/

Морфологическая разметка в OpenCorpora:

Кто размечает?Как быстро?С какой мотивацией?

0

100

200

300

400

500

600

700

0

10000

20000

30000

40000

50000

60000

100 1 000 10 000 100 0000,00%

5,00%

10,00%

15,00%

20,00%

25,00%

30,00%

35,00%

40,00%

Количество примеров

Рас

хож

ден

ие

Участники

# Участник Ответов%

расхождений%

ошибок

1 Lvova 359 367 4.1% 0.7%

2 Nofenigma 109 663 4.2% 1.8%

3 Мяу 103 877 2.2% 0.6%

4 Rave 83 522 3.5% 0.6%

5 quorax 38 757 4.3% 0.6%

круглый стол по русским тезаурусам @ AINL'2014

12.09.2014

Iryna Gurevych

Language Resources: Status Quo

• Huge variety of resources, cf. LRE Map http://www.resourcebook.eu/

• Different size, language, quality

• Different origins: experts, user-generated, automatic

• Different purposes

http://www.resourcebook.eu/

Observation

• Requirements for the resource depend on the application

– Some applications need just flat frequency counts, such as Google N-Grams, but at large scale

– Some applications need rich and accurate semantic representations in narrow domains

Consequence

• Language resources should be:

– Large-scale, flexibly configurable and standardized

– Linked with corpus information and world knowledge

– Include data about confidence and quality of information, if derived from non-experts or automatically

Vision: One-Stop Resource

• High-quality backbone store of lexical-semantic and world knowledge by experts

– Enriched by a large community of users

– Enriched by the results of text analysis methods

– Standardized and continuously monitored by quality

• Pay-per-use business model: input are requirements, output is a correctly configured language resource for your particular application

Universal Dictionary of Concepts

● Dictionary of the artificial interlingua UNL «Universal Networking Language».

● Basic lexical units of UNL are so-called concepts equal to lexical senses. Each concept has a unique identifier called "Universal Word" (UW)

– UWs are linked to words and expressions of natural languages

– UWs have semantic links to other UWs

– UWs are linked to ontology classes.

Natural languages

Ontology

UNL

слово word

ContentBearingObject

sentenceinformation

từmot

word(icl>information,pof>sentence)

Current status

● Universal Words are sorted into 3 groups: general lexics, terminology, named entities.

● Some of the data are already available for download.

Part Number of UW Status

General lexics 86410 DownloadableTerminology 688617 Under development

Named Entities 2109240 Soon to be released

Natural Languages

Language General Terms Names Total QualityEnglish 82550 (41761 words) 688617 2109240 2 880 407 *****

Russian 55046 (34236 words) 688613 226595 970 254 **** Proofreading

French 37034 (25626 words) 103060 367888 507 982 *** Autoranking

Hindi 27813 (30219 words) 0 10823 38 635 *** Auto

Spanish 11758 (6983 words) 21990 298674 332 422 ** Experimental

Malay 21861 (17457 words) 0 46044 67 905 ** Experimental

Vietnamese 5927 (6456 words) 0 171367 177 294 *** Experimental

English is based on a subset of Princeton Wordnet

Russian dictionary is being proofread and extended (work in progress)

Links between UWs and other languages are ranked automatically

– Ranking is based on the number of sources confirming translations that can be deduced from the UNL dictionary AND amount of manual proofreading.

– Russian has reached 93,4% of the English Wordnet quality level.

Number of UWs linked to NL words and expressions

Links to other resources

UW lists:

EN FR HI SP

Ontology dbPediaWordnets

Semanticnetwork

SUMO

ETAP-3 Ariane CFILT

MSRU VI

MT Systems that support UNL

Local dictionaries

General Terms Names

DomainOntologies

Semantic dataUniversal Dictionaryof Concepts

Data files

● Files available at

https://github.com/dikonov/Universal-Dictionary-of-Concepts

● Formats

– CSV

– XML (pivax, xdxf) for various dictionary shells

– LMF, RDF/Turtle (Open Linked Data) planned

● Free to use under GPLv3+ or Creative Commons CC-BY-SA / CC-BY-SA-NC

Questions?

A sample search using goldendict dictionary shell...

, .

. . И

•

– 20

– – 158

•

–

–

–

• RuThes-lite

– 96

– http://www.labinform.ru/pub/ruthes/index.htm

– xml-

щ

,

щ -

39 . , 110 .

О в ые ек ы

:

-

. -

QA -

-

-

-

-

-

-

-

-

-

-

-

1999-

/ Ц 2006-

/ 2000-

/ « »

( « ») 1997-

2011 « -

» 2002-

/ .

2008-

2012 « -

»

2013 -

2007

2007

2003

( . )

1996

А ?

• ACL -2014 (Bansal et al.) Structured Learning for

Taxonomy Induction with Belief Propagation.

– 761 WordNet – F-

= 54.8%

– 700 =66.6%

• !!

–

– vs.

–

А WorНNОt

•

– И

–

– . • :

• ( WorНNОt): GОrЦКNОt

-

• : – , – : ,

• – , : , ы – ,

–

• И – – • – – ,

– (?): • - ( .)

– • ,

– ( .)

–

,

•

– ,

–

• И : – « »: , « ».

• YARN : К ,

• YARN : , . –

–

– И

– . .

• –

– –

• – ( . , . ) –

– ( )

– ,

•

100

Русский Викисловарь

как источник

семантической информации

Александр Силонов

[email protected]

ru.wiktionary.org

mailto:[email protected]

Тезаурус YARN: взгляд со стороны

Елена Трещева

Наталья Степанова

Саратовский государственный университет им. Н.Г. Чернышевского

WordNet-подобные тезаурусы:

• Прикладные задачи (ИИ, ИП, МП и др.)

• Теоретическая лингвистика: моделирование лексико-семантического уровня некоторого языка

2

Требования

к семантическим сетям:

• Соответствие системе категориальных отношений в мире

• Соответствие природе моделируемого языка

(в частности, представление о лексической системе языка не как о коллекции слов и их толкований, а как об иерархической системе с элементами разного статуса)

3

• Учет функционального «неравноправия» слов в языке (и, следовательно, в синсете):

o Характер семантики (наличие / отсутствие дополнительных смыслов)

o Оценочность / экспрессивность

o Функционально-стилевая принадлежность

o Сфера коммуникации

o Частотность

o Дистрибутивные свойства

4

Пример:

5

Пример:

6

• Синсет = лексикализованное понятие, вступающее в семантические отношения

• Отношения между словами типа «хлеб» и «папка» – отношения не системные, это отношения между словами как единицами словаря

7

Решения?

8

1. «Осложненная» квазисинонимия →

лексические отношения:

Хлеб, булка, папа, папка

Еда, пища, продовольствие

…

Еда, пища, продовольствие

…

Хлеб, булка

папа, папка

S

W

9

2. Порядок слов в синсете:

• Значимый порядок слов в синсете (от наиболее употребительных и семантически нейтральных к редким / стилистически окрашенным / распространенным в рамках ограниченных коммуникативных сфер)

Еще один вопрос:

11

Доменные области в YARN

o разный категориальный уровень

o разнородность (четко очерченные тематические группы vs наименования конкретных объектов)

o неодинаковая степень подробности предметных областей

Предложения по интерфейсу:

• Отображение в интерфейсе индекса ЛСВ, лексемы

Ср.: БОЙ 1, БОЙ 2.1, БОЙ 2.2 ...

• Соотнесенность пары «слово+толкование» с определенным синсетом

12

13

• Практическая ценность

• Научно-теоретическая ценность

+

• Академическая ценность (хорошая лекcикографическая практика)

Тезаурус YARN как ресурс

Спасибо за внимание!

14

Science

WordNet для русского языка. Русские тезаурусы: что есть и что надо? Ведущий: Павел Браславский