Upload
kylynn-torres
View
56
Download
7
Embed Size (px)
DESCRIPTION
Методы разрешения (семантической) неоднозначности. Обзор методов NLP , IR и Data Mining (методов извлечения инфомации / знаний). -1. Некоторые замечания о методах NLP, IR и Data Mining. СТАНДАРТНЫЕ МАТЕМАТИЧЕСКИЕ МЕТОДЫ и МОДЕЛИ + ЛИНГВИСТИЧЕСКИЕ KNOW HOW Математические основания: - PowerPoint PPT Presentation
Citation preview
МЕТОДЫ РАЗРЕШЕНИЯ (СЕМАНТИЧЕСКОЙ) НЕОДНОЗНАЧНОСТИ
Обзор методов NLP, IR и Data Mining (методов извлечения инфомации/знаний)
-1. НЕКОТОРЫЕ ЗАМЕЧАНИЯ О МЕТОДАХ NLP, IR И DATA MINING
СТАНДАРТНЫЕ МАТЕМАТИЧЕСКИЕ МЕТОДЫ и МОДЕЛИ + ЛИНГВИСТИЧЕСКИЕ KNOW HOW
Математические основания:
Стандартные оценки вероятностей + условная вероятность
Метод максимального правдоподобия + наивная Байесовская модель
Стандартные методы проверки гипотез (хи-квадрат, t-статистика и т.п.)
Признаковое пространство + квантитативная оценка весов + метрика
Кластеризация и классификация Алгоритмы самообучения (методы распознавания образов):
Модель языка (скрытые марковские модели) Энтропийные модели
Классификаторы (метод максимальной энтропии, наивная байесовская классификация, скрытые марковские модели, деревья решений, bootstrapping, нейронные сети, генетические алгоритмы и т.п.)
ПРЕДВАРИТЕЛЬНАЯ ПОСТАНОВКА ЗАДАЧИ Проблема 1: омонимия и многозначность при автоматической
обработке текста
Проблема 2.: группировка лексики, автоматическое создание нужных лексикографических ресурсов, например, тезаурусов.
Семантическая неоднозначность (технический термин - семантическая многозначность):
Bank vs. bank (см. предыдущую лекцию) Он нашел возможность vs. Он нашел квартиру Язык: естественный язык vs. Говяжий язык
Используется в: (а) семантическая разметка корпусов (б) прикладные задачи информационного поиска: простой
поиск по запросу пользователя, автоматическая классификация текстов
(в) автоматические системы перевода (г) вопросно-ответные системы (д) извлечение знаний из текстов (омонимия
именованных сущностей NER) и т.п.
WSD — word sense disambiguation
На какие вопросы надо ответить, чтобы решить задачу
На что опираться, чтобы различить разные смыслы лексемы в тексте
Лингвистические основания
Значения: словари, тезаурусы, WordNet Источники:
(а) значение в словаре(б) тезаурусный класс(в) синонимический ряд в словаре
синонимов(г) synset в WordNet(д) Wikipedia
класс контекстных слов ((а) – (г) источники для выделения класса контекстных слов)задача для каждого из "значений" определить "класс эквивалентности"
Алгоритмы самообучения
Задача: найти множество признаков контекста, которые бы максимально точно определяли данное значение слова (значение X в контексте С (с1 …. ст) было бы максимально вероятно)
Можно приписать в лексикографическом источнике
Можно извлечь в процессе самообучения: (а) из размеченного корпуса (supervised learning) (б) из неразмеченного корпуса (unsupervised) Классификатор Кластеризация для контекстов
Выбор признаков (все, что только можно) Контекстные слова Контестные POS тэги Место контекстного слова
ЗАДАЧИ ОБУЧЕНИЯ:
Дано: "мешок" признаков Выход: максимально "близкая" к
действительности модель (такие параметры модели, которые бы давали максимальную вероятность того, что данная модель порождает то, что мы наблюдаем) – веса параметров, правила классификации и т.п.
ПРИМЕР
1. МЕТОДЫ, ОСНОВАННЫЕ НА ЗНАНИЯХ (KNOWLEDGE-BASED METHODS)
1.1. МЕТОДЫ КОНТЕКСТНОГО ПЕРЕСЕЧЕНИЯ
11
OVERLAP BASED APPROACHES
Require a Machine Readable Dictionary (MRD).
Find the overlap between the features of different senses of an ambiguous word (sense bag) and the features of the words in its context (context bag).
These features could be sense definitions, example sentences, hypernyms etc.
The features could also be given weights.
The sense which has the maximum overlap is selected as the contextually appropriate sense. 11
CFILT
- IITB
12
LESK’S ALGORITHM
CFILT
- IITB
12
Sense 1Trees of the olive family with pinnate leaves, thin furrowed bark and gray branches.
Sense 2The solid residue left when combustible material is thoroughly burned or oxidized.
Sense 3To convert into ash
Sense 1A piece of glowing carbon or burnt wood.
Sense 2charcoal.
Sense 3A black solid combustible substance formed by the partial decomposition of vegetable matter without free access to air and under the influence of moisture and often increased pressure and temperature that is widely used as a fuel for burning
Ash Coal
Sense Bag: contains the words in the definition of a candidate sense of the ambiguous word.
Context Bag: contains the words in the definition of each sense of each context word.
E.g. “On burning coal we get ash.”
In this case Sense 2 of ash would be the winner sense.
13
WALKER’S ALGORITHM A Thesaurus Based approach. Step 1: For each sense of the target word find the thesaurus
category to which that sense belongs. Step 2: Calculate the score for each sense by using the context
words. A context words will add 1 to the score of the sense if the thesaurus category of the word matches that of the sense.
E.g. The money in this bank fetches an interest of 8% per annum
Target word: bank Clue words from the context: money, interest, annum, fetch
Sense1: Finance Sense2: Location
Money +1 0
Interest +1 0
Fetch 0 0
Annum +1 0
Total 3 0
Context wordsadd 1 to thesense when the topic of theword matches thatof the sense
14
WSD USING CONCEPTUAL DENSITY
Select a sense based on the relatedness of that word-sense to the context.
Relatedness is measured in terms of conceptual distance (i.e. how close the concept represented by the word and the
concept represented by its context words are)
This approach uses a structured hierarchical semantic net (WordNet) for finding the conceptual distance.
Smaller the conceptual distance higher will be the conceptual density. (i.e. if all words in the context are strong indicators of a particular
concept then that concept will have a higher density.)
14
CFILT
- IITB
15
CONCEPTUAL DENSITY (EXAMPLE)
15
CFILT
- IITB
The dots in the figure represent the senses of the word to be disambiguated or the senses of the words in context.
The CD formula will yield highest density for the sub-hierarchy containing more senses.
The sense of W contained in the sub-hierarchy with the highest CD will be chosen.
16
The jury(2) praised the administration(3) and operation (8) of Atlanta Police Department(1)
Step 1: Make a lattice of the nouns in the context, their senses and hypernyms.
Step 2: Compute the conceptual density of resultant concepts (sub-hierarchies).
Step 3: The concept with highest CD is selected.
Step 4: Select the senses below the selected concept as the correct sense for the respective words.
operation
division
administrative_unit
jury
committee
police department
local department
government department
department
jury administration
body
CD = 0.256
CD = 0.062
16
CFILT
- IITB
CONCEPTUAL DENSITY (EXAMPLE)
17
WSD USING RANDOM WALK ALGORITHM
S3
Bell ring church Sunday
S2
S1
S3
S2
S1
S3
S2
S1 S1
a
c
b
e
f
g
hi
j
k
l
0.46
a
0.49
0.92
0.97
0.35
0.56
0.42
0.63
0.580.67
Step 1: Add a vertex for each possible sense of each
word in the text.
Step 2: Add weighted edges using definition based semantic similarity (Lesk’s method).
Step 3: Apply graph based ranking algorithm to find score of each vertex (i.e. for each word sense).
Step 4: Select the vertex (sense) which has the highest score.
17
CFILT
- IITB
KB APPROACHES – COMPARISONS
18
CFILT
- IITB
Algorithm Accuracy
WSD using Selectional Restrictions 44% on Brown Corpus
Lesk’s algorithm 50-60% on short samples of “Pride and Prejudice” and some “news stories”.
WSD using conceptual density 54% on Brown corpus.
WSD using Random Walk Algorithms 54% accuracy on SEMCOR corpus which has a baseline accuracy of 37%.
Walker’s algorithm 50% when tested on 10 highly polysemous English words.
KB APPROACHES –CONCLUSIONS
19
CFILT
- IITB
Drawbacks of WSD using Selectional Restrictions Needs exhaustive Knowledge Base.
Drawbacks of Overlap based approaches Dictionary definitions are generally very small. Dictionary entries rarely take into account the
distributional constraints of different word senses (e.g. selectional preferences, kinds of prepositions, etc. cigarette and ash never co-occur in a dictionary).
Suffer from the problem of sparse match. Proper nouns are not present in a MRD. Hence these
approaches fail to capture the strong clues provided by proper nouns.
E.g. “Sachin Tendulkar” will be a strong indicator of the category “sports”.
Sachin Tendulkar plays cricket.
KB APPROACHES –CONCLUSIONS
20
CFILT
- IITB
Drawbacks of WSD using Selectional Restrictions Needs exhaustive Knowledge Base.
Drawbacks of Overlap based approaches Dictionary definitions are generally very small. Dictionary entries rarely take into account the
distributional constraints of different word senses (e.g. selectional preferences, kinds of prepositions, etc. cigarette and ash never co-occur in a dictionary).
Suffer from the problem of sparse match. Proper nouns are not present in a MRD. Hence these
approaches fail to capture the strong clues provided by proper nouns.
E.g. “Sachin Tendulkar” will be a strong indicator of the category “sports”.
Sachin Tendulkar plays cricket.
Методы машинного обучения
автоматическая классификация и кластеризация
КОНТРОЛИРУЕМЫЕ МЕТОДЫ МАШИННОГО ОБУЧЕНИЯ
Задача разрешения семантической неоднозначности сводится к задаче классификации:
Дано: wj ~ {si} – множестов смыслов для слова wj
{fi } – множество признаков
(для данной задачи – множество контекстных слов, множество морфологических тэгов и т.п.)
Задача: наблюдаемое в некотором конкретном контексте словоупотребление отнести к одному из классов (si) на основе информации о значениях признаков {fi } для данного контекста
надо построить такую модель, чтобы по комбинации значений признаков максимально точно предсказывать si
КОНТРОЛИРУЕМЫЕ МЕТОДЫ ОБУЧЕНИЯ: ОБУЧАЕМЫЙ КЛАССИФИКАТОР
Обучающий корпус: размеченный по si вручную корпус текстов
Множество признаков: контекстные слова и POS тэги с весами
Найти такие значения признаков, чтобы их комбинация была «оптимальна» на данном значении слова (например, вероятность приписать данное значение словоформе в данном контексте была максимальной или энтропия была бы максимальной)
sˆ= argmax s ε senses Pr(s|Vw)
Обучение: извлечение из корпуса информации о совместной
встречаемости значения s i и признака vw
Targeted Word Sense Disambiguation
Disambiguate one target word“Take a seat on this chair”“The chair of the Math Department”
WSD is viewed as a typical classification problem use machine learning techniques to train a system
Training: Corpus of occurrences of the target word, each
occurrence annotated with appropriate sense Build feature vectors:
a vector of relevant linguistic features that represents the context (ex: a window of words around the target word)
Disambiguation: Disambiguate the target word in new unseen text
Targeted Word Sense Disambiguation
Take a window of n word around the target word Encode information about the words around the target word
typical features include: words, root forms, POS tags, frequency, … An electric guitar and bass player stand off to one side, not really
part of the scene, just as a sort of nod to gringo expectations perhaps.
Surrounding context (local features) [ (guitar, NN1), (and, CJC), (player, NN1), (stand, VVB) ]
Frequent co-occurring words (topical features) [fishing, big, sound, player, fly, rod, pound, double, runs,
playing, guitar, band] [0,0,0,1,0,0,0,0,0,0,1,0]
Other features: [followed by "player", contains "show" in the sentence,
…] [yes, no, … ]
Метод максимальной энтропии Единственным источником информации для
статистического моделирования являются примеры из обучающей выборки. Чем больше бит информации принесет каждый пример - тем лучше используются имеющиеся в нашем распоряжения даные.
Рассмотрим произвольную компоненту нормированных (предобработанных) данных: . Среднее количество информации, приносимой каждым примером , равно энтропии распределения значений этой компоненты . Если эти значения сосредоточены в относительно небольшой области единичного интервала, информационное содержание такой компоненты мало. В пределе нулевой энтропии, когда все значения переменной совпадают, эта переменная не несет никакой информации. Напротив, если значения переменной равномерно распределены в единичном интервале, информация такой переменной максимальна.
Общий принцип предобработки данных для обучения, таким образом, состоит в максимизации энтропии входов и выходов. Этим принципом следует руководствоваться и на этапе кодирования нечисловых переменных.
Пример. Байесовская классификация. Математические основания
Наиvвный баvйесовский классификаvтор — простой вероятностный классификатор, основанный на применении Теоремы Байеса со строгими (наивными) предположениями о независимости.
Абстрактно, вероятностная модель для классификатора — это условная модель
над зависимой переменной класса C с малым количеством результатов или классов, зависимая от нескольких переменных F1 … Fn. Проблема заключается в том, что когда количество свойств n очень велико или когда свойство может принимать большое количество значений, тогда строить такую модель на вероятностных таблицах становится невозможно. Поэтому мы переформулируем модель, чтобы сделать ее легко поддающейся обработке.
Используя теорему Байеса
28
NAÏVE BAYES
28
CFILT
- IITB
sˆ= argmax s ε senses Pr(s|Vw)
‘Vw’ is a feature vector consisting of: POS of w Semantic & Syntactic features of w Collocation vector (set of words around it) typically consists of
next word(+1), next-to-next word(+2), -2, -1 & their POS's Co-occurrence vector (number of times w occurs in bag of words
around it)
Applying Bayes rule and naive independence assumption
sˆ= argmax s ε senses Pr(s).Πi=1nPr(Vw
i|s)
29
DECISION LIST ALGORITHM
Based on ‘One sense per collocation’ property. Nearby words provide strong and consistent clues as to the sense of
a target word.
Collect a large set of collocations for the ambiguous word.
Calculate word-sense probability distributions for all such collocations.
Calculate the log-likelihood ratio
Higher log-likelihood = more predictive evidence Collocations are ordered in a decision list, with most
predictive collocations ranked highest.29
CFILT
- IITB
Pr(Sense-A| Collocationi)
Pr(Sense-B| Collocationi)Log( )
29
CFILT
- IITB
Assuming there are only
two senses for the word.
Of course, this can easily
be extended to ‘k’ senses.
SUPERVISED APPROACHES –CONCLUSIONS
30
CFILT
- IITB
General Comments Use corpus evidence instead of relying of dictionary defined
senses. Can capture important clues provided by proper nouns because
proper nouns do appear in a corpus.
Naïve Bayes Suffers from data sparseness. Since the scores are a product of probabilities, some weak
features might pull down the overall score for a sense. A large number of parameters need to be trained.
Decision Lists A word-specific classifier. A separate classifier needs to be
trained for each word. Uses the single most predictive feature which eliminates the
drawback of Naïve Bayes.
SUPERVISED APPROACHES –CONCLUSIONS
31
CFILT
- IITB
Exemplar Based K-NN A word-specific classifier. Will not work for unknown words which do not appear in the
corpus. Uses a diverse set of features (including morphological and
noun-subject-verb pairs)
SVM A word-sense specific classifier. Gives the highest improvement over the baseline accuracy. Uses a diverse set of features.
HMM Significant in lieu of the fact that a fine distinction between the
various senses of a word is not needed in tasks like MT. A broad coverage classifier as the same knowledge sources can
be used for all words belonging to super sense. Even though the polysemy was reduced significantly there was
not a comparable significant improvement in the performance.
32
ROADMAP
Knowledge Based Approaches WSD using Selectional Preferences (or restrictions) Overlap Based Approaches
Machine Learning Based Approaches Supervised Approaches Semi-supervised Algorithms Unsupervised Algorithms
Hybrid Approaches Reducing Knowledge Acquisition Bottleneck WSD and MT Summary Future Work
32
CFILT
- IITB
33
SEMI-SUPERVISED DECISION LIST ALGORITHM
Based on Yarowsky’s supervised algorithm that uses Decision Lists.
Step1: Train the Decision List algorithm using a small amount of seed data.
Step2: Classify the entire sample set using the trained classifier.
Step3: Create new seed data by adding those members which are tagged as Sense-A or Sense-B with high probability.
Step4: Retrain the classifier using the increased seed data.
Exploits “One sense per discourse” property Identify words that are tagged with low confidence and label
them with the sense which is dominant for that document33
CFILT
- IITB
34
INITIALIZATION, PROGRESS AND CONVERGENCE
Seed set grows Stop when residual set stabilizes
34
CFILT
- IITB
LifeManufacturin
g
Residual data
SEMI-SUPERVISED APPROACHES – COMPARISONS & CONCLUSIONS
35
CFILT
- IITB
Approach Average Precision
Corpus Average Baseline Accuracy
SupervisedDecision Lists
96.1% Tested on a set of 12 highlypolysemous English words
63.9%
Semi-SupervisedDecision Lists
96.1% Tested on a set of 12 highlypolysemous English words
63.9%
Works at par with its supervised version even though it needs significantly less amount of tagged data.
Has all the advantages and disadvantaged of its supervised version.
36
ROADMAP
Knowledge Based Approaches WSD using Selectional Preferences (or restrictions) Overlap Based Approaches
Machine Learning Based Approaches Supervised Approaches Semi-supervised Algorithms Unsupervised Algorithms
Hybrid Approaches Reducing Knowledge Acquisition Bottleneck WSD and MT Summary Future Work
36
CFILT
- IITB
37
HYPERLEX
KEY IDEA Instead of using “dictionary defined senses” extract the “senses
from the corpus” itself These “corpus senses” or “uses” correspond to clusters of
similar contexts for a word.
CFILT
- IITB
(river)
(water)
(flow)
(electricity) (victory)
(team)
(cup)
(world)
38
DETECTING ROOT HUBS
Different uses of a target word form highly interconnected bundles (or high density components)
In each high density component one of the nodes (hub) has a higher degree than the others.
Step 1: Construct co-occurrence graph, G.
Step 2: Arrange nodes in G in decreasing order of in-degree.
Step 3: Select the node from G which has the highest frequency. This
node will be the hub of the first high density component. Step 4:
Delete this hub and all its neighbors from G. Step 5:
Repeat Step 3 and 4 to detect the hubs of other high density components
CFILT
- IITB
39
DETECTING ROOT HUBS (CONTD.)
The four components for “barrage” can be characterized as:
CFILT
- IITB
40
DELINEATING COMPONENTS
Attach each node to the root hub closest to it. The distance between two nodes is measured as the
smallest sum of the weights of the edges on the paths linking them.
Step 1: Add the target word to the graph G.
Step 2: Compute a Minimum Spanning Tree (MST) over G taking the
target word as the root.
CFILT
- IITB
41
DISAMBIGUATION
Each node in the MST is assigned a score vector with as many dimensions as there are components.
E.g. pluei(rain) belongs to the component EAU(water) and d(eau, pluie) = 0.82, spluei = (0.55, 0, 0, 0)
Step 1: For a given context, add the score vectors of all words in that
context.
Step 2: Select the component that receives the highest weight.
CFILT
- IITB
42
DISAMBIGUATION (EXAMPLE)
Le barrage recueille l’eau a la saison des plueis.
The dam collects water during the rainy season.
EAU is the winner in this case.
A reliability coefficient (ρ) can be calculated as the difference (δ) between the best score and the second best score.
ρ = 1 – (1/(1+ δ))
CFILT
- IITB
43
YAROWSKY’S ALGORITHM (WSD USING ROGET’S THESAURUS CATEGORIES)
Based on the following 3 observations: Different conceptual classes of words (say ANIMALS and
MACHINES) tend to appear in recognizably different contexts. Different word senses belong to different conceptual classes (E.g.
crane). A context based discriminator for the conceptual classes can
serve as a context based discriminator for the members of those classes.
Identify salient words in the collective context of the thesaurus category and weigh appropriately.
Weight(word) = Salience(Word) =
43
CFILT
- IITB
ANIMAL/INSECT
species (2.3), family(1.7), bird(2.6), fish(2.4), egg(2.2), coat(2.5), female(2.0), eat (2.2), nest(2.5), wild
TOOLS/MACHINERY
tool (3.1), machine(2.7), engine(2.6), blade(3.8), cut(2.2), saw(2.5), lever(2.0), wheel (2.2), piston(2.5)
44
DISAMBIGUATION
Predict the appropriate category for an ambiguous word using the weights of words in its context.
ARGMAX RCat
44
CFILT
- IITB…lift water and to grind grain. Treadmills attached to cranes were
used to lift heavy objects from Roman times, ….
TOOLS/MACHINE Weight ANIMAL/INSECT Weight
lift 2.44 Water 0.76
grain 1.68
used 1.32
heavy 1.28
Treadmills 1.16
attached 0.58
grind 0.29
Water 0.11
TOTAL 11.30 TOTAL 0.76
45
LIN’S APPROACHC
FILT
- IITB
45
installation proficiency adeptness readiness toilet/bathroom
Word Freq Log Likelihood
ORG 64 50.4
Plant 14 31.0
Company 27 28.6
Industry 9 14.6
Unit 9 9.32
Aerospace 2 5.81
Memory device
1 5.79
Pilot 2 5.37
Senses of facility Subjects of “employ”
Two different words are likely to have similar meanings if they occur in identical local contexts.
E.g. The facility will employ 500 new employees.
In this case Sense 1 of installation would be the winner sense.