Upload
jemima-lloyd
View
216
Download
2
Embed Size (px)
Citation preview
The University of Wisconsin-MadisonThe University of Wisconsin-Madison
Universal Morphological Analysis using Structured Nearest Neighbor Prediction
Young-Bum Kim, João V. Graça, and Benjamin Snyder
University of Wisconsin-Madison
28 July, 2011
The University of Wisconsin-Madison
Unsupervised NLP
Unsupervised learning in NLP has become popular
27 papers in this year ACL+EMNLP
Relies on inductive bias, encoded in model structure or learning algorithm.
Example : HMM for POS induction, encodes transitional regularity
? ? ? ?
I like to read
1
The University of Wisconsin-Madison
Inductive Biases
Formulated with weak empirical grounding (or left implicit)
Single, simple bias for all languages
low performance, complicated models, fragility, language dependence.
Our approach : learn complex, universal bias using labeled languages
2
i.e. Empirically learn what the space of plausible human languages looks like to guide unsupervised learning
The University of Wisconsin-Madison
Key Idea
1) Collect labeled corpora (non-parallel) for several training languages
Training languages
3
Test language
The University of Wisconsin-Madison
1 1( , )f x y
2 2( , )f x y
3 3( , )f x y
2) Map each (x,y) pair into a “universal feature space”
- i.e. to allow cross-lingual generalization
Training languages
4
Test language
Key Idea
The University of Wisconsin-Madison
1 1( , )f x y
2 2( , )f x y
3 3( , )f x y
score (·)
3) Train scoring function over universal feature space
- i.e. treat each annotated language as single data point in structured prediction problem
Training languages
5
Test language
Key idea
The University of Wisconsin-Madison
1 1( , )f x y
2 2( , )f x y
3 3( , )f x y
score (·)
*y argmax y
4) Predict test labels which yield highest score
Training languages
6
Test language
score ( )
Key Idea
The University of Wisconsin-Madison
Test Case: Nominal Morphology
Languages differ in morphological complexity
- Only 4 English noun tags in Penn Treebank
- 154 noun tags in Hungarian corpus (suffix encode case, number, and gender) Our analysis will break each noun into :
stem, phonological deletion rule, and suffix
- utiskom [ stem = utisak, del = (..ak# → ..k#), suffix = om ]
Question : Can we use morphologically annotated languages to train a universal morphological analyzer ?
7
The University of Wisconsin-Madison
Our Method
Universal feature space (8 features)
- Size of stem, suffix, and deletion rule lexicons
- Entropy of stem, suffix, and deletion rule distributions
- Percentage of suffix-free words, and words with phonological deletions.
Learning algorithm
- Broad characteristics of morphology often similar across select language pairs
- Motivates a nearest neighbor approach
- In structured scenario, learning becomes a search problem over label space
8
The University of Wisconsin-Madison
Structured Nearest Neighbor
Main Idea: predict analysis for test language which brings us closest in feature space to a training language.
1) Initialize analysis of test language:
2) For each training language :
- iteratively and greedily update test language analysis to bring closer
in feature space to
3) After T iterations, choose training language closest in feature space:
4) Predict the associated analysis:
9
(1, ) (2, ) (3, ): , , ,...y y y
0y
Training Test
The University of Wisconsin-Madison
Structured Nearest Neighbor
10
Training languages:
Initialize test language labels:
2 2( , )x y 3 3( , )x y
The University of Wisconsin-Madison
Structured Nearest Neighbor
11
Iterative Search:
The University of Wisconsin-Madison
Structured Nearest Neighbor
12
Iterative Search:
The University of Wisconsin-Madison
Structured Nearest Neighbor
13
Iterative Search:
The University of Wisconsin-Madison
Structured Nearest Neighbor
14
Predict:
The University of Wisconsin-Madison
Morphology Search Algorithm
15
Initialization
Reanalyze Each Word
Find New Stems
Find New Suffixes
Based on (Goldsmith 2005)
- He minimizes description length - We minimize distance to training language
Training
CandidatesSelect
Stage 0:
Stage 1:
Stage 2:
Stage 3:
The University of Wisconsin-Madison
Iterative Search Algorithm
16
Stage 0 : Using “character successor frequency,” initialize sets T, F, and D.
Stem Set T Suffix Set F Deletion rule Set F
The University of Wisconsin-Madison
Iterative Search Algorithm
17
Stage 1 :
- greedily reanalyze each word, keeping T and F fixed.
Stem Set T Suffix Set F Deletion rule Set F
The University of Wisconsin-Madison
Iterative Search Algorithm
18
Stage 2 :
- greedily analyze unsegmented words, keeping F fixed
Stem Set T Suffix Set F Deletion rule Set F
The University of Wisconsin-Madison
Iterative Search Algorithm
19
Stage 3 : Find new Suffixes
- greedily analyze unsegmented words, keeping T fixed
Stem Set T Suffix Set F Deletion rule Set F
The University of Wisconsin-Madison
Experimental Setup
Corpus: Orwell’s Nineteen Eighty Four (Multext East V3)
- Languages: Bulgarian, Czech, English, Estonian, Hungarian, Romanian, Slovene, Serbian
- 94,725 tokens (English). Slight confound: data is parallel. Method does not assume or exploit this fact.
- all words tagged with morpho-syntactic analysis.
Baseline: Linguistica model (Goldsmith 2005)
- same search procedure, greedily minimizes description length
Upper bound: supervised model
- structured perceptron framework (Collins 2002)
20
The University of Wisconsin-Madison
Aggregate Results
21
Accuracy: fraction of word types with correct analysis
avg. over 8 languages40
60
80
100Linguistica
64.6
The University of Wisconsin-Madison
Aggregate Results
22
Accuracy: fraction of word types with correct analysis
avg. over 8 languages40
60
80
100Linguistica
64.6
92.8
Supervised
The University of Wisconsin-Madison
Aggregate Results
23
Accuracy: fraction of word types with correct analysis
Our Model: Train with 7, test on 1
-average absolute increase of 11.8 -reduces error by 42%
avg. over 8 languages40
60
80
100Our model Linguistica
76.4 64.6
92.8
Supervised
The University of Wisconsin-Madison
Aggregate Results
24
Accuracy: fraction of word types with correct analysis
Our Model: Train with 7, test on 1
-average absolute increase of 11.8 -reduces error by 42%
Oracle: Each language guided using own gold standard feature values
Accuracy still below supervised:
(1) search errors (2) coarseness of feature space
avg. over 8 languages40
60
80
100Our model Linguistica
76.4 64.6
81.1
92.8
SupervisedOracle
The University of Wisconsin-Madison
Results By Language
25
Best accuracy: English
Lowest accuracy: Estonian
Linguistica
BG CS EN ET HU RO SL SR40
60
80
100Linguistica
61 6469 60 5181 65 66
The University of Wisconsin-Madison
Results By Language
BG CS EN ET HU RO SL SR40
60
80
100Our Model Linguistica
84 83 76 67 69 83 61 6469 60 51 7981 65 71 66
26
Biggest improvements for Serbian (15 points) and Slovene (22 points).
For all languages other than English, improvement over baseline
Our Model (train with 7, test on 1)
The University of Wisconsin-Madison
Visualization of Feature Space
27
Feature space reduced to 2D using MDS
Linguistica
Gold Standard
Our Method
The University of Wisconsin-Madison
Visualization of Feature Space
28
Serbian and Slovene: - Closely related Slavic languages- Nearest Neighbors under our model’s analysis- Essentially they “swap places”
Linguistica
Gold Standard
Our Method
The University of Wisconsin-Madison
Visualization of Feature Space
29
Estonian and Hungarian:
- Highly inflected Uralic Languages- They “swap places”
Linguistica
Gold Standard
Our Method
The University of Wisconsin-Madison
Visualization of Feature Space
30
English:
- Failed to find a good neighbor - Pulled towards Bulgarian (second least inflected language in dataset)
Linguistica
Gold Standard
Our Method
The University of Wisconsin-Madison
Accuracy as Training Languages Added
31
Averaged over all language combinations of various sizes
- Accuracy climbs as training languages added
- Worse than baseline when only one training language available
- Better than baseline when two or more training languages available
The University of Wisconsin-Madison
Why does accuracy improve with more languages?
Resulting distance VS accuracy for all 56 train-test pairs
- More training languages find a closer neighbor⇒- Closer neighbor higher accuracy⇒
32
The University of Wisconsin-Madison
Summary
33
Main Idea: Recast unsupervised learning as cross-lingual structured prediction
Test case: morphological analysis of 8 languages.
Formulated universal feature space for morphology
Developed novel structured nearest neighbor approach
Our method yields substantial accuracy gains
The University of Wisconsin-Madison
Future Work
34
Shortcoming
- uniform weighting of dimensions in the the universal feature space
- some features may be more important than others
Future work: learn distance metric on universal feature space
The University of Wisconsin-Madison
Thank You
35
Thank you