29
Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy, Yoav Goldberg, Ido Dagan arXivTimes, 2017/1/24 Presenter: Hironsan Twitter : Hironsan13 GitHub : Hironsan actions of the Association for Computational Linguistics, 2015

Improving Distributional Similarity with Lessons Learned from Word Embeddings

Embed Size (px)

Citation preview

Page 1: Improving Distributional Similarity with Lessons Learned from Word Embeddings

Improving Distributional Similaritywith Lessons Learned from Word Embeddings

Omer Levy, Yoav Goldberg, Ido Dagan

arXivTimes, 2017/1/24

Presenter: HironsanTwitter : Hironsan13GitHub : Hironsan

Transactions of the Association for Computational Linguistics, 2015

Page 2: Improving Distributional Similarity with Lessons Learned from Word Embeddings

Abstract• word embedding の性能向上は Hyperparameters のチューニングによるところが大きいのではないかということで検証した論文• 4 つの手法について様々な Hyperparameters の組み合わせで検証• word similarity タスクにおいて、 count-base の手法でも

prediction-base の手法と同等の性能を示した• analogy タスクにおいては prediction-base 手法の方が優っていた

Page 3: Improving Distributional Similarity with Lessons Learned from Word Embeddings

Word Similarity & Relatedness• How similar is pizza to pasta?• How related is pizza to Italy?

• 単語をベクトルとして表現することで類似度を計算できるようになる

Page 4: Improving Distributional Similarity with Lessons Learned from Word Embeddings

Approaches for Representing Words

Word Embeddings (Predict)• Inspired by deep learning• word2vec (Mikolov et al., 2013)• GloVe (Pennington et al., 2014)

Underlying Theory: The Distributional Hypothesis (Harris, ’54)

“ ”似た単語は似た文脈で現れる

Distributional Semantics (Count)• Used since the 90’s• Sparse word-context PMI/PPMI matrix• Decomposed with SVD

Page 5: Improving Distributional Similarity with Lessons Learned from Word Embeddings

Approaches for Representing WordsBoth approaches:• distributional hypothesis に頼っている• 使用するデータも同じ• 数学的にも関連している

• “Neural Word Embedding as Implicit MatrixFactorization” (NIPS 2014)

• どっちの手法の方が良いのだろうか?• “Don’t Count, Predict!” (Baroni et al., ACL 2014)• タイトルの通り Predict ベースを使うべきと主張。本当に?

Page 6: Improving Distributional Similarity with Lessons Learned from Word Embeddings

The Contributions of Word Embeddings

Novel Algorithms(objective + training method)

• Skip Grams + Negative Sampling• CBOW + Hierarchical Softmax• Noise Contrastive Estimation• GloVe• …

どちらが性能を改善するのに重要なのか?

New Hyperparameters(preprocessing, smoothing, etc.)

• Subsampling• Dynamic Context Windows• Context Distribution Smoothing• Adding Context Vectors• …

Page 7: Improving Distributional Similarity with Lessons Learned from Word Embeddings

Contributions1) Identifying the existence of new hyperparameters

• Not always mentioned in papers

2) Adapting the hyperparameters across algorithms• Must understand the mathematical relation between

algorithms

3) Comparing algorithms across all hyperparameter settings

• Over 5,000 experiments

Page 8: Improving Distributional Similarity with Lessons Learned from Word Embeddings

Comparing Methods以下の 4 つの手法についてハイパーパラメータを変えて比較• PPMI(Positive Pointwise Mutual Information)

• SVD(Singular Value Decomposition)

• SGNS(Skip-Gram with Negative Sampling)

• GloVe(Global Vectors)

Count-base

Prediction-base

Page 9: Improving Distributional Similarity with Lessons Learned from Word Embeddings

HyperparametersPreprocessing Hyperparameters

• Window Size(win)• Dynamic Context Window (dyn)• Subsampling (sub) • Deleting Rare Words (del)

Association Metric Hyperparameters• Shifted PMI (neg) • Context Distribution Smoothing (cds)

Post-processing Hyperparameters• Adding Context Vectors (w+c) • Eigenvalue Weighting (eig) • Vector Normalization (nrm)

Page 10: Improving Distributional Similarity with Lessons Learned from Word Embeddings

Dynamic Context Windows•  

Page 11: Improving Distributional Similarity with Lessons Learned from Word Embeddings

Subsampling (sub) • Subsampling は頻出語を除去するために使う• 閾値 t 以上に頻出する単語を確率 p で除去する

• f は単語の頻度• t として設定する値は 10-3 〜 10-5

Page 12: Improving Distributional Similarity with Lessons Learned from Word Embeddings

Deleting Rare Words (del) • 低頻度語の除去• context windows を作る前に除去する

Page 13: Improving Distributional Similarity with Lessons Learned from Word Embeddings

Transferable HyperparametersPreprocessing Hyperparameters

• Window Size(win)• Dynamic Context Window (dyn)• Subsampling (sub) • Deleting Rare Words (del)

Association Metric Hyperparameters• Shifted PMI (neg) • Context Distribution Smoothing (cds)

Post-processing Hyperparameters• Adding Context Vectors (w+c) • Eigenvalue Weighting (eig) • Vector Normalization (nrm)

Page 14: Improving Distributional Similarity with Lessons Learned from Word Embeddings

Shifted PMI (neg) • PMI から SGNS の Negative Sampling のパラメータ k を shift する

(Levy and Goldberg 2014)

SPPMI(w, c) = max(PMI(w, c) − log k, 0)• Negative Sampling のパラメータ k を PMI に適用している

Page 15: Improving Distributional Similarity with Lessons Learned from Word Embeddings

Context Distribution Smoothing (cds)• negative sampling は単語のユニグラム分布 P をスムージングした分布を用いて行う

• PMI でも同様にスムージング

• こうすることで低頻度語の影響を緩和でき、かなり効果的

Page 16: Improving Distributional Similarity with Lessons Learned from Word Embeddings

Transferable HyperparametersPreprocessing Hyperparameters

• Window Size(win)• Dynamic Context Window (dyn)• Subsampling (sub) • Deleting Rare Words (del)

Association Metric Hyperparameters• Shifted PMI (neg) • Context Distribution Smoothing (cds)

Post-processing Hyperparameters• Adding Context Vectors (w+c) • Eigenvalue Weighting (eig) • Vector Normalization (nrm)

Page 17: Improving Distributional Similarity with Lessons Learned from Word Embeddings

Adding Context Vectors (w+c) • SGNS は単語ベクトル w を生成する• SGNS は文脈ベクトル c も生成する

• GloVe と SVD も• 単語を w で表さず、 w+c で表す by Pennington et al. (2014)

• 今までは GloVe だけに適応されてた

Page 18: Improving Distributional Similarity with Lessons Learned from Word Embeddings

Eigenvalue Weighting (eig) • SVD で分解した結果得られた固有値の重み付け方法を変更する• SVD して W と C を作るが類似度タスクに最適な作り方とは限らない• symmetric な方法が semantic タスクには適している

Symmetric

Page 19: Improving Distributional Similarity with Lessons Learned from Word Embeddings

Vector Normalization (nrm) • 得られた単語ベクトルは正規化

• 単位ベクトルにする

Page 20: Improving Distributional Similarity with Lessons Learned from Word Embeddings

Transferable Hyperparameters• 各アルゴリズムにおける Hyperparameters の対応をとることで、なるべく条件を揃えて実験する

Page 21: Improving Distributional Similarity with Lessons Learned from Word Embeddings

Comparing Algorithms

Page 22: Improving Distributional Similarity with Lessons Learned from Word Embeddings

Systematic Experiments• 9 Hyperparameters

• 4 Word Representation Algorithms• PPMI (Sparse & Explicit)• SVD(PPMI)• SGNS• GloVe

• 8 Benchmarks• 6 Word Similarity Tasks• 2 Analogy Tasks

Page 23: Improving Distributional Similarity with Lessons Learned from Word Embeddings

Hyperparameter SettingsClassic Vanilla Setting(commonly used for baselines)• Preprocessing

• <None>•

• Postprocessing• <None>

• Association Metric• Vanilla PMI/PPMI

Recommended word2vec Setting(tuned for SGNS)• Preprocessing

• Dynamic Context Window• Subsampling

• Postprocessing• <None>

• Association Metric• Shifted PMI/PPMI• Context Distribution

Smoothing

Page 24: Improving Distributional Similarity with Lessons Learned from Word Embeddings

Experiments: Hyperparameter Tuning

PPMI (Sparse Vectors) SGNS (Embeddings)0.3

0.4

0.5

0.6

0.7

0.54 0.5870.688

0.6230.697 0.681

WordSim-353 Relatedness

Spea

rman

’s Co

rrel

ation

[optimal]

Page 25: Improving Distributional Similarity with Lessons Learned from Word Embeddings

Overall Results• Hyperparameters のチューニングは時に algorithms より重要

• word similarity タスクにおいて、 count-base の手法でもprediction-base の手法と同等の性能を示した

• analogy タスクにおいては prediction-base 手法の方が優っていた

Page 26: Improving Distributional Similarity with Lessons Learned from Word Embeddings

Conclusions

Page 27: Improving Distributional Similarity with Lessons Learned from Word Embeddings

Conclusions: Distributional SimilarityWord Embeddings の性能には以下の 2 つが影響する :

• Novel Algorithms

• New Hyperparameters

何が性能にとって本当に重要なのか?• Hyperparameters (mostly)

• Algorithms

Page 28: Improving Distributional Similarity with Lessons Learned from Word Embeddings

References• Omer Levy and Yoav Goldberg, 2014, Neural word

embeddings as implicit matrix factorization

• Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, 2013, Efficient Estimation of Word Representations in Vector Space

• Jeffrey Pennington, Richard Socher, Christopher D. Manning, 2014, GloVe: Global Vectors for Word Representation

Page 29: Improving Distributional Similarity with Lessons Learned from Word Embeddings

If you are interested inMachine Learning, Natural Language Processing, Computer Vision,Follow arXivTimes @ https://twitter.com/arxivtimes