Improving Distributional Similarity with Lessons Learned from Word Embeddings

Improving Distributional Similaritywith Lessons Learned from Word Embeddings

Omer Levy, Yoav Goldberg, Ido Dagan

arXivTimes, 2017/1/24

Presenter: HironsanTwitter : Hironsan13GitHub : Hironsan

Transactions of the Association for Computational Linguistics, 2015

Abstract• word embedding の性能向上は Hyperparameters のチューニングによるところが大きいのではないかということで検証した論文• 4 つの手法について様々な Hyperparameters の組み合わせで検証• word similarity タスクにおいて、 count-base の手法でも

prediction-base の手法と同等の性能を示した• analogy タスクにおいては prediction-base 手法の方が優っていた

Word Similarity & Relatedness• How similar is pizza to pasta?• How related is pizza to Italy?

• 単語をベクトルとして表現することで類似度を計算できるようになる

Approaches for Representing Words

Word Embeddings (Predict)• Inspired by deep learning• word2vec (Mikolov et al., 2013)• GloVe (Pennington et al., 2014)

Underlying Theory: The Distributional Hypothesis (Harris, ’54)

“ ”似た単語は似た文脈で現れる

Distributional Semantics (Count)• Used since the 90’s• Sparse word-context PMI/PPMI matrix• Decomposed with SVD

Approaches for Representing WordsBoth approaches:• distributional hypothesis に頼っている• 使用するデータも同じ• 数学的にも関連している

• “Neural Word Embedding as Implicit MatrixFactorization” (NIPS 2014)

• どっちの手法の方が良いのだろうか？• “Don’t Count, Predict!” (Baroni et al., ACL 2014)• タイトルの通り Predict ベースを使うべきと主張。本当に？

The Contributions of Word Embeddings

Novel Algorithms(objective + training method)

• Skip Grams + Negative Sampling• CBOW + Hierarchical Softmax• Noise Contrastive Estimation• GloVe• …

どちらが性能を改善するのに重要なのか？

New Hyperparameters(preprocessing, smoothing, etc.)

• Subsampling• Dynamic Context Windows• Context Distribution Smoothing• Adding Context Vectors• …

Contributions1) Identifying the existence of new hyperparameters

• Not always mentioned in papers

2) Adapting the hyperparameters across algorithms• Must understand the mathematical relation between

algorithms

3) Comparing algorithms across all hyperparameter settings

• Over 5,000 experiments

Comparing Methods以下の 4 つの手法についてハイパーパラメータを変えて比較• PPMI(Positive Pointwise Mutual Information)

• SVD(Singular Value Decomposition)

• SGNS(Skip-Gram with Negative Sampling)

• GloVe(Global Vectors)

Count-base

Prediction-base

HyperparametersPreprocessing Hyperparameters

• Window Size(win)• Dynamic Context Window (dyn)• Subsampling (sub) • Deleting Rare Words (del)

Association Metric Hyperparameters• Shifted PMI (neg) • Context Distribution Smoothing (cds)

Post-processing Hyperparameters• Adding Context Vectors (w+c) • Eigenvalue Weighting (eig) • Vector Normalization (nrm)

Dynamic Context Windows•

Subsampling (sub) • Subsampling は頻出語を除去するために使う• 閾値 t 以上に頻出する単語を確率 p で除去する

• f は単語の頻度• t として設定する値は 10-3 〜 10-5

Deleting Rare Words (del) • 低頻度語の除去• context windows を作る前に除去する

Transferable HyperparametersPreprocessing Hyperparameters

Shifted PMI (neg) • PMI から SGNS の Negative Sampling のパラメータ k を shift する

(Levy and Goldberg 2014)

SPPMI(w, c) = max(PMI(w, c) − log k, 0)• Negative Sampling のパラメータ k を PMI に適用している

Context Distribution Smoothing (cds)• negative sampling は単語のユニグラム分布 P をスムージングした分布を用いて行う

• PMI でも同様にスムージング

• こうすることで低頻度語の影響を緩和でき、かなり効果的

Transferable HyperparametersPreprocessing Hyperparameters

Adding Context Vectors (w+c) • SGNS は単語ベクトル w を生成する• SGNS は文脈ベクトル c も生成する

• GloVe と SVD も• 単語を w で表さず、 w+c で表す by Pennington et al. (2014)

• 今までは GloVe だけに適応されてた

Eigenvalue Weighting (eig) • SVD で分解した結果得られた固有値の重み付け方法を変更する• SVD して W と C を作るが類似度タスクに最適な作り方とは限らない• symmetric な方法が semantic タスクには適している

Symmetric

Vector Normalization (nrm) • 得られた単語ベクトルは正規化

• 単位ベクトルにする

Transferable Hyperparameters• 各アルゴリズムにおける Hyperparameters の対応をとることで、なるべく条件を揃えて実験する

Comparing Algorithms

Systematic Experiments• 9 Hyperparameters

• 4 Word Representation Algorithms• PPMI (Sparse & Explicit)• SVD(PPMI)• SGNS• GloVe

• 8 Benchmarks• 6 Word Similarity Tasks• 2 Analogy Tasks

Hyperparameter SettingsClassic Vanilla Setting(commonly used for baselines)• Preprocessing

• <None>•

• Postprocessing• <None>

• Association Metric• Vanilla PMI/PPMI

Recommended word2vec Setting(tuned for SGNS)• Preprocessing

• Dynamic Context Window• Subsampling

• Postprocessing• <None>

• Association Metric• Shifted PMI/PPMI• Context Distribution

Smoothing

Experiments: Hyperparameter Tuning

PPMI (Sparse Vectors) SGNS (Embeddings)0.3

0.54 0.5870.688

0.6230.697 0.681

WordSim-353 Relatedness

’s Co

[optimal]

Overall Results• Hyperparameters のチューニングは時に algorithms より重要

• word similarity タスクにおいて、 count-base の手法でもprediction-base の手法と同等の性能を示した

• analogy タスクにおいては prediction-base 手法の方が優っていた

Conclusions

Conclusions: Distributional SimilarityWord Embeddings の性能には以下の 2 つが影響する :

• Novel Algorithms

• New Hyperparameters

何が性能にとって本当に重要なのか？• Hyperparameters (mostly)

• Algorithms

References• Omer Levy and Yoav Goldberg, 2014, Neural word

embeddings as implicit matrix factorization

• Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, 2013, Efficient Estimation of Word Representations in Vector Space

• Jeffrey Pennington, Richard Socher, Christopher D. Manning, 2014, GloVe: Global Vectors for Word Representation

If you are interested inMachine Learning, Natural Language Processing, Computer Vision,Follow arXivTimes @ https://twitter.com/arxivtimes

Improving Distributional Similarity with Lessons Learned from Word Embeddings

Software

Symbolic Data Analysis: Dissimilarity/Similarity/Distance ... · Dis/Similarity / Distance Measures Distance Measures, Similarity/Dissimilarity Matrices: Goalis to subdivide the complete

Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

New species and new distributional data on Cteniopodini …...New species and new distributional data on Cteniopodini from the Palaearctic Region (Coleoptera: Tenebrionidae: Alleculinae)

Segmentation Similarity and Agreement

PERBANDINGAN METODE DICE SIMILARITY DENGAN COSINE ...etheses.uin-malang.ac.id/13814/1/13650031.pdf · i perbandingan metode dice similarity dengan cosine similarity menggunakan query

Evaluation methods for unsupervised word embeddings EMNLP2015 読み会

Online Appendix Word Embeddings for the Analysis of Ideological … · 2019. 5. 6. · Online Appendix Word Embeddings for the Analysis of Ideological Placement in Parliamentary Corpora

Distributional patterns of neotropical, afrotropical and

Meta-Prod2Vec: Simple Product Embeddings with Side-Information

Kernel Embeddings for Inference with Intractable …sejdinov/talks/pdf/2016-03-30_ISM...2016/03/30 · Kernel Embeddings for Inference with Intractable Likelihoods Dino Sejdinovic

Funciones léxicas en español utilizando embeddings

Exploiting Bilingual Word Embeddings to Establish ...weissweiler/kolloq/referate/eder.p… · Exploiting Bilingual Word Embeddings to Establish Translational Equivalence 15.05.2017

Estimating Distributional Parameters in Hierarchical Models

ETH-CVL @ MediaEval 2016: Textual-Visual Embeddings and …ceur-ws.org/Vol-1739/MediaEval_2016_paper_24.pdf · 2016-11-20 · ETH-CVL @ MediaEval 2016: Textual-Visual Embeddings and

Autour de la résolution automatique de la coréférence ... · GloVe embeddings (Pennington et al.,2014) and 50-dimensional embeddings from Turian et al. (2010), both normalized

New distributional records of Delphacidae …...Campodonico (2017): New distributional records of Delphacidae (Hemiptera: Fulgoroidea) from Chile. 120 Weglarz (2012) were followed

Similarity Join Wu Yang 2009.4.9. Main work MS--A Primitive Operator for Similarity Joins in Data Cleaning ICDE 2006 Google--Scaling Up All Pairs Similarity

Learning Composition Models for Phrase Embeddings

Word embeddings: reprezentacje właściwościowe słów

Economic Similarity and Bilateral Trade