教師なしオブジェクトマッチング（第2回ステアラボ人工知能セミナー）

Copyright©2016 NTT corp. All Rights Reserved.

教師なしオブジェクトマッチング

NTTコミュニケーション科学基礎研究所岩田具治

2 Copyright©2016 NTT corp. All Rights Reserved.

研究してきたこと

Recommender system Clustering

Topic modeling

Information diffusion

Object matching

Visualization

Active Learning

Domain adaptation


今日の目次

• 教師なしオブジェクトマッチングの導入 • 具体的な手法

– 潜在確率モデルによる教師なしクラスタマッチング

– ネットワークデータのための教師なしクラスタマッチング

– 多言語文書データからの普遍文法の抽出


機械学習

• 教師あり学習 – 入力と出力のペアから未知の出力を予測する – 例１：スパムメールフィルタ – 例２：画像認識

• 強化学習 – 入力に対する正解の出力は与えられないが，出力の評価は与えられ，最適な出力を学習する

– 例１：ロボット制御 – 例２：ゲーム

• 教師なし学習 – データの背後に存在する隠れた構造を抽出する


教師なし学習

• 例１：クラスタリング – 似た購買行動をする消費者グループを見つける – 関連する文書のまとまりを見つける


教師なし学習

• 例2：次元削減 – 高次元データを2次元に変換して可視化する – 本質的な次元だけ残すことでノイズを除去する


オブジェクトマッチング

• 異なるドメインのオブジェクト間を対応付ける • 例

– 画像とタグ – 英語と日本語の単語 – 異なるデータベースのＩＤ（名寄せ）

クラスタ教師なしマッチング

unsupervised cluster matching English

Japanese


教師ありオブジェクトマッチング

• 対応データが教師データとして与えられる • 対応が未知のテストデータを対応付ける

annotation

ドメイン１ドメイン２

教師データ


テストデータ

?


既存手法:正準相関分析

• 学習フェーズ – 正解対応データが低次元潜在空間で同じ位置に埋め込まれるように線形写像を学習する

低次元潜在空間

ドメイン１高次元空間ドメイン２高次元空間

線形写像𝑊1 線形写像𝑊2



• テストフェーズ – 学習した線形写像を使ってテストデータを低次元潜在空間に写像したときに，近くに配置されたデータが対応すると推定学習する















教師ありマッチング手法の問題点

• 対応データが必要 – 例：対訳文，辞書

• 対応データが入手困難・不可能な状況もある

– プライバシーの保護 • 異なる企業間で顧客情報の共有ができない

– データの入手目的や方法が異なる • すでに入手したデータの場合，対応が消えている場合もある

– 人手による対応付け高コストが高い • 辞書や対訳データが整備されていない使用人数が少ない言語もある



• 対応データなしで対応を見つける


?



• ドメイン間のオブジェクトの距離は測れない

C

A

B

ドメイン１空間

3

2

1

ドメイン２空間



• ドメイン間のオブジェクトの距離は測れない • ドメイン内のオブジェクトの距離は測れる

C

A

B


3

2

1





C

A

B


3

2

1


Cだけ離れている 1だけ離れている




C

A

B


3

2

1


Cだけ離れている AはBよりもCに近い

1だけ離れている 2は3よりも1に近い




C

A

B


3

2

1


Cだけ離れている AはBよりもCに近い

1だけ離れている 2は3よりも1に近い


今日の目次

• 教師なしオブジェクトマッチングの基礎 • 具体的な手法

– 潜在確率モデルによる教師なしクラスタマッチング

– ネットワークデータのための教師なしクラスタマッチング

– 多言語文書データからの普遍文法の抽出


潜在確率モデルによる教師なしクラスタマッチング

Tomoharu Iwata joint work with Tsutomu Hirao and Naonori Ueda


教師なしオブジェクトマッチング手法

• 既存手法 – kernelized sorting [Quadrianto et al. 2010]

– convex kernelized sorting [Djuric, Grbovic, Vucetic, 2012]

– least squares object matching [Yamada and Sugiyama, 2011]

– matching canonical correlation analysis [Haghighi et al. 2008]

• 問題点 – 1対1対応のみ発見 – ドメイン毎のオブジェクト数が同じでないといけない – 2ドメイン以上に対応


タスク：教師なしクラスタマッチング

• 異なるドメイン間のクラスタの対応を教師なしで見つける – １対１対応とは限らない – ドメイン数は２以上 – オブジェクト数は異なってもよい

car automobile

motorcar

wagen automobil

車自動車乗用車

English German

Japanese


提案法：教師なしクラスタマッチングのための潜在変数モデル

1. 各ドメインのデータを共通の低次元潜在空間へ埋め込む

2. 潜在空間でクラスタリング 3. 同じクラスタになったオブジェクトが対応





















確率的生成モデルによるクラスタリング

• クラスタが与えられたときにデータが生成される過程を確率を用いて定義

• 実際にはデータが与えられる • データを生成したもっともらしいクラスタを推論 • 利点

– 不確実性を考慮できる – 確率論の枠組みで異種データを統合できる

クラスタデータ

生成

推論


混合正規分布によるクラスタリング

• k平均法の確率版 • 生成過程

– クラスタ毎の平均は{𝜇1, 𝜇2,⋯ , 𝜇𝐾} – For オブジェクト 𝑛 = 1,⋯ ,𝑁

• クラスタ割当を決める 𝑠𝑛 ∼ Categorical(𝜃) • オブジェクトを生成 𝒙𝑛 ∼ Normal(𝜇𝑠𝑛 ,𝜎2)






𝜇2

𝜇3

𝜇1






𝜇2

𝜇3

𝜇1 1

2

3

𝜃






𝜇2

𝜇3

𝜇1 1

2

3

𝜃

1つ目のオブジェクトのクラスタ割当： 𝑧1 = 2






𝜇2

𝜇3

𝜇1 1

2

3

𝜃

1つ目のオブジェクトのクラスタ割当： 𝑠1 = 2

𝒙1






𝜇2

𝜇3

𝜇1 1

2

3

𝜃

1つ目のオブジェクトのクラスタ割当： 𝑠2 = 1

𝒙1 𝒙2






𝜇2

𝜇3

𝜇1 1

2

3

𝜃

𝒙1 𝒙2 𝒙5 𝒙7

𝒙6

𝒙9 𝒙4

𝒙8

𝒙3



• 推論 – オブジェクト集合{𝒙1,𝒙2,⋯ ,𝒙𝑁}を生成したもっともらしいクラスタ割当集合 𝑠1, 𝑠2,⋯ , 𝑠𝑁 、クラスタ平均集合{𝜇1, 𝜇2,⋯ , 𝜇𝐾}、クラスタ割合𝜃を求める

𝒙1 𝒙2 𝒙5 𝒙7

𝒙6

𝒙9 𝒙4

𝒙8

𝒙3 𝜇2

𝜇3

𝜇1

𝒙1 𝒙2 𝒙5 𝒙7

𝒙6

𝒙9 𝒙4

𝒙8

𝒙3


教師なしクラスタマッチング生成モデル

• 潜在空間に無限個の潜在ベクトル{𝒛1, 𝒛2, 𝒛3 ⋯ } • 潜在空間から各ドメインの線形写像行列{𝑊1,⋯ ,𝑊𝐷} • For ドメイン 𝑑 = 1,⋯ ,𝐷

– For オブジェクト 𝑛 = 1,⋯ ,𝑁𝑑 • クラスタ割当を決める 𝑠𝑑𝑛 ∼ Categorical(𝜃) • オブジェクトを生成 𝒙𝑛 ∼ Normal(𝑊𝑑𝑧𝑠𝑑𝑛 ,𝛼−1𝐼)


教師なしクラスタマッチング生成モデル

• 無限混合正規分布

潜在空間

ドメイン１観測空間

z1 z2

z3

W1z1 W2z2

W2z3

W2z1

W1z3 W1z2

ドメイン２観測空間

𝛼−1

𝑝 𝒙𝑑𝑛 𝒁,𝑾,𝜽 = �𝜃𝑗𝑁(𝒙𝑑𝑛|𝑾𝑑𝒛𝑗 ,𝛼−1𝑰)∞

𝑗=1

線形写像行列潜在ベクトル

精度（分散の逆数）クラスタ割合


提案法の入出力

• 入力：Dドメインのオブジェクト集合 – 𝒙𝑑𝑛 ∈ 𝑅𝑀𝑑 はドメインdのn番

目のオブジェクトの特徴ベクトル

– オブジェクト数や特徴数はドメイン事に異なってよい𝑁𝑑 ≠ 𝑁𝑑′ ,𝑀𝑑 ≠ 𝑀𝑑′

• 出力：オブジェクト毎のクラスタ割当 – 𝑠𝑑𝑛 ∈ {1,⋯ ,∞} はドメインdのn番目のオブジェクトのクラスタ割当で、クラスタは全ドメインで共通 39

𝑿1 = {𝒙11,𝒙12,⋯ ,𝒙1𝑁1}, ⋯ ,𝑿𝐷

domain2 domain1

object→

feature→ feature→

𝑺1 = {𝑠11, 𝑠12,⋯ , 𝑠1𝑁1},⋯ , 𝑺𝐷

objects in dom

ain2

objects in dom

ain1

clusters


提案法の特徴

• クラスタ数を自動推定できる – ディリクレ過程を用いて無限個のクラスタを想定

• 異なるドメインのオブジェクトを共通のクラスタに割当できる – 潜在ベクトルを全ドメインで共有

• ドメイン毎に異なる特徴次元や統計的性質を考慮できる – ドメイン固有の線形写像行列

• ドメイン毎に異なるオブジェクト数でもよい – 潜在ベクトルが与えられたとき、各ドメイン独立にオブジェクトを生成


推論

• 確率的EMアルゴリズム – Eステップ：クラスタ割当sをギブスサンプリング – Mステップ：写像行列Wを最尤推定 – 潜在ベクトルz、クラスタ割合θ、精度αは解析的に積分消去

41

𝑝 𝑠𝑑𝑛 = 𝑗 𝑿,𝑺∖𝑑𝑛,𝑾 ∝𝑝(𝑠𝑑𝑛 = 𝑗, 𝑺∖𝑑𝑛)

𝑝(𝑺∖𝑑𝑛)𝑝(𝑿|𝑠𝑑𝑛 = 𝑗, 𝑺∖𝑑𝑛,𝑾)𝑝(𝑿∖𝑑𝑛|𝑺∖𝑑𝑛,𝑾)

𝑝(𝑠𝑑𝑛 = 𝑗, 𝑺∖𝑑𝑛)𝑝(𝑺∖𝑑𝑛)

∝ �𝑁𝑗∖𝑑𝑛𝛾 for an existing clusterfor a new cluster

𝑝(𝑿|𝑠𝑑𝑛 = 𝑗,𝑺∖𝑑𝑛 ,𝑾)𝑝(𝑿∖𝑑𝑛|𝑺∖𝑑𝑛,𝑾)

= 2𝜋 −𝑀𝑑/2𝑟𝑏∖𝑑𝑛′ 𝑎∖𝑑𝑛

′

𝑏𝑠𝑑𝑛=𝑗′ 𝑎𝑠𝑑𝑛=𝑗

′

Γ(𝑎𝑠𝑑𝑛=𝑗′ )

Γ(𝑎∖𝑑𝑛′ )𝑪𝑗,𝑠𝑑𝑛=𝑗

1/2

𝑪𝑗∖𝑑𝑛1/2

a

θ

z

γ s x W

α r ∞

D N

b mixture weight

precision latent vector

projection matrix

object cluster assignment


回転手書き数字のマッチング

• ドメイン１：元画像 • ドメイン２：９０度回転 • ドメイン３：１８０回転

42


異なるドメインへの写像

• ドメインdから潜在空間への写像 𝒛� = 𝑾𝑑

𝑇𝑾𝑑−1𝑾𝑑

𝑇𝒙 • 潜在空間からドメインd’への写像

𝒙�𝑑 = 𝑊𝑑′𝒛� • ドメインdからドメインd’への写像行列

𝑾𝑑′ 𝑾𝑑𝑇𝑾𝑑

−1𝑾𝑑

𝑇

43

latent space

domain d domain d’


実験

• ３つの人工データ、４つの実データ • 特徴をランダムに２つに分割して２つのドメイン

を作成 • 同じクラスラベルを持つオブジェクトをマッチン

グさせたい • 評価尺度：adjusted Rand index（高いほど良い）

44

𝑝 𝑠𝑑𝑛 = 𝑗 𝑋, 𝑆∖𝑑𝑛,𝑊 ∝ �𝑁𝑗∖𝑑𝑛 ⋅ 𝑝(𝑥𝑑𝑛|𝑠𝑑𝑛 = 𝑗, 𝑆∖𝑑𝑛,𝑊)

𝛾 ⋅ 𝑝 𝑥𝑑𝑛 𝑊

clustering each domain individually

clustering, then one-to-one matching

one-to-one matching

one-to-one matching, then clustering

KM: k-means, CKS: convex kernelized sorting

object→

feature→


実験

• 真の潜在空間次元K*=5の場合の人工データ • 真の次元と同じ場合に最も高い性能 • 提案法はベイズ推定により潜在次元に対して頑健

45 adjusted Rand index

latent dimensionality

Proposed

KM-CKS CKS-KM

KM

CKS


実験

• ドメイン数を変化させても提案法は最も高い精度を達成

46

adjusted Rand index

#domains D

Proposed

CKS-KM

KM-CKS

KM

CKS


半教師あり

• 少数の対応データが得られる場合もある • Eステップで対応データは必ず同じクラスタに割り当てられるとする

𝑝 𝑠𝑑𝑛 = 𝑠𝑑′𝑛′ = 𝑗 𝑿,𝑺∖𝑑𝑛𝑑′𝑛′,𝑾

unsupervised semi-supervised


半教師あり実験

labeled object rate

adjusted Rand index


結論

• 教師なしクラスタマッチング手法を提案 – 対応データ不要 – 多ドメイン、多対多、任意のオブジェクト数に対応

• 今後の展開 – 非線形写像 – 実応用

• バイオ、購買、多言語解析

49


ネットワークデータのための教師なしクラスタマッチング

NTTコミュニケーション科学基礎研究所岩田具治 joint work with James Lloyd and Zoubin Ghahramani

Introduction • Networks have common latent groups

– lexical networks from different languages – social networks from different research labs – biological networks from different species – user-item networks from different stores

車

自動車

運転する動かす

ドライバー運転手

car motorcar

drive run

driver operator

Introduction • Find correspondence between clusters in

multiple networks without node correspondence – e.g. discover shared word clusters from multi-

lingual document-word networks without cross-language alignment information

• Networks from different fields exhibit

common characteristics – e.g. Scale-free, small world, community structure

• Multi-task learning for networks

Input: two user-item networks

user

user

item item

Output: common user/item clusters

sorted users user

sorted items

sorted users

sorted items

Task Input

Task Input

Output clustering

Task Input

Output matching

Proposed Method: ReMatch • based on Infinite Relational Models (IRM)

[Kemp, et. al. 2006] – infinite version of stochastic block models – clustering nodes based on connectivity

• a single network is modeled by an IRM • multiple IRMs are generated from shared

connectivity and cluster proportions • different networks can share clusters and their

interaction patterns

Generative process of ReMatch

ReMatch: IRM with a combined matrix

missing

missing

Network1

Network2

Inference

• collapsed Gibbs sampling

Unidentifiable Networks

A

B C

a

b c

A

B

a

b

Identifiable Networks

A B

C a

c b

cluster index →

node index →

clustering and matching simultaneously

(user) (item)

×：network1 ○：network2

Experiments with synthetic data Adjusted Rand Index

Experiments with real-world user-item data

user

item

Experiments with real-world document-word data

(Wikipedia in English and German)

Conclusion

• We proposed the probabilistic model for unsupervised cluster matching for networks.

• Future work – investigate other common properties

• e.g. small world, scale free

– apply the proposed framework to other network models

• e.g. latent feature model, dynamic IRM


多言語文書データからの普遍文法の抽出

Tomoharu Iwata joint work with Daichi Mochihashi and Hiroshi Sawada

Introduction

• Languages share certain common properties – word order in most European languages is SVO

• Reasons for commonalities – a common ancestor language – borrowing from nearby languages – innate abilities of humans

69

protolang

lang 1 lang 2

lang 1 lang 2

lang 1 lang 2

brain

Task • Extract a common grammar from multilingual

corpora

70

non-parallel and non-annotated multilingual corpora INPUT

・ common grammar ・ language dependent grammars OUTPUT

Our approach • Hierarchical Bayesian modeling

– Monolingual grammar: probabilistic context-free grammar (PCFG) • Each sentence is generated from the language

dependent PCFG – PCFG for each language is generated from a prior

(common grammar)

71

PCFG

PCFG prior (common grammar)

English

sentences

PCFG

German

sentences

PCFG

Swedish

sentences

Probabilistic context free grammar

72

S

NP VP

V NP

Det N

I

saw

a dog

S → NP VP : 0.5 S → VP VP : 0.1

probability of nontermial production

nontermial

terminal

probability of termial emission

V → saw : 0.002 V → study : 0.001

probability of choosing production or emission

S: emission: 0.0 V: emission: 0.9

Generative model of PCFG

73

( , , )l l l=G K W ΦPCFG of language l

nonterminals terminals

rule probabilities

*nonterminals are shared among languages

{ }, ,l lA lA lA A∈=

KΦ θ φ ψ

emit|S

prod|S

en,l S=φ

en,l S=θ

S→learn

S→universal

S→from

S→

grammar

S→multi

en,l S=ψ

emit|S

prod|S de,l S=φ

de,l S=θ

S→dieser

S→vortrag

S→sehr

S→ist

S→gut

de,l S=ψ

emit|S

prod|S

Sϕα

Sθα

English German

Common probability of terminal emission

(multinomial) A→w

probability of nontermial production

(multinomial) A→BC

probability of choosing

emission or production

(multinomial)

(Dirichlet)

(Dirichlet)

S→NN

S→

NV

S→VN

S→

VV

S→SN

S→NN

S→

NV

S→VN

S→

VV

S→SN

S→NN

S→

NV

S→VN

S→

VV

S→SN

rule probability parameters

S S S

Inference • variational Bayesian method

– estimate posterior

– via a tractable variational distribution

– so as to minimize the KL divergence

74

parse trees language dependent PCFG parameters

common grammar parameters

multilingual corpora

[ ], ,

arg min KL ( , , ) || ( , , | )q pZ Φ α

Z Φ α Z Φ α X

Parameter update • The parameters can be updated efficiently using the inside-

outside algorithm

75

language dependent parameters

parse tree

common parameters

Experiments • Data

– EuroParl corpus in 11 languages – 100,000 sentences for each language – not sentence-aligned – 20 nonterminals

• Probable terminals for each nonterminals – only nonterminals with high probabilities of

selecting the emission rule • Common grammar rule probability

76

1

0 1

ˆ ABCA BC

AB CB C

ϕθ

θ θ ϕ

ααϕα α α→

′ ′′ ′

= ⋅+ ∑

Probable terminals for nonterminal 9

77

da: of for in and to with on from de: and in for on of to with to also el: and in that from be of for es: of that to in by with and for on fi: and Europe is-like or are nor-is

English translation fr: of has that for on in of and by it: of and of of of in to ‘s of by nl: of in for and on with to about but pt: of of of and to in of and with sv: in and for of to on with as


78

da: it I we there they therefore this debate what de: is are have shall has must might my will can el: be for to must not and this with that es: is there thanks morning that you-have we is place are fi: is not are should was sent not may concerns can-be

English translation fr: new is in has place will you have it: and that not one debate take-place president are nl: is be must has have must can shall will is pt: that and not mr. parliament of approves tommorow with sv: the I we this what therefore they debate you it


79

Inferred common grammar

80 * We named nonterminals using grammatical categories after the inference

=

=

=

=

=

=

=

R: root, S: sentence, SBJ: subject VP: verb phrase, V: verb, NP: noun phrase DT: determiner, N: noun, PR: preposition

Conclusion • Bayesian approach for capturing commonalities at the

syntax level for non-parallel multilingual corpora • Future work

– model improvement • more sophisticated probabilistic grammar models • infer #nonterminals with nonparametric Bayes • more hierarchy for modeling a evolutionary tree of

languages – experiments with a greater diversity of languages – finding a universal grammar

81 lang 1 lang 2 lang 3


今日の目次

• 潜在確率モデルによる教師なしクラスタマッチング

• ネットワークデータのための教師なしクラスタマッチング

• 多言語文書データからの普遍文法の抽出

Technology

教師なしオブジェクトマッチング（第2回ステアラボ人工知能セミナー）