36
Analyse conjointe de réseaux et de corpus avec Linkage Pierre Latouche (C. Bouveyron) MASHS 2018, Paris

Analyse conjointe de réseaux et de corpus avec Linkage

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Analyse conjointe de réseaux et de corpus avec Linkage

Analyse conjointe de réseaux et de corpus avecLinkage

Pierre Latouche (C. Bouveyron)

MASHS 2018, Paris

Page 2: Analyse conjointe de réseaux et de corpus avec Linkage

Outline

Introduction

STBM

Analysis of a co-authorship network

Linkage.fr

Page 3: Analyse conjointe de réseaux et de corpus avec Linkage

Outline

Introduction

STBM

Analysis of a co-authorship network

Linkage.fr

Page 4: Analyse conjointe de réseaux et de corpus avec Linkage

the Enron Email dataset (2001)

Nodes + edges

Page 5: Analyse conjointe de réseaux et de corpus avec Linkage

Introduction

Types of networks: (→ development of statistical approaches)I Binary + static edgesI Discrete / continuous / categorical / ...I Covariates on vertices / edgesI Dynamic edges:

I Continous time → point processesI Discrete time → Markov,...

Types of clusters: (→ development of statistical approaches)I Communities (transitivity)I Heterogeneous clustersI Partitions, overlapping clusters, hierarchy

Page 6: Analyse conjointe de réseaux et de corpus avec Linkage

Introduction

Essentially, two starting points:I The latent position model [HRH02]I The stochastic block model [WW87, NS01]

Page 7: Analyse conjointe de réseaux et de corpus avec Linkage

IntroductionNetworks can be observed directly or indirectly from a variety ofsources:

I social websites (Facebook, Twitter, ...),I personal emails (from your Gmail, Clinton’s mails, ...),I emails of a company (Enron Email data),I digital/numeric documents (Panama papers,

co-authorships, ...),I and even archived documents in libraries (digital

humanities).

⇒ most of these sources involve text!

Page 8: Analyse conjointe de réseaux et de corpus avec Linkage

Introduction

1 2

3

4

6

5

7

8

9

Figure: An (hypothetic) email network between a few individuals.

Page 9: Analyse conjointe de réseaux et de corpus avec Linkage

Introduction

1 2

3

4

6

5

7

8

9

Figure: A typical clustering result for the (directed) binary network.

Page 10: Analyse conjointe de réseaux et de corpus avec Linkage

Introduction

1 2

3

Bask

etb

all i

s gre

at!

4

6

5

7

I love watching basketball!

8

9

I love

fishing!

Fishing

isso

rela

xing!

Wha

t is thega

me

result?

Figure: The (directed) network with textual edges.

Page 11: Analyse conjointe de réseaux et de corpus avec Linkage

Introduction

1 2

3

Bask

etb

all i

s gre

at!

4

6

5

7

I love watching basketball!

8

9

I love

fishing!

Fishing

isso

rela

xing!

Wha

t is thega

me

result?

Figure: Expected clustering result for the (directed) network withtextual edges.

Page 12: Analyse conjointe de réseaux et de corpus avec Linkage

The stochastic topic block model

the stochastic topic block model (STBM) [BLZ16]:I generalizes both SBM and LDA modelsI allows to analyze (directed and undirected) networks with

textual edges.

Page 13: Analyse conjointe de réseaux et de corpus avec Linkage

Outline

Introduction

STBM

Analysis of a co-authorship network

Linkage.fr

Page 14: Analyse conjointe de réseaux et de corpus avec Linkage

Context and notationsWe are interesting in clustering the nodes of a (directed)network of M vertices into Q groups:

I the network is represented by its M ×M adjacency matrixA:

Aij =

{1 if there is an edge between i and j0 otherwise

I if Aij = 1, the textual edge is characterized by a set of Dij

documents:

Wij = (W 1ij , ...,W

dij , ...,W

Dij

ij )

I each document W dij is made of Nd

ij words:

W dij = (W d1

ij , ...,Wdnij , ...,W

dNdij

ij ).

Page 15: Analyse conjointe de réseaux et de corpus avec Linkage

Context and notationsWe are interesting in clustering the nodes of a (directed)network of M vertices into Q groups:

I the network is represented by its M ×M adjacency matrixA:

Aij =

{1 if there is an edge between i and j0 otherwise

I if Aij = 1, the textual edge is characterized by a set of Dij

documents:

Wij = (W 1ij , ...,W

dij , ...,W

Dij

ij )

I each document W dij is made of Nd

ij words:

W dij = (W d1

ij , ...,Wdnij , ...,W

dNdij

ij ).

Page 16: Analyse conjointe de réseaux et de corpus avec Linkage

Context and notationsWe are interesting in clustering the nodes of a (directed)network of M vertices into Q groups:

I the network is represented by its M ×M adjacency matrixA:

Aij =

{1 if there is an edge between i and j0 otherwise

I if Aij = 1, the textual edge is characterized by a set of Dij

documents:

Wij = (W 1ij , ...,W

dij , ...,W

Dij

ij )

I each document W dij is made of Nd

ij words:

W dij = (W d1

ij , ...,Wdnij , ...,W

dNdij

ij ).

Page 17: Analyse conjointe de réseaux et de corpus avec Linkage

Context and notationsWe are interesting in clustering the nodes of a (directed)network of M vertices into Q groups:

I the network is represented by its M ×M adjacency matrixA:

Aij =

{1 if there is an edge between i and j0 otherwise

I if Aij = 1, the textual edge is characterized by a set of Dij

documents:

Wij = (W 1ij , ...,W

dij , ...,W

Dij

ij )

I each document W dij is made of Nd

ij words:

W dij = (W d1

ij , ...,Wdnij , ...,W

dNdij

ij ).

Page 18: Analyse conjointe de réseaux et de corpus avec Linkage

Modeling of the edges

Let us assume that edges are generated according to a SBMmodel:

I each node i is associated with an (unobserved) groupamong Q according to:

Yi ∼M(1, ρ),

where ρ ∈ [0, 1]Q is the vector of group proportions,

I the presence of an edge Aij between i and j is drawnaccording to:

Aij |YiqYjr = 1 ∼ B(πqr),

where πqr ∈ [0, 1] is the connection probability betweenclusters q and r.

Page 19: Analyse conjointe de réseaux et de corpus avec Linkage

Modeling of the edges

Let us assume that edges are generated according to a SBMmodel:

I each node i is associated with an (unobserved) groupamong Q according to:

Yi ∼M(1, ρ),

where ρ ∈ [0, 1]Q is the vector of group proportions,

I the presence of an edge Aij between i and j is drawnaccording to:

Aij |YiqYjr = 1 ∼ B(πqr),

where πqr ∈ [0, 1] is the connection probability betweenclusters q and r.

Page 20: Analyse conjointe de réseaux et de corpus avec Linkage

Modeling of the documentsThe generative model for the documents is as follows:

I each pair of clusters (q, r) is first associated to a vector oftopic proportions θqr = (θqrk)k sampled from a Dirichletdistribution:

θqr ∼ Dir (α) ,

such that∑K

k=1 θqrk = 1,∀(q, r).

I the nth word W dnij of documents d in Wij is then associated

to a latent topic vector Zdnij according to:

Zdnij | {AijYiqYjr = 1, θ} ∼ M (1, θqr) .

I then, given Zdnij , the word W dnij is assumed to be drawn

from a multinomial distribution:

W dnij |Zdnkij = 1 ∼M (1, βk = (βk1, . . . , βkV )) ,

where V is the vocabulary size.

Page 21: Analyse conjointe de réseaux et de corpus avec Linkage

Modeling of the documentsThe generative model for the documents is as follows:

I each pair of clusters (q, r) is first associated to a vector oftopic proportions θqr = (θqrk)k sampled from a Dirichletdistribution:

θqr ∼ Dir (α) ,

such that∑K

k=1 θqrk = 1,∀(q, r).

I the nth word W dnij of documents d in Wij is then associated

to a latent topic vector Zdnij according to:

Zdnij | {AijYiqYjr = 1, θ} ∼ M (1, θqr) .

I then, given Zdnij , the word W dnij is assumed to be drawn

from a multinomial distribution:

W dnij |Zdnkij = 1 ∼M (1, βk = (βk1, . . . , βkV )) ,

where V is the vocabulary size.

Page 22: Analyse conjointe de réseaux et de corpus avec Linkage

Modeling of the documentsThe generative model for the documents is as follows:

I each pair of clusters (q, r) is first associated to a vector oftopic proportions θqr = (θqrk)k sampled from a Dirichletdistribution:

θqr ∼ Dir (α) ,

such that∑K

k=1 θqrk = 1,∀(q, r).

I the nth word W dnij of documents d in Wij is then associated

to a latent topic vector Zdnij according to:

Zdnij | {AijYiqYjr = 1, θ} ∼ M (1, θqr) .

I then, given Zdnij , the word W dnij is assumed to be drawn

from a multinomial distribution:

W dnij |Zdnkij = 1 ∼M (1, βk = (βk1, . . . , βkV )) ,

where V is the vocabulary size.

Page 23: Analyse conjointe de réseaux et de corpus avec Linkage

InferenceThe full joint distribution of the STBM model is given by:

p(A,W, Y, Z, θ|ρ, π, β) = p(W,Z, θ|A, Y, β)p(A, Y |ρ, π).

A key property of the STMB model:I let us assume that Y is observed (groups are known),

I it is then possible to reorganize the documentsD =

∑i,j Dij documents W such that:

W = (W̃qr)qr where W̃qr ={W dij , ∀(d, i, j), YiqYjrAij = 1

},

I since all words in W̃qr are associated with the same pair(q, r) of clusters, they share the same mixture distribution,

I and, simply seeing W̃qr as a document d, the samplingscheme then corresponds to the one of a LDA model withD = Q2 documents.

Page 24: Analyse conjointe de réseaux et de corpus avec Linkage

InferenceThe full joint distribution of the STBM model is given by:

p(A,W, Y, Z, θ|ρ, π, β) = p(W,Z, θ|A, Y, β)p(A, Y |ρ, π).

A key property of the STMB model:I let us assume that Y is observed (groups are known),

I it is then possible to reorganize the documentsD =

∑i,j Dij documents W such that:

W = (W̃qr)qr where W̃qr ={W dij , ∀(d, i, j), YiqYjrAij = 1

},

I since all words in W̃qr are associated with the same pair(q, r) of clusters, they share the same mixture distribution,

I and, simply seeing W̃qr as a document d, the samplingscheme then corresponds to the one of a LDA model withD = Q2 documents.

Page 25: Analyse conjointe de réseaux et de corpus avec Linkage

InferenceThe full joint distribution of the STBM model is given by:

p(A,W, Y, Z, θ|ρ, π, β) = p(W,Z, θ|A, Y, β)p(A, Y |ρ, π).

A key property of the STMB model:I let us assume that Y is observed (groups are known),

I it is then possible to reorganize the documentsD =

∑i,j Dij documents W such that:

W = (W̃qr)qr where W̃qr ={W dij , ∀(d, i, j), YiqYjrAij = 1

},

I since all words in W̃qr are associated with the same pair(q, r) of clusters, they share the same mixture distribution,

I and, simply seeing W̃qr as a document d, the samplingscheme then corresponds to the one of a LDA model withD = Q2 documents.

Page 26: Analyse conjointe de réseaux et de corpus avec Linkage

InferenceThe full joint distribution of the STBM model is given by:

p(A,W, Y, Z, θ|ρ, π, β) = p(W,Z, θ|A, Y, β)p(A, Y |ρ, π).

A key property of the STMB model:I let us assume that Y is observed (groups are known),

I it is then possible to reorganize the documentsD =

∑i,j Dij documents W such that:

W = (W̃qr)qr where W̃qr ={W dij , ∀(d, i, j), YiqYjrAij = 1

},

I since all words in W̃qr are associated with the same pair(q, r) of clusters, they share the same mixture distribution,

I and, simply seeing W̃qr as a document d, the samplingscheme then corresponds to the one of a LDA model withD = Q2 documents.

Page 27: Analyse conjointe de réseaux et de corpus avec Linkage

Inference

Given the above property of the model, we propose for inferenceto maximize the complete data log-likelihood:

log p(A,W, Y |ρ, π, β) = log∑Z

ˆθp(A,W, Y, Z, θ|ρ, π, β)dθ,

with respect to (ρ, π, β) and Y = (Y1, . . . , YM ).

Page 28: Analyse conjointe de réseaux et de corpus avec Linkage

Inference: the C-VEM algorithm

The C(-V)EM algorithm makes use of a variationaldecomposition:

log p(A,W, Y |ρ, π, β) = L (R;Y, ρ, π, β)+KL (R ‖ p(·|A,W, Y, ρ, π, β)) ,

where

L (R(·);Y, ρ, π, β) =∑Z

ˆθR(Z, θ) log

p(A,W, Y, Z, θ|ρ, π, β)R(Z, θ)

dθ,

and R(·) is assumed to factorize as follows:

R(Z, θ) = R(Z)R(θ) = R(θ)M∏

i 6=j,Aij=1

Dij∏d=1

Ndij∏

n=1

R(Zdnij ).

Page 29: Analyse conjointe de réseaux et de corpus avec Linkage

Outline

Introduction

STBM

Analysis of a co-authorship network

Linkage.fr

Page 30: Analyse conjointe de réseaux et de corpus avec Linkage

The HAL Paris Descartes co-authorship network

The Paris Descartes co-authorship network:I the last 10 000 articles published on HAL,I with at least one author from University Paris Descartes,I the network has 13 101 authors and 91 074 edges.

The analysis with Linkage.fr:I the whole analysis process took 38 min on the server,I which includes 3 steps:

I retrieving the data from HAL,I formatting and pre-processing the data,I the choice of (Q,K) and the clustering (app. 20% of the

whole process).

Page 31: Analyse conjointe de réseaux et de corpus avec Linkage

The Paris Descartes co-authorship network

Figure: The HAL Paris Descartes co-authorship network

Page 32: Analyse conjointe de réseaux et de corpus avec Linkage

The Paris Descartes co-authorship network

Figure: The HAL Paris Descartes co-authorship network

Page 33: Analyse conjointe de réseaux et de corpus avec Linkage

The Paris Descartes co-authorship network

Page 34: Analyse conjointe de réseaux et de corpus avec Linkage

Outline

Introduction

STBM

Analysis of a co-authorship network

Linkage.fr

Page 35: Analyse conjointe de réseaux et de corpus avec Linkage

Conclusion

I STBM : allows to model networks with textual edgesI C-VEM algorithm for inferenceI Model selection criterionI Find clusters of nodes and topics of discussions

Page 36: Analyse conjointe de réseaux et de corpus avec Linkage

Biblio I

Charles Bouveyron, Pierre Latouche, and Rawya Zreik, Thestochastic topic block model for the clustering of vertices innetworks with textual edges, Statistics and Computing(2016), 1–21.

Peter D Hoff, Adrian E Raftery, and Mark S Handcock,Latent space approaches to social network analysis, Journalof the american Statistical association 97 (2002), no. 460,1090–1098.

K. Nowicki and T.A.B. Snijders, Estimation and predictionfor stochastic blockstructures, Journal of the AmericanStatistical Association 96 (2001), 1077–1087.

Y.J. Wang and G.Y. Wong, Stochastic blockmodels fordirected graphs, Journal of the American StatisticalAssociation 82 (1987), 8–19.