42
Marginal Pseudo-Likelihood Learning of Markov Network Structures Marginal Pseudo-Likelihood Learning of Markov Network Structures Johan Pensar [email protected] Henrik Nyman [email protected] Department of Mathematics and statistics ˚ Abo Akademi University 20500 Turku, Finland Juha Niiranen [email protected] Jukka Corander [email protected] Department of Mathematics and statistics University of Helsinki 00014 Helsinki, Finland Editor: Abstract Undirected graphical models known as Markov networks are popular for a wide variety of applications ranging from statistical physics to computational biology. Traditionally, learning of the network structure has been done under the assumption of chordality which ensures that efficient scoring methods can be used. In general, non-chordal graphs have intractable normalizing constants which renders the calculation of Bayesian and other scores difficult beyond very small-scale systems. Recently, there has been a surge of interest towards the use of regularized pseudo-likelihood methods for structural learning of large- scale Markov network models, as such an approach avoids the assumption of chordality. The currently available methods typically necessitate the use of a tuning parameter to adapt the level of regularization for a particular dataset, which can be optimized for example by cross-validation. Here we introduce a Bayesian version of pseudo-likelihood scoring of Markov networks, which enables an automatic regularization through marginalization over the nuisance parameters in the model. We prove consistency of the resulting MPL estimator for the network structure via comparison with the pseudo information criterion. Identification of the MPL-optimal network on a prescanned graph space is considered with both greedy hill climbing and exact pseudo-Boolean optimization algorithms. We find that for reasonable sample sizes the hill climbing approach most often identifies networks that are at a negligible distance from the restricted global optimum. Using synthetic and existing benchmark networks, the marginal pseudo-likelihood method is shown to generally perform favorably against recent popular inference methods for Markov networks. Keywords: Bayesian inference, Markov networks, structure learning, undirected graph, pseudo-likelihood, regularization 1. Introduction Markov networks represent a ubiquitous modeling framework for multivariate systems, with applications ranging from statistical physics to computational biology and sociology (see 1 arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Marginal Pseudo-Likelihood Learning of Markov Network Structures

Marginal Pseudo-Likelihood Learning ofMarkov Network Structures

Johan Pensar [email protected]

Henrik Nyman [email protected] of Mathematics and statisticsAbo Akademi University20500 Turku, Finland

Juha Niiranen [email protected]

Jukka Corander [email protected]

Department of Mathematics and statistics

University of Helsinki

00014 Helsinki, Finland

Editor:

Abstract

Undirected graphical models known as Markov networks are popular for a wide varietyof applications ranging from statistical physics to computational biology. Traditionally,learning of the network structure has been done under the assumption of chordality whichensures that efficient scoring methods can be used. In general, non-chordal graphs haveintractable normalizing constants which renders the calculation of Bayesian and other scoresdifficult beyond very small-scale systems. Recently, there has been a surge of interesttowards the use of regularized pseudo-likelihood methods for structural learning of large-scale Markov network models, as such an approach avoids the assumption of chordality. Thecurrently available methods typically necessitate the use of a tuning parameter to adaptthe level of regularization for a particular dataset, which can be optimized for exampleby cross-validation. Here we introduce a Bayesian version of pseudo-likelihood scoringof Markov networks, which enables an automatic regularization through marginalizationover the nuisance parameters in the model. We prove consistency of the resulting MPLestimator for the network structure via comparison with the pseudo information criterion.Identification of the MPL-optimal network on a prescanned graph space is considered withboth greedy hill climbing and exact pseudo-Boolean optimization algorithms. We findthat for reasonable sample sizes the hill climbing approach most often identifies networksthat are at a negligible distance from the restricted global optimum. Using synthetic andexisting benchmark networks, the marginal pseudo-likelihood method is shown to generallyperform favorably against recent popular inference methods for Markov networks.

Keywords: Bayesian inference, Markov networks, structure learning, undirected graph,pseudo-likelihood, regularization

1. Introduction

Markov networks represent a ubiquitous modeling framework for multivariate systems, withapplications ranging from statistical physics to computational biology and sociology (see

1

arX

iv:1

401.

4988

v2 [

stat

.ML

] 1

1 N

ov 2

014

Page 2: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Pensar, Nyman, Niiranen and Corander

Lauritzen, 1996; Koller and Friedman, 2009). However, statistical inference for such modelsis in general challenging, both regarding estimation of parameters and learning structureof the network. Under the assumption of chordality it is possible to use a closed-formfactorization of a distribution with respect to a Markov network, however, in non-chordalcases the normalizing factor (or partition function) of these distributions is intractablebeyond toy-sized systems. Since the chordality assumption is restrictive and may seriouslybias learning of the dependencies among variables, considerable interest has been targetedtowards making also non-chordal networks tractable for applications. A revival of interesthas in particular arisen from the need to consider high-dimensional models in a ’large p,small n’ setting (Lee et al., 2006; Hofling and Tibshirani, 2009; Ravikumar et al., 2010;Aurell and Ekeberg, 2012; Ekeberg et al., 2013).

In physics, Markov network models have traditionally been fitted using the mean-fieldapproximation, which has only recently started to become superceded by more elaborateapproaches, such as the pseudo-likelihood method (Aurell and Ekeberg, 2012; Ekeberg et al.,2013). The pseudo-likelihood approach was originally motivated by the difficulties of max-imizing the likelihood function for lattice models (Besag, 1972) and it simplifies the modelfitting by a factorization of the likelihood over local neighborhoods of the random variablesinvolved in the modeled system.

High-dimensional Markov networks usually necessitate some form of regularization tomake the pseudo-likelihood estimation problem feasible to solve. Some of the currentlyavailable methods necessitate the use of a tuning parameter to adapt the level of regulariza-tion for a particular dataset. The value of the tuning parameter can then be optimized forexample by cross-validation. Here we introduce a Bayesian version of the pseudo-likelihoodapproach to learn the structure of a Markov network without assuming chordality. Ourmethod enables an automatic regularization of the resulting model complexity throughmarginalization over the nuisance parameters in the model.

The structure of the remaining article is as follows. In the next section the basic prop-erties of Markov networks are reviewed and the structure learning problem is formulated inSection 3. In Section 4, we introduce the marginal pseudo-likelihood (MPL) score and proveconsistency of the corresponding structure estimator. Algorithms for optimizing the MPLscore for a given dataset are derived in Section 5 and the penultimate section demonstratesthe favorable performance of our method against other popular recent alternatives. Thelast section provides some additional remarks and conclusions.

2. Markov networks

We consider a set of d discrete random variables X = {X1, . . . , Xd} where each variable Xj

takes values from a finite set of outcomes Xj . A Markov network over X is a undirectedprobabilistic graphical model that compactly represents a joint distribution over the vari-ables. The dependence structure over the d variables is specified by an undirected graphG = (V,E) where the nodes V = {1, . . . , d} correspond to the indices of the variables Xand the edge set E ⊆ {V × V } represents dependencies among the variables. We will usethe terms node and variable interchangeably throughout this article. The complete set ofundirected graphs is denoted by G.

2

Page 3: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Marginal Pseudo-Likelihood Learning of Markov Network Structures

A node i is a neighbor of j (and vice versa) if {i, j} ∈ E and the set of all neighborsof j is called its Markov blanket, which is denoted by mb(j) = {i ∈ V : {i, j} ∈ E}. Aclique in a graph is a subset of nodes, C ⊆ V , for which every pair of nodes are connectedby an edge, that is {i, j} ∈ E if i, j ∈ C. A clique is considered maximal if it cannot beextended by including an additional node without violating the clique criterion. The set ofmaximal cliques associated with a graph is denoted by C(G). The variables correspondingto a subset of nodes, S ⊆ V , are denoted by XS = {Xj}j∈S and the corresponding jointoutcome space is specified by the Cartesian product XS = ×j∈SXj . The cardinality of anoutcome space is denoted by |XS |. We use a lowercase letter xS to denote that the variableshave been assigned a specific joint outcome in XS . A dataset x = (x1, . . . ,xn) refers to acollection of n i.i.d. complete joint observations xk = (xk,1, . . . , xk,d) over the d variables,that is xk,j ∈ Xj for all k and j.

In addition to the graph, to fully specify a Markov network one must also define aprobability distribution that satisfies the restrictions imposed by the graph G. We restrictthe models to positive and faithful distributions unless otherwise mentioned. A distributionis said to be faithful to G if it does not satisfy any additional independencies that are notconveyed by the graph. In this case G can be considered a true representation in the sensethat no artificial dependencies are introduced. We use θG to denote the set of parametersdescribing a distribution of a model with graph G. The parameter space ΘG contains allpossible instantiations of θG corresponding to a distribution satisfying G. Finally, we usep(xA | xB) as an abbreviated notation for the conditional probability p(XA = xA | XB =xB), while p(XA | XB) represents the corresponding family of conditional distributions.

The concept of graphical models is based on the assumption of modularity manifestedin the factorization of the joint distribution. In particular, the (positive) joint distributionof a Markov network can be factorized over the maximal cliques in the graph according to

p(x) =1

Z

∏C∈C(G)

φ(xC) (1)

where φ(xC) : XC → R+ is a clique factor (or potential) and Z =∑

x∈X∏

C∈C(G) φ(xC) isa normalizing constant known as the partition function. Markov networks are often alsoparameterized in terms of a log-linear model in which each clique factor is replaced by anexponentiated weighted sum of features according to

p(x) =1

Zexp

∑fK∈F

wKfK(xK)

(2)

where F = {fK} is the set of feature functions and W = {wK} is the corresponding set ofweights. A feature function fK : XK → R maps each value xK ∈ XK for some K ⊆ V toa numerical value, typically it is in the form of an indicator function that equals 1 if thevalue matches a specific feature and 0 otherwise. Every Markov network can be encoded asa log-linear model by defining a feature as an indicator function for every assignment of XC

for each C ∈ C(G). In this case, the weights in (2) correspond to the natural logarithm ofthe clique factors in (1). Conversely, a log-linear model over X implicitly induces the graphof a Markov network by imposing an edge {i, j} for every pair of variables appearing in thesame domain of some feature function fK(xK), that is {i, j} ∈ E if {i, j} ⊆ K.

3

Page 4: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Pensar, Nyman, Niiranen and Corander

The absence of edges in the graph G = (V,E) of a Markov network encodes statements ofconditional independence. The variables XA are conditionally independent of the variablesXB given the variables XS if p(XA | XB, XS) = p(XA | XS) holds. We denote this by

XA ⊥ XB | XS .

The dependence structure of a Markov network can be characterized by the followingMarkov properties:

1. Pairwise Markov property: Xi ⊥ Xj | XV \{i,j} for all {i, j} 6∈ E.

2. Local Markov property: Xi ⊥ XV \{mb(i)∪i} | Xmb(i) for all i ∈ V .

3. Global Markov property: XA ⊥ XB | XS for all disjoint subsets (A,B, S) of V suchthat S separates A from B.

Although the strength of the above properties differ in general, they are proven to beequivalent under the current assumption of positivity of the joint distribution (Lauritzen,1996). While the last property is sufficient in the sense that it captures the entire set ofindependencies induced by a network, the first two properties are also useful since theyallow one to focus on smaller sets of independencies. In particular, our MPL approach forstructure learning is based on the local Markov property.

3. Structure learning

There are two main tasks associated with fitting graphical models to data; parameter esti-mation and structure learning. In this work, we focus entirely on the latter. By structurelearning, we refer to the process of deducing the dependence structure from a set of data as-sumed to be generated from an unknown Markov network. The structure learning problemcan be considered a model class learning problem in the sense that each specific structurealone represents a class of models. In many applications, the structure is a goal in itself inthe sense that one wants merely to gain a qualitative insight into the dependence structureof an underlying process. However, given a known structure, the problem of model parame-ter estimation is simplified. Hence, if the distribution needs also to be explicitly estimated,this can be achieved by using any of the several existing methods conditional on the fixedstructure learned by our approach.

3.1 Hypothesis space

When learning the structure of a Markov network the considered space of model classes,or hypothesis space, can be formulated in terms of different degrees of granularity (Kollerand Friedman, 2009). The most fine-grained structure learning methods aim at recoveringdistinct features in the log-linear parameterization (2), this approach is commonly referredto as feature selection (Pietra et al., 1997; Lee et al., 2006; Hofling and Tibshirani, 2009;Ravikumar et al., 2010; Lowd and Davis, 2014). The advantage of a very detailed structureis that it enables the model to better emulate the properties of a distribution withoutimposing redundant parameters. One possible drawback of this formulation is the risk ofoverfitting the structure through long specialized features. Since every pair of variables in a

4

Page 5: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Marginal Pseudo-Likelihood Learning of Markov Network Structures

feature results in an edge, such parameterizations can obscure the connection to the graphstructure in the sense that sparsity in the number of features does not in general correspondto sparsity in the number of edges in the graph. In contrast to the very specific featureselection problem, the model space of our approach is formulated directly in terms of thegraph structure alone and the complexity of a model is defined by the size of the maximalcliques in the network.

Although dense graphs are not necessarily unfavorable, there are several situations wherea sparse graph is preferred. In particular, when the ultimate goal is knowledge discovery, adense graph may in the worst case hide the primary layer of the dependency pattern. An-other important aspect is the feasibility of performing probabilistic inference in the model.One of the main inference tasks for graphical models is the process of computing the poste-rior probability of a list of query variables given some observed variables. Inference methodsdesigned for this purpose often exploit the sparsity of the graph structure and dense graphsinevitably hamper the efficiency of such algorithms.

3.2 Different approaches

Structure learning methods can roughly be divided into two categories; constraint-based andscore-based methods. Constraint-based approaches aim at inferring the structure through aseries of independence tests based on the Markov properties, (Spirtes et al., 2000; Tsamardi-nos et al., 2003; Bromberg et al., 2009; Anandkumar et al., 2012). This approach is appeal-ing in the sense that the independently performed tests can be combined into a structurethrough a divide-and-conquer approach. Under the assumptions that the distribution isfaithful to graph structure and that the tests are correct, the true structure can be re-constructed. However, constraint-based approaches can be quite sensitive to failures inindividual tests in the sense that a wrong answer from an independence test can misleadthe network construction procedure (Koller and Friedman, 2009). In practice, rather largesample sizes may be required for the independence tests to yield correct answers.

The score-based approach formulates structure learning as an optimization problem.One defines an objective or score function according to which the plausibility of each candi-date in the model space can be evaluated. Since score functions consider the whole structureat once, they can be less sensitive to individual failures. The disadvantage of the score-basedapproach is that it usually requires use of an optimization algorithm. This poses an obvi-ous problem since the search space for d nodes consists of 2

(d2

)distinct undirected graphs.

Finding the global optimum in such enormous combinatorial spaces becomes intractable al-ready for moderate-sized models. For this reason, a selection of heuristic search algorithmshave been developed for the sole purpose of finding high-scoring networks and many of themhave been shown to work well in practice.

The most commonly used objective function is the likelihood of the data given a graph,

l(θG; x) = p(x | θG) =

n∏k=1

p(xk | θG),

or, in practice, the corresponding log-likelihood function,

`(θG; x) = log l(θG; x).

5

Page 6: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Pensar, Nyman, Niiranen and Corander

By maximizing the (log-)likelihood, the model fit to the data is maximized. Althoughthere exists no analytical solution for non-chordal Markov networks, the concavity of thelikelihood function enables it to be maximized by numerical optimization for moderate-sizedmodels. The maximum likelihood alone is not an appropriate objective function since itobtains its maximum value under the complete graph due to noise in the data. One optionis to constrain the expressiveness of the graphs in the model space, for example by onlyconsidering tree structures (Chow and Liu, 1968). A problem with such a constraint isthat it may easily end up limiting the model space to networks not suitable for modelingthe data. A more popular approach is to regulate the fit by adding a sparsity-promotingpenalty function to the log-likelihood (Akaike, 1974; Schwarz, 1978; Lee et al., 2006).

In contrast to the above methods where the complexity of a model is penalized explicitly,the Bayesian framework provides an alternative by implicitly preventing overfitting. In theBayesian approach a graph is scored by its posterior probability given the data,

p(G | x) =p(x | G) · p(G)

p(x). (3)

In practice it suffices to consider the unnormalized posterior probability

p(G,x) = p(x | G) · p(G), (4)

since p(x) is a normalizing constant that can be ignored when comparing graphs. The keyfactor of (4) is p(x | G) which is the marginal likelihood (ML) of the data given the networkstructure (also called the evidence). To evaluate the ML, one must integrate the likelihoodfunction over all parameter values satisfying the restrictions imposed by the graph accordingto

p(x | G) =

∫ΘG

l(θG; x) · f(θG)dθG, (5)

where f(θG) is a prior distribution that assigns a weight to each θG ∈ ΘG. Since the MLaccounts for the parameter uncertainty through the prior, it implicitly regulates the fit tothe data against the complexity of the network.

A drawback of the ML is that it is extremely hard to evaluate for non-chordal Markovnetworks. For this reason various penalized maximum likelihood objectives have naturallybeen preferred. In particular, Schwarz (1978) introduced the Bayesian information criterion(BIC) as an asymptotic approximation of the ML. Still, due to the partition function, evenmaximum likelihood based techniques become intractable for larger models and require useof approximate inference. Therefore, in the next section we derive an alternative Bayesian-type score applicable also to very large systems.

Given a scoring function, it is still necessary to specify a search algorithm to find high-scoring networks since the discrete search space is in general too large for an exhaustiveevaluation. To avoid the discrete nature of the model space, Lee et al. (2006) introduced anL1-based penalty to reformulate the structure learning problem as a convex optimizationproblem over the continuous parameter space. This is an elegant technique that has beenfurther developed (Hofling and Tibshirani, 2009; Ravikumar et al., 2010) for the specialclass of binary pairwise Markov networks for which the method is especially well-suited.Each edge in such a network is associated with a single parameter and forcing an edge

6

Page 7: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Marginal Pseudo-Likelihood Learning of Markov Network Structures

parameter to zero is equivalent to removing the corresponding edge from the network. Infact, in this case the problem formulation of feature selection and graph structure discoveryare equivalent due to the one-to-one correspondence between edges and features. For moregeneral networks, sparsity must be enforced to groups of parameters in order to achievesparsity in the number of edges in the resulting graph (Schmidt and Murphy, 2010). Forthe direct approach of Lee et al. (2006) the issue of maximizing the likelihood function stillremains.

The main difference between constraint- and score-based methods is the level at whichthey approach the problem (Koller and Friedman, 2009). Score-based methods works ona global level by considering the whole structure at once. This makes them less sensitiveto individual failures but has a negative effect on their scalability. The local approach ofconstraint-based methods allows them to scale up well but it makes them more sensitiveto failures in the individual tests. Although the MPL as such would fall into the score-based category, under the optimization strategy introduced in Section 5, our MPL methodis rather a hybrid by which we aim to achieve scalability as well as reliable performance.

4. Marginal pseudo-likelihood

In order to avoid problems associated with the evaluation of the true likelihood function,one can preferably use alternative objectives that possess favorable properties from a com-putational perspective. In this work we consider the commonly used pseudo-likelihood,originally introduced by Besag (1972), from a Bayesian perspective.

4.1 Derivation

The pseudo-likelihood function approximates the likelihood function by a factorization intoconditional likelihood functions according to

pl(θ; x) =

n∏k=1

d∏j=1

p(xk,j | xk,V \j , θ).

For a fixed graph structure G, the local Markov property implies that a variable in a Markovnetwork is independent of the remaining variables given its Markov blanket such that

p(Xj | XV \j , G) = p(Xj | Xmb(j), G)

must hold. Consequently, the pseudo-likelihood for a fixed graph is given by

pl(θG; x) =n∏

k=1

d∏j=1

p(xk,j | xk,mb(j), θG). (6)

In terms of the log-linear parameterization (2), the pseudo-likelihood approximation offershuge computational savings compared to the true likelihood since the global normalizingconstant in the likelihood function is replaced by d local normalizing constants. By replacingthe likelihood with the pseudo-likelihood, methods originally based on the maximum like-lihood have been extended to work on larger systems. For example, the pseudo-likelihood

7

Page 8: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Pensar, Nyman, Niiranen and Corander

approximation of Hofling and Tibshirani (2009) and the closely related method by Raviku-mar et al. (2010) highlight how the original idea of Lee et al. (2006) can be extended tohigher dimensions. Ji and Seymour (1996) and Csiszar and Talata (2006) both deriveda pseudo-likelihood version of the Bayesian information criterion by Schwarz (1978). Anencouraging aspect is that several pseudo-likelihood approaches have been shown to enjoyconsistency under the assumption that the data is generated by a distribution in the modelclass (Ji and Seymour, 1996; Csiszar and Talata, 2006; Ravikumar et al., 2010).

From a Bayesian perspective, the structural form of (6) offers an interesting possibility.In fact, under certain assumptions it enables an analytical evaluation of the integral

p(x | G) =

∫ΘG

pl(θG; x) · f(θG)dθG (7)

which is here referred to as the marginal pseudo-likelihood (MPL). We parameterize theconditional probabilities associated with the pseudo-likelihood function of a graph by

θijl = p(Xj = x(i)j | Xmb(j) = x

(l)mb(j)) where θijl > 0 and

rj∑i=1

θijl = 1. (8)

The indices i = 1, . . . , rj and l = 1, . . . , qj , where rj = |Xj | and qj = |Xmb(j)| =∏

i∈mb(j) ri,represent the configurations of the variable and its respective Markov blanket. The above setof graph-specific parameters is by no means a compact representation of a Markov network,in fact, it is a quite crude over-parameterization. Rather than actual model parameters,they should be considered temporary nuisance parameters, used solely for computationalconvenience, in solving the structure learning problem. Similarly as above, we denote thecounts of the corresponding configurations in x by

nijl =n∑

k=1

I[(xk,j , xk,mb(j)) = (x

(i)j , x

(l)mb(j))

]and njl =

rj∑i=1

nijl.

The pseudo-likelihood function can now be expressed in terms of our above notation by

pl(θG; x) =d∏

j=1

qj∏l=1

rj∏i=1

θnijl

ijl . (9)

Under the current parameterization it is easy to make out certain structural similaritiesbetween the above pseudo-likelihood function and the likelihood function of a Bayesiannetwork under a standard conditional parameterization (see e.g. Koller and Friedman, 2009).In a Bayesian network the l-index would be associated with configurations of parent setsinstead of configurations of Markov blankets. The parent sets must be such that theysatisfy the acyclicity constraint imposed by a DAG whereas the Markov blankets must bemutually consistent. Under certain assumptions listed by Heckerman et al. (1995) the MLof a Bayesian network has a nice analytical expression that factorizes variable-wise makingit attractive for the task of structure learning. Using a corresponding set of assumptionswe would like to achieve something similar for the ML. We consider the parameters definedin (8) in terms of the sets

θjl = ∪rji=1{θijl}, θj = ∪qjl=1{θjl}, and θG = ∪dj=1{θj}.

8

Page 9: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Marginal Pseudo-Likelihood Learning of Markov Network Structures

One of the fundamental assumptions behind the ML for Bayesian networks is an assumptionregarding global and local parameter independence (Assumption 2, Heckerman et al., 1995).This assumption ultimately justifies a factorization of the parameter prior. We need tofactorize the parameter prior in (7) in a corresponding fashion according to

f(θG) =d∏

j=1

f(θj) =d∏

j=1

qj∏l=1

f(θjl),

implying that θj ⊥ θj′ for j 6= j′ (global parameter independence) and θjl ⊥ θjl′ for l 6= l′

(local parameter independence). Whereas parameter independence can be a quite reason-able assumption in a Bayesian network parameterization, in our case it directly violatesthe properties of a Markov network. The conditional distributions, represented by our pa-rameters, are connected to each other in the sense that they must satisfy certain algebraicrelations for them to be consistent with a Markov network. We do not elaborate on theserelations but we note that they directly translate to restrictions between the correspondingparameter sets. At this point, the parameter independence assumption is mainly justi-fied by the induced computational savings. In Section 4.4 we discuss the implications ofthe assumption more in detail from another perspective. Another fundamental assump-tion, necessary for our derivation, is to restrict each parameter set θjl to follow a Dirichletdistribution

θjl ∼ Dirichlet(α1jl, . . . , αrjjl),

where α1jl, . . . , αrjjl are hyperparameters for which we denote αjl =∑rj

i=1 αijl.Under the established assumptions, the integral in (7) can be reordered into a product

of local integrals. Since the Dirichlet distribution is a conjugate prior of the multinomialdistribution, each local integral is easily solved using standard Bayesian calculations:

p(x | G) =

∫ΘG

pl(θG; x) · f(θG)dθG

=d∏

j=1

qj∏l=1

∫Θjl

rj∏i=1

θnijl

ijl · f(θjl)dθjl

=d∏

j=1

qj∏l=1

Γ(αjl)

Γ(njl + αjl)

rj∏i=1

Γ(nijl + αijl)

Γ(αijl)

In practice, the logarithm of the formula is used since it is computationally more manage-able.

To evaluate the above expression it is necessary to define the hyperparameters. We wantto specify a symmetric prior since we assume that there is no prior knowledge favoring oneparameter in {θ1jl, . . . , θrjjl} over any of the others. We achieve this by modifying a priororiginally defined for Bayesian networks by Buntine (1991) such that the hyperparametersare determined according to

αijl =N

|Xj | · |Xmb(j)|=

N

rj · qj,

where N is the equivalent sample size adjusting the strength of the prior.

9

Page 10: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Pensar, Nyman, Niiranen and Corander

4.2 Properties

The MPL possesses several advantageous properties. The parameter prior offers a naturalregularization that prevents overfitting. Methods that explicitly penalize the degree ofregularization are sensitive to the choice of some tuning parameter, which usually has to bedetermined empirically. In contrast, the MPL requires specification of the hyperparametersin the Dirichlet distribution. In our formulation this boils down to setting a value onthe equivalent sample size N . Silander et al. (2007) show that the maximum a posteriori(MAP) Bayesian network structure optimization problem is indeed sensitive to the choiceof value on the equivalent sample size parameter. Due to the similarity between the MPLand the BDeu score considered in Silander et al. (2007), one would expect the MPL todisplay a similar behavior. In this work we primarily focus on the setting where N = 1,and simulations are used to demonstrate the adequacy of this choice.

An important property preferably satisfied by a scoring function is consistency. Byconsistency we mean that, under the assumption that the generating distribution is faithfulto a Markov network structure, the score will favor the true graph when the sample sizetends to infinity. The following theorem establishes that MPL is indeed a consistent scoringfunction for Markov networks.

Theorem 1 Let G∗ ∈ G be the true graph structure, of a Markov network over (X1, . . . , Xd),with the corresponding Markov blankets mb(G∗) = {mb∗(1), . . . ,mb∗(d)}. Let θG∗ ∈ ΘG∗

define the corresponding joint distribution which is faithful to G∗ and from which a samplex of size n is obtained. The local MPL estimator

mb(j) = arg maxmb(j)⊆V \j

p(xj | xmb(j))

is consistent in the sense that mb(j) = mb∗(j) eventually almost surely as n → ∞ forj = 1, . . . , d. Consequently, the global MPL estimator

G = arg maxG∈G

p(x | G)

is consistent in the sense that G = G∗ eventually almost surely as n→∞.

Proof See Appendix A.Although consistency alone is a reassuring theoretical property, it is also important to

recognize its limitations. In particular, the assumptions under which the result is obtainedrarely hold in practice. Therefore it is important to investigate how well the MPL performsin practice. In Section 6 we do a series of large-scale numerical simulations to investigate howwell the MPL performs in combination with the search algorithms introduced in Section 5.Before that, we conduct a small-scale simulation study to gain an insight into the behaviorof the MPL both when it comes to choosing the optimal graph as well as ranking the mostplausible graphs.

In the first part of the experiment we restrict the model space to chordal graphs. Bydoing so we can calculate the ML of each considered graph and compare it to the MPL. TheML of a chordal graph is usually calculated by factorizing the likelihood according to themaximal cliques and separators of the graph (see e.g. Corander et al., 2008), however, there

10

Page 11: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Marginal Pseudo-Likelihood Learning of Markov Network Structures

(a) (b) (c)

Figure 1: Graphs used in the simulations in Section 4.2.

.25 .5 1 2 4 8

0

0.25

0.5

0.75

1

Sample size (x1000)

Inte

rse

ctio

n r

ate

Top graph

Top 100

(a)

.25 .5 1 2 4 8

0

1

2

3

4

5

Sample size (x1000)

Err

or

rate

MPL−FP

MPL−FN

ML−FP

ML−FN

(b)

Figure 2: Comparison of the MPL and ML graph rankings for different sample sizes. In (a)the similarity of rankings are compared for the top ranked graph and the 100 top rankedgraphs. In (b) the average FP and FN rates are compared for the 100 top ranked graphs.

is an alternative approach. To any chordal graph, there exists a collection of Markov equiv-alent DAGs, each encoding an equivalent dependence structure as the undirected graph.We can therefore evaluate the ML of an undirected chordal graph by the BDeu metric(Buntine, 1991) of one of the equivalent DAGs. Since the BDeu metric assigns the samescore to all Markov equivalent DAGs, it does not depend on which DAG being picked forevaluation. The main advantage of the DAG-based approach is that the structural formof the ML is similar to the MPL since the factorization of the likelihood and the choiceof hyperparameters are done analogously. Consequently, we can apply both methods un-der fairly similar conditions such that the different behaviors are primarily due to differentfundamental characteristics of the score functions.

First, we used the graph in Figure 1a as base for the generating model. The number ofpossible graphs over six nodes is 2

(62

)= 32768 and 18154 of these are chordal. To generate

a distribution according to a graph, we assigned values to the maximal clique factors in(1) by independently sampling from a uniform distribution over (0, 1). We generated tendistributions and for each distribution we generated ten samples. The final results are thusaveraged over hundred samples. We performed an exhaustive evaluation of the chordalgraphs and listed the hundred highest ranked graphs for the respective score.

11

Page 12: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Pensar, Nyman, Niiranen and Corander

.25 .5 1 2 4 80

0.25

0.5

0.75

1

Sample size (x1000)

Su

cc

ess

ra

te

MPL

ML

(a)

.25 .5 1 2 4 80

0.25

0.5

0.75

1

Sample size (x1000)

Su

cc

ess

ra

te

MPL

ML

(b)

.25 .5 1 2 4 80

0.25

0.5

0.75

1

Sample size (x1000)

Su

cc

ess

ra

te

MPL

ML

(c)

Figure 3: Comparison of the MPL and ML top ranked graph for the graphs in Figure 1.The success rate, which is plotted against the sample size, refers to the rate at which thecorrect graph is identified except for in (c) where the ML success rate refers to the rate atwhich the top ranked graph contains all the true edges.

To begin with, we consider the similarity of the rankings. Figure 2a illustrates the rateat which the ML and MPL scores agree on the top graph as well the percentage of graphsincluded in both of the top 100 rankings. With an increased sample size, the two scores showan increased conformity in how they rank the graphs. To further investigate the differences,Figure 2b illustrates the average rate of falsely added edges (False Positives, FPs) andfalsely omitted edges (False Negatives, FNs) among the 100 top ranked graphs. Althoughthe overall error rates are quite similar, there is a key difference between the MPL and theML which is nicely illustrated by the figure. Since the MPL in a sense over-determines thedependence structure, it is more conservative in terms of adding edges. This phenomenonis clearly reflected by the MPL consistently having a lower false positive rate and a higherfalse negative rate than the ML. The difference becomes less distinct for larger sample sizeswhich is in concordance with Figure 2a.

One drawback of the above mentioned characteristic is that it makes the MPL less sampleefficient than the ML in terms of identifying the correct graph. In Figure 3a we have plottedthe rate at which the true graph was ranked as optimal by the respective score. Althoughthe curves eventually converge for large enough sample sizes, the ML outperforms the MPLfor all the considered sample sizes. This weakness is exaggerated for graphs containing largeMarkov blankets compared to the maximal clique sizes. As an ultimate example of this,consider the star graph in Figure 1b for which there is one hub node connected to all theother nodes. In Figure 3b we see that the ML has a clear advantage over the MPL, forthis type of graph, for limited sample sizes. Still, the curves will eventually converge asconfirmed by Theorem 1.

As expected, the ML is to be preferred over the MPL when it comes to picking theoptimal chordal graph. However, we conclude this section by giving an example that illus-trates the importance of going beyond chordal graphs, which for larger systems only makeup a small fraction of the graph space. In the last experiment we also consider non-chordalgraphs. In particular, we based our generating model on the non-chordal graph in Figure1c. Since the ML can only be evaluated for chordal graphs, it cannot discover the truegraph. Therefore, we change the criterion for success for the ML by looking at the true

12

Page 13: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Marginal Pseudo-Likelihood Learning of Markov Network Structures

positives. If the top ranked graph contains all of the true edges, we consider it to be correct.In this setup the MPL clearly outperforms the ML as seen in Figure 3c. This is a naturalconsequence considering that the ML needs to add three spurious edges in order to forma chordal graph that contains all the true edges. As in the previous case, the curves willeventually converge for large enough sample sizes. Still, a model based on a graph withspurious edges contains redundant parameters which inevitably destabilize a subsequentparameter estimation process.

4.3 Computational complexity

Whereas the computational complexity of the ML is determined by the size of the maximalcliques, the computational complexity of the MPL is determined by the size of the Markovblankets. The (log-)MPL is calculated by the sum

log p(x | G) =d∑

j=1

log p(xj | xmb(j), G)

=d∑

j=1

qj∑l=1

log p(xj | x(l)mb(j), G)

=d∑

j=1

qj∑l=1

[log Γ(αjl)− log Γ(njl + αjl) +

rj∑i=1

[log Γ(nijl + αijl)− log Γ(αijl)]

],

which consists of∑d

j=1 qj(2+2rj) terms. Since rj = |Xj | does not depend on the graph, thenumber of terms, associated with a node j, is mainly determined by the number of Markovblanket configurations, qj , which grows exponentially with the size of the Markov blanket.Thereby, the complexity of calculating the MPL of a graph is to a high extent determinedby the maximal Markov blanket size. Still, it is important to note that the partial sum

log p(xj | x(l)mb(j), G) = log Γ(αjl)− log Γ(njl + αjl) +

rj∑i=1

[log Γ(nijl + αijl)− log Γ(αijl)] ,

does not contribute to the MPL if the corresponding Markov blanket configuration is notrepresented in the data. Consequently, the maximum number of terms evaluated by a non-naive implementation is

∑dj=1 min(qj , n)(2 + 2rj) where n is the number of observations

in the dataset. Furthermore, for a large Markov blanket of node j, the number of distinctconfigurations present in a dataset is, in practice, usually far less than min(qj , n).

If we look at the MPL from an optimization perspective, it is easy to see that its variable-wise decomposition makes it a convenient candidate for search algorithms based on localchanges. To compare the plausibility of two graphs, G1 = (V,E1) and G2 = (V,E2), we cancalculate the ratio of their MPLs,

K(G1, G2) =p(x | G1)

p(x | G2),

or equivalently the log-ratio,

logK(G1, G2) = log p(x | G1)− log p(x | G2),

13

Page 14: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Pensar, Nyman, Niiranen and Corander

which is basically the pseudo-version of log-Bayes factor or the log-Bayes pseudo-factor.Assume there is a single edge difference,

{E1 ∪ E2} \ {E1 ∩ E2} = {i, j},

between the graphs. This implies that mb(i) and mb(j) are the only Markov blankets thatdiffer in the two graphs. Consequently, log-Bayes pseudo-factor is simply evaluated by

logK(G1, G2) = log p(xi | xmb(i), G1) + log p(xj | xmb(j), G1)−log p(xi | xmb(i), G2)− log p(xj | xmb(j), G2)

since the rest of the terms cancel each other out.

4.4 Related work

In addition to the asymptotically equivalent PIC (see proof of Theorem 1) by Csiszar andTalata (2006), the MPL is very closely related to a class of models known as dependencynetworks (Heckerman et al., 2001). In fact, the general concept of using pseudo-likelihoodfor Markov networks has an alternative interpretation in terms of this class of models.

The distribution of a dependency network is, like a Bayesian network, represented byvariable-wise conditional distributions in a pseudo-likelihood type manner. The directedgraph of a dependency network may thus, in contrast to a Bayesian network, contain cy-cles. A dependency network does not in general represent a consistent distribution in thesense that the local distribution cannot be inferred from a joint distribution over all thevariables. Consequently, one must rely on Gibbs sampling to perform inference in such mod-els. In contrast to dependency networks, Markov networks always represent a consistentdistribution. Still, any Markov network with the undirected graph G can be representedby a consistent dependency network with a symmetric directed graph containing the samestructural adjacencies as G (Theorem 1 & 4 Heckerman et al., 2001).

The obvious advantage of dependency networks in terms of structure learning is thatthe local structure of each node can be learned independently of the rest of the network.The local structures can be inferred using a variety of regression-based techniques. Inparticular, Heckerman et al. (2001) model the local structures using probabilistic decisiontrees in conjunction with a Bayesian score originally derived by Friedman and Goldszmidt(1996) for the purpose of including context-specific independence in the learning process ofBayesian networks. Lowd and Davis (2014) converted this approach into a feature selectionmethod by transforming the trees into features of a Markov network. The authors mentionthe risk of overfitting by generating long specialized features. Due to the feature-edgerelation described in Section 2, the implication of such overfitting would be emphasized interms of graph structure discovery.

The logistic regression approach of Ravikumar et al. (2010) is another method that hasa natural interpretation in terms of the dependency network framework. The solutions ofthe separate regression problems represent a structure of a general dependency network andmust be made symmetric in order to be consistent with a structure of a Markov network.In contrast, the problem formulation in the closely related approach by Hofling and Tib-shirani (2009) ensures that the network is kept symmetric and even consistent during theoptimization process.

14

Page 15: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Marginal Pseudo-Likelihood Learning of Markov Network Structures

Our MPL can be interpreted as the ML of a symmetric dependency network under thestandard conditional parameterization defined in (8). Without the parameter independenceassumption, the MPL would correspond to the ML of a consistent dependency network un-der our parameterization. This would make up a very natural option for objective functionif it could be evaluated efficiently. Still, the goal of the MPL is to learn a graph structurerather than a specific model. If merely considering the dependence structure, each graphof a Markov network has an equivalent counterpart in terms of a symmetric dependencynetwork structure. Therefore it makes sense to enforce consistency among the Markovblankets, that is, to only consider symmetric dependency networks. In contrast to a generaldependency network, the MPL evaluates the inclusion of an edge {i, j} by comparing thepotential benefit from adding j to mb(i) against the potential loss of adding i to mb(j).

5. MPL optimization

The straightforward global MPL-based optimization problem is formulated by

arg maxG∈G

log p(x | G) + log p(G) (10)

where p(G) is the graph prior distribution which can account for any prior belief regardingfor example the degree of sparsity. To maintain the useful structure of the MPL, the priormust follow a similar decomposition. This is achieved by defining the prior in terms ofmutually independent prior beliefs on the individual Markov blankets. In Section 6.3 wegive an example of such a prior, however, in the remainder of the section we assume auniform prior and the term p(G) is therefore omitted. Still, the methods presented in thissection are also directly applicable under any prior that follows the same decomposition asthe MPL.

Due to rapidly growing size of the discrete optimization space, the global optimizationproblem (10) is clearly intractable already for moderate-sized systems. Hence, we need toconstruct an algorithm that finds approximate solutions of satisfactory quality in a reason-able time. To ensure applicability in a genuinely high-dimensional setting, the algorithm isdesigned to exploit the structural decomposition of the MPL by breaking down the probleminto two steps instead of directly approaching the global problem (10).

Since each graph G is uniquely specified by its collection of Markov blankets mb(G) ={mb(j)}dj=1, we can reformulate (10) as

arg maxmb(G)∈×j∈V P(V \j)

d∑j=1

log p(xj | xmb(j))

subject to i ∈ mb(j)⇒ j ∈ mb(i) for all i, j ∈ V

(11)

where P(V \ j) is the power set of V \ j representing all possible Markov blankets of node j.From (11) it is easy to see that our problem is basically made up of d dependent subproblemsthat are connected through the consistency constraint. By omitting the constraint weremove the dependence among the subproblems and obtain the relaxed problem

arg maxmb(G)∈×j∈V P(V \j)

d∑j=1

log p(xj | xmb(j)). (12)

15

Page 16: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Pensar, Nyman, Niiranen and Corander

Since the d subproblems now are independent of each other, we can finally reformulate (12)by breaking it down into a collection of stand-alone Markov blanket discovery problems,

arg maxmb(j)⊆V \j

log p(xj | xmb(j)) for j = 1, . . . , d, (13)

which can be solved completely in parallel considerably improving real time efficiency. Sinceeach individual subproblem in itself is still intractable, in Section 5.1 we introduce anefficient deterministic search algorithm that gives an approximate solution.

The relaxation step shifts the focus from the strictly score-based view in (10) towards aconstraint- or regression-based view, or in terms of dependency networks, from symmetricto general. It is worth noticing that the consistency result established in Theorem 1 stillholds under the relaxed problem formulation.

By solving the relaxed problem we usually obtain a solution inconsistent with a Markovnetwork structure. We could simply post-process the solution using either a ∧ (and) crite-rion,

E∧ = {{i, j} ∈ {V × V } : i ∈ mb(j) ∧ j ∈ mb(i)}

or a ∨ (or) criterion,

E∨ = {{i, j} ∈ {V × V } : i ∈ mb(j) ∨ j ∈ mb(i)}.

These criteria are quite standard among constraint- or regression-based methods, however,neither of them is quite satisfactory from an MPL optimization perspective. Therefore wepropose a second optimization phase whose goal is to combine the inconsistent Markovblankets from the first phase into a coherent structure which is MPL-optimal on a reducedmodel space determined by the relaxed solution.

More specifically, the edge set in E∨ is considered to be the result of a prescan thatidentifies eligible edges. The original problem (10) is then solved with respect to the reducedmodel space G∨ = {G ∈ G : E ⊆ E∨}, that is

arg maxG∈G∨

log p(x | G).

The reduced model space G∨ is in general considerably smaller than G. In Section 5.2 wediscuss a method that under certain circumstances can solve the above problem exactly. InSection 5.3 we describe a fast deterministic approximate algorithm that can be applied alsoin situations when the exact method is infeasible.

5.1 Local Markov blanket discovery using greedy hill climbing

To solve the relaxed problem, we basically need a Markov blanket discovery algorithm whosegoal is to optimize the local MPL for each node independently of the solutions of the othernodes. For this we use an approximate deterministic hill climbing procedure similar to theinterIAMB algorithm by Tsamardinos et al. (2003).

An outline of the algorithm is presented in Algorithm 1 and the general idea is asfollows. The algorithm is based on the two basic operations by which members are addedto or deleted from the Markov blanket. The method is initiated with the empty Markovblanket and all other nodes are considered potential Markov blanket members. At each

16

Page 17: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Marginal Pseudo-Likelihood Learning of Markov Network Structures

Algorithm 1 Procedure for optimizing the local MPL of a node using greedy hill climbing.

Procedure Markov-Blanket-Hill-Climb(j, //Current nodex, //Complete dataset)

1: mb(j), mb(j)← ∅2: while mb(j) has changed

3: C ← V \ {mb(j) ∪ j}4: mb(j)← mb(j)

5: for each i ∈ C6: if log p(xj | xmb(j)∪i) > log p(xj | xmb(j))

7: mb(j)← mb(j) ∪ i8: end

9: end

10: while mb(j) has changed & |mb(j)| > 2

11: mb(j)← mb(j)

12: for each i ∈ mb(j)13: if log p(xj | xmb(j)\i) > log p(xj | xmb(j))

14: mb(j)← mb(j) \ i15: end

16: end

17: end

18: end

19: return mb(j)

iteration it adds to the Markov blanket the node that induces the greatest improvementto the local MPL and updates the set of potential members accordingly. When the sizeof the Markov blanket grows larger than two, the algorithm interleaves each successfuladdition-step with a deletion phase. In the deletion phase, the algorithm removes the nodethat induces the largest improvement to score. The deletion-step is repeated until removalof a node no longer increases the score or the size of the Markov blanket is smaller thanthree. When the addition-phase is iterated through without a successful addition, a localmaximum has been reached, the algorithm terminates and returns the identified Markovblanket.

To examine the computational complexity of the algorithm, we consider the cost ofperforming an complete iteration where mb denotes the current Markov blanket. In theaddition phase, each of the d − 1 − |mb| candidate members needs to be evaluated bycalculating the local MPL for Markov blankets of size |mb| + 1. Say that a node is addedto the Markov blanket which is now of size |mb| + 1. In the first iteration of a potentialdeletion phase, the removal of each of the |mb| + 1 Markov blanket members is evaluatedby calculating the local MPL for Markov blankets of size |mb|. In practice, the most latelyadded node can be skipped in the first iteration. In a potential successive deletion step,|mb| Markov blankets of size |mb| − 1 must be evaluated and so on.

17

Page 18: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Pensar, Nyman, Niiranen and Corander

The recurring deletion-phase of the algorithm attempts to keep the size of the Markovblanket as small as possible during the search in order to improve both sample and timeefficiency. Still, the computational cost of the method is strongly dependent on the size ofidentified Markov blanket since a large Markov blanket naturally requires many iterations.Furthermore, an iteration becomes more expensive as the current Markov blanket growslarger since the cost of evaluating the local MPL is highly dependent on the size of theMarkov blanket (see Section 4.3).

5.2 Global graph discovery using pseudo-boolean optimization

There has recently been a considerable interest in use of computational logic algorithmsfor structure learning of both Bayesian and Markov networks (Cussens, 2008; Bartlett andCussens, 2013; Corander et al., 2013; Berg et al., 2014; Parviainen et al., 2014). In thissection we describe how

arg maxG∈G∨

log p(x | G) (14)

can be cast as a pseudo-boolean optimization (PBO) problem (Boros and Hammer, 2002)which can be solved by existing mixed integer programming solvers such as the SCIP solver(Berthold et al., 2009; Achterberg, 2009).

A PBO problem consists of an objective function and a set of (in)equality constraintsover boolean variables. To formulate our optimization problem as a PBO problem, we needto introduce two types of propositional variables:

1. Edge variables: For each edge {i, j} ∈ E∨, a variable x{i,j} is introduced. If thevalue of x{i,j} is 1 (true) in a solution, the associated edge is included in the graph.If the value of x{i,j} is 0 (false), the associated edge is not included in the graph.

2. Markov blanket variables: Each node j is associated with a set of candidateMarkov blankets defined as all subsets of mb∨(j) which is the Markov blanket of nodej in G∨. Let dj be the number of nodes in mb∨(j). The Markov blanket candidatesare denoted by mbk(j) for k = 1, . . . ,mj where mj = 2dj . For each candidate Markovblanket, a variable xmbk(j) is introduced. If the value of xmbk(j) is 1 (true) in a solution,the Markov blanket of node j is equal to mbk(j) in the graph. If the value of xmbk(j)

is 0 (false), the Markov blanket of node j is not equal to the k:th candidate.

Each complete instantiation of the edge variables will correspond to a distinct graph inthe considered graph space. The purpose of the edge variables is to ensure that the com-bined Markov blankets correspond to a coherent graph structure. Consequently, we needto connect the edge variables to the blanket variables in such a way that the value of blan-ket variable is true if and only if all edge variables associated with edges induced by theMarkov blanket are true and the remaining edge variables are false. More formally, we needto introduce a constraint corresponding to the propositional formula

xmbk(j) ↔ (x{v1,j} ∧ x{v2,j} ∧ · · · ∧ x{vl,j} ∧ ¬x{vl+1,j} ∧ ¬x{vl+2,j} ∧ · · · ∧ ¬x{vdj ,j}) (15)

where

{v1, . . . , vl} = mbk(j) and {vl+1, . . . , vdj} = mb∨(j) \mbk(j).

18

Page 19: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Marginal Pseudo-Likelihood Learning of Markov Network Structures

If we now consider the variables taking on values 0 and 1 rather than false and true, theabove formula can be expressed as the pseudo-boolean equality constraint

xmbk(j) −

(l∏

i=1

x{vi,j}

) dj∏i=l+1

x{vi,j}

= 0, (16)

where x{vi,j} = 1− x{vi,j}. For any assignment of the variables, it is clear that the value offormula (15) is true if and only if constraint (16) is satisfied. For each node and Markovblanket candidate, we add constraint (16) to the PBO problem. This will ensure that anyfeasible instantiation of the introduced variables must coincide with a graph structure of aMarkov network.

The constraints expressed by equation (16) are sufficient on their own, however, tofacilitate the optimization process we also introduce the following constraint for each node:

mj∑k=1

xmbk(j) = 1 (17)

By including constraint (17), we explicitly require that exactly one candidate is selectedfor each node. Even though this is already implied by constraints (16), the implicationis not straightforward since the candidate variables are related to each other via the edgevariables. Consequently, to realize that any two given candidate variables of the same nodecan not be true simultaneously, some of the edge variables need to be assigned. Therefore,including constraint (17) helps the solver to tighten the bounds of the feasible region (andobjective function) early on.

Finally, we need to express our objective function. For this we introduce the Markovblanket candidate weights

w(j, k) = −bK · log p(xj |xmbk(j))c

where K is a large positive integer. As required in a PBO objective function, the floorfunction transforms the weights into integers. The objective function to be minimized cannow be expressed by

d∑j=1

mj∑k=1

w(j, k) · xmbk(j). (18)

With a large enough K, the solution to the PBO problem

arg minmb(j)⊆mb∨(j)

j=1,...,d

d∑j=1

mj∑k=1

w(j, k) · xmbk(j), (19)

subject to constraint (16) (and (17)), is equivalent to the solution to the optimizationproblem

arg minmb(j)⊆mb∨(j)

j=1,...,d

−Kd∑

j=1

log p(xj | xmb(j)), (20)

19

Page 20: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Pensar, Nyman, Niiranen and Corander

Algorithm 2 Procedure for optimizing the MPL using greedy hill climbing.

Procedure Graph-Hill-Climb(G∨ //The considered graph spacex, //Complete dataset)

1: G, G← ∅2: while G has changed

3: G← G

4: for each G′ ∈ NG∨(G)

5: if p(x | G′) > p(x | G)

6: G← G′

7: end

8: end

9: end

10: return G

subject to the constraint in (11), which in turn is equivalent to the original optimizationproblem (14).

The obvious advantage of this approach is that we are guaranteed to obtain the exact(or global) solution to the problem. However, the method can only be applied in certainsituations since the number of variables and constraints for each node grows exponentiallywith the number of the potential Markov blanket members. More specifically, the totalnumber of introduced boolean variables is |E∨| +

∑dj=1 2dj and the total number of intro-

duced equality constraints is d+∑d

j=1 2dj . In addition, the weight of each candidate mustbe calculated and stored prior to the actual optimization. Consequently, the feasibility ofthis approach depends strongly on the sizes of the Markov blankets in G∨. Hence, in thenext section we also introduce an alternative approximate algorithm that can be appliedalso in intractable situations.

5.3 Global graph discovery using greedy hill climbing

The variable-wise factorization of the MPL makes it particularly well-suited for global searchalgorithms based on local changes. As a stochastic option, the non-reversible MCMC-based approach by Corander et al. (2008) is directly applicable for MPL optimization.Here we propose a simple deterministic approach in form of a greedy hill climbing (HC)algorithm which has also been used for learning Bayesian networks (see e.g. Heckermanet al., 1995). Local edge change algorithms move between neighboring graph structuresduring the optimization procedure. The set of neighbors of a graph G in a graph space G isdenoted by NG(G) and defined as all graphs in G that can be reached from G by a addingor removing a single edge.

An outline of the algorithm is presented in Algorithm 2 and the general idea is as follows.The empty graph is set as the initial graph and the considered optimization space is G∨. Ateach iteration, all neighbors of the current graph are evaluated. At the end of the iteration,we choose the highest scoring graph from the neighbors, assuming that it has a higher score

20

Page 21: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Marginal Pseudo-Likelihood Learning of Markov Network Structures

than the current graph, and repeat the procedure. If no candidate among the neighbors hasa higher score than the current graph, a local maximum has been reached, the algorithmterminates and returns the identified graph.

We examine the computational complexity of the proposed algorithm by considering thecalculations required at each iteration. The cost of evaluating the specified expressions wasdiscussed in Section 4.3. We show that, by implementing smart caching, the efficiency ofthe algorithm can be improved considerably. Let Gt be the current graph at iteration t andlet G′t ∈ NG∨(Gt) differ with respect to the edge {i, j}. To compare G′t to Gt we calculatethe log-Bayes pseudo-factor

logK(G∗t , Gt) = log p(xi | xmb(i), G′t) + log p(xj | xmb(j), G

′t)−

log p(xi | xmb(i), Gt)− log p(xj | xmb(j), Gt).

The above expression must be evaluated for each candidate in the neighbor set NG∨(Gt)which has a maximum cardinality of

(d2

)if all edges are included. However, since

log p(xi | xmb(i), Gt) and log p(xj | xmb(j), Gt)

are determined by the current graph, they can be stored and re-used. Consequently,

log p(xi | xmb(i), G′t) and log p(xj | xmb(j), G

′t)

are the only terms that specifically need to be calculated to evaluate the neighbor associatedwith the edge change {i, j}.

In the first iteration, d local MPLs with empty Markov blankets must be evaluated.Additionally, each possible neighbor must be evaluated by calculating two local MPLs withMarkov blanket size one. In subsequent iterations, we can further exploit the decompositionof the MPL by noting that most of the log-factors from the previous iteration remainunchanged under the new current graph. In fact, the only edge changes that need to bere-evaluated are those that overlap with the previous change. Say that the current graphGt was attained by adding (deleting) the edge between {k, l} to (from) Gt−1. In line withearlier notation, let G′t−1 be the neighbor of Gt−1 that differs with respect to the edge {i, j}.Given the context, we now have that

logK(G′t, Gt) = logK(G′t−1, Gt−1) if {i, j} ∩ {k, l} = ∅.

Consequently, after the initial iteration it suffices to re-evaluate only a small fraction ofthe updated neighbor set. In the worst case, 2(d− 1) neighbors need to be evaluated eventhough the maximum cardinality of the neighbor set of interest is

(d2

)− 1 when excluding

the graph from the previous iteration.

Under this optimization strategy, the MPL method is similar in spirit to the max-min hillclimbing algorithm for learning Bayesian networks by Tsamardinos et al. (2006). The maindifference is that both phases of our algorithm are derived from the notion of maximizinga single underlying score.

21

Page 22: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Pensar, Nyman, Niiranen and Corander

Network Grid Hub Loop CliqueNumber of nodes 16 16 16 16Number of edges 24 15 19 20Average Markov blanket size 3.25 1.88 2.38 2.5Maximum Markov blanket size 4 8 4 4Chordal No Yes No Yes

Table 1: Properties of the graph components in Figure 4.

6. Experimental results

The main focus of this section is to empirically investigate the performance of the MPLusing the optimization algorithms from the previous section. To evaluate our approach ina controlled setting, we compare it to other competitive methods on synthetic models aswell as real-world networks. Since the graph structures of the generating models are known,it allows for a straightforward and fair assessment of the algorithms. To finally illustratethe potential of our method, we also present a real high-dimensional knowledge discoveryproblem on which our method is applied.

When the true structure of the generating network is known, the quality of an outputnetwork is readily assessed by the number of errors in terms of FPs and FNs. As our mainmeasure of quality we consider the sum of FPs and FNs which is the Hamming distancebetween the output and true network. Consequently, a low value on the Hamming distancecorresponds to structural resemblance to the true network and the minimum value of zerois obtained for the correct graph. In addition to structural resemblance, we monitor theexecution times for the different methods.1 The total runtimes of all algorithms are reportedalong with the maximum discovery time for a single Markov blanket. The maximum Markovblanket discovery time would be the total real time required if the local problems were solvedin parallel rather than in serial fashion.

If not otherwise mentioned, we set the equivalent sample size parameter N = 1 whichresults in a weak parameter prior. For the main part of the experiments, we set p(G) tobe uniform since we want to investigate how well the MPL alone performs as a metric forgraph structures.

6.1 Synthetic Markov networks

In this section we use synthetic models to generate datasets of different sizes to systemat-ically compare the performance of the MPL combined with our optimization algorithms.Moreover, we compare the MPL against different competing methods for structure learningof Markov networks. For simplicity, we restrict the synthetic networks to be made up ofbinary variables.

The synthetic graphs were formed by combining disconnected components in form ofthe four 16-node graphs illustrated in Figure 4. These graphs represent different structuralcharacteristics present in realistic models and some of their properties are listed in Table1. In particular, as already shown in Section 4.2, the hub network in Figure 4b represents

1. All experiments were carried out in Matlab except for the PBO, which was solved using the SCIP solver(web site: http://scip.zib.de/). The experiments were performed on a standard PC architecture with2.66GHz dual-core processors.

22

Page 23: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Marginal Pseudo-Likelihood Learning of Markov Network Structures

(a)

(b)

(c)(d)

Figure 4: Synthetic graphs used in Section 6.1: (a) Grid, (b) Hub, (c) Loop, and (d) Clique.

a structural characteristic that is especially hard to capture for the MPL even though itis a rather simple tree structure. Initially, one replica of each subgraph was combined toform a structure over 64 variables. This procedure was then repeated with 2, 4, and 8 repli-cas to form network structures over 128, 256, and 512 variables, respectively. Each finalnetwork structure thus contained all the structural characteristics present in the graph com-ponents. The advantage of this approach is that the disconnected nature of the generatingnetworks facilitates the sampling procedure substantially since each distinct subnetwork canbe sampled directly from its corresponding joint distribution independently of the rest ofthe network. In practice, a distribution was generated by randomly sampling the maximalclique factors in (1). Each factor value φ(xC) was drawn, independently of the other values,from a uniform distribution over (0, 1). Consequently, the strength of the dependenciesentailed by the edges may have varied considerably. To increase the stability of our results,for each sample size and graph structure, we generated 10 distributions from each of which10 datasets were sampled. In total, under each setup, we learned 100 model structures overwhich the final results were averaged. The experiments were performed for sample sizesranging from 250 to 32000.

First we examine how our approximate algorithm, HC, performs in comparison with ourexact algorithm, PBO, in the second phase of the optimization process. Since the exactmethod finds the globally MPL-optimal graph on the reduced graph space G∨, we can useit as a gold standard to which we compare our approximate method. Since the feasibility ofthe exact method is restricted by the output of the first phase, we need to filter out instancesthat are not solvable in a reasonable time. We restricted the comparison to instances wherethe total number of Markov blanket candidates is less than 15000. We also set a time limiton the solver to 3600 seconds per instance. The following results are thus based on theremaining solved instances (see Table 3 for more details).

In Figure 5a we have plotted the rate at which the two methods discover identicalsolutions. In terms of the HC method, this corresponds to the rate at which the algorithmsucceeds in reaching the global optimum. As expected, the success rate grows with anincreased sample size. When given more observations the algorithm is more firmly guided

23

Page 24: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Pensar, Nyman, Niiranen and Corander

.25 .5 1 2 4 8 16 32

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Sample size (x1000)

Su

cc

ess

ra

te

64

128

256

512

(a)

.25 .5 1 2 4 8 16 32

0

2

4

6

8

10

12

14

16

18

20

Sample size (x1000)

Ha

mm

ing

dis

tan

ce

64

128

256

512

(b)

Figure 5: Comparison of HC and PBO for different model sizes. In (a) the rate at which thetwo methods reach identical solutions is plotted against the sample size. In (b) the averageHamming distance between non-identical solutions is plotted against the sample size.

towards the optimal solution. In addition, the size of the optimization space is in generalsmaller for larger sample sizes due to an increased conformity among the Markov blanketsfrom the initial phase. Overall, the HC method performs very well in identifying the optimalsolution for reasonable sample sizes. We now consider the instances where the optimalsolution was not reached by the HC method. In Figure 5b we have plotted the averageHamming distance between the HC solution and the optimal solution for instances whenthe two graphs were different. For all of the considered model sizes, the curves quicklyconverge towards a Hamming distance of two which is the closest a local maximum canbe to the global maximum under our definition of neighboring graphs. As expected, theapproximate and exact solutions tend to resemble each other somewhat less for the moreextreme “large d, small n”-setups.

In the second part of the experiment, we compare the MPL against other structurelearning methods that are also applicable in high dimensions. We limit the MPL approachto the less computationally expensive HC algorithm, which we from now on simply refer toas the MPL method. The other methods used in the comparison are the following:

• PIC: The PIC criterion by Csiszar and Talata (2006) is applied using the exact samesearch technique as for the MPL method. From the proof of Theorem (1), we knowthat MPL and PIC are asymptotically equivalent estimators, but here we examinehow they perform in practice for limited sample sizes.

• CMI: We apply the Markov blanket discovery approach of Tsamardinos et al. (2003)who use conditional mutual information2 (CMI) to assess if two variables are condi-tionally independent given some set of variables. For a fair comparison against theMPL method, we use the CMI measure combined with Algorithm 1. To form the final

2. To calculate the conditional mutual information, the Matlab package of Peng (2007, Accessed 2013-10-14)was used.

24

Page 25: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Marginal Pseudo-Likelihood Learning of Markov Network Structures

.25 .5 1 2 4 8 16 32

0

10

20

30

40

50

60

70

Sample size (x1000)

Ha

mm

ing

dis

tan

ce

MPL

PIC

CMI

L1LR

(a)

.25 .5 1 2 4 8 16 32

0

20

40

60

80

100

120

140

Sample size (x1000)

Ha

mm

ing

dis

tan

ce

MPL

PIC

CMI

L1LR

(b)

.25 .5 1 2 4 8 16 32

0

40

80

120

160

200

240

280

Sample size (x1000)

Ha

mm

ing

dis

tan

ce

MPL

PIC

CMI

L1LR

(c)

.25 .5 1 2 4 8 16 32

0

80

160

240

320

400

480

560

Sample size (x1000)

Ha

mm

ing

dis

tan

ce

MPL

PIC

CMI

L1LR

(d)

Figure 6: Comparison of the MPL method with the other structure learning methods. Theaverage Hamming distance between the induced and true graph is plotted against the samplesize for synthetic models of size (a) d = 64, (b) d = 128, (c) d = 256, and (d) d = 512.

graph, we apply either the AND or the OR criterion depending on which one resultsin a graph closer to the true graph in terms of Hamming distance.

• L1LR: We apply the L1-regularized logistic regression3 (L1LR) approach of Raviku-mar et al. (2010) which is directly applicable on our models since we have restrictedthe experiments to binary variables. To form the final graph, we apply either theAND or the OR criterion depending on which one results in a graph closer to the truegraph in terms of Hamming distance.

An issue with both the CMI and L1LR method is that they require the user to specifya crucial tuning parameter in form of a threshold value and a regularization weight, re-spectively. Consequently, to circumvent this problem, the methods were executed for the

3. The L1-regularized logistic regression was performed using the Matlab package of Schmidt (2013, Ac-cessed 2013-10-14).

25

Page 26: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Pensar, Nyman, Niiranen and Corander

Network Alarm Insurance Hailfinder BarleyNumber of nodes 37 27 56 48Number of edges (DAG) 46 52 66 84Number of edges (moral graph) 65 70 99 126Number of parameters 509 984 2656 114005Average Markov blanket size 3.51 5.19 3.54 5.25Maximum Markov blanket size 8 10 17 13Average variable cardinality 2.84 3.30 3.98 8.77Maximum variable cardinality 4 5 11 67

Table 2: Properties of the real-world Bayesian networks used in Section 6.2.

following ranges of values:

λCMI ∈ {0.25, 0.1, 0.075, 0.05, 0.025, 0.01, 0.0075, 0.005, 0.0025, 0.001, 0.00075, 0.0005},

λL1LR ∈ {4, 6, 8, 12, 16, 24, 32, 48, 64, 96, 128, 192, 256, 384, 512, 768, 1024}.

This resulted in a range of overly sparse to overly dense graphs from which we picked thegraph (and parameter value), that minimized the Hamming distance, as the final solution.Since the Hamming distance tend to steadily increase when moving away from the optimalparameter value, the above range of values were chosen such that the picked parametervalues (see Table 8) would lie strictly between the smallest and largest value. In a sense,the MPL also requires us to choose a value on the equivalent sample size parameter N .However, in these experiments we have fixed N = 1 whereas the other methods are tunedwith respect to the true graph in order to perform optimally given the range of parametervalues.

The results of the simulations are summarized in Table 4 and 5. In Figure 6 the averageHamming distance is illustrated for the different methods and model sizes. Overall, the MPLmethod performed highly satisfactorily and was marginally inferior to the other methodsonly for some of the most extreme “large d, small n”-settings. It is difficult to say whetherthis is due to the MPL itself or the approximation made by the HC method. In terms ofspeed (see Table 5), the MPL method was at a comparable level for all of the consideredmodels. Furthermore, the task of determining the tuning parameter experimentally wouldsignificantly increase the total runtimes of the CMI and L1LR method.

6.2 Real-world Bayesian networks

In this section we proceed to a more realistic setting by conducting experiments on well-known real-world models, from the related class of Bayesian networks, in a similar fashionas Bromberg et al. (2009). The considered models are commonly used as benchmarks inresearch and are available from a number of sources4. To transform the directed acyclicgraph of a Bayesian network into a corresponding undirected graph of a Markov network, atwo-step procedure known as moralization is used (see Lauritzen, 1996; Koller and Friedman,

4. The networks used in this work were obtained from the Bayesian network repository athttp://www.bnlearn.com/bnrepository/ (Accessed 2014-08-07) and sampled using the R package ofScutari (2010).

26

Page 27: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Marginal Pseudo-Likelihood Learning of Markov Network Structures

2009). In the first step all parents of a common child are connected by an undirected edgeif not already connected. In the second step the graph is made undirected by removing thedirection of all directed edges. Although the local Markov property remains valid in thetransformed network, some conditional independencies are lost in the moralization processdue to the added edges. Consequently, the associated distribution is no longer faithful tothe undirected graph making the graph identification more challenging.

We selected the four medium-sized networks which are listed along with some of theirproperties in Table 2. Compared to the relatively simple and balanced synthetic networksin Section 6.1, these models are more challenging due to their higher edge density and largerMarkov blankets. In addition, large variable cardinalities also tend to have a negative effecton the learning time for methods such as the MPL. As before, we sampled each network forsample sizes ranging from 250 to 32000. For each network and sample size, we generated100 samples over which the final results were averaged. We applied the same methods asin the previous section except for L1LR which without modifications is restricted to binaryvariables. In order to have a sufficient range of threshold values for the CMI method weadded

{0.5, 0.75, 1, 1.25, 1.5, 1.75, 2, 2.25, 2.5}to the range of values from the previous section.

The results of the simulations are summarized in Table 6 and 7. In Figure 7 the averageHamming distance from the moralized graph is illustrated for the different methods andnetworks. Again, the MPL method displayed an overall stable and good performance whencompared to the other methods. However, the Hailfinder network exposes the main weaknessof the MPL (and the related PIC when consistency is enforced). Although the majorityof the nodes in the moral graph have relatively small Markov blankets, there is one nodewith a Markov blanket of size 17. As shown earlier, the MPL struggles with large Markovblankets or so-called hub nodes. As seen in Table 6, the CMI-AND method naturally suffersfrom the same problem, however, the CMI-OR method can to some extent circumvent it. Interms of speed (see Table 7), the MPL is the slowest of the considered methods, however, itsruntimes are still at a reasonable level considering the performance. A reason for the slowerruntimes is that the MPL method tends to produce denser graphs than the PIC as well asthe CMI method when tuned with respect to the Hamming distance. Consequently, theMPL method requires more iterations which will affect the runtimes negatively, especiallywhen the variable cardinalities are increased. However, it should be noted that the choiceof threshold value by cross-validation would again significantly increase the computationtime for CMI method.

All MPL simulations this far have been performed under the fixed equivalent samplesize N = 1. Whereas λCMI and λL1LR have a rather clear interpretation in terms of theireffect on the graph, the effect of N is not as easy to interpret. Silander et al. (2007)showed experimentally that the maximum a posteriori (MAP) structure of the BDeu scorefor Bayesian networks is sensitive to the choice of N . Furthermore, they noted that largervalues of N tend to produce denser MAP graphs. Since the MPL and BDeu share the samebasic structure, one would expect to see a similar behavior between the two scores. Toinvestigate this we conclude this section by performing an additional simulation study forthe Alarm network for

N ∈ {1, 4, 8, 16, 32, 64, 128, 256}.

27

Page 28: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Pensar, Nyman, Niiranen and Corander

.25 .5 1 2 4 8 16 32

0

10

20

30

40

50

60

Sample size (x1000)

Ha

mm

ing

dis

tan

ce

MPL

PIC

CMI

(a)

.25 .5 1 2 4 8 16 32

20

30

40

50

60

70

80

Sample size (x1000)

Ha

mm

ing

dis

tan

ce

MPL

PIC

CMI

(b)

.25 .5 1 2 4 8 16 32

40

50

60

70

80

90

100

Sample size (x1000)

Ha

mm

ing

dis

tan

ce

MPL

PIC

CMI

(c)

.25 .5 1 2 4 8 16 32

70

80

90

100

110

120

130

Sample size (x1000)

Ha

mm

ing

dis

tan

ce

MPL

PIC

CMI

(d)

Figure 7: Comparison of the MPL method with the other structure learning methods. Theaverage Hamming distance between the identified and true moral graph is plotted againstthe sample size for the (a) Alarm, (b) Insurance, (c) Hailfinder, and (d) Barley network.

The results of the simulation are summarized in Table 6b and the Hamming distancesare illustrated in Figure 8. As expected, the results indicate a similar behavior as the BDeumetric in the sense that the MPL method produced denser graphs for larger values of N .Furthermore, the results in Table 6b also indicate that larger values of N can be beneficialfor larger samples, however, N = 1 appears to be a reasonable choice when considering thecomplete range of sample sizes.

6.3 A real-world application

Finally, to illustrate the MPL for a high-dimensional real application, we consider a datasetof 1,000 aligned whole-genome DNA sequences of Mycobacterium tuberculosis (Casali et al.,2014). A Markov network can be used to reveal direct associations between variation overgenome positions that may be relatively distant from each other. This purpose is similar

28

Page 29: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Marginal Pseudo-Likelihood Learning of Markov Network Structures

.25 .5 1 2 4 8 16 32

0

10

20

30

40

50

60

70

80

90

Sample size (x1000)

Ha

mm

ing

dis

tan

ce

1

4

8

16

32

64

128

256

Figure 8: The MPL method applied on the Alarm network under different values of N . Theaverage Hamming distance between the identified and true moral graph is plotted againstthe sample size.

0 1 2 3 4

0

1

2

3

4

Interval position (x1000)

Inte

rva

l p

osi

tio

n (

x1

00

0)

Figure 9: Association pattern identified by the MPL method for 1kb genome intervals inM. tuberculosis. A blue point indicates found linkage between positions that reside withinthe intervals located in the genome according to the values on the horizontal and verticalaxes (in thousands of nucleotides).

to the use of Markov networks for finding direct dependencies among amino acid sequencepositions that relate to the underlying crystal structure (Ekeberg et al., 2013).

The original 5Mb multiple sequence alignment for M. tuberculosis had approximately27,000 variable positions, out of which we chose 589 that displayed sufficient variabilitydetermined by the threshold that the most frequent DNA base in a variable position wasnot allowed to represent more than 90% of the total number of observations. Similar tothe amino acid dependency modeling with Markov networks, associations between genomepositions that are close neighbors in the sequences are trivial and uninteresting for thebiological purposes. We applied the MPL method on the 589 variables under the sparsity

29

Page 30: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Pensar, Nyman, Niiranen and Corander

promoting prior

p(G) ∼d∏

j=1

2−qj(rj−1).

in order to filter out the strongest dependencies.

The variable positions included in the analysis were separated maximally by almost 2.5million bases since the bacterial genome is circular. To enable an efficient illustration of theresults, the genome alignment was first split into non-overlapping intervals of 1000 (1kb)successive positions. Then, an adjacency matrix was created for the pairs of genome inter-vals such that a pair of intervals was determined adjacent if the identified graph containedan edge between any variable positions in the two intervals, respectively. The resulting ad-jacency structure is illustrated in Figure 9, which succinctly demonstrates the long-distancelinkage of genome variation in this bacterium, as expected on the basis of its very lowrecombination rate.

7. Conclusions

In this work we have introduced a novel approach for learning the structure of a Markovnetwork without imposing the restriction of chordality. Our marginal pseudo-likelihoodscoring method is proven to be consistent and can be considered a small sample analyticalversion of the information theoretic PIC criterion (Csiszar and Talata, 2006). We havedesigned two algorithms for finding the MPL-optimal solution after an initial prescan forplausible edges. For moderately sized candidate sets of Markov blankets, we have shownthat it is possible to obtain an exact solution to the restricted global optimization problemusing pseudo-boolean optimization. As a fast alternative to the exact method, we considereda greedy hill climbing approach, which gave near optimal performance for reasonable samplesizes. The straightforward possibility of parallel use of the MPL makes it a viable candidatefor high-dimensional knowledge discovery.

In comparison with the other methods for structure learning of Markov networks, ourMPL method was overall superior, and only slightly inferior to alternatives under fairlyextreme “large d, small n”-settings, or when the underlying network contained hub nodes.Moreover, it should also be kept in mind that we chose the tuning parameter values forthe CMI and L1LR method by optimizing their performance against the known underlyingstructures, to reduce the computational burden of the experiments. In a real data analysisscenario, it would be necessary to tune these methods, using for example cross-validation,which would plausibly have a negative effect on their performance. In this sense, thecomparison was extremely fair for the alternative methods, since no parameter choice wasmade for the MPL by the resulting performance. In terms of execution time, the MPLmethod is not necessarily as fast as the other methods, however, the runtimes of the CMIand L1LR method are here reported under a best case scenario. If the value of the penaltyparameter would be determined experimentally, the computation times would easily exceedthose reported for the MPL method.

The main drawback of the MPL compared to the true ML is that the former in a senseover-specifies the dependence structure. As a result, the MPL is less data efficient than theML, especially when the true network contains hub nodes. On the other hand, calculation

30

Page 31: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Marginal Pseudo-Likelihood Learning of Markov Network Structures

of the true ML remains still intractable for non-chordal Markov networks. Therefore, theattractive properties of the MPL, combined with its solid performance in our experiments,suggest that our approach has considerable potential for both applications and furthertheoretical development.

Appendix A.

In this appendix we prove Theorem 1 from Section 4:

Theorem Let G∗ ∈ G be the true graph structure, of a Markov network over (X1, . . . , Xd),with the corresponding Markov blankets mb(G∗) = {mb∗(1), . . . ,mb∗(d)}. Let θG∗ ∈ ΘG∗

define the corresponding joint distribution to which G∗ is faithful and from which a samplex of size n is obtained. The local MPL estimator

mb(j) = arg maxmb(j)⊆V \j

p(xj | xmb(j)) (21)

is consistent in the sense that mb(j) = mb∗(j) eventually almost surely as n → ∞ forj = 1, . . . , d. Consequently, the global MPL estimator

G = arg maxG∈G

p(x | G) (22)

is consistent in the sense that G = G∗ eventually almost surely as n→∞.

Proof The proof is based on an asymptotic comparison of the local log-MPL

log p(xj | xmb(j)) =

qj∑l=1

[log Γ(αjl)− log Γ(njl + αjl)

+

rj∑i=1

(log Γ(nijl + αijl)− log Γ(αijl))]

(23)

and the PIC criterion (Csiszar and Talata, 2006) which is defined by

PIC(mb(j); x) = −qj∑l=1

rj∑i=1

nijl lognijlnjl

+ qj log n

according to our notation. The PIC estimator is then defined as the Markov blanket (orneighborhood) that minimizes the above score. Csiszar and Talata (2006) conclude thatthe PIC criterion is consistent (Theorem 2.1). Their proof is based on two key propositions(Proposition 4.1 and 5.1) which together rule out the possibility of overestimation as wellas underestimation. They also note that their results remains valid if the penalty term ismultiplied by any constant c > 0. Thus, any estimator asymptotically equivalent to thePIC estimator is also consistent.

31

Page 32: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Pensar, Nyman, Niiranen and Corander

We proceed by investigating the asymptotic behavior of the local log-MPL when n→∞.We re-arrange the terms in (23) and omit the constant term introducing an O(1) error:

qj∑l=1

[log Γ(αjl)− log Γ(njl + αjl) +

rj∑i=1

(log Γ(nijl + αijl)− log Γ(αijl))]

=

qj∑l=1

[− log Γ(njl + αjl) +

rj∑i=1

log Γ(nijl + αijl)] +

qj∑l=1

[log Γ(αjl)−rj∑i=1

log Γ(αijl)]

=

qj∑l=1

[− log Γ(njl + αjl) +

rj∑i=1

log Γ(nijl + αijl)] +O(1)

We let n→∞ and apply Stirling’s asymptotic formula

log Γ(n)→ (n− 1

2) log n− n+O(1)

on the remaining terms:

qj∑l=1

[− log Γ(njl + αjl) +

rj∑i=1

log Γ(nijl + αijl)] +O(1)

→qj∑l=1

[−(njl + αjl −1

2) log(njl + αjl) + (njl + αjl)

+

rj∑i=1

(nijl + αijl −1

2) log(nijl + αijl)− (nijl + αijl)] +O(1)

=

qj∑l=1

rj∑i=1

nijl lognijl + αijl

njl + αjl+

qj∑l=1

rj∑i=1

αijl lognijl + αijl

njl + αjl

+

qj∑l=1

[1

2log(njl + αjl)−

rj∑i=1

1

2log(nijl + αijl)] +O(1)

The second step is allowed since njl =∑rj

i=1 nijl and αjl =∑rj

i=1 αijl. As n → ∞ we havethat

nijl + αijl

njl + αjl=nijl(1 +

αijlnijl

)

njl(1 +αjlnjl

)→

nijlnjl

Since nijl/njl is the maximum likelihood estimate of the parameter θijl we further knowthat

qj∑l=1

rj∑i=1

αijl lognijl + αijl

njl + αjl= O(1)

32

Page 33: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Marginal Pseudo-Likelihood Learning of Markov Network Structures

Finally, the remaining terms can be rewritten as

qj∑l=1

[1

2log(njl + αjl)−

rj∑i=1

1

2log(nijl + αijl)]

=1

2

qj∑l=1

[lognjl + αjl

n− log n−

rj∑i=1

(lognijl + αijl

n− log n)]

=1

2

qj∑l=1

[− log n+

rj∑i=1

log n] +1

2

qj∑l=1

[lognjl + αjl

n−

rj∑i=1

lognijl + αijl

n]

=1

2

qj∑l=1

[− log n+

rj∑i=1

log n] +O(1)

=1

2(−qj log n+ qjrj log n) +O(1)

=qj(rj − 1)

2log n+O(1)

Piecing everything together,

log p(xj | xmb(j))→qj∑l=1

rj∑i=1

nijl lognijlnjl− (rj − 1)qj

2log n+O(1).

as n→∞. Since the O(1) term does not grow with n, the local log-MPL is asymptoticallyequivalent to

qj∑l=1

rj∑i=1

nijl lognijlnjl− cj · qj log n.

where cj = (rj−1)/2 is a variable specific constant. Consequently, the local MPL estimatoris asymptotically equivalent to minimizing

−qj∑l=1

rj∑i=1

nijl lognijlnjl

+ cj · qj log n

which is equivalent to the consistent PIC estimator up to a constant factor on the penaltyterm. Hence the local MPL estimator (21) is consistent.

Since the local MPL estimator is consistent, the true collection of Markov blankets iseventually identified when n → ∞. A set of Markov blankets uniquely specifies the struc-ture of a Markov network. Since the true model structure satisfies the structural propertiesof a Markov network, that is i ∈ mb(j) if j ∈ mb(i), the global MPL estimator (22) is alsoconsistent.

Appendix B.

This appendix contains supplementary material of Section 6 in form of numerical results.

33

Page 34: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Pensar, Nyman, Niiranen and Corander

dn

Num

ber

of

PB

OH

Cso

lved

insta

nces

Enco

din

g/Solv

ing

time

(s)lo

g-M

PL

TP

/F

PT

ime

(s)lo

g-M

PL

TP

/F

P

64

250

100

0.3

1/1.2

4-8

700.2

034.16/6.23

0.1

1-8

701.0

733.7

9/6.1

1500

100

0.3

1/1.1

3-1

6974.3

141.12/2.97

0.1

3-1

6976.7

540.7

7/2.7

81000

100

0.2

6/0.6

7-3

3310.5

549.08/1.06

0.1

6-3

3317.1

948.5

5/1.0

82000

100

0.4

6/0.7

7-6

5613.3

955.75/0.46

0.2

5-6

5614.2

555.6

2/0.4

54000

100

0.7

0/0.7

6-1

32021.3

661.8

4/0.1

80.4

0-1

32021.3

761.85/0.17

8000

100

1.4

7/0.9

4-2

62370.0

965.33/0.13

0.7

4-2

62370.1

265.33/0.13

16000

100

1.5

7/0.4

9-5

19632.7

665.51/0.10

1.8

9-5

19632.7

665.51/0.10

32000

100

4.0

4/0.7

6-1

015034.1

768.5

1/0.0

33.7

3-1

015034.3

468.52/0.03

128

250

91

4.4

2/43.6

3-1

7017.3

562.1

3/25.7

30.2

2-1

7021.4

161.62/23.91

500

100

1.2

2/7.1

9-3

3552.9

279.52/10.43

0.2

3-3

3561.8

178.4

5/10.2

21000

100

1.3

9/4.3

3-6

6074.1

597.25/4.42

0.2

9-6

6081.2

196.4

6/4.4

62000

100

0.9

2/2.0

5-1

31739.7

9112.87/2.26

0.4

7-1

31742.1

7112.6

8/2.2

24000

99

1.3

1/1.8

3-2

58402.8

8122.91/0.94

0.7

6-2

58403.1

1122.8

8/0.9

28000

100

2.1

8/1.7

1-5

14098.8

0127.50/0.46

1.3

9-5

14098.9

5127.4

5/0.4

616000

100

3.5

6/1.7

8-1

030570.2

6133.66/0.26

4.0

0-1

030570.2

6133.66/0.26

32000

100

8.7

7/2.0

9-2

086638.6

6137.16/0.09

7.8

1-2

086639.2

7137.16/0.09

256

250

10

12.7

9/235.7

8-3

4505.1

2120.7

0/64.0

00.4

8-3

4521.4

3120.30/60.50

500

55

13.5

1/118.9

3-6

7470.6

6161.13/31.95

0.4

9-6

7482.0

9159.6

0/30.8

41000

90

6.2

6/28.8

5-1

32908.9

6199.94/13.91

0.6

3-1

32927.7

0198.6

6/13.5

22000

96

6.3

3/17.0

4-2

61032.0

9222.43/7.21

0.9

2-2

61044.2

4221.7

5/7.0

24000

100

4.6

5/7.3

3-5

18442.2

3245.28/3.04

1.5

6-5

18446.1

0245.1

7/2.9

98000

100

6.2

5/6.5

6-1

031306.3

9257.04/1.63

2.8

2-1

031306.6

5257.02/1.61

16000

100

8.5

3/4.2

2-2

076927.1

3267.76/0.47

8.1

3-2

076927.3

0267.7

0/0.4

732000

100

17.0

4/3.7

5-4

075373.1

1275.12/0.32

15.2

5-4

075375.0

1275.0

6/0.3

2

512

250

0-

--

--

-500

0-

--

--

-1000

19

19.7

8/179.2

5-2

67311.6

1379.37/42.00

1.3

7-2

67355.8

0375.5

8/41.7

92000

44

9.2

4/44.1

0-5

22454.7

5447.18/19.23

1.9

9-5

22464.4

6446.5

5/18.8

44000

77

14.9

3/39.8

0-1

043048.8

9485.06/9.47

3.2

0-1

043055.0

3484.7

8/9.3

88000

97

15.0

8/19.8

4-2

075735.2

9513.1

1/4.1

85.8

4-2

075735.8

9513.11/4.14

16000

97

18.5

8/10.9

4-4

166121.6

2532.99/2.19

16.8

2-4

166134.1

8532.8

2/2.1

932000

100

34.2

5/8.7

1-8

296145.2

2553.44/1.06

31.5

4-8

296146.0

2553.3

8/1.0

6

Tab

le3:

Resu

ltsfro

mth

ecom

pariso

nof

the

PB

Oan

dH

Cm

ethod

inS

ection6.1.

(bold

font

=low

estH

amm

ing

distan

ce)

34

Page 35: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Marginal Pseudo-Likelihood Learning of Markov Network Structures

dn

MP

LP

ICC

MI

L1L

RA

ND

OR

AN

DO

R

64

250

33.79/6.11

23.7

7/0.3

422.7

4/1.2

923.2

2/2.3

624.1

1/1.7

426.9

6/2.2

6500

40.77/2.78

29.4

3/0.0

929.0

3/0.4

524.4

4/1.5

832.3

1/3.2

334.3

8/3.3

11000

48.55/1.08

39.0

9/0.0

741.6

5/5.8

737.5

7/3.2

039.8

6/1.6

944.5

4/2.8

62000

55.62/0.45

46.8

1/0.0

249.1

8/1.3

037.3

7/0.1

247.4

1/1.9

950.5

0/3.6

54000

61.85/0.17

55.9

0/0.0

157.1

6/0.4

958.9

8/4.6

655.6

6/1.1

959.6

2/2.5

28000

65.33/0.13

60.9

2/0.0

162.7

0/1.5

957.3

2/2.2

062.5

1/1.4

165.2

0/2.9

616000

65.51/0.10

61.8

4/0.0

061.4

2/0.5

859.2

9/1.5

160.2

3/0.7

263.5

2/2.2

232000

68.52/0.03

65.4

7/0.0

068.4

5/0.0

358.6

1/0.8

766.1

3/0.8

466.6

0/1.9

2

128

250

61.0

9/24.0

444.28/1.85

38.1

0/0.5

437.3

3/2.8

240.4

8/3.0

646.3

8/6.3

9500

78.45/10.22

58.3

2/0.5

157.1

5/0.7

346.6

8/1.1

459.7

9/5.0

867.3

2/11.1

61000

96.46/4.46

75.4

9/0.2

081.9

8/11.0

469.9

9/10.8

073.9

8/2.5

182.7

6/7.3

52000

112.68/2.22

97.4

9/0.0

497.5

3/3.2

573.2

0/0.6

495.2

8/4.4

3101.5

1/11.3

34000

122.88/0.93

111.8

9/0.0

6108.5

8/1.8

7114.3

7/9.4

6105.4

3/2.4

4111.8

4/6.6

98000

127.45/0.46

118.9

1/0.0

0121.4

4/2.9

9117.2

2/8.5

0116.7

2/3.7

8117.8

9/5.8

116000

133.66/0.26

126.1

0/0.0

1125.2

4/0.2

5119.4

2/3.7

2124.8

4/2.2

2127.0

7/4.2

832000

137.16/0.09

130.8

1/0.0

0136.6

2/0.1

8122.3

4/1.3

5131.6

1/2.8

4130.4

9/2.1

5

256

250

118.3

0/66.5

187.70/6.81

75.9

4/0.7

968.1

9/1.5

581.5

5/12.8

290.9

0/21.0

9500

158.28/32.72

119.6

4/1.9

8117.5

1/2.0

593.9

0/1.3

6123.6

3/20.0

7127.1

4/28.6

61000

198.98/13.76

157.3

3/0.5

3168.8

2/22.5

7146.3

6/32.5

4150.6

5/9.9

8163.5

2/19.6

82000

221.73/7.08

188.3

5/0.2

2191.7

1/5.9

5145.5

0/2.0

3183.5

4/15.6

2175.5

6/10.0

74000

245.17/2.99

222.1

7/0.0

5223.0

0/3.0

8225.0

8/28.1

0210.3

3/9.0

0210.0

6/7.1

38000

257.02/1.61

239.5

2/0.0

3242.4

7/5.1

4225.7

9/17.4

5236.4

7/13.7

7227.1

0/4.3

916000

267.70/0.47

255.0

5/0.0

1253.6

6/0.2

7237.9

4/7.6

8253.6

4/8.5

6248.9

4/6.5

732000

275.06/0.32

264.7

2/0.0

0271.7

4/0.0

7243.5

6/2.1

8262.5

4/9.0

2253.1

3/5.9

0

512

250

238.8

3/168.3

9182.13/22.64

159.2

5/2.7

7145.2

6/5.1

9160.1

5/37.2

1134.2

0/4.1

0500

303.8

2/85.5

1235.79/7.33

228.6

6/4.5

7189.0

2/7.2

2206.2

0/40.9

9187.3

7/2.4

91000

384.26/40.57

306.1

6/2.0

5333.1

1/35.8

4282.9

7/74.3

5298.5

7/38.1

4276.5

6/6.3

62000

442.16/20.10

377.7

7/0.6

3374.0

3/8.4

8287.8

7/3.3

6345.1

5/42.1

6325.3

4/3.1

94000

484.44/9.65

439.5

5/0.1

7432.4

2/4.0

6432.7

3/60.9

8417.7

6/30.0

0406.0

6/9.5

78000

513.00/4.17

476.5

7/0.0

4485.4

4/7.5

6442.4

4/33.0

6433.8

4/12.0

7434.8

2/14.6

416000

532.61/2.25

504.7

9/0.0

4502.8

6/0.2

5470.2

9/23.7

9489.4

9/15.2

6447.9

1/15.3

532000

553.38/1.06

529.7

2/0.0

2545.6

9/0.1

2488.0

6/6.4

4505.8

0/1.2

5457.7

2/14.9

4

Tab

le4:

Tru

ean

dfa

lse

posi

tive

s(T

P/F

P)

for

the

met

hod

su

sed

inS

ecti

on6.

1.(b

old

font

=lo

wes

tH

amm

ing

dis

tan

ce)

35

Page 36: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Pensar, Nyman, Niiranen and Corander

dn

Mark

ovbla

nket

discov

ery(to

tal/

node

max)

Hill

climbin

g

MP

LP

ICC

MI

L1L

RM

PL

PIC

AN

DO

RA

ND

OR

64

250

1.8

3/0.1

11.2

5/0.0

61.1

7/0.1

10.5

2/0.0

50.9

6/0.0

50.9

7/0.0

50.1

10.0

7500

2.1

8/0.1

31.6

1/0.0

81.7

6/0.2

50.6

1/0.0

61.2

3/0.0

61.2

6/0.0

60.1

30.0

91000

2.9

3/0.1

52.4

2/0.1

17.9

3/0.5

21.0

8/0.1

61.6

6/0.0

81.6

5/0.0

80.1

60.1

22000

4.8

3/0.2

54.1

9/0.2

110.0

8/1.2

31.4

2/0.1

03.5

1/0.1

63.4

8/0.1

60.2

50.2

04000

8.4

0/0.3

97.7

0/0.3

318.2

6/2.6

04.5

6/0.9

19.9

7/0.4

69.8

6/0.4

60.4

00.3

58000

15.8

0/0.8

514.8

1/0.6

455.0

0/7.1

27.7

0/1.1

924.8

1/1.0

223.7

6/1.0

00.7

40.6

616000

45.1

4/2.2

042.3

4/1.8

265.8

6/13.3

918.1

9/2.3

369.8

3/3.0

567.3

9/2.9

91.8

91.3

832000

84.4

3/4.9

678.3

3/3.8

1147.6

3/40.2

043.8

9/4.5

2164.1

7/6.7

1148.0

7/6.0

93.7

32.8

1

128

250

8.4

5/0.2

44.7

8/0.1

02.9

7/0.2

41.8

2/0.0

92.8

4/0.0

62.8

4/0.0

60.2

20.1

0500

9.0

2/0.2

46.3

9/0.1

68.0

3/0.5

42.3

1/0.1

23.7

5/0.0

83.7

4/0.0

80.2

30.1

41000

12.1

4/0.3

29.4

5/0.2

442.9

0/1.2

04.5

7/0.5

86.5

7/0.1

46.5

7/0.1

40.2

90.2

02000

20.3

0/0.4

917.5

7/0.4

647.2

9/2.3

95.7

5/0.2

818.6

3/0.4

218.4

3/0.4

10.4

70.3

84000

33.7

7/0.8

130.6

1/0.6

878.8

2/5.2

417.2

1/2.2

937.8

8/0.8

036.6

7/0.7

80.7

60.6

68000

62.6

1/1.7

458.8

8/1.4

3209.3

6/13.2

234.7

3/5.7

1117.1

0/2.2

7105.1

4/2.1

01.3

91.2

116000

186.9

6/5.1

8174.5

5/3.9

6160.9

5/26.0

974.7

4/8.7

3337.2

9/7.6

5294.6

9/6.6

44.0

02.8

632000

343.9

8/11.8

3323.1

2/9.1

5510.0

9/82.2

1185.6

9/11.6

6829.7

2/19.4

7668.2

2/15.1

17.8

15.6

0

256

250

40.9

6/0.6

719.6

3/0.2

213.2

1/0.5

56.7

7/0.1

68.6

1/0.0

98.6

6/0.0

90.5

40.1

7500

41.3

8/0.6

026.2

4/0.3

241.3

7/1.1

19.1

9/0.2

315.1

8/0.1

715.5

6/0.1

70.5

00.2

41000

52.6

1/0.7

239.4

9/0.5

8218.8

0/2.4

020.0

5/1.4

541.8

9/0.4

441.6

0/0.4

40.6

30.3

82000

80.9

8/1.1

067.6

1/0.9

0222.7

0/4.8

022.8

7/0.9

599.9

6/1.0

388.5

2/0.9

40.9

20.6

74000

138.1

1/2.0

8125.0

6/1.3

9342.8

1/10.5

069.6

0/6.1

8218.2

6/2.4

1186.8

9/2.0

21.5

61.2

98000

256.8

1/4.3

6238.0

5/3.1

3959.1

3/27.5

0125.4

3/14.7

8573.2

2/7.4

5406.3

5/5.1

32.8

22.4

416000

755.3

0/11.5

7710.8

0/9.1

2651.1

7/57.7

3284.7

8/26.3

41536.9

3/21.7

71110.5

4/14.6

38.1

35.7

432000

1404.2

9/27.8

41299.7

8/19.7

51996.3

4/171.5

0719.2

9/31.5

03150.5

3/41.5

21948.5

3/26.0

615.2

510.9

7

512

250

208.3

5/2.1

683.1

9/0.4

867.5

6/1.1

728.0

5/0.5

139.1

1/0.2

243.2

0/0.2

31.5

90.3

7500

183.7

2/1.3

3105.9

9/0.6

5193.0

5/2.2

737.8

8/0.9

990.1

3/0.4

793.6

8/0.4

61.2

40.4

91000

220.5

1/1.5

3156.3

2/1.2

51001.8

1/4.7

381.8

3/3.1

2192.6

0/1.0

3182.6

8/0.9

21.4

10.7

62000

334.7

1/2.4

1272.7

4/1.8

01039.6

8/9.9

189.8

6/2.4

4443.7

8/2.7

8369.6

8/1.9

62.0

11.3

74000

557.1

8/4.6

2494.4

1/2.8

21445.5

8/21.4

9276.0

6/14.8

91131.2

2/7.9

6862.9

0/5.3

83.2

02.5

48000

1031.3

6/9.7

0957.1

3/6.9

34374.0

4/55.3

4472.2

3/31.7

92404.3

3/17.2

11925.9

5/13.6

75.8

54.9

516000

3058.0

0/29.7

42842.6

6/19.6

92261.5

2/111.6

61057.6

3/71.3

57002.3

9/45.7

43614.5

9/27.3

616.8

111.2

832000

5697.6

5/64.9

45337.0

1/44.2

88494.3

7/355.7

62782.4

9/98.0

616049.3

2/92.7

68041.9

1/59.5

731.5

422.6

8

Tab

le5:

Execu

tiontim

es(in

second

s)for

the

meth

od

sused

inS

ection6.1.

36

Page 37: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Marginal Pseudo-Likelihood Learning of Markov Network Structures

Net

work

nM

PL

PIC

CM

IA

ND

OR

Ala

rm

250

38.48/3.70

17.8

7/1.8

719.5

5/2.5

330.7

3/5.7

1500

43.05/1.39

19.9

6/2.0

422.4

8/0.9

333.5

1/3.0

01000

46.22/0.39

25.6

2/2.0

228.1

5/2.7

337.6

0/3.9

22000

49.02/0.37

31.5

9/1.4

030.3

7/0.0

639.3

9/2.3

84000

50.65/0.25

34.9

4/0.9

934.4

4/2.0

042.2

2/2.6

38000

52.40/0.35

39.0

0/0.3

439.7

6/0.5

047.2

8/1.9

516000

53.98/0.74

42.7

7/0.2

143.7

8/0.2

749.5

7/2.0

232000

55.97/0.98

44.0

0/1.0

245.4

5/0.5

753.9

8/1.5

8

Insu

rance

250

21.75/1.95

14.3

7/0.6

915.6

7/3.0

218.3

1/2.7

5500

25.26/1.44

16.8

2/0.9

318.7

5/1.4

130.2

6/10.9

01000

28.95/1.02

20.2

1/1.1

121.0

6/1.5

432.9

8/8.2

22000

32.04/0.82

22.1

2/0.9

722.9

7/2.8

131.9

5/4.6

14000

34.94/0.78

26.0

3/0.9

327.6

2/2.9

133.3

2/2.5

68000

38.67/0.87

28.6

3/0.9

231.9

1/2.4

037.0

9/4.5

216000

40.87/0.92

31.0

2/0.8

832.5

2/1.0

537.8

5/1.0

332000

43.61/0.82

34.0

0/0.9

940.6

3/2.3

236.9

9/0.9

9

Hailfinder

250

25.8

3/8.9

721.52/3.34

16.2

8/2.5

818.1

8/4.6

5500

30.8

3/9.2

525.50/3.01

16.9

2/3.5

420.1

4/2.8

81000

36.0

8/10.3

130.9

0/2.1

925.2

8/6.8

535.50/1.66

2000

38.6

0/10.6

435.4

5/2.6

523.1

6/3.5

434.95/1.31

4000

43.6

8/9.5

438.6

9/3.2

028.0

5/0.2

349.81/4.59

8000

52.0

9/9.4

243.0

5/2.8

134.6

6/0.0

250.41/4.82

16000

55.7

1/8.6

947.5

7/2.7

835.7

5/0.1

752.38/5.25

32000

59.21/8.84

50.6

8/3.8

937.0

1/0.0

052.8

4/4.8

9

Barl

ey

250

16.69/3.93

13.5

2/2.3

93.2

5/0.6

74.9

0/1.7

3500

19.8

6/3.6

120.64/2.99

9.5

5/3.1

714.7

0/6.1

51000

27.28/3.18

24.8

4/2.9

916.2

8/4.3

722.5

9/8.6

92000

32.67/3.45

28.8

9/3.0

217.6

5/5.7

822.9

8/6.0

04000

39.68/3.25

30.5

3/3.0

019.3

7/5.1

526.2

6/6.1

68000

44.03/3.39

34.1

8/3.0

017.5

2/2.0

040.6

3/13.4

716000

49.49/3.03

39.4

0/3.2

119.2

7/2.6

037.6

8/14.6

132000

54.95/3.09

39.7

1/3.0

024.6

7/2.1

342.0

0/8.0

0

(a)

nN

MP

L

250

1

38.48/3.70

500

43.05/1.39

1000

46.2

2/0.3

92000

49.0

2/0.3

74000

50.65/0.25

8000

52.4

0/0.3

516000

53.9

8/0.7

432000

55.9

7/0.9

8

250

4

40.7

7/6.4

9500

44.7

5/3.1

61000

47.39/1.29

2000

49.48/0.75

4000

50.85/0.45

8000

53.01/0.83

16000

54.2

0/0.9

832000

56.2

6/1.0

1

250

8

42.3

3/10.8

5500

45.7

6/6.2

81000

48.1

6/3.6

52000

50.0

3/2.5

14000

51.3

0/1.9

58000

53.3

8/1.7

016000

54.80/1.33

32000

56.46/1.11

250

16

44.1

1/17.3

7500

47.0

8/12.0

21000

49.2

4/8.2

82000

50.6

9/5.9

14000

52.1

2/4.1

08000

53.9

3/3.3

116000

55.4

7/2.7

832000

56.7

0/2.1

2

NM

PL

32

45.3

8/25.5

248.1

6/19.4

750.3

1/14.9

251.3

3/11.7

253.0

4/8.5

354.4

0/7.0

855.7

8/5.3

057.3

9/4.0

2

64

45.9

6/35.3

448.6

7/29.1

550.6

7/24.1

251.9

2/19.4

753.9

5/15.3

055.3

8/12.9

956.2

8/10.4

257.2

5/8.1

9

128

46.4

0/47.5

849.2

7/40.3

750.7

2/34.6

352.4

9/29.8

054.6

3/24.7

455.9

6/20.6

556.5

8/17.4

857.2

0/14.9

1

256

47.0

0/59.4

849.7

9/52.8

651.6

9/46.5

252.7

8/41.5

554.7

1/36.2

756.1

2/31.5

856.7

6/27.6

057.5

5/24.3

5

(b)

Tab

le6:

Tru

ean

dfa

lse

pos

itiv

es(T

P/F

P)

for

the

sim

ula

tion

sin

Sec

tion

6.2

for

(a)

the

met

hod

su

sed

inth

eco

mp

aris

on,

and

(b)

the

MP

L-H

Cap

pli

edon

the

Ala

rmn

etw

ork

un

der

diff

eren

tva

lues

ofth

eeq

uiv

alen

tsa

mp

lesi

zep

aram

eterN

.(b

old

font

=lo

wes

tH

amm

ing

dis

tan

ce)

37

Page 38: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Pensar, Nyman, Niiranen and Corander

Netw

ork

nM

ark

ovbla

nket

discov

ery(to

tal/

node

max)

Hill

climbin

g

MP

LP

ICC

MI

MP

LP

ICA

ND

OR

Ala

rm

250

1.3

1/0.0

90.6

7/0.0

40.4

3/0.0

60.2

8/0.0

30.1

80.0

8500

1.7

1/0.1

20.9

4/0.0

40.5

3/0.0

80.3

3/0.0

20.2

30.1

01000

2.4

2/0.1

91.3

8/0.0

61.1

5/0.2

70.5

0/0.0

50.3

30.1

32000

3.9

6/0.3

52.3

2/0.1

31.3

4/0.3

80.7

5/0.0

60.5

40.2

14000

7.2

1/0.7

34.0

8/0.2

24.1

8/1.0

81.5

3/0.1

90.9

80.3

58000

14.9

2/1.6

98.2

8/0.4

39.2

5/3.4

73.2

8/0.2

22.0

90.7

416000

37.8

4/4.5

223.5

6/1.5

123.5

2/8.3

97.5

8/0.5

05.3

21.9

832000

78.1

7/10.7

947.9

8/3.3

159.6

5/21.4

122.0

9/1.5

311.4

83.9

9

Insu

rance

250

0.6

4/0.0

60.4

5/0.0

40.3

3/0.0

40.1

3/0.0

10.1

10.0

7500

0.8

9/0.0

80.6

2/0.0

40.4

6/0.0

60.2

9/0.0

50.1

30.0

91000

1.3

7/0.1

20.9

3/0.0

50.7

5/0.1

30.4

0/0.0

80.2

10.1

32000

2.3

6/0.2

01.6

1/0.1

41.5

1/0.2

80.5

0/0.1

00.4

00.2

04000

4.3

6/0.5

82.9

9/0.2

53.5

2/0.7

80.8

3/0.1

50.7

30.4

38000

8.9

4/1.2

75.8

5/0.4

76.8

2/1.5

72.1

3/0.6

71.6

40.7

916000

25.2

7/3.2

615.2

8/1.6

212.6

6/3.2

43.8

5/0.5

94.4

31.9

432000

57.6

1/11.3

735.4

2/4.9

155.4

6/15.0

89.4

5/1.1

610.6

94.8

6

Hailfi

nder

250

3.0

2/0.1

92.6

8/0.0

91.6

4/0.0

50.3

7/0.0

30.1

90.1

2500

4.1

3/0.2

23.6

9/0.1

32.0

0/0.1

00.4

3/0.0

30.3

10.1

61000

5.9

4/0.3

05.1

9/0.1

94.6

1/0.2

00.6

6/0.0

30.6

50.2

42000

8.9

8/0.4

18.9

8/0.4

44.8

4/0.4

30.8

9/0.0

41.4

40.3

84000

15.0

2/0.7

515.7

1/0.9

65.7

7/0.7

92.3

4/0.1

23.7

10.6

48000

32.4

1/2.1

628.8

1/1.6

318.3

5/1.9

24.7

4/0.3

613.2

41.1

916000

82.4

2/6.0

468.3

8/5.1

139.2

7/4.9

910.7

2/0.9

535.4

62.8

432000

179.1

6/22.5

3152.4

5/17.7

047.2

1/9.8

026.0

4/1.7

5124.3

66.2

0

Barley

250

3.6

4/0.2

82.8

0/0.1

50.2

8/0.0

30.2

2/0.0

20.1

30.1

0500

6.9

2/0.6

85.9

3/0.2

91.1

5/0.0

80.4

4/0.0

60.2

00.1

81000

11.2

0/1.0

210.5

6/0.5

13.4

5/0.2

30.7

5/0.1

50.3

70.3

12000

19.9

3/2.3

118.7

0/1.1

37.9

4/0.4

70.9

1/0.2

40.7

30.5

24000

50.0

8/6.3

934.0

9/1.9

212.2

4/1.0

81.8

9/0.6

62.1

21.1

18000

117.9

7/13.9

371.3

6/4.7

97.4

6/2.7

07.4

6/2.7

04.5

22.2

216000

346.7

2/47.5

6155.1

9/10.1

226.3

8/5.7

316.3

6/4.7

211.4

35.4

532000

921.3

6/173.2

4357.3

6/26.9

5103.5

1/16.7

533.5

9/16.9

036.8

312.7

4

Tab

le7:

Execu

tiontim

es(in

second

s)for

the

meth

od

sused

inS

ection6.2.

38

Page 39: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Marginal Pseudo-Likelihood Learning of Markov Network Structures

dn

CM

IL

1L

RA

ND

OR

AN

DO

R

64

250

0.0

2500/0.0

5000

0.0

5000/0.1

0000

8/16

12/16

500

0.0

2500/0.0

2500

0.0

2500/0.0

5000

12/16

16/24

1000

0.0

1000/0.0

2500

0.0

2500/0.0

5000

16/24

24/32

2000

0.0

0500/0.0

1000

0.0

1000/0.0

2500

24/32

24/48

4000

0.0

0500/0.0

0750

0.0

0750/0.0

1000

32/48

48/64

8000

0.0

0250/0.0

0500

0.0

0500/0.0

1000

48/64

64/96

16000

0.0

0100/0.0

0250

0.0

0250/0.0

1000

64/128

96/128

32000

0.0

0075/0.0

0100

0.0

0100/0.0

0750

128/192

128/256

128

250

0.0

2500/0.0

5000

0.0

5000/0.0

7500

12/12

12/16

500

0.0

2500/0.0

2500

0.0

5000/0.0

7500

16/16

16/24

1000

0.0

1000/0.0

2500

0.0

2500/0.0

5000

24/24

24/32

2000

0.0

0500/0.0

1000

0.0

2500/0.0

2500

32/32

32/48

4000

0.0

0500/0.0

0750

0.0

0750/0.0

1000

48/48

48/64

8000

0.0

0250/0.0

0500

0.0

0500/0.0

1000

64/96

64/96

16000

0.0

0100/0.0

0250

0.0

0250/0.0

1000

96/128

96/128

32000

0.0

0075/0.0

0100

0.0

0250/0.0

0750

128/192

128/256

256

250

0.0

5000/0.0

5000

0.0

7500/0.0

7500

12/16

12/16

500

0.0

2500/0.0

2500

0.0

5000/0.0

5000

16/16

16/24

1000

0.0

1000/0.0

2500

0.0

2500/0.0

5000

24/24

24/32

2000

0.0

0500/0.0

1000

0.0

2500/0.0

2500

32/32

32/48

4000

0.0

0500/0.0

0750

0.0

0750/0.0

1000

48/48

48/64

8000

0.0

0250/0.0

0500

0.0

0500/0.0

1000

64/96

64/128

16000

0.0

0100/0.0

0250

0.0

0500/0.0

1000

96/128

96/192

32000

0.0

0075/0.0

0100

0.0

0250/0.0

0750

128/192

192/512

512

250

0.0

5000/0.0

5000

0.0

7500/0.0

7500

12/16

12/16

500

0.0

2500/0.0

2500

0.0

5000/0.0

5000

16/24

24/24

1000

0.0

1000/0.0

1000

0.0

2500/0.0

5000

24/32

24/32

2000

0.0

0500/0.0

1000

0.0

2500/0.0

2500

32/48

48/64

4000

0.0

0500/0.0

0500

0.0

1000/0.0

1000

48/64

64/96

8000

0.0

0250/0.0

0250

0.0

0750/0.0

1000

64/96

96/128

16000

0.0

0100/0.0

0250

0.0

0500/0.0

1000

96/128

128/256

32000

0.0

0100/0.0

0100

0.0

0500/0.0

0750

192/192

256/512

(a)

Net

work

nC

MI

AN

DO

R

Ala

rm

250

0.0

5000/0.1

0000

0.0

7500/0.2

5000

500

0.0

2500/0.0

7500

0.0

5000/0.1

0000

1000

0.0

2500/0.0

5000

0.0

5000/0.1

0000

2000

0.0

2500/0.0

2500

0.0

2500/0.0

7500

4000

0.0

1000/0.0

2500

0.0

2500/0.0

5000

8000

0.0

0750/0.0

1000

0.0

2500/0.0

2500

16000

0.0

0500/0.0

1000

0.0

1000/0.0

2500

32000

0.0

0250/0.0

0750

0.0

0750/0.0

1000

Insu

rance

250

0.0

5000/0.2

5000

0.0

7500/0.2

5000

500

0.0

5000/0.1

0000

0.0

7500/0.2

5000

1000

0.0

5000/0.1

0000

0.0

5000/0.1

0000

2000

0.0

2500/0.0

7500

0.0

5000/0.1

0000

4000

0.0

2500/0.0

5000

0.0

5000/0.1

0000

8000

0.0

1000/0.0

2500

0.0

5000/0.0

7500

16000

0.0

0750/0.0

2500

0.0

5000/0.0

5000

32000

0.0

0750/0.0

1000

0.0

5000/0.0

5000

Hailfinder

250

0.0

5000/0.2

5000

0.5

0000/0.7

5000

500

0.0

2500/0.2

5000

0.2

5000/0.5

0000

1000

0.0

2500/0.1

0000

0.2

5000/0.2

5000

2000

0.0

5000/0.1

0000

0.2

5000/0.2

5000

4000

0.0

7500/0.1

0000

0.1

0000/0.1

0000

8000

0.0

5000/0.0

5000

0.0

7500/0.1

0000

16000

0.0

2500/0.0

5000

0.0

7500/0.1

0000

32000

0.0

5000/0.0

5000

0.0

5000/0.1

0000

Barl

ey

250

0.1

0000/2.0

0000

1.2

5000/2.0

0000

500

0.0

1000/1.2

5000

1.0

0000/1.2

5000

1000

0.0

0500/0.5

0000

0.7

5000/1.0

0000

2000

0.0

0500/0.2

5000

0.7

5000/0.7

5000

4000

0.0

0500/0.5

0000

0.5

0000/0.5

0000

8000

0.2

5000/0.2

5000

0.2

5000/0.2

5000

16000

0.1

0000/0.2

5000

0.2

5000/0.2

5000

32000

0.0

7500/0.1

0000

0.2

5000/0.2

5000

(b)

Tab

le8:

Min

imu

man

dm

axim

um

tun

ing

par

amet

erva

lues

(min

/max

)p

icke

dby

(a)

the

CM

Ian

dL

1LR

met

hod

inS

ecti

on6.

1,an

d(b

)th

eC

MI

met

hod

inSec

tion

6.2

.

39

Page 40: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Pensar, Nyman, Niiranen and Corander

References

T. Achterberg. SCIP: solving constraint integer programs. Mathematical ProgrammingComputation, 1:1–41, 2009.

H. Akaike. A new look at the statistical model identification. IEEE transactions on auto-matic control, 19:716–723, 1974.

A. Anandkumar, V. Y. F. Tan, F. Huang, and A. S. Willsky. High-dimensional structureestimation in Ising models: Local separation criterion. The Annals of Statistics, 40:1346–1375, 2012.

E. Aurell and M. Ekeberg. Inverse Ising inference using all the data. Physical ReviewLetters, 108:090201, 2012.

M. Bartlett and J. Cussens. Advances in Bayesian network learning using integer program-ming. In Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence(UAI 2013), pages 182–191. AUAI Press, 2013.

J. Berg, M. Jarvisalo, and B. Malone. Learning optimal bounded treewidth Bayesian net-works via maximum satisfiability. In Proceedings of the 17th Conference on ArtificialIntelligence and Statistics (AISTATS 2014), 2014.

T. Berthold, S. Heinz, and M. E. Pfetsch. Solving pseudo-boolean problems with SCIP.ZIB-report, Zuse Institute Berlin, 2009.

J. E. Besag. Nearest-neighbour systems and the auto-logistic model for binary data. Journalof the Royal Statistical Society. Series B (Methodological), 34:75–83, 1972.

E. Boros and P. L. Hammer. Pseudo-boolean optimization. Discrete Applied Mathematics,123:155–225, 2002.

F. Bromberg, D. Margaritis, and V. Honavar. Efficient Markov network structure discoveryusing independence tests. Journal of Artificial Intelligence Research, 35:449–485, 2009.

W. Buntine. Theory refinement on Bayesian networks. In Proceedings of the SeventhConference on Uncertainty in Artificial Intelligence, pages 52–60. Morgan Kaufmann,1991.

N. Casali, V. Nikolayevskyy, Y. Balabanova, S. R. Harris, O. Ignatyeva, I. Kontsevaya,J. Corander, J. Bryant, J. Parkhill, S. Nejentsev, R. D. Horstmann, T. Brown, andF. Drobniewski. Evolution and transmission of drug resistant tuberculosis in a population:Insights from a 1000 genome study. Nature Genetics, 46:279–286, 2014.

C. Chow and C. Liu. Approximating discrete probability distributions with dependencetrees. IEEE Transactions on Information Theory, 14(3):462–467, 1968.

J. Corander, M. Ekdahl, and T. Koski. Parallel interacting MCMC for learning of topologiesof graphical models. Data Mining and Knowledge Discovery, 17:431–456, 2008.

40

Page 41: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Marginal Pseudo-Likelihood Learning of Markov Network Structures

J. Corander, T. Janhunen, J. Rintanen, H. Nyman, and J. Pensar. Learning chordal Markovnetworks by constraint satisfaction. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahra-mani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems26, pages 1349–1357, 2013.

I. Csiszar and Z. Talata. Consistent estimation of the basic neighborhood of Markov randomfields. Annals of Statistics, 34:123–145, 2006.

J. Cussens. Bayesian network learning by compiling to weighted MAX-SAT. In Proceedingsof the Conference on Uncertainty in Artificial Intelligence, pages 105–112, 2008.

M. Ekeberg, C. Lovkvist, Y. Lan, M. Weigt, and E. Aurell. Improved contact prediction inproteins: Using pseudolikelihoods to infer Potts models. Physical Review E, 87:012707,2013.

N. Friedman and M. Goldszmidt. Learning Bayesian networks with local structure. InE. Horvitz and F. V. Jensen, editors, Proceedings of the Twelfth Annual Conference onUncertainty in Artificial Intelligence, pages 252–262. Morgan Kaufmann, 1996.

D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The combi-nation of knowledge and statistical data. Machine Learning, 20:197–243, 1995.

D. Heckerman, D. M. Chickering, C. Meek, R. Rounthwaite, and C. Kadie. Dependencynetworks for inference, collaborative filtering, and data visualization. Journal of MachineLearning Research, 1:49–75, 2001.

H. Hofling and R. Tibshirani. Estimation of sparse binary pairwise Markov networks usingpseudo-likelihoods. Journal of Machine Learning Research, 10:883–906, 2009.

C. Ji and L. Seymour. A consistent model selection procedure for Markov random fieldsbased on penalized pseudolikelihood. Annals of Applied Probability, 6:423–443, 1996.

D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques.MIT Press, 2009.

S. L. Lauritzen. Graphical Models. Oxford University Press, Oxford, 1996.

S.-I. Lee, V. Ganapathi, and D. Koller. Efficient structure learning of Markov networksusing `1-regularization. In Advances in Neural Information Processing Systems, 2006.

D. Lowd and J. Davis. Improving Markov network structure learning using decision trees.Journal of Machine Learning Research, 15:501–532, 2014.

P. Parviainen, H.S. Farahani, and J. Lagergren. Learning bounded tree-width Bayesiannetworks using integer linear programming. In Proceedings of the 17th Conference onArtificial Intelligence and Statistics (AISTATS 2014), 2014.

H. Peng. Mutual information computation package, 2007, Accessed 2013-10-14.URL http://www.mathworks.com/matlabcentral/fileexchange/14888-mutual-\

information-computation.

41

Page 42: Abo Akademi University arXiv:1401.4988v2 [stat.ML] 11 Nov 2014

Pensar, Nyman, Niiranen and Corander

S. Della Pietra, V. Della Pietra, and J. Lafferty. Inducing features of random fields. IEEETransactions on Pattern Analysis and Machine Intelligence, 19:380–393, 1997.

P. Ravikumar, M. J. Wainwright, and J. D. Lafferty. High-dimensional Ising model selectionusing `1-regularized logistic regression. Annals of Statistics, 38:1287–1319, 2010.

M. Schmidt. Matlab code by Mark Schmidt (2005-2013), 2013, Accessed 2013-10-14. URLhttp://www.di.ens.fr/$\sim$mschmidt/Software/code.html.

M. Schmidt and K. Murphy. Convex structure learning in log-linear models: Beyond pair-wise potentials. In Proceedings of International Workshop on Artificial Intelligence andStatistics, 2010.

G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6:461–464, 1978.

M. Scutari. Learning Bayesian networks with the bnlearn R package. Journal of StatisticalSoftware, 35(3):1–22, 2010.

T. Silander, P. Kontkanen, and P. Myllymaki. On sensitivity of the MAP Bayesian networkstructure to the equivalent sample size parameter. In R. Parr and L. van der Gaag, editors,Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence, pages 360–367. AUAI Press, 2007.

P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. MIT Press,2nd edition, 2000.

I. Tsamardinos, C. Aliferis, A. Statnikov, and E. Statnikov. Algorithms for large scaleMarkov blanket discovery. In The 16th International FLAIRS Conference, pages 376–380. AAAI Press, 2003.

I. Tsamardinos, L. E. Brown, and C. F. Aliferis. The max-min hill-climbing Bayesiannetwork structure learning algorithm. Machine Learning, 65:31–78, 2006.

42