45
Praveen, Paurush, and Holger Fröhlich. "Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources." PloS one 8.6 (2013): e67410. @St_Hakky こここここここここここここここここ こここここここここここここここ 、。

Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

  • Upload
    -

  • View
    124

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

Praveen, Paurush, and Holger Fröhlich. "Boosting probabilistic graphical model inference by

incorporating prior knowledge from multiple sources." PloS one 8.6 (2013): e67410.

@St_Hakky

これは論文を読んだ際のまとめであり、私が書いた論文ではありません。

Page 2: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

Introduction• Graphical models is useful for extracting meaningful

biological insights from experimental data in life science research. • (Dynamic) Bayesian Networks• Gaussian Graphical Models• Poisson Graphical Models• And so many other methods…

• These models can infer features of cellular networks in a data driven manner.

Page 3: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

Problem• Network inference from experimental data is

challenging because of …• the typical low signal to noise ratio• High dimensional• Low number of replicates and noisy measurements(the

sample size is very small)

• The inference of regulatory network from such data is hence challenging and often fails to reach the desired level of accuracy.

Page 4: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

How to deal with this problem?• One can either work at experimental level by

increasing the sample size.• This is the best way, but practically this is very difficult.

• The limitation of cost and time.

• We can deal with this problem at inference level and we will consider about it.

Page 5: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

There are so many data in Bio• There are many knowledge in Bio : pathway

databases [6–8], Gene Ontology [9] and others.

• So, Integrating known information from databases and biological literature as prior knowledge thus appears to be beneficial.

Page 6: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

Integrating heterogenous information into the learning process is not straight forward.• However, biological knowledge covers many

different aspects and is widely distributed across multiple knowledge resources.

• So, integrating heterogenous information into the learning process is not straight forward.

Page 7: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

The usuful databases• Pathway databases [6–8]• Gene Ontology [9]• Promoter sequences• Protein-protein interactions• Evolutionary information• KEGG pathway• GO annotation

Page 8: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

How to deal with the problem at inference level• From one information• Generate sample data• Integrating one or more particular information into

learning process• Gene regulatory networks were inferred from a

combination of gene expression data with transcription factor binding motifs• promoter sequences, protein-protein interactions,

evolutionary information, KEGG pathways, GO anotation

Page 9: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

Several approaches have been proposed• On the technical side several approaches for

integrating prior knowledge into the inference of probabilistic graphical models have been published.

Page 10: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

Prior Research • In [16] and [17] the authors only generate candidate structures with

significance above a certain threshold according to prior knowledge. • [16] Larsen, Peter, et al. "A statistical method to incorporate biological knowledge for

generating testable novel gene regulatory interactions from microarray experiments." BMC bioinformatics 8.1 (2007): 1.

• [17] Almasri, Eyad, et al. “Incorporating literature knowledge in Bayesian network for inferring gene networks with gene expression data.” International Symposium on Bioinformatics Research and Applications. Springer Berlin Heidelberg, 2008.

• Another idea is to introduce a probabilistic Bayesian prior over network structures. [18] introduced a prior for individual edges based on an a-priori assumed degree of belief.• [18] Fröhlich, Holger, et al. “Large scale statistical inference of signaling pathways

from rnai and microarray data.” BMC bioinformatics 8.1 (2007): 1.

Page 11: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

Prior Research

• [19] describes a more general set of priors, which can also capture global network properties, such as scale-free behavior. • [19] Mukherjee, Sach, and Terence P. Speed. "Network inference using informative

priors." Proceedings of the National Academy of Sciences 105.38 (2008): 14313-14318.

• [20] use a similar form of prior as [18], but additionally combine multiple information sources via a linear weighting scheme.• Werhli, Adriano V., and Dirk Husmeier. "Reconstructing gene regulatory networks

with Bayesian networks by combining expression data with multiple sources of prior knowledge." Statistical applications in genetics and molecular biology 6.1 (2007).

Page 12: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

Prior Research • [21] treat different information sources as statistically independent,

and consequently the overall prior is just the product over the priors for the individual information sources.

• [21] Gao, Shouguo, and Xujing Wang. "Quantitative utilization of prior biological knowledge in the Bayesian network modeling of gene expression data." BMC bioinformatics 12.1 (2011): 1.

Page 13: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

We focus on…

• The focus of this paper is on construction of consensus priors from multiple, heterogenous knowledge sources.

Page 14: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

Proposal Methods• Latent Factor Model(LFM)• LFM relies on the idea of a generative process for the

individual information sources and uses Bayesian inference to estimate a consensus prior.

• Noisy-OR• The second model integrates different information

sources via a Noisy-OR gate.

Page 15: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

Materials and Methods• 1.1: Edge-wise Priors for Data Driven Network

Inference.• 1.2 : Latent Factor Model (LFM)• 1.3 : Noisy-OR Model (NOM)• 1.4 : Information Sources

Page 16: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

Edge-wise Priors for Data Driven Network Inference• D : experimental data• : network graph (represented by an m×m

adjacency matrix) to infer from data.• Bayes rule :

Page 17: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

Edge-wise Priors for Data Driven Network Inference• P() is a prior. We assume P() can be decomposed

into

• is a matrix of prior edge confidence.• = 1 : high prior degree of belief in the existence of

the edge i→j

Page 18: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

Our purpose• Our purpos is to compile in a consistent manner

from n available resources.

• we have n edge confidence matrix with n information matrix.

Page 19: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

Latent Factor Model(LFM)• LFM is based on the idea that the prior information

encoded in matrices all originate from the true but unknown network .

Page 20: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

Latent Factor Model(LFM)• We use this notion to conduct joint Bayesian

inference on W as well as additional parameters given

• The idea behind this equation is that we can identify with the posterior

Page 21: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

Latent Factor Model(LFM)• The entries of each matrix can be assumed to

follow beta distribution.

•詳しい説明は省きますが、 αと βの値をいじれるので、それぞれの情報源の重要度を独立に決めることができます

Page 22: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

MCMC to get and • Employ an adaptive Markov Chain Monte Carlo

strategy to learn the latent variable together with parameters .• In network space MCMC moves are edge insertion,

deletion and reversal. • In parameter space and are adapted on log-scale using

a multivariate Gaussian transition kernel.

Page 23: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

What is MCMC?• PRML11章と、緑本の 8章あたりが参考になる

Page 24: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

Noisy-OR model(NOM)• The NOM represents a non-deterministic

disjunctive relation between an effect.• Its possible causes and has been extensively used in

artificial intelligence.

Page 25: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

Noisy-OR model(NOM)• The Noisy-OR principle is governed by two

hallmarks: • First, each cause has a probability to produce the effect.• Second, the probability of each cause being sufficient to

produce the effect is independent of the presence of other causes.

Page 26: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

Noisy-OR model(NOM)• In our case are interpreted as causes and as

effect. The link between both is given by

• = 1 : high prior degree of belief in the existence of the edge i→j.

Page 27: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

LFM vs NOM• LFM• A high level of agreement between information sources

is required in order to achieve high values in

• NOM• In consequence becomes close to 1, if the edge i→j

has a high confidence in at least one information source, because then the product gets close to 0.

Page 28: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

NOM.RNK• In addition to NOM, we also experimented with a

variant based on relative ranks.

Page 29: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

Information Sources• We use the following information as prior

information• GO annotation• Pathway Databases

• KEGG, PathwayCommons• Protain domain annotation

• InterPro• Protatin domain interaction

• DOMINE

Page 30: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

GO annotation• Interacting proteins often function in the same

biological process.

• So, two proteins acting in the same biological process are more likely to interact than two proteins involved in different processes.

• We used the default similarity measure for gene products implemented in the R-package GOSim.

Page 31: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

Pathway Databases• KEGG

• R-package : KEGGgraph• PathwaysCommons database• To compute a condence value for each interaction

between a pair of genes/proteins we can work at the level of this graph in different ways:• Shortest path distance between the two entities

• Dijkstra’s algorithm• R-package : RBGL• In this paper, we use this one because we observed that the use of

the diffusion kernel in place of the shortest path measure did not affect the results much.

• Diffusion kernels• R-package : pathClass

Page 32: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

Protein domain annotation

• Protein domain annotation was compared on the basis of a binary vector representation via the cosine similarity.

• proteins with similar domains are more likely to act in similar biological pathways.

• The condence of interaction between two proteins can thus be seen as a function of the similarity of the domain annotations of proteins.

Page 33: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

Protein domain annotation• Protein domain annotation can be retrieved from

the Inter-Pro database.• A “1" in a component thus indicates that the

protein is annotated with the corresponding domain. Otherwise a “0" is filled in. • The similarity between two binary vectors u, v

(domain signatures) is presented in terms of the cosine similarity

Page 34: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

Protatin domain interaction• The relative frequency of interacting protein domain pairs

was taken as another confidence measure for an edge a-b.

• Two proteins are more likely to interact if they contain domains, which can potentially interact.

• The DOMINE database collates known and predicted domain-domain interactions• where H is the number of hit pairs found in the DOMINE

database and DA and DB are the the number of domains in proteins A and B, respectively

Page 35: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

Results• 2.1 : Correlation of Prior Edge Confidences with

True Biological Network• Network Sampling• Simulated Information Sources• Weighting of Information Sources• Real Information Sources

• 2.2 : Enhancement of Data Driven Network Reconstruction Accuracy• Simulated Data• Application to Breast Cancer• Application to Yeast Heat-Shock Network

Page 36: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

Results• 2.1 : Correlation of Prior Edge Confidences with

True Biological Network• Network Sampling• Simulated Information Sources• Weighting of Information Sources• Real Information Sources

• 2.2 : Enhancement of Data Driven Network Reconstruction Accuracy• Simulated Data• Application to Breast Cancer• Application to Yeast Heat-Shock Network

Page 37: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

•今回は割愛

Page 38: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

Results• 2.1 : Correlation of Prior Edge Confidences with

True Biological Network• Network Sampling• Simulated Information Sources• Weighting of Information Sources• Real Information Sources

• 2.2 : Enhancement of Data Driven Network Reconstruction Accuracy• Simulated Data• Application to Breast Cancer• Application to Yeast Heat-Shock Network

Page 39: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

Enhancement of Data Driven Network Reconstruction Accuracy• We next investigated, in how far our priors could

enhance the reconstruction performance of Bayesian Networks learned from data.

• From the set of best fitting network structures the one with minimum BIC value was selected.

Page 40: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

Network reconstruction for the breast cancer data• (Right) Without using any prior• (Left) Using the NOM prior

• Black edges could be verified with established literature knowledge• Grey edges could not be verified.

Page 41: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

Network reconstruction for the breast cancer data

Page 42: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

Yeast heat-shock response network obtained via Bayesian network reconstruction• (a) Network without any prior knowledge• (b) The gold standard network from YEASTRACT database• (c) Network reconstructed with prior knowledge (NOM).

Page 43: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

Reconstruction performance of Yeast heat-shock response network with Bayesian Networks and different priors (NP = No Prior).

Page 44: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

Conclusion• We propose two methods:• Latent Factor Model(LFM)• Noisy-OR model(NOM)

• Result : • Both of our models yielded priors which were

significantly closer to the true biological network than competing methods.• They could significantly enhance the reconstruction

performance of Bayesian Networks compared to a situation without any prior.

Page 45: Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources

Conclusion• Our methods were also superior to purely using

STRING edge confidence scores as prior information.

• The proposal methods are robust for network size. They can manage small , medium, large network.• The LFM’s computational cost is higher than NOM, so we

should use NOM to large network.