Upload
-
View
124
Download
0
Embed Size (px)
Citation preview
Praveen, Paurush, and Holger Fröhlich. "Boosting probabilistic graphical model inference by
incorporating prior knowledge from multiple sources." PloS one 8.6 (2013): e67410.
@St_Hakky
これは論文を読んだ際のまとめであり、私が書いた論文ではありません。
Introduction• Graphical models is useful for extracting meaningful
biological insights from experimental data in life science research. • (Dynamic) Bayesian Networks• Gaussian Graphical Models• Poisson Graphical Models• And so many other methods…
• These models can infer features of cellular networks in a data driven manner.
Problem• Network inference from experimental data is
challenging because of …• the typical low signal to noise ratio• High dimensional• Low number of replicates and noisy measurements(the
sample size is very small)
• The inference of regulatory network from such data is hence challenging and often fails to reach the desired level of accuracy.
How to deal with this problem?• One can either work at experimental level by
increasing the sample size.• This is the best way, but practically this is very difficult.
• The limitation of cost and time.
• We can deal with this problem at inference level and we will consider about it.
There are so many data in Bio• There are many knowledge in Bio : pathway
databases [6–8], Gene Ontology [9] and others.
• So, Integrating known information from databases and biological literature as prior knowledge thus appears to be beneficial.
Integrating heterogenous information into the learning process is not straight forward.• However, biological knowledge covers many
different aspects and is widely distributed across multiple knowledge resources.
• So, integrating heterogenous information into the learning process is not straight forward.
The usuful databases• Pathway databases [6–8]• Gene Ontology [9]• Promoter sequences• Protein-protein interactions• Evolutionary information• KEGG pathway• GO annotation
How to deal with the problem at inference level• From one information• Generate sample data• Integrating one or more particular information into
learning process• Gene regulatory networks were inferred from a
combination of gene expression data with transcription factor binding motifs• promoter sequences, protein-protein interactions,
evolutionary information, KEGG pathways, GO anotation
Several approaches have been proposed• On the technical side several approaches for
integrating prior knowledge into the inference of probabilistic graphical models have been published.
Prior Research • In [16] and [17] the authors only generate candidate structures with
significance above a certain threshold according to prior knowledge. • [16] Larsen, Peter, et al. "A statistical method to incorporate biological knowledge for
generating testable novel gene regulatory interactions from microarray experiments." BMC bioinformatics 8.1 (2007): 1.
• [17] Almasri, Eyad, et al. “Incorporating literature knowledge in Bayesian network for inferring gene networks with gene expression data.” International Symposium on Bioinformatics Research and Applications. Springer Berlin Heidelberg, 2008.
• Another idea is to introduce a probabilistic Bayesian prior over network structures. [18] introduced a prior for individual edges based on an a-priori assumed degree of belief.• [18] Fröhlich, Holger, et al. “Large scale statistical inference of signaling pathways
from rnai and microarray data.” BMC bioinformatics 8.1 (2007): 1.
Prior Research
• [19] describes a more general set of priors, which can also capture global network properties, such as scale-free behavior. • [19] Mukherjee, Sach, and Terence P. Speed. "Network inference using informative
priors." Proceedings of the National Academy of Sciences 105.38 (2008): 14313-14318.
• [20] use a similar form of prior as [18], but additionally combine multiple information sources via a linear weighting scheme.• Werhli, Adriano V., and Dirk Husmeier. "Reconstructing gene regulatory networks
with Bayesian networks by combining expression data with multiple sources of prior knowledge." Statistical applications in genetics and molecular biology 6.1 (2007).
Prior Research • [21] treat different information sources as statistically independent,
and consequently the overall prior is just the product over the priors for the individual information sources.
• [21] Gao, Shouguo, and Xujing Wang. "Quantitative utilization of prior biological knowledge in the Bayesian network modeling of gene expression data." BMC bioinformatics 12.1 (2011): 1.
We focus on…
• The focus of this paper is on construction of consensus priors from multiple, heterogenous knowledge sources.
Proposal Methods• Latent Factor Model(LFM)• LFM relies on the idea of a generative process for the
individual information sources and uses Bayesian inference to estimate a consensus prior.
• Noisy-OR• The second model integrates different information
sources via a Noisy-OR gate.
Materials and Methods• 1.1: Edge-wise Priors for Data Driven Network
Inference.• 1.2 : Latent Factor Model (LFM)• 1.3 : Noisy-OR Model (NOM)• 1.4 : Information Sources
Edge-wise Priors for Data Driven Network Inference• D : experimental data• : network graph (represented by an m×m
adjacency matrix) to infer from data.• Bayes rule :
Edge-wise Priors for Data Driven Network Inference• P() is a prior. We assume P() can be decomposed
into
• is a matrix of prior edge confidence.• = 1 : high prior degree of belief in the existence of
the edge i→j
Our purpose• Our purpos is to compile in a consistent manner
from n available resources.
• we have n edge confidence matrix with n information matrix.
Latent Factor Model(LFM)• LFM is based on the idea that the prior information
encoded in matrices all originate from the true but unknown network .
Latent Factor Model(LFM)• We use this notion to conduct joint Bayesian
inference on W as well as additional parameters given
• The idea behind this equation is that we can identify with the posterior
Latent Factor Model(LFM)• The entries of each matrix can be assumed to
follow beta distribution.
•詳しい説明は省きますが、 αと βの値をいじれるので、それぞれの情報源の重要度を独立に決めることができます
MCMC to get and • Employ an adaptive Markov Chain Monte Carlo
strategy to learn the latent variable together with parameters .• In network space MCMC moves are edge insertion,
deletion and reversal. • In parameter space and are adapted on log-scale using
a multivariate Gaussian transition kernel.
What is MCMC?• PRML11章と、緑本の 8章あたりが参考になる
Noisy-OR model(NOM)• The NOM represents a non-deterministic
disjunctive relation between an effect.• Its possible causes and has been extensively used in
artificial intelligence.
Noisy-OR model(NOM)• The Noisy-OR principle is governed by two
hallmarks: • First, each cause has a probability to produce the effect.• Second, the probability of each cause being sufficient to
produce the effect is independent of the presence of other causes.
Noisy-OR model(NOM)• In our case are interpreted as causes and as
effect. The link between both is given by
• = 1 : high prior degree of belief in the existence of the edge i→j.
LFM vs NOM• LFM• A high level of agreement between information sources
is required in order to achieve high values in
• NOM• In consequence becomes close to 1, if the edge i→j
has a high confidence in at least one information source, because then the product gets close to 0.
NOM.RNK• In addition to NOM, we also experimented with a
variant based on relative ranks.
Information Sources• We use the following information as prior
information• GO annotation• Pathway Databases
• KEGG, PathwayCommons• Protain domain annotation
• InterPro• Protatin domain interaction
• DOMINE
GO annotation• Interacting proteins often function in the same
biological process.
• So, two proteins acting in the same biological process are more likely to interact than two proteins involved in different processes.
• We used the default similarity measure for gene products implemented in the R-package GOSim.
Pathway Databases• KEGG
• R-package : KEGGgraph• PathwaysCommons database• To compute a condence value for each interaction
between a pair of genes/proteins we can work at the level of this graph in different ways:• Shortest path distance between the two entities
• Dijkstra’s algorithm• R-package : RBGL• In this paper, we use this one because we observed that the use of
the diffusion kernel in place of the shortest path measure did not affect the results much.
• Diffusion kernels• R-package : pathClass
Protein domain annotation
• Protein domain annotation was compared on the basis of a binary vector representation via the cosine similarity.
• proteins with similar domains are more likely to act in similar biological pathways.
• The condence of interaction between two proteins can thus be seen as a function of the similarity of the domain annotations of proteins.
Protein domain annotation• Protein domain annotation can be retrieved from
the Inter-Pro database.• A “1" in a component thus indicates that the
protein is annotated with the corresponding domain. Otherwise a “0" is filled in. • The similarity between two binary vectors u, v
(domain signatures) is presented in terms of the cosine similarity
Protatin domain interaction• The relative frequency of interacting protein domain pairs
was taken as another confidence measure for an edge a-b.
• Two proteins are more likely to interact if they contain domains, which can potentially interact.
• The DOMINE database collates known and predicted domain-domain interactions• where H is the number of hit pairs found in the DOMINE
database and DA and DB are the the number of domains in proteins A and B, respectively
Results• 2.1 : Correlation of Prior Edge Confidences with
True Biological Network• Network Sampling• Simulated Information Sources• Weighting of Information Sources• Real Information Sources
• 2.2 : Enhancement of Data Driven Network Reconstruction Accuracy• Simulated Data• Application to Breast Cancer• Application to Yeast Heat-Shock Network
Results• 2.1 : Correlation of Prior Edge Confidences with
True Biological Network• Network Sampling• Simulated Information Sources• Weighting of Information Sources• Real Information Sources
• 2.2 : Enhancement of Data Driven Network Reconstruction Accuracy• Simulated Data• Application to Breast Cancer• Application to Yeast Heat-Shock Network
•今回は割愛
Results• 2.1 : Correlation of Prior Edge Confidences with
True Biological Network• Network Sampling• Simulated Information Sources• Weighting of Information Sources• Real Information Sources
• 2.2 : Enhancement of Data Driven Network Reconstruction Accuracy• Simulated Data• Application to Breast Cancer• Application to Yeast Heat-Shock Network
Enhancement of Data Driven Network Reconstruction Accuracy• We next investigated, in how far our priors could
enhance the reconstruction performance of Bayesian Networks learned from data.
• From the set of best fitting network structures the one with minimum BIC value was selected.
Network reconstruction for the breast cancer data• (Right) Without using any prior• (Left) Using the NOM prior
• Black edges could be verified with established literature knowledge• Grey edges could not be verified.
Network reconstruction for the breast cancer data
Yeast heat-shock response network obtained via Bayesian network reconstruction• (a) Network without any prior knowledge• (b) The gold standard network from YEASTRACT database• (c) Network reconstructed with prior knowledge (NOM).
Reconstruction performance of Yeast heat-shock response network with Bayesian Networks and different priors (NP = No Prior).
Conclusion• We propose two methods:• Latent Factor Model(LFM)• Noisy-OR model(NOM)
• Result : • Both of our models yielded priors which were
significantly closer to the true biological network than competing methods.• They could significantly enhance the reconstruction
performance of Bayesian Networks compared to a situation without any prior.
Conclusion• Our methods were also superior to purely using
STRING edge confidence scores as prior information.
• The proposal methods are robust for network size. They can manage small , medium, large network.• The LFM’s computational cost is higher than NOM, so we
should use NOM to large network.