35
Network analysis and applications Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576 [email protected] Dec 2 nd , 2014

Network analysis and applications Sushmita Roy BMI/CS 576 [email protected] Dec 2 nd, 2014

Embed Size (px)

Citation preview

Page 1: Network analysis and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 2 nd, 2014

Network analysis and applications

Sushmita RoyBMI/CS 576

www.biostat.wisc.edu/[email protected]

Dec 2nd, 2014

Page 2: Network analysis and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 2 nd, 2014

Computational problems in networks

• Network reconstruction– Infer the structure and parameters of networks– We examined this problem in the context of “expression-

based network inference”

• Network evaluation/analysis – Properties of networks

• Network applications– Prioritization of genes– Interpretation of gene sets– Identify new biological pathways

• Densely connected subnetworks

Page 3: Network analysis and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 2 nd, 2014

Using networks as tools for discovery

• So far we have considered problems in network inference

• Topological properties of networks can be informative

• Biological networks can also be used for numerous applications– Prioritization of genes– Identify new biological pathways

• Densely connected subnetworks

Page 4: Network analysis and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 2 nd, 2014

Network properties

• Degree distribution• Average shortest path length• Clustering coefficient• Modularity• Network motifs

Page 5: Network analysis and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 2 nd, 2014

Why should we care about network measures?

• From Barabasi and Oltvai 2004:“Probably the most important discovery of

network theory was the realization that despite the remarkable diversity of networks in nature, their architecture is governed by a few simple principles that are common to most networks of major scientific and technological interest”

Page 6: Network analysis and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 2 nd, 2014

Node degree

• Undirected network– Degree, k: Number of neighbors of a node

• Directed network– In degree, kin: Number of incoming edges

– Out degree, kout: Number of outgoing edges

Directed Edge

A

B C

D

E

FIn degree of F is 4Out degree of E is 0

Page 7: Network analysis and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 2 nd, 2014

Average degree

• Consider an undirected network with N nodes • Let ki denote the degree of node i• Average degree is

Page 8: Network analysis and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 2 nd, 2014

Degree distribution

• P(k) the probability that a node has k edges• Different networks can have different degree

distributions• A fundamental property that can be used to

characterize a network

Page 9: Network analysis and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 2 nd, 2014

Different degree distributions

• Poisson distribution– The mean is a good representation of ki of all nodes– Networks that have a Poisson degree distribution are

called Erdos Renyi or random networks

• Power law distribution– Also called scale free – There is no “typical” node that captures the degree of

nodes.

Page 10: Network analysis and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 2 nd, 2014

Poisson distribution

• A discrete distribution

• The Poisson is parameterized by which can be easily estimated by maximum likelihood

k

P(X

=k)

Page 11: Network analysis and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 2 nd, 2014

Power law distribution

• Used to capture the degree distribution of most real networks

• Typical value of is between 2 and 3.

• MLE exists for but is more complicated– See Power-Law Distributions in Empirical

Data. Clauset, Shalizi and Newman, 2009 for details

P(k)

Page 12: Network analysis and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 2 nd, 2014

Erdos Renyi random graphs

• Dates back to 1960 due to two mathematicians Paul Erdos and Alfred Renyi.

• Provides a probabilistic model to generate a graph• Starts with N nodes and connects two nodes with

probability p• Node degrees follow a Poisson distribution• Tail falls off exponentially, suggesting that nodes with

degrees different from the mean are very rare

Page 13: Network analysis and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 2 nd, 2014

Scale free networks

• Degree distribution is captured by a power law distribution

• There is no “typical” node that describes the degree of all other nodes

• Such networks are ubiquitous in nature

Page 14: Network analysis and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 2 nd, 2014

Poisson versus Scale free

Barabasi & Oltvai 2004, Nature Genetics Review

Page 15: Network analysis and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 2 nd, 2014

Yeast protein interaction network is believed to be scale free

• “Whereas most proteins participate in only a few interactions, a few participate in dozens”

• Such high degree nodes are called hubs

Barabasi & Oltvai 2004, Nature Genetics Review

Page 16: Network analysis and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 2 nd, 2014

Degree of a node is correlated to functional importance of a node

Red nodes on deletion cause the organism to dieRed nodes also among the most degree central

Yeast protein-protein interaction network

Page 17: Network analysis and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 2 nd, 2014

Origin of scale free networks

• Scale free networks are ubiquitous is nature• How do such networks form?• Such networks are the result of two processes– a growth process where new nodes join the network over

an extended period of time• Think about how the internet has grown

– Preferential attachment: new nodes tend to connect to nodes with many neighbors• Rich get richer.

Page 18: Network analysis and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 2 nd, 2014

Growth and preferential attachment in scale free networks

A new node (red) is more likely to connect to node 1than 2

Page 19: Network analysis and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 2 nd, 2014

Path lengths

• The shortest path length between two nodes A and B:– The smallest number of edges that need to be traversed to

get from A to B

• Mean path length is the average of all shortest path lengths

• Diameter of a graph is the longest of all shortest paths in the network

Page 20: Network analysis and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 2 nd, 2014

Scale-free networks tend to be ultra-small

• Two nodes on the network are connected by a small number of edges

• Average path length is log(log(N)), where N is the number of nodes in the network

• In a random network (Erdos Renyi network) the average path length is log(N)

Page 21: Network analysis and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 2 nd, 2014

Modularity in networks

• Modularity “refers to a group of physically or functionally linked nodes that work together to achieve a distinct function”

-- Barabasi & Oltvai

• Two questions– Given a network is it modular?

• Modularity can be assessed using “Clustering coefficient”• Modularity can also be assessed using the difference between the

number of edges within and between a given grouping of nodes.

– Given a network what are the modules in the network?• Graph clustering

Page 22: Network analysis and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 2 nd, 2014

A modular network

Module 1

Module 2

Module 3

Page 23: Network analysis and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 2 nd, 2014

Clustering coefficient

• Measure of transitivity in the network that asks– If A is connected to B, and B is connected to C, how often is A

connected to C?

• Clustering coefficient Ci for each node i is

• ki Degree of node i

• ni is the number of edges among neighbors of i• Average clustering coefficient gives a measure of “modularity”

of the network

A

BC

?

Page 24: Network analysis and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 2 nd, 2014

Clustering coefficient example

A

C

BG

D

Page 25: Network analysis and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 2 nd, 2014

Finding modules in a graph

• Given a graph find the modules– Modules are represented by densely connected subgraphs

• The graph can be partitioned into modules using “Graph clustering”– Hierarchical or flat clustering using a notion of similarity

between nodes– Markov clustering algorithm– Spectral clustering– Girvan-Newman algorithm

Page 26: Network analysis and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 2 nd, 2014

Girvan-Newman algorithm

• General idea: “If two communities are joined by only a few inter-community edges, then all paths through the network from vertices in one community to vertices in the other must pass along one of those few edges.”

• Betweenness of an edge e is defined as the number of shortest paths that include e

• Edges that lie between communities tend to have high betweenness

M. E. J. Newman and M. Girvan. Finding and evaluating community structure

Page 27: Network analysis and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 2 nd, 2014

Girvan-Newman algorithm

• Initialize– Compute betweenness for all edges

• Repeat until convergence criteria1. Remove the edge with the highest betweenness2. Recompute betweenness of affected edges

• Convergence criteria can be– No more edges– Desired modularity.

Page 28: Network analysis and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 2 nd, 2014

Evaluating the “modularity” of the clusters

• Given K groups of nodes, we can compute modularity (Q) also as– difference between within group (community) connections and

expected connections within a group

K: number of groupseij: Fraction of total edges that link nodes in group i to group j

Page 29: Network analysis and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 2 nd, 2014

Zachary’s karate club study

Each node is an individual and edges represent social interactions among individuals. The shape and colors represent different groups.

Node grouping based on betweenness

Page 30: Network analysis and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 2 nd, 2014

Network motifs

• Network motifs are defined as small recurring subnetworks that occur much more than a randomized network

• A subgraph is called a network motif of a network if its occurrence in randomized networks is significantly less than the original network.

• Some motifs are associated to explain specific network dynamics

Milo Science 2002

Page 31: Network analysis and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 2 nd, 2014

Network motifs of size three in a directed network

Page 32: Network analysis and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 2 nd, 2014

Finding network motifs

• Enumerating motifs– Subgraph enumeration

• Calculating the number of occurrences in randomized networks

Milo 2002

Page 33: Network analysis and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 2 nd, 2014

Network motifs found in many complex networks

The occurrence of the feedforward loop in both networks suggests a fundamental similarity in the design on these networks

Page 34: Network analysis and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 2 nd, 2014

Structural common motifs seen in the yeast regulatory network

Lee et.al. 2002, Mangan & Alon, 2003

Auto-regulation Multi-component Feed-forward loop

Single Input Multi Input

Regulatory Chain

Feed-forward loops involved in speeding up in response of target gene

Page 35: Network analysis and applications Sushmita Roy BMI/CS 576  sroy@biostat.wisc.edu Dec 2 nd, 2014

Summary of network analysis

• Given a network, its topology can be characterized using different measures– Degree distribution– Average path length– Clustering coefficient

• Degree distribution can be– Poisson– Power law

• Such networks are called scale free

• Network modularity– Clustering coefficient– Edge betweennness

• Network motifs– Overrepresentation of subgraphs of specific types