45
生生生生生生生生生生生生生生生 生生生生生生生生生生生生生生生 林林 Laboratory of Computational Molecular Biolog y [http://cmb.bnu.edu.cn] College of Life Sciences Beijing Normal University

生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

Embed Size (px)

Citation preview

Page 1: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

生物信息学中的几个计算问题简介生物信息学中的几个计算问题简介

林魁Laboratory of Computational Molecular Biology

[http://cmb.bnu.edu.cn]

College of Life SciencesBeijing Normal University

Page 2: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

Outline

Introduction to bioinformatics

Genome annotation

Genome evolution

Metagenomics

Acknowledgements

Page 3: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

From U.S. Department of Energy Human Genome Program. http://www.ornl.gov/hgmis

Page 4: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

Is Biology an Informational Science?

The HGP changed how we view & practice biology.

Biology is an informational science.

Digital genome

Environmental signals

Biology has become a cross-disciplinary science.

Page 5: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

Bioinformatics as an intersecting discipline

Mathematical sciences Computer science

Life sciences

Developing the high throughput technologies and

computational/mathematical tools required for this new

biology.

Page 6: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

Why? Where? What? How?

• Why: Ideas for what to produce these huge datasets?

Biological background needed.

• Where: Raw data need to store, IT platforms required.

• What: Patterns in datasets that can be analyzed using

computers. Various data models and their respective

algorithms are needed.

• How: Different resources need to be integrated.

Page 7: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

What is Bioinformatics?

• The field of biology specializing in developing hardware and software to store and analyze the huge amounts of data being generated by life scientists. (NIH)

• More than 20 different definitions can be found from Google!

• Computational Biology?• Computational Molecular Biology?

Data integrationVarious molecular biology databases

Bioinformatics applications

Data analysis

Page 8: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

Key Challenge of Bioinformatics

The world of biology is very different from what it

was even ten years ago.

To bridge the considerable gap between technical

data production and its use by scientists for

biological discovery.

Page 9: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

Clients

Intranetand/or

Internet

Servers

Browsers

Light applications

WWW servers

Database servers

Intensive computing servers

HTML/XML

PERL/C/C++/Java (BioPerl)

MySQL

Bioinformatics tools

Statistical analysis (R)

HPC with MPI

Schematic platform for bioinformatics applications

Page 10: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

新一代基因组测序 (NGS) 技术

3730xl

焦磷酸测序 1 亿 bp

边合成边测序 15 亿 bp

边连接边测序 20 亿 bp

Sanger 10 万 bp双脱氧核苷酸

Page 11: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

Technology:

Read length:Throughput:

Sequencing cost:

Sanger

LongLowHigh

Sanger

LongLowHigh

454

MiddleMiddleMiddle

454

MiddleMiddleMiddle

ShortHighLow

ShortHighLow

Trade-off between read length and sequencing cost

With the new technology

• New scientific questions emerge

• Existing questions can be answered in a way that

was not considered before.

Page 12: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

Sequence assembly by the shotgun approach

• Master sequence short sequences,

simply by examining the sequences for

overlaps.

• No need any prior knowledge of the genome.

Page 13: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

CGGTTGAAAGCGGTAGCGTCCATGCGTATTACTCTTGAGCGGTCGAACCTTCTGAAATCGCTGAACCACGTCCACCGGGTCGTCGAGCGTCGCAACACGATCCCGATCCTGTCCAACGTTCTGCTGCGCGCCTCCGGCGCCAATCTGGACATGAAGGCGACCGACCTCGATCTGGAAATCACCGAAGCGACCCCGGCCATGGTGGAGCAGGCTGGCGCCACCACCGTACCGGCACACCTGCTTTACGAAATCGTGCGCAAGCTGCCGGATGGTTCCGAAGTGCTTCTGGCGACCAACCCGGACGGCTCCTCCATGACCGTTGCGTCCGGCCGCTCGAAATTCTCGCTGCAATGCCTGCCGGAAGCGGATTTCCCTGACCTCACCGCCGGCACCTTCAGCCACACCTTCAAACTGAAGGCGGCCGATCTGAAGATGCTGATCGACCGGACGCAGTTTGCGATTTCGACCGAAGAGACGCGTTATTACCTGAACGGCATTTTCTTCCACACCATCGAAAGCAATGGCGAGCTGAAACTGCGCGCCGTCGCCACCGACGGTCACCGCCTTGCGCGTGCTGACGTCGATGCGCCCTCCGGCTCCGAAGGCATGCCGGGCATCATCATTCCGCGCAAGACCGTCGGTGAACTGCAGAAGCTGATGGACAATCCGGAACTGGAAGTCACAGTCGAAGTCTCGGATGCGAAGATCCGCCTGGCCATCGGTTCCGTCGTTCTGACCTCGAAGCTGATCGACGGCACCTTTCCCGATTATCAGCGCGTCATCCCAACCGGCAACGACAAGGAAATGCGCGTCGATTGCCAGACCTTCGCCCGGGCAGTGGACCGTGTTTCGACGATTTCTTCCGAGCGCGGCCGCGCCGTGAAGCTGGCGCTAACTGACGGCCAGTTGACGCTGACCGTCAACAATCCCGACTCGGGAAGTGCTACCGAAGAAGTGGCCGTTGGCTACGACAATGATTCGATGGAAATCGGCTTCAATGCCAAATATCTCCTCGACATCACGTCGCAGCTCTCCGGCGAAGATGCGATTTTTCTGCTGGCGGATGCGGGTTCGCCAACACTGGTTCGCGATACCGCCGGCGACGACGCACTCTATGTTCTGATGCCGATGCGCGTTTAAAACCGACCGTTTTCTTCAATTTTTCCAGAAACGCCGGTGGATCGCTTCATCGGCGTTTTTTGATTCGGCGAACAGGTGGCTCTACCCGTAACTGAATTTTCTCAGTTACGACATTTTGCCTTGTTTTTGCGCCAAATGGGATCAACAGTACGTAACAATTTTTTGACAATGACCAATACATCCGAGGGGAATCATGGCACTCAACCTGAAGCAACGGCTTGAACAAAAATTTGAGGAAGAAATCCGCTTTTTCAAAGGTATGGTCAGCCAGCCGAAAAAAGTCGGCGCCATTGTCCCGACGGTTCCGTCGTTCTGACCTCGAAGCTGATCGACGGCACCTTTCCCGATTATCAGCGCGTCATCCCAACCGGCAACGACAAGGAAATGCGCGTCGATTGCCAGACCTTCGCCCGGGCAGTGGACCGTGTTTCGACGATTTCTTCCGAGCGCGGCCGCGCCGTGAAGCTGGCGCTAACTGACGGCCAGTTGACGCTGACCGTCAACAATCCCGACTCGGGAAGTGCTACCGAAGAAGTGGCCGTTGGCTACGACAATGATTCGATGGAAATCGGCTTCAATGCCAAATATCTCCTCGACATCACGTCGCAGCTCTCCGGCGAAGATGCGATTTTTCTGCTGGCGGATGCGGGTTCGCCAACACTGGTTCGCGATACCGCCGGCGACGACGCACTCTATGTTCTGATGCCGATGCGCGTTTAAAACCGACCGTTTTCTTCAATTTTTCCAGAAACGCCGGTGGATCGCTTCATCGGCGTTTTTTGATTCGGCGAACAGGTGGCTCTACCCGTAACTGAATTTTCTCAGTTACGACATTTTGCCTTGTTTTTGCGCCAAATGGGATCAACAGTACGTAACAATTTTTTGACAATGACCAATACATCCGAGGGGAATCATGGCACTCAACCTGAAGCAACGGCTT

Where is a gene?

Page 14: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

Th

ree Layers of G

enom

e An

notation.

Fro

m S

tein

, L. 2

00

1. N

atu

re R

evie

ws g

en

etics 2

:49

3-5

03

1. Transcriptional control sites

2. Non-coding RNAs

3.3. mRNAsmRNAs Evidence-based preEvidence-based pre

dictiondiction Cis-alignmentCis-alignment Trans-alignmenTrans-alignmen

tt Ab initio / de novoAb initio / de novo p p

redictionrediction

Page 15: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University
Page 16: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

Some models

• Dynamic programming• Hidden Markov Models (HMMs)• Conditional random field (CRF)• Support vector machines (SVMs)

Page 17: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

Cucumber Genome Annotation Project

The Institute of Vegetables and Flowers,

Chinese Academy of Agricultural Sciences

Laboratory of Computational Molecular Biology,

Beijing Normal University

Page 18: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

CMB genome CMB genome annotation annotation

pipelinepipelineRepeatMasker

MySQL MySQL DBMSDBMS

Functional annotation:Functional annotation: Protein homology Protein homology Domain annotation (InterProScan)Domain annotation (InterProScan) Mapping to Gene OntologyMapping to Gene Ontology Mapping to KEGGMapping to KEGG

WWW service WWW service + +

VisualizationVisualization

GBrowseGBrowse

EVM

Genomic sequenceGenomic sequence ESTs/cDNAsESTs/cDNAs UniProt pro

teins UniProt pro

teins

Rfam PseudoPipe

RepeatsProtein-coding Genes

Pseudogenes

ncRNAgenes

CMB genome CMB genome annotation annotation

pipelinepipeline

Page 19: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

gbrowser

Page 20: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

Depending on the state of sequencing project, genomic coordinates along the chromosome may change dramatically from assembly to assembly.

Page 21: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

Phylogenetics

• Evolutionary theory states that groups of

similar organisms are descended from a

common ancestor.

• Phylogenetic systematics (cladistics) is a

method of taxonomic classification based on

their evolutionary history.

• It was developed by Willi Hennig, a German

entomologist, in 1950.

Page 22: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

Major reasons to use phylogenetics

Understand the lineage of different species

Organizing principle to sort species into a taxonomy

Understand how various functions evolved

Understand forces and constraints on evolution

Perform multiple sequence alignment

Predict gene function (phylogenetic footprint)

Page 23: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

Species/Gene Trees

Species tree (how are my species related?) contains only one representative from each

species

when did speciation take place?

all nodes indicate speciation events

Gene tree (how are my genes related?)often contains a number of genes from a single

species.

nodes relate either to speciation or gene

duplication events.

Page 24: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

Species tree

Page 25: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

Phylogenomics: Genome trees

Explore genome evolution based on large data sets of DNA or protein sequences.

Using entire genomes to infer a species tree (Eisen and Fraser 2003).

Based on maximum genetic information and average out the anomalies.

Has become the standard for reconstructing reliable phylogenies (Ciccarelli et al, 2006; Daubin et al. 2002).

Page 26: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

Phylogenomics and the tree of life

Fro

m D

elsu

c, F., e

t al. (200

5) P

hyloge

nom

ics and

th

e recon

structio

n of th

e tree o

f life. N

at. R

ev. G

ene

t. 6, 3

61-3

75

Page 27: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

Fro

m D

els

uc

, F., e

t al. (2

00

5) P

hy

log

en

om

ics

an

d th

e re

co

ns

truc

tion

of th

e tre

e o

f life. N

at. R

ev

. Ge

ne

t. 6, 3

61

-37

5

Page 28: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

Taxonomic resolution of some of the novel approaches– Creating ever-more robust phylogeniesever-more robust phylogenies on the basis

of diverse data sets.

Try

Page 29: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

Evolutionary theory is evolving

Page 30: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

100 trillion microbial cells

Page 31: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

The dominant form of life on Earth

~1,000 Gbp of microbial genome sequences per gram of soil !

Page 32: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

Metagenomics offers a way forward• Who is out there?

• What are they doing?

What is being done by the communityWhat is being done by the community?

Why genomics is not enough

• Most microbes cannot be cultured

• Microbial diversity and variation have no limits

Definition:

Both a set of research techniques & a research field.

Page 33: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

Difference between metagenomics & microbial

genomics

Page 34: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

基于 16S rRNA 的分析方法 : 快速而高效 , 应用广泛

“Who is out there ?” 经过多年的发展和完善 (Olsen et al. 1986)

Renaissance currently (Tringe & Hugenholtz 2008)

16S rRNA 的高保守性 variable regions

包含大量 rRNA 基因序列( >200,000 )的数据库 ( Cole et al. 2005; Medini et al. 2008) encountering the limitations of existing tools

hsp60 gene: more sensitive than SSU rRNA

Page 35: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

Cartoon of

the general

structure of

the bacterial

16S rRNA

gene

Who is there?

Page 36: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

Analysis of diversity in the human gut microbial community based on surveys of a limited number of humans.

“strain-level”

‘‘species’’

‘‘genus’’

‘‘family’’

Page 37: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

Broad-range PCR amplification and sequencing of microbial 16S rRNA genes

Page 38: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

Microbial diversity in environmental

samples

We require the evolutionary and ecological mechanisms

Why? Clusters of very closely related sequences at the tips of phylogenetic trees separated by relatively long branches.

Page 39: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

Functional analysis of complex microbial communities (EGTs)

Page 40: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

From Turnbaugh et al. 2009. A core gut microbiome in obese and lean twins. Nature

Relative abundance of major phyla and relative abundance of categories of function.

Page 41: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

Pathways and subnetworks reflect the adaptation of microbial communities across environments and habitats.

From Gianoulis et al. 2009. PNAS

Page 42: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

Resolving strain-level

heterogeneity

Sequence divergence

Gene content difference

Multiple strain sequence types

Gene insertion

Gene rearrangement

Allen & Banfield (2005) Community genomics in microbial ecology and evolution. Nat Rev Microbiol, 3, 489-498

Page 43: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

拟展开的工作• 不同生境(或样本)的群落基因组间比较分析,

具体阐明关键环境因子的改变如何导致群落组成的变异 ;

• 结合基因表达和代谢等数据探讨不同群落的基因组与主要生态系统过程(如氮的固定,碳的降解,反硝化作用以及厌氧微生物的除铵作用等)之间的相关关系。

Page 44: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

Acknowledgements

Beijing Normal University Beijing Normal University

All members of LCMB, BNUAll members of LCMB, BNU

http://cmb.bnu.edu.cn

Page 45: 生物信息学中的几个计算问题简介 林魁 Laboratory of Computational Molecular Biology [] College of Life Sciences Beijing Normal University

CommentsCommentsandand

Suggestions?Suggestions?