Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
F A T C H I Y A H , M . K E S . P H . D
E M A I L : F A T C H I Y A @ Y A H O O . C O . I D
BIOINFORMATIKA DAN BIOLOGI KOMPUTASI DALAM
BIOLOGI MOLEKULER
Introduction
The Human Genome Project
Challenges of Molecular Biology
The changing role of the Biologist in the Age of
Information
Bioinformatics software
Genomics
Impact on medicine
What genes cause the condition? What are the normal function of the gene? What mutations have been linked to diseases? How does the mutation alter gene function?
What laboratories are performing DNA tests? Are there gene therapies or clinical trials?
What names are used to refer to the genes and the diseases?
What other conditions are linked to these same genes?
2/28/2011
3
fatchiyah JB UB Bioinformatic
What is “bioinformatics”?Bioinformatics:
The use of computers to collect, analyze, and interpret biological information at the molecular level.
"The mathematical, statistical and computing methods that aim to solve biological problems using DNA and amino acid sequences and related information."
A set of software tools for molecular sequence analysis
2/28/2011
4
fatchiyah JB UB Bioinformatic
NIH Working Definition
Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data. http://www.bisti.nih.gov/CompuBioDef.pdf
fatchiyah JB UB Bioinformatic2/28/2011
5
Bioinformatics
Interdisciplinary approach Computer science, Mathematics & Statistics.
Molecular biology, Biochemistry & Medicine.
Scopes of bioinformatics Genomics Primarily sequences (DNA and RNA) Databanks and search algorithms Supports studies of molecular evolution (“Tree wars”)
Proteomics Sequences (Protein) and structures Mass spectrometry, X-ray crystallography Databanks, knowledge bases, visualization
Functional Genomics (transcriptomics) Microarray data Databanks, analysis tools, controlled terminologies
Systems Biology (metabolomics) Metabolites and interacting systems (interactomics) Graphs, visualization, modeling, networks of entities
DNA RNA Protein PhenotypePhenotype
2/28/2011
7
fatchiyah JB UB Bioinformatic
What the function of bioinformatics analysis
1. Genome sequence: for the first time there is a blueprint of the activity of a cell
2. Gene expression, in the form of cDNA array, and proteomic studies:
how these genes interact, interfacing with each other, and how they form networks.
3. On structural level, the mechanism how these molecules work.
Major impact on diagnosis, treatment, drug discovery, regulation and metabolism, biodegradation
2/28/2011
8
fatchiyah JB UB Bioinformatic
In SilicoIn Vivo
Analysis development
In Vitro
2/28/2011
9
fatchiyah JB UB Bioinformatic
MESDAMESETMESSRSMYN
AMEISWALTERYALLKINCAL
LMEWALLYIPREFERDREVIL
MYSELFIMACENTERDIRATV
ANDYINTENNESSEEILIKENM
RANDDYNAMICSRPADNAPRI
MASERADCALCYCLINNDRKI
NASEMRPCALTRACTINKAR
KICIPCDPKIQDENVSDETAVS
WILLWINITALL
3D
structure
Cell
System Dynamics
Cell
Structures
Complexes
Sequence
Structural Scales
Organism
2/28/2011
10
fatchiyah JB UB Bioinformatic
Blue print of gene
Genome sequence: for the first time there is a blueprint of the activity of a cell
Gene expression, in the form of cDNA array, and proteomic studies: how these genes interact, interfacing with each other, and how they form networks.
On structural level, the mechanism how these molecules work.
Major impact on diagnosis, treatment, drug discovery, regulation and metabolism, biodegradation
Challenges in Computational Biology
February 28, 2011BBSI Summer School - Iowa State University
12
1. Obtain the genome of an organism.
2. Identify and annotate genes.
3. Find the sequences, three dimensional structures, and functions of proteins.
4. Find sequences of proteins that have desired three dimensional structures.
5. Compare DNA sequences and proteins sequences for similarity.
6. Study the evolution of sequences and species.
http://gila.engr.uic.edu/bioinformatics/
Bioinformatics
Computational analysis of high-throughput biological data Whole genome sequencing.
Global genomic expression & profiling.
Functional genomics.
Structural genomics/proteomics
Comparative genomics.
2/28/2011fatchiyah, JB-UB 14
In the mid-1990s, the GenBank database became part of the International Nucleotide Sequence Database Collaboration:
Internationally Networking Collaboration
NCBI investigators maintain on going collaborations with several institutes within NIH and also with numerous academic and government research laboratories
DDBJ Mishima,
Japan
GenBankNCBIUSA
EMBLEuropea
www.ncbi.nlm.nih.gov/
http://www.ebi.ac.uk/
2/28/2011fatchiyah, JB-UB 15
Nucleotide
Protein
PubMed
The original version of Entrez had just 3 nodes: nucleotides, proteins, and PubMed abstracts.
Entrez has now grown to nearly 20 nodes
2/28/2011fatchiyah, JB-UB 16
The future of genomic rests on the foundation of the Human Genome
Project
2/28/2011fatchiyah, JB-UB 17
The Wellcome Trust
Free unrestricted access for all
The door to discovery is wide open
Genome browsers
Ensemblwww.ensembl.org
University of California Santa Cruzhttp://genome.cse.ucsc.edu
European Bioinformatics Instituteswww.ebi.ac.uk
MGD the Jackson Laboratorywww.informatics.jax.org
GenBankwww.ncbi.nlm.nih.gov
DNA Data Bank of Japanwww.ddbj.nig.ac.jp
Genome Databases
I. The Human Genome Project
The genome sequence is complete - almost!
approximately 3.2 billion base pairs.
DNA
Protein
Nucleotides
sequence
Gene expression = Protein production
The Flow of Biotechnology Information
2/28/2011fatchiya
20
Gene > DNA sequenceAATTCATGAAAATCGTATACTGGTCTGGTACCGGCAACAC
TGAGAAAATGGCAGAGCTCATCGCTAAAGGTATCATCGAA
TCTGGTAAAGACGTCAACACCATCAACGTGTCTGACGTTA
ACATCGATGAACTGCTGAACGAAGATATCCTGATCCTGGG
TTGCTCTGCCATGGGCGATGAAGTTCTCGAGGAAAGCGAA
TTTGAACCGTTCATCGAAGAGATCTCTACCAAAATCTCTG
GTAAGAAGGTTGCGCTGTTCGGTTCTTACGGTTGGGGCGA
CGGTAAGTGGATGCGTGACTTCGAAGAACGTATGAACGGC
TACGGTTGCGTTGTTGTTGAGACCCCGCTGATCGTTCAGA
ACGAGCCGGACGAAGCTGAGCAGGACTGCATCGAATTTGG
TAAGAAGATCGCGAACATCTAGTAGA
> Protein sequence
MKIVYWSGTGNTEKMAELIAKGIIESGKDVNTINVSDVNIDELLNEDILILGCSAMGDEVLEESEFEPFIEEISTKISGKKVALFGSYGWGDGKWMRDFEERMNGYGCVVVETPLIVQNEPDEAEQDCIEFGKKIANI
Genome size
Organism Number of base pairs
X-174 virus 5,386
Epstein Bar Virus 172,282
Mycoplasma genitalium 580,000
Hemophilus Influenza 1.8 106
Yeast (S. Cerevisiae) 12.1 106
Human 3.2 109
Wheat 16 109
Lilium longiflorum 90 109
Salamander 100 109
Amoeba dubia 670 109
2/28/2011fatchiya 22
Sanger Technique for Sequencing
A Sequence print-out from a control sample
2/28/2011fatchiya
23
2/28/2011fatchiya 24
FASTA Format2/28/2011fatchiya
25
>identifier descriptive textnucleotide of amino-acid sequence on multiple lines if needed.
Example:>gi|41|emb|X63129.1|BTA1AT B. taurus mRNA for alpha-1-anti-trypsin
GACCAGCCCTGACCTAGGACAGTGAATCGATAATGGCACTCTCCATCACGCGGGGCCTTCTGCTGCTGGC ….
MOST important data format!!!
Modified FASTA Format
2/28/2011fatchiya
26
1) A few tools follow the convention that lower case sequences are masked. (repeat masker, some versions of blast, megablast, blastz)
2) A few analysis tools (like CLUSTAL) want a simplified identifier on the defline. So they can have a short string for the alignment.
>X63129.1
GACCAGCCCTGACCTAGGACAGTGAATCGATAATGGCACTCTC
CATCACGCGGGGCCTTCTGCTGCTGGC ….
2/28/2011fatchiya 27
H5N1
Write specific name
of gene
Click & Go
2/28/2011fatchiya 28GenBank Record-2
GeneBank Record
modification
date
Header
Molecule Type
GenBank Division
Modification Date
Version Number
Accession Number
Locus Name
Sequence Length
GeneBank Record
2/28/2011fatchiya
29
Gene Sequence
2/28/2011fatchiya
30
Searching Biological Databases
February 28, 2011BBSI Summer School - Iowa State University
31
BLAST (Basic Local Alignment Search Tool)
http://www.ncbi.nlm.nih.gov
BLASTN (DNA)
BLASTP (Protein)
BLASTX (DNA against Protein)
PSI-BLAST (Position Specific Iterative BLAST)
Multiple Alignment Software
February 28, 2011BBSI Summer School - Iowa State University
32
Clustalw (http://www.ebi.ac.uk/clustalw)
MSA (http://softlib.rice.edu/softlib/msa.html)
HMMER (http://hmmer.wustl.edu/)
SAM (http://www.cse.ucsc.edu/research/ compbio/sam.html)
2/28/2011 fatchiyah, JB-UB33
Biology Information on the Internet
Introduction to Databases
Searching the Internet for Biology Information.
General Search methods
Biology Web sites
Introduction to Genbank file format.
Introduction to Entrez and Pubmed
Ref: Chapters 1,2,5,6 of “Bioinformatics”
2/28/2011fatchiyah, JB-UB 34
The Wellcome Trust
Free unrestricted access for all
The door to discovery is wide open
Genome browsers
Ensemblwww.ensembl.org
University of California Santa Cruzhttp://genome.cse.ucsc.edu
European Bioinformatics Instituteswww.ebi.ac.uk
MGD the Jackson Laboratorywww.informatics.jax.org
GenBankwww.ncbi.nlm.nih.gov
DNA Data Bank of Japanwww.ddbj.nig.ac.jp
Genome Databases
2/28/2011 fatchiyah, JB-UB35
Refseq and LocusLink
Attempt to produce 1 mRNA, 1 protein, and 1 genomic gene for each frequently occuring allele of a protein expressing gene.
www.ncbi.nlm.nih.gov/LocusLink
Special non-genbank Accession numbers
NM_nnnnnn mRNA refseq
NP_nnnnnn protein refseq
NC_nnnnnn refseq genomic contig
NT_nnnnnn temporary genomic contig
NX_nnnnnn predicted gene
Ensembl Overviewsearching and browsing
Background
HGP draft sequence of human genome
jigsaw puzzle: assembling semi-accurate sequences coming from all over the world
Ensembl
EMBL-EBI and Sanger Institute
Automatic system track sequenced pieces (and their changes)
assemble into larger stretches
analyze to find genes
Ensembl Data
Annotation features identified on each DNA sequence
Examples genes: known genes or predicted by Ensembl
SNP (single nucleotide polymorphisms)
repeats (regions of simple repetitive seq.)
homologies (regions highly similar to other sequence in the public databases)
Data Access
A web-based genome browser which can be customized as required
A web-based system for data export and data mining
'Dumps' of sequence and other data sets for you to download
Direct access to the databases
A Perl-based object layer
Gene Prediction
finding genes GeneScan: a gene finding software
comparison with all known genes matches are considered supporting evidence
Ensembl Services
BLAST
Sequence browsing
Identifier search
Known gene names
OMIM diseases
Free text search of OMIM, SWISS-PROT, InterPro annotation
Sequence Browsing
Chromosome-level
Contig View (Detailed)
Contig View (Customization)
Marker View
Ensembl Gene Report