53
Data, Bioinformatics and Genomics Roderic Guigó, Center for Genomic Regulation, Barcelona

Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

Data, Bioinformatics and Genomics

Roderic Guigó,

Center for Genomic Regulation, Barcelona

Page 2: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

ACTCAGCCCCAGCGGAGGTGAAGGACGTCCTTCCCCAGGAGCCGGTGAGAAGCGCAGTCGGGGGCACGGGGATGAGCTCAGGGGCCTCTAGAAAGATGTAGCTGGGACCTCGGGAAGCCCTGGCCTCCAGGTAGTCTCAGGAGAGCTACTCAGGGTCGGGCTTGGGGAGAGGAGGAGCGGGGGTGAGGCCAGCAGCAGGGGACTGGACCTGGGAAGGGCTGGGCAGCAGAGACGACCCGACCCGCTAGAAGGTGGGGTGGGGAGAGCATGTGGACTAGGAGCTAAGCCACAGCAGGACCCCCACGAGTTGTCACTGTCATTTATCGAGCACCTACTGGGTGTCCCCAGTGTCCTCAGATCTCCATAACTGGGAAGCCAGGGGCAGCGACACGGTAGCTAGCCGTCGATTGGAGAACTTTAAAATGAGGACTGAATTAGCTCATAAATGGAAAACGGCGCTTAAATGTGAGGTTAGAGCTTAGAATGTGAAGGGAGAATGAGGAATGCGAGACTGGGACTGAGATGGAACCGGCGGTGGGGAGGGGGAGGGGGTGTGGAATTTGAACCCCGGGAGAGAAAGATGGAATTTTGGCTATGGAGGCCGACCTGGGGATGGGGAAATAAGAGAAGACCAGGAGGGAGTTAAATAGGGAATGGGTTGGGGGCGGCTTGGTAACTGTTTGTGCTGGGATTAGGCTGTTGCAGATAATGGAGCAAGGCTTGGAAGGCTAACCTGGGGTGGGGCCGGGTTGGGGTCGGGCTGGGGGCGGGAGGAGTCCTCACTGGCGGTTGATTGACAGTTTCTCCTTCCCCAGACTGGCCAATCACAGGCAGGAAGATGAAGGTTCTGTGGGCTGCGTTGCTGGTCACATTCCTGGCAGGTATGGGGCGGGGCTTGCTCGGTTTTCCCCGCTTCTCCCCCTCTCATCCTCACCTCAACCTCCTGGCCCCATTCAAGCACACCCTGGGCCCCCTCTTCTTCTGCTGGTCTGTCCCCTGAGGGGAAAGCCCAGGTCTGAGGCTTCTATGCTGCTTTCTGGCTCAGAACAGCGATTTGACGCTCTGTGAGCCTCGGTTCCTCCCCCGCTTTTTTTTTTTCAGCCAGAGTCTCACTCTGTCGCCCAGGCTGGAGTGCAGTGGCGCAATCTCAGCTCACTGCAAGCTCCGCCTCCCGGGTTCACGCTATTCTCCCGCCTCAGCCTCCCGAGTAGCTGGGACTACAGGCGCCCGCCACCATGCCCGGCTAATTTTTTGTACTTTGAGTAGGGAAGGGGTTTCACTGTATTATCCAGGATGGTCTCTATCTCCTGACCTCGTGATCTGCCCGCCTGGCCTCCCAAAGTGCTGGAATTACAGGCGTGAGCCTCCGCGCCCGGCCTCCCCATCCTTAATATAGGAGTTAGAAGTTTTTGTTTGTTTGTTTTGTTTTGTTTTTGTTTTGTTTTGAGATGAAGTCCCTCTGTCGCCCAGGCTGGAGTGCAGTGGCTCCCAGGCTGGAGTTCAGTGGCTGGATCTCGGCTCACTGCAAGCTCCGCCTCCCAGGTTCACGCCATTCTCCTGCCTCAGCCTCCGGAGTAGCTGGGACTACAGGAACATGCCACCACACCCGACTAACTTTTTTTGTATTTTTAGTAGAGACGGGGTTTCACCATGTTGGCCAGGCTGGTCTGGAACTCCTGACCTCAGGTGATCTGCCTGCTTCAACCTCCCAAAGTGCTGGGATTACAGACGTGGGCCACCGCGCCCGGCTGGGAGTTAAGAGGTTTCTAATGCATTGCATTAGAATACCAGACACGGGACAGCTGTGATCTTTATTCTCCATCACCCCACACAGCCCTGCCTGGGGCACACAAGGACACTCAATACACGCTTTTCGGGCGCGGTGGCTCAAGCTGTAATCCCAGCACTTTGGGAGGCTGAGGCGGGTGGTACATGAGGTCAGGAGATCGAGACCATCCTGGCTAACATGGTGAAACCCCGTCTCTACTAAAAATACAAAAAACTAGCCCGGGCGTGGTGGCGGGCGCCTGTAGTCCCAGCTACTCGGAGGCTGAGGCAGGAGAATGGCGTGAACCTGGGAGGCGGAGCTTGCAGTGAGCCGAGATCGCGCCACTGCACTCCAGCCTGGGTGACACAGCGCGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATACACGCTTTTCCGCTAGGCACGGTGGCTCACCCCTGTAATCCCAGCATTTTGGGAGGCCAAGGTGGGAGGATCACTTGAGCCCAGGAGTTCAACACCAGACTCAGCAACATAGTGAGACTCTCTCTACTAAAAATACAAAAATTAGCCAGGCCTGGTGCCACACACCTGTGGTCCCAGCTACTCAGAAGGCTAAGGCAGGAGGATCGCTTAAGCCCAGAAGGTCAAGGTTGCAGTGAACCACGTTCAGGCCACTGCAGTCCAGCCTGGGTGACAGAGCAAGACCCTGTCTGTAAATAAATAACGCTTTTCAAGTGATTAAACAGACTCCCCCCTCACCCTGCCCACCATGGCTCCAAAGCAGCATTTGTGGAGCACCTTCTGTGTGCCCCTAGGTACTAGCTGCCTGGACGGGGTCAGAAGGAACCTGAACCACCTTCAACTTGTTCCACACAGGATGCCAGGCCAAGGTGGAGCAACCGGTGGAGCCAGAGACAGAACCCGACGTTCGCCAGCAGGCTGAGTGGCAGAGCGGCCAGCCCTGGGAGCTGGCACTGGGTCGCTTTTGGGATTACCTGCGCTGGGTGCAGACACTGTCTGAGCAGGTGCAGGAGGAGCTGCTCAGCC

Page 3: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

ACTCAGCCCCAGCGGAGGTGAAGGACGTCCTTCCCCAGGAGCCGGTGAGAAGCGCAGTCGGGGGCACGGGGATGAGCTCAGGGGCCTCTAGAAAGATGTAGCTGGGACCTCGGGAAGCCCTGGCCTCCAGGTAGTCTCAGGAGAGCTACTCAGGGTCGGGCTTGGGGAGAGGAGGAGCGGGGGTGAGGCCAGCAGCAGGGGACTGGACCTGGGAAGGGCTGGGCAGCAGAGACGACCCGACCCGCTAGAAGGTGGGGTGGGGAGAGCATGTGGACTAGGAGCTAAGCCACAGCAGGACCCCCACGAGTTGTCACTGTCATTTATCGAGCACCTACTGGGTGTCCCCAGTGTCCTCAGATCTCCATAACTGGGAAGCCAGGGGCAGCGACACGGTAGCTAGCCGTCGATTGGAGAACTTTAAAATGAGGACTGAATTAGCTCATAAATGGAAAACGGCGCTTAAATGTGAGGTTAGAGCTTAGAATGTGAAGGGAGAATGAGGAATGCGAGACTGGGACTGAGATGGAACCGGCGGTGGGGAGGGGGAGGGGGTGTGGAATTTGAACCCCGGGAGAGAAAGATGGAATTTTGGCTATGGAGGCCGACCTGGGGATGGGGAAATAAGAGAAGACCAGGAGGGAGTTAAATAGGGAATGGGTTGGGGGCGGCTTGGTAACTGTTTGTGCTGGGATTAGGCTGTTGCAGATAATGGAGCAAGGCTTGGAAGGCTAACCTGGGGTGGGGCCGGGTTGGGGTCGGGCTGGGGGCGGGAGGAGTCCTCACTGGCGGTTGATTGACAGTTTCTCCTTCCCCAGACTGGCCAATCACAGGCAGGAAGATGAAGGTTCTGTGGGCTGCGTTGCTGGTCACATTCCTGGCAGGTATGGGGCGGGGCTTGCTCGGTTTTCCCCGCTTCTCCCCCTCTCATCCTCACCTCAACCTCCTGGCCCCATTCAAGCACACCCTGGGCCCCCTCTTCTTCTGCTGGTCTGTCCCCTGAGGGGAAAGCCCAGGTCTGAGGCTTCTATGCTGCTTTCTGGCTCAGAACAGCGATTTGACGCTCTGTGAGCCTCGGTTCCTCCCCCGCTTTTTTTTTTTCAGCCAGAGTCTCACTCTGTCGCCCAGGCTGGAGTGCAGTGGCGCAATCTCAGCTCACTGCAAGCTCCGCCTCCCGGGTTCACGCTATTCTCCCGCCTCAGCCTCCCGAGTAGCTGGGACTACAGGCGCCCGCCACCATGCCCGGCTAATTTTTTGTACTTTGAGTAGGGAAGGGGTTTCACTGTATTATCCAGGATGGTCTCTATCTCCTGACCTCGTGATCTGCCCGCCTGGCCTCCCAAAGTGCTGGAATTACAGGCGTGAGCCTCCGCGCCCGGCCTCCCCATCCTTAATATAGGAGTTAGAAGTTTTTGTTTGTTTGTTTTGTTTTGTTTTTGTTTTGTTTTGAGATGAAGTCCCTCTGTCGCCCAGGCTGGAGTGCAGTGGCTCCCAGGCTGGAGTTCAGTGGCTGGATCTCGGCTCACTGCAAGCTCCGCCTCCCAGGTTCACGCCATTCTCCTGCCTCAGCCTCCGGAGTAGCTGGGACTACAGGAACATGCCACCACACCCGACTAACTTTTTTTGTATTTTTAGTAGAGACGGGGTTTCACCATGTTGGCCAGGCTGGTCTGGAACTCCTGACCTCAGGTGATCTGCCTGCTTCAACCTCCCAAAGTGCTGGGATTACAGACGTGGGCCACCGCGCCCGGCTGGGAGTTAAGAGGTTTCTAATGCATTGCATTAGAATACCAGACACGGGACAGCTGTGATCTTTATTCTCCATCACCCCACACAGCCCTGCCTGGGGCACACAAGGACACTCAATACACGCTTTTCGGGCGCGGTGGCTCAAGCTGTAATCCCAGCACTTTGGGAGGCTGAGGCGGGTGGTACATGAGGTCAGGAGATCGAGACCATCCTGGCTAACATGGTGAAACCCCGTCTCTACTAAAAATACAAAAAACTAGCCCGGGCGTGGTGGCGGGCGCCTGTAGTCCCAGCTACTCGGAGGCTGAGGCAGGAGAATGGCGTGAACCTGGGAGGCGGAGCTTGCAGTGAGCCGAGATCGCGCCACTGCACTCCAGCCTGGGTGACACAGCGCGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATACACGCTTTTCCGCTAGGCACGGTGGCTCACCCCTGTAATCCCAGCATTTTGGGAGGCCAAGGTGGGAGGATCACTTGAGCCCAGGAGTTCAACACCAGACTCAGCAACATAGTGAGACTCTCTCTACTAAAAATACAAAAATTAGCCAGGCCTGGTGCCACACACCTGTGGTCCCAGCTACTCAGAAGGCTAAGGCAGGAGGATCGCTTAAGCCCAGAAGGTCAAGGTTGCAGTGAACCACGTTCAGGCCACTGCAGTCCAGCCTGGGTGACAGAGCAAGACCCTGTCTGTAAATAAATAACGCTTTTCAAGTGATTAAACAGACTCCCCCCTCACCCTGCCCACCATGGCTCCAAAGCAGCATTTGTGGAGCACCTTCTGTGTGCCCCTAGGTACTAGCTGCCTGGACGGGGTCAGAAGGAACCTGAACCACCTTCAACTTGTTCCACACAGGATGCCAGGCCAAGGTGGAGCAACCGGTGGAGCCAGAGACAGAACCCGACGTTCGCCAGCAGGCTGAGTGGCAGAGCGGCCAGCCCTGGGAGCTGGCACTGGGTCGCTTTTGGGATTACCTGCGCTGGGTGCAGACACTGTCTGAGCAGGTGCAGGAGGAGCTGCTCAGCC

Page 4: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

ACTCAGCCCCAGCGGAGGTGAAGGACGTCCTTCCCCAGGAGCCGGTGAGAAGCGCAGTCGGGGGCACGGGGATGAGCTCAGGGGCCTCTAGAAAGATGTAGCTGGGACCTCGGGAAGCCCTGGCCTCCAGGTAGTCTCAGGAGAGCTACTCAGGGTCGGGCTTGGGGAGAGGAGGAGCGGGGGTGAGGCCAGCAGCAGGGGACTGGACCTGGGAAGGGCTGGGCAGCAGAGACGACCCGACCCGCTAGAAGGTGGGGTGGGGAGAGCATGTGGACTAGGAGCTAAGCCACAGCAGGACCCCCACGAGTTGTCACTGTCATTTATCGAGCACCTACTGGGTGTCCCCAGTGTCCTCAGATCTCCATAACTGGGAAGCCAGGGGCAGCGACACGGTAGCTAGCCGTCGATTGGAGAACTTTAAAATGAGGACTGAATTAGCTCATAAATGGAAAACGGCGCTTAAATGTGAGGTTAGAGCTTAGAATGTGAAGGGAGAATGAGGAATGCGAGACTGGGACTGAGATGGAACCGGCGGTGGGGAGGGGGAGGGGGTGTGGAATTTGAACCCCGGGAGAGAAAGATGGAATTTTGGCTATGGAGGCCGACCTGGGGATGGGGAAATAAGAGAAGACCAGGAGGGAGTTAAATAGGGAATGGGTTGGGGGCGGCTTGGTAACTGTTTGTGCTGGGATTAGGCTGTTGCAGATAATGGAGCAAGGCTTGGAAGGCTAACCTGGGGTGGGGCCGGGTTGGGGTCGGGCTGGGGGCGGGAGGAGTCCTCACTGGCGGTTGATTGACAGTTTCTCCTTCCCCAGACTGGCCAATCACAGGCAGGAAGATGAAGGTTCTGTGGGCTGCGTTGCTGGTCACATTCCTGGCAGGTATGGGGCGGGGCTTGCTCGGTTTTCCCCGCTTCTCCCCCTCTCATCCTCACCTCAACCTCCTGGCCCCATTCAAGCACACCCTGGGCCCCCTCTTCTTCTGCTGGTCTGTCCCCTGAGGGGAAAGCCCAGGTCTGAGGCTTCTATGCTGCTTTCTGGCTCAGAACAGCGATTTGACGCTCTGTGAGCCTCGGTTCCTCCCCCGCTTTTTTTTTTTCAGCCAGAGTCTCACTCTGTCGCCCAGGCTGGAGTGCAGTGGCGCAATCTCAGCTCACTGCAAGCTCCGCCTCCCGGGTTCACGCTATTCTCCCGCCTCAGCCTCCCGAGTAGCTGGGACTACAGGCGCCCGCCACCATGCCCGGCTAATTTTTTGTACTTTGAGTAGGGAAGGGGTTTCACTGTATTATCCAGGATGGTCTCTATCTCCTGACCTCGTGATCTGCCCGCCTGGCCTCCCAAAGTGCTGGAATTACAGGCGTGAGCCTCCGCGCCCGGCCTCCCCATCCTTAATATAGGAGTTAGAAGTTTTTGTTTGTTTGTTTTGTTTTGTTTTTGTTTTGTTTTGAGATGAAGTCCCTCTGTCGCCCAGGCTGGAGTGCAGTGGCTCCCAGGCTGGAGTTCAGTGGCTGGATCTCGGCTCACTGCAAGCTCCGCCTCCCAGGTTCACGCCATTCTCCTGCCTCAGCCTCCGGAGTAGCTGGGACTACAGGAACATGCCACCACACCCGACTAACTTTTTTTGTATTTTTAGTAGAGACGGGGTTTCACCATGTTGGCCAGGCTGGTCTGGAACTCCTGACCTCAGGTGATCTGCCTGCTTCAACCTCCCAAAGTGCTGGGATTACAGACGTGGGCCACCGCGCCCGGCTGGGAGTTAAGAGGTTTCTAATGCATTGCATTAGAATACCAGACACGGGACAGCTGTGATCTTTATTCTCCATCACCCCACACAGCCCTGCCTGGGGCACACAAGGACACTCAATACACGCTTTTCGGGCGCGGTGGCTCAAGCTGTAATCCCAGCACTTTGGGAGGCTGAGGCGGGTGGTACATGAGGTCAGGAGATCGAGACCATCCTGGCTAACATGGTGAAACCCCGTCTCTACTAAAAATACAAAAAACTAGCCCGGGCGTGGTGGCGGGCGCCTGTAGTCCCAGCTACTCGGAGGCTGAGGCAGGAGAATGGCGTGAACCTGGGAGGCGGAGCTTGCAGTGAGCCGAGATCGCGCCACTGCACTCCAGCCTGGGTGACACAGCGCGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATACACGCTTTTCCGCTAGGCACGGTGGCTCACCCCTGTAATCCCAGCATTTTGGGAGGCCAAGGTGGGAGGATCACTTGAGCCCAGGAGTTCAACACCAGACTCAGCAACATAGTGAGACTCTCTCTACTAAAAATACAAAAATTAGCCAGGCCTGGTGCCACACACCTGTGGTCCCAGCTACTCAGAAGGCTAAGGCAGGAGGATCGCTTAAGCCCAGAAGGTCAAGGTTGCAGTGAACCACGTTCAGGCCACTGCAGTCCAGCCTGGGTGACAGAGCAAGACCCTGTCTGTAAATAAATAACGCTTTTCAAGTGATTAAACAGACTCCCCCCTCACCCTGCCCACCATGGCTCCAAAGCAGCATTTGTGGAGCACCTTCTGTGTGCCCCTAGGTACTAGCTGCCTGGACGGGGTCAGAAGGAACCTGAACCACCTTCAACTTGTTCCACACAGGATGCCAGGCCAAGGTGGAGCAACCGGTGGAGCCAGAGACAGAACCCGACGTTCGCCAGCAGGCTGAGTGGCAGAGCGGCCAGCCCTGGGAGCTGGCACTGGGTCGCTTTTGGGATTACCTGCGCTGGGTGCAGACACTGTCTGAGCAGGTGCAGGAGGAGCTGCTCAGCC

Page 5: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

Bioinformatics in Medline

Page 6: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

1953

Page 7: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

MALWTRLRPLLALLALWPPPPARAFVNQHLCGS

HLVEALYLVCGERGFFYTPKARREVEGPQVGAL

ELAGGPGAGGLEGPPQKRGIVEQCCASVCSLYQ

LENYCN

Page 8: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

ENIAC

Late 40s: first digital computers

Page 9: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

1965

Page 10: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

Sequence alignment and comparison

Page 11: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

The total number of possible alignments between two sequences of length 100 is approximately 10200.

With DP the number of operations required to

obtain the optimal alignment is aproximately

3x1002

Query: 25 IPREVIERLARSQIHSIRDLQRLLEIDSVGSEDSLDTSLRAHGVHATKHVPEKRPLPIRR 84

IP E+ + L+ I S DLQRLL+ DS G ED + L H+ + R

Sbjct: 10 IPEELYKMLSGHSIRSFDDLQRLLQGDS-GKEDGAELDLNMTRSHSGGELESLA----RG 64

Query: 85 KRSI------EEAVPAVCKTRTVIYEIPRSQVDPTSANFLIWPPCVEVKRCTGCCNTSSV 138

KRS+ E A+ A CKTRT ++EI R +D T+ANFL+WPPCVEV+RC+GCCN +V

Sbjct: 65 KRSLGSLSVAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNV 124

Query: 139 KCQPSRVHHRSVKVAKVEYVRKKPKLKEVQVRLEEHLECAC 179

+C+P++V R V+V K+E VRKKP K+ V LE+HL C C

Sbjct: 125 QCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKC 165

DYNAMIC PROGRAMMING,

Nedleman and Wunsch, 1970

Smith and Waterman, 1981

1970: Optimal sequence alignment

Page 12: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

1982: the first electronic databases

Page 13: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

FASTA, 1982: Wilbur and Lipman, 1985: Lipman and Pearson

BLAST, 1990: Altschul, Gish, Miller, Myers and Lipman

1982: accelerating database searches hash methods

1 2 3 4 5 6 7 8 9 10 11 12 13

W A T S N A N D C R I C K

A C D I K N R S T W

2 6

9 12

8 11 13 5 7

10 4 3 1

Query Sequence

Hash table K=1

http://www.ccl.rutgers.edu/~ouyang/5020/FASTA-BLAST.ppt

Page 14: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

Search of the Platelet Derived Growth Factor sequence

1982, Doolittle: relationship between oncogenes and growth factors

Page 15: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

1990:The human genome project

THE HUMAN GENOME PROGRAM (HGP) is producing large quantities of complex map and DNA sequence data. Informatics projects in algorithms, software, and databases are crucial in accumulating and interpreting these data in a robust and automated fashion at genome and sequencing centers

Computer systems play essential roles in all aspects of genome research, from data acquisition and analysis to data management. Without powerful computers and appropriately designed data–management systems, high–volume genome research cannot

proceed.

Page 16: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

Human Genome Project Milestones

Page 17: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

Bioinformatics in Medline

Page 18: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

bioinformatics 24,500,000

chemoinformatics 275,000

astroinformatics 27,800

neuroinformatics 331,000

socioinformatics 14,100

geoinformatics 548,000

meteoinformatics 146

econoinformatics 2,010

ecoinformatics 92,800

Bioinformatics

Google search: X-informatics (june 4, 2015)

Page 19: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

ACTCAGCCCCAGCGGAGGTGAAGGACGTCCTTCCCCAGGAGCCGGTGAGAAGCGCAGTCGGGGGCACGGGGATGAGCTCAGGGGCCTCTAGAAAGATGTAGCTGGGACCTCGGGAAGCCCTGGCCTCCAGGTAGTCTCAGGAGAGCTACTCAGGGTCGGGCTTGGGGAGAGGAGGAGCGGGGGTGAGGCCAGCAGCAGGGGACTGGACCTGGGAAGGGCTGGGCAGCAGAGACGACCCGACCCGCTAGAAGGTGGGGTGGGGAGAGCATGTGGACTAGGAGCTAAGCCACAGCAGGACCCCCACGAGTTGTCACTGTCATTTATCGAGCACCTACTGGGTGTCCCCAGTGTCCTCAGATCTCCATAACTGGGAAGCCAGGGGCAGCGACACGGTAGCTAGCCGTCGATTGGAGAACTTTAAAATGAGGACTGAATTAGCTCATAAATGGAAAACGGCGCTTAAATGTGAGGTTAGAGCTTAGAATGTGAAGGGAGAATGAGGAATGCGAGACTGGGACTGAGATGGAACCGGCGGTGGGGAGGGGGAGGGGGTGTGGAATTTGAACCCCGGGAGAGAAAGATGGAATTTTGGCTATGGAGGCCGACCTGGGGATGGGGAAATAAGAGAAGACCAGGAGGGAGTTAAATAGGGAATGGGTTGGGGGCGGCTTGGTAACTGTTTGTGCTGGGATTAGGCTGTTGCAGATAATGGAGCAAGGCTTGGAAGGCTAACCTGGGGTGGGGCCGGGTTGGGGTCGGGCTGGGGGCGGGAGGAGTCCTCACTGGCGGTTGATTGACAGTTTCTCCTTCCCCAGACTGGCCAATCACAGGCAGGAAGATGAAGGTTCTGTGGGCTGCGTTGCTGGTCACATTCCTGGCAGGTATGGGGCGGGGCTTGCTCGGTTTTCCCCGCTTCTCCCCCTCTCATCCTCACCTCAACCTCCTGGCCCCATTCAAGCACACCCTGGGCCCCCTCTTCTTCTGCTGGTCTGTCCCCTGAGGGGAAAGCCCAGGTCTGAGGCTTCTATGCTGCTTTCTGGCTCAGAACAGCGATTTGACGCTCTGTGAGCCTCGGTTCCTCCCCCGCTTTTTTTTTTTCAGCCAGAGTCTCACTCTGTCGCCCAGGCTGGAGTGCAGTGGCGCAATCTCAGCTCACTGCAAGCTCCGCCTCCCGGGTTCACGCTATTCTCCCGCCTCAGCCTCCCGAGTAGCTGGGACTACAGGCGCCCGCCACCATGCCCGGCTAATTTTTTGTACTTTGAGTAGGGAAGGGGTTTCACTGTATTATCCAGGATGGTCTCTATCTCCTGACCTCGTGATCTGCCCGCCTGGCCTCCCAAAGTGCTGGAATTACAGGCGTGAGCCTCCGCGCCCGGCCTCCCCATCCTTAATATAGGAGTTAGAAGTTTTTGTTTGTTTGTTTTGTTTTGTTTTTGTTTTGTTTTGAGATGAAGTCCCTCTGTCGCCCAGGCTGGAGTGCAGTGGCTCCCAGGCTGGAGTTCAGTGGCTGGATCTCGGCTCACTGCAAGCTCCGCCTCCCAGGTTCACGCCATTCTCCTGCCTCAGCCTCCGGAGTAGCTGGGACTACAGGAACATGCCACCACACCCGACTAACTTTTTTTGTATTTTTAGTAGAGACGGGGTTTCACCATGTTGGCCAGGCTGGTCTGGAACTCCTGACCTCAGGTGATCTGCCTGCTTCAACCTCCCAAAGTGCTGGGATTACAGACGTGGGCCACCGCGCCCGGCTGGGAGTTAAGAGGTTTCTAATGCATTGCATTAGAATACCAGACACGGGACAGCTGTGATCTTTATTCTCCATCACCCCACACAGCCCTGCCTGGGGCACACAAGGACACTCAATACACGCTTTTCGGGCGCGGTGGCTCAAGCTGTAATCCCAGCACTTTGGGAGGCTGAGGCGGGTGGTACATGAGGTCAGGAGATCGAGACCATCCTGGCTAACATGGTGAAACCCCGTCTCTACTAAAAATACAAAAAACTAGCCCGGGCGTGGTGGCGGGCGCCTGTAGTCCCAGCTACTCGGAGGCTGAGGCAGGAGAATGGCGTGAACCTGGGAGGCGGAGCTTGCAGTGAGCCGAGATCGCGCCACTGCACTCCAGCCTGGGTGACACAGCGCGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATACACGCTTTTCCGCTAGGCACGGTGGCTCACCCCTGTAATCCCAGCATTTTGGGAGGCCAAGGTGGGAGGATCACTTGAGCCCAGGAGTTCAACACCAGACTCAGCAACATAGTGAGACTCTCTCTACTAAAAATACAAAAATTAGCCAGGCCTGGTGCCACACACCTGTGGTCCCAGCTACTCAGAAGGCTAAGGCAGGAGGATCGCTTAAGCCCAGAAGGTCAAGGTTGCAGTGAACCACGTTCAGGCCACTGCAGTCCAGCCTGGGTGACAGAGCAAGACCCTGTCTGTAAATAAATAACGCTTTTCAAGTGATTAAACAGACTCCCCCCTCACCCTGCCCACCATGGCTCCAAAGCAGCATTTGTGGAGCACCTTCTGTGTGCCCCTAGGTACTAGCTGCCTGGACGGGGTCAGAAGGAACCTGAACCACCTTCAACTTGTTCCACACAGGATGCCAGGCCAAGGTGGAGCAACCGGTGGAGCCAGAGACAGAACCCGACGTTCGCCAGCAGGCTGAGTGGCAGAGCGGCCAGCCCTGGGAGCTGGCACTGGGTCGCTTTTGGGATTACCTGCGCTGGGTGCAGACACTGTCTGAGCAGGTGCAGGAGGAGCTGCTCAGCC

Page 20: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGT

CGAAAAATTATCTTGATAAAGCAGGAATTACTACTGCTTGTTTACGAATTAAATCG

AAGTGGACTGCTGGCGGAAAATGAGAAAATTCGACCTATCCTTGCGCAGCTCGA

GAAGCTCTTACTTTGCGACCTTTCGCCATCAACTAACGATTCTGTCAAAAACTGA

CGCGTTGGATGAGGAGAAGTGGCTTAATATGCTTGGCACGTTCGTCAAGGACTG

GTTTAGATATGAGTCACATTTTGTTCATGGTAGAGATTCTCTTGT

MALWTRLRPLLALLALWPPPPARAFVNQHLCGSHLVEALYLVCGERGFFY

TPKARREVEGPQVGALELAGGPGAGGLEGPPQKRGIVEQCCASVCSLYQLENYCN

Page 21: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

Openness

• Open Source Software

• Open Access to Genome Information

Page 22: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

Bioinformatics & Unix

• Early development of Bioinformatics strongly linked to the UNIX operating System, and GNU tools become very popular among bioinformatic programers and developers: emacs, gcc, bash, gawk

Page 23: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences
Page 24: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

1996: Bermuda principles

The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences among scientists.

• Automatic release of sequence assemblies larger than 1 kb (preferably within 24 hours).

• Immediate publication of finished annotated sequences.

• Aim to make the entire sequence freely available in the public domain for both research and development in order

to maximise benefits to society.

Page 25: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

2003: Fort Lauderdale principles

• Large-insert clone-based projects: DNA sequence assemblies of 2 kb or greater are to be deposited in a public nucleotide sequence database (GenBank, EMBL or DDBJ) within 24 hours of generation. Sequence traces from these projects are to be deposited in a trace archive (NCBI Trace Repository or Ensembl Trace Server) within one week of production.

• Whole genome assemblies are to be deposited in a public nucleotide sequence database as soon as possible after the assembled sequence has met a set of quality evaluation criteria.

Page 26: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

2008: ENCODE Project • NHGRI has designated the Encyclopedia of DNA Elements

(ENCODE) and model organism ENCODE (modENCODE) Projects as community resource projects to accelerate access to and use of the data by the entire scientific community.

• Resource users are asked to respect the ability of the producers to publish an initial analysis of the data they have generated in a timely manner. To facilitate this compromise between unrestricted use of the data and unavailability of the data until publication, NHGRI will promote observation of a 9-month period during which resource users may freely use the ENCODE/modENCODE data to design and carry out their own research programs, but not to submit publications that use unpublished ENCODE/modENCODE data without prior consent.

Page 27: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

2015 NIH Genomic Data Sharing policy

• Investigators should submit large-scale human genomic data as well as relevant associated data (e.g., phenotype and exposure data) to an NIH-designated data repository in a timely manner.

• In general, NIH will release data submitted to NIH-designated data repositories no later than six months after the initial data submission begins, or at the time of acceptance of the first publication, whichever occurs first, without restrictions on publication or other dissemination.

Page 28: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

2005: A revolution in sequencing technologies

• The human genome project

– 12-15 years

– 5 large centers: hundreds of instruments

– Hundreds of scientists worldwide

– 3.000 milion $

• 2015

– A single instrument: 50 genomes / day

– 1000$ / genome

Page 29: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

ENIAC, 1950s 2.4 x 0.9 x 30 (m) 385 operations/second. 10-

6 operations/second/cm3

Page 30: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

ENIAC, 1950s 2.4 x 0.9 x 30 (m) 385 operations/second.

10-6 operations/second/cm3

MAC AIR, 2010s ~1 x 32.5 x 22.7 (cm) 133,656,056 operations/second.

105 operations/second/cm3

Page 31: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

Moore’s Law

Page 32: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

CELERA GENOMICS, year 2000 1,000 m2. 2 yr. 3GB at 10x

5x103 bp/day/m3

Page 33: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

CELERA GENOMICS, year 2000 1,000 m2. 2 yr. 3GB at 10x

5x103 bp/day/m3

HISEQ 2500. year 2012 119 x 94 x 76 (cm). 1 day 120 Gb

1011 bp/day/m3

Page 34: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences
Page 35: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

The time for the sequencing of our

genomes has arrived

Page 36: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences
Page 37: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

The time for the sequencing of our

genomes has arrived

• Privacy vs open access

• The cost and benefits of privacy

– At least in the context of genomic information

Page 38: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

What is the EGA? The EGA is an Archive for human genomic data that should have controlled access because it allows identifying individuals

Data is provided by research centers and health care institutions

Access is controlled by Data Access Committees

Data requesters are researchers from other research or health care institutions

41

http://ega.crg.eu

Page 39: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

The EGA contains a growing amount of data

1.600 TB

Page 40: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences
Page 41: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

Today’s (June 5, 2015) in the Catalan press

Page 42: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

• We really need to assess the cost of privacy, and whether privacy contributes to the common and personal good.

• Genomes in isolation have little value

– It is only though the comparison of thousands of genomes and phenotypic observations that we can infer association between genetic variation and disease

• Genomic privacy may be an illusion.

Page 43: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

Thanks

Page 44: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

47

Page 45: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

48

Page 46: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

49

Page 47: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

Graur et al., 2013, Genome Biology and Evolution

This absurd conclusion was reached through various means, chiefly • (1) by employing the seldom used “causal role” definition of

biological function and then applying it inconsistently to different biochemical properties,

• (2) by committing a logical fallacy known as “affirming the consequent,”

• 3) by failing to appreciate the crucial difference between “junk DNA” and “garbage DNA,”

• (4) by using analytical methods that yield biased errors and inflate estimates of functionality,

• (5) by favoring statistical sensitivity over specificity, and • (6) by emphasizing statistical significance rather than the

magnitude of the effect. Here, we detail the many logical and methodological transgressions involved in assigning functionality to almost every nucleotide in the human genome.

50

Page 48: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

51

Page 49: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences

6/15/2015 52

Page 50: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences
Page 51: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences
Page 52: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences
Page 53: Data, Bioinformatics and Genomics · 2015. 6. 15. · 1996: Bermuda principles The goal of the agreement was to provide a basis for a free sharing of pre-published data on gene sequences