Upload
juan-antonio-vizcaino
View
357
Download
0
Embed Size (px)
Citation preview
EMBL-EBI Now and in the Future
Proteomics and the big data trend: challenges and new possibilitiesDr. Juan Antonio Vizcano
Proteomics Team LeaderEMBL-European Bioinformatics InstituteHinxton, Cambridge, UK
Juan A. [email protected], 11 August 2016
1
OverviewIntro: Concept of Big data in biology and proteomics
PRIDE Archive and ProteomeXchange
PRIDE tools
Reuse of public proteomics data
Working with Big data: PRIDE Cluster
Juan A. [email protected], 11 August 2016Big data jobs
http://www.indeed.co.uk/Big-Data-jobs
Juan A. [email protected], 11 August 2016
3
Big data: definition
Slide from: http://www.ibmbigdatahub.com/
Juan A. [email protected], 11 August 2016Volume. The quantity of generated and stored data. The size of the data determines the value and potential insight- and whether it can actually be considered big data or not.Variety. The type and nature of the data. This helps people who analyze it to effectively use the resulting insight.Velocity. In this context, the speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and Veracity. The quality of captured data can vary greatly, affecting accurate analysis.4
Big data is everywhere
Juan A. [email protected], 11 August 2016Volume. The quantity of generated and stored data. The size of the data determines the value and potential insight- and whether it can actually be considered big data or not.Variety. The type and nature of the data. This helps people who analyze it to effectively use the resulting insight.Velocity. In this context, the speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and Veracity. The quality of captured data can vary greatly, affecting accurate analysis.5
Big data in biology
The term has been applied so far mainly to genomics
Juan A. [email protected], 11 August 2016Big data in biology: personalised medicine
Aim: Healthcare becomes patient centric for the first timePersonalized medicine
Slide from: http://vector.childrenshospital.org/wp-content/uploads/2016/01/What-is-precision-medicine.jpg
Juan A. [email protected], 11 August 2016One slide intro to MS based proteomics
Hein et al., Handbook of Systems Biology, 2012
Juan A. [email protected], 11 August 2016OverviewIntro: Concept of Big data in biology and proteomics
PRIDE Archive and ProteomeXchange
PRIDE tools
Reuse of public proteomics data
Working with Big data: PRIDE Cluster
Juan A. [email protected], 11 August 2016
9
Data resources at EMBL-EBIGenes, genomes & variationArrayExpressExpression Atlas
PRIDEInterProPfamUniProtChEMBLChEBIMolecular structuresProtein Data Bank in EuropeElectron Microscopy Data BankEuropean Nucleotide ArchiveEuropean Variation ArchiveEuropean Genome-phenome ArchiveGene & protein expressionProtein sequences, families & motifsChemical biologyReactions, interactions & pathwaysIntActReactomeMetaboLightsSystemsBioModelsEnzyme PortalBioSamplesEnsembl Ensembl GenomesGWAS CatalogMetagenomics portalEurope PubMed CentralGene OntologyExperimental Factor OntologyLiterature & ontologies
Juan A. [email protected], 11 August 2016The slide shows the core resources at the EBI to show the range of data you can access through the EBI.
10
Data resources at EMBL-EBIGenes, genomes & variationArrayExpressExpression Atlas
PRIDEInterProPfamUniProtChEMBLChEBIMolecular structuresProtein Data Bank in EuropeElectron Microscopy Data BankEuropean Nucleotide ArchiveEuropean Variation ArchiveEuropean Genome-phenome ArchiveGene & protein expressionProtein sequences, families & motifsChemical biologyReactions, interactions & pathwaysIntActReactomeMetaboLightsSystemsBioModelsEnzyme PortalBioSamplesEnsembl Ensembl GenomesGWAS CatalogMetagenomics portalEurope PubMed CentralGene OntologyExperimental Factor OntologyLiterature & ontologies
Juan A. [email protected], 11 August 2016The slide shows the core resources at the EBI to show the range of data you can access through the EBI.
11
PRIDE stores mass spectrometry (MS)-based proteomics data:Peptide and protein expression data (identification and quantification)Post-translational modificationsMass spectra (raw data and peak lists)Technical and biological metadataAny other related information
Full support for tandem MS approaches
PRIDE (PRoteomics IDEntifications) databasehttp://www.ebi.ac.uk/pride/archiveMartens et al., Proteomics, 2005Vizcano et al., NAR, 2016
Juan A. [email protected], 11 August 2016
12
ProteomeXchange ConsortiumGoal: Development of a framework to allow standard data submission and dissemination pipelines between the main existing proteomics repositories.
Includes PeptideAtlas (ISB, Seattle), PRIDE (Cambridge, UK), MassIVE (UCSD, San Diego) and very recently jPOST (Japan).
Common identifier space (PXD identifiers)
Two supported data workflows: MS/MS and SRM.
Main objective: Make life easier for researchers
http://www.proteomexchange.org
Vizcano et al., Nat Biotechnol, 2014
Juan A. [email protected], 11 August 2016
13
ProteomeCentralMetadata / ManuscriptRaw Data*Results
Journals
UniProt/neXtProt
Peptide Atlas
Other DBs Receiving repositories
jPOST (MS/MS data)
PRIDE (MS/MS data)
GPMDB
Researchers resultsReprocessed results
Raw data*Metadata
MassIVE (MS/MS data)ProteomeXchange data workflow
PASSEL (SRM data)
Juan A. [email protected], 11 August 2016
14
What is a proteomics publication in 2016?Proteomics studies generate potentially large amounts of data and results.
Ideally, a proteomics publication needs to:Summarize the results of the studyProvide supporting information for reliability of any results reported
Information in a publication:ManuscriptSupplementary materialAssociated data submitted to a public repository
Juan A. [email protected], 11 August 2016PRIDE Archive over 4,000 datasets, > 51 countries and 1,700 groupsUSA 814 datasetsGermany 528 UK 338China 328France 222Netherlands 175Canada - 137
Data volume:Total: ~225 TB Number of all files: ~560,000PXD000320-324: ~ 4 TBPXD002319-26 ~2.4 TBPXD001471 ~1.6 TB1973 datasets i.e. 52% of all are publicly accessible> 90% of all ProteomeXchange data
YearSubmissionsAll submissionsCompletePRIDE Archive growthIn the last year: >150 submitted datasets per month
Juan A. [email protected], 11 August 2016
16
PRIDE: Source of MS proteomics data
PRIDE Archive already provides or will soon provide MS proteomics data to other EMBL-EBI resources such as UniProt, Ensembl and the EBI Expression Atlas.
http://www.ebi.ac.uk/pride/archive
Juan A. [email protected], 11 August 2016
17
OverviewIntro: Concept of Big data in biology and proteomics
PRIDE Archive and ProteomeXchange
PRIDE tools
Reuse of public proteomics data
Working with Big data: PRIDE Cluster
Juan A. [email protected], 11 August 2016PRIDE Components: Data Submission ProcessPRIDE Converter 2PRIDE InspectorPX Submission Tool
mzIdentMLPRIDE XMLIn addition to PRIDE Archive, the PRIDE team develops and maintains different tools and software libraries to facilitate the handling and visualisation of MS proteomics data and the submission process
Juan A. [email protected], 11 August 2016
19
PRIDE Inspector Toolsuite
Wang et al., Nat. Biotechnology, 2012Perez-Riverol et al., Bioinformatics, 2015Perez-Riverol et al., MCP, 2016
PRIDE Inspector - standalone tool to enable visualisation and validation of MS data. Build on top of ms-data-core-api - open source algorithms and libraries for computational proteomics.Supported file formats: mzIdentML, mzML, mzTab (PSI standards), and PRIDE XML.Broad functionality.
https://github.com/PRIDE-Utilities/ms-data-core-apihttps://github.com/PRIDE-Toolsuite/pride-inspector
Juan A. [email protected], 11 August 2016
20
PRIDE Inspector Functionality
Summary and QC charts
Peptide spectra annotation and visualisationProtein groups inference
Protein view containing protein inference informationQuantification view Multiple export options (.mgf, protein/peptide tables, mzTab file)Direct access to PRIDE datasetsSummary and QC charts (Delta m/z, precursor charges, etc.)Spectra view (fragmentation table, ion series annotation)Protein inference algorithm and protein groups visualisation
Juan A. [email protected], 11 August 2016
21
PRIDE Components: Data Submission ProcessPRIDE Converter 2PRIDE InspectorPX Submission Tool
mzIdentMLPRIDE XMLIn addition to PRIDE Archive, the PRIDE team develops and maintains different tools and software libraries to facilitate the handling and visualisation of MS proteomics data and the submission process
Juan A. [email protected], 11 August 2016
22
PX Submission Tool
Desktop application for data submissions to ProteomeXchange via PRIDE
Implemented in Java 7Streamlines the submission processCapture mappings between filesRetain metadataFast file transfer with Aspera (FASP transfer technology) FTP also availableCommand line option
Submission tool screenshot
Juan A. [email protected], 11 August 2016
23
OverviewIntro: Concept of Big data in biology and proteomics
PRIDE Archive and ProteomeXchange
PRIDE tools
Reuse of public proteomics data
Working with Big data: PRIDE Cluster
Juan A. [email protected], 11 August 2016Datasets are being reused more and more.
Data download volume in 2015: ~ 200 TB
Vaudel et al., Proteomics, 2016
Juan A. [email protected], 11 August 2016
25
Data sharing in Proteomics
Vaudel et al., Proteomics, 2016
Juan A. [email protected], 11 August 2016Draft Human proteome papers published in 2014
Wilhelm et al., Nature, 2014Kim et al., Nature, 2014
Two independent groups claimed to have produced the first complete draft of the human proteome by MS.
Some of their findings are controversial and need further validation but generated a lot of discussion and put proteomics in the spotlight.
They used many different tissues.Nature cover 29 May 2014
Juan A. [email protected], 11 August 2016Draft Human proteome papers published in 2014
Wilhelm et al., Nature, 2014Around 60% of the data used for the analysis comes from previous experiments, most of them stored in proteomics repositories such as PRIDE/ProteomeXchange, PASSEL or MassIVE.
They complement that data with exotic tissues.
Juan A. [email protected], 11 August 2016Challenges for data reuse in proteomicsInsufficient technical and biological metadata.
Large computational infrastructure maybe needed (e.g. when analysing many datasets together).
Shortage of expertise (people).
Lack of standardisation in the field.
Juan A. [email protected], 11 August 2016Summary of the talk so far PRIDE Archive and other ProteomeXchange resources make possible data sharing in the MS proteomics field.Data sharing is becoming the norm in the field.
Standalone tools: PRIDE Inspector and PX Submission tool.
Datasets are increasingly reused (many opportunities):Example of one of the drafts of the human proteome.But there are important challenges as well.
Juan A. [email protected], 11 August 2016OverviewIntro: Concept of Big data in biology and proteomics
PRIDE Archive and ProteomeXchange
PRIDE tools
Reuse of public proteomics data
Working with Big data: PRIDE Cluster
Juan A. [email protected], 11 August 2016PRIDE Cluster: Initial MotivationProvide a QC-filtered peptide-centric view of PRIDE.
Data is stored in PRIDE Archive as originally analysed by the submitters (no data reprocessing is done).
Heterogeneous quality, difficult to make the data comparable.
Enable assessment of (published) proteomics data. Pre-requisite for data reuse (e.g. in UniProt).
Juan A. [email protected], 11 August 2016PRIDE Cluster - ConceptGriss et al., Nat Methods, 2013
NMMAACDPR NMMAACDPR PPECPDFDPPR
NMMAACDPR NMMAACDPR
NMMAACDPR Consensus spectrumPPECPDFDPPRThreshold: At least 10 spectra in a cluster and ratio >70%.Originally submitted identified spectraSpectrumclustering
Juan A. [email protected], 11 August 2016PRIDE Cluster - Concept
Juan A. [email protected], 11 August 2016PRIDE Cluster: ImplementationGriss et al., Nat. Methods, 2013
Clustered all public, identified spectra in PRIDEEBI compute farm, LSF20.7 M identified spectra610 CPU days, two calendar weeksValidation, calibrationFeedback into PRIDE datasetsEBI farm, LSF
Juan A. [email protected], 11 August 2016PRIDE Cluster Iteration 2: Why?PRIDE Archive has experienced a huge increase in data since 2013.We wanted to develop an algorithm that could also work with unidentified spectra.
YearSubmissionsAll submissionsCompletePRIDE Archive growth
Juan A. [email protected], 11 August 2016Parallelizing Spectrum Clustering: HadoopOptimizes work distribution among machines.Hadoop is a (open source) Framework for parallelism using the Map-Reduce algorithm by Google. Solves many general issues of large parallel jobs:Schedulinginter-job communicationfailure
https://hadoop.apache.org/
Juan A. [email protected], 11 August 2016PRIDE Cluster: Second ImplementationGriss et al., Nat. Methods, 2013
Clustered all public, identified spectra in PRIDEEBI compute farm, LSF20.7 M identified spectra610 CPU days, two calendar weeksValidation, calibrationFeedback into PRIDE datasetsEBI farm, LSF
Griss et al., Nat. Methods, 2016Clustered all public spectra in PRIDE by April 2015Apache Hadoop.Starting with 256 M spectra.190 M unidentified spectra (they were filtered to 111 M for spectra that are likely to represent a peptide).66 M identified spectraResult: 28 M clusters 5 calendar days on 30 node Hadoop cluster, 340 CPU cores
Juan A. [email protected], 11 August 2016PRIDE Cluster - ConceptGriss et al., Nat Methods, 2016
NMMAACDPR NMMAACDPR PPECPDFDPPRNMMAACDPR Consensus spectrum
PPECPDFDPPR
NMMAACDPR NMMAACDPR Threshold: At least 3 spectra in a cluster and ratio >70%.Originally submitted identified spectraSpectrumclustering
Juan A. [email protected], 11 August 2016PRIDE Cluster Home page
http://www.ebi.ac.uk/pride/cluster/#/
Juan A. [email protected], 11 August 2016PRIDE Cluster: result of searcheshttp://www.ebi.ac.uk/pride/cluster/#/
A couple of examples
Juan A. [email protected], 11 August 2016Examples: one perfect cluster
880 PSMs give the same peptide ID4 species28 datasetsSame instruments
Juan A. [email protected], 11 August 2016Examples: one perfect cluster (2)
Juan A. [email protected], 11 August 2016PRIDE Cluster
Sequence-based search enginesSpectrum clusteringIncorrectly or unidentified spectra
Juan A. [email protected], 11 August 2016Output of the analysis1. Inconsistent spectrum clusters
2. Clusters including identified and unidentified spectra.
3. Clusters just containing unidentified spectra.
Juan A. [email protected], 11 August 20161. Re-analysis of inconsistent clusters
NMMAACDPR NMMAACDPR IGGIGTVPVGRNMMAACDPRPPECPDFDPPRVFDEFKPLVEEPQNLIKNMMAACDPRIGGIGTVPVGR
No sequence has a proportion in the cluster >50% Consensus spectrum
PPECPDFDPPRVFDEFKPLVEEPQNLIK Originally submitted identified spectraSpectrumclustering
Juan A. [email protected], 11 August 20161. Re-analysis of inconsistent clustersRe-analysed 3,997 large (>100 spectra), inconsistent clusters with PepNovo, SpectraST, X!Tandem.453 clusters (11%) were identified as peptides originated from keratins, trypsin, albumin, and hemoglobin.In this case, it is likely that a contaminants DB was not used in the search.
Juan A. [email protected], 11 August 2016Validation
Juan A. [email protected], 11 August 2016
Juan A. [email protected], 11 August 2016
Juan A. [email protected], 11 August 2016
Juan A. [email protected], 11 August 2016
Juan A. [email protected], 11 August 2016
Juan A. [email protected], 11 August 2016
Juan A. [email protected], 11 August 2016
2. Inferring identifications for originally unidentified spectra559.1 M unidentified spectra were contained in clusters with a reliable identification.These are candidate new identifications (that need to be confirmed), often missed due to search engine settingsExample: 49,263 reliable clusters (containing 560,000 identified and 130,000 unidentified spectra) contained phosphorylated peptides, many of them from non-enriched studies.
Juan A. [email protected], 11 August 20163. Consistently unidentified clusters19 M clusters contain only unidentified spectra.41,155 of these spectra have more than 100 spectra (= 12 M spectra).Most of themare likely to be derived from peptides.They could correspond to PTMs or variant peptides.With various methods, we found likely identifications for about 20%.Vast amount of data mining remains to be done.
Juan A. [email protected], 11 August 20163. Consistently unidentified clusters
Juan A. [email protected], 11 August 2016PRIDE Cluster as a Public Data Mining Resource58http://www.ebi.ac.uk/pride/cluster Spectral libraries for 16 species.All clustering results, as well as specific subsets of interest available.Source code (open source) and Java API
Juan A. [email protected], 11 August 2016
Juan A. [email protected], 11 August 2016Other Applications of spectrum clustering
60In individual or small groups or similar proteomics datasets:Can be used to target spectra that are consistently unidentified.Unidentified spectra could represent PTMs or sequence variants.Try more-expensive computational analysis methods (e.g. spectral searches, de novo).
When mixing identified and unidentified spectra from different experiments, if non-initially found PTMs are identified, one could modify the initial search parameters.
Juan A. [email protected], 11 August 2016Other applications of spectrum clustering
61
Spectrum clustering can also be applied to MS/MS lipidomics studies
Juan A. [email protected], 11 August 2016Summary part 2Using a big data approach we are able to get extra knowledge from all the public data in PRIDE Archive.
Spectrum clustering enables QC in proteomics resources such as PRIDE Archive.
It is possible to detect spectra that are consistently unidentified across hundreds of datasets (maybe peptide variants, or peptides with PTMs not initially considered).
Spectrum clustering is applicable in the analysis of individual datasets (and not only for proteomics!).
Juan A. [email protected], 11 August 2016Aknowledgements: PeopleAttila CsordasTobias TernentGerhard Mayer (de.NBI)
Johannes GrissYasset Perez-RiverolManuel Bernal-LlinaresAndrew Jarnuczak
Former team members, especially Rui Wang, Florian Reisinger, Noemi del Toro, Jose A. Dianes & Henning Hermjakob
Acknowledgements: The PRIDE Team
All data submitters !!!
Juan A. [email protected], 11 August 201663
Questions?
Juan A. [email protected], 11 August 201664
PXD identifierHits/ No files = dataset downloadsDataset Title
PXD00056146578/ 2383 = 20A draft map of the human proteome
PXD00158713435/140 = 96
DIA-Umpire: comprehensive computational framework for data independent acquisition proteomics
PRD00006612748/4090 = 3
Quantitative Proteomics Analysis of the Secretory Pathway
PXD0006584004/460 = 9
Global phosphoproteomic profiling reveals distinct signatures in B-cell non-Hodgkin
PXD0001493781/598 = 6The potato tuber mitochondrial proteome
PXD00086512535/1368 = 9Mass spectrometry based draft of the human proteome