Proteomics and the "big data" trend: challenges and new possibilitites (Talk at ISAS Dortmund)

EMBL-EBI Now and in the Future

Proteomics and the big data trend: challenges and new possibilitiesDr. Juan Antonio Vizcano

Proteomics Team LeaderEMBL-European Bioinformatics InstituteHinxton, Cambridge, UK

Juan A. [email protected], 11 August 2016

1

OverviewIntro: Concept of Big data in biology and proteomics

PRIDE Archive and ProteomeXchange

PRIDE tools

Reuse of public proteomics data

Working with Big data: PRIDE Cluster

Juan A. [email protected], 11 August 2016Big data jobs

http://www.indeed.co.uk/Big-Data-jobs


3

Big data: definition

Slide from: http://www.ibmbigdatahub.com/

Juan A. [email protected], 11 August 2016Volume. The quantity of generated and stored data. The size of the data determines the value and potential insight- and whether it can actually be considered big data or not.Variety. The type and nature of the data. This helps people who analyze it to effectively use the resulting insight.Velocity. In this context, the speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and Veracity. The quality of captured data can vary greatly, affecting accurate analysis.4

Big data is everywhere

Juan A. [email protected], 11 August 2016Volume. The quantity of generated and stored data. The size of the data determines the value and potential insight- and whether it can actually be considered big data or not.Variety. The type and nature of the data. This helps people who analyze it to effectively use the resulting insight.Velocity. In this context, the speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and Veracity. The quality of captured data can vary greatly, affecting accurate analysis.5

Big data in biology

The term has been applied so far mainly to genomics

Juan A. [email protected], 11 August 2016Big data in biology: personalised medicine

Aim: Healthcare becomes patient centric for the first timePersonalized medicine

Slide from: http://vector.childrenshospital.org/wp-content/uploads/2016/01/What-is-precision-medicine.jpg

Juan A. [email protected], 11 August 2016One slide intro to MS based proteomics

Hein et al., Handbook of Systems Biology, 2012

Juan A. [email protected], 11 August 2016OverviewIntro: Concept of Big data in biology and proteomics


PRIDE tools




9

Data resources at EMBL-EBIGenes, genomes & variationArrayExpressExpression Atlas

PRIDEInterProPfamUniProtChEMBLChEBIMolecular structuresProtein Data Bank in EuropeElectron Microscopy Data BankEuropean Nucleotide ArchiveEuropean Variation ArchiveEuropean Genome-phenome ArchiveGene & protein expressionProtein sequences, families & motifsChemical biologyReactions, interactions & pathwaysIntActReactomeMetaboLightsSystemsBioModelsEnzyme PortalBioSamplesEnsembl Ensembl GenomesGWAS CatalogMetagenomics portalEurope PubMed CentralGene OntologyExperimental Factor OntologyLiterature & ontologies

Juan A. [email protected], 11 August 2016The slide shows the core resources at the EBI to show the range of data you can access through the EBI.

10

Data resources at EMBL-EBIGenes, genomes & variationArrayExpressExpression Atlas

PRIDEInterProPfamUniProtChEMBLChEBIMolecular structuresProtein Data Bank in EuropeElectron Microscopy Data BankEuropean Nucleotide ArchiveEuropean Variation ArchiveEuropean Genome-phenome ArchiveGene & protein expressionProtein sequences, families & motifsChemical biologyReactions, interactions & pathwaysIntActReactomeMetaboLightsSystemsBioModelsEnzyme PortalBioSamplesEnsembl Ensembl GenomesGWAS CatalogMetagenomics portalEurope PubMed CentralGene OntologyExperimental Factor OntologyLiterature & ontologies

Juan A. [email protected], 11 August 2016The slide shows the core resources at the EBI to show the range of data you can access through the EBI.

11

PRIDE stores mass spectrometry (MS)-based proteomics data:Peptide and protein expression data (identification and quantification)Post-translational modificationsMass spectra (raw data and peak lists)Technical and biological metadataAny other related information

Full support for tandem MS approaches

PRIDE (PRoteomics IDEntifications) databasehttp://www.ebi.ac.uk/pride/archiveMartens et al., Proteomics, 2005Vizcano et al., NAR, 2016


12

ProteomeXchange ConsortiumGoal: Development of a framework to allow standard data submission and dissemination pipelines between the main existing proteomics repositories.

Includes PeptideAtlas (ISB, Seattle), PRIDE (Cambridge, UK), MassIVE (UCSD, San Diego) and very recently jPOST (Japan).

Common identifier space (PXD identifiers)

Two supported data workflows: MS/MS and SRM.

Main objective: Make life easier for researchers

http://www.proteomexchange.org

Vizcano et al., Nat Biotechnol, 2014


13

ProteomeCentralMetadata / ManuscriptRaw Data*Results

Journals

UniProt/neXtProt

Peptide Atlas

Other DBs Receiving repositories

jPOST (MS/MS data)

PRIDE (MS/MS data)

GPMDB

Researchers resultsReprocessed results

Raw data*Metadata

MassIVE (MS/MS data)ProteomeXchange data workflow

PASSEL (SRM data)


14

What is a proteomics publication in 2016?Proteomics studies generate potentially large amounts of data and results.

Ideally, a proteomics publication needs to:Summarize the results of the studyProvide supporting information for reliability of any results reported

Information in a publication:ManuscriptSupplementary materialAssociated data submitted to a public repository

Juan A. [email protected], 11 August 2016PRIDE Archive over 4,000 datasets, > 51 countries and 1,700 groupsUSA 814 datasetsGermany 528 UK 338China 328France 222Netherlands 175Canada - 137

Data volume:Total: ~225 TB Number of all files: ~560,000PXD000320-324: ~ 4 TBPXD002319-26 ~2.4 TBPXD001471 ~1.6 TB1973 datasets i.e. 52% of all are publicly accessible> 90% of all ProteomeXchange data

YearSubmissionsAll submissionsCompletePRIDE Archive growthIn the last year: >150 submitted datasets per month


16

PRIDE: Source of MS proteomics data

PRIDE Archive already provides or will soon provide MS proteomics data to other EMBL-EBI resources such as UniProt, Ensembl and the EBI Expression Atlas.

http://www.ebi.ac.uk/pride/archive


17



PRIDE tools



Juan A. [email protected], 11 August 2016PRIDE Components: Data Submission ProcessPRIDE Converter 2PRIDE InspectorPX Submission Tool

mzIdentMLPRIDE XMLIn addition to PRIDE Archive, the PRIDE team develops and maintains different tools and software libraries to facilitate the handling and visualisation of MS proteomics data and the submission process


19

PRIDE Inspector Toolsuite

Wang et al., Nat. Biotechnology, 2012Perez-Riverol et al., Bioinformatics, 2015Perez-Riverol et al., MCP, 2016

PRIDE Inspector - standalone tool to enable visualisation and validation of MS data. Build on top of ms-data-core-api - open source algorithms and libraries for computational proteomics.Supported file formats: mzIdentML, mzML, mzTab (PSI standards), and PRIDE XML.Broad functionality.

https://github.com/PRIDE-Utilities/ms-data-core-apihttps://github.com/PRIDE-Toolsuite/pride-inspector


20

PRIDE Inspector Functionality

Summary and QC charts

Peptide spectra annotation and visualisationProtein groups inference

Protein view containing protein inference informationQuantification view Multiple export options (.mgf, protein/peptide tables, mzTab file)Direct access to PRIDE datasetsSummary and QC charts (Delta m/z, precursor charges, etc.)Spectra view (fragmentation table, ion series annotation)Protein inference algorithm and protein groups visualisation


21

PRIDE Components: Data Submission ProcessPRIDE Converter 2PRIDE InspectorPX Submission Tool

mzIdentMLPRIDE XMLIn addition to PRIDE Archive, the PRIDE team develops and maintains different tools and software libraries to facilitate the handling and visualisation of MS proteomics data and the submission process


22

PX Submission Tool

Desktop application for data submissions to ProteomeXchange via PRIDE

Implemented in Java 7Streamlines the submission processCapture mappings between filesRetain metadataFast file transfer with Aspera (FASP transfer technology) FTP also availableCommand line option

Submission tool screenshot


23



PRIDE tools



Juan A. [email protected], 11 August 2016Datasets are being reused more and more.

Data download volume in 2015: ~ 200 TB

Vaudel et al., Proteomics, 2016


25

Data sharing in Proteomics

Vaudel et al., Proteomics, 2016

Juan A. [email protected], 11 August 2016Draft Human proteome papers published in 2014

Wilhelm et al., Nature, 2014Kim et al., Nature, 2014

Two independent groups claimed to have produced the first complete draft of the human proteome by MS.

Some of their findings are controversial and need further validation but generated a lot of discussion and put proteomics in the spotlight.

They used many different tissues.Nature cover 29 May 2014

Juan A. [email protected], 11 August 2016Draft Human proteome papers published in 2014

Wilhelm et al., Nature, 2014Around 60% of the data used for the analysis comes from previous experiments, most of them stored in proteomics repositories such as PRIDE/ProteomeXchange, PASSEL or MassIVE.

They complement that data with exotic tissues.

Juan A. [email protected], 11 August 2016Challenges for data reuse in proteomicsInsufficient technical and biological metadata.

Large computational infrastructure maybe needed (e.g. when analysing many datasets together).

Shortage of expertise (people).

Lack of standardisation in the field.

Juan A. [email protected], 11 August 2016Summary of the talk so far PRIDE Archive and other ProteomeXchange resources make possible data sharing in the MS proteomics field.Data sharing is becoming the norm in the field.

Standalone tools: PRIDE Inspector and PX Submission tool.

Datasets are increasingly reused (many opportunities):Example of one of the drafts of the human proteome.But there are important challenges as well.

Juan A. [email protected], 11 August 2016OverviewIntro: Concept of Big data in biology and proteomics


PRIDE tools



Juan A. [email protected], 11 August 2016PRIDE Cluster: Initial MotivationProvide a QC-filtered peptide-centric view of PRIDE.

Data is stored in PRIDE Archive as originally analysed by the submitters (no data reprocessing is done).

Heterogeneous quality, difficult to make the data comparable.

Enable assessment of (published) proteomics data. Pre-requisite for data reuse (e.g. in UniProt).

Juan A. [email protected], 11 August 2016PRIDE Cluster - ConceptGriss et al., Nat Methods, 2013

NMMAACDPR NMMAACDPR PPECPDFDPPR

NMMAACDPR NMMAACDPR

NMMAACDPR Consensus spectrumPPECPDFDPPRThreshold: At least 10 spectra in a cluster and ratio >70%.Originally submitted identified spectraSpectrumclustering

Juan A. [email protected], 11 August 2016PRIDE Cluster - Concept

Juan A. [email protected], 11 August 2016PRIDE Cluster: ImplementationGriss et al., Nat. Methods, 2013

Clustered all public, identified spectra in PRIDEEBI compute farm, LSF20.7 M identified spectra610 CPU days, two calendar weeksValidation, calibrationFeedback into PRIDE datasetsEBI farm, LSF

Juan A. [email protected], 11 August 2016PRIDE Cluster Iteration 2: Why?PRIDE Archive has experienced a huge increase in data since 2013.We wanted to develop an algorithm that could also work with unidentified spectra.

YearSubmissionsAll submissionsCompletePRIDE Archive growth

Juan A. [email protected], 11 August 2016Parallelizing Spectrum Clustering: HadoopOptimizes work distribution among machines.Hadoop is a (open source) Framework for parallelism using the Map-Reduce algorithm by Google. Solves many general issues of large parallel jobs:Schedulinginter-job communicationfailure

https://hadoop.apache.org/

Juan A. [email protected], 11 August 2016PRIDE Cluster: Second ImplementationGriss et al., Nat. Methods, 2013

Clustered all public, identified spectra in PRIDEEBI compute farm, LSF20.7 M identified spectra610 CPU days, two calendar weeksValidation, calibrationFeedback into PRIDE datasetsEBI farm, LSF

Griss et al., Nat. Methods, 2016Clustered all public spectra in PRIDE by April 2015Apache Hadoop.Starting with 256 M spectra.190 M unidentified spectra (they were filtered to 111 M for spectra that are likely to represent a peptide).66 M identified spectraResult: 28 M clusters 5 calendar days on 30 node Hadoop cluster, 340 CPU cores

Juan A. [email protected], 11 August 2016PRIDE Cluster - ConceptGriss et al., Nat Methods, 2016

NMMAACDPR NMMAACDPR PPECPDFDPPRNMMAACDPR Consensus spectrum

PPECPDFDPPR

NMMAACDPR NMMAACDPR Threshold: At least 3 spectra in a cluster and ratio >70%.Originally submitted identified spectraSpectrumclustering

Juan A. [email protected], 11 August 2016PRIDE Cluster Home page

http://www.ebi.ac.uk/pride/cluster/#/

Juan A. [email protected], 11 August 2016PRIDE Cluster: result of searcheshttp://www.ebi.ac.uk/pride/cluster/#/

A couple of examples

Juan A. [email protected], 11 August 2016Examples: one perfect cluster

880 PSMs give the same peptide ID4 species28 datasetsSame instruments

Juan A. [email protected], 11 August 2016Examples: one perfect cluster (2)

Juan A. [email protected], 11 August 2016PRIDE Cluster

Sequence-based search enginesSpectrum clusteringIncorrectly or unidentified spectra

Juan A. [email protected], 11 August 2016Output of the analysis1. Inconsistent spectrum clusters

2. Clusters including identified and unidentified spectra.

3. Clusters just containing unidentified spectra.

Juan A. [email protected], 11 August 20161. Re-analysis of inconsistent clusters

NMMAACDPR NMMAACDPR IGGIGTVPVGRNMMAACDPRPPECPDFDPPRVFDEFKPLVEEPQNLIKNMMAACDPRIGGIGTVPVGR

No sequence has a proportion in the cluster >50% Consensus spectrum

PPECPDFDPPRVFDEFKPLVEEPQNLIK Originally submitted identified spectraSpectrumclustering

Juan A. [email protected], 11 August 20161. Re-analysis of inconsistent clustersRe-analysed 3,997 large (>100 spectra), inconsistent clusters with PepNovo, SpectraST, X!Tandem.453 clusters (11%) were identified as peptides originated from keratins, trypsin, albumin, and hemoglobin.In this case, it is likely that a contaminants DB was not used in the search.

Juan A. [email protected], 11 August 2016Validation








2. Inferring identifications for originally unidentified spectra559.1 M unidentified spectra were contained in clusters with a reliable identification.These are candidate new identifications (that need to be confirmed), often missed due to search engine settingsExample: 49,263 reliable clusters (containing 560,000 identified and 130,000 unidentified spectra) contained phosphorylated peptides, many of them from non-enriched studies.

Juan A. [email protected], 11 August 20163. Consistently unidentified clusters19 M clusters contain only unidentified spectra.41,155 of these spectra have more than 100 spectra (= 12 M spectra).Most of themare likely to be derived from peptides.They could correspond to PTMs or variant peptides.With various methods, we found likely identifications for about 20%.Vast amount of data mining remains to be done.

Juan A. [email protected], 11 August 20163. Consistently unidentified clusters

Juan A. [email protected], 11 August 2016PRIDE Cluster as a Public Data Mining Resource58http://www.ebi.ac.uk/pride/cluster Spectral libraries for 16 species.All clustering results, as well as specific subsets of interest available.Source code (open source) and Java API


Juan A. [email protected], 11 August 2016Other Applications of spectrum clustering

60In individual or small groups or similar proteomics datasets:Can be used to target spectra that are consistently unidentified.Unidentified spectra could represent PTMs or sequence variants.Try more-expensive computational analysis methods (e.g. spectral searches, de novo).

When mixing identified and unidentified spectra from different experiments, if non-initially found PTMs are identified, one could modify the initial search parameters.

Juan A. [email protected], 11 August 2016Other applications of spectrum clustering

61

Spectrum clustering can also be applied to MS/MS lipidomics studies

Juan A. [email protected], 11 August 2016Summary part 2Using a big data approach we are able to get extra knowledge from all the public data in PRIDE Archive.

Spectrum clustering enables QC in proteomics resources such as PRIDE Archive.

It is possible to detect spectra that are consistently unidentified across hundreds of datasets (maybe peptide variants, or peptides with PTMs not initially considered).

Spectrum clustering is applicable in the analysis of individual datasets (and not only for proteomics!).

Juan A. [email protected], 11 August 2016Aknowledgements: PeopleAttila CsordasTobias TernentGerhard Mayer (de.NBI)

Johannes GrissYasset Perez-RiverolManuel Bernal-LlinaresAndrew Jarnuczak

Former team members, especially Rui Wang, Florian Reisinger, Noemi del Toro, Jose A. Dianes & Henning Hermjakob

Acknowledgements: The PRIDE Team

All data submitters !!!


Questions?


PXD identifierHits/ No files = dataset downloadsDataset Title

PXD00056146578/ 2383 = 20A draft map of the human proteome

PXD00158713435/140 = 96

DIA-Umpire: comprehensive computational framework for data independent acquisition proteomics

PRD00006612748/4090 = 3

Quantitative Proteomics Analysis of the Secretory Pathway

PXD0006584004/460 = 9

Global phosphoproteomic profiling reveals distinct signatures in B-cell non-Hodgkin

PXD0001493781/598 = 6The potato tuber mitochondrial proteome

PXD00086512535/1368 = 9Mass spectrometry based draft of the human proteome

Science

Proteomics and the "big data" trend: challenges and new possibilitites (Talk at ISAS Dortmund)