Pride cluster presentation

Update to the PRIDE Cluster project

Dr. Juan Antonio Vizcaíno

Proteomics Team LeaderEMBL-European Bioinformatics InstituteHinxton, Cambridge, UK

Juan A. Vizcaí[email protected]

Bioinformatics Hub HUPO 2016Taipei, September 2016

•PRIDE stores mass spectrometry (MS)-based proteomics data:

•Peptide and protein expression data (identification and quantification)

•Post-translational modifications

•Mass spectra (raw data and peak lists)

•Technical and biological metadata

•Any other related information

•Full support for tandem MS approaches

PRIDE (PRoteomics IDEntifications) database

http://www.ebi.ac.uk/pride/archiveMartens et al., Proteomics, 2005Vizcaíno et al., NAR, 2016



PRIDE Cluster: Initial Motivation• Provide a QC-filtered peptide-centric view of PRIDE.

• Data is stored in PRIDE Archive as originally analysed by the submitters (no data reprocessing is done).

• Heterogeneous quality, difficult to make the data comparable.

• Enable assessment of (published) proteomics data. Pre-requisite for data reuse (e.g. in UniProt).



PRIDE Cluster - Concept

Griss et al., Nat Methods, 2016

NMMAACDPR

NMMAACDPR

PPECPDFDPPR

NMMAACDPR

Consensus spectrum

PPECPDFDPPR

NMMAACDPR

NMMAACDPR

Threshold: At least 3 spectra in a cluster and ratio >70%.

Originally submitted identified spectra

Spectrumclustering



PRIDE Cluster: Second Implementation

• Griss et al., Nat. Methods, 2013

• Clustered all public, identified spectra in PRIDE

• EBI compute farm, LSF

• 20.7 M identified spectra

• 610 CPU days, two calendar weeks

• Validation, calibration

• Feedback into PRIDE datasets

• EBI farm, LSF

• Griss et al., Nat. Methods, 2016

• Clustered all public spectra in PRIDE by April 2015

• Apache Hadoop.

• Starting with 256 M spectra.

• 190 M unidentified spectra (they were filtered to 111 M for spectra that are likely to represent a peptide).

• 66 M identified spectra

• Result: 28 M clusters

• 5 calendar days on 30 node Hadoop cluster, 340 CPU cores



Parallelizing Spectrum Clustering: Hadoop

• Optimizes work distribution among machines.

• Hadoop is a (open source) Framework for parallelism using the Map-Reduce algorithm by Google.

• Solves many general issues of large parallel jobs:

• Scheduling

• inter-job communication

• failure

https://hadoop.apache.org/



PRIDE Cluster Home page

http://www.ebi.ac.uk/pride/cluster/#/



PRIDE Cluster: result of searches

http://www.ebi.ac.uk/pride/cluster/#/

A couple of examples …



Examples: one perfect cluster

- 880 PSMs give the same peptide ID- 4 species- 28 datasets- Same instruments



Examples: one perfect cluster (2)



Output of the analysis

• 1. Inconsistent spectrum clusters

• 2. Clusters including identified and unidentified spectra.

• 3. Clusters just containing unidentified spectra.









2. Inferring identifications for originally unidentified spectra

13

• 9.1 M unidentified spectra were contained in clusters with a reliable identification.

• These are candidate new identifications (that need to be confirmed), often missed due to search engine settings

• Example: 49,263 reliable clusters (containing 560,000 identified and 130,000 unidentified spectra) contained phosphorylated peptides, many of them from non-enriched studies.









3. Consistently unidentified clusters

• 19 M clusters contain only unidentified spectra.

• 41,155 of these spectra have more than 100 spectra (= 12 M spectra).

• Most of them are likely to be derived from peptides.

• They could correspond to PTMs or variant peptides.

• With various methods, we found likely identifications for about 20%.

• Vast amount of data mining remains to be done.









PRIDE Cluster as a Public Data Mining Resource

18

• http://www.ebi.ac.uk/pride/cluster

• Spectral libraries for 16 species.

• All clustering results, as well as specific subsets of interest available.

• Source code (open source) and Java API



Consistently unidentified clusters

• We provide the results split per species in MGF and mzML format.

• Very interested in getting people trying to work in those.

• Available for several species (Largest clusters at present).





Aknowledgements: People

Attila CsordasTobias TernentGerhard Mayer (de.NBI)

Johannes GrissYasset Perez-RiverolManuel Bernal-LlinaresAndrew Jarnuczak

Former team members, especially Rui Wang, Florian Reisinger, Noemi del Toro, Jose A. Dianes & Henning Hermjakob

Acknowledgements: The PRIDE Team

All data submitters !!!



Questions?

Science

Pride cluster presentation