61
Biomedical informatics for proteomics Boguski, M. S. and M. W. McIntosh (2003). Nature 422(6928): 233-237. 指指指指 : 指指指 Kun-Mao Chao 指指 : 指指指 指指指 指指指 指指指 指指指 指指指 指指指 指指指

Biomedical informatics for proteomics

  • Upload
    lea

  • View
    53

  • Download
    1

Embed Size (px)

DESCRIPTION

Biomedical informatics for proteomics. Boguski , M. S. and M. W. McIntosh (2003). Nature 422(6928): 233-237. 指導 老師 : 趙坤茂 Kun-Mao Chao 組員 : 施光偉   蕭雅 茵 計 佩 岑 葉 衍陞  葉 欣綺  鍾宇彥 蘇鈺惠 陳雲濤. Outline. Introduction Study design and sample quality - PowerPoint PPT Presentation

Citation preview

Page 1: Biomedical informatics                        for proteomics

Biomedical informatics for proteomics

Boguski, M. S. and M. W. McIntosh (2003). Nature 422(6928): 233-237.

指導老師 : 趙坤茂 Kun-Mao Chao

組員 : 施光偉 蕭雅茵 計佩岑 葉衍陞 葉欣綺 鍾宇彥 蘇鈺惠 陳雲濤

Page 2: Biomedical informatics                        for proteomics

Outline

• Introduction• Study design and sample quality• Protein databases• Protein identification by database searching• Pattern matching without protein

identification• Conclusions and future challenges

Page 3: Biomedical informatics                        for proteomics

Introductionreporter: 施光偉

Page 4: Biomedical informatics                        for proteomics

Introduction

• The subtitle: “Genes Were Easy”.• We have transitioned rapidly from a large but finite and complete

human genome to a seemingly infinite biological universe.• Proteomics is often referred to as a ‘post-genome’ science, but its

antecedents actually predate the Human Genome Project by two to three decades.

• Although medical informatics has until recently been largely detached from bioinformatics, the emergence of clinical genomics and proteomics increasingly requires the integrated analysis of genetic, cellular, molecular and clinical information and the expertise of pathologists, epidemiologists and biostatisticians.

Page 5: Biomedical informatics                        for proteomics

Introduction

• Proteomics is the latest functional genomics technology to capture our imagination and it is instructive to review some lessons learned during the earlier adoption of another functional genomics technology, namely gene expression analysis using microarrays and similar technologies.

• There are many implications of biomedical informatics for proteomics, including multiple platform technologies, laboratory information-management systems, medical records systems, and documentation of clinical trial results for regulatory agencies.

• In the present work, we confine our discussions to mass spectrometry-based proteomics, and to study design and data resources, tools and analysis in a research setting.

Page 6: Biomedical informatics                        for proteomics

Introduction

• Proteomics depends upon careful study design and high-quality biological samples, advanced information technologies.

• Proteome analysis is at a much earlier stage of development than genomics and gene expression (microarray) studies.

• Fundamental issues involving biological variability, pre-analytic factors and analytical reproducibility remain to be resolved.

Page 7: Biomedical informatics                        for proteomics

Study design and sample quality

reporter: 蕭雅茵

Page 8: Biomedical informatics                        for proteomics

Glossary1) Case-control and cohort studyObservational studies:Case → O/X of the phenotype(case/control) Cohort → Participants based on O/X of risk factor of interest and over time for development of an outcome

2) Confounder/ConfoundingDistort an apparent relationship between an exposure and a phenotype of interest

3) Plasma: fluid, non-cellular

Serum: protein solution remaining after blood coagulated

4) Pre-analytical variablesVariables that present before laboratory test and data analysis

5) Randomized clinical trialTreatments are randomly assigned in order to prevent confounding

Page 9: Biomedical informatics                        for proteomics

Study design and sample quality• Potter describes 4 study design

• However, the distinction between observational and experimental design isn’t made as well as proteomics studies.

Page 10: Biomedical informatics                        for proteomics

Observational studies of gene expression and proteomic analysis involving human→①bias & confounding factor

Human plasma and serum proteomics are susceptible to observational biases→confused with a specific characteristic of the disease process→mislead

Each may induce a change in total protein concentrations by ± 10%.

Highlighting human serum proteome→nature but confounding variables may complicate finding

Study design and sample quality

Page 11: Biomedical informatics                        for proteomics

No adjust for confounding even, only to have careful design and specimen ascertainment

②quality ③number

Margolin has admonished that ”Scientists...need to avoid the tendency, often driven by the high price of some of the newer techniques, of running under-controlled experiments or experiments with fewer repeated conditions than would have been accepted with standard techniques.”

Proteomics discovery has no priori enumeration of targets and lacks described procedural structure.

Study design and sample quality

Page 12: Biomedical informatics                        for proteomics

Protein databasereporter: 計佩岑

Page 13: Biomedical informatics                        for proteomics

Proteome

DNA

mRNA

Proteins

Genome

Proteome

Page 14: Biomedical informatics                        for proteomics

Protein databases

• Collections protein sequences date back to the1960s.

• Utilitarian goal of protein databases (1990s~today)– Minimal redundancy– Maximal annotation– Integration with other databases

Page 15: Biomedical informatics                        for proteomics

Protein databases

• Current molecular sequence databases are classified according to their evolutionary history inferred from sequence homology.– excellent tools for gene discovery, comparative

genomics and molecular evolution– much work to be done to even minimally serve

the needs of proteomics and integrative biological science

Page 16: Biomedical informatics                        for proteomics

Protein databases

• Today's principal protein databases emphasize – molecular – cellular features– annotation – are not well suited to represent physiology.

• A more ideal database for plasma proteome studies would classify proteins from a functional, rather than an evolutionary, viewpoint

Page 17: Biomedical informatics                        for proteomics

Data standards

• Multiple or specialized file formats has hindered accessibility, information exchange and integration

• eXtensible Markup Language (XML)– an Internet standard for describing structured and

semistructured data– most of the main databases make their data available

in XML and make it easy to publish and exchange XML data

Page 18: Biomedical informatics                        for proteomics

Protein databases

• PDB(Protein Data Bank )• GenBank• SWISS-PROT• EMBL• HPRD(Human Protein Reference Database )

Page 19: Biomedical informatics                        for proteomics

Protein identification by database searching

reporter:葉衍陞 葉欣綺 鍾宇彥

Page 20: Biomedical informatics                        for proteomics

Purpose of Protein identification by database searching

• NOT the species or remoteness of the relationship

• infer similarity of function from similarity of sequence

• study the evolution of protein families or domains

• Different aims and therefore require different strategies and tools

Page 21: Biomedical informatics                        for proteomics

Analysis of human serum

• interested in identifying proteins they are not normally present

• match between subsequences• weak similarities

Page 22: Biomedical informatics                        for proteomics

Statistical significance

• statistical significance is important, but not in the sense of the probability that two sequences are related by chance

• deviates significantly from a normal range of values.

• If it is met, one is then interested in attempting to demonstrate a significant correlation

Page 23: Biomedical informatics                        for proteomics

影響 database 原因之一

DNA 1 mRNA 2 protein

1.TranscriptionPost translational modification2.translation(proteolytic processing glycosylation, methylation, phosphorylation, Met 切除 , 雙硫鍵形成 , acetylation, hydroxylation )

Page 24: Biomedical informatics                        for proteomics

Post translational modification

proteolytic processing • 移除訊號序列胺基酸殘基• 移往特定細胞• 特殊胜肽水解酶移除glycosylation• Asn 和 Ser 或 Thr• 主要場地內質網• 有潤滑作用的含有寡糖類之鏈

Page 25: Biomedical informatics                        for proteomics

Post translational modification

methylation• 特定 Lys 殘基進行• 某些肌肉蛋白、組蛋白、與色素細胞 cphosphorylation• 多接在 -OH 基的胺基酸• 調控蛋白質酵素活性

Page 26: Biomedical informatics                        for proteomics

Post translational modification

Met 切除• N 端的 Met 往往在多胜肽鏈合成前被切除

(AUG)雙硫鍵形成• mRNA has no codingacetylation• 組蛋白調控轉錄作用hydroxylation• 膠原蛋白等

Page 27: Biomedical informatics                        for proteomics

• Peptide analysis• Error Tolerance• Scoring methods

Page 28: Biomedical informatics                        for proteomics

Peptide analysis- Experimental process

• Cut to mixture of short peptides– Specific: restriction enzyme

• Mass Spectrometry – Detect the m/z of the compounds– Tandem mass spectrometry (MS/MS)

• Fragments of specific m/z

• Chromatography– Separation before MS

Page 29: Biomedical informatics                        for proteomics

Tandem mass spectrometry

http://en.wikipedia.org/wiki/Tandem_mass_spectrometry

Page 30: Biomedical informatics                        for proteomics

Chromatography

Dionex

Page 31: Biomedical informatics                        for proteomics

Peptide analysis- Mass Spectrometry

• Several Approach– Analytic peptide-mass fingerprint

• used as profile

– Compare with the predicted spectrum• match to database

– De novo sequence interpretation• Manual interpretation by expert• Time consumption high

Page 32: Biomedical informatics                        for proteomics

Consideration of Error Tolerance

• Restriction enzyme non-specificity• Precursor charge errors

– Get more than one charge in ionization– Isotope

• Mass measurement errors– Related to accuracy of instrument

• Unsuspected modifications– Ex: post-translational modification

• Primary sequence variations– deletions, insertions, substitutions

[2002] Error tolerant searching of uninterpreted tandem mass spectrometry data

Page 33: Biomedical informatics                        for proteomics

Scoring methods description

• In general, each scoring algorithm designates a quantity related to the probability that the candidate peptide could have produced the observed spectrum by chance

• Ranking is required for high-throughput automated analysis

Page 34: Biomedical informatics                        for proteomics

Example of peptide identification

PB cannot be identified due to high variationSolutions: reduce the number of target peptides

Page 35: Biomedical informatics                        for proteomics

Another challenge

• Another automating proteomics challenge : the best match of a scoring algorithm is simply not good enough.

• Establishing a criteria for acceptance overall therefore becomes the main focus of automated proteomics.

Page 36: Biomedical informatics                        for proteomics

Scoring Threshld ,P value

• It is generally assumed that higher-scoring assignments are more likely to be correct than lower-scoring assignments.

• Threshold: i : Sensitivity ii : Specificity iii:Mixture , sequence data base• P values : If p values < 0.05 , 5% of all false

tests will be misidentified as true.

Page 37: Biomedical informatics                        for proteomics

Scoring Threshld ,P value

http://rating.com.vn/home/_/Y-nghia-cua-tri-so-P-tuc-P-value.26.1080

Prob

abili

ty

Page 38: Biomedical informatics                        for proteomics

P value-like quantities

• Keller et al. estimate the reference distributions of the correct and incorrect assignments within any experiment.

• Keller et al. describe an approach that may allow a scoring algorithm to be converted into P value-like quantities that can then be used to control error rates.

Page 39: Biomedical informatics                        for proteomics

*Pattern matching without protein identification

*Conclusions and future challengesreporter: 蘇鈺惠 陳雲濤

Page 40: Biomedical informatics                        for proteomics

Time-of-flight mass spectrometry(TOF)

Page 41: Biomedical informatics                        for proteomics

• mass spectrometry• ions are accelerated by an electric field• velocity of the ion depends on the mass-to-

charge ratio• Time is measured• Compared with known experimental

parameter, we can get the ion of mass-to-charge ratio.

Time-of-flight mass spectrometry

Page 42: Biomedical informatics                        for proteomics

Time-of-flight mass spectrometry

• Time-of-flight mass spectrometry (TOFMS) is a method of mass spectrometry in which ions are accelerated by an electric field of known strength. This acceleration results in an ion having the same kinetic energy as any other ion that has the same charge. The velocity of the ion depends on the mass-to-charge ratio. The time that it subsequently takes for the particle to reach a detector at a known distance is measured. This time will depend on the mass-to-charge ratio of the particle (heavier particles reach lower speeds). From this time and the known experimental parameters one can find the mass-to-charge ratio of the ion. The elapsed time from the instant a particle leaves a source to the instant it reaches a detector. from wikipedia

Page 43: Biomedical informatics                        for proteomics

Principle & method

• Ep=q*U• Ek=1/2*m*v^2• Ek=Ep q*U=1/2*m*v^2 (v=d/t)t=k*sqrt(m/q) ; k=d/sqrt(2*U)

• The velocity is determined by time-of-flight tube length(d) and time of the flight of the ion (t) v=d/t

Page 44: Biomedical informatics                        for proteomics

application

• Matrix-assisted laser desorption ionization time of flight spectrometry(MALDI-TOF) is a pulsed ionization technique that is readily compatible with TOF MS.

• 1. ionize molecule via laser pulse• 2. separate molecule according to mass to

charge ratio• 3. mainly used for detection of large

biomolecule.

Page 45: Biomedical informatics                        for proteomics

Component of MALDI-TOF

Page 46: Biomedical informatics                        for proteomics

• http://www.youtube.com/watch?v=gTRsaAnkRVU

Page 47: Biomedical informatics                        for proteomics

Drawback of TOF

• Each m/z value of the spectrum reflects the abundance of possibly many peptides having a similar mass. Thus, with complex mixtures, these TOF methods are not able to identify individual peptides.

Page 48: Biomedical informatics                        for proteomics

Using TOF

• When used with complex mixtures, analysis methods are intended to identify peaks, or features, of the spectrum that can segregate identifiable groups

• When evaluating expression array, using Clustering methods, Pattern matching for alignment and peak identification.

Page 49: Biomedical informatics                        for proteomics

expression array ( 圖 )

Fig. 2 TOF-SIMS image and mass spectrum of a high-density array. (a) Binary array of melatonin (dark color) and uridine (light color) Each 50 μm × 50 μm vial was loaded with 2.4 pmol of the respective molecules. (b) Representative mass spectrum obtained from 2.4 pmol of melatonin localized within a single nanovial. (From R. M. Braun et al., Spatially resolved detection of attomole quantities of organic molecules localized in picoliter vials using time-of-flight secondary ion mass spectrometry, Anal. Chem, 71:3318–3324, 1999)

ig 2. 

Page 50: Biomedical informatics                        for proteomics

Cluster Algorithm• 一種分類的方法 : 由一個基準點,描述其在有限範圍 (Eps) 內包含不少於 MinPt

個點的群集

• 範圍以歐幾里得距離或曼哈頓距離算之

• 用途廣泛 : 諸如商業市場分析、生物分類研究、生醫資訊領域、 Data mining 、Machine Learning 、圖像分析

• 種類 :Partitioning MethodsHierarchical Methodsdensity-based methodsgrid-based methodsModel-Based Methods

Page 51: Biomedical informatics                        for proteomics

Cluster Algorithm

• Euclidean distance ( 歐幾里得距離 )

• Taxicab geometry ( 曼哈頓距離 )

• 圖示綠色線為歐幾里得距離,其餘所示,曼哈頓距離總和均相同 = =|||

( 參考資料 :Wiki)

Page 52: Biomedical informatics                        for proteomics

Cluster Algorithm• Example 如圖示 : 假設 MinPt == 4

把 (3,14) 判定為”噪音”為何 (8,3) 一個點形成了一個”簇” ? 不是一個簇最少應該包含 MinPts 個點嗎,如果只有一個點,那 (8,3) 應該歸類為噪音才對呀 ?原因是在演算法算的初期, (8,3) 、 (5,3) 、 (8,6) 、 (10,4) 被劃分成一個”簇”,並且此時判定 (8,3) 是核心點—這個决定不會再更改。只是到後來 (5,3) 、 (8,6) 、 (10,4) 又被劃分到其他”簇”中去了。

Page 53: Biomedical informatics                        for proteomics

Pattern matching• exact pattern matchng : M = 6(needle) , N = 17(hayneedsanneedlex)• h a y n e e d s a n n e e d l e x

n e e d l e n e e d l e n e e d l e n e e d l e n e e d l e n e e d l e n e e d l e n e e d l e n e e d l e n e e d l e n e e d l e ( 參考資料 : http://www.cs.princeton.edu/~rs/AlgsDS07/21PatternMatching.pdf)

Page 54: Biomedical informatics                        for proteomics

Pattern matching• public static int search(String pattern, String text)• {• int M = pattern.length(); // M = strlen(“needle”)• int N = text.length(); // N = … • for (int i = 0; i < N - M; i++) // loop 跑 N-M 次• {• int j;• for (j = 0; j < M; j++) // 內層 loop 從 pattern 開頭一一比對是否 match• if (text.charAt(i+j) != pattern.charAt(j))• break; // 沒有完全 match 到就 break 出來• if (j == M) return i; // 完全 match 到的情形時 j 會等於 M , return index

開頭 i • }• return -1; // 沒找到相符的 return -1• }

Page 55: Biomedical informatics                        for proteomics

• 前述方法為 O((N-M)*M) …• 改良成 O(N) 的方法 :

Knuth-Morris-Pratt (KMP) exact pattern-matching algorithm• 改善想法 :

build DFA from patternsimulate DFA with text as inputMatch input character: move from i to i+1Mismatch: move to previous state

Pattern matching

Page 56: Biomedical informatics                        for proteomics

KMP Pattern matching

Page 57: Biomedical informatics                        for proteomics

KMP Pattern matching

• DFA representation: a single state-indexed array next[] Upon character match in state j, go forward to state j+1. Upon character mismatch in state j, go back to state next[j].

Page 58: Biomedical informatics                        for proteomics

KMP Pattern matching

• Simulation of KMP DFA 利用建構好的 Next array 實作• int j = 0; • for (int i = 0; i < N; i++)• {• if (t.charAt(i) == p.charAt(j)) j++; // match• else j = next[j]; // mismatch !! 跳回 state # = next[j]• if (j == M) return i - M + 1; // found• }• return -1; // not found

Page 59: Biomedical informatics                        for proteomics

About TOF• ( 跳 tone 跳太大了…趕快跳回來 !! )

• Evan though the TOF algorithms have not yet led to peptide identification, this factor does not greatly limit their utility for identifying newer and far more accurate approaches for medical diagnostics, because diagnosing disease is a problem of prediction rather than of aetiology.

• Algorithms that have potential clinical relevance have already been identified by Petricion et al. and Adam et al. for diagnosing ovarian and prostate cancer, respectively.

• The efficiency of the TOF approaches, and their demonstrated ability to generate highly accurate diagnostic tests, may provide advantages for this technology compared with others for the development of medical diagnostics.

Page 60: Biomedical informatics                        for proteomics

Conclusions and future chanllenges

• Proteomics is a powerful, post-genome paradigm that seeks to describe and explain what Erwin Chargaff called the “immensely diversified phenomenology” of cells and organisms.

• Beyond the enumerations and characterizations of different proteomes lies the elucidation of macromolecular interactions, complexes and networks. Informatics will play a crucial role in working towards these goals.

Page 61: Biomedical informatics                        for proteomics

Thank you Report Group: Biomedical informatics for proteomics