Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
저 시-비 리- 경 지 2.0 한민
는 아래 조건 르는 경 에 한하여 게
l 저 물 복제, 포, 전송, 전시, 공연 송할 수 습니다.
다 과 같 조건 라야 합니다:
l 하는, 저 물 나 포 경 , 저 물에 적 된 허락조건 명확하게 나타내어야 합니다.
l 저 터 허가를 면 러한 조건들 적 되지 않습니다.
저 에 른 리는 내 에 하여 향 지 않습니다.
것 허락규약(Legal Code) 해하 쉽게 약한 것 니다.
Disclaimer
저 시. 하는 원저 를 시하여야 합니다.
비 리. 하는 저 물 리 목적 할 수 없습니다.
경 지. 하는 저 물 개 , 형 또는 가공할 수 없습니다.
이학석사학위논문
Identification of Synthetic
Survival and Synthetic Survival
burden in multiple Cancer
암 유전체 자료에서 합성생존 체세포 돌연변이
쌍의 검출과 분석
2015년 8월
서울대학교 대학원
자연과학부 협동과정 생물정보학 전공
임 영 균
Identification of Synthetic
Survival and Synthetic Survival
burden in multiple Cancer
암 유전체 자료에서 합성생존 체세포 돌연변이
쌍의 검출과 분석
지도교수 김 주 한
이 논문을 이학석사학위논문으로 제출함
2015년 7월
서울대학교 대학원
자연과학부 생물정보학 전공
임 영 균
임 영 균의 석사학위논문를 인준함
2015년 6월
위 원 장 김 선 (인)
부위원장 김 주 한 (인)
위 원 윤 성 로 (인)
Abstract
Identification of Synthetic
Survival and Synthetic Survival
burden in multiple Cancer
Younggyun, Lim
Bioinformatics
The graduate school
Seoul National University
Background: According to cancer genomics become prevalent, large
cancer genome projects are launched and almost finished. From these
project, a huge amount of genomic, transcriptomic, epigenomic data is
released in public. This public data have provided quite some studies
of somatic mutations in cancer cell, then many researchers have
identified several cancer-related genes. However, some existing cancer
genomics studies have a limitations. Many studies are restricted to
single cancer type or single cancer driver gene. It can limit study’s
scope to identify how cancer occur, develop and progress. To solve
this problem, a concept of synthetic lethality is arisen. Synthetic
lethality is a concept that a combination of mutations in two genes
lead to cell death, while a mutation in an only one of these genes
remains cell survival. We propose a concept synthetic survival based
on the idea of synthetic lethality. Synthetic Survival (SS) is a
concept that a combinations of mutations in two gene lead to patients
survival, whereas a mutation in only one gene lead to patients bad
prognosis. The goal of this dissertation is to identify synthetic
survival and synthetic survival burden in multiple cancers.
Method: We downloaded 9,184 patients’ clinical information and 6,936
patients’ somatic mutation data in 20 cancer types from The Cancer
Genome Atlas. all of data is open-accessed data. To make core data
set for further analysis, several criteria is applied to filter out data
which is not able to analysis. All genes harboring at least one
non-synonymous mutations are annotated to gene damaging score.
Gene damaging score is a measure to quantify each gene’s
deleteriousness. Gene damaging score is derived from both scores in
Sorting Intolerant From Tolerant and classification of LoF mutations.
By gene damaging score, patients are stratified to four groups.
Survival analysis are conducted for all gene pairs’ group with cox
proportional hazard model with penalized likelihood. From the cox
model, candidates of potential synthetic survival pairs are decided.
Result: The result from filtering raw data, we build the core data set
for 4,844 patients including clinical information and somatic mutation
profiles. For candidates of potential synthetic survival pairs, 436 gene
pairs are identified in 5 cancer types. We also identified synthetic
survival burden which means that the group which have more SS
pairs and triplets have higher survival rate than the group which
have less SS pairs
Discussion: In this dissertation, Candidate SS pairs are identify by
only performing survival analysis with cox proportional hazard model
and also ensure that the number of SS pairs is related to patient’s
prognosis. Genes consisting SS pairs are usually related to cell-cell
interaction or migration so might be action to prevent metastasis,
then shows good prognosis. This study might be approach to reduce
cost for screening and to apply practical clinical uses.
Keywords: Cancer genomics, The Cancer Genome
Atlas, Synthetic survival, Survival analysis, Synthetic
survival burden
Student Number: 2013-20406
Contents
Ⅰ. Introduction.....................................................................................1
1.1 Cancer genomics.......................................................................1
1.2 Existing studies in cancer genomics.....................................2
1.3 Synthetic lethality and Synthetic Survival (SS).................2
1.4 Research goal............................................................................3
1.5 Related studies..........................................................................4
Ⅱ. Methods & Materials....................................................................5
2.1 Data..............................................................................................5
2.2 Data Processing(Filtering)........................................................5
2.3 Gene Damaging Score...............................................................6
2.4 Cox proportional hazard model with penalized likelihood....7
2.5 Synthetic Surival burden.......................................................8
Ⅲ. Result..............................................................................................9
3.1 TCGA core data set.................................................................9
3.2 Gene damaging score distribution..........................................9
3.3 Candidate Synthetic Survival pairs.....................................10
3.4 Synthetic Survival burden...................................................11
Ⅳ. Conclusion & Discussion............................................................11
Ⅴ. References.....................................................................................13
국문초록...............................................................................................34
- 1 -
Ⅰ. Introduction
1.1 Cancer genomics
Cancer is genetic disease. The transition from a normal cell to a
malignant cancer cell is caused by changes to a cell’s DNA called
mutations [1]. Some of those mutations might by inherit, but most of
cancer causing mutations is acquired throughout life also known as
somatic mutations. Recently, whole genome sequencing and whole
exome sequencing technologies have provided quite several studies of
somatic mutations in variable cancer and have made researchers
identify multiple cancer-related driver genes [2, 3]. With this trend,
large cancer genome project such as The Cancer Genome Atlas
(TCGA), International Cancer Genome Consortium (ICGC) are
launched and some of those are almost finished [4, 5]. TCGA is a
comprehensive and coordinated to accelerate understanding of the
molecular basis of cancer through the multi-omics technologies and
TCGA released multi-omics data and clinical data in public.
One major application for cancer patients with primary tumor is
accurate prognosis, which helps divided patients into specific risk
groups and decide both treatment and observation patient status.
Traditionally, prognosis is based on only common clinical variables
such as ages, pathologic tumor stages. Recently, molecular variables
such as genetic alterations or amplifications are incorporated to cancer
patient prognosis. Representatively, ER, PR, HER2 protein levels are
significant biomarker in breast cancer, which is really applicable to
practical treatment [6].
- 2 -
1.2 Existing studies in cancer genomics
Lately, typical cancer genomics studies and papers are carried out by
TCGA. TCGA published genomic, transcriptomic and epigenetic
landscape profile of each cancer types for about 30 types [7, 8, 9, 10].
Those studies include finding cancer driver mutations, molecular
classification of cancer and heterogeneity of cancer and so on. To
find cancer driver mutations, a number of algorithms and tools are
developed [11, 12]. Those tools use statistical methods, so detect
significant mutation based on background mutations rate. In spite of
these developments, most of researches are focused single driver
gene. By convention, studies associated with prognosis also have
been limited single gene and single cancer lineage. Many cancers
related genes identified from these researches are not directly
‘druggable’ or are not applying predicting prognosis. A huge number
of studies are enacted, but practical clinical application is still difficult
and sure prognosis factor is not identified.
1.3 Synthetic lethality and Synthetic Survival
(SS)
Several decades ago, the concept of synthetic lethality is arisen [13].
Synthetic lethality is a concept that a combination of mutations in
two genes lead to cell death, while a mutation in an only one of
these genes remains cell survival [13, 14]. (Figure 1)
- 3 -
In this dissertation, we propose a concept of synthetic survival based
on the idea of synthetic lethality. Synthetic survival (SS) is a concept
that a combinations of mutations in two gene lead to patients
survival, whereas a mutation in only one gene lead to patients bad
prognosis. Synthetic Survival can analyze gene interactions in cancer,
then infer how those gene interaction affect to patients’ prognosis.
Most of synthetic lethality is identified with experimental method
such as RNAi screening [15]. However, an experimental method such
as RNAi screening is cost, time and labor-intensive. For this reason,
synthetic lethal screening is not able to conduct with high throughput
method. However, synthetic survival screening is able to conduct
with high throughput way if sufficient data exist. For synthetic
survival analysis, only patients’ clinical information such as survival
status and time and somatic mutation profile are needed. Although
experimental validations might be needed after computational high
throughput analysis, computational screening can save cost, time and
human labor.
1.4 Research goal
The goal of this dissertation is to identify synthetic survival and
synthetic survival burden in multiple cancers. Gene pairs associating
with synthetic survival might be potentially used as prognostic factor
in specific cancer type or across cancer types. In this dissertation, we
propose simple methods for identifying synthetic survival with
survival analysis. We conduct survival analysis with cox proportional
hazard model for all possible gene pairs and identify candidates of
- 4 -
synthetic survival. We also identify synthetic survival burden, which
is accumulation of synthetic survival gene pairs affect patient’s
prognosis.
1.5 Related studies
In 2011, a research to predict patients’ survival in ovarian cancer
with molecular profile is introduced [16]. In this study, patients are
stratified into groups by BRCA1 and BRCA2 mutation status. This
study is initial approach to predict prognosis of cancer patients’ with
not only clinical variables but also molecular profiles.
Recently, computational methods to find synthetic lethal pairs are
introduced. Xiaosheng Wang et al tried to use only expression data to
find synthetic lethal relationship with TP53 gene [17]. Although this
is study is not associated with prognosis and gene sets are limited to
only kinases, it has a meaning that using only genomic data with
computational approach.
Livnat Jerby-Arnon et al. find Synthetic lethal pairs with only
data-driven method such as Genomic survival of fittest, functional
examination and coexpression data [18]. In this research, a huge
amount of data is used to find potential candidate of synthetic lethal
gene pairs, and researchers are using not only variable genomic data
also clinical variables.
Here, we analysis synthetic survival with somatic mutation data and
clinical data in 20 cancer type, then identify potential SS gene
relationship with only simple survival analysis.
- 5 -
Ⅱ. Methods & Materials
Overall workflow is divided four part, which is data processing, gene
damaging score, survival analysis and interpretation.(Figure 2)
2.1 Data
For analysis, Data downloaded from TCGA data portal is used. Data
is downloaded on 4 Mar 2015. It includes somatic mutation level 2
data (Whole exome sequencing data) of 5,618 patients and clinical
level 2 data of 6,838 patients. Somatic mutation level 2 data in TCGA
is formatted as mutation annotation format (maf). Position and variant
classification is used for analysis among annotated information.
Variants are classified as ‘Missense mutation’, ‘Nonsense mutation’,
‘Frameshift indel’, ‘In frame indel’, ‘splice site mutation;, Silent
mutation’, ‘Intron’, ‘UTR’ and ‘Intergenic’. Among these, only
non-synonymous mutations are used for analysis. Clinical level 2 data
includes a number of clinical variables according to cancer types, and
clinical variables for cox proportional hazard model are reviewed by
professional pathologist (Table 1).
2.2 Data Processing (Filtering)
First of all, for clinical data, patients who do not have information for
cox proportional hazard model are excluded. Next, patients who is
relevant to strong confounding factor of patients’ prognosis such as
- 6 -
the history of other malignancy, the history of neo adjuvant
treatment, radiation adjuvant treatment, pharmaco adjuvant treatment,
ablation adjuvant and metastasis to distant organs are excluded. At
last, patients who do not have mutation information are excluded. For
mutations data, to begin with, synonymous mutations are excluded.
Secondly, mutations in ‘Unknown’ gene not annotated with official
gene symbol from HGNC [19] are excluded. Finally, mutations of
patients who do not have clinical information are excluded. Then
4,844 patients are remained for further analysis.(Figure 2, Table 2)
2.3 Gene Damaging Score
To quantify deleteriousness of a gene, gene damaging score is
defined. Gene damaging score is calculated by considering the number
and classification of mutations in the gene. The scale of gene
damaging score is zero to one. The smaller gene damaging score
means the more deleterious gene. If the gene harbors Lof mutation
which contains nonsense mutation, frameshift insertion and deletion,
nonstop mutation and splice site mutation, the gene damaging score
of the gene is zero.[20] If not, the gene damaging score of the gene
is the geometric mean of SIFT score[21, 22, 23] of mutations in the
gene. To avoid the situation dividing by zero, if SIFT score is zero,
it substituted 10-e8. In all cancer, most of gene damaging score is
one, and then zero is second large portion. (Figure 3, Figure 4)
- 7 -
2.4 Cox proportional hazard model with penalized
likelihood
Cox proportional hazard model [24] is used for survival analysis. Cox
proportional hazard model is able to adjust confounding clinical
variables. To identify prognosis effect of gene pair’ mutation status,
patients are divided into four groups by every gene pairs; both genes’
damaging score below 0.3, only one gene’s damaging score below 0.3,
another gene’s damaging score below 0.3 and both genes’ damaging
score above 0.3. Because of this kind of grouping, which is,
classification according two genes’ damaging score, the group whose
both gene damaging score is below 0.3 is very small and sometimes
has no death event. So, cox proportional hazard model with maximum
likelihood, which is generally used for cox regression, is not adequate.
Cox proportional hazard model with penalized likelihood [25] can solve
this convergence problem. Survival analysis is conducted with R
(3.2.0)[26] and ‘coxphf’ packages [27].
For every cancer types, clinical features are added to cox proportional
hazard model to adjust confounding effects. Common clinical features
like age, gender are added and cancer specific clinical features
reviewed by professional pathologist and prior studies are also added
to adjust confounding effects.[28, 29] (Table 3)
Potential SS pairs are determined by p-value of cox proportional
hazard model, p-values of mutational status and hazard ratio. Criteria
are p-value of cox model below 0.05 and p-values of mutational
status below 0.05 and hazard ratio above 1.
- 8 -
2.5 Synthetic survival triplet analysis and
synthetic survival burden
A undirected graph showing relation among the significant pairs is
visualized with Graphviz(2.36) function ‘neato’.[30] To identify
prognosis effect according to the number of SS pairs, we conduct
survival analysis with grouping by the number of SS pairs of
patients. Patients distribution by the number of SS pairs are very
skewed (Figure 5), so all groups are merge to four or five group.
(Figure 6)
- 9 -
Ⅲ. Result
3.1 TCGA core data set
The result from filtering raw data is 4,844 patients’ clinical
information and DNA somatic mutation data in 20 cancer types. This
data set include both data types and all of clinical variables needed
for cox proportional hazard model in each cancer types. In this
dissertation, this data set is named core data set, and further analysis
is conducted with this core data set. (Table 1)
3.2 Gene damaging score distribution
All genes that harbors at least one non-synonymous mutation in each
cancer types are annotated by gene damaging score. If genes have no
non-synonymous mutations, their gene damaging scores are
designated to 1. For gene damaging score of all patients, we make a
matrix of which row is gene and column is patient. Because of
scarcity of somatic mutation event, the matrix is very sparse which
means that most of gene damaging score in the matrix is 1. Except
1, 0 is high proportion of gene score. (Figure 4) To dichotomize gene
score to binary factor, threshold 0.3 is applied. It provides well
grouping for gene deleteriousness (Figure 4, black line)
- 10 -
3.3 Candidate Synthetic Survival pairs
The result from performing survival analysis for twenty cancer types
is 436 candidates of SS pairs in five cancer types with criteria, which
is p-value below 0.05, and 45 candidates of SS pairs in three cancer
types with criteria which is p-value below 0.01. (Table 4) Most of
results are skewed to particular cancer types such as Lung
adenocarcinoma, Skin cutaneous melanoma. Both cancer types have
high somatic mutation frequency [31], and somehow high mortality.
436 pairs are consisted of 281 genes, and some genes like XIRP2,
RYR3 are showing high frequency in pairs. (Figure 6) Most of
highly frequently appeared genes are related to cadherin genes.
Cadherin genes are closely involved cell surface, so also involved
cell-cell interactions [32]. This result infers that these frequently
appeared genes are probably connected to cancer cell metastasis
mechanism.
GO analysis is conducted with genes which consist SS pairs. Highly
enriched terms are mostly related to cell-cell interaction such as cell
adhesion or extracellular matrix part (Figure 7). This result also
infers that synthetic survival gene pairs have a role for cell-cell
interaction, then it might be associated with cancer cell metastasis
[33,34].
Most of patients have no SS pairs because of scarcity, and the
number of patients is decreased by increasing SS pairs (Figure 10).
- 11 -
3.4 Synthetic Survival burden
In synthetic survival burden test, the result to identify prognostic
effects by increasing SS pairs shows that the more SS pairs or
affect better patients’ survival. (Figure 9, Figure 10) The group,
which have more SS pairs and triplets, have higher survival rate
than the group, which have less SS pairs and triplets. From this
result, we are able to infer that SS pairs don’t work as a only
marker for cancer progress but their cumulation might be disturb
cancer progress such as metastasis to distal metastasis and result in
patient’s well prognosis. Of course, patients having only one SS pair
have better survival status than those who don’t having SS pairs.
Ⅳ. Conclusion & Discussion
In this research, Candidate SS pairs are identify by only performing
survival analysis with cox proportional hazard model and also ensure
that the number of SS pairs is related to patient’s prognosis. Genes
consisting SS pairs are usually related to cell-cell interaction or
migration so might be action to prevent metastasis, then shows good
prognosis. Patients’ prognosis is a key clinical factor to apply
genomic characters with simple methods and public data. Despite of
simple scheme of study, this research has some limitations.
First of all, gene damaging score may be biased score, because it
does not care about gene’s length. If gene’s length is quite large, the
probability of harboring LoF mutations might be high. It makes large
gene deleterious. Secondly, because the core of method is survival
analysis, low mortality cancer is hard to analysis with this method.
- 12 -
The adequate death events are needed for this kind of methods.
Thirdly, the cancer type with low mutation burden like Leukemia is
difficult to apply this method because of grouping.
This kind of pre-screening of potential synthetic lethal like synthetic
survival using only genomic profiles is a promising approach for
improving the efficiency of two genes relations screening, and may
supplement the standard method in some cases. However, the
reliability the method should be validated by in-vitro or in-vivo
experiments.
- 13 -
Ⅴ. References
1. Vogelstein, B. and K.W. Kinzler, The multistep nature of cancer.
Trends Genet, 1993. 9(4): p. 138-41.
2. Mardis, E.R., A decade's perspective on DNA sequencing
technology. Nature, 2011. 470(7333): p. 198-203.
3. Wold, B. and R.M. Myers, Sequence census methods for functional
genomics. Nat Methods, 2008. 5(1): p. 19-21.
4. Cancer Genome Atlas Research, N., et al., The Cancer Genome
Atlas Pan-Cancer analysis project. Nat Genet, 2013. 45(10): p.
1113-20.
5. Joly, Y., et al., Data sharing in the post-genomic world: the
experience of the International Cancer Genome Consortium (ICGC)
Data Access Compliance Office (DACO). PLoS Comput Biol, 2012.
8(7): p. e1002549.
6. Weigel, M.T. and M. Dowsett, Current and emerging biomarkers in
breast cancer: prognosis and prediction. Endocr Relat Cancer, 2010.
17(4): p. R245-62.
7. Cancer Genome Atlas Research, N., Comprehensive genomic
characterization defines human glioblastoma genes and core pathways.
Nature, 2008. 455(7216): p. 1061-8.
- 14 -
8. Cancer Genome Atlas Research, N., Genomic and epigenomic
landscapes of adult de novo acute myeloid leukemia. N Engl J Med,
2013. 368(22): p. 2059-74.
9. Cancer Genome Atlas Research, N., Comprehensive molecular
profiling of lung adenocarcinoma. Nature, 2014. 511(7511): p. 543-50.
10. Cancer Genome Atlas Network. Electronic address, i.m.o. and N.
Cancer Genome Atlas, Genomic Classification of Cutaneous
Melanoma. Cell, 2015. 161(7): p. 1681-96.
11. Lawrence, M.S., et al., Mutational heterogeneity in cancer and the
search for new cancer-associated genes. Nature, 2013. 499(7457): p.
214-8.
12. Dees, N.D., et al., MuSiC: identifying mutational significance in
cancer genomes. Genome Res, 2012. 22(8): p. 1589-98.
13. Lucchesi, J.C., Synthetic lethality and semi-lethality among
functionally related mutants of Drosophila melanfgaster. Genetics,
1968. 59(1): p. 37-44.
14. Kaelin, W.G., Jr., The concept of synthetic lethality in the context
of anticancer therapy. Nat Rev Cancer, 2005. 5(9): p. 689-98.
15. Canaani, D., Methodological approaches in application of synthetic
lethality screening towards anticancer therapy. Br J Cancer, 2009.
100(8): p. 1213-8.
- 15 -
16. Bolton, K.L., et al., Association between BRCA1 and BRCA2
mutations and survival in women with invasive epithelial ovarian
cancer. JAMA, 2012. 307(4): p. 382-90.
17. Wang, X. and R. Simon, Identification of potential synthetic lethal
genes to p53 using a computational biology approach. BMC Med
Genomics, 2013. 6: p. 30.
18. Jerby-Arnon, L., et al., Predicting cancer-specific vulnerability via
data-driven detection of synthetic lethality. Cell, 2014. 158(5): p.
1199-209.
19. Gray, K.A., et al., Genenames.org: the HGNC resources in 2013.
Nucleic Acids Res, 2013. 41(Database issue): p. D545-52.
20. MacArthur, D.G., et al., A systematic survey of loss-of-function
variants in human protein-coding genes. Science, 2012. 335(6070): p.
823-8.
21. Ng, P.C. and S. Henikoff, Predicting deleterious amino acid
substitutions. Genome Res, 2001. 11(5): p. 863-74.
22. Ng, P.C. and S. Henikoff, SIFT: Predicting amino acid changes
that affect protein function. Nucleic Acids Res, 2003. 31(13): p. 3812-4.
23. Kumar, P., S. Henikoff, and P.C. Ng, Predicting the effects of
coding non-synonymous variants on protein function using the SIFT
algorithm. Nat Protoc, 2009. 4(7): p. 1073-81.
- 16 -
24. Cox, D.R., Regression Models and Life-Tables. Journal of the
Royal Statistical Society Series B-Statistical Methodology, 1972. 34(2):
p. 187-+.
25. Lai, Y.T., M. Hayashida, and T. Akutsu, Survival Analysis by
Penalized Regression and Matrix Factorization. Scientific World
Journal, 2013.
26. R Core Team, R: A language and environment for statistical
computing. R Foundation for Statistical Computing, Vienna, Austria.
URL http://www.R-project.org/ (2015)
27. R by Meinhard Ploner and Fortran by Georg Heinze, coxphf: Cox
regression with Firth’s penalized likelihood. R package ersion 1.11.
http://CRAN.R-project.org/package=coxphf (2015).
28. Cheng, Y., et al., Significance of E-cadherin, beta-catenin, and
vimentin expression as postoperative prognosis indicators in cervical
squamous cell carcinoma. Hum Pathol, 2012. 43(8): p. 1213-20.
29. Sharma, P., et al., CD8 tumor-infiltrating lymphocytes are
predictive of survival in muscle-invasive urothelial carcinoma. Proc
Natl Acad Sci U S A, 2007. 104(10): p. 3967-72.
30. Gansner, E.R. and S.C. North, An open graph visualization system
and its applications to software engineering. Software-Practice &
Experience, 2000. 30(11): p. 1203-1233.
- 17 -
31. Kandoth, C., et al., Mutational landscape and significance across 12
major cancer types. Nature, 2013. 502(7471): p. 333-+.
32. Jeanes, A., C.J. Gottardi, and A.S. Yap, Cadherins and cancer: how
does cadherin dysfunction promote tumor progression? Oncogene,
2008. 27(55): p. 6920-6929.
33. Nguyen, D.X. and J. Massague, Genetic determinants of cancer
metastasis. Nature Reviews Genetics, 2007. 8(5): p. 341-352.
34.Yoshida, B.A., et al., Metastasis-suppressor genes: a review and
perspective on an emerging field. Journal of the National Cancer
Institute, 2000. 92(21): p. 1717-1730.
- 18 -
Table 1. Data statistics of 20 cancer types.
- 19 -
Table 2. Clincial variables used in cox proportional hazard model for
each cancer type.
- 20 -
Table 3. Clinical variables statistics in Lung adenocarinoma.
- 21 -
Table 4. The number of candidates SS pair/triplet
- 22 -
Figure 1. Synthetic lethality concept
- 23 -
Figure 2. Overall workflow
- 24 -
Figure 3. Gene damaging score calculation flow
- 25 -
Figure 4. Gene damaging score distribution excluding score 1.
- 26 -
Figure 5. The distribution of patients by the number of SS pairs.
- 27 -
Figure 6. The distribution of genes consisting SS pairs.
- 28 -
Figure 7. GO enrichment for genes consisting SS pairs.
- 29 -
Figure 8. undirected graph with 436 Synthetic survival pairs
- 30 -
Figure 9. Kaplen meier plot by the number of SS pairs in
LUAD/SKCM.
- 31 -
Figure 10. The distribution of patients by SS pair frequency
- 32 -
요약(국문초록)
배 경: 암유전체학이 발전함에 따라, 많은 암유전체학 과제들이 시작되
었고 몇몇은 거의 완성단계에 이르렀다. 이러한 과제들로부터 대용량의
유전체, 전사체, 후성유전체의 자료들이 공개적으로 게재되었다. 이러한
공개 데이터는 암세포에서의 체세포 돌연변이에 대한 많은 연구를 가능
하게 하였고, 많은 연구자들은 암과 관련있는 유전자를 발견하였다. 하지
만 기존의 암유전체연구들은 제한점을 가지고 있다. 많은 연구들이 하나
의 암종 혹은 하나의 암을 일으키는 유전자에 제한되어있다. 이것은 암
의 발생과 발달 그리고 진행에 대한 연구를 하는데 있어서 제한점이 될
수가 있다. 이러한 문제를 풀기위해 합성치사라는 개념이 대두되었다. 합
성치사는 두 개의 유전자중 하나의 유전자에만 돌연변이가 있거나 둘다
없는 경우에는 세포가 살지만, 두 개의 유전자에 있는 돌연변이의 조합
은 세포를 죽게 만든다는 개념이다. 우리는 이 연구에서 이 합성치사의
발상을 기본으로 하여 합성생존이라는 개념을 제안한다. 합성생존이란
두 개의 유전자 중 한 개의 유전자에만 돌연변이가 있거나 둘다 없는 경
우는 그 환자의 생존이 나쁘지만, 두 개의 유전자에 있는 돌연변이의 조
합은 그 환자의 생존을 좋게 만든다는 개념이다. 이 논문의 목표는 다양
한 암종에서 합성생존과 합성생존하중을 확인하는 것이다.
방 법: 우리는 20개의 암종에서 9,184명의 임상정보와 6,936명의 체세포
돌연변이 자료를 암 유전체 지도책 과제에서 내려받았다. 모든 자료들은
제한없이 접근이 가능하다. 추후 분석을 위한 핵심 자료를 만들기 위해
분석에 포함되지 않는 자료들을 제하려고 몇몇의 기준을 적용하였다. 최
소한 하나의 비동의돌연변이가 있는 유전자들에 대해 유전자위험점수를
계산하였다. 유전자위험점수란 각각의 유전자가 돌연변이로 인해 얼마나
망가졌는지를 측정하는 점수이다. 유전자위험점수는 기능상실변이와
SIFT 점수로부터 계산되었다. 유전자위험점수에 따라 환자들은 네 그룹
으로 나뉘어졌다. 그 그룹들에 대해 콕스비례위험모형을 가지고 생존분
- 33 -
석을 진행하였다. 이 분석을 통해 잠재적인 합성생존유전자쌍이 발견되
었다.
결 과: 처음 데이터를 가공한 결과 우리는 4,884명의 환자에 대해 핵심
자료모음을 만들었다. 이 자료는 임상정보와 체세포돌연변이를 포함한다.
잠재적인 합성생존유전자쌍으로 436개의 유전자쌍이 5개의 암종에서 발
견되었다. 합성생존세쌍 분석에서 2,427개의 세쌍이 2개의 암종에서 발견
되었다. 그 2개의 암종은 폐암과 흑색종이다. 우리는 또한 합성생존하중
을 발견하였다. 합성생존하중이란 더 많은 합성생존유전자쌍 혹은 합성
생존유전자세쌍을 가진 경우가 더 적게 가진 경우보다 예후가 더 좋은
경우이다.
토 의: 이 논문에서는 잠재적 합성생존유전자쌍이 간단한 생존분석을 통
해서 나타났고 이 유전자쌍들이 많아질수록 환자의 예후와 관계가 있음
을 발견하였다. 합성생존유전자쌍을 이루는 대부분의 유전자들은 세포세
포반응이나 세포의 이동과 관련이 있는 것으로 나타났는데 이것은 암에
서 전이와 연관이 있을것으로 생각된다. 이 연구에서의 방법론적인 것은
간단한 데이터와 분석방법을 통해 실제적인 임상으로서의 적용이 가능하
다는 장점이 있다.
주요어: 암유전체학, The Cancer Genome atlas, 합성생
존, 생존분석, 합성생존 하중.
학 번: 2013-20406
- 34 -
저 시-비 리- 경 지 2.0 한민
는 아래 조건 르는 경 에 한하여 게
l 저 물 복제, 포, 전송, 전시, 공연 송할 수 습니다.
다 과 같 조건 라야 합니다:
l 하는, 저 물 나 포 경 , 저 물에 적 된 허락조건 명확하게 나타내어야 합니다.
l 저 터 허가를 면 러한 조건들 적 되지 않습니다.
저 에 른 리는 내 에 하여 향 지 않습니다.
것 허락규약(Legal Code) 해하 쉽게 약한 것 니다.
Disclaimer
저 시. 하는 원저 를 시하여야 합니다.
비 리. 하는 저 물 리 목적 할 수 없습니다.
경 지. 하는 저 물 개 , 형 또는 가공할 수 없습니다.
이학석사학위논문
Identification of Synthetic
Survival and Synthetic Survival
burden in multiple Cancer
암 유전체 자료에서 합성생존 체세포 돌연변이
쌍의 검출과 분석
2015년 8월
서울대학교 대학원
자연과학부 협동과정 생물정보학 전공
임 영 균
Identification of Synthetic
Survival and Synthetic Survival
burden in multiple Cancer
암 유전체 자료에서 합성생존 체세포 돌연변이
쌍의 검출과 분석
지도교수 김 주 한
이 논문을 이학석사학위논문으로 제출함
2015년 7월
서울대학교 대학원
자연과학부 생물정보학 전공
임 영 균
임 영 균의 석사학위논문를 인준함
2015년 6월
위 원 장 김 선 (인)
부위원장 김 주 한 (인)
위 원 윤 성 로 (인)
Abstract
Identification of Synthetic
Survival and Synthetic Survival
burden in multiple Cancer
Younggyun, Lim
Bioinformatics
The graduate school
Seoul National University
Background: According to cancer genomics become prevalent, large
cancer genome projects are launched and almost finished. From these
project, a huge amount of genomic, transcriptomic, epigenomic data is
released in public. This public data have provided quite some studies
of somatic mutations in cancer cell, then many researchers have
identified several cancer-related genes. However, some existing cancer
genomics studies have a limitations. Many studies are restricted to
single cancer type or single cancer driver gene. It can limit study’s
scope to identify how cancer occur, develop and progress. To solve
this problem, a concept of synthetic lethality is arisen. Synthetic
lethality is a concept that a combination of mutations in two genes
lead to cell death, while a mutation in an only one of these genes
remains cell survival. We propose a concept synthetic survival based
on the idea of synthetic lethality. Synthetic Survival (SS) is a
concept that a combinations of mutations in two gene lead to patients
survival, whereas a mutation in only one gene lead to patients bad
prognosis. The goal of this dissertation is to identify synthetic
survival and synthetic survival burden in multiple cancers.
Method: We downloaded 9,184 patients’ clinical information and 6,936
patients’ somatic mutation data in 20 cancer types from The Cancer
Genome Atlas. all of data is open-accessed data. To make core data
set for further analysis, several criteria is applied to filter out data
which is not able to analysis. All genes harboring at least one
non-synonymous mutations are annotated to gene damaging score.
Gene damaging score is a measure to quantify each gene’s
deleteriousness. Gene damaging score is derived from both scores in
Sorting Intolerant From Tolerant and classification of LoF mutations.
By gene damaging score, patients are stratified to four groups.
Survival analysis are conducted for all gene pairs’ group with cox
proportional hazard model with penalized likelihood. From the cox
model, candidates of potential synthetic survival pairs are decided.
Result: The result from filtering raw data, we build the core data set
for 4,844 patients including clinical information and somatic mutation
profiles. For candidates of potential synthetic survival pairs, 436 gene
pairs are identified in 5 cancer types. We also identified synthetic
survival burden which means that the group which have more SS
pairs and triplets have higher survival rate than the group which
have less SS pairs
Discussion: In this dissertation, Candidate SS pairs are identify by
only performing survival analysis with cox proportional hazard model
and also ensure that the number of SS pairs is related to patient’s
prognosis. Genes consisting SS pairs are usually related to cell-cell
interaction or migration so might be action to prevent metastasis,
then shows good prognosis. This study might be approach to reduce
cost for screening and to apply practical clinical uses.
Keywords: Cancer genomics, The Cancer Genome
Atlas, Synthetic survival, Survival analysis, Synthetic
survival burden
Student Number: 2013-20406
Contents
Ⅰ. Introduction.....................................................................................1
1.1 Cancer genomics.......................................................................1
1.2 Existing studies in cancer genomics.....................................2
1.3 Synthetic lethality and Synthetic Survival (SS).................2
1.4 Research goal............................................................................3
1.5 Related studies..........................................................................4
Ⅱ. Methods & Materials....................................................................5
2.1 Data..............................................................................................5
2.2 Data Processing(Filtering)........................................................5
2.3 Gene Damaging Score...............................................................6
2.4 Cox proportional hazard model with penalized likelihood....7
2.5 Synthetic Surival burden.......................................................8
Ⅲ. Result..............................................................................................9
3.1 TCGA core data set.................................................................9
3.2 Gene damaging score distribution..........................................9
3.3 Candidate Synthetic Survival pairs.....................................10
3.4 Synthetic Survival burden...................................................11
Ⅳ. Conclusion & Discussion............................................................11
Ⅴ. References.....................................................................................13
국문초록...............................................................................................34
- 1 -
Ⅰ. Introduction
1.1 Cancer genomics
Cancer is genetic disease. The transition from a normal cell to a
malignant cancer cell is caused by changes to a cell’s DNA called
mutations [1]. Some of those mutations might by inherit, but most of
cancer causing mutations is acquired throughout life also known as
somatic mutations. Recently, whole genome sequencing and whole
exome sequencing technologies have provided quite several studies of
somatic mutations in variable cancer and have made researchers
identify multiple cancer-related driver genes [2, 3]. With this trend,
large cancer genome project such as The Cancer Genome Atlas
(TCGA), International Cancer Genome Consortium (ICGC) are
launched and some of those are almost finished [4, 5]. TCGA is a
comprehensive and coordinated to accelerate understanding of the
molecular basis of cancer through the multi-omics technologies and
TCGA released multi-omics data and clinical data in public.
One major application for cancer patients with primary tumor is
accurate prognosis, which helps divided patients into specific risk
groups and decide both treatment and observation patient status.
Traditionally, prognosis is based on only common clinical variables
such as ages, pathologic tumor stages. Recently, molecular variables
such as genetic alterations or amplifications are incorporated to cancer
patient prognosis. Representatively, ER, PR, HER2 protein levels are
significant biomarker in breast cancer, which is really applicable to
practical treatment [6].
- 2 -
1.2 Existing studies in cancer genomics
Lately, typical cancer genomics studies and papers are carried out by
TCGA. TCGA published genomic, transcriptomic and epigenetic
landscape profile of each cancer types for about 30 types [7, 8, 9, 10].
Those studies include finding cancer driver mutations, molecular
classification of cancer and heterogeneity of cancer and so on. To
find cancer driver mutations, a number of algorithms and tools are
developed [11, 12]. Those tools use statistical methods, so detect
significant mutation based on background mutations rate. In spite of
these developments, most of researches are focused single driver
gene. By convention, studies associated with prognosis also have
been limited single gene and single cancer lineage. Many cancers
related genes identified from these researches are not directly
‘druggable’ or are not applying predicting prognosis. A huge number
of studies are enacted, but practical clinical application is still difficult
and sure prognosis factor is not identified.
1.3 Synthetic lethality and Synthetic Survival
(SS)
Several decades ago, the concept of synthetic lethality is arisen [13].
Synthetic lethality is a concept that a combination of mutations in
two genes lead to cell death, while a mutation in an only one of
these genes remains cell survival [13, 14]. (Figure 1)
- 3 -
In this dissertation, we propose a concept of synthetic survival based
on the idea of synthetic lethality. Synthetic survival (SS) is a concept
that a combinations of mutations in two gene lead to patients
survival, whereas a mutation in only one gene lead to patients bad
prognosis. Synthetic Survival can analyze gene interactions in cancer,
then infer how those gene interaction affect to patients’ prognosis.
Most of synthetic lethality is identified with experimental method
such as RNAi screening [15]. However, an experimental method such
as RNAi screening is cost, time and labor-intensive. For this reason,
synthetic lethal screening is not able to conduct with high throughput
method. However, synthetic survival screening is able to conduct
with high throughput way if sufficient data exist. For synthetic
survival analysis, only patients’ clinical information such as survival
status and time and somatic mutation profile are needed. Although
experimental validations might be needed after computational high
throughput analysis, computational screening can save cost, time and
human labor.
1.4 Research goal
The goal of this dissertation is to identify synthetic survival and
synthetic survival burden in multiple cancers. Gene pairs associating
with synthetic survival might be potentially used as prognostic factor
in specific cancer type or across cancer types. In this dissertation, we
propose simple methods for identifying synthetic survival with
survival analysis. We conduct survival analysis with cox proportional
hazard model for all possible gene pairs and identify candidates of
- 4 -
synthetic survival. We also identify synthetic survival burden, which
is accumulation of synthetic survival gene pairs affect patient’s
prognosis.
1.5 Related studies
In 2011, a research to predict patients’ survival in ovarian cancer
with molecular profile is introduced [16]. In this study, patients are
stratified into groups by BRCA1 and BRCA2 mutation status. This
study is initial approach to predict prognosis of cancer patients’ with
not only clinical variables but also molecular profiles.
Recently, computational methods to find synthetic lethal pairs are
introduced. Xiaosheng Wang et al tried to use only expression data to
find synthetic lethal relationship with TP53 gene [17]. Although this
is study is not associated with prognosis and gene sets are limited to
only kinases, it has a meaning that using only genomic data with
computational approach.
Livnat Jerby-Arnon et al. find Synthetic lethal pairs with only
data-driven method such as Genomic survival of fittest, functional
examination and coexpression data [18]. In this research, a huge
amount of data is used to find potential candidate of synthetic lethal
gene pairs, and researchers are using not only variable genomic data
also clinical variables.
Here, we analysis synthetic survival with somatic mutation data and
clinical data in 20 cancer type, then identify potential SS gene
relationship with only simple survival analysis.
- 5 -
Ⅱ. Methods & Materials
Overall workflow is divided four part, which is data processing, gene
damaging score, survival analysis and interpretation.(Figure 2)
2.1 Data
For analysis, Data downloaded from TCGA data portal is used. Data
is downloaded on 4 Mar 2015. It includes somatic mutation level 2
data (Whole exome sequencing data) of 5,618 patients and clinical
level 2 data of 6,838 patients. Somatic mutation level 2 data in TCGA
is formatted as mutation annotation format (maf). Position and variant
classification is used for analysis among annotated information.
Variants are classified as ‘Missense mutation’, ‘Nonsense mutation’,
‘Frameshift indel’, ‘In frame indel’, ‘splice site mutation;, Silent
mutation’, ‘Intron’, ‘UTR’ and ‘Intergenic’. Among these, only
non-synonymous mutations are used for analysis. Clinical level 2 data
includes a number of clinical variables according to cancer types, and
clinical variables for cox proportional hazard model are reviewed by
professional pathologist (Table 1).
2.2 Data Processing (Filtering)
First of all, for clinical data, patients who do not have information for
cox proportional hazard model are excluded. Next, patients who is
relevant to strong confounding factor of patients’ prognosis such as
- 6 -
the history of other malignancy, the history of neo adjuvant
treatment, radiation adjuvant treatment, pharmaco adjuvant treatment,
ablation adjuvant and metastasis to distant organs are excluded. At
last, patients who do not have mutation information are excluded. For
mutations data, to begin with, synonymous mutations are excluded.
Secondly, mutations in ‘Unknown’ gene not annotated with official
gene symbol from HGNC [19] are excluded. Finally, mutations of
patients who do not have clinical information are excluded. Then
4,844 patients are remained for further analysis.(Figure 2, Table 2)
2.3 Gene Damaging Score
To quantify deleteriousness of a gene, gene damaging score is
defined. Gene damaging score is calculated by considering the number
and classification of mutations in the gene. The scale of gene
damaging score is zero to one. The smaller gene damaging score
means the more deleterious gene. If the gene harbors Lof mutation
which contains nonsense mutation, frameshift insertion and deletion,
nonstop mutation and splice site mutation, the gene damaging score
of the gene is zero.[20] If not, the gene damaging score of the gene
is the geometric mean of SIFT score[21, 22, 23] of mutations in the
gene. To avoid the situation dividing by zero, if SIFT score is zero,
it substituted 10-e8. In all cancer, most of gene damaging score is
one, and then zero is second large portion. (Figure 3, Figure 4)
- 7 -
2.4 Cox proportional hazard model with penalized
likelihood
Cox proportional hazard model [24] is used for survival analysis. Cox
proportional hazard model is able to adjust confounding clinical
variables. To identify prognosis effect of gene pair’ mutation status,
patients are divided into four groups by every gene pairs; both genes’
damaging score below 0.3, only one gene’s damaging score below 0.3,
another gene’s damaging score below 0.3 and both genes’ damaging
score above 0.3. Because of this kind of grouping, which is,
classification according two genes’ damaging score, the group whose
both gene damaging score is below 0.3 is very small and sometimes
has no death event. So, cox proportional hazard model with maximum
likelihood, which is generally used for cox regression, is not adequate.
Cox proportional hazard model with penalized likelihood [25] can solve
this convergence problem. Survival analysis is conducted with R
(3.2.0)[26] and ‘coxphf’ packages [27].
For every cancer types, clinical features are added to cox proportional
hazard model to adjust confounding effects. Common clinical features
like age, gender are added and cancer specific clinical features
reviewed by professional pathologist and prior studies are also added
to adjust confounding effects.[28, 29] (Table 3)
Potential SS pairs are determined by p-value of cox proportional
hazard model, p-values of mutational status and hazard ratio. Criteria
are p-value of cox model below 0.05 and p-values of mutational
status below 0.05 and hazard ratio above 1.
- 8 -
2.5 Synthetic survival triplet analysis and
synthetic survival burden
A undirected graph showing relation among the significant pairs is
visualized with Graphviz(2.36) function ‘neato’.[30] To identify
prognosis effect according to the number of SS pairs, we conduct
survival analysis with grouping by the number of SS pairs of
patients. Patients distribution by the number of SS pairs are very
skewed (Figure 5), so all groups are merge to four or five group.
(Figure 6)
- 9 -
Ⅲ. Result
3.1 TCGA core data set
The result from filtering raw data is 4,844 patients’ clinical
information and DNA somatic mutation data in 20 cancer types. This
data set include both data types and all of clinical variables needed
for cox proportional hazard model in each cancer types. In this
dissertation, this data set is named core data set, and further analysis
is conducted with this core data set. (Table 1)
3.2 Gene damaging score distribution
All genes that harbors at least one non-synonymous mutation in each
cancer types are annotated by gene damaging score. If genes have no
non-synonymous mutations, their gene damaging scores are
designated to 1. For gene damaging score of all patients, we make a
matrix of which row is gene and column is patient. Because of
scarcity of somatic mutation event, the matrix is very sparse which
means that most of gene damaging score in the matrix is 1. Except
1, 0 is high proportion of gene score. (Figure 4) To dichotomize gene
score to binary factor, threshold 0.3 is applied. It provides well
grouping for gene deleteriousness (Figure 4, black line)
- 10 -
3.3 Candidate Synthetic Survival pairs
The result from performing survival analysis for twenty cancer types
is 436 candidates of SS pairs in five cancer types with criteria, which
is p-value below 0.05, and 45 candidates of SS pairs in three cancer
types with criteria which is p-value below 0.01. (Table 4) Most of
results are skewed to particular cancer types such as Lung
adenocarcinoma, Skin cutaneous melanoma. Both cancer types have
high somatic mutation frequency [31], and somehow high mortality.
436 pairs are consisted of 281 genes, and some genes like XIRP2,
RYR3 are showing high frequency in pairs. (Figure 6) Most of
highly frequently appeared genes are related to cadherin genes.
Cadherin genes are closely involved cell surface, so also involved
cell-cell interactions [32]. This result infers that these frequently
appeared genes are probably connected to cancer cell metastasis
mechanism.
GO analysis is conducted with genes which consist SS pairs. Highly
enriched terms are mostly related to cell-cell interaction such as cell
adhesion or extracellular matrix part (Figure 7). This result also
infers that synthetic survival gene pairs have a role for cell-cell
interaction, then it might be associated with cancer cell metastasis
[33,34].
Most of patients have no SS pairs because of scarcity, and the
number of patients is decreased by increasing SS pairs (Figure 10).
- 11 -
3.4 Synthetic Survival burden
In synthetic survival burden test, the result to identify prognostic
effects by increasing SS pairs shows that the more SS pairs or
affect better patients’ survival. (Figure 9, Figure 10) The group,
which have more SS pairs and triplets, have higher survival rate
than the group, which have less SS pairs and triplets. From this
result, we are able to infer that SS pairs don’t work as a only
marker for cancer progress but their cumulation might be disturb
cancer progress such as metastasis to distal metastasis and result in
patient’s well prognosis. Of course, patients having only one SS pair
have better survival status than those who don’t having SS pairs.
Ⅳ. Conclusion & Discussion
In this research, Candidate SS pairs are identify by only performing
survival analysis with cox proportional hazard model and also ensure
that the number of SS pairs is related to patient’s prognosis. Genes
consisting SS pairs are usually related to cell-cell interaction or
migration so might be action to prevent metastasis, then shows good
prognosis. Patients’ prognosis is a key clinical factor to apply
genomic characters with simple methods and public data. Despite of
simple scheme of study, this research has some limitations.
First of all, gene damaging score may be biased score, because it
does not care about gene’s length. If gene’s length is quite large, the
probability of harboring LoF mutations might be high. It makes large
gene deleterious. Secondly, because the core of method is survival
analysis, low mortality cancer is hard to analysis with this method.
- 12 -
The adequate death events are needed for this kind of methods.
Thirdly, the cancer type with low mutation burden like Leukemia is
difficult to apply this method because of grouping.
This kind of pre-screening of potential synthetic lethal like synthetic
survival using only genomic profiles is a promising approach for
improving the efficiency of two genes relations screening, and may
supplement the standard method in some cases. However, the
reliability the method should be validated by in-vitro or in-vivo
experiments.
- 13 -
Ⅴ. References
1. Vogelstein, B. and K.W. Kinzler, The multistep nature of cancer.
Trends Genet, 1993. 9(4): p. 138-41.
2. Mardis, E.R., A decade's perspective on DNA sequencing
technology. Nature, 2011. 470(7333): p. 198-203.
3. Wold, B. and R.M. Myers, Sequence census methods for functional
genomics. Nat Methods, 2008. 5(1): p. 19-21.
4. Cancer Genome Atlas Research, N., et al., The Cancer Genome
Atlas Pan-Cancer analysis project. Nat Genet, 2013. 45(10): p.
1113-20.
5. Joly, Y., et al., Data sharing in the post-genomic world: the
experience of the International Cancer Genome Consortium (ICGC)
Data Access Compliance Office (DACO). PLoS Comput Biol, 2012.
8(7): p. e1002549.
6. Weigel, M.T. and M. Dowsett, Current and emerging biomarkers in
breast cancer: prognosis and prediction. Endocr Relat Cancer, 2010.
17(4): p. R245-62.
7. Cancer Genome Atlas Research, N., Comprehensive genomic
characterization defines human glioblastoma genes and core pathways.
Nature, 2008. 455(7216): p. 1061-8.
- 14 -
8. Cancer Genome Atlas Research, N., Genomic and epigenomic
landscapes of adult de novo acute myeloid leukemia. N Engl J Med,
2013. 368(22): p. 2059-74.
9. Cancer Genome Atlas Research, N., Comprehensive molecular
profiling of lung adenocarcinoma. Nature, 2014. 511(7511): p. 543-50.
10. Cancer Genome Atlas Network. Electronic address, i.m.o. and N.
Cancer Genome Atlas, Genomic Classification of Cutaneous
Melanoma. Cell, 2015. 161(7): p. 1681-96.
11. Lawrence, M.S., et al., Mutational heterogeneity in cancer and the
search for new cancer-associated genes. Nature, 2013. 499(7457): p.
214-8.
12. Dees, N.D., et al., MuSiC: identifying mutational significance in
cancer genomes. Genome Res, 2012. 22(8): p. 1589-98.
13. Lucchesi, J.C., Synthetic lethality and semi-lethality among
functionally related mutants of Drosophila melanfgaster. Genetics,
1968. 59(1): p. 37-44.
14. Kaelin, W.G., Jr., The concept of synthetic lethality in the context
of anticancer therapy. Nat Rev Cancer, 2005. 5(9): p. 689-98.
15. Canaani, D., Methodological approaches in application of synthetic
lethality screening towards anticancer therapy. Br J Cancer, 2009.
100(8): p. 1213-8.
- 15 -
16. Bolton, K.L., et al., Association between BRCA1 and BRCA2
mutations and survival in women with invasive epithelial ovarian
cancer. JAMA, 2012. 307(4): p. 382-90.
17. Wang, X. and R. Simon, Identification of potential synthetic lethal
genes to p53 using a computational biology approach. BMC Med
Genomics, 2013. 6: p. 30.
18. Jerby-Arnon, L., et al., Predicting cancer-specific vulnerability via
data-driven detection of synthetic lethality. Cell, 2014. 158(5): p.
1199-209.
19. Gray, K.A., et al., Genenames.org: the HGNC resources in 2013.
Nucleic Acids Res, 2013. 41(Database issue): p. D545-52.
20. MacArthur, D.G., et al., A systematic survey of loss-of-function
variants in human protein-coding genes. Science, 2012. 335(6070): p.
823-8.
21. Ng, P.C. and S. Henikoff, Predicting deleterious amino acid
substitutions. Genome Res, 2001. 11(5): p. 863-74.
22. Ng, P.C. and S. Henikoff, SIFT: Predicting amino acid changes
that affect protein function. Nucleic Acids Res, 2003. 31(13): p. 3812-4.
23. Kumar, P., S. Henikoff, and P.C. Ng, Predicting the effects of
coding non-synonymous variants on protein function using the SIFT
algorithm. Nat Protoc, 2009. 4(7): p. 1073-81.
- 16 -
24. Cox, D.R., Regression Models and Life-Tables. Journal of the
Royal Statistical Society Series B-Statistical Methodology, 1972. 34(2):
p. 187-+.
25. Lai, Y.T., M. Hayashida, and T. Akutsu, Survival Analysis by
Penalized Regression and Matrix Factorization. Scientific World
Journal, 2013.
26. R Core Team, R: A language and environment for statistical
computing. R Foundation for Statistical Computing, Vienna, Austria.
URL http://www.R-project.org/ (2015)
27. R by Meinhard Ploner and Fortran by Georg Heinze, coxphf: Cox
regression with Firth’s penalized likelihood. R package ersion 1.11.
http://CRAN.R-project.org/package=coxphf (2015).
28. Cheng, Y., et al., Significance of E-cadherin, beta-catenin, and
vimentin expression as postoperative prognosis indicators in cervical
squamous cell carcinoma. Hum Pathol, 2012. 43(8): p. 1213-20.
29. Sharma, P., et al., CD8 tumor-infiltrating lymphocytes are
predictive of survival in muscle-invasive urothelial carcinoma. Proc
Natl Acad Sci U S A, 2007. 104(10): p. 3967-72.
30. Gansner, E.R. and S.C. North, An open graph visualization system
and its applications to software engineering. Software-Practice &
Experience, 2000. 30(11): p. 1203-1233.
- 17 -
31. Kandoth, C., et al., Mutational landscape and significance across 12
major cancer types. Nature, 2013. 502(7471): p. 333-+.
32. Jeanes, A., C.J. Gottardi, and A.S. Yap, Cadherins and cancer: how
does cadherin dysfunction promote tumor progression? Oncogene,
2008. 27(55): p. 6920-6929.
33. Nguyen, D.X. and J. Massague, Genetic determinants of cancer
metastasis. Nature Reviews Genetics, 2007. 8(5): p. 341-352.
34.Yoshida, B.A., et al., Metastasis-suppressor genes: a review and
perspective on an emerging field. Journal of the National Cancer
Institute, 2000. 92(21): p. 1717-1730.
- 18 -
Table 1. Data statistics of 20 cancer types.
- 19 -
Table 2. Clincial variables used in cox proportional hazard model for
each cancer type.
- 20 -
Table 3. Clinical variables statistics in Lung adenocarinoma.
- 21 -
Table 4. The number of candidates SS pair/triplet
- 22 -
Figure 1. Synthetic lethality concept
- 23 -
Figure 2. Overall workflow
- 24 -
Figure 3. Gene damaging score calculation flow
- 25 -
Figure 4. Gene damaging score distribution excluding score 1.
- 26 -
Figure 5. The distribution of patients by the number of SS pairs.
- 27 -
Figure 6. The distribution of genes consisting SS pairs.
- 28 -
Figure 7. GO enrichment for genes consisting SS pairs.
- 29 -
Figure 8. undirected graph with 436 Synthetic survival pairs
- 30 -
Figure 9. Kaplen meier plot by the number of SS pairs in
LUAD/SKCM.
- 31 -
Figure 10. The distribution of patients by SS pair frequency
- 32 -
요약(국문초록)
배 경: 암유전체학이 발전함에 따라, 많은 암유전체학 과제들이 시작되
었고 몇몇은 거의 완성단계에 이르렀다. 이러한 과제들로부터 대용량의
유전체, 전사체, 후성유전체의 자료들이 공개적으로 게재되었다. 이러한
공개 데이터는 암세포에서의 체세포 돌연변이에 대한 많은 연구를 가능
하게 하였고, 많은 연구자들은 암과 관련있는 유전자를 발견하였다. 하지
만 기존의 암유전체연구들은 제한점을 가지고 있다. 많은 연구들이 하나
의 암종 혹은 하나의 암을 일으키는 유전자에 제한되어있다. 이것은 암
의 발생과 발달 그리고 진행에 대한 연구를 하는데 있어서 제한점이 될
수가 있다. 이러한 문제를 풀기위해 합성치사라는 개념이 대두되었다. 합
성치사는 두 개의 유전자중 하나의 유전자에만 돌연변이가 있거나 둘다
없는 경우에는 세포가 살지만, 두 개의 유전자에 있는 돌연변이의 조합
은 세포를 죽게 만든다는 개념이다. 우리는 이 연구에서 이 합성치사의
발상을 기본으로 하여 합성생존이라는 개념을 제안한다. 합성생존이란
두 개의 유전자 중 한 개의 유전자에만 돌연변이가 있거나 둘다 없는 경
우는 그 환자의 생존이 나쁘지만, 두 개의 유전자에 있는 돌연변이의 조
합은 그 환자의 생존을 좋게 만든다는 개념이다. 이 논문의 목표는 다양
한 암종에서 합성생존과 합성생존하중을 확인하는 것이다.
방 법: 우리는 20개의 암종에서 9,184명의 임상정보와 6,936명의 체세포
돌연변이 자료를 암 유전체 지도책 과제에서 내려받았다. 모든 자료들은
제한없이 접근이 가능하다. 추후 분석을 위한 핵심 자료를 만들기 위해
분석에 포함되지 않는 자료들을 제하려고 몇몇의 기준을 적용하였다. 최
소한 하나의 비동의돌연변이가 있는 유전자들에 대해 유전자위험점수를
계산하였다. 유전자위험점수란 각각의 유전자가 돌연변이로 인해 얼마나
망가졌는지를 측정하는 점수이다. 유전자위험점수는 기능상실변이와
SIFT 점수로부터 계산되었다. 유전자위험점수에 따라 환자들은 네 그룹
으로 나뉘어졌다. 그 그룹들에 대해 콕스비례위험모형을 가지고 생존분
- 33 -
석을 진행하였다. 이 분석을 통해 잠재적인 합성생존유전자쌍이 발견되
었다.
결 과: 처음 데이터를 가공한 결과 우리는 4,884명의 환자에 대해 핵심
자료모음을 만들었다. 이 자료는 임상정보와 체세포돌연변이를 포함한다.
잠재적인 합성생존유전자쌍으로 436개의 유전자쌍이 5개의 암종에서 발
견되었다. 합성생존세쌍 분석에서 2,427개의 세쌍이 2개의 암종에서 발견
되었다. 그 2개의 암종은 폐암과 흑색종이다. 우리는 또한 합성생존하중
을 발견하였다. 합성생존하중이란 더 많은 합성생존유전자쌍 혹은 합성
생존유전자세쌍을 가진 경우가 더 적게 가진 경우보다 예후가 더 좋은
경우이다.
토 의: 이 논문에서는 잠재적 합성생존유전자쌍이 간단한 생존분석을 통
해서 나타났고 이 유전자쌍들이 많아질수록 환자의 예후와 관계가 있음
을 발견하였다. 합성생존유전자쌍을 이루는 대부분의 유전자들은 세포세
포반응이나 세포의 이동과 관련이 있는 것으로 나타났는데 이것은 암에
서 전이와 연관이 있을것으로 생각된다. 이 연구에서의 방법론적인 것은
간단한 데이터와 분석방법을 통해 실제적인 임상으로서의 적용이 가능하
다는 장점이 있다.
주요어: 암유전체학, The Cancer Genome atlas, 합성생
존, 생존분석, 합성생존 하중.
학 번: 2013-20406
- 34 -