Disclaimer - Seoul National Universitys-space.snu.ac.kr/bitstream/10371/131178/1/000000056933.pdf저작자표시-비영리-변경금지 2.0 대한민국 이용자는 아래의 조건을

저 시-비 리- 경 지 2.0 한민

는 아래 조건 르는 경 에 한하여 게

l 저 물 복제, 포, 전송, 전시, 공연 송할 수 습니다.

다 과 같 조건 라야 합니다:

l 하는, 저 물 나 포 경 , 저 물에 적 된 허락조건 명확하게 나타내어야 합니다.

l 저 터 허가를 면 러한 조건들 적 되지 않습니다.

저 에 른 리는 내 에 하여 향 지 않습니다.

것 허락규약(Legal Code) 해하 쉽게 약한 것 니다.

Disclaimer

저 시. 하는 원저 를 시하여야 합니다.

비 리. 하는 저 물 리 목적 할 수 없습니다.

경 지. 하는 저 물 개 , 형 또는 가공할 수 없습니다.

http://creativecommons.org/licenses/by-nc-nd/2.0/kr/legalcode

http://creativecommons.org/licenses/by-nc-nd/2.0/kr/

이학석사학위논문

Identification of Synthetic

Survival and Synthetic Survival

burden in multiple Cancer

암 유전체 자료에서 합성생존 체세포 돌연변이

쌍의 검출과 분석

2015년 8월

서울대학교 대학원

자연과학부 협동과정 생물정보학 전공

임 영 균






지도교수 김 주 한

이 논문을 이학석사학위논문으로 제출함

2015년 7월


자연과학부 생물정보학 전공

임 영 균

임 영 균의 석사학위논문를 인준함

2015년 6월

위 원 장 김 선 (인)

부위원장 김 주 한 (인)

위 원 윤 성 로 (인)

Abstract




Younggyun, Lim

Bioinformatics

The graduate school

Seoul National University

Background: According to cancer genomics become prevalent, large

cancer genome projects are launched and almost finished. From these

project, a huge amount of genomic, transcriptomic, epigenomic data is

released in public. This public data have provided quite some studies

of somatic mutations in cancer cell, then many researchers have

identified several cancer-related genes. However, some existing cancer

genomics studies have a limitations. Many studies are restricted to

single cancer type or single cancer driver gene. It can limit study’s

scope to identify how cancer occur, develop and progress. To solve

this problem, a concept of synthetic lethality is arisen. Synthetic

lethality is a concept that a combination of mutations in two genes

lead to cell death, while a mutation in an only one of these genes

remains cell survival. We propose a concept synthetic survival based

on the idea of synthetic lethality. Synthetic Survival (SS) is a

concept that a combinations of mutations in two gene lead to patients

survival, whereas a mutation in only one gene lead to patients bad

prognosis. The goal of this dissertation is to identify synthetic

survival and synthetic survival burden in multiple cancers.

Method: We downloaded 9,184 patients’ clinical information and 6,936

patients’ somatic mutation data in 20 cancer types from The Cancer

Genome Atlas. all of data is open-accessed data. To make core data

set for further analysis, several criteria is applied to filter out data

which is not able to analysis. All genes harboring at least one

non-synonymous mutations are annotated to gene damaging score.

Gene damaging score is a measure to quantify each gene’s

deleteriousness. Gene damaging score is derived from both scores in

Sorting Intolerant From Tolerant and classification of LoF mutations.

By gene damaging score, patients are stratified to four groups.

Survival analysis are conducted for all gene pairs’ group with cox

proportional hazard model with penalized likelihood. From the cox

model, candidates of potential synthetic survival pairs are decided.

Result: The result from filtering raw data, we build the core data set

for 4,844 patients including clinical information and somatic mutation

profiles. For candidates of potential synthetic survival pairs, 436 gene

pairs are identified in 5 cancer types. We also identified synthetic

survival burden which means that the group which have more SS

pairs and triplets have higher survival rate than the group which

have less SS pairs

Discussion: In this dissertation, Candidate SS pairs are identify by

only performing survival analysis with cox proportional hazard model

and also ensure that the number of SS pairs is related to patient’s

prognosis. Genes consisting SS pairs are usually related to cell-cell

interaction or migration so might be action to prevent metastasis,

then shows good prognosis. This study might be approach to reduce

cost for screening and to apply practical clinical uses.

Keywords: Cancer genomics, The Cancer Genome

Atlas, Synthetic survival, Survival analysis, Synthetic

survival burden

Student Number: 2013-20406

Contents

Ⅰ. Introduction.....................................................................................1

1.1 Cancer genomics.......................................................................1

1.2 Existing studies in cancer genomics.....................................2

1.3 Synthetic lethality and Synthetic Survival (SS).................2

1.4 Research goal............................................................................3

1.5 Related studies..........................................................................4

Ⅱ. Methods & Materials....................................................................5

2.1 Data..............................................................................................5

2.2 Data Processing(Filtering)........................................................5

2.3 Gene Damaging Score...............................................................6

2.4 Cox proportional hazard model with penalized likelihood....7

2.5 Synthetic Surival burden.......................................................8

Ⅲ. Result..............................................................................................9

3.1 TCGA core data set.................................................................9

3.2 Gene damaging score distribution..........................................9

3.3 Candidate Synthetic Survival pairs.....................................10

3.4 Synthetic Survival burden...................................................11

Ⅳ. Conclusion & Discussion............................................................11

Ⅴ. References.....................................................................................13

국문초록...............................................................................................34

- 1 -

Ⅰ. Introduction

1.1 Cancer genomics

Cancer is genetic disease. The transition from a normal cell to a

malignant cancer cell is caused by changes to a cell’s DNA called

mutations [1]. Some of those mutations might by inherit, but most of

cancer causing mutations is acquired throughout life also known as

somatic mutations. Recently, whole genome sequencing and whole

exome sequencing technologies have provided quite several studies of

somatic mutations in variable cancer and have made researchers

identify multiple cancer-related driver genes [2, 3]. With this trend,

large cancer genome project such as The Cancer Genome Atlas

(TCGA), International Cancer Genome Consortium (ICGC) are

launched and some of those are almost finished [4, 5]. TCGA is a

comprehensive and coordinated to accelerate understanding of the

molecular basis of cancer through the multi-omics technologies and

TCGA released multi-omics data and clinical data in public.

One major application for cancer patients with primary tumor is

accurate prognosis, which helps divided patients into specific risk

groups and decide both treatment and observation patient status.

Traditionally, prognosis is based on only common clinical variables

such as ages, pathologic tumor stages. Recently, molecular variables

such as genetic alterations or amplifications are incorporated to cancer

patient prognosis. Representatively, ER, PR, HER2 protein levels are

significant biomarker in breast cancer, which is really applicable to

practical treatment [6].

- 2 -

1.2 Existing studies in cancer genomics

Lately, typical cancer genomics studies and papers are carried out by

TCGA. TCGA published genomic, transcriptomic and epigenetic

landscape profile of each cancer types for about 30 types [7, 8, 9, 10].

Those studies include finding cancer driver mutations, molecular

classification of cancer and heterogeneity of cancer and so on. To

find cancer driver mutations, a number of algorithms and tools are

developed [11, 12]. Those tools use statistical methods, so detect

significant mutation based on background mutations rate. In spite of

these developments, most of researches are focused single driver

gene. By convention, studies associated with prognosis also have

been limited single gene and single cancer lineage. Many cancers

related genes identified from these researches are not directly

‘druggable’ or are not applying predicting prognosis. A huge number

of studies are enacted, but practical clinical application is still difficult

and sure prognosis factor is not identified.

1.3 Synthetic lethality and Synthetic Survival

(SS)

Several decades ago, the concept of synthetic lethality is arisen [13].

Synthetic lethality is a concept that a combination of mutations in

two genes lead to cell death, while a mutation in an only one of

these genes remains cell survival [13, 14]. (Figure 1)

- 3 -

In this dissertation, we propose a concept of synthetic survival based

on the idea of synthetic lethality. Synthetic survival (SS) is a concept

that a combinations of mutations in two gene lead to patients


prognosis. Synthetic Survival can analyze gene interactions in cancer,

then infer how those gene interaction affect to patients’ prognosis.

Most of synthetic lethality is identified with experimental method

such as RNAi screening [15]. However, an experimental method such

as RNAi screening is cost, time and labor-intensive. For this reason,

synthetic lethal screening is not able to conduct with high throughput

method. However, synthetic survival screening is able to conduct

with high throughput way if sufficient data exist. For synthetic

survival analysis, only patients’ clinical information such as survival

status and time and somatic mutation profile are needed. Although

experimental validations might be needed after computational high

throughput analysis, computational screening can save cost, time and

human labor.

1.4 Research goal

The goal of this dissertation is to identify synthetic survival and

synthetic survival burden in multiple cancers. Gene pairs associating

with synthetic survival might be potentially used as prognostic factor

in specific cancer type or across cancer types. In this dissertation, we

propose simple methods for identifying synthetic survival with

survival analysis. We conduct survival analysis with cox proportional

hazard model for all possible gene pairs and identify candidates of

- 4 -

synthetic survival. We also identify synthetic survival burden, which

is accumulation of synthetic survival gene pairs affect patient’s

prognosis.

1.5 Related studies

In 2011, a research to predict patients’ survival in ovarian cancer

with molecular profile is introduced [16]. In this study, patients are

stratified into groups by BRCA1 and BRCA2 mutation status. This

study is initial approach to predict prognosis of cancer patients’ with

not only clinical variables but also molecular profiles.

Recently, computational methods to find synthetic lethal pairs are

introduced. Xiaosheng Wang et al tried to use only expression data to

find synthetic lethal relationship with TP53 gene [17]. Although this

is study is not associated with prognosis and gene sets are limited to

only kinases, it has a meaning that using only genomic data with

computational approach.

Livnat Jerby-Arnon et al. find Synthetic lethal pairs with only

data-driven method such as Genomic survival of fittest, functional

examination and coexpression data [18]. In this research, a huge

amount of data is used to find potential candidate of synthetic lethal

gene pairs, and researchers are using not only variable genomic data

also clinical variables.

Here, we analysis synthetic survival with somatic mutation data and

clinical data in 20 cancer type, then identify potential SS gene

relationship with only simple survival analysis.

- 5 -

Ⅱ. Methods & Materials

Overall workflow is divided four part, which is data processing, gene

damaging score, survival analysis and interpretation.(Figure 2)

2.1 Data

For analysis, Data downloaded from TCGA data portal is used. Data

is downloaded on 4 Mar 2015. It includes somatic mutation level 2

data (Whole exome sequencing data) of 5,618 patients and clinical

level 2 data of 6,838 patients. Somatic mutation level 2 data in TCGA

is formatted as mutation annotation format (maf). Position and variant

classification is used for analysis among annotated information.

Variants are classified as ‘Missense mutation’, ‘Nonsense mutation’,

‘Frameshift indel’, ‘In frame indel’, ‘splice site mutation;, Silent

mutation’, ‘Intron’, ‘UTR’ and ‘Intergenic’. Among these, only

non-synonymous mutations are used for analysis. Clinical level 2 data

includes a number of clinical variables according to cancer types, and

clinical variables for cox proportional hazard model are reviewed by

professional pathologist (Table 1).

2.2 Data Processing (Filtering)

First of all, for clinical data, patients who do not have information for

cox proportional hazard model are excluded. Next, patients who is

relevant to strong confounding factor of patients’ prognosis such as

- 6 -

the history of other malignancy, the history of neo adjuvant

treatment, radiation adjuvant treatment, pharmaco adjuvant treatment,

ablation adjuvant and metastasis to distant organs are excluded. At

last, patients who do not have mutation information are excluded. For

mutations data, to begin with, synonymous mutations are excluded.

Secondly, mutations in ‘Unknown’ gene not annotated with official

gene symbol from HGNC [19] are excluded. Finally, mutations of

patients who do not have clinical information are excluded. Then

4,844 patients are remained for further analysis.(Figure 2, Table 2)

2.3 Gene Damaging Score

To quantify deleteriousness of a gene, gene damaging score is

defined. Gene damaging score is calculated by considering the number

and classification of mutations in the gene. The scale of gene

damaging score is zero to one. The smaller gene damaging score

means the more deleterious gene. If the gene harbors Lof mutation

which contains nonsense mutation, frameshift insertion and deletion,

nonstop mutation and splice site mutation, the gene damaging score

of the gene is zero.[20] If not, the gene damaging score of the gene

is the geometric mean of SIFT score[21, 22, 23] of mutations in the

gene. To avoid the situation dividing by zero, if SIFT score is zero,

it substituted 10-e8. In all cancer, most of gene damaging score is

one, and then zero is second large portion. (Figure 3, Figure 4)

- 7 -

2.4 Cox proportional hazard model with penalized

likelihood

Cox proportional hazard model [24] is used for survival analysis. Cox

proportional hazard model is able to adjust confounding clinical

variables. To identify prognosis effect of gene pair’ mutation status,

patients are divided into four groups by every gene pairs; both genes’

damaging score below 0.3, only one gene’s damaging score below 0.3,

another gene’s damaging score below 0.3 and both genes’ damaging

score above 0.3. Because of this kind of grouping, which is,

classification according two genes’ damaging score, the group whose

both gene damaging score is below 0.3 is very small and sometimes

has no death event. So, cox proportional hazard model with maximum

likelihood, which is generally used for cox regression, is not adequate.

Cox proportional hazard model with penalized likelihood [25] can solve

this convergence problem. Survival analysis is conducted with R

(3.2.0)[26] and ‘coxphf’ packages [27].

For every cancer types, clinical features are added to cox proportional

hazard model to adjust confounding effects. Common clinical features

like age, gender are added and cancer specific clinical features

reviewed by professional pathologist and prior studies are also added

to adjust confounding effects.[28, 29] (Table 3)

Potential SS pairs are determined by p-value of cox proportional

hazard model, p-values of mutational status and hazard ratio. Criteria

are p-value of cox model below 0.05 and p-values of mutational

status below 0.05 and hazard ratio above 1.

- 8 -

2.5 Synthetic survival triplet analysis and

synthetic survival burden

A undirected graph showing relation among the significant pairs is

visualized with Graphviz(2.36) function ‘neato’.[30] To identify

prognosis effect according to the number of SS pairs, we conduct

survival analysis with grouping by the number of SS pairs of

patients. Patients distribution by the number of SS pairs are very

skewed (Figure 5), so all groups are merge to four or five group.

(Figure 6)

- 9 -

Ⅲ. Result

3.1 TCGA core data set

The result from filtering raw data is 4,844 patients’ clinical

information and DNA somatic mutation data in 20 cancer types. This

data set include both data types and all of clinical variables needed

for cox proportional hazard model in each cancer types. In this

dissertation, this data set is named core data set, and further analysis

is conducted with this core data set. (Table 1)

3.2 Gene damaging score distribution

All genes that harbors at least one non-synonymous mutation in each

cancer types are annotated by gene damaging score. If genes have no

non-synonymous mutations, their gene damaging scores are

designated to 1. For gene damaging score of all patients, we make a

matrix of which row is gene and column is patient. Because of

scarcity of somatic mutation event, the matrix is very sparse which

means that most of gene damaging score in the matrix is 1. Except

1, 0 is high proportion of gene score. (Figure 4) To dichotomize gene

score to binary factor, threshold 0.3 is applied. It provides well

grouping for gene deleteriousness (Figure 4, black line)

- 10 -

3.3 Candidate Synthetic Survival pairs

The result from performing survival analysis for twenty cancer types

is 436 candidates of SS pairs in five cancer types with criteria, which

is p-value below 0.05, and 45 candidates of SS pairs in three cancer

types with criteria which is p-value below 0.01. (Table 4) Most of

results are skewed to particular cancer types such as Lung

adenocarcinoma, Skin cutaneous melanoma. Both cancer types have

high somatic mutation frequency [31], and somehow high mortality.

436 pairs are consisted of 281 genes, and some genes like XIRP2,

RYR3 are showing high frequency in pairs. (Figure 6) Most of

highly frequently appeared genes are related to cadherin genes.

Cadherin genes are closely involved cell surface, so also involved

cell-cell interactions [32]. This result infers that these frequently

appeared genes are probably connected to cancer cell metastasis

mechanism.

GO analysis is conducted with genes which consist SS pairs. Highly

enriched terms are mostly related to cell-cell interaction such as cell

adhesion or extracellular matrix part (Figure 7). This result also

infers that synthetic survival gene pairs have a role for cell-cell

interaction, then it might be associated with cancer cell metastasis

[33,34].

Most of patients have no SS pairs because of scarcity, and the

number of patients is decreased by increasing SS pairs (Figure 10).

- 11 -

3.4 Synthetic Survival burden

In synthetic survival burden test, the result to identify prognostic

effects by increasing SS pairs shows that the more SS pairs or

affect better patients’ survival. (Figure 9, Figure 10) The group,

which have more SS pairs and triplets, have higher survival rate

than the group, which have less SS pairs and triplets. From this

result, we are able to infer that SS pairs don’t work as a only

marker for cancer progress but their cumulation might be disturb

cancer progress such as metastasis to distal metastasis and result in

patient’s well prognosis. Of course, patients having only one SS pair

have better survival status than those who don’t having SS pairs.

Ⅳ. Conclusion & Discussion

In this research, Candidate SS pairs are identify by only performing

survival analysis with cox proportional hazard model and also ensure

that the number of SS pairs is related to patient’s prognosis. Genes

consisting SS pairs are usually related to cell-cell interaction or

migration so might be action to prevent metastasis, then shows good

prognosis. Patients’ prognosis is a key clinical factor to apply

genomic characters with simple methods and public data. Despite of

simple scheme of study, this research has some limitations.

First of all, gene damaging score may be biased score, because it

does not care about gene’s length. If gene’s length is quite large, the

probability of harboring LoF mutations might be high. It makes large

gene deleterious. Secondly, because the core of method is survival

analysis, low mortality cancer is hard to analysis with this method.

- 12 -

The adequate death events are needed for this kind of methods.

Thirdly, the cancer type with low mutation burden like Leukemia is

difficult to apply this method because of grouping.

This kind of pre-screening of potential synthetic lethal like synthetic

survival using only genomic profiles is a promising approach for

improving the efficiency of two genes relations screening, and may

supplement the standard method in some cases. However, the

reliability the method should be validated by in-vitro or in-vivo

experiments.

- 13 -

Ⅴ. References

1. Vogelstein, B. and K.W. Kinzler, The multistep nature of cancer.

Trends Genet, 1993. 9(4): p. 138-41.

2. Mardis, E.R., A decade's perspective on DNA sequencing

technology. Nature, 2011. 470(7333): p. 198-203.

3. Wold, B. and R.M. Myers, Sequence census methods for functional

genomics. Nat Methods, 2008. 5(1): p. 19-21.

4. Cancer Genome Atlas Research, N., et al., The Cancer Genome

Atlas Pan-Cancer analysis project. Nat Genet, 2013. 45(10): p.

1113-20.

5. Joly, Y., et al., Data sharing in the post-genomic world: the

experience of the International Cancer Genome Consortium (ICGC)

Data Access Compliance Office (DACO). PLoS Comput Biol, 2012.

8(7): p. e1002549.

6. Weigel, M.T. and M. Dowsett, Current and emerging biomarkers in

breast cancer: prognosis and prediction. Endocr Relat Cancer, 2010.

17(4): p. R245-62.

7. Cancer Genome Atlas Research, N., Comprehensive genomic

characterization defines human glioblastoma genes and core pathways.

Nature, 2008. 455(7216): p. 1061-8.

- 14 -

8. Cancer Genome Atlas Research, N., Genomic and epigenomic

landscapes of adult de novo acute myeloid leukemia. N Engl J Med,

2013. 368(22): p. 2059-74.

9. Cancer Genome Atlas Research, N., Comprehensive molecular

profiling of lung adenocarcinoma. Nature, 2014. 511(7511): p. 543-50.

10. Cancer Genome Atlas Network. Electronic address, i.m.o. and N.

Cancer Genome Atlas, Genomic Classification of Cutaneous

Melanoma. Cell, 2015. 161(7): p. 1681-96.

11. Lawrence, M.S., et al., Mutational heterogeneity in cancer and the

search for new cancer-associated genes. Nature, 2013. 499(7457): p.

214-8.

12. Dees, N.D., et al., MuSiC: identifying mutational significance in

cancer genomes. Genome Res, 2012. 22(8): p. 1589-98.

13. Lucchesi, J.C., Synthetic lethality and semi-lethality among

functionally related mutants of Drosophila melanfgaster. Genetics,

1968. 59(1): p. 37-44.

14. Kaelin, W.G., Jr., The concept of synthetic lethality in the context

of anticancer therapy. Nat Rev Cancer, 2005. 5(9): p. 689-98.

15. Canaani, D., Methodological approaches in application of synthetic

lethality screening towards anticancer therapy. Br J Cancer, 2009.

100(8): p. 1213-8.

- 15 -

16. Bolton, K.L., et al., Association between BRCA1 and BRCA2

mutations and survival in women with invasive epithelial ovarian

cancer. JAMA, 2012. 307(4): p. 382-90.

17. Wang, X. and R. Simon, Identification of potential synthetic lethal

genes to p53 using a computational biology approach. BMC Med

Genomics, 2013. 6: p. 30.

18. Jerby-Arnon, L., et al., Predicting cancer-specific vulnerability via

data-driven detection of synthetic lethality. Cell, 2014. 158(5): p.

1199-209.

19. Gray, K.A., et al., Genenames.org: the HGNC resources in 2013.

Nucleic Acids Res, 2013. 41(Database issue): p. D545-52.

20. MacArthur, D.G., et al., A systematic survey of loss-of-function

variants in human protein-coding genes. Science, 2012. 335(6070): p.

823-8.

21. Ng, P.C. and S. Henikoff, Predicting deleterious amino acid

substitutions. Genome Res, 2001. 11(5): p. 863-74.

22. Ng, P.C. and S. Henikoff, SIFT: Predicting amino acid changes

that affect protein function. Nucleic Acids Res, 2003. 31(13): p. 3812-4.

23. Kumar, P., S. Henikoff, and P.C. Ng, Predicting the effects of

coding non-synonymous variants on protein function using the SIFT

algorithm. Nat Protoc, 2009. 4(7): p. 1073-81.

- 16 -

24. Cox, D.R., Regression Models and Life-Tables. Journal of the

Royal Statistical Society Series B-Statistical Methodology, 1972. 34(2):

p. 187-+.

25. Lai, Y.T., M. Hayashida, and T. Akutsu, Survival Analysis by

Penalized Regression and Matrix Factorization. Scientific World

Journal, 2013.

26. R Core Team, R: A language and environment for statistical

computing. R Foundation for Statistical Computing, Vienna, Austria.

URL http://www.R-project.org/ (2015)

27. R by Meinhard Ploner and Fortran by Georg Heinze, coxphf: Cox

regression with Firth’s penalized likelihood. R package ersion 1.11.

http://CRAN.R-project.org/package=coxphf (2015).

28. Cheng, Y., et al., Significance of E-cadherin, beta-catenin, and

vimentin expression as postoperative prognosis indicators in cervical

squamous cell carcinoma. Hum Pathol, 2012. 43(8): p. 1213-20.

29. Sharma, P., et al., CD8 tumor-infiltrating lymphocytes are

predictive of survival in muscle-invasive urothelial carcinoma. Proc

Natl Acad Sci U S A, 2007. 104(10): p. 3967-72.

30. Gansner, E.R. and S.C. North, An open graph visualization system

and its applications to software engineering. Software-Practice &

Experience, 2000. 30(11): p. 1203-1233.

- 17 -

31. Kandoth, C., et al., Mutational landscape and significance across 12

major cancer types. Nature, 2013. 502(7471): p. 333-+.

32. Jeanes, A., C.J. Gottardi, and A.S. Yap, Cadherins and cancer: how

does cadherin dysfunction promote tumor progression? Oncogene,

2008. 27(55): p. 6920-6929.

33. Nguyen, D.X. and J. Massague, Genetic determinants of cancer

metastasis. Nature Reviews Genetics, 2007. 8(5): p. 341-352.

34.Yoshida, B.A., et al., Metastasis-suppressor genes: a review and

perspective on an emerging field. Journal of the National Cancer

Institute, 2000. 92(21): p. 1717-1730.

- 18 -

Table 1. Data statistics of 20 cancer types.

- 19 -

Table 2. Clincial variables used in cox proportional hazard model for

each cancer type.

- 20 -

Table 3. Clinical variables statistics in Lung adenocarinoma.

- 21 -

Table 4. The number of candidates SS pair/triplet

- 22 -

Figure 1. Synthetic lethality concept

- 23 -

Figure 2. Overall workflow

- 24 -

Figure 3. Gene damaging score calculation flow

- 25 -

Figure 4. Gene damaging score distribution excluding score 1.

- 26 -

Figure 5. The distribution of patients by the number of SS pairs.

- 27 -

Figure 6. The distribution of genes consisting SS pairs.

- 28 -

Figure 7. GO enrichment for genes consisting SS pairs.

- 29 -

Figure 8. undirected graph with 436 Synthetic survival pairs

- 30 -

Figure 9. Kaplen meier plot by the number of SS pairs in

LUAD/SKCM.

- 31 -

Figure 10. The distribution of patients by SS pair frequency

- 32 -

요약(국문초록)

배 경: 암유전체학이 발전함에 따라, 많은 암유전체학 과제들이 시작되

었고 몇몇은 거의 완성단계에 이르렀다. 이러한 과제들로부터 대용량의

유전체, 전사체, 후성유전체의 자료들이 공개적으로 게재되었다. 이러한

공개 데이터는 암세포에서의 체세포 돌연변이에 대한 많은 연구를 가능

하게 하였고, 많은 연구자들은 암과 관련있는 유전자를 발견하였다. 하지

만 기존의 암유전체연구들은 제한점을 가지고 있다. 많은 연구들이 하나

의 암종 혹은 하나의 암을 일으키는 유전자에 제한되어있다. 이것은 암

의 발생과 발달 그리고 진행에 대한 연구를 하는데 있어서 제한점이 될

수가 있다. 이러한 문제를 풀기위해 합성치사라는 개념이 대두되었다. 합

성치사는 두 개의 유전자중 하나의 유전자에만 돌연변이가 있거나 둘다

없는 경우에는 세포가 살지만, 두 개의 유전자에 있는 돌연변이의 조합

은 세포를 죽게 만든다는 개념이다. 우리는 이 연구에서 이 합성치사의

발상을 기본으로 하여 합성생존이라는 개념을 제안한다. 합성생존이란

두 개의 유전자 중 한 개의 유전자에만 돌연변이가 있거나 둘다 없는 경

우는 그 환자의 생존이 나쁘지만, 두 개의 유전자에 있는 돌연변이의 조

합은 그 환자의 생존을 좋게 만든다는 개념이다. 이 논문의 목표는 다양

한 암종에서 합성생존과 합성생존하중을 확인하는 것이다.

방 법: 우리는 20개의 암종에서 9,184명의 임상정보와 6,936명의 체세포

돌연변이 자료를 암 유전체 지도책 과제에서 내려받았다. 모든 자료들은

제한없이 접근이 가능하다. 추후 분석을 위한 핵심 자료를 만들기 위해

분석에 포함되지 않는 자료들을 제하려고 몇몇의 기준을 적용하였다. 최

소한 하나의 비동의돌연변이가 있는 유전자들에 대해 유전자위험점수를

계산하였다. 유전자위험점수란 각각의 유전자가 돌연변이로 인해 얼마나

망가졌는지를 측정하는 점수이다. 유전자위험점수는 기능상실변이와

SIFT 점수로부터 계산되었다. 유전자위험점수에 따라 환자들은 네 그룹

으로 나뉘어졌다. 그 그룹들에 대해 콕스비례위험모형을 가지고 생존분

- 33 -

석을 진행하였다. 이 분석을 통해 잠재적인 합성생존유전자쌍이 발견되

었다.

결 과: 처음 데이터를 가공한 결과 우리는 4,884명의 환자에 대해 핵심

자료모음을 만들었다. 이 자료는 임상정보와 체세포돌연변이를 포함한다.

잠재적인 합성생존유전자쌍으로 436개의 유전자쌍이 5개의 암종에서 발

견되었다. 합성생존세쌍 분석에서 2,427개의 세쌍이 2개의 암종에서 발견

되었다. 그 2개의 암종은 폐암과 흑색종이다. 우리는 또한 합성생존하중

을 발견하였다. 합성생존하중이란 더 많은 합성생존유전자쌍 혹은 합성

생존유전자세쌍을 가진 경우가 더 적게 가진 경우보다 예후가 더 좋은

경우이다.

토 의: 이 논문에서는 잠재적 합성생존유전자쌍이 간단한 생존분석을 통

해서 나타났고 이 유전자쌍들이 많아질수록 환자의 예후와 관계가 있음

을 발견하였다. 합성생존유전자쌍을 이루는 대부분의 유전자들은 세포세

포반응이나 세포의 이동과 관련이 있는 것으로 나타났는데 이것은 암에

서 전이와 연관이 있을것으로 생각된다. 이 연구에서의 방법론적인 것은

간단한 데이터와 분석방법을 통해 실제적인 임상으로서의 적용이 가능하

다는 장점이 있다.

주요어: 암유전체학, The Cancer Genome atlas, 합성생

존, 생존분석, 합성생존 하중.

학 번: 2013-20406

- 34 -

저 시-비 리- 경 지 2.0 한민

는 아래 조건 르는 경 에 한하여 게

l 저 물 복제, 포, 전송, 전시, 공연 송할 수 습니다.

다 과 같 조건 라야 합니다:

l 하는, 저 물 나 포 경 , 저 물에 적 된 허락조건 명확하게 나타내어야 합니다.

l 저 터 허가를 면 러한 조건들 적 되지 않습니다.

저 에 른 리는 내 에 하여 향 지 않습니다.

것 허락규약(Legal Code) 해하 쉽게 약한 것 니다.

Disclaimer

저 시. 하는 원저 를 시하여야 합니다.

비 리. 하는 저 물 리 목적 할 수 없습니다.

경 지. 하는 저 물 개 , 형 또는 가공할 수 없습니다.

http://creativecommons.org/licenses/by-nc-nd/2.0/kr/legalcode

http://creativecommons.org/licenses/by-nc-nd/2.0/kr/

이학석사학위논문






2015년 8월


자연과학부 협동과정 생물정보학 전공

임 영 균






지도교수 김 주 한

이 논문을 이학석사학위논문으로 제출함

2015년 7월


자연과학부 생물정보학 전공

임 영 균

임 영 균의 석사학위논문를 인준함

2015년 6월

위 원 장 김 선 (인)

부위원장 김 주 한 (인)

위 원 윤 성 로 (인)

Abstract




Younggyun, Lim

Bioinformatics

The graduate school

Seoul National University

Background: According to cancer genomics become prevalent, large

cancer genome projects are launched and almost finished. From these

project, a huge amount of genomic, transcriptomic, epigenomic data is

released in public. This public data have provided quite some studies

of somatic mutations in cancer cell, then many researchers have

identified several cancer-related genes. However, some existing cancer

genomics studies have a limitations. Many studies are restricted to

single cancer type or single cancer driver gene. It can limit study’s

scope to identify how cancer occur, develop and progress. To solve

this problem, a concept of synthetic lethality is arisen. Synthetic

lethality is a concept that a combination of mutations in two genes

lead to cell death, while a mutation in an only one of these genes

remains cell survival. We propose a concept synthetic survival based

on the idea of synthetic lethality. Synthetic Survival (SS) is a

concept that a combinations of mutations in two gene lead to patients


prognosis. The goal of this dissertation is to identify synthetic

survival and synthetic survival burden in multiple cancers.

Method: We downloaded 9,184 patients’ clinical information and 6,936

patients’ somatic mutation data in 20 cancer types from The Cancer

Genome Atlas. all of data is open-accessed data. To make core data

set for further analysis, several criteria is applied to filter out data

which is not able to analysis. All genes harboring at least one

non-synonymous mutations are annotated to gene damaging score.

Gene damaging score is a measure to quantify each gene’s

deleteriousness. Gene damaging score is derived from both scores in

Sorting Intolerant From Tolerant and classification of LoF mutations.

By gene damaging score, patients are stratified to four groups.

Survival analysis are conducted for all gene pairs’ group with cox

proportional hazard model with penalized likelihood. From the cox

model, candidates of potential synthetic survival pairs are decided.

Result: The result from filtering raw data, we build the core data set

for 4,844 patients including clinical information and somatic mutation

profiles. For candidates of potential synthetic survival pairs, 436 gene

pairs are identified in 5 cancer types. We also identified synthetic

survival burden which means that the group which have more SS

pairs and triplets have higher survival rate than the group which

have less SS pairs

Discussion: In this dissertation, Candidate SS pairs are identify by

only performing survival analysis with cox proportional hazard model

and also ensure that the number of SS pairs is related to patient’s

prognosis. Genes consisting SS pairs are usually related to cell-cell

interaction or migration so might be action to prevent metastasis,

then shows good prognosis. This study might be approach to reduce

cost for screening and to apply practical clinical uses.

Keywords: Cancer genomics, The Cancer Genome

Atlas, Synthetic survival, Survival analysis, Synthetic

survival burden

Student Number: 2013-20406

Contents

Ⅰ. Introduction.....................................................................................1

1.1 Cancer genomics.......................................................................1

1.2 Existing studies in cancer genomics.....................................2

1.3 Synthetic lethality and Synthetic Survival (SS).................2

1.4 Research goal............................................................................3

1.5 Related studies..........................................................................4

Ⅱ. Methods & Materials....................................................................5

2.1 Data..............................................................................................5

2.2 Data Processing(Filtering)........................................................5

2.3 Gene Damaging Score...............................................................6

2.4 Cox proportional hazard model with penalized likelihood....7

2.5 Synthetic Surival burden.......................................................8

Ⅲ. Result..............................................................................................9

3.1 TCGA core data set.................................................................9

3.2 Gene damaging score distribution..........................................9

3.3 Candidate Synthetic Survival pairs.....................................10

3.4 Synthetic Survival burden...................................................11

Ⅳ. Conclusion & Discussion............................................................11

Ⅴ. References.....................................................................................13

국문초록...............................................................................................34

- 1 -

Ⅰ. Introduction

1.1 Cancer genomics

Cancer is genetic disease. The transition from a normal cell to a

malignant cancer cell is caused by changes to a cell’s DNA called

mutations [1]. Some of those mutations might by inherit, but most of

cancer causing mutations is acquired throughout life also known as

somatic mutations. Recently, whole genome sequencing and whole

exome sequencing technologies have provided quite several studies of

somatic mutations in variable cancer and have made researchers

identify multiple cancer-related driver genes [2, 3]. With this trend,

large cancer genome project such as The Cancer Genome Atlas

(TCGA), International Cancer Genome Consortium (ICGC) are

launched and some of those are almost finished [4, 5]. TCGA is a

comprehensive and coordinated to accelerate understanding of the

molecular basis of cancer through the multi-omics technologies and

TCGA released multi-omics data and clinical data in public.

One major application for cancer patients with primary tumor is

accurate prognosis, which helps divided patients into specific risk

groups and decide both treatment and observation patient status.

Traditionally, prognosis is based on only common clinical variables

such as ages, pathologic tumor stages. Recently, molecular variables

such as genetic alterations or amplifications are incorporated to cancer

patient prognosis. Representatively, ER, PR, HER2 protein levels are

significant biomarker in breast cancer, which is really applicable to

practical treatment [6].

- 2 -

1.2 Existing studies in cancer genomics

Lately, typical cancer genomics studies and papers are carried out by

TCGA. TCGA published genomic, transcriptomic and epigenetic

landscape profile of each cancer types for about 30 types [7, 8, 9, 10].

Those studies include finding cancer driver mutations, molecular

classification of cancer and heterogeneity of cancer and so on. To

find cancer driver mutations, a number of algorithms and tools are

developed [11, 12]. Those tools use statistical methods, so detect

significant mutation based on background mutations rate. In spite of

these developments, most of researches are focused single driver

gene. By convention, studies associated with prognosis also have

been limited single gene and single cancer lineage. Many cancers

related genes identified from these researches are not directly

‘druggable’ or are not applying predicting prognosis. A huge number

of studies are enacted, but practical clinical application is still difficult

and sure prognosis factor is not identified.

1.3 Synthetic lethality and Synthetic Survival

(SS)

Several decades ago, the concept of synthetic lethality is arisen [13].

Synthetic lethality is a concept that a combination of mutations in

two genes lead to cell death, while a mutation in an only one of

these genes remains cell survival [13, 14]. (Figure 1)

- 3 -

In this dissertation, we propose a concept of synthetic survival based

on the idea of synthetic lethality. Synthetic survival (SS) is a concept

that a combinations of mutations in two gene lead to patients


prognosis. Synthetic Survival can analyze gene interactions in cancer,

then infer how those gene interaction affect to patients’ prognosis.

Most of synthetic lethality is identified with experimental method

such as RNAi screening [15]. However, an experimental method such

as RNAi screening is cost, time and labor-intensive. For this reason,

synthetic lethal screening is not able to conduct with high throughput

method. However, synthetic survival screening is able to conduct

with high throughput way if sufficient data exist. For synthetic

survival analysis, only patients’ clinical information such as survival

status and time and somatic mutation profile are needed. Although

experimental validations might be needed after computational high

throughput analysis, computational screening can save cost, time and

human labor.

1.4 Research goal

The goal of this dissertation is to identify synthetic survival and

synthetic survival burden in multiple cancers. Gene pairs associating

with synthetic survival might be potentially used as prognostic factor

in specific cancer type or across cancer types. In this dissertation, we

propose simple methods for identifying synthetic survival with

survival analysis. We conduct survival analysis with cox proportional

hazard model for all possible gene pairs and identify candidates of

- 4 -

synthetic survival. We also identify synthetic survival burden, which

is accumulation of synthetic survival gene pairs affect patient’s

prognosis.

1.5 Related studies

In 2011, a research to predict patients’ survival in ovarian cancer

with molecular profile is introduced [16]. In this study, patients are

stratified into groups by BRCA1 and BRCA2 mutation status. This

study is initial approach to predict prognosis of cancer patients’ with

not only clinical variables but also molecular profiles.

Recently, computational methods to find synthetic lethal pairs are

introduced. Xiaosheng Wang et al tried to use only expression data to

find synthetic lethal relationship with TP53 gene [17]. Although this

is study is not associated with prognosis and gene sets are limited to

only kinases, it has a meaning that using only genomic data with

computational approach.

Livnat Jerby-Arnon et al. find Synthetic lethal pairs with only

data-driven method such as Genomic survival of fittest, functional

examination and coexpression data [18]. In this research, a huge

amount of data is used to find potential candidate of synthetic lethal

gene pairs, and researchers are using not only variable genomic data

also clinical variables.

Here, we analysis synthetic survival with somatic mutation data and

clinical data in 20 cancer type, then identify potential SS gene

relationship with only simple survival analysis.

- 5 -

Ⅱ. Methods & Materials

Overall workflow is divided four part, which is data processing, gene

damaging score, survival analysis and interpretation.(Figure 2)

2.1 Data

For analysis, Data downloaded from TCGA data portal is used. Data

is downloaded on 4 Mar 2015. It includes somatic mutation level 2

data (Whole exome sequencing data) of 5,618 patients and clinical

level 2 data of 6,838 patients. Somatic mutation level 2 data in TCGA

is formatted as mutation annotation format (maf). Position and variant

classification is used for analysis among annotated information.

Variants are classified as ‘Missense mutation’, ‘Nonsense mutation’,

‘Frameshift indel’, ‘In frame indel’, ‘splice site mutation;, Silent

mutation’, ‘Intron’, ‘UTR’ and ‘Intergenic’. Among these, only

non-synonymous mutations are used for analysis. Clinical level 2 data

includes a number of clinical variables according to cancer types, and

clinical variables for cox proportional hazard model are reviewed by

professional pathologist (Table 1).

2.2 Data Processing (Filtering)

First of all, for clinical data, patients who do not have information for

cox proportional hazard model are excluded. Next, patients who is

relevant to strong confounding factor of patients’ prognosis such as

- 6 -

the history of other malignancy, the history of neo adjuvant

treatment, radiation adjuvant treatment, pharmaco adjuvant treatment,

ablation adjuvant and metastasis to distant organs are excluded. At

last, patients who do not have mutation information are excluded. For

mutations data, to begin with, synonymous mutations are excluded.

Secondly, mutations in ‘Unknown’ gene not annotated with official

gene symbol from HGNC [19] are excluded. Finally, mutations of

patients who do not have clinical information are excluded. Then

4,844 patients are remained for further analysis.(Figure 2, Table 2)

2.3 Gene Damaging Score

To quantify deleteriousness of a gene, gene damaging score is

defined. Gene damaging score is calculated by considering the number

and classification of mutations in the gene. The scale of gene

damaging score is zero to one. The smaller gene damaging score

means the more deleterious gene. If the gene harbors Lof mutation

which contains nonsense mutation, frameshift insertion and deletion,

nonstop mutation and splice site mutation, the gene damaging score

of the gene is zero.[20] If not, the gene damaging score of the gene

is the geometric mean of SIFT score[21, 22, 23] of mutations in the

gene. To avoid the situation dividing by zero, if SIFT score is zero,

it substituted 10-e8. In all cancer, most of gene damaging score is

one, and then zero is second large portion. (Figure 3, Figure 4)

- 7 -

2.4 Cox proportional hazard model with penalized

likelihood

Cox proportional hazard model [24] is used for survival analysis. Cox

proportional hazard model is able to adjust confounding clinical

variables. To identify prognosis effect of gene pair’ mutation status,

patients are divided into four groups by every gene pairs; both genes’

damaging score below 0.3, only one gene’s damaging score below 0.3,

another gene’s damaging score below 0.3 and both genes’ damaging

score above 0.3. Because of this kind of grouping, which is,

classification according two genes’ damaging score, the group whose

both gene damaging score is below 0.3 is very small and sometimes

has no death event. So, cox proportional hazard model with maximum

likelihood, which is generally used for cox regression, is not adequate.

Cox proportional hazard model with penalized likelihood [25] can solve

this convergence problem. Survival analysis is conducted with R

(3.2.0)[26] and ‘coxphf’ packages [27].

For every cancer types, clinical features are added to cox proportional

hazard model to adjust confounding effects. Common clinical features

like age, gender are added and cancer specific clinical features

reviewed by professional pathologist and prior studies are also added

to adjust confounding effects.[28, 29] (Table 3)

Potential SS pairs are determined by p-value of cox proportional

hazard model, p-values of mutational status and hazard ratio. Criteria

are p-value of cox model below 0.05 and p-values of mutational

status below 0.05 and hazard ratio above 1.

- 8 -

2.5 Synthetic survival triplet analysis and

synthetic survival burden

A undirected graph showing relation among the significant pairs is

visualized with Graphviz(2.36) function ‘neato’.[30] To identify

prognosis effect according to the number of SS pairs, we conduct

survival analysis with grouping by the number of SS pairs of

patients. Patients distribution by the number of SS pairs are very

skewed (Figure 5), so all groups are merge to four or five group.

(Figure 6)

- 9 -

Ⅲ. Result

3.1 TCGA core data set

The result from filtering raw data is 4,844 patients’ clinical

information and DNA somatic mutation data in 20 cancer types. This

data set include both data types and all of clinical variables needed

for cox proportional hazard model in each cancer types. In this

dissertation, this data set is named core data set, and further analysis

is conducted with this core data set. (Table 1)

3.2 Gene damaging score distribution

All genes that harbors at least one non-synonymous mutation in each

cancer types are annotated by gene damaging score. If genes have no

non-synonymous mutations, their gene damaging scores are

designated to 1. For gene damaging score of all patients, we make a

matrix of which row is gene and column is patient. Because of

scarcity of somatic mutation event, the matrix is very sparse which

means that most of gene damaging score in the matrix is 1. Except

1, 0 is high proportion of gene score. (Figure 4) To dichotomize gene

score to binary factor, threshold 0.3 is applied. It provides well

grouping for gene deleteriousness (Figure 4, black line)

- 10 -

3.3 Candidate Synthetic Survival pairs

The result from performing survival analysis for twenty cancer types

is 436 candidates of SS pairs in five cancer types with criteria, which

is p-value below 0.05, and 45 candidates of SS pairs in three cancer

types with criteria which is p-value below 0.01. (Table 4) Most of

results are skewed to particular cancer types such as Lung

adenocarcinoma, Skin cutaneous melanoma. Both cancer types have

high somatic mutation frequency [31], and somehow high mortality.

436 pairs are consisted of 281 genes, and some genes like XIRP2,

RYR3 are showing high frequency in pairs. (Figure 6) Most of

highly frequently appeared genes are related to cadherin genes.

Cadherin genes are closely involved cell surface, so also involved

cell-cell interactions [32]. This result infers that these frequently

appeared genes are probably connected to cancer cell metastasis

mechanism.

GO analysis is conducted with genes which consist SS pairs. Highly

enriched terms are mostly related to cell-cell interaction such as cell

adhesion or extracellular matrix part (Figure 7). This result also

infers that synthetic survival gene pairs have a role for cell-cell

interaction, then it might be associated with cancer cell metastasis

[33,34].

Most of patients have no SS pairs because of scarcity, and the

number of patients is decreased by increasing SS pairs (Figure 10).

- 11 -

3.4 Synthetic Survival burden

In synthetic survival burden test, the result to identify prognostic

effects by increasing SS pairs shows that the more SS pairs or

affect better patients’ survival. (Figure 9, Figure 10) The group,

which have more SS pairs and triplets, have higher survival rate

than the group, which have less SS pairs and triplets. From this

result, we are able to infer that SS pairs don’t work as a only

marker for cancer progress but their cumulation might be disturb

cancer progress such as metastasis to distal metastasis and result in

patient’s well prognosis. Of course, patients having only one SS pair

have better survival status than those who don’t having SS pairs.

Ⅳ. Conclusion & Discussion

In this research, Candidate SS pairs are identify by only performing

survival analysis with cox proportional hazard model and also ensure

that the number of SS pairs is related to patient’s prognosis. Genes

consisting SS pairs are usually related to cell-cell interaction or

migration so might be action to prevent metastasis, then shows good

prognosis. Patients’ prognosis is a key clinical factor to apply

genomic characters with simple methods and public data. Despite of

simple scheme of study, this research has some limitations.

First of all, gene damaging score may be biased score, because it

does not care about gene’s length. If gene’s length is quite large, the

probability of harboring LoF mutations might be high. It makes large

gene deleterious. Secondly, because the core of method is survival

analysis, low mortality cancer is hard to analysis with this method.

- 12 -

The adequate death events are needed for this kind of methods.

Thirdly, the cancer type with low mutation burden like Leukemia is

difficult to apply this method because of grouping.

This kind of pre-screening of potential synthetic lethal like synthetic

survival using only genomic profiles is a promising approach for

improving the efficiency of two genes relations screening, and may

supplement the standard method in some cases. However, the

reliability the method should be validated by in-vitro or in-vivo

experiments.

- 13 -

Ⅴ. References

1. Vogelstein, B. and K.W. Kinzler, The multistep nature of cancer.

Trends Genet, 1993. 9(4): p. 138-41.

2. Mardis, E.R., A decade's perspective on DNA sequencing

technology. Nature, 2011. 470(7333): p. 198-203.

3. Wold, B. and R.M. Myers, Sequence census methods for functional

genomics. Nat Methods, 2008. 5(1): p. 19-21.

4. Cancer Genome Atlas Research, N., et al., The Cancer Genome

Atlas Pan-Cancer analysis project. Nat Genet, 2013. 45(10): p.

1113-20.

5. Joly, Y., et al., Data sharing in the post-genomic world: the

experience of the International Cancer Genome Consortium (ICGC)

Data Access Compliance Office (DACO). PLoS Comput Biol, 2012.

8(7): p. e1002549.

6. Weigel, M.T. and M. Dowsett, Current and emerging biomarkers in

breast cancer: prognosis and prediction. Endocr Relat Cancer, 2010.

17(4): p. R245-62.

7. Cancer Genome Atlas Research, N., Comprehensive genomic

characterization defines human glioblastoma genes and core pathways.

Nature, 2008. 455(7216): p. 1061-8.

- 14 -

8. Cancer Genome Atlas Research, N., Genomic and epigenomic

landscapes of adult de novo acute myeloid leukemia. N Engl J Med,

2013. 368(22): p. 2059-74.

9. Cancer Genome Atlas Research, N., Comprehensive molecular

profiling of lung adenocarcinoma. Nature, 2014. 511(7511): p. 543-50.

10. Cancer Genome Atlas Network. Electronic address, i.m.o. and N.

Cancer Genome Atlas, Genomic Classification of Cutaneous

Melanoma. Cell, 2015. 161(7): p. 1681-96.

11. Lawrence, M.S., et al., Mutational heterogeneity in cancer and the

search for new cancer-associated genes. Nature, 2013. 499(7457): p.

214-8.

12. Dees, N.D., et al., MuSiC: identifying mutational significance in

cancer genomes. Genome Res, 2012. 22(8): p. 1589-98.

13. Lucchesi, J.C., Synthetic lethality and semi-lethality among

functionally related mutants of Drosophila melanfgaster. Genetics,

1968. 59(1): p. 37-44.

14. Kaelin, W.G., Jr., The concept of synthetic lethality in the context

of anticancer therapy. Nat Rev Cancer, 2005. 5(9): p. 689-98.

15. Canaani, D., Methodological approaches in application of synthetic

lethality screening towards anticancer therapy. Br J Cancer, 2009.

100(8): p. 1213-8.

- 15 -

16. Bolton, K.L., et al., Association between BRCA1 and BRCA2

mutations and survival in women with invasive epithelial ovarian

cancer. JAMA, 2012. 307(4): p. 382-90.

17. Wang, X. and R. Simon, Identification of potential synthetic lethal

genes to p53 using a computational biology approach. BMC Med

Genomics, 2013. 6: p. 30.

18. Jerby-Arnon, L., et al., Predicting cancer-specific vulnerability via

data-driven detection of synthetic lethality. Cell, 2014. 158(5): p.

1199-209.

19. Gray, K.A., et al., Genenames.org: the HGNC resources in 2013.

Nucleic Acids Res, 2013. 41(Database issue): p. D545-52.

20. MacArthur, D.G., et al., A systematic survey of loss-of-function

variants in human protein-coding genes. Science, 2012. 335(6070): p.

823-8.

21. Ng, P.C. and S. Henikoff, Predicting deleterious amino acid

substitutions. Genome Res, 2001. 11(5): p. 863-74.

22. Ng, P.C. and S. Henikoff, SIFT: Predicting amino acid changes

that affect protein function. Nucleic Acids Res, 2003. 31(13): p. 3812-4.

23. Kumar, P., S. Henikoff, and P.C. Ng, Predicting the effects of

coding non-synonymous variants on protein function using the SIFT

algorithm. Nat Protoc, 2009. 4(7): p. 1073-81.

- 16 -

24. Cox, D.R., Regression Models and Life-Tables. Journal of the

Royal Statistical Society Series B-Statistical Methodology, 1972. 34(2):

p. 187-+.

25. Lai, Y.T., M. Hayashida, and T. Akutsu, Survival Analysis by

Penalized Regression and Matrix Factorization. Scientific World

Journal, 2013.

26. R Core Team, R: A language and environment for statistical

computing. R Foundation for Statistical Computing, Vienna, Austria.

URL http://www.R-project.org/ (2015)

27. R by Meinhard Ploner and Fortran by Georg Heinze, coxphf: Cox

regression with Firth’s penalized likelihood. R package ersion 1.11.

http://CRAN.R-project.org/package=coxphf (2015).

28. Cheng, Y., et al., Significance of E-cadherin, beta-catenin, and

vimentin expression as postoperative prognosis indicators in cervical

squamous cell carcinoma. Hum Pathol, 2012. 43(8): p. 1213-20.

29. Sharma, P., et al., CD8 tumor-infiltrating lymphocytes are

predictive of survival in muscle-invasive urothelial carcinoma. Proc

Natl Acad Sci U S A, 2007. 104(10): p. 3967-72.

30. Gansner, E.R. and S.C. North, An open graph visualization system

and its applications to software engineering. Software-Practice &

Experience, 2000. 30(11): p. 1203-1233.

- 17 -

31. Kandoth, C., et al., Mutational landscape and significance across 12

major cancer types. Nature, 2013. 502(7471): p. 333-+.

32. Jeanes, A., C.J. Gottardi, and A.S. Yap, Cadherins and cancer: how

does cadherin dysfunction promote tumor progression? Oncogene,

2008. 27(55): p. 6920-6929.

33. Nguyen, D.X. and J. Massague, Genetic determinants of cancer

metastasis. Nature Reviews Genetics, 2007. 8(5): p. 341-352.

34.Yoshida, B.A., et al., Metastasis-suppressor genes: a review and

perspective on an emerging field. Journal of the National Cancer

Institute, 2000. 92(21): p. 1717-1730.

- 18 -

Table 1. Data statistics of 20 cancer types.

- 19 -

Table 2. Clincial variables used in cox proportional hazard model for

each cancer type.

- 20 -

Table 3. Clinical variables statistics in Lung adenocarinoma.

- 21 -

Table 4. The number of candidates SS pair/triplet

- 22 -

Figure 1. Synthetic lethality concept

- 23 -

Figure 2. Overall workflow

- 24 -

Figure 3. Gene damaging score calculation flow

- 25 -

Figure 4. Gene damaging score distribution excluding score 1.

- 26 -

Figure 5. The distribution of patients by the number of SS pairs.

- 27 -

Figure 6. The distribution of genes consisting SS pairs.

- 28 -

Figure 7. GO enrichment for genes consisting SS pairs.

- 29 -

Figure 8. undirected graph with 436 Synthetic survival pairs

- 30 -

Figure 9. Kaplen meier plot by the number of SS pairs in

LUAD/SKCM.

- 31 -

Figure 10. The distribution of patients by SS pair frequency

- 32 -

요약(국문초록)

배 경: 암유전체학이 발전함에 따라, 많은 암유전체학 과제들이 시작되

었고 몇몇은 거의 완성단계에 이르렀다. 이러한 과제들로부터 대용량의

유전체, 전사체, 후성유전체의 자료들이 공개적으로 게재되었다. 이러한

공개 데이터는 암세포에서의 체세포 돌연변이에 대한 많은 연구를 가능

하게 하였고, 많은 연구자들은 암과 관련있는 유전자를 발견하였다. 하지

만 기존의 암유전체연구들은 제한점을 가지고 있다. 많은 연구들이 하나

의 암종 혹은 하나의 암을 일으키는 유전자에 제한되어있다. 이것은 암

의 발생과 발달 그리고 진행에 대한 연구를 하는데 있어서 제한점이 될

수가 있다. 이러한 문제를 풀기위해 합성치사라는 개념이 대두되었다. 합

성치사는 두 개의 유전자중 하나의 유전자에만 돌연변이가 있거나 둘다

없는 경우에는 세포가 살지만, 두 개의 유전자에 있는 돌연변이의 조합

은 세포를 죽게 만든다는 개념이다. 우리는 이 연구에서 이 합성치사의

발상을 기본으로 하여 합성생존이라는 개념을 제안한다. 합성생존이란

두 개의 유전자 중 한 개의 유전자에만 돌연변이가 있거나 둘다 없는 경

우는 그 환자의 생존이 나쁘지만, 두 개의 유전자에 있는 돌연변이의 조

합은 그 환자의 생존을 좋게 만든다는 개념이다. 이 논문의 목표는 다양

한 암종에서 합성생존과 합성생존하중을 확인하는 것이다.

방 법: 우리는 20개의 암종에서 9,184명의 임상정보와 6,936명의 체세포

돌연변이 자료를 암 유전체 지도책 과제에서 내려받았다. 모든 자료들은

제한없이 접근이 가능하다. 추후 분석을 위한 핵심 자료를 만들기 위해

분석에 포함되지 않는 자료들을 제하려고 몇몇의 기준을 적용하였다. 최

소한 하나의 비동의돌연변이가 있는 유전자들에 대해 유전자위험점수를

계산하였다. 유전자위험점수란 각각의 유전자가 돌연변이로 인해 얼마나

망가졌는지를 측정하는 점수이다. 유전자위험점수는 기능상실변이와

SIFT 점수로부터 계산되었다. 유전자위험점수에 따라 환자들은 네 그룹

으로 나뉘어졌다. 그 그룹들에 대해 콕스비례위험모형을 가지고 생존분

- 33 -

석을 진행하였다. 이 분석을 통해 잠재적인 합성생존유전자쌍이 발견되

었다.

결 과: 처음 데이터를 가공한 결과 우리는 4,884명의 환자에 대해 핵심

자료모음을 만들었다. 이 자료는 임상정보와 체세포돌연변이를 포함한다.

잠재적인 합성생존유전자쌍으로 436개의 유전자쌍이 5개의 암종에서 발

견되었다. 합성생존세쌍 분석에서 2,427개의 세쌍이 2개의 암종에서 발견

되었다. 그 2개의 암종은 폐암과 흑색종이다. 우리는 또한 합성생존하중

을 발견하였다. 합성생존하중이란 더 많은 합성생존유전자쌍 혹은 합성

생존유전자세쌍을 가진 경우가 더 적게 가진 경우보다 예후가 더 좋은

경우이다.

토 의: 이 논문에서는 잠재적 합성생존유전자쌍이 간단한 생존분석을 통

해서 나타났고 이 유전자쌍들이 많아질수록 환자의 예후와 관계가 있음

을 발견하였다. 합성생존유전자쌍을 이루는 대부분의 유전자들은 세포세

포반응이나 세포의 이동과 관련이 있는 것으로 나타났는데 이것은 암에

서 전이와 연관이 있을것으로 생각된다. 이 연구에서의 방법론적인 것은

간단한 데이터와 분석방법을 통해 실제적인 임상으로서의 적용이 가능하

다는 장점이 있다.

주요어: 암유전체학, The Cancer Genome atlas, 합성생

존, 생존분석, 합성생존 하중.

학 번: 2013-20406

- 34 -

Documents

Disclaimer - Seoul National Universitys-space.snu.ac.kr/bitstream/10371/131178/1/000000056933.pdf저작자표시-비영리-변경금지 2.0 대한민국 이용자는 아래의 조건을